In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Web scraping - pridobivanje podatkov s spleta


## What Is Web Scraping?

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png" src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png">

### Why Web Scraping for Data Science?

## Network complexity

## HTTP

## HTTP in Python: The Requests Library

[Requests: HTTP for Humans](https://2.python-requests.org/en/master/)

In [2]:
import requests

In [3]:
url = 'http://example.com/'

In [4]:
r = requests.get(url)

In [5]:
r

<Response [200]>

In [6]:
type(r)

requests.models.Response

In [7]:
r.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n

In [8]:
r.status_code

200

In [9]:
r.reason

'OK'

In [10]:
r.headers

{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 20 Jun 2019 19:44:51 GMT', 'Etag': '"1541025663"', 'Expires': 'Thu, 27 Jun 2019 19:44:51 GMT', 'Last-Modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'Server': 'ECS (nyb/1D13)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '606'}

In [11]:
r.request

<PreparedRequest [GET]>

In [12]:
r.request.headers

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [13]:
r.text[:100]

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <m'

## HTML and CSS

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png" src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png">

### Hypertext Markup Language: HTML

Link strani: https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

In [14]:
url_got = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

In [15]:
r = requests.get(url_got)

In [16]:
r.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Game of Thrones episodes - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":902691464,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage"

In [17]:
from bs4 import BeautifulSoup

In [18]:
html_content = r.text

In [19]:
html_soup = BeautifulSoup(html_content, 'html.parser')

In [20]:
html_soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [21]:
html_soup.find('', {'id' : 'firstHeading'})

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

- `<p>...</p>` to enclose a paragraph;
- `<br>` to set a line break;
- `<table>...</table>` to start a table block, inside; `<tr>...<tr/>` is used for the rows; and `<td>...</td>` cells;
- `<img>` for images;
- `<h1>...</h1> to <h6>...</h6>` for headers;
- `<div>...</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;
- `<a>...</a>` for hyperlinks;
- `<ul>...</ul>, <ol>...</ol>` for unordered and ordered lists respectively; inside of these, `<li>...</li>` is used for each list item.

## Using Your Browser as a Development Tool

## The Beautiful Soup Library

In [22]:
html_content = r.text

In [23]:
import pandas as pd

> **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

In [24]:
from bs4 import BeautifulSoup

In [25]:
html_soup = BeautifulSoup(html_content, 'html.parser')

In Python, multiple parsers exist to do so:
- `html.parser`: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.
- `lxml`: which is very fast but requires an extra installation.
- `html5lib`: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

- `find(name, attrs, recursive, string, **keywords)`
- `find_all(name, attrs, recursive, string, limit, **keywords)`

In [26]:
html_soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [27]:
html_soup.find('', {'id': 'firstHeading'})

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [28]:
all_h2 = html_soup.find_all('h2', limit=3)

In [29]:
all_h2

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>]

In [30]:
len (all_h2)

3

In [31]:
all_h2

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>]

In [32]:
for found in html_soup.find_all('h2'):
    print(found)

<h2>Contents</h2>
<h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>
<h2><span class="mw-headline" id="Episodes">Episodes</span></h2>
<h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>
<h2><span class="mw-headline" id="Ratings">Ratings</span></h2>
<h2><span class="mw-headline" id="References">References</span></h2>
<h2><span class="mw-headline" id="External_links">External links</span></h2>
<h2>Navigation menu</h2>


In [33]:
first_h1 = html_soup.find('h1')

In [34]:
first_h1

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [35]:
first_h1.name

'h1'

In [36]:
first_h1.contents

['List of ', <i>Game of Thrones</i>, ' episodes']

In [37]:
str(first_h1)

'<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>'

In [38]:
first_h1.text

'List of Game of Thrones episodes'

In [39]:
first_h1.get_text('---', strip=True)

'List of---Game of Thrones---episodes'

In [40]:
first_h1.attrs

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}

In [41]:
first_h1.attrs['id']

'firstHeading'

In [42]:
first_h1['id']

'firstHeading'

In [43]:
first_h1.get('id')

'firstHeading'

In [44]:
cites = html_soup.find_all('cite', class_='citation', limit=4)

In [45]:
cites

[<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news">Fleming, Michael (January 16, 2007). <a class="external text" href="http://www.variety.com/article/VR1117957532.html?categoryid=14&amp;cs=1" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">A

In [46]:
len(cites)

4

In [47]:
cites[0]

<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>

In [48]:
cites

[<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news">Fleming, Michael (January 16, 2007). <a class="external text" href="http://www.variety.com/article/VR1117957532.html?categoryid=14&amp;cs=1" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">A

In [49]:
cites[0].get_text()

'Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.'

In [50]:
link = cites[0].find('a').get('href')

In [51]:
link

'http://tv.ign.com/articles/116/1160215p1.html'

In [52]:
for citation in cites:
    print('---->', citation.get_text())
    print(citation.find('a').get('href'))
    print()

----> Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.
http://tv.ign.com/articles/116/1160215p1.html

----> Fleming, Michael (January 16, 2007). "HBO turns Fire into fantasy series". Variety. Archived from the original on May 16, 2012. Retrieved September 3, 2016.
http://www.variety.com/article/VR1117957532.html?categoryid=14&cs=1

----> "Game of Thrones". Emmys.com. Retrieved September 17, 2016.
http://www.emmys.com/shows/game-thrones

----> Roberts, Josh (April 1, 2012). "Where HBO's hit 'Game of Thrones' was filmed". USA Today. Archived from the original on April 1, 2012. Retrieved March 8, 2013.
https://web.archive.org/web/20120401123724/http://travel.usatoday.com/destinations/story/2012-04-01/Where-the-HBO-hit-Game-of-Thrones-was-filmed/53876876/1



In [53]:
html_soup.text[:1000]

'\n\n\n\nList of Game of Thrones episodes - Wikipedia\ndocument.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":902691464,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefa

In [54]:
episodes = []

In [55]:
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')

In [56]:
ep_tables[0]

<table class="wikitable plainrowheaders wikiepisodetable" style="width:100%"><tbody><tr style="color:white;text-align:center"><th scope="col" style="background:#295354;width:5%"><abbr title="Number">No.</abbr><br/>overall</th><th scope="col" style="background:#295354;width:5%"><abbr title="Number">No.</abbr> in<br/>season</th><th scope="col" style="background:#295354;width:23%">Title</th><th scope="col" style="background:#295354;width:17%">Directed by</th><th scope="col" style="background:#295354;width:27%">Written by</th><th scope="col" style="background:#295354;width:12%">Original air date</th><th scope="col" style="background:#295354;width:10%">U.S. viewers<br/>(millions)</th></tr><tr class="vevent" style="text-align:center;background:inherit"><th id="ep1" rowspan="1" scope="row" style="text-align:center">1</th><td style="text-align:center">1</td><td class="summary" style="text-align:left">"<a href="/wiki/Winter_Is_Coming" title="Winter Is Coming">Winter Is Coming</a>"</td><td style

In [57]:
len(ep_tables)

7

In [58]:
for table in ep_tables[0]:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    for row in rows[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            episode_dict = {headers[i]: values[i] for i in range(len(values))}
            episodes.append(episode_dict)

In [59]:
episodes[0]

{'No.overall': '1',
 'No. inseason': '1',
 'Title': '"Winter Is Coming"',
 'Directed by': 'Tim Van Patten',
 'Written by': 'David Benioff & D. B. Weiss',
 'Original air date': 'April\xa017,\xa02011\xa0(2011-04-17)',
 'U.S. viewers(millions)': '2.22[20]'}

In [60]:
for epsiode in episodes[:3]:
    print(epsiode)

{'No.overall': '1', 'No. inseason': '1', 'Title': '"Winter Is Coming"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'April\xa017,\xa02011\xa0(2011-04-17)', 'U.S. viewers(millions)': '2.22[20]'}
{'No.overall': '2', 'No. inseason': '2', 'Title': '"The Kingsroad"', 'Directed by': 'Tim Van Patten', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'April\xa024,\xa02011\xa0(2011-04-24)', 'U.S. viewers(millions)': '2.20[21]'}
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'}


In [61]:
pd.DataFrame(episodes).head(5)

Unnamed: 0,Directed by,No. inseason,No.overall,Original air date,Title,U.S. viewers(millions),Written by
0,Tim Van Patten,1,1,"April 17, 2011 (2011-04-17)","""Winter Is Coming""",2.22[20],David Benioff & D. B. Weiss
1,Tim Van Patten,2,2,"April 24, 2011 (2011-04-24)","""The Kingsroad""",2.20[21],David Benioff & D. B. Weiss
2,Brian Kirk,3,3,"May 1, 2011 (2011-05-01)","""Lord Snow""",2.44[22],David Benioff & D. B. Weiss
3,Brian Kirk,4,4,"May 8, 2011 (2011-05-08)","""Cripples, Bastards, and Broken Things""",2.45[23],Bryan Cogman
4,Brian Kirk,5,5,"May 15, 2011 (2011-05-15)","""The Wolf and the Lion""",2.58[24],David Benioff & D. B. Weiss


## Web APIs

### Primer uporabe APIja

https://github.com/HackerNews/API

In [62]:
articles = []

In [63]:
url = 'https://hacker-news.firebaseio.com/v0'

In [64]:
top_stories = requests.get(url + '/topstories.json')

In [65]:
top_stories.text[:40]

'[20235527,20236137,20236296,20232318,202'

In [66]:
top_stories = top_stories.json()

In [67]:
top_stories[:5]

[20235527, 20236137, 20236296, 20232318, 20236066]

In [68]:
for story_id in top_stories[:5]:
    story_url = url + f'/item/{story_id}.json'
    print('Prenos: ', story_url)
    r = requests.get(story_url)
    story_dict = r.json()
    articles.append(story_dict)

Prenos:  https://hacker-news.firebaseio.com/v0/item/20235527.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20236137.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20236296.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20232318.json
Prenos:  https://hacker-news.firebaseio.com/v0/item/20236066.json


In [69]:
articles[0]

{'by': 'abhorrence',
 'descendants': 16,
 'id': 20235527,
 'kids': [20236483,
  20235947,
  20236524,
  20236525,
  20236426,
  20235978,
  20236172,
  20236158],
 'score': 157,
 'time': 1561054055,
 'title': 'Open-sourcing Sorbet: a fast, powerful type checker for Ruby',
 'type': 'story',
 'url': 'https://sorbet.org/blog/2019/06/20/open-sourcing-sorbet'}

In [70]:
articles[1]['title']

'Song of the Rarest Large Whale on Earth Recorded for the First Time'

In [71]:
for article in articles:
    print(article['title'])

Open-sourcing Sorbet: a fast, powerful type checker for Ruby
Song of the Rarest Large Whale on Earth Recorded for the First Time
Desjardins announces personal data of 2.9M members improperly shared
Support for right-to-repair laws slowly grows
Gryphon: An open-source framework for algorithmic trading in cryptocurrency


### Import data from web - pandas

##### [Odprti podatki Slovenije](https://podatki.gov.si/)


Na portalu OPSI boste našli vse od podatkov, orodij, do koristnih virov, s katerimi boste lahko razvijali spletne in mobilne aplikacije, oblikovali lastne infografike in drugo

In [72]:
opsi = pd.read_csv('https://podatki.gov.si/harvest/json-object/eb674637-7251-4264-bcab-5e3355b81990')
opsi.head()

Unnamed: 0,"{""INFO"": """"","""user_id"": ""69""","""DESCRIPTION"": ""Zaposleni po dejavnosti in spolu",statisti\u010dne regije,Slovenija,"ve\u010dletno""","""metadata_modified_date"": ""2019-01-31""","""CONTACT_MAIL"": ""danilo.dolenc@gov.si""","""PXNAME"": ""05G3060S.PX""","""podrocje"": ""Sociala in zaposlovanje""","""CONTACT_NAME"": ""DANILO DOLENC""","""guid"": ""05g3060s.px""","""publisher_id"": ""5022932000""","""LAST-UPDATED"": ""20190131 10:30""","""name"": ""surs05g3060s""}"


Primer: https://support.spatialkey.com/spatialkey-sample-csv-data/

In [73]:
data = pd.read_csv('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv')

In [74]:
data.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


## Web Scraping using pandas

> Spletna stran: https://www.fdic.gov/bank/individual/failed/banklist.html

`pandas.read_html: ` Read HTML tables into a list of DataFrame objects. -> [Dokumentacija](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)



In [75]:
tables = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [76]:
len(tables)

1

In [77]:
banks = tables[0]

In [78]:
banks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 556 entries, 0 to 555
Data columns (total 7 columns):
Bank Name                556 non-null object
City                     556 non-null object
ST                       556 non-null object
CERT                     556 non-null int64
Acquiring Institution    556 non-null object
Closing Date             556 non-null object
Updated Date             556 non-null object
dtypes: int64(1), object(6)
memory usage: 30.5+ KB


In [79]:
banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,The Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.","May 31, 2019","June 18, 2019"
1,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 1, 2019"
2,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
3,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","January 29, 2019"
4,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"


In [80]:
close_timestamps = pd.to_datetime(banks['Closing Date'])

In [81]:
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2017      8
2016      5
2001      4
2004      4
2007      3
2003      3
2000      2
2019      1
Name: Closing Date, dtype: int64

## Primeri

### Scraping and Visualizing IMDB Ratings

Stran: http://www.imdb.com/title/tt0944947/episodes

In [82]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes'

In [83]:
episodes = []
rankings = []

In [84]:
for season in range(1,9):
    r = requests.get(url, params={'season': season})
    soup = BeautifulSoup(r.text, 'html.parser')
    listing = soup.find('div', class_='eplist')
    for epnr, div in enumerate(listing.find_all('div', recursive=False)):
        episode = f'{season}.{epnr + 1}'
        rating_el = div.find(class_='ipl-rating-star__rating')
        print(epsiode, rating_el)
        print('----------------')
        rating = float(rating_el.get_text(strip=True))
        episodes.append(episode)
        rankings.append(rating)

{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">9.1</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">8.8</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">8.7</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Wr

{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">9.2</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">9.7</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">8.9</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Wr

{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">8.7</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">9.0</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Written by': 'David Benioff & D. B. Weiss', 'Original air date': 'May\xa01,\xa02011\xa0(2011-05-01)', 'U.S. viewers(millions)': '2.44[22]'} <span class="ipl-rating-star__rating">9.2</span>
----------------
{'No.overall': '3', 'No. inseason': '3', 'Title': '"Lord Snow"', 'Directed by': 'Brian Kirk', 'Wr

In [85]:
rankings[:20]

[9.1,
 8.8,
 8.7,
 8.8,
 9.1,
 9.2,
 9.3,
 9.1,
 9.6,
 9.5,
 8.9,
 8.6,
 8.9,
 8.9,
 8.9,
 9.1,
 9.0,
 8.9,
 9.7,
 9.5]

In [86]:
episodes[:20]

['1.1',
 '1.2',
 '1.3',
 '1.4',
 '1.5',
 '1.6',
 '1.7',
 '1.8',
 '1.9',
 '1.10',
 '2.1',
 '2.2',
 '2.3',
 '2.4',
 '2.5',
 '2.6',
 '2.7',
 '2.8',
 '2.9',
 '2.10']

In [87]:
import matplotlib.pylab as plt

plt.figure()

positions = [a for a in range(len(rankings))]
plt.bar(positions, rankings, align='center')

<BarContainer object of 73 artists>

### Scraping Fast Track data

Stran: https://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

In [88]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [89]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [90]:
page = requests.get(urlpage)

In [91]:
soup = BeautifulSoup(page.text, 'html.parser')

In [92]:
table = soup.find('table', class_='tableSorter')

In [93]:
len(table)

3

In [94]:
table

<table class="tableSorter">
<tbody>
<tr>
<th>Rank</th>
<th>Company</th>
<th class="">Location</th>
<th class="no-word-wrap">Year end</th>
<th class="" style="text-align:right;">Annual sales rise over 3 years</th>
<th class="" style="text-align:right;">Latest sales £000s</th>
<th class="" style="text-align:right;">Staff</th>
<th class="">Comment</th>
<!--				<th>FYE</th>-->
</tr>
<tr>
<td>1</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/plan-com/"><span class="company-name">Plan.com</span></a>Communications provider</td>
<td>Isle of Man</td>
<td>Sep-17</td>
<td style="text-align:right;">364.38%</td>
<td style="text-align:right;">*35,418</td>
<td style="text-align:right;">90</td>
<td>About 650 partners use its telecoms platform to support more than 100,000 UK business customers</td>
<!--						<td>Sep-17</td>-->
</tr>
<tr>
<td>2</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/psioxus-2/"><span class="company-name">PsiOxus</span></a>Biotechnology developer</td>

In [95]:
results = table.find_all('tr')

In [96]:
print('Number of rows: ', len(results))

Number of rows:  101


In [97]:
results[1]

<tr>
<td>1</td>
<td><a href="https://www.fasttrack.co.uk/company_profile/plan-com/"><span class="company-name">Plan.com</span></a>Communications provider</td>
<td>Isle of Man</td>
<td>Sep-17</td>
<td style="text-align:right;">364.38%</td>
<td style="text-align:right;">*35,418</td>
<td style="text-align:right;">90</td>
<td>About 650 partners use its telecoms platform to support more than 100,000 UK business customers</td>
<!--						<td>Sep-17</td>-->
</tr>

In [98]:
results[0]

<tr>
<th>Rank</th>
<th>Company</th>
<th class="">Location</th>
<th class="no-word-wrap">Year end</th>
<th class="" style="text-align:right;">Annual sales rise over 3 years</th>
<th class="" style="text-align:right;">Latest sales £000s</th>
<th class="" style="text-align:right;">Staff</th>
<th class="">Comment</th>
<!--				<th>FYE</th>-->
</tr>

In [99]:
rows = []

In [100]:
for row in results[0].find_all('th'):
    rows.append(row.contents[0])

In [101]:
rows

['Rank',
 'Company',
 'Location',
 'Year end',
 'Annual sales rise over 3 years',
 'Latest sales £000s',
 'Staff',
 'Comment']

In [102]:
rows = []
rows.append(['Rank',
 'Company',
 'Location',
 'Year end',
 'Annual sales rise over 3 years',
 'Latest sales £000s',
 'Staff',
 'Comment'])

In [103]:
rows

[['Rank',
  'Company',
  'Location',
  'Year end',
  'Annual sales rise over 3 years',
  'Latest sales £000s',
  'Staff',
  'Comment']]

In [104]:
for result in results:
    data = result.find_all('td')
    if len(data) == 0:
        continue

In [105]:
data

[<td>100</td>,
 <td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>,
 <td>West London</td>,
 <td>Aug-17</td>,
 <td style="text-align:right;">50.17%</td>,
 <td style="text-align:right;">*5,250</td>,
 <td style="text-align:right;">27</td>,
 <td>Its technology is used in high-profile events such as the Oscars</td>]

In [106]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

In [107]:
rank

'100'

In [108]:
company

'Brompton TechnologyVideo technology provider'

In [109]:
sales

'*5,250'

In [110]:
data

[<td>100</td>,
 <td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>,
 <td>West London</td>,
 <td>Aug-17</td>,
 <td style="text-align:right;">50.17%</td>,
 <td style="text-align:right;">*5,250</td>,
 <td style="text-align:right;">27</td>,
 <td>Its technology is used in high-profile events such as the Oscars</td>]

In [111]:
companyname = data[1].find('span', class_='company-name').getText()

In [112]:
companyname

'Brompton Technology'

In [113]:
company.strip(companyname)

'Video technology provid'

In [114]:
company

'Brompton TechnologyVideo technology provider'

In [115]:
description = company.replace(companyname,'')

In [116]:
description

'Video technology provider'

In [117]:
sales

'*5,250'

In [118]:
sales.strip('*').strip('†').replace(',','')

'5250'

In [119]:
data[1]

<td><a href="https://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>

In [120]:
url = data[1].find('a').get('href')

In [121]:
url

'https://www.fasttrack.co.uk/company_profile/brompton-technology/'

In [122]:
page = requests.get(url)

In [123]:
soup = BeautifulSoup(page.text, 'html.parser')

In [124]:
try:
    tableRow = soup.find('table').find_all('tr')[-1]
    webpage = tableRow.find('a').get('href')
except:
    webpage = None

In [125]:
webpage

'http://www.bromptontech.com'

#### Celotni program skupaj

In [126]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

page = requests.get(urlpage)
soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table', class_='tableSorter')
results = table.find_all('tr')
print('Number of rows:', len(results))

Number of rows: 101


In [127]:
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location',
            'Year end', 'Annual sales rise over 3 years', 'Sales £000s',
            'Staff', 'Comments'])

In [128]:
for num, result in enumerate(results):
    data = result.find_all('td')
    if len(data) == 0:
        continue
        
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()   

companyname = data[1].find('span', class_='company-name').getText()
description = company.replace(companyname, '')
print(num, '- Company is', companyname)

sales = sales.strip('*').strip('†').replace(',','')

url = data[1].find('a').get('href')
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

try:
    tableRow = soup.find('table').find_all('tr')[-1]
    webpage = tableRow.find('a').get('href')
except:
    webpage = None
        
rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales,
                staff, comments])

100 - Company is Brompton Technology


In [129]:
with open('OUT_companies.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerows(rows)

In [130]:
df = pd.read_csv('OUT_companies.csv')

In [131]:
df.head()

Unnamed: 0,Rank,Company Name,Webpage,Description,Location,Year end,Annual sales rise over 3 years,Sales £000s,Staff,Comments
0,100,Brompton Technology,http://www.bromptontech.com,Video technology provider,West London,Aug-17,50.17%,5250,27,Its technology is used in high-profile events ...
