In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Web scraping - pridobivanje podatkov s spleta


## What Is Web Scraping?

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png" src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png">

### Why Web Scraping for Data Science?

## Network complexity

## HTTP

## HTTP in Python: The Requests Library

[Requests: HTTP for Humans](https://2.python-requests.org/en/master/)

In [2]:
import requests

In [4]:
url = 'http://example.com/'

In [5]:
r = requests.get(url)

In [6]:
r

<Response [200]>

In [7]:
type(r)

requests.models.Response

In [4]:
r.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n

In [8]:
r.status_code

200

In [9]:
r.reason

'OK'

In [10]:
r.headers

{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 28 May 2019 20:00:57 GMT', 'Etag': '"1541025663+ident"', 'Expires': 'Tue, 04 Jun 2019 20:00:57 GMT', 'Last-Modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'Server': 'ECS (bsa/EB24)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '606'}

In [11]:
r.request

<PreparedRequest [GET]>

In [12]:
r.request.headers

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

## HTML and CSS

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png" src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png">

### Hypertext Markup Language: HTML

Link strani: https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

In [8]:
url_got = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

In [9]:
r = requests.get(url_got)

In [10]:
r.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Game of Thrones episodes - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":899966754,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage"

In [16]:
from bs4 import BeautifulSoup

In [11]:
html_content = r.text

In [12]:
html_soup = BeautifulSoup(html_content, 'html.parser')

In [13]:
html_soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [15]:
html_soup.find('', {'id' : 'firstHeading'})

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

- `<p>...</p>` to enclose a paragraph;
- `<br>` to set a line break;
- `<table>...</table>` to start a table block, inside; `<tr>...<tr/>` is used for the rows; and `<td>...</td>` cells;
- `<img>` for images;
- `<h1>...</h1> to <h6>...</h6>` for headers;
- `<div>...</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;
- `<a>...</a>` for hyperlinks;
- `<ul>...</ul>, <ol>...</ol>` for unordered and ordered lists respectively; inside of these, `<li>...</li>` is used for each list item.

## Using Your Browser as a Development Tool

## The Beautiful Soup Library

In [8]:
html_content = r.text

In [11]:
import pandas as pd

> **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

In [9]:
from bs4 import BeautifulSoup

In [17]:
html_soup = BeautifulSoup(html_content, 'html.parser')

In Python, multiple parsers exist to do so:
- `html.parser`: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.
- `lxml`: which is very fast but requires an extra installation.
- `html5lib`: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

- `find(name, attrs, recursive, string, **keywords)`
- `find_all(name, attrs, recursive, string, limit, **keywords)`

In [18]:
html_soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [19]:
html_soup.find('', {'id': 'firstHeading'})

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [20]:
all_h2 = html_soup.find_all('h2')

In [21]:
all_h2

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

In [24]:
len (all_h2)

8

In [25]:
all_h2

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

In [22]:
for found in html_soup.find_all('h2'):
    print(found)

<h2>Contents</h2>
<h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>
<h2><span class="mw-headline" id="Episodes">Episodes</span></h2>
<h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>
<h2><span class="mw-headline" id="Ratings">Ratings</span></h2>
<h2><span class="mw-headline" id="References">References</span></h2>
<h2><span class="mw-headline" id="External_links">External links</span></h2>
<h2>Navigation menu</h2>


In [23]:
first_h1 = html_soup.find('h1')

In [24]:
first_h1

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [25]:
first_h1.name

'h1'

In [26]:
first_h1.contents

['List of ', <i>Game of Thrones</i>, ' episodes']

In [27]:
str(first_h1)

'<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>'

In [28]:
first_h1.text

'List of Game of Thrones episodes'

In [35]:
first_h1.get_text()

'List of Game of Thrones episodes'

In [37]:
first_h1.attrs

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}

In [38]:
first_h1.attrs['id']

'firstHeading'

In [39]:
first_h1['id']

'firstHeading'

In [40]:
first_h1.get('id')

'firstHeading'

In [29]:
cites = html_soup.find_all('cite', class_='citation', limit=4)

In [30]:
len(cites)

4

In [43]:
cites[0]

<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>

In [44]:
cites

[<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news">Fleming, Michael (January 16, 2007). <a class="external text" href="http://www.variety.com/article/VR1117957532.html?categoryid=14&amp;cs=1" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">A

In [45]:
cites[0].get_text()

'Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.'

In [48]:
link = cites[0].find('a').get('href')

In [49]:
link

'http://tv.ign.com/articles/116/1160215p1.html'

In [None]:
for citation in cites:
    print('---->')

In [50]:
html_soup.text[:1000]

'\n\n\n\nList of Game of Thrones episodes - Wikipedia\ndocument.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":898999050,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefa

In [None]:
episodes = []

In [51]:
ep_tables = html_soup.find_all('table', class_='wikiepisodetable')

In [53]:
ep_tables[0]

<table class="wikitable plainrowheaders wikiepisodetable" style="width:100%"><tbody><tr style="color:white;text-align:center"><th scope="col" style="background:#295354;width:5%"><abbr title="Number">No.</abbr><br/>overall</th><th scope="col" style="background:#295354;width:5%"><abbr title="Number">No.</abbr> in<br/>season</th><th scope="col" style="background:#295354;width:23%">Title</th><th scope="col" style="background:#295354;width:17%">Directed by</th><th scope="col" style="background:#295354;width:27%">Written by</th><th scope="col" style="background:#295354;width:12%">Original air date</th><th scope="col" style="background:#295354;width:10%">U.S. viewers<br/>(millions)</th></tr><tr class="vevent" style="text-align:center;background:inherit"><th id="ep1" rowspan="1" scope="row" style="text-align:center">1</th><td style="text-align:center">1</td><td class="summary" style="text-align:left">"<a href="/wiki/Winter_Is_Coming" title="Winter Is Coming">Winter Is Coming</a>"</td><td style

## Web APIs

### Primer uporabe APIja

https://github.com/HackerNews/API

In [60]:
articles = []

In [57]:
url = 'https://hacker-news.firebaseio.com/v0'

In [58]:
top_stories = url + '//topstories.json'

In [61]:
top_stories.text[:40]

AttributeError: 'str' object has no attribute 'text'

### Import data from web - pandas

##### [Odprti podatki Slovenije](https://podatki.gov.si/)


Na portalu OPSI boste našli vse od podatkov, orodij, do koristnih virov, s katerimi boste lahko razvijali spletne in mobilne aplikacije, oblikovali lastne infografike in drugo

Primer: https://support.spatialkey.com/spatialkey-sample-csv-data/

In [62]:
data = pd.read_csv('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv')

In [63]:
data.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


## Web Scraping using pandas

> Spletna stran: https://www.fdic.gov/bank/individual/failed/banklist.html

`pandas.read_html: ` Read HTML tables into a list of DataFrame objects. -> [Dokumentacija](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)



In [64]:
tables = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [66]:
len(tables)

1

In [67]:
banks = tables[0]

In [68]:
banks.info

<bound method DataFrame.info of                                              Bank Name                City  \
0                  Washington Federal Bank for Savings             Chicago   
1      The Farmers and Merchants State Bank of Argonia             Argonia   
2                                  Fayette County Bank          Saint Elmo   
3    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee   
4                                       First NBC Bank         New Orleans   
5                                        Proficio Bank  Cottonwood Heights   
6                        Seaway Bank and Trust Company             Chicago   
7                               Harvest Community Bank          Pennsville   
8                                          Allied Bank            Mulberry   
9                         The Woodbury Banking Company            Woodbury   
10                              First CornerStone Bank     King of Prussia   
11                              

In [69]:
banks.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 1, 2019"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"
2,Fayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb","May 26, 2017","January 29, 2019"
3,"Guaranty Bank, (d/b/a BestBank in Georgia & Mi...",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,"May 5, 2017","March 22, 2018"
4,First NBC Bank,New Orleans,LA,58302,Whitney Bank,"April 28, 2017","January 29, 2019"


## Primeri

### Scraping and Visualizing IMDB Ratings

Stran: http://www.imdb.com/title/tt0944947/episodes

In [71]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes'

### Scraping Fast Track data

Stran: https://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

In [76]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [73]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [74]:
page = requests.get(urlpage)

In [78]:
soup = BeautifulSoup(page.text, 'html.parser')

In [79]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

KeyError: 0

#### Celotni program skupaj