# Lecture 6: How to get the data
aka "Why BeautifulSoup is cool"

(The present lecture is inspired by the the webpage [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/))

## Loading python modules

[requests](http://docs.python-requests.org/en/master/) permits to get a webpage.
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) navigates it.
<br>**PAY ATTENTION!** request**S** (request is a different module)

In [1]:
import sys

In [2]:
from requests import get
from requests.exceptions import ProxyError, ConnectionError, ChunkedEncodingError, Timeout

In [3]:
from bs4 import BeautifulSoup
import bs4

In [159]:
import networkx as nx

# IMDB

What is [IMDB](https://en.wikipedia.org/wiki/IMDb)?

The webpage of [IMDB search](https://www.imdb.com/search/title)

## requests in action

In [4]:
url = 'http://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1'

In [5]:
response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})

With the previous specification get accepts US English and generically English, with a strictness of 0.5 (I mean, quite weak).

In [6]:
print response.text[:500]



<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle",


Cool, but as a text I cannot do something really interesting...

## Beautiful Soup is indeed beautiful
The interesting thing of BeautifulSoup is that it permits you to navigate the webpage

### Getting a navigable webpage

In [7]:
html_soup = BeautifulSoup(response.text, 'html.parser')

Here we are using the default parser, but in principle there are [others](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser).

In [8]:
html_soup

\n<!DOCTYPE html>\n\n<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">\n<head>\n<meta charset="unicode-escape"/>\n<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>\n<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>\n<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == 'function'

The html page has different attributes. Let us play a little with some of them. 

In [9]:
html_soup.head

<head>\n<meta charset="unicode-escape"/>\n<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>\n<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>\n<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == 'function') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == 'function') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n<link href="https://www.imdb.com/search/

In [10]:
testa=html_soup.head

In [11]:
testa.contents

[u'\n',
 <meta charset="unicode-escape"/>,
 u'\n',
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>,
 u'\n',
 <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>,
 u'\n',
 <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>,
 u'\n',
 <script>\n    if (typeof uet == 'function') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>,
 u'\n',
 <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>,
 u'\n',
 <title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>,
 u'\n',
 <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>,
 u'\n',
 <script>\n    if (typeof uet == 'function') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>,
 u'\n',
 <script>\n    if (typeof uex == 'function') {\n      uex("ld", 

Navigate by considering the element of the list of contents...

In [12]:
testa.contents[13]

<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>

In [13]:
testa.contents[13].name

u'title'

...or using the tags:

In [14]:
testa.title

<title>IMDb: Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>

[How to navigate in a soup?](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)
<br/>An HTML page is organized hierarchically, with parents, children and descendants. Check the link for a more detailed guide.

### Finding the right tags...

I mean, BeautifulSoup is indeed beautiful, but without knowing what we are looking for it is a quite a mess...
<br/> Let us investigate the structure of the HTML. This can be done, by going in the **Developer Tools** of your browser (the images are taken from [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)). On Chrome:<br>
!['developer tools'](l6_developer_tools.png 'Inspect!')

There is something similar in each browser. On Firefox:
![Firefox](l6_developer_tools_firefox.png 'Sorry, I left it in Italian!')

Nevertheless, the important thing is that it permits you to examine the html structure. 
Going back to Chrome, if you pass the arrow over a certain movie
![Logan](l6_container.png 'Still Marvel...')

Ok, so the information we are looking for is contained in a 'div' tag. But there is a lot of them! Indeed they contain all we need. For instance, the title is here:
![here](l6_h3_title.png)

... the rating is here...
![Substructure](l6_rating.png 'Here!')

... the total number of votes here...
![votes](l6_votes.png)

... the metascore is here...
![cast](l6_metascore.png)

... the cast and the directors are here!
![cast](l6_directors_actors.png)

### Going back to our soup

It select all 'div' tags with class 'lister-item mode-advanced' (as the one containing the information about 'Logan').

In [15]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

In [16]:
print len(movie_containers)

50


Ok, it starts looking nicer: we have exactly the same number of elements as the results in the search page. But is it indeed nice?

### Refining our search

In [17]:
first_movie = movie_containers[0]
first_movie

<div class="lister-item mode-advanced">\n<div class="lister-top-right">\n<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154756"></div>\n</div>\n<div class="lister-item-image float-left">\n<a href="/title/tt4154756/?ref_=adv_li_i"> <img alt="Avengers: Infinity War" class="loadlate" data-tconst="tt4154756" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>\n</a> </div>\n<div class="lister-item-content">\n<h3 class="lister-item-header">\n<span class="lister-item-index unbold text-primary">1.</span>\n<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>\n<span class="lister-item-year text-muted unbold">(2018)</span>\n</h3>\n<p class="text-muted ">\n<span class="certificate">PG-13</span>\n<span class="ghost">|</span>\n<span class="runtime">149 min</span>\n

#### The title

It could be nicer... Anyway, that's not a big issue since we know that the title is inside the tag 'h3'...

In [18]:
first_movie.h3

<h3 class="lister-item-header">\n<span class="lister-item-index unbold text-primary">1.</span>\n<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>\n<span class="lister-item-year text-muted unbold">(2018)</span>\n</h3>

... then let us look for the inner tag with the title...

In [19]:
first_movie.h3.a

<a href="/title/tt4154756/?ref_=adv_li_tt">Avengers: Infinity War</a>

... and get the text:

In [20]:
first_movie.h3.a.text

u'Avengers: Infinity War'

**Cool!** We can get even the year (if we would have been interested...)

#### The year

In [21]:
first_movie.h3.span

<span class="lister-item-index unbold text-primary">1.</span>

Where is the second one? BeautifulSoup consider only the first item it finds with that name.

In [22]:
first_movie.find('span', class_="lister-item-year text-muted unbold").text

u'(2018)'

#### The rating

The rating is instead contained in the strong tag...

In [23]:
first_movie.strong.text

u'8.5'

#### The metascore

The metascore is still in another 'span':

In [24]:
first_movie.find('span', class_= "metascore favorable").text

u'68        '

#### The number of voters

The number of votes is a little more involved: we have to look for the 'span' which has attribute 'name' with value 'nv': 

In [25]:
first_movie.find('span', attrs = {'name':'nv'})

<span data-value="567364" name="nv">567,364</span>

In [26]:
aux=first_movie.find('span', attrs = {'name':'nv'})

Final trick:

In [27]:
aux.text

u'567,364'

whose datatype is

In [28]:
type(aux.text)

unicode

In [29]:
aux['data-value']

u'567364'

In [30]:
int(aux.text)

ValueError: invalid literal for int() with base 10: '567,364'

In [31]:
int(aux['data-value'])

567364

Better considering the second one!

### Exercise: Get the actors and the directors for first_movie from the present webpage

### Exercise: download all movies with a metascore among the first 50 (disregard the directors and the cast)

### Exercise: for the first 200 movies with a metascore, download the rating, the metascore, the number of votes, url and runtime
Hints:
- [time module and sleep](https://docs.python.org/2/library/time.html#time.sleep) (Do not access too much in order to avoid to get banned)

In [42]:
import time
import datetime as dt

##### Loading function 

### Exercise: build on the fly the edgelist of the bipartite network actors/movies for the first 50 films

Respect to the previous case the problem is that we do not know in advance what is the number of edges. Moreover, the complete cast is not in the original webpage, neither in the url saved! **Hint_1**: check where you can find the entire cast. **Hint_2**: take a sleep every 10 searches.

### Final Exercise: IMT vs. IMDB

Find the assistant professors of IMT that have a IMDB webpage as an actor