# Why BeautifulSoup is cool
(at least, more than you may think). The following code is taken from [here](https://www.dataquest.io/blog/web-scraping-beautifulsoup/)

[request](http://docs.python-requests.org/en/master/) permits to get a webpage.
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) navigates it.

In [1]:
from requests import get
from bs4 import BeautifulSoup

### Get the webpage

[IMDB search](https://www.imdb.com/search/title)

Look at the movies released since 2017 and order them by number of votes

In [2]:
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'

In [3]:
response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})
print response.text[:500]




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


That is the structure of the HTML we want to download. [Wikipedia](https://en.wikipedia.org/wiki/HTML) can tell you something more.

### Soupify

In [4]:
html_soup = BeautifulSoup(response.text, 'html.parser')

What kind of variable?

In [5]:
type(html_soup)

bs4.BeautifulSoup

In [6]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')

In [29]:
print type(movie_containers)
print len(movie_containers)

<class 'bs4.element.ResultSet'>
50


Exactly the number of the film listed here!

In [30]:
first_movie = movie_containers[0]
first_movie

<div class="lister-item mode-advanced">\n<div class="lister-top-right">\n<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>\n</div>\n<div class="lister-item-image float-left">\n<a href="/title/tt3315342/?ref_=adv_li_i"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB470041630_.png" width="67"/>\n</a> </div>\n<div class="lister-item-content">\n<h3 class="lister-item-header">\n<span class="lister-item-index unbold text-primary">1.</span>\n<a href="/title/tt3315342/?ref_=adv_li_tt">Logan</a>\n<span class="lister-item-year text-muted unbold">(2017)</span>\n</h3>\n<p class="text-muted ">\n<span class="certificate">R</span>\n<span class="ghost">|</span>\n<span class="runtime">137 min</span>\n<span 

From here it is unfeasible. Go to Developer tools and look for the structure. 
Looking for the title...

In [31]:
print first_movie.h3.a.text

Logan


... the year ...

In [32]:
print first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold').text

(2017)


... the idmb rating...

In [33]:
print first_movie.strong.text

8.1


... the metascore...

In [34]:
print first_movie.find('span', class_ = 'metascore favorable').text

77        


In [35]:
print first_movie.find('span', attrs = {'name':'nv'})['data-value']

512062


In [7]:
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:

    # If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:

        # The name
        name = container.h3.a.text
        names.append(name)

        # The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)

        # The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)

        # The Metascore
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))

        # The number of votes
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

In [8]:
import pandas as pd

test_df = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes})

In [9]:
print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 5 columns):
imdb         47 non-null float64
metascore    47 non-null int64
movie        47 non-null object
votes        47 non-null int64
year         47 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.9+ KB
None


Unnamed: 0,imdb,metascore,movie,votes,year
0,8.1,77,Logan,512115,(2017)
1,7.5,76,Wonder Woman,439388,(2017)
2,8.0,94,Dunkirk,421726,(2017)
3,7.2,85,Star Wars: The Last Jedi,418677,(2017)
4,7.7,67,Guardians of the Galaxy Vol. 2,414404,(2017)
5,7.9,74,Thor: Ragnarok,391682,(2017)
6,7.5,73,Spider-Man: Homecoming,361565,(2017)
7,7.7,84,Get Out,335511,(I) (2017)
8,8.0,81,Blade Runner 2049,334884,(2017)
9,7.6,86,Baby Driver,326954,(2017)


### Consider more than a webpage

In [10]:
for i in xrange(1,11):
    url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page='+str(i)
    response = get(url,headers = {"Accept-Language": "en-US, en;q=0.5"})
    html_soup = BeautifulSoup(response.text, 'html.parser')
    movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
    print i
    for container in movie_containers:

        # If the movie has Metascore, then extract:
        if container.find('div', class_ = 'ratings-metascore') is not None:

            # The name
            name = container.h3.a.text
            names.append(name)

            # The year
            year = container.h3.find('span', class_ = 'lister-item-year').text
            years.append(year)

            # The IMDB rating
            imdb = float(container.strong.text)
            imdb_ratings.append(imdb)

            # The Metascore
            m_score = container.find('span', class_ = 'metascore').text
            metascores.append(int(m_score))

            # The number of votes
            vote = container.find('span', attrs = {'name':'nv'})['data-value']
            votes.append(int(vote))

1
2
3
4
5
6
7
8
9
10


In [11]:
final_df = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes})

In [12]:
final_df

Unnamed: 0,imdb,metascore,movie,votes,year
0,8.1,77,Logan,512115,(2017)
1,7.5,76,Wonder Woman,439388,(2017)
2,8.0,94,Dunkirk,421726,(2017)
3,7.2,85,Star Wars: The Last Jedi,418677,(2017)
4,7.7,67,Guardians of the Galaxy Vol. 2,414404,(2017)
5,7.9,74,Thor: Ragnarok,391682,(2017)
6,7.5,73,Spider-Man: Homecoming,361565,(2017)
7,7.7,84,Get Out,335511,(I) (2017)
8,8.0,81,Blade Runner 2049,334884,(2017)
9,7.6,86,Baby Driver,326954,(2017)
