# Implementing Web Scraping in Python with BeautifulSoup
There are mainly two ways to extract data from a website:

* Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.

* Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

### Steps involved in web scraping:

1. Send a HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data.
3. There are many HTML parser libraries available but the most advanced one is html5lib.
Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

## Scraping data of movies from IMDB
We want to analyze the distributions of IMDB and Metacritic movie ratings to see if we find anything interesting. To do this, we’ll first scrape data for over 50 movies.

In [3]:
import requests 
from bs4 import BeautifulSoup 

In [17]:
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = requests.get(url)
print(response.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


In [18]:
html_soup = BeautifulSoup(response.text, 'html.parser')

Now let’s use the find_all() method to extract all the div containers that have a class attribute of lister-item mode-advanced:

In [20]:
movie_containers = html_soup.find_all('div', class_ = 'lister-item mode-advanced')
movie_containers

[<div class="lister-item mode-advanced">
 <div class="lister-top-right">
 <div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
 </div>
 <div class="lister-item-image float-left">
 <a href="/title/tt3315342/"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB466725069_.png" width="67"/>
 </a> </div>
 <div class="lister-item-content">
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">1.</span>
 <a href="/title/tt3315342/">Logan</a>
 <span class="lister-item-year text-muted unbold">(2017)</span>
 </h3>
 <p class="text-muted">
 <span class="certificate">A</span>
 <span class="ghost">|</span>
 <span class="runtime">137 min</span>
 <span class="ghost">|</span>
 <span

In [21]:
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


In [24]:
# Lists to store the scraped data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
# Extract data from individual movie container
for container in movie_containers:
# If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
# The name
        name = container.h3.a.text
        names.append(name)
# The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
# The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
# The Metascore
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))
# The number of votes
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

In [27]:
import pandas as pd
df = pd.DataFrame({'movie': names,
'year': years,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes
})
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 5 columns):
movie        42 non-null object
year         42 non-null object
imdb         42 non-null float64
metascore    42 non-null int64
votes        42 non-null int64
dtypes: float64(1), int64(2), object(2)
memory usage: 1.7+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Logan,(2017),8.1,77,610583
1,Thor: Ragnarok,(2017),7.9,74,545066
2,Guardians of the Galaxy Vol. 2,(2017),7.6,67,534353
3,Star Wars: The Last Jedi,(2017),7.0,85,533402
4,Wonder Woman,(2017),7.4,76,524241
5,Dunkirk,(2017),7.9,94,512169
6,Spider-Man: Homecoming,(2017),7.4,73,479653
7,Get Out,(I) (2017),7.7,85,455032
8,It,(I) (2017),7.3,69,429886
9,Blade Runner 2049,(2017),8.0,81,425269


In [28]:
df['year'] = df['year'].str[-5:-1].astype(int)

In [29]:
df

Unnamed: 0,movie,year,imdb,metascore,votes
0,Logan,2017,8.1,77,610583
1,Thor: Ragnarok,2017,7.9,74,545066
2,Guardians of the Galaxy Vol. 2,2017,7.6,67,534353
3,Star Wars: The Last Jedi,2017,7.0,85,533402
4,Wonder Woman,2017,7.4,76,524241
5,Dunkirk,2017,7.9,94,512169
6,Spider-Man: Homecoming,2017,7.4,73,479653
7,Get Out,2017,7.7,85,455032
8,It,2017,7.3,69,429886
9,Blade Runner 2049,2017,8.0,81,425269
