<a href="https://colab.research.google.com/github/avasquez9999/Web-Scraping-applications-in-python/blob/main/IMDB_Web_Scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping: Part I

Suppose we want to analyze the distributions of [IMDB](https://www.imdb.com)
and [Metacritic](https://www.metacritic.com/) movie
ratings to see if we find anything interesting. To do this, we’ll first scrape data for over 20000 movies.

It’s essential to identify the goal of our scraping right from the beginning. Writing a scraping script
can take a lot of time, especially if we want to scrape more than one web page. We want to avoid spending
hours writing a script which scrapes data we won’t actually need.

## Working out which pages to scrape

Once we’ve established our goal, we then need to identify an efficient set of pages to scrape.

We want to find a combination of pages that requires a relatively small number of requests. A request is
what happens whenever we access a web page. We ‘request’ the content of a page from the server. The more
requests we make, the longer our script will need to run, and the greater the strain on the server.

One way to get all the data we need is to compile a list of movie names, and use it to access the web page
of each movie on both IMDB and Metacritic websites.

Since we want to get over 20000 ratings from both IMDB and Metacritic, we’ll have to make at least 40000 requests.
If we make one request per second, our script will need a little over an hour to make 40000 requests. Because of
this, it’s worth trying to identify more efficient ways of obtaining our data.

If we explore the IMDB website, we can discover a way to halve the number of requests. Metacritic scores are shown
on the IMDB movie page, so we can scrape both ratings with a single request.

<img src="http://drive.google.com/uc?export=view&id=1ZUdBCtB-qQnx7gNNXFLFzReRkEPcX0-R" width="800">


## Identifying the URL structure

Our challenge now is to make sure we understand the logic of the URL as the pages we want to scrape change. If we
can’t understand this logic enough so we can implement it into code, then we’ll reach a dead end.

If you go on [IMDB’s advanced search page](https://www.imdb.com/search/), you can browse movies by different criteria.


<img src="http://drive.google.com/uc?export=view&id=16-fIqD6m0_RaM2Lr_eHl3ElEc4gXKh_a" width="800">

<img src="http://drive.google.com/uc?export=view&id=1aiNZ23KcT8c6ssuZC9iwDNCzSLAXO63O" width="800">


Let’s browse by year 2018, sort the movies on the first page by number of votes,

<img src="http://drive.google.com/uc?export=view&id=10arMqQLKRopm_XBtuBc33GvuDt_9dmOb" width="800">

then switch to the next page.

<img src="http://drive.google.com/uc?export=view&id=1nWjymmBHLbkC0a7WMydIi9Zb49mxsSzx" width="800">

We’ll arrive at this web page, which has this URL:
<img src="http://drive.google.com/uc?export=view&id=1xPqShdu46yiV_2reDDhay_9E13ewgaBQ" width="800">

In the image above, we can see that the URL has several parameters after the question mark:

-  release_date — Shows only the movies released in a specific year.
-  sort — Sorts the movies on the page. sort=num_votes,desc translates to sort by number of votes in a
   descending order.
-  start — Specifies the starting number.
-  ref_ — Takes us to the the next or the previous page. The reference is the page we are
   currently on. adv_nxt and adv_prv are two possible values. They translate to advance to
   the next page, and advance to the previous page, respectively.

Let’s start writing the script by requesting the content of this single web page:
http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1. In the following
code cell we will:

-  Import the get() function from the requests module.
-  Assign the address of the web page to a variable named url.
-  Request the server the content of the web page by using get(), and store the server’s response in the variable
   response.
-  Print a small part of response‘s content by accessing its .text attribute (response is now a Response object).

In [None]:
import requests
url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1'
page = requests.get(url)
#print(page.text[:500])
page.text

'\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>Released between 2018-01-01 and 2018-12-31\n(Sorted by Number of Votes Descending) - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadT

Notice that all of the information for each movie, including the poster, is contained in a div tag.

There are a lot of HTML lines nested within each div tag. You can explore them by clicking those little gray arrows on the left of the HTML lines corresponding to each div. Within these nested tags we’ll find the information we need, like a movie’s rating.

There are 50 movies shown per page, so there should be a div container for each. Let’s extract all these 50 containers by parsing the HTML document from our earlier request.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')
type(soup)
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Released between 2018-01-01 and 2018-12-31
(Sorted by Number of Votes Descending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/search/title/?release_date=2018-01-01,

Before extracting the 50 div containers, we need to figure out what distinguishes them from other
div elements on that page. Often, the distinctive mark resides in the class attribute. If you inspect
the HTML lines of the containers of interest, you’ll notice that the class
attribute has two values: `lister-item` and `mode-advanced`. This combination is unique to
these div containers. We can see that’s true by doing a quick search (Ctrl + F). We have
50 such containers, so we expect to see only 50 matches:

Now let’s use the `find_all()` method to extract all the div containers that have
a class attribute of lister-item mode-advanced:

In [None]:
movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


Now we’ll select only the first container, and extract, by turn, each item of interest:

-  The name of the movie.
-  The year of release.
-  The IMDB rating.
-  The Metascore.
-  The number of votes.

## Extracting the data for a single movie
We can access the first container, which contains information about a single
movie, by using list notation on movie_containers.

In [None]:
movie_containers[0]

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154756"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt4154756/"> <img alt="Avengers: Infinity War" class="loadlate" data-tconst="tt4154756" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt4154756/">Avengers: Infinity War</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>
<p class="text-muted ">
<span class="certificate">12</span>
<span class="ghost">|</span>
<span class="runtime">149 min</span>
<span class="ghost">|</span>
<span class="genre">
Action, Adventure, Sci-Fi            <

In [None]:
movie_containers[1]

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt1825683"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt1825683/"> <img alt="Black Panther" class="loadlate" data-tconst="tt1825683" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMTg1MTY2MjYzNV5BMl5BanBnXkFtZTgwMTc4NTMwNDI@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">2.</span>
<a href="/title/tt1825683/">Black Panther</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>
<p class="text-muted ">
<span class="certificate">PG-13</span>
<span class="ghost">|</span>
<span class="runtime">134 min</span>
<span class="ghost">|</span>
<span class="genre">
Action, Adventure, Sci-Fi            </span>
</p>
<di

In [None]:
movie_containers[49]

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt7959026"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt7959026/"> <img alt="The Mule" class="loadlate" data-tconst="tt7959026" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMTc1OTc5NzA4OF5BMl5BanBnXkFtZTgwOTAzMzE2NjM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">50.</span>
<a href="/title/tt7959026/">The Mule</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>
<p class="text-muted ">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">116 min</span>
<span class="ghost">|</span>
<span class="genre">
Crime, Drama, Thriller            </span>
</p>
<div class="ratings

### The name of the movie
We begin with the movie’s name, and locate its correspondent HTML line. You can
see that the name is contained within an anchor tag (&lt;a&gt;). This tag is nested
within a header tag (&lt;h3&gt;). The &lt;h3&gt; tag is nested within a &lt;div&gt; tag.
This &lt;div&gt; is the third of the divs nested in the container of the
first movie. We stored the content of this container in the first_movie variable.

In [None]:
first_movie = movie_containers[0]
first_movie.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt4154756"></div>
</div>

In [None]:
first_movie.a

<a href="/title/tt4154756/"> <img alt="Avengers: Infinity War" class="loadlate" data-tconst="tt4154756" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMjMxNjY2MDU1OV5BMl5BanBnXkFtZTgwNzY1MTUwNTM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a>

In [None]:
first_movie.h3

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt4154756/">Avengers: Infinity War</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>

In [None]:
first_movie.h3.a

<a href="/title/tt4154756/">Avengers: Infinity War</a>

In [None]:
first_movie.h3.a.get_text()

'Avengers: Infinity War'

In [None]:
first_name = first_movie.h3.a.text
first_name

'Avengers: Infinity War'

### The year of the movie’s release

We move on with extracting the year. This data is stored within the &lt;span&gt; tag below
the &lt;a&gt; that contains the name.

Dot notation will only access the first span element. We’ll search by the distinctive
mark of the second &lt;span&gt;. We’ll use the find() method which is almost the
same as find_all(), except that it only returns the first match. In fact,
find() is equivalent to find_all(limit = 1). The limit argument limits
the output to the first match.

The distinguishing mark consists of the
values lister-item-year text-muted unbold assigned to the
class attribute. So we look for the first &lt;span&gt; with these
values within the &lt;h3&gt; tag:

In [None]:
first_year_tag = first_movie.h3.find('span', class_ = 'lister-item-year text-muted unbold')
first_year_tag

<span class="lister-item-year text-muted unbold">(2018)</span>

In [None]:
first_movie.h3.span.text

'1.'

In [None]:
first_year = first_year_tag.text
first_year

'(2018)'

### The IMDB rating
We now focus on extracting the IMDB rating of the first movie.

There are a couple of ways to do that, but we’ll first try the easiest one. If you
inspect the IMDB rating, you’ll notice that the rating is contained
within a &lt;strong&gt; tag.

Let’s use attribute notation, and hope that the first &lt;strong&gt; will also be
the one that contains the rating.

In [None]:
first_movie.strong

<strong>8.5</strong>

Great! We’ll access the text, convert it to the float type, and assign it to the
variable first_imdb:

In [None]:
first_movie.strong.text

'8.5'

In [None]:
first_movie.strong.get_text()

'8.5'

In [None]:
first_imdb = float(first_movie.strong.text)
first_imdb

8.5

### The Metascore
If we inspect the Metascore, we’ll notice that we can find it within a &lt;span&gt; tag.

Attribute notation clearly isn’t a solution. There are many &lt;span&gt; tags before
that. You can see one right above the &lt;strong&lt; tag. We’d better use the distinctive
values of the class attribute (metascore favorable).

Note that if you copy-paste those values, there will be two
white space characters between metascore and favorable. Make sure there will
be only one whitespace character when you pass the values as arguments to
the class_ parameter. Otherwise, find() won’t find anything.

In [None]:
first_mscore = first_movie.find('span', class_ = 'metascore favorable')
first_mscore = int(first_mscore.text)
print(first_mscore)

68


### The number of votes
The number of votes is contained within a &lt;span&gt; tag. Its distinctive mark is a
name attribute with the value nv.

The name attribute is different from the class attribute. Using BeautifulSoup we
can access elements by any attribute. The find() and find_all() functions have a
parameter named attrs. To this we can pass in the attributes and values we are
searching for as a dictionary:

In [None]:
first_votes = first_movie.find('span', attrs = {'name':'nv'})
dir(first_votes)

['HTML_FORMATTERS',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_attr_value_as_string',
 '_attribute_checker',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_select_debug',
 '_selector_combinators',
 '_should_pretty_print',
 '_tag_name_matches_and',
 'append',
 'attribselect_re',
 'attrs',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decomp

We could use .text notation to access the &lt;span&gt; tag’s content. It would be better
though if we accessed the value of the data-value attribute. This way we can
convert the extracted datapoint to an int without having to strip a comma.

You can treat a Tag object just like a dictionary. The HTML attributes are
the dictionary’s keys. The values of the HTML attributes are the values of the
dictionary’s keys. This is how we can access the value of the data-value attribute:

In [None]:
first_votes['data-value']

'995153'

Let’s convert that value to an integer, and assign it to first_votes:

In [None]:
first_votes = int(first_votes['data-value'])

That’s it! We’re now in a position to easily write a script for scraping a single page.

## Extracting information from a single page

You may think it is OK to use a for loop. When you run the following codes, you will a traceback (or an error).

In [None]:
## The script for a single page

import requests
from bs4 import BeautifulSoup

url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1'
page = requests.get(url)
#print(page.text[:500])
#page.text


soup = BeautifulSoup(page.text, 'html.parser')
#type(soup)
#soup

movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')
#print(type(movie_containers))
#print(len(movie_containers))

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:
    # The name
    name = container.h3.a.text
    names.append(name)
    # The year
    year = container.h3.find('span', class_ = 'lister-item-year').text
    years.append(year)
    # The IMDB rating
    imdb = float(container.strong.text)
    imdb_ratings.append(imdb)
    # The Metascore
    m_score = container.find('span', class_ = 'metascore').text
    metascores.append(int(m_score))
    # The number of votes
    vote = container.find('span', attrs = {'name':'nv'})['data-value']
    votes.append(int(vote))

AttributeError: ignored

The error message shows that 'NoneType' object has no attribute 'text', which indicate that
container.find('span', class_ = 'metascore') for some movie is empty (or None). Thus we cannot
use the text method. So we need to add a condition to skip movies without a Metascore.

In [None]:
## The script for a single page

import requests
from bs4 import BeautifulSoup

url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1'
page = requests.get(url)
#print(page.text[:500])
#page.text


soup = BeautifulSoup(page.text, 'html.parser')
#type(soup)
#soup

movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')
#print(type(movie_containers))
#print(len(movie_containers))

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:
    # If the movie has Metascore, then extract:
    if container.find('div', class_ = 'ratings-metascore') is not None:
        # The name
        name = container.h3.a.text
        names.append(name)
        # The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
        # The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        # The Metascore
        m_score = container.find('span', class_ = 'metascore').text
        metascores.append(int(m_score))
        # The number of votes
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

Let’s check the data collected so far. Pandas makes it easy for us to see whether we’ve
scraped our data successfully.

In [None]:
import pandas as pd
test_df = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes
})
print(test_df.info())
test_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      46 non-null     object 
 1   year       46 non-null     object 
 2   imdb       46 non-null     float64
 3   metascore  46 non-null     int64  
 4   votes      46 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 1.9+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Avengers: Infinity War,(2018),8.5,68,995153
1,Black Panther,(2018),7.3,88,712874
2,Deadpool 2,(2018),7.7,66,546574
3,Bohemian Rhapsody,(2018),8.0,49,511089
4,A Quiet Place,(2018),7.5,82,502104
5,Spider-Man: Een nieuw universum,(2018),8.4,87,479096
6,Venom,(2018),6.7,35,459071
7,Green Book,(2018),8.2,69,456433
8,Aquaman,(2018),6.9,55,436384
9,Ready Player One,(2018),7.4,64,416932


If we do not want to skip movies without a Metascore, we can use try-except statement.

In [None]:
 ## The script for a single page

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc&start=1'
page = requests.get(url)
#print(page.text[:500])
#page.text


soup = BeautifulSoup(page.text, 'html.parser')
#type(soup)
#soup

movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')
#print(type(movie_containers))
#print(len(movie_containers))

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# Extract data from individual movie container
for container in movie_containers:
    # The name
    name = container.h3.a.text
    names.append(name)

    # The year
    year = container.h3.find('span', class_ = 'lister-item-year').text
    years.append(year)

    # The IMDB rating
    imdb = float(container.strong.text)
    imdb_ratings.append(imdb)

    # The Metascore
    try:
        m_score = container.find('span', class_ = 'metascore').text
    except:
        m_score = "None"
    metascores.append(m_score)

    # The number of votes
    vote = container.find('span', attrs = {'name':'nv'})['data-value']
    votes.append(int(vote))


test2_df = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes
})
print(test2_df.info())
test2_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      50 non-null     object 
 1   year       50 non-null     object 
 2   imdb       50 non-null     float64
 3   metascore  50 non-null     object 
 4   votes      50 non-null     int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 2.1+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Avengers: Infinity War,(2018),8.5,68.0,995153
1,Black Panther,(2018),7.3,88.0,712874
2,Deadpool 2,(2018),7.7,66.0,546574
3,Bohemian Rhapsody,(2018),8.0,49.0,511089
4,A Quiet Place,(2018),7.5,82.0,502104
5,Spider-Man: Een nieuw universum,(2018),8.4,87.0,479096
6,Venom,(2018),6.7,35.0,459071
7,Green Book,(2018),8.2,69.0,456433
8,Aquaman,(2018),6.9,55.0,436384
9,Ready Player One,(2018),7.4,64.0,416932


## Extracting information from multiple pages

You may notice that there are 281,830 titles in total in our search results for year 2018. The idea above
only returns the 50 movies in the first page. Next, we will show how to deal with multiple pages.

In [None]:
base = "I like"
Day_of_Week = "Friday"
Day = base + " " + Day_of_Week + " of " + str(2022)
print(Day)

I like Friday of 2022


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc'
current_page = 1    ## first page


names = []
years = []
imdb_ratings = []
metascores = []
votes = []

while current_page < 6:   ## suppose we want to get the first five pages
    print('\n')
    print('Page ', current_page)
    start = (current_page-1)*50 + 1  ## starting number; start=1 for page 1 and start=51 for page 2
    url = base_url + "&start=" + str(start)
    page = requests.get(url)

    if (page.status_code // 10**2) == 2 :
        print('succesffully connected!')
    else :
        print('we cannot connect the page!')


    soup = BeautifulSoup(page.text, 'html.parser')
    movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')

    for container in movie_containers:
        # The name
        name = container.h3.a.text
        names.append(name)

        # The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)

        # The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)

        # The Metascore
        try:
            m_score = container.find('span', class_ = 'metascore').text
        except:
            m_score = 'None'
        metascores.append(m_score)

        # The number of votes
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

    del page     ## delete the current web page
    del soup     ## delete the current soup

    current_page += 1   ## move to next page

test3_df = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes
})
print('\n')
print(test3_df.info())
test3_df





Page  1
succesffully connected!


Page  2
succesffully connected!


Page  3
succesffully connected!


Page  4
succesffully connected!


Page  5
succesffully connected!


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      250 non-null    object 
 1   year       250 non-null    object 
 2   imdb       250 non-null    float64
 3   metascore  250 non-null    object 
 4   votes      250 non-null    int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 9.9+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Avengers: Infinity War,(2018),8.5,68,995153
1,Black Panther,(2018),7.3,88,712874
2,Deadpool 2,(2018),7.7,66,546574
3,Bohemian Rhapsody,(2018),8.0,49,511089
4,A Quiet Place,(2018),7.5,82,502104
...,...,...,...,...,...
245,Dogman,(2018),7.2,71,26095
246,Doragon bôru chô: Burorî,(2018),7.8,59,25898
247,College Romance,(2018– ),9.1,,25828
248,"Don't Worry, He Won't Get Far on Foot",(2018),6.8,67,25795
