# Scraping Top 250 Movies Sorted by Rating on  IMDb Using Pyhton

![](https://i.imgur.com/Lh71xPB.png)
**IMDb** is the most authoritative source of entertainment information, with features designed to help fans explore the world of movies and shows and decide what to watch.IMDb (an acronym for Internet Movie Database) is an online database of information related to films, television programs, home videos, video games, and streaming content online — including cast, production crew, and personal biographies, plot summaries, trivia, ratings, and fan, and critical review.

The page https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start= provides information about the Top 250 Movies.In this Project we will use this page to retrive information about the movies using Web Scraping.

### Lets talk about Web Scraping
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

### How to do Web Scraping Using Python
As we know, Python is has various applications and there are different libraries for different purposes. In our further demonstration, we will be using the following libraries:

Requests: The requests module allows you to send HTTP requests using Python.The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).[Requets](https://www.w3schools.com/python/module_requests.asp)

BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily.[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.[Pandas](https://www.w3schools.com/python/pandas/pandas_csv.asp)

## Here's a step-by-step outline For this project:

1. Download the Webpage using requests.
2. Parse the HTML source code using beautiful Soup.
3. Extract topic names, descriptions, and URLs from page.
4. Compile the extracted information into Python List and Dictionary.
5. create a CSV file using Pandas to save the extracted information.

By the end of this Project, We will create a CSV file which contains the following information:

Movie_Name, Release year, Movie Url, Rating

## How to Run the Code

You can Execute the code using the 'Run' botton at the top of this page. You can also make changes and save your own version of this notebook to [Jovian](https://jovian.ai/) by executing the following code cells :

In [182]:
!pip install jovian --upgrade --quiet

In [183]:
import jovian

# Downlaod the Webpage using requests

We can the use the requests library to download the web page

In [184]:
!pip install requests --upgrade --quiet

In [185]:
 import requests

The library is now installed and imported.
To download the Web page ,we can use 'get' function from the requests, which returns a response object.

In [186]:
Base_Url = 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start='

response = requests.get(base_Url)

`requests.get` returns a response object containing the data from the web page and some other information about the response.

Then`response.status_code` returns a number that indicates the status (200 is OK, 404 is Not Found). Python requests are generally used to fetch the content from a particular resource URI.[HTTP status_code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

In [187]:
response.status_code

200

The  request is successful.Now we can get the contents of the page using the `response.text`


In [188]:
Page_content = response.text

Lets check the no. of character on the page.

In [189]:
len(Page_content)

441498

The page contains over595400 charcters.Here are the first 250 character of the page:

In [190]:
Page_content[:250]

'\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n\n\n\n        <script type="te'

What we are looking at the above content is the [HTML Source code of the page](https://en.wikipedia.org/wiki/HTML) of the web page

We can also save it to file and view the page locally within jupyter using 'File>open'

In [191]:
with open('Top_250_Movie.html',"w") as file:
    file.write(Page_content)

 We have successfully download the web page using request.Now we can view the HTML source code of the page locally within Jupyter. 

## Parse the HTML source code using the Beautiful Soup

We'll use Beautiful Soup Python library to parse the HTML source code of the web page downloaded in the previous section.[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Lets install and import the Beautiful Soup Library.

In [192]:
!pip install beautifulsoup4 --upgrade --quiet

In [193]:
from bs4 import BeautifulSoup

We have successfully import Beautiful Soup. Now we use HTML Parser which is a structured markup processing tool. It defines a class called HTMLParser, which is used to parse HTML files.

In [194]:
doc = BeautifulSoup(response.text,'html.parser')

In [195]:
type(doc)

bs4.BeautifulSoup

In [196]:
doc.find('title') # Title of page

<title>IMDb "Top 250"
(Sorted by IMDb Rating Descending) - IMDb</title>



###  Let's create a helper Function get_page_urls that store the link of              all pages which we want to scrape in a list

In [197]:
Base_Url = 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start='
def Get_Movie_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,202,50): # Using this loop to reterive url for different pages
        urls = Base_Url +str(i) +"&ref_=adv_nxt"
        URL_List.append(urls)
    return URL_List


In [198]:
Get_Movie_Pages_Url(Base_Url)

['https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=1&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=51&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=101&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=151&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=201&ref_=adv_nxt']

### Now we can use the Get_Movie_Pages_Url to downlaod and parse a web page

In [199]:
def Parse_Url(Base_Url):
    response = requests.get(base_url)
    if response.status_code != 200:
        raise Exception("error in getting the page {}".format(base_url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc


In [200]:
doc = Parse_Url(Base_Url)

we can now use the `get_movie_page` to downlaod the different web page of top 250 movies using the Beautiful SOup

# Extract movie names, year,  ratings and URLs

## Extracting Movie Names

We're extracting movie names by first selecting  tag 'h3' and  class 'lister_item_header' and then inside it there is a 'a' tag which gives the movie name ![lister-item-header](https://i.imgur.com/uj5eJJA.png)

In [201]:
def Get_Movie_Title(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

In [202]:
movie_name =Get_Movie_Title(doc) # we store the movie name 

In [203]:
movie_name[:5] # we can see the first five movie name

['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 'The Lord of the Rings: The Return of the King',
 "Schindler's List"]

In [204]:
len(movie_name) # as we can see there are 50 movies on one page

50

## Extracting  Movie Released Year

We're extracting movie year by selecting a tag `span` and class `lister-item-year text-muted unbold`![](https://i.imgur.com/iF8Fcn9.png)

In [205]:
def Get_Movie_Year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    
    for tag in Movie_year_tags:
       
        Movie_year.append(tag.text)
    
    return Movie_year

In [206]:
movie_year =Get_Movie_Year(doc)


In [207]:
len(Movie_Year)

50

In [208]:
Movie_Year[:5]

['(1994)', '(1972)', '(2008)', '(2003)', '(1993)']

## Extracting  Individual Movie Url

We're extracting movie year by selecting a tag `h3` and class `lister-item-header` ![](https://i.imgur.com/iF8Fcn9.png)

In [209]:

def Get_Movie_Url(doc):
    base_url= 'https://www.imdb.com/'
    Movie_link_tags = doc.find_all('h3',class_='lister-item-header')
    Movie_link=[]
    
    for tag in Movie_link_tags:
        Movie_link.append(base_url + tag.find('a')['href'])
        
    return Movie_link

In [210]:
movie_url = Get_Movie_Url(doc)



In [211]:
len(movie_url)

50

In [212]:
movie_url[:5]

['https://www.imdb.com//title/tt0111161/',
 'https://www.imdb.com//title/tt0068646/',
 'https://www.imdb.com//title/tt0468569/',
 'https://www.imdb.com//title/tt0167260/',
 'https://www.imdb.com//title/tt0108052/']

## Extracting Ratings

We're extracting movie year by selecting a tag `div` and class `inline-block ratings-imdb-rating`![](https://i.imgur.com/qKfgHdb.png)

In [213]:

def Get_Movie_Rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

In [214]:
movie_rating = Get_Movie_Rating(doc)



In [215]:
len(movie_rating)

50

In [216]:
movie_rating[:5]

['9.3', '9.2', '9.0', '9.0', '9.0']

##  Let us import Pandas Library

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals

let us import Pandas

In [217]:
import pandas as pd


## Let Us Create a Function which will store all the extracted Information into the python dictionary for all the  pages

In [218]:
def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_Movie_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Movie_Url':[],'Rating':[]}
    for url in urls:
        doc = parse_url(url)
        Movie_Dict['Movie_name'] += Get_Movie_Title(doc)
        Movie_Dict['Release_Year'] += Get_Movie_Year(doc)
        Movie_Dict['Movie_Url'] += Get_Movie_Url(doc)
        Movie_Dict['Rating'] += Get_Movie_Rating(doc)
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('IMDB250.csv',index = False)
    return Movie_Dict

### Now we write all the extracted information into the CSV file by using the Pandas Dataframe

In [219]:
Movies_df = pd.DataFrame(Scrape_Top_Rated_Movies(Base_Url))
Movies_df.to_csv('IMDB250.csv',index = False)

In [220]:
Movies_df

Unnamed: 0,Movie_name,Release_Year,Movie_Url,Rating
0,The Shawshank Redemption,(1994),https://www.imdb.com//title/tt0111161/,9.3
1,The Godfather,(1972),https://www.imdb.com//title/tt0068646/,9.2
2,The Dark Knight,(2008),https://www.imdb.com//title/tt0468569/,9.0
3,The Lord of the Rings: The Return of the King,(2003),https://www.imdb.com//title/tt0167260/,9.0
4,Schindler's List,(1993),https://www.imdb.com//title/tt0108052/,9.0
...,...,...,...,...
245,The Incredibles,(2004),https://www.imdb.com//title/tt0317705/,8.0
246,Aladdin,(1992),https://www.imdb.com//title/tt0103639/,8.0
247,Beauty and the Beast,(1991),https://www.imdb.com//title/tt0101414/,8.0
248,Dances with Wolves,(1990),https://www.imdb.com//title/tt0099348/,8.0


## Let us combine all the function that we had written to extract the particular information 

In [221]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


Base_Url = 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start='

def Get_Movie_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,202,50): # Using this loop to reterive url for different pages
        urls = Base_Url +str(i) +"&ref_=adv_nxt"
        URL_List.append(urls)
    return URL_List

def Parse_Url(Base_Url):
    response = requests.get(base_url)
    if response.status_code != 200:
        raise Exception(f"Error in extracting Information for webpage = {Base_Url}")
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc

def Get_Movie_Title(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

def Get_Movie_Year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    
    for tag in Movie_year_tags:
       
        Movie_year.append(tag.text)
    
    return Movie_year

def Get_Movie_Url(doc):
    base_url= 'https://www.imdb.com/'
    Movie_link_tags = doc.find_all('h3',class_='lister-item-header')
    Movie_link=[]
    
    for tag in Movie_link_tags:
        Movie_link.append(base_url + tag.find('a')['href'])
        
    return Movie_link

def Get_Movie_Rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_Movie_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Movie_Url':[],'Rating':[]}
    for url in urls:
        
        doc = Parse_Url(Base_Url)
        Movie_Dict['Movie_name'] += Get_Movie_Title(doc)
        Movie_Dict['Release_Year'] += Get_Movie_Year(doc)
        Movie_Dict['Movie_Url'] += Get_Movie_Url(doc)
        Movie_Dict['Rating'] += Get_Movie_Rating(doc)
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('IMDB250.csv',index = False)
    return Movie_Dict

In [224]:
 All_Movies = Scrape_Top_Rated_Movies(Base_Url)

In [226]:
All_Movies

{'Movie_name': ['The Shawshank Redemption',
  'The Godfather',
  'The Dark Knight',
  'The Lord of the Rings: The Return of the King',
  "Schindler's List",
  'The Godfather Part II',
  '12 Angry Men',
  'Pulp Fiction',
  'Inception',
  'The Lord of the Rings: The Two Towers',
  'Fight Club',
  'The Lord of the Rings: The Fellowship of the Ring',
  'Forrest Gump',
  'The Good, the Bad and the Ugly',
  'The Matrix',
  'Goodfellas',
  'Star Wars: Episode V - The Empire Strikes Back',
  "One Flew Over the Cuckoo's Nest",
  'Top Gun: Maverick',
  'Interstellar',
  'City of God',
  'Spirited Away',
  'Saving Private Ryan',
  'The Green Mile',
  'Life Is Beautiful',
  'Se7en',
  'Terminator 2: Judgment Day',
  'The Silence of the Lambs',
  'Star Wars',
  'Hara-Kiri',
  'Seven Samurai',
  "It's a Wonderful Life",
  'Parasite',
  'Whiplash',
  'The Intouchables',
  'The Prestige',
  'The Departed',
  'The Pianist',
  'Gladiator',
  'American History X',
  'The Usual Suspects',
  'Léon: The Pro

# Summary

Here's what we have covered

1.First we had identified the Web pages to Scrape the related information.

2.Download webpages using "requests" Library

3.Used BeautifulSoup to parse the HTML source code

4.Extract data like movie name, year in which it released, url of the the movie and rating for each movie

5.Collect the downloaded data into Python lists

6.Extract and combine data from multiple pages

7.Create CSV file with the all the information that we had extracted from the above steps.

# Future Work

* We can now work forward to explore this data more and more to fetch meaningful information out of it.

* With all the insights , and further analysis into the data, we can have answers to a lot of questions like -

   * Which actor has worked in most top rated movies across the world?
   * The Top Rated Movies as per the Genre of our interest?
   * Which Director has directed the most top rated movies?
   * we can also find the metascore,gross for each movie?
   

# Reference

[1] Python offical documentation. https://docs.python.org/3/

[2] Requests library. https://pypi.org/project/requests/

[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[5] Pandas library documentation. https://pandas.pydata.org/docs/

[6] IMDB Website. https://www.imdb.com/chart/top

