# SCRAPING THE IMDb's TOP 250 MOVIE DETAILS USING PYTHON


<img src="https://i.imgur.com/665OqLN.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">


## WebScraping Introduction


**WebScraping** is the use of automated bots to extract/ scrape data from a website. We get this data from the websites html. **Hypertext Markup Language** is a special kind of programming language which is used to structure websites.
While web scraping often involves parsing and processing HTML documents, some platforms also offer REST APIs to retrieve information in a machine-readable format like JSON. In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.We can extract data from this structure. To achieve this there are many specialised tools like the **BeautifulSoup** library for python. **Webscraping** is an important tool to gather data without much hardwork with the help of bots. In this project we will try to use some of this tools to scrape data from **IMDB**.



![](https://prowebscraping.com/wp-content/uploads/2015/09/data-scraping-service.jpg)





### OBJECTIVES

- To parse the IMDB TOO 100 MOVIES Page

- To extract the data to a csv file.

- Use the BeautifulSoup Library to scrape the html of the web page.


### Steps followed 

- Downloading web pages using the requests library
- Inspecting the HTML source code of a web page selected
- Parsing parts of a website using Beautiful Soup
- Using a REST API to retrieve data
- Combining data from multiple sources
- Writing parsed information into CSV files
- Using links on a page to crawl a website




![](https://www.webharvy.com/images/web%20scraping.png)



### Importing the required libraries

LIBRARIES USED 
- jovian 
- request 
- BeautifulSoup
- pandas


In [1]:
import jovian
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Install the library
!pip install jovian --upgrade --quiet
!pip install requests --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

 ### About the IMBd
 
 **IMDb** (an abbreviation of Internet Movie Database)is an online database of information related to films, television series, home videos and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. IMDb began as a fan-operated movie database on the Usenet group "rec.arts.movies" in 1990, and moved to the Web in 1993. It is now owned and operated by IMDb.com, Inc., a subsidiary of Amazon.

As of March 2022, the database contained some 10.1 million titles (including television episodes) and 11.5 million person records. Additionally, the site had 83 million registered users. The site's message boards were disabled in February 2017.



### Project Idea

In this project, we will parse through the IMDB's Top rated Movies page to get details about the top rated movies from around the world.

We will retrieve information from the page **’Top Rated Movies’** using _web scraping_: a process of extracting information from a website programmatically. Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.

### Project Goal

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

![](https://i.imgur.com/a5SKDn7.png)

#### **requests.get()**

In order to **download a web page**, we use `requests.get()` to **send the HTTP request** to the **IMDB server** and what the function returns is a **response object**, which is **the HTTP response**. 

In [3]:
#The URL Address of the webpage we will scrape, i.e. Top Rated Movies
Base_Url = 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start='
response = requests.get(Base_Url)

### To check the status code

An HTTP status code is a server response to a browser's request. When you visit a website, your browser sends a request to the site's server, and the server then responds to the browser's request with a three-digit code: the HTTP status code. **status_code** returns a number that indicates the status (200 is OK, 404 is Not Found). Python requests are generally used to fetch the content from a particular resource URL. Whenever we make a request to a specified URL through Python, it returns a response object.

The HTTP 200 **OK** success status response code indicates that the request has succeeded. A 200 response is cacheable by default. The meaning of a success depends on the HTTP request method: GET : The resource has been fetched and is transmitted in the message body.

The HyperText Transfer Protocol (HTTP) 400 **Bad** Request response status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error (for example, malformed request syntax, invalid request message framing, or deceptive request routing)

In [4]:
response.status_code

200

In [5]:
page_contents = response.text
len(page_contents)    

438711

In [6]:
with open('top_rated_movies.html', 'w') as f:  #Writing the html page to a file locally, i.e. a replica of real html page
    f.write(page_contents)

HTTP 200 means transmission is OK on the http level.

## Extracting information using the BeautifulSoup


To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library.

 **Beautiful Soup** is a python package and as the name suggests, parses the unwanted data and helps to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversible XML structures. In short, Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.
 
 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

In [7]:
doc = BeautifulSoup(response.text,'html.parser')
#Now 'doc' contains entire html in parsed format

In [8]:
type(doc)

bs4.BeautifulSoup

### Inspecting the HTML source code of a web page

What is HTML?

Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.

The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.

![](https://i.imgur.com/mvBpQIP.png)

As mentioned earlier, web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://jovian.ai" target="_blank">Go to Jovian</a>`. An HTML tag has three parts:

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.

### Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)



`What we can do with **a BeautifulSoup object** is to get **a specifc types of a tag in HTML** by calling the name of a tag, as shown in code cell below.`

Here, we use the `find()` function of BeautifulSoup to find the first `<title>` tag in the HTML document and display its content

In [9]:
title = doc.find('title')
title

<title>IMDb "Top 250"
(Sorted by IMDb Rating Descending) - IMDb</title>

Let's save the work done 

In [10]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ajay-kumar38897/imdb-movies" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/ajay-kumar38897/imdb-movies[0m


'https://jovian.com/ajay-kumar38897/imdb-movies'

### Inspecting HTML in the Browser


>To view the **source code** of any webpage right within **your browser**, you can **right click** anywhere on a page and **select** the **"Inspect"** option. You access the **"Developer Tools"** mode, where you can see the source code as **a tree**. You can expand and collapse various nodes and find the source code for a specific portion of the page



![](https://i.imgur.com/Wcq9yJA.png)


As shown in the photo below, I've cursored over one of the Movie Names to display how the entire content was presented. 
I found out that each `moviename` was present inside the `<a>` tag. Since it does not have any specific class, or other attribute, I will have to check for the desired `<a>` tags among all the `<a>` tags present on the page.

Since I've pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

**Parsing** is defined as the process of converting codes to machine language to analyze the correct syntax of the code. Python provides a library called a parser.

**Parser** is a feature which is solely exclusive for the Web Scraper Cloud. It is used to automatize data post processing that usually would be done by a custom user written script or manually in a spreadsheet software.

Parsers are used when there is a need to represent input data from source code abstractly as a data structure so that it can be checked for the correct syntax. Coding languages and other technologies use parsing of some type for this purpose.




Upon inspecting the box containing the information for a IMDB, you will find an `article` tag for each one, with `class` attribute set to  `border rounded color-shadow-small color-bg-secondary my-4`.


    


###  Create a helper Function get_page_urls that store the link.
MOVIE URLS
 
    
    


In [11]:
def Get_Movie_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,202,50): # Using this loop to reterive url for different pages
        urls = Base_Url +str(i) +"&ref_=adv_nxt"
        URL_List.append(urls)
    return URL_List

In [12]:
Get_Movie_Pages_Url(Base_Url)

['https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=1&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=51&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=101&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=151&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=201&ref_=adv_nxt']

### Download and parse the web page

In [13]:
def Parse_Url(Base_Url):
    response = requests.get(Base_Url)
    if response.status_code != 200:
        raise Exception("error in getting the page {}".format(Base_Url))
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc

doc = Parse_Url(Base_Url)

# Extract movie names, year,  ratings, runtime, genre and URLs

##  Movie Names

We're extracting movie names by first selecting  tag 'h3' and  class 'lister_item_header' and then inside it there is a 'a' tag which gives the movie name 

In [14]:
def Get_Movie_Title(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

In [15]:
# we store the movie name 
movie_name =Get_Movie_Title(doc) 

## Year of Released

In [16]:
def Get_Movie_Year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    for tag in Movie_year_tags:
       Movie_year.append(tag.text.strip())
    
    return Movie_year

movie_year =Get_Movie_Year(doc)

## Movie Url 

We're extracting movie year by selecting a tag `h3` and class `lister-item-header` ![](https://i.imgur.com/iF8Fcn9.png)

In [17]:

def Get_Movie_Url(doc):
    base_url= 'https://www.imdb.com/'
    Movie_link_tags = doc.find_all('h3',class_='lister-item-header')
    Movie_link=[]
    
    for tag in Movie_link_tags:
        Movie_link.append(base_url + tag.find('a')['href'])
        
    return Movie_link

movie_url = Get_Movie_Url(doc)


## Rating 

In [18]:

def Get_Movie_Rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

movie_rating = Get_Movie_Rating(doc)

## Genre

In [19]:
def Get_Movie_genre(doc):
        
    Movie_genre_tags = doc.find_all('span',class_='genre')
    Movie_genre = []
    for tag in Movie_genre_tags:
        Movie_genre.append(tag.text.strip())
    
    return Movie_genre

movie_genre = Get_Movie_genre(doc)

In [20]:
movie_genre[:10]

['Drama',
 'Crime, Drama',
 'Action, Crime, Drama',
 'Action, Adventure, Drama',
 'Crime, Drama',
 'Biography, Drama, History',
 'Crime, Drama',
 'Crime, Drama',
 'Action, Adventure, Drama',
 'Drama']

## Movie Runtime 

In [21]:
def Get_Movie_runtime(doc):
        
    Movie_runtime_tags = doc.find_all('span',class_='runtime')
    Movie_runtime = []
    for tag in Movie_runtime_tags:
        Movie_runtime.append(tag.text.strip())
    
    return Movie_runtime

movie_runtime = Get_Movie_runtime(doc)

# Parsed the data into Data frame



**PANDAS** is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
- Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays.

#### 1. Advantages of Pandas Library
There are many benefits of Python Pandas library, listing them all would probably take more time than what it takes to learn the library. Therefore, these are the core advantages of using the Pandas library:

-  Data representation
-  Less writing and more work done
-  An extensive set of features
-  Efficiently handles large data
-  Makes data flexible and customizable


A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.



## Create a Function which will store the extracted Information into the python dictionary

In [22]:
def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_Movie_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Genre':[],'Runtime':[],'Movie_Url':[],'Rating':[]}
    for url in urls:
        doc = Parse_Url(url)
        Movie_Dict['Movie_name'] += Get_Movie_Title(doc)
        Movie_Dict['Release_Year'] += Get_Movie_Year(doc)
        Movie_Dict['Genre'] += Get_Movie_genre(doc)
        Movie_Dict['Runtime'] += Get_Movie_runtime(doc)
        Movie_Dict['Rating'] += Get_Movie_Rating(doc)
        Movie_Dict['Movie_Url'] += Get_Movie_Url(doc)
        
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('Top250_Movies.csv',index = None)
    return Movie_Dict

In [23]:
Scrape_Top_Rated_Movies(Base_Url)

{'Movie_name': ['The Shawshank Redemption',
  'The Godfather',
  'The Dark Knight',
  'The Lord of the Rings: The Return of the King',
  'The Godfather: Part II',
  "Schindler's List",
  '12 Angry Men',
  'Pulp Fiction',
  'The Lord of the Rings: The Fellowship of the Ring',
  'Fight Club',
  'Inception',
  'Forrest Gump',
  'The Lord of the Rings: The Two Towers',
  'The Good, the Bad and the Ugly',
  'Jai Bhim',
  'GoodFellas',
  'The Matrix',
  "One Flew Over the Cuckoo's Nest",
  'Star Wars: Episode V - The Empire Strikes Back',
  'Interstellar',
  "It's a Wonderful Life",
  'Se7en',
  'Star Wars',
  'The Silence of the Lambs',
  'The Green Mile',
  'Terminator 2: Judgment Day',
  'Saving Private Ryan',
  'City of God',
  'Spirited Away',
  'Life Is Beautiful',
  'Seven Samurai',
  'Harakiri',
  'Whiplash',
  'Léon: The Professional',
  'Parasite',
  'Gladiator',
  'Back to the Future',
  'The Prestige',
  'Apocalypse Now',
  'The Departed',
  'Alien',
  'The Usual Suspects',
  'Am

## Combine all the functions written to extract the information 

In [24]:


def Get_Movie_Pages_Url(Base_Url):
    URL_List = []
    for i in range(1,202,50): # Using this loop to reterive url for different pages
        urls = Base_Url +str(i) +"&ref_=adv_nxt"
        URL_List.append(urls)
    return URL_List

def Parse_Url(Base_Url):
    response = requests.get(Base_Url)
    if response.status_code != 200:
        raise Exception(f"Error in extracting Information for webpage = {Base_Url}")
    page_contents = response.text
    doc = BeautifulSoup(page_contents,'html.parser')
    return doc

def Get_Movie_Title(doc):
    Movie_names = [] 
    
    # find all h3 tag by using the class "lister-item-header"
        
    Movie_name_tags =doc.find_all('h3',class_='lister-item-header')
    # as we can see the Movie name is inside the 'a' tag 
    for tag in Movie_name_tags:
        titles = tag.find('a')
    # use the .append method to store all movie name in Movie_names 
        Movie_names.append(titles.text)
    
    return Movie_names

def Get_Movie_Year(doc):
        
    Movie_year_tags = doc.find_all('span',class_='lister-item-year text-muted unbold')
    Movie_year = []
    
    for tag in Movie_year_tags:
       
        Movie_year.append(tag.text.strip())
    
    return Movie_year

def Get_Movie_Url(doc):
    base_url= 'https://www.imdb.com/'
    Movie_link_tags = doc.find_all('h3',class_='lister-item-header')
    Movie_link=[]
    
    for tag in Movie_link_tags:
        Movie_link.append(base_url + tag.find('a')['href'])
        
    return Movie_link

def Get_Movie_Rating(doc):
    Movie_ratings_tags = doc.find_all('div',class_='inline-block ratings-imdb-rating')
    Movie_ratings=[]
    
    for tag in Movie_ratings_tags:
        Movie_ratings.append(tag.text.replace('\n',""))
        
    return Movie_ratings

def Get_Movie_genre(doc):
    Movie_genre_tags = doc.find_all('span',class_='genre')
    Movie_genre=[]
    
    for tag in Movie_genre_tags:
        Movie_genre.append(tag.text.strip())
    
    return Movie_genre

def Get_Movie_runtime(doc):
    Movie_runtime_tags = doc.find_all('span',class_='runtime')
    Movie_runtime = []
    
    for tag in Movie_runtime_tags:
        Movie_runtime.append(tag.text.strip())
    
    return Movie_runtime

movie_runtime = Get_Movie_runtime(doc)


def Scrape_Top_Rated_Movies(Base_Url):
    urls = Get_Movie_Pages_Url(Base_Url)
    Movie_Dict = {'Movie_name':[],'Release_Year':[],'Genre':[],'Runtime':[],'Movie_Url':[],'Rating':[]}
    for url in urls:
        doc = Parse_Url(url)
        Movie_Dict['Movie_name'] += Get_Movie_Title(doc)
        Movie_Dict['Release_Year'] += Get_Movie_Year(doc)
        Movie_Dict['Genre'] += Get_Movie_genre(doc)
        Movie_Dict['Runtime'] += Get_Movie_runtime(doc)
        Movie_Dict['Rating'] += Get_Movie_Rating(doc)
        Movie_Dict['Movie_Url'] += Get_Movie_Url(doc)
       
    Movies_df = pd.DataFrame(Movie_Dict)
    Movies_df.to_csv('Top250_Movies.csv',index = None)
    return Movie_Dict

### Load the information to the csv file



**CSV** file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. Because it's a plain text file, it can contain only actual text data—in other words, printable ASCII or Unicode characters. The structure of a CSV file is given away by its name.

By using pandas. DataFrame. to_csv() method you can write/save/export a pandas DataFrame to CSV File. By default to_csv() method export DataFrame to a CSV file with comma delimiter and row index as the first column.

In [25]:
#using pandas to convert dictionary into dataframe
dictionary = Scrape_Top_Rated_Movies(Base_Url)
df = pd.DataFrame(dictionary) 
df
    
    

Unnamed: 0,Movie_name,Release_Year,Genre,Runtime,Movie_Url,Rating
0,The Shawshank Redemption,(1994),Drama,142 min,https://www.imdb.com//title/tt0111161/,9.3
1,The Godfather,(1972),"Crime, Drama",175 min,https://www.imdb.com//title/tt0068646/,9.2
2,The Dark Knight,(2008),"Action, Crime, Drama",152 min,https://www.imdb.com//title/tt0468569/,9.0
3,The Lord of the Rings: The Return of the King,(2003),"Action, Adventure, Drama",201 min,https://www.imdb.com//title/tt0167260/,9.0
4,The Godfather: Part II,(1974),"Crime, Drama",202 min,https://www.imdb.com//title/tt0071562/,9.0
...,...,...,...,...,...,...
245,Dances with Wolves,(1990),"Adventure, Drama, Western",181 min,https://www.imdb.com//title/tt0099348/,8.0
246,The Incredibles,(2004),"Animation, Action, Adventure",115 min,https://www.imdb.com//title/tt0317705/,8.0
247,Aladdin,(1992),"Animation, Adventure, Comedy",90 min,https://www.imdb.com//title/tt0103639/,8.0
248,Life of Brian,(1979),Comedy,94 min,https://www.imdb.com//title/tt0079470/,8.0


In [26]:
def get_csv(Data_Frame):
    return Data_Frame.to_csv('Top250_Movies.csv',index = None)
   
get_csv(df)
    

![image.png](https://i.imgur.com/Ucjbttz.png) 

## Future Work

 We can now work forward to explore this data more and more to fetch meaningful information out of it.  

With all the insights, and further analysis into the data, we can have answers to a lot of questions like -   
* Which actor has worked in most top rated movies across the world?
* The Top Rated Movies as per the `Genre` of our interest?
* Which Director has directed the most top rated movies?
* Which year gave us the most Top Rated Movies till date?

And the list goes on..

> In the future, I would like to work to make this `DataSet` even richer with more data from other lists created by IMDB like - `Most Trending Movies`, `Top Rated Indian Movies`, `Lowest Rated Movies` etc.
I would then like to work on analysing the entire data, to know a lot more about movies than I currently know. 

## Summary

- We have used Python's requests, BeautifulSoup libraries to get the information.
- We have tried to get the information in CSV format which can be easily used by the end-user.
- CSV contains six columns that we wanted
   1. Name of the Movie
   2. Year of Release  
   3. Runtime Time 
   4. Movie Ratings
   5. Movie URLS
   6. Genre

The scraping of the **IMDB webpage** aim to offer a variety of opinions on a title so users can make informed viewing decisions. We also always display the breakdown of the ratings so users can see the distribution of votes and determine how uniform or polarized the opinion of a movie is.

## References 

1. https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=%27

2. https://en.wikipedia.org/wiki/IMDb
3. https://imgur.com/
4. https://stackoverflow.com/
5. https://www.webharvy.com/articles/what-is-web-scraping.html
6. https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
7. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>