# Scraping IMDb Movies and TV Show details using Python

 
 
 
 
 ![image](https://i.imgur.com/8a8GUMN.png)

 


# Web Scraping

![](https://i.imgur.com/D8e7o6o.png)

>### What is Web Scraping?
**Web Scraping** is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on. 

>When we say 'Easy to work on', we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in an ` Excel File or a CSV file`.

>### How does web scraping work?


>In order to understand web scraping, it's important to first understand that web pages which are built using text-based mark-up languages and the most common being used is`HTML`.
How does web scrapping works?


>To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages – the most common being HTML.

>A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs. Once the HTML is parsed, the scraper then extracts the necessary data and stores it. Note : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don't get our hands on information which might belong to someone else. Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.

>The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.         
**Note**  : Not all websites allow Web Scraping, especially when it comes to user's personal data, so we always have to make sure that we are not publishing any personal data of user online.
Websites usually have protection and if they see that we are downloading a large amount of data from their website, they will block us from accessing the website.

# IMDb Website

IMDb (an abbreviation of Internet Movie Database) is an online database of information related to films, television series, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. IMDb began as a fan-operated movie database on the Usenet group "rec.arts.movies" in 1990, and moved to the Web in 1993. It is now owned and operated by IMDb.com, Inc., a subsidiary of Amazon.

As of March 2022, the database contained some 10.1 million titles (including television episodes) and 11.5 million person records. Additionally, the site had 83 million registered users. The site's message boards were disabled in February 2017.

# Project Idea

![](https://i.imgur.com/OPuKzRd.png)

In this project, we will parse through the IMDB's Top rated Movies and Top TV shows web page to get details about the top rated movies and TV shows from around the world.

We will retrieve information from the page **Top Rated Movies** and **Top TV shows**  using _web scraping_: a process of extracting information from a website programmatically. Web scraping is useful for some and readers may grab information on a daily basis. For example, Analysis team of any body may scrap the stock data and find insights from it.

 Packages Used:
>1. Requests — For downloading the HTML code from the IMDB URL
>2. BeautifulSoup4 — For parsing and extracting data from the HTML string
>3. Pandas — to gather my data into a dataframe for further processing

# Here are the steps we will follow:

- We are going to scrape https://www.imdb.com/chart/top/?ref_=nv_mv_250 and https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250
- We will get a list of top 250 movies with best rating. For each movie, we'll get movie rank, name , year of release and rating it has got.
- For another page, we will get a list of top 250 tv shows with best rating. For each tv show, we'll get it's rank, it's name, it's year of release and the rating it has got.
- We will grab the the ranking, name, year of release, and rating for both top 250 movies and tv shows.
- We will show the results either in csv files or excel file.
- We will create a file in the format:
'''  Ranking, Movie Name, Year of Release, Rating '''



### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

# `Step 1 :`  Download the page & Parse the data



In [63]:
!pip install requests --upgrade --quiet
import requests

In [64]:
!pip install beautifulsoup4 --upgrade --quiet

In [65]:
!pip install pandas --upgrade --quiet

In [66]:
from bs4 import BeautifulSoup

In [67]:
import pandas as pd
import numpy as np
import jovian

In [82]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "alamfaizan7777/imdb-rating-web-scrapping" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/alamfaizan7777/imdb-rating-web-scrapping[0m


'https://jovian.com/alamfaizan7777/imdb-rating-web-scrapping'

In [69]:
urls = ['https://www.imdb.com/chart/top/?ref_=nv_mv_250', 'https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250']
urls

['https://www.imdb.com/chart/top/?ref_=nv_mv_250',
 'https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250']

#### **Using `request` to download and checking `status code`**

In order to *download a web page*, we use `requests.get()` to *send the HTTP request* to the *IMDB server* and what the function returns is a *response object*, which is *the HTTP response*. 

We have to also `check` if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get `the feedback` directly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the **status code**. In `requests` library, `requests.get` returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, `response.status_code` is set to a value between **200 and 299**.

In [70]:
def response_get(urls):
    response_list = []
    for i in urls:
        response= requests.get(i)
        response_list.append(response)
    return response_list

response_list=response_get(urls)

In [71]:
response_list

[<Response [200]>, <Response [200]>]


Now that we have got the responses between 200 & 399 , now we will display the page one by one and later we will be scrapping the part of the page we wish to. For that we will be using `Beautiful Soup` to parse the page.

In [72]:
def status(url):
    response = requests.get(url)
    doc = BeautifulSoup(response.text)
    return doc

In [73]:
status(urls[0])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 250 Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/chart/top/" rel="canonical"/>
<meta content="http://www.imdb.com/chart/top/" property="og:url"/>
<script>
    if (typeof uet == 'function') {
      uet("bb", "Loa

In [74]:
status(urls[1])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
            </style>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb Top 250 TV - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/chart/toptv/

# `Step 2 : `  Helper functions to parse information from the page.
 We have checked the response of both the website and we have also done the parsing, now we will be using it to write into the website. 
 

In [75]:
topic_url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
response = requests.get(topic_url)
page_contents = response.text
with open('page1.html', 'w', encoding="utf-8") as file:
    file.write(page_contents)

# `Step 3 : `  Parsing the webpage and scrapping desired results

In [76]:
def get_movie_details(urls):
    dictionary = {"name":[],"rank":[],"year":[],"rating":[]}
    for i in urls:
        doc = status(i)
        movies = doc.find('tbody', class_ = "lister-list").find_all('tr')
        for j in movies:
            dictionary["name"].append(j.find('td', class_ = "titleColumn").a.text)
            dictionary["rank"].append(j.find('td', class_ = "titleColumn").get_text(strip= True).split('.')[0]) 
            dictionary["year"].append(j.find('td', class_ = "titleColumn").span.text.strip('()'))
            dictionary["rating"].append(j.find('td', class_ = "ratingColumn imdbRating").strong.text)
    return dictionary
            
        

In [77]:
get_movie_details(urls)

{'name': ['The Shawshank Redemption',
  'The Godfather',
  'The Dark Knight',
  'The Godfather Part II',
  '12 Angry Men',
  "Schindler's List",
  'The Lord of the Rings: The Return of the King',
  'Pulp Fiction',
  'The Lord of the Rings: The Fellowship of the Ring',
  'The Good, the Bad and the Ugly',
  'Forrest Gump',
  'Fight Club',
  'The Lord of the Rings: The Two Towers',
  'Inception',
  'Star Wars: Episode V - The Empire Strikes Back',
  'The Matrix',
  'Goodfellas',
  "One Flew Over the Cuckoo's Nest",
  'Se7en',
  'Seven Samurai',
  "It's a Wonderful Life",
  'The Silence of the Lambs',
  'City of God',
  'Saving Private Ryan',
  'Interstellar',
  'Life Is Beautiful',
  'The Green Mile',
  'Star Wars',
  'Terminator 2: Judgment Day',
  'Back to the Future',
  'Spirited Away',
  'The Pianist',
  'Psycho',
  'Parasite',
  'Léon: The Professional',
  'The Lion King',
  'Gladiator',
  'American History X',
  'The Departed',
  'The Usual Suspects',
  'The Prestige',
  'Whiplash',

# `Step 4 : ` Converting data to pandas dataframe

In [78]:
movie_tv = get_movie_details(urls)
csv_file = pd.DataFrame(movie_tv)
csv_final = csv_file.to_csv('imdb.csv')
csv_final

In [79]:
csv_final = csv_file.to_csv()
csv = csv_file.to_csv('imdb.csv')

In [80]:
csv_final

',name,rank,year,rating\n0,The Shawshank Redemption,1,1994,9.2\n1,The Godfather,2,1972,9.2\n2,The Dark Knight,3,2008,9.0\n3,The Godfather Part II,4,1974,9.0\n4,12 Angry Men,5,1957,9.0\n5,Schindler\'s List,6,1993,8.9\n6,The Lord of the Rings: The Return of the King,7,2003,8.9\n7,Pulp Fiction,8,1994,8.8\n8,The Lord of the Rings: The Fellowship of the Ring,9,2001,8.8\n9,"The Good, the Bad and the Ugly",10,1966,8.8\n10,Forrest Gump,11,1994,8.8\n11,Fight Club,12,1999,8.7\n12,The Lord of the Rings: The Two Towers,13,2002,8.7\n13,Inception,14,2010,8.7\n14,Star Wars: Episode V - The Empire Strikes Back,15,1980,8.7\n15,The Matrix,16,1999,8.7\n16,Goodfellas,17,1990,8.7\n17,One Flew Over the Cuckoo\'s Nest,18,1975,8.6\n18,Se7en,19,1995,8.6\n19,Seven Samurai,20,1954,8.6\n20,It\'s a Wonderful Life,21,1946,8.6\n21,The Silence of the Lambs,22,1991,8.6\n22,City of God,23,2002,8.6\n23,Saving Private Ryan,24,1998,8.6\n24,Interstellar,25,2014,8.6\n25,Life Is Beautiful,26,1997,8.6\n26,The Green Mile,27,19

It will show us the top movies.
![](https://i.imgur.com/7cEAhDI.png)

In [81]:
csv_file

Unnamed: 0,name,rank,year,rating
0,The Shawshank Redemption,1,1994,9.2
1,The Godfather,2,1972,9.2
2,The Dark Knight,3,2008,9.0
3,The Godfather Part II,4,1974,9.0
4,12 Angry Men,5,1957,9.0
...,...,...,...,...
495,Gintama,246,2005,8.4
496,The Great British Baking Show,247,2010,8.4
497,Black Books,248,2000,8.4
498,Foyle's War,249,2002,8.4


# Summary

Finally, we have managed to `parse` IMDb for `Top Rated Movies` and `Top TV shows` to get data about its release and rating. We have saved all the information, we extracted from the website for our needs in a `CSV` file and further converted it into pandas dataframe using which we can get answers to a lot of questions we may want to ask.

![](https://i.imgur.com/llTRwsJ.jpg)


We will now summarize the steps we followed to complete the project. : 

1. Downloading the web pages using `request` library.

2. We `parsed` the HTML CODE using `BeautifulSoup` library & extracted the information i.e, name, year , rating, rank.

3. We created a `DataFrame` using `Pandas` that we derived from the previous step.

5. We then created two seperate CSV files to save all these details.

6. We converted the python dictionary into `Pandas DataFrames`.

# Future Work

We can now work forward to explore these `urls` more and more to fetch meaningful information out of it.  

With all the insights , and further analysis into the data, we can have answers to a lot of questions like -  

* How many actors have worked in most top rated movies across the world?
* Which Director has directed the most top rated movies?
* Which production house has given the top movies?
* It can also be used to find the cast of a particular movie.
* We can know the ratings and names of movies of certain budget.

> In the future, I would like to work to make this `DataSet` even richer with more data from other lists created by IMDB like - `Most Trending Movies`, `Awards & Events` `Celebs` , `Lowest Rated Movies` etc.
We could analyse more about these data, and get more insights from the data.

# References


[1] Python offical documentation. https://docs.python.org/3/


[2] Requests library. https://pypi.org/project/requests/


[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/


[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api


[5] Pandas library documentation. https://pandas.pydata.org/docs/


[6] IMDB Website. https://www.imdb.com/chart/top


[7] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python

