# Scraping Top Movies and TV shows on IMDb using Python

![banner-image](https://i.imgur.com/MCY7PHW.png)

IMDb is an online database of information related to films, television series, podcasts, home videos, video games, and streaming content online - including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. 

The Movie page https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating provides a list of popular movies on IMDb, In this project we'll retrive information from this page using web scraping.

The Tv Show page https://www.imdb.com/chart/toptv/?ref_=nv_mp_tv250 provides a list of popular tv shows on IMDb.

#### What is Web Scraping?

Web scraping is a process of accessing web pages, extracting data and storing the data in a structured format where it can be used for further analysis.

we'll use Python libraries `requests` and `BeautifulSoup` to scrape data from this page.

Here's the Outline of the steps we'll follow:

 1. Download the webpage using `requests`
 2. Parse the HTML source code using BeautifulSoup.
 3. Extract movie names, release years and user score from the page.
 4. Compile extracted information into python lists and dictionaries.
 5. Save the extracted information to a CSV file.
 
By the end of the project, we'll create a CSV file in the following format:

       Title  Release year  Audience rating  Genre  Runtime
       
        The Shawshank Redemption, 1994, 9.3, Drama, 142 min
        The Godfather, 1972, 9.2, Crime, Drama, 175 min
        The Dark Knight, 2008, 9, Action, Crime, Drama, 152 min
        The Godfather Part II, 1974, 9, Crime, Drama, 202 min
        ....


### How to Run the Code

You can execute the code using the "Run" button at the top of this page and selecting "Run on Binder". You can make changes and save your own version of the notebook to [Jovian](https://www.jovian.ai) by executing the following cells:

In [25]:
!pip install jovian --upgrade --quiet

## Download the webpage using `requests`

 We'll use the `requests` library to download the web page.
 
 The library can be installed using `pip`.

In [26]:
!pip install requests --upgrade --quiet

In [27]:
import requests

The library is now installed and imported.

To download a page, we can use the `get` function from requests.

In [28]:
movies_url = 'https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating'
response = requests.get(movies_url, headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'})

`requests.get` returns a response object containing the data from the web page and some other information.

The `status_code` property can be used to check if the requests was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [29]:
response.status_code

200

The request was successful, We can get contents of the page using `response.text`.

In [30]:
page_contents = response.text

Let's check the no. of characters of the page.

In [31]:
len(response.text)

789397

The page contains over 240,000 characters. Here the first 1000 characters of the page.

In [32]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        \n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);\n\n\n    var ue_err_chan = \'jserr\';\n(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&

What we're looking at above is the [HTML source code](https://en.wikipedia.org/wiki/HTML) of the web page.

We can also save it to a file and view the page locally within jupyter using "file > open".

In [33]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

In this section, we used the requests library to download a web page as HTML.

## Parse the HTML source code using `BeautifulSoup`

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import the BeautifulSoup class from the bs4 module.

In [34]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [35]:
# Import the library
from bs4 import BeautifulSoup

In [36]:
doc = BeautifulSoup(page_contents, 'html.parser')

Before we parse a page and find the top movies, let's define a helper function to get the web page for any movie.

In [37]:
def get_page_contents(movies_url):
    response = requests.get(movies_url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(response.text, "html.parser")

doc = get_page_contents(movies_url)

In [38]:
type(doc)

bs4.BeautifulSoup

In [39]:
doc.find('title')

<title>IMDb "Top 1000"
(Sorted by IMDb Rating Descending) - IMDb</title>

Now we get the title of the page, by using `find` method.

 ![banner-image](https://i.imgur.com/9p8AJ4z.png)

Upon inspecting the box containing the information for a movie, The 'a' is referred to as the tag for the HTML line. In some cases, there is a tag and a class associated with an HTML line. I will find an h3 tag with class attribute set to "lister-item-header".

In [40]:
movies = doc.find_all('h3', {'class': 'lister-item-header'})

In [41]:
len(movies)

100

There are 100 movies listed on the page. I have found the enclosing tag for each movie.
Now I am going to find title of the movies by 'a' tag, because h3 has 'a' tag inside it.

In [42]:
title =[] 
for movie in movies:
    title.append(movie.find('a').text)
title[:10]

['The Shawshank Redemption',
 'The Godfather',
 'Spider-Man: Across the Spider-Verse',
 'The Dark Knight',
 "Schindler's List",
 '12 Angry Men',
 'The Lord of the Rings: The Return of the King',
 'The Godfather Part II',
 'Pulp Fiction',
 'Inception']

Now We got 100 movies title by using loop. 

## Extract `movie names`, `release years` , `Audience score` , `Genre` and `Runtime`from the page

Now I create a list of all distinct movies and their corresponding HTML. The find_all method creates a list where each entry contains the HTML that's captured within the ‘div’ tag with the class ‘lister-item-content’.

In [43]:
movies = doc.find_all('div', {'class':'lister-item-content'})

Now I want to extract each data element, i will loop through all movies, find all the HTML lines within the specified tag and class, and extract data element.

In [44]:
release_year = [movie.find('span', {'class':'lister-item-year text-muted unbold'}).text for movie in movies]
release_year[:10]

['(1994)',
 '(1972)',
 '(2023)',
 '(2008)',
 '(1993)',
 '(1957)',
 '(2003)',
 '(1974)',
 '(1994)',
 '(2010)']

In [45]:
audience_rating = [movie.find('div', {'class':'inline-block ratings-imdb-rating'}).text.strip() for movie in movies]
audience_rating[:10]

['9.3', '9.2', '9.0', '9.0', '9.0', '9.0', '9.0', '9.0', '8.9', '8.8']

In [46]:
genre = [movie.find('span', {'class':'genre'}).text.strip() for movie in movies]
genre[:10]

['Drama',
 'Crime, Drama',
 'Animation, Action, Adventure',
 'Action, Crime, Drama',
 'Biography, Drama, History',
 'Crime, Drama',
 'Action, Adventure, Drama',
 'Crime, Drama',
 'Crime, Drama',
 'Action, Adventure, Sci-Fi']

In [47]:
runtime = [movie.find('span', {'class':'runtime'}).text for movie in movies]
runtime[:10]

['142 min',
 '175 min',
 '140 min',
 '152 min',
 '195 min',
 '96 min',
 '201 min',
 '202 min',
 '154 min',
 '148 min']

Now I have extracted all the information from this page.

## Compile extracted information into python lists and dictionaries

Now I will stored all extracted data information in the list. 

In [48]:
title = [movie.find('a').text for movie in movies]
release_year = [movie.find('span', {'class':'lister-item-year text-muted unbold'}).text for movie in movies]
audience_rating = [movie.find('div', {'class':'inline-block ratings-imdb-rating'}).text.strip() for movie in movies]
genre = [movie.find('span', {'class':'genre'}).text.strip() for movie in movies]
runtime = [movie.find('span', {'class':'runtime'}).text for movie in movies]

I will now create a dictionary for extracted all data information.

In [49]:
movies_dict = {
    'Title' : title,
    'Release year' : release_year,
    'Audience rating' : audience_rating,
    'Genre' : genre,
    'Runtime' : runtime
}

## Extract and combine data from the page

In [50]:
def get_page_contents(movies_url):
    response = requests.get(movies_url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(response.text, "html.parser")

doc = get_page_contents(movies_url)

def parse_movies(movies):
    title = [movie.find('a').text for movie in movies]
    release_year = [movie.find('span', {'class':'lister-item-year text-muted unbold'}).text for movie in movies]
    audience_rating = [movie.find('div', {'class':'inline-block ratings-imdb-rating'}).text.strip() for movie in movies]
    genre = [movie.find('span', {'class':'genre'}).text.strip() for movie in movies]
    runtime = [movie.find('span', {'class':'runtime'}).text for movie in movies]
    return movies_dict

Now I can use the pandas library to view its contents.

In [51]:
!pip install pandas --quiet

In [52]:
import pandas as pd

In [53]:
movies_df = pd.DataFrame(movies_dict)

In [54]:
movies_df

Unnamed: 0,Title,Release year,Audience rating,Genre,Runtime
0,The Shawshank Redemption,(1994),9.3,Drama,142 min
1,The Godfather,(1972),9.2,"Crime, Drama",175 min
2,Spider-Man: Across the Spider-Verse,(2023),9.0,"Animation, Action, Adventure",140 min
3,The Dark Knight,(2008),9.0,"Action, Crime, Drama",152 min
4,Schindler's List,(1993),9.0,"Biography, Drama, History",195 min
...,...,...,...,...,...
95,Requiem for a Dream,(2000),8.3,Drama,102 min
96,Full Metal Jacket,(1987),8.3,"Drama, War",116 min
97,Good Will Hunting,(1997),8.3,"Drama, Romance",126 min
98,American Beauty,(1999),8.3,Drama,122 min


I can see total extracted data information of this page.

## Save the extracted information to a CSV file

I can create a csv file from the extracted data information that was created in the above step.

In [55]:
movies_df.to_csv('movies.csv', index = None)

# Getting information out of a Top tv show

 ## Download the webpage using `requests`

 We'll use the `requests` library to download the web page.
 
 To download a page, we can use the get function from requests.

In [56]:
tv_url = 'https://www.imdb.com/chart/toptv/?ref_=nv_mp_tv250'

In [57]:
response = requests.get(tv_url, headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'})

The status_code property can be used to check if the requests was successful. A successful response will have an HTTP status code between 200 and 299.

In [58]:
response.status_code

200

The request was successful, We can get contents of the page using response.text

Let's check the no. of characters of the page.

In [59]:
page_content = response.text

In [60]:
len(response.text)

742940

We can also save it to a file and view the page locally within jupyter using "file > open"

In [61]:
with open('webpage2.html', 'w') as f:
    f.write(page_content)

In this section, we used the requests library to download a web page as HTML.

## Parse the HTML source code using BeautifulSoup

To extract information from the HTML source code of a webpage programmatically, I can use the Beautiful Soup library. I already install the library and import the library in above steps.

In [62]:
doc2 = BeautifulSoup(response.text, 'html.parser')

Before we parse a page and find the top tv shows, let's define a helper function to get the web page for any tv show.

In [63]:
def get_page_content(tv_url):
    response = requests.get(tv_url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(response.text, "html.parser")

doc2 = get_page_content(tv_url)

In [64]:
doc2.find('title')

<title>IMDb Top 250 TV - IMDb</title>

Now we get the title of the page, by using find method.

![banner-image](https://i.imgur.com/imTlODP.png)

Upon inspecting the box containing the information for a top tv shows, The 'a' is referred to as the tag for the HTML line. In some cases, there is a tag and a class associated with an HTML line. I will find an `td` tag with class attribute set to `titleColumn`.

In [65]:
tvs = doc2.find_all('td',{'class':'titleColumn'})

In [66]:
len(tvs)

250

There are 250 top tv shows listed on the page. I have found the enclosing tag for each tv show. Now I am going to find title of the tv show by 'a' tag, because `td` has 'a' tag inside it.

## Extract tv shows title and release years from page¶

Now I want to extract each data element, i will loop through all movies, find all the HTML lines within the specified tag and class, and extract data element.

In [67]:
title =[] 
for tv in tvs:
    title.append(tv.find('a').text)
title[:10]

['Planet Earth II',
 'Breaking Bad',
 'Planet Earth',
 'Band of Brothers',
 'Chernobyl',
 'The Wire',
 'Avatar: The Last Airbender',
 'Blue Planet II',
 'The Sopranos',
 'Cosmos: A Spacetime Odyssey']

I got 250 rows tv shows title from this page.

In [68]:
release_years = [tv.find('span', {'class':'secondaryInfo'}).text for tv in tvs]
release_years[:10]

['(2016)',
 '(2008)',
 '(2006)',
 '(2001)',
 '(2019)',
 '(2002)',
 '(2005)',
 '(2017)',
 '(1999)',
 '(2014)']

In [69]:
tv_urls =[]
base_url = 'https://www.imdb.com/'
for tv in tvs:
    tv_urls.append(base_url + tv.find('a')['href'])
tv_urls[:10]

['https://www.imdb.com//title/tt5491994/',
 'https://www.imdb.com//title/tt0903747/',
 'https://www.imdb.com//title/tt0795176/',
 'https://www.imdb.com//title/tt0185906/',
 'https://www.imdb.com//title/tt7366338/',
 'https://www.imdb.com//title/tt0306414/',
 'https://www.imdb.com//title/tt0417299/',
 'https://www.imdb.com//title/tt6769208/',
 'https://www.imdb.com//title/tt0141842/',
 'https://www.imdb.com//title/tt2395695/']

Now I have extracted some information from this page.

## Compile extracted information into python lists and dictionaries

Now I will stored all extracted data information in the list.

In [70]:
titles = [tv.find('a').text for tv in tvs]
release_years = [tv.find('span', {'class':'secondaryInfo'}).text for tv in tvs]
tv_urls = [tv.find('a')['href'] for tv in tvs]

I will now create a dictionary for extracted all data information.

In [71]:
tv_dict = {
    'Title' : titles,
    'Release year' : release_years,
    'Tv_URL' : tv_urls
   }

## Extract and combine data from the page

In [72]:
def get_page_content(tv_url):
    response = requests.get(tv_url, headers={"Accept-Language": "en-US"})
    return BeautifulSoup(response.text, "html.parser")
doc2 = get_page_content(tv_url)

def parse_tv_show(tvs):
    titles = [tv.find('a').text for tv in tvs]
    release_years = [tv.find('span', {'class':'secondaryInfo'}).text for tv in tvs]
    tv_urls = [tv.find('a')['href'] for tv in tvs]
    return tv_dict

Now I can use the pandas library to view its contents. As above steps panda library is already installed and imported.

In [73]:
toptvshow_df = pd.DataFrame(tv_dict)

In [74]:
toptvshow_df

Unnamed: 0,Title,Release year,Tv_URL
0,Planet Earth II,(2016),/title/tt5491994/
1,Breaking Bad,(2008),/title/tt0903747/
2,Planet Earth,(2006),/title/tt0795176/
3,Band of Brothers,(2001),/title/tt0185906/
4,Chernobyl,(2019),/title/tt7366338/
...,...,...,...
245,Alfred Hitchcock Presents,(1955),/title/tt0047708/
246,Southland,(2009),/title/tt1299368/
247,Foyle's War,(2002),/title/tt0310455/
248,Alchemy of Souls,(2022),/title/tt20859920/


I can see some extracted data information of this page.

## Save the extracted information to a CSV file


I can create a csv file from the extracted data information that was created in the above step

In [75]:
toptvshow_df.to_csv('tvshow.csv', index = None)

In [76]:
import jovian

In [77]:
jovian.commit(Project = "web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "aaarchisingh18/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/aaarchisingh18/web-scraping-project[0m


'https://jovian.com/aaarchisingh18/web-scraping-project'

# Summary



Let us look at the steps that we took from start to finish :

 1. We downloaded the webpage using requests

 2. We parsed the HTML source code using BeautifulSoup library and extracted the desired infromation.
 
 3. We extracted detailed information for each movie among the list of Top Rated Movies, such as :
 
   * Movie title
   * Release year
   * Audience rating
   * Genre
   * Runtime
   * tv url
  
 4. After that we created a Python Dictionary to save all these details.
 
 5. We converted the python dictionary into Pandas DataFrames.
 
 6. Then we converted it into a CSV file. 
  

## Future work


As for future work we can now work forward to explore this data more and more to fetch meaningful information out of it,
and further analysis into the data, we can have answers to a lot of questions like -

  * Which year do we have the most TV shows?
  * Which year gave us the most Top Rated Movies till date?
  * Which Director has directed the most top rated movies?
  
  and the list goes on...



## References


 1. Python offical documentation https://docs.python.org/3/
 2. Requests library https://pypi.org/project/requests/
 3. Beautiful Soup documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 4. Aakash N S, Introduction to Web Scraping https://jovian.com/aakashns/python-web-scraping-and-rest-api 
 5. Pandas library documentation https://pandas.pydata.org/docs/
 6. IMDB Website for movies https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating
 7. IMDB Website for tv show https://www.imdb.com/chart/toptv/?ref_=nv_mp_tv250
 8. Web Scraping info https://www.toptal.com/python/web-scraping-with-python
 9. Medium page https://medium.com/
 10. Working with Jupyter Notebook https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd


In [78]:
jovian.commit(Project = "web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "aaarchisingh18/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/aaarchisingh18/web-scraping-project[0m


'https://jovian.com/aaarchisingh18/web-scraping-project'