# Scraping Popular TV Shows From `IMDB` Using Python 

![Banner-Image](https://i.imgur.com/GUwEsFh.png)

IMDb is an online database of information related to films, television series, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. For example, this [link](https://www.imdb.com/chart/tvmeter/?ref_=nv_tvv_mptv) included the most popular Tv show on IMDB.

The Page https://www.imdb.com/chart/tvmeter/?ref_=nv_tvv_mptv and https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250 provides a list of the `most popular Tv shows` and `Top Rated TV Shows` on IMDB. In this project, we will retrive infromation from this pages using Web_Scraping: Web scraping is the process of collecting structured web data in an automated fashion. It's also known as web data extraction. We'll use the python libraries [Requests](https://pypi.org/project/requests/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrape data from this page.

The outline of this assignment is listed below:

1. Download the webpage using requests. 
2. Parse the HTML source code using beautiful soup 
3. Extract `Show_name`, `Released_Date` and `IMDB_Rating` from this page 
4. Compile extracted information using Python lists and dictionaries
5. Extract and combine data from multiple pages
6. Save the extracted information to a CSV file.

The CSV file which will be created will have the following format:

```
Show_Name,Released_Date,Imdb_Rating
The Last of Us,(2023),9.3
Velma,(2023),1.3
```

You can execute the code using the "Run on Binder" button at the top of this page. You can make changes and save your own version of the notebook to [Jovian](https://www.jovian.com/) by executing the following cells:

In [1]:
import jovian

# Downloading Web Page Using requests


In [2]:
!pip install requests --upgrade --quiet

In [3]:
import requests

To download a page, we can use the get function from requests, which returns the response object.



In [4]:
topic_url = 'https://www.imdb.com/chart/tvmeter/?ref_=nv_tvv_mptv'

In [5]:
response = requests.get(topic_url)

`requests.get` returns a response object containing the data from the webpage and some other information.

The `.status_code` property can be used to check if the request was successful.A successful response will have an HTTPstatus code between 200 and 209.

In [6]:
response.status_code

200

In [7]:
type(response)

requests.models.Response

Let us check the number of characters in the webpage.

In [8]:
page_contents = response.text

In [9]:
len(page_contents)

356298

The webpage contains over 36,000 characters. Here are the first 500 characters of the page:

In [10]:
page_contents[:500]

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'fu'

The above code is the [HTML source code](https://en.wikipedia.org/wiki/HTML#:~:text=The%20HyperText%20Markup%20Language%20or,scripting%20languages%20such%20as%20JavaScript.) of the webpage. We can also save it to a file and view locally within Jupyter using "File -> Open".

In [11]:
with open("most_popular_tv_shows.html","w") as f:
    f.write(page_contents)

The page looks similiar to the original but none of the links will work in this webpage.

![most-popular-tv-shows page](https://i.imgur.com/s9JYGSM.png)

In this section, we have successfully used the requests library to download a webpage as HTML.

# Parse Information From HTML Using `Beautiful Soup`


We can use the `BeautifulSoup` module from `bs4` library to parse the html code which was obtained using the `requests` library.

The library can be installed using `pip`.

In [12]:
!pip install beautifulsoup4 --upgrade --quiet

In [13]:
from bs4 import BeautifulSoup

In [14]:
with open("most_popular_tv_shows.html","r") as f:
    html_source = f.read()

In [15]:
html_source[:500]

'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n\n        <meta charset="utf-8">\n\n    \n    \n    \n\n    \n    \n    \n\n            <style>\n                body#styleguide-v2 {\n                    background: no-repeat fixed center top #000;\n                }\n            </style>\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'fu'

To parse HTML contents of a webpage, we can pass the HTML contents to the `BeautifulSoup` class which returns a bs4 object.

In [16]:
doc = BeautifulSoup(html_source)

In [17]:
type(doc)

bs4.BeautifulSoup

Here is the title of the web page


In [18]:
title_tag = doc.title

In [19]:
title_tag

<title>Most Popular TV - IMDb</title>

In [20]:
title_tag.text

'Most Popular TV - IMDb'

In [21]:
table_row_tag = doc.find_all("tr")

In [22]:
len(table_row_tag)

101

## Extract Show Name, Released Date and IMDB Rating from the webpages

Here is the HTML source code where we need to navigate each tag to find the required information.

Let's see under which tags of the HTML code for Shows Name, Year Of Release and IMDB Rating exists.

![Source-Code](https://i.imgur.com/qTB4rDd.png)

### Finding Show Name

In [23]:
a_tag = table_row_tag[10]

In [24]:
name_tag = a_tag.find('td', class_ = 'titleColumn')

Here is the screenshot of inspect element for Extracting `Show Name` from tag `td` and class `titleColumn` having `a` tag as child.

![Tags for parsing Show Name](https://i.imgur.com/N5b1c3h.png)

In [25]:
show_names = name_tag.find('a')

In [26]:
show_name = show_names.text

Here is the Show Name

In [27]:
show_name

'Daisy Jones & The Six'

### Finding Year Of Release

In [28]:
year_tag = a_tag.find('span', class_ = 'secondaryInfo')

Here is the screenshot of inspect element for Extracting `Released Date` of Show from tag `td` and class `titleColumn` having `span` tag as child.
![](https://i.imgur.com/nVSFFfu.png)

In [29]:
year_of_release = year_tag.text

Here is the Released Date of the show

In [30]:
year_of_release

'(2023)'

### Parsing IMDB Rating

In [31]:
rating_tag = a_tag.find('td', class_ = 'ratingColumn imdbRating')

Here is the screenshot of inspect element for Extracting `IMDB Rating` of Show from tag `td` and class `ratingColumn imdbRating` having `strong` tag as child.

![](https://i.imgur.com/6uXFrSV.png)

In [32]:
imdb_ratings = rating_tag.find('strong')

In [33]:
imdb_rating = imdb_ratings.text

Here is the IMDB Rating of the Show

In [34]:
imdb_rating

'8.1'

# Parsing Show Link

In [35]:
link_tag = doc.find_all('td', class_ = 'titleColumn')

In [36]:
shows_url = []

for link in link_tag:
    show_link = link.find('a')
    href_link = "https://www.imdb.com" + show_link['href']
    shows_url.append(href_link)
    
shows_url[:2]

['https://www.imdb.com/title/tt13918776/',
 'https://www.imdb.com/title/tt8111088/']

## Defining A Helper Function To Download The Web Page And Returning Beautiful Soup

In [37]:
def get_topic_page(topic):
    #construct the url
    topic_url = 'https://www.imdb.com/chart/' + topic
    
    #Get the HTML page content using requests
    response = requests.get(topic_url)
    
    #Ensure that the response is valid
    if response.status_code != 200:
        print('Status Code:', response.status_code)
        raise Exception('Failed to fetch the web page' + topic_url)
    
    #construct a beautiful soup document
    doc = BeautifulSoup(response.text)
    return doc

Now we will call the function `get_topic_page` by passing the required URL as argument and verify the title of the webpage.

In [38]:
doc = get_topic_page('tvmeter/?ref_=nv_tvv_mptv')

In [39]:
doc.title.text

'Most Popular TV - IMDb'

In [40]:
doc2 = get_topic_page('toptv/?ref_=nv_tvv_250')

In [41]:
doc2.title.text

'IMDb Top 250 TV - IMDb'

## Function To Parse Show Details Like Show Name, Year Of Release, IMDB Rating Using Table Row Tag

In [42]:
def parse_show_details(table_row_tag):
    #Show Name
    name_tag = table_row_tag.find('td', class_ = 'titleColumn')
    show_names = name_tag.find('a')
    show_name = show_names.text
    
    #Released Date
    year_tag = table_row_tag.find('span', class_ = 'secondaryInfo')
    year_of_release = year_tag.text
    
    #IMDB Rating
    rating_tag = table_row_tag.find('td', class_ = 'ratingColumn imdbRating')
    imdb_ratings = rating_tag.find('strong')
    
    if imdb_ratings is not None:
        imdb_rating = imdb_ratings.text
    else:
        imdb_rating = "N/A"
        
    #Showing URLs
    link_tag = table_row_tag.find_all('td', class_ = 'titleColumn')
    
    shows_url = []

    for link in link_tag:
        show_link = link.find('a')
        href_link = "https://www.imdb.com" + show_link['href']
        shows_url.append(href_link)
    
    
    return {
        'Show_Name' : show_name,
        'Released_Date' : year_of_release,
        'Imdb_Rating' : imdb_rating,
        'Show_URLs': shows_url
    
    }
    

We can now use the function `parse_show_details` to extract information in the form of dictionaries

In [43]:
parse_show_details(table_row_tag[1])

{'Show_Name': 'The Night Agent',
 'Released_Date': '(2023)',
 'Imdb_Rating': '7.6',
 'Show_URLs': ['https://www.imdb.com/title/tt13918776/']}

Here We are looking for only top 60 shows

In [44]:
top_shows = [parse_show_details(tag) for tag in table_row_tag[1:61]]

In [45]:
len(top_shows)

60

Here are the top 10 most popular Tv shows on IMDB

In [46]:
top_shows[1:11]

[{'Show_Name': 'The Mandalorian',
  'Released_Date': '(2019)',
  'Imdb_Rating': '8.7',
  'Show_URLs': ['https://www.imdb.com/title/tt8111088/']},
 {'Show_Name': 'Succession',
  'Released_Date': '(2018)',
  'Imdb_Rating': '8.8',
  'Show_URLs': ['https://www.imdb.com/title/tt7660850/']},
 {'Show_Name': 'The Last of Us',
  'Released_Date': '(2023)',
  'Imdb_Rating': '8.9',
  'Show_URLs': ['https://www.imdb.com/title/tt3581920/']},
 {'Show_Name': 'Ted Lasso',
  'Released_Date': '(2020)',
  'Imdb_Rating': '8.8',
  'Show_URLs': ['https://www.imdb.com/title/tt10986410/']},
 {'Show_Name': 'Yellowjackets',
  'Released_Date': '(2021)',
  'Imdb_Rating': '7.9',
  'Show_URLs': ['https://www.imdb.com/title/tt11041332/']},
 {'Show_Name': 'Beef',
  'Released_Date': '(2023)',
  'Imdb_Rating': '8.4',
  'Show_URLs': ['https://www.imdb.com/title/tt14403178/']},
 {'Show_Name': 'Ahsoka',
  'Released_Date': '(2023)',
  'Imdb_Rating': 'N/A',
  'Show_URLs': ['https://www.imdb.com/title/tt13622776/']},
 {'Show_

## Function That Take Beautiful Soup Object And Returns List Of Dictionaries

In [47]:
def get_top_tvShows(doc):
    table_row_tag = doc.find_all("tr")
    top_popular_shows = [parse_show_details(tag) for tag in table_row_tag[1:61]]
    return top_popular_shows

We can now use the functions we've defined to get the top 60 [`Most Popular TV Shows`](https://www.imdb.com/chart/tvmeter/?ref_=nv_tvv_mptv) and [`Top Rated TV Shows`](https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250) on IMDB.

Here is the top 5 Most Popular TV Shows on IMDB

In [48]:
topic_page_movies = get_topic_page('tvmeter/?ref_=nv_tvv_mptv')
most_popular_tvShows = get_top_tvShows(topic_page_movies)
most_popular_tvShows[:5]

[{'Show_Name': 'The Night Agent',
  'Released_Date': '(2023)',
  'Imdb_Rating': '7.6',
  'Show_URLs': ['https://www.imdb.com/title/tt13918776/']},
 {'Show_Name': 'The Mandalorian',
  'Released_Date': '(2019)',
  'Imdb_Rating': '8.7',
  'Show_URLs': ['https://www.imdb.com/title/tt8111088/']},
 {'Show_Name': 'Succession',
  'Released_Date': '(2018)',
  'Imdb_Rating': '8.8',
  'Show_URLs': ['https://www.imdb.com/title/tt7660850/']},
 {'Show_Name': 'The Last of Us',
  'Released_Date': '(2023)',
  'Imdb_Rating': '8.9',
  'Show_URLs': ['https://www.imdb.com/title/tt3581920/']},
 {'Show_Name': 'Ted Lasso',
  'Released_Date': '(2020)',
  'Imdb_Rating': '8.8',
  'Show_URLs': ['https://www.imdb.com/title/tt10986410/']}]

Here is the top 5 Top Rated TV Shows on IMDB

In [49]:
topic_page_shows = get_topic_page('toptv/?ref_=nv_tvv_250')
top_tvshows = get_top_tvShows(topic_page_shows)
top_tvshows[:5]

[{'Show_Name': 'Planet Earth II',
  'Released_Date': '(2016)',
  'Imdb_Rating': '9.4',
  'Show_URLs': ['https://www.imdb.com/title/tt5491994/']},
 {'Show_Name': 'Breaking Bad',
  'Released_Date': '(2008)',
  'Imdb_Rating': '9.4',
  'Show_URLs': ['https://www.imdb.com/title/tt0903747/']},
 {'Show_Name': 'Planet Earth',
  'Released_Date': '(2006)',
  'Imdb_Rating': '9.4',
  'Show_URLs': ['https://www.imdb.com/title/tt0795176/']},
 {'Show_Name': 'Band of Brothers',
  'Released_Date': '(2001)',
  'Imdb_Rating': '9.4',
  'Show_URLs': ['https://www.imdb.com/title/tt0185906/']},
 {'Show_Name': 'Chernobyl',
  'Released_Date': '(2019)',
  'Imdb_Rating': '9.3',
  'Show_URLs': ['https://www.imdb.com/title/tt7366338/']}]

In [50]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankitkumar22may/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/ankitkumar22may/web-scraping-project[0m


'https://jovian.com/ankitkumar22may/web-scraping-project'

## Writing Information To CSV Files

Let's create a helper function which takes a list of dictionaries and writes them to a CSV file.

In [51]:
def write_csv(items,path):
    # open the file in write mode
    with open(path,'w') as f:
        # Return if there is nothing in write
        if len(items) == 0:
            return
        
        # Write the header in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, '')))
            f.write(','.join(values) + '\n' )

Let's write the data stored in `most_popular_tvShows` into a CSV file.

In [52]:
write_csv(most_popular_tvShows,'most_popular_tvShows.csv')

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

In [53]:
with open('most_popular_tvShows.csv','r') as f:
    f.read()

We've created a CSV containing the information about the top `Most Popular TV Shows` on IMDB.

Now that we have a CSV file, we can use the pandas library to view its contents.
And we can see list of top 60 `Most Popular TV Shows` on IMDB.

In [54]:
import pandas as pd

In [55]:
most_popular_tvShows_df  = pd.read_csv('most_popular_tvShows.csv')

In [56]:
clean_url = lambda url: url.strip("[]'")
most_popular_tvShows_df['Show_URLs'] = most_popular_tvShows_df['Show_URLs'].apply(clean_url)

In [57]:
most_popular_tvShows_df

Unnamed: 0,Show_Name,Released_Date,Imdb_Rating,Show_URLs
0,The Night Agent,(2023),7.6,https://www.imdb.com/title/tt13918776/
1,The Mandalorian,(2019),8.7,https://www.imdb.com/title/tt8111088/
2,Succession,(2018),8.8,https://www.imdb.com/title/tt7660850/
3,The Last of Us,(2023),8.9,https://www.imdb.com/title/tt3581920/
4,Ted Lasso,(2020),8.8,https://www.imdb.com/title/tt10986410/
5,Yellowjackets,(2021),7.9,https://www.imdb.com/title/tt11041332/
6,Beef,(2023),8.4,https://www.imdb.com/title/tt14403178/
7,Ahsoka,(2023),,https://www.imdb.com/title/tt13622776/
8,Unstable,(2023),6.8,https://www.imdb.com/title/tt19394168/
9,Daisy Jones & The Six,(2023),8.1,https://www.imdb.com/title/tt8749198/


Now, we can also create csv file for `Top Rated TV Shows` on IMDB

In [58]:
write_csv(top_tvshows,'top_tvshows.csv')

In [59]:
with open('top_tvshows.csv','r') as f:
    f.read()

We've created a CSV containing the information about the Top Rated TV Shows on IMDB.

Now that we have a CSV file, we can use the pandas library to view its contents. And we can see list of top 60 `Top Rated TV Shows` on IMDB

In [60]:
top_tvshows_df = pd.read_csv('top_tvshows.csv')

In [61]:
top_tvshows_df['Show_URLs'] = top_tvshows_df['Show_URLs'].apply(clean_url)

In [62]:
top_tvshows_df

Unnamed: 0,Show_Name,Released_Date,Imdb_Rating,Show_URLs
0,Planet Earth II,(2016),9.4,https://www.imdb.com/title/tt5491994/
1,Breaking Bad,(2008),9.4,https://www.imdb.com/title/tt0903747/
2,Planet Earth,(2006),9.4,https://www.imdb.com/title/tt0795176/
3,Band of Brothers,(2001),9.4,https://www.imdb.com/title/tt0185906/
4,Chernobyl,(2019),9.3,https://www.imdb.com/title/tt7366338/
5,The Wire,(2002),9.3,https://www.imdb.com/title/tt0306414/
6,Avatar: The Last Airbender,(2005),9.2,https://www.imdb.com/title/tt0417299/
7,Blue Planet II,(2017),9.2,https://www.imdb.com/title/tt6769208/
8,The Sopranos,(1999),9.2,https://www.imdb.com/title/tt0141842/
9,Cosmos: A Spacetime Odyssey,(2014),9.2,https://www.imdb.com/title/tt2395695/


In [63]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankitkumar22may/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/ankitkumar22may/web-scraping-project[0m


'https://jovian.com/ankitkumar22may/web-scraping-project'

# Summary

Here is what we covered in this notebook:

1. Download the webpage using `requests`
2. Parse the HTML source code using `beautiful soup`
3. Extract Show Name, Released Date, IMDB Ratings and Shows URLs from webpage
4. Compile extracted information using Python lists and dictionaries
5. Save the extracted information to a CSV file.

The CSV file which will be created will have the following format:

`Show_Name,Released_Date,Imdb_Rating,Show_URLs
The Last of Us,(2023),9.3,['https://www.imdb.com/title/tt0141842/']
Velma,(2023),1.3
........`

# Future Work

• We can fetch the details about the Movies and Actor/Actress who won oscars in each year.  
• We can fetch the IMDb Top movies as rated by regular IMDb voters.  

# References

1. https://requests.readthedocs.io/en/latest/
2. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. https://www.imdb.com/
4. https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id

In [64]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankitkumar22may/web-scraping-project" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/ankitkumar22may/web-scraping-project[0m


'https://jovian.com/ankitkumar22may/web-scraping-project'