# Game of Thrones IMDb Scraping

Part of the code was taken from Isabella Benabaye's amazing blogpost on web scraping here: https://isabella-b.com/blog/scraping-episode-imdb-ratings-tutorial/.

Episode-wise data is scraped from IMDb.
Viewership Data is scraped from Wikipedia.

# Importing the necessary libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

First, we'll do a dry run and then include them in a loop for accumulation of results.

In [2]:
URL = 'https://www.imdb.com/title/tt0944947/episodes?season=1'
response = requests.get(URL)
print(response.status_code)

200


In [5]:
soup = BeautifulSoup(response.text,'html.parser')
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Game of Thrones - Season 1 - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/title/tt0944947/episodes?season=1" rel="canonical"/>
<meta content="http://www.imdb.com/title/tt0944947/episodes?season=1" property="og:url">
<scrip

In [6]:
#Getting the div container that holds the episodes of a given season
container = soup.find('div',class_='list detail eplist')

In [68]:
#Inspecting to obtain the first episode's image link
img = soup.find_all('div',class_='image')[0]
img

<div class="image">
<a href="/title/tt1480055/" itemprop="url" title="Winter Is Coming"> <div class="hover-over-image zero-z-index" data-const="tt1480055">
<img alt="Winter Is Coming" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMmVhODQ1NmUtMzJiYi00MGNiLWExNmQtYmUxNGJmY2U5ZmJlXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY126_UX224_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>

In [74]:
img.img['src']

'https://m.media-amazon.com/images/M/MV5BMmVhODQ1NmUtMzJiYi00MGNiLWExNmQtYmUxNGJmY2U5ZmJlXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY126_UX224_AL_.jpg'

In [11]:
#Once we have the container, get the episodes one by one from the given container (Here, Season 1.)
first_ep = container.find_all('div')[0]
first_ep

<div class="list_item odd">
<div class="image">
<a href="/title/tt1480055/" itemprop="url" title="Winter Is Coming"> <div class="hover-over-image zero-z-index" data-const="tt1480055">
<img alt="Winter Is Coming" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMmVhODQ1NmUtMzJiYi00MGNiLWExNmQtYmUxNGJmY2U5ZmJlXkEyXkFqcGdeQXVyNjAwNDUxODI@._V1_UY126_UX224_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="info" itemprop="episodes" itemscope="" itemtype="http://schema.org/TVEpisode">
<meta content="1" itemprop="episodeNumber"/>
<div class="airdate">
            18 Apr. 2011
    </div>
<strong><a href="/title/tt1480055/" itemprop="name" title="Winter Is Coming">Winter Is Coming</a></strong>
<div class="ipl-rating-widget">
<div class="ipl-rating-star small">
<span class="ipl-rating-star__star">
<svg class="ipl-icon ipl-star-icon" fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
<path d="M0 0h24v24

In [20]:
#Extracting the first episode's title
first_ep.a['title']

'Winter Is Coming'

In [26]:
#The airdate of the first episode
first_ep.find('div',class_='airdate').text.strip()

'18 Apr. 2011'

In [34]:
#IMDb Rating for the episode
float(first_ep.find('span', class_='ipl-rating-star__rating').text)

8.9

In [56]:
#Number of Users who rated this episode
users = first_ep.find('span', class_='ipl-rating-star__total-votes').text
users

'(51,300)'

In [57]:
#Experimenting with cleaning the value to get a whole number
users = int(users.replace('(', '').replace(')', '').replace(',', ''))
print(users)

51300


In [42]:
#The episode description
first_ep.find('div', class_='item_description').text.strip()

'Eddard Stark is torn between his family and an old friend when asked to serve at the side of King Robert Baratheon; Viserys plans to wed his sister to a nomadic warlord in exchange for an army.'

# Getting the viewership data

I figured that if we could even get viewership data for each episode, it would make for an interesting analysis.

In [77]:
url = 'https://en.wikipedia.org/wiki/Game_of_Thrones'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')

In [78]:
#Getting the table
table = soup.find_all('table',class_='wikitable')[-1]
table

<table class="wikitable" style="text-align:center"><tbody><tr><th colspan="2" rowspan="2" style="padding-left:.8em;padding-right:.8em">Season</th><th colspan="10" style="padding-left:.8em;padding-right:.8em">Episode number</th><th rowspan="2" scope="col" style="padding-left:.8em;padding-right:.8em">Average</th></tr><tr><th scope="col">1</th><th scope="col">2</th><th scope="col">3</th><th scope="col">4</th><th scope="col">5</th><th scope="col">6</th><th scope="col">7</th><th scope="col">8</th><th scope="col">9</th><th scope="col">10</th></tr><tr><th style="background-color:#295354;width:10px"></th><th scope="row">1</th><td style="width:35px">2.22</td><td style="width:35px">2.20</td><td style="width:35px">2.44</td><td style="width:35px">2.45</td><td style="width:35px">2.58</td><td style="width:35px">2.44</td><td style="width:35px">2.40</td><td style="width:35px">2.72</td><td style="width:35px">2.66</td><td style="width:35px">3.04</td><td>2.52</td></tr><tr><th style="background-color:#D09

In [86]:
#I wanted to ignore the first two rows since they are headers for the episode numbers
rows = table.find_all('tr')[2:]

In [99]:
#Getting individual viewership data for each episode in a given season
rows[-1].find_all('td')

[<td style="width:35px">11.76</td>,
 <td style="width:35px">10.29</td>,
 <td style="width:35px">12.02</td>,
 <td style="width:35px">11.80</td>,
 <td style="width:35px">12.48</td>,
 <td style="width:35px">13.61</td>,
 <td class="table-na" colspan="4" data-sort-value="" style="background: #ececec; color: #2C2C2C; vertical-align: middle; text-align: center;">–</td>,
 <td>11.99</td>]

In [102]:
#Putting it in a loop so that we can use it in a dataframe later
viewership_df = []
for i,row in enumerate(rows):
    na_row = row.find('table',class_='table-na')
    if na_row==[]:
        data_row = row.find_all('td')[:-1]
    else:
        data_row = row.find_all('td')[:-2]
    for j,data in enumerate(data_row):
        viewership_data = (i+1,j+1,float(data.text))
        viewership_df.append(viewership_data)

# Putting it all together

In [75]:
# Initializing the series that the loop will populate
got_episodes = []

# For every season in the series-- range depends on the show
for sn in range(1,9):
    # Request from the server the content of the web page by using requests.get(), and store the server’s response in the variable response
    response = requests.get('https://www.imdb.com/title/tt0944947/episodes?season=' + str(sn))

    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')

    # Select all the episode containers from the season's page
    episode_containers = page_html.find_all('div', class_ = 'info')
    imgs_list = page_html.find_all('div',class_='image')

    # For each episode in each season
    for i,episodes in enumerate(episode_containers):
            # Get the info of each episode on the page
            season = int(sn)
            episode_number = int(episodes.meta['content'])
            title = episodes.a['title']
            airdate = episodes.find('div', class_='airdate').text.strip()
            rating = float(episodes.find('span', class_='ipl-rating-star__rating').text)
            total_votes = episodes.find('span', class_='ipl-rating-star__total-votes').text
            total_votes = int(total_votes.replace('(', '').replace(')', '').replace(',', ''))
            desc = episodes.find('div', class_='item_description').text.strip()
            img_url = imgs_list[i]
            img_url = img_url.img['src']
            # Compiling the episode info
            episode_data = [season, episode_number, img_url, title, airdate, rating, total_votes, desc]

            # Append the episode info to the complete dataset
            got_episodes.append(episode_data)

# Making the Dataframe

In [76]:
got_episodes = pd.DataFrame(got_episodes, columns = ['season', 'episode_number', 'image_url', 'title', 'airdate', 'rating', 'total_votes', 'desc'])

got_episodes.head()


Unnamed: 0,season,episode_number,image_url,title,airdate,rating,total_votes,desc
0,1,1,https://m.media-amazon.com/images/M/MV5BMmVhOD...,Winter Is Coming,18 Apr. 2011,8.9,51300,Eddard Stark is torn between his family and an...
1,1,2,https://m.media-amazon.com/images/M/MV5BYzhhOT...,The Kingsroad,25 Apr. 2011,8.6,38774,"While Bran recovers from his fall, Ned takes o..."
2,1,3,https://m.media-amazon.com/images/M/MV5BZjcwNz...,Lord Snow,2 May 2011,8.5,36680,Jon begins his training with the Night's Watch...
3,1,4,https://m.media-amazon.com/images/M/MV5BMDc0ZT...,"Cripples, Bastards, and Broken Things",9 May 2011,8.6,34891,Eddard investigates Jon Arryn's murder. Jon be...
4,1,5,https://m.media-amazon.com/images/M/MV5BZDA2ZT...,The Wolf and the Lion,16 May 2011,9.0,36262,Catelyn has captured Tyrion and plans to bring...


In [106]:
viewership_df = pd.DataFrame(viewership_df,columns=['season','episode_number','total_viewers'])

viewership_df.head()

Unnamed: 0,season,episode_number,total_viewers
0,1,1,2.22
1,1,2,2.2
2,1,3,2.44
3,1,4,2.45
4,1,5,2.58


# Save to CSV on disk

In [107]:
got_episodes.to_csv('episode-list.csv',index=False)
viewership_df.to_csv('viewership_data.csv',index=False)