# Peaky Blinders IMDb Ratings, Part IV

Using the walkthrough from Notebook Part III, scrape website with Season and Episode Rating information to make my own charts.

In [1]:
"""
Want to be able to scrape webpage with user-inputted Show Title

All shows URL looks like this
    https://www.imdb.com/title/tt2442560/
    where tt2442560 is the tconst of the show, and acts as the parentTconst
    of its children episodes, but that's not reliable enough.

1. Take user input and make a function to search for tconst based on input
2. Scrape website iwth BeautifulSoup using the identified tconst

"""

"\nWant to be able to scrape webpage with user-inputted Show Title\n\nAll shows URL looks like this\n    https://www.imdb.com/title/tt2442560/\n    where tt2442560 is the tconst of the show, and acts as the parentTconst\n    of its children episodes, but that's not reliable enough.\n\n1. Take user input and make a function to search for tconst based on input\n2. Scrape website iwth BeautifulSoup using the identified tconst\n\n"

In [3]:
# Import pandas so I can use read_csv
import pandas as pd

In [7]:
title_basics_tsv = pd.read_csv('title.basics.tsv/data.tsv', sep='\t', 
                               # Was giving me a low_memory dtype warning
                               dtype={'startYear': str})
title_basics_tsv

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,\N
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,\N
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,\N
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,\N
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,\N
...,...,...,...,...,...,...,...,...,...
6768523,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,\N
6768524,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,\N
6768525,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,\N
6768526,tt9916856,short,The Wind,The Wind,0,2015,\N,27,\N


In [18]:
user_tconst = title_basics_tsv[title_basics_tsv.primaryTitle=='Peaky Blinders']
user_tconst

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3383210,tt2442560,tvSeries,Peaky Blinders,Peaky Blinders,0,2013,\N,60,\N


In [19]:
user_tconst = user_tconst.iloc[0]['tconst']
user_tconst

'tt2442560'

In [48]:
from requests import get
user_show_title = input("Enter a name of a Show: ") 
user_tconst = title_basics_tsv[title_basics_tsv.primaryTitle==user_show_title]
user_tconst = user_tconst.iloc[0]['tconst']
url = 'https://www.imdb.com/title/' + user_tconst
response = get(url)
print(url)

Enter a name of a Show: Breaking Bad
https://www.imdb.com/title/tt0903747


In [49]:
# Import Beautiful Soup to parse the webpage from IMDB
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
html_soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///title/tt0903747?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Breaking Bad (TV Series 2008–2013) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="

In [50]:
# Isolates the div with my seasons link
seasons_container = html_soup.find_all('div', class_ = 'seasons-and-year-nav')
print(type(seasons_container))
print(seasons_container)

<class 'bs4.element.ResultSet'>
[<div class="seasons-and-year-nav">
<div>
<h4 class="float-left">Seasons</h4><hr/>
</div>
<div>
<h4 class="float-left">Years</h4><hr/>
</div>
<br class="clear"/>
<div>
<a href="/title/tt0903747/episodes?season=5">5</a>  
                                                    <a href="/title/tt0903747/episodes?season=4">4</a>  
                                                    <a href="/title/tt0903747/episodes?season=3">3</a>  
                                                    <a href="/title/tt0903747/episodes?season=2">2</a>  
                                                    <a href="/title/tt0903747/episodes?season=1">1</a>  
            </div>
<div>
<a href="/title/tt0903747/episodes?year=2013">2013</a>  
                                                            <a href="/title/tt0903747/episodes?year=2012">2012</a>  
                                                            <a href="/title/tt0903747/episodes?year=2011">2011</a>  
           

In [51]:
# This pulls all the hrefs in the webpage
for a in html_soup.find_all('a', href=True):
    print('Found URL:', a['href'])

Found URL: /?ref_=nv_home
Found URL: https://www.imdb.com/calendar/?ref_=nv_mv_cal
Found URL: https://www.imdb.com/list/ls016522954/?ref_=nv_tvv_dvd
Found URL: /chart/top/?ref_=nv_mv_250
Found URL: /chart/moviemeter/?ref_=nv_mv_mpm
Found URL: https://www.imdb.com/feature/genre/?ref_=nv_ch_gr
Found URL: /chart/boxoffice/?ref_=nv_ch_cht
Found URL: https://m.imdb.com/showtimes/movie/?ref_=nv_mv_sh
Found URL: https://www.imdb.com/showtimes/?ref_=nv_mv_sh
Found URL: https://www.imdb.com/movies-in-theaters/?ref_=nv_mv_inth
Found URL: https://m.imdb.com/coming-soon/?ref_=nv_mv_cs
Found URL: https://www.imdb.com/coming-soon/?ref_=nv_mv_cs
Found URL: /news/movie/?ref_=nv_nw_mv
Found URL: /india/toprated/?ref_=nv_mv_in
Found URL: https://www.imdb.com/whats-on-tv/?ref_=nv_tv_ontv
Found URL: https://m.imdb.com/whats-on-tv/?ref_=nv_tv_ontv
Found URL: /chart/toptv/?ref_=nv_tvv_250
Found URL: /chart/tvmeter/?ref_=nv_tvv_mptv
Found URL: https://www.imdb.com/feature/genre/?ref_=nv_tv_gr
Found URL: /new

In [72]:
# Holy cow, such a quick way to pull up ahrefs with "season" in the URL
# https://stackoverflow.com/questions/38252434/beautifulsoup-to-find-a-link-that-contains-a-specific-word
seasons_URLs = html_soup.select('a[href*=season]')
seasons_URLs

[<a href="/title/tt0903747/episodes?season=5">5</a>,
 <a href="/title/tt0903747/episodes?season=4">4</a>,
 <a href="/title/tt0903747/episodes?season=3">3</a>,
 <a href="/title/tt0903747/episodes?season=2">2</a>,
 <a href="/title/tt0903747/episodes?season=1">1</a>]

In [None]:
"""
OKAY next steps. Make a for loop that does the following:
1. Go to the extracted webpage
2. Using episode link information, make a list with season, episode #, tconst
"""

In [54]:
type(seasons_URLs)

list

In [216]:
numSeasons = len(seasons_URLs)
numSeasons

5

In [121]:
# Find out how many episodes in a season to create dataframe

# Pull up the most recent season webpage link
season_URL = 'https://www.imdb.com' + seasons_URLs[0]['href']
print(season_URL)

https://www.imdb.com/title/tt0903747/episodes?season=5


In [122]:
# Go into that webpage
response = get(season_URL)
season_soup = BeautifulSoup(response.text, 'html.parser')

In [123]:
season_soup.find('input', {'name':'tconst'})['value']

'tt2081647'

In [125]:
# Count how many episode links there are
# There are 16 episodes per season
numEpisodes = len(season_soup.find_all('input', attrs = {'name':'tconst'}))
numEpisodes

16

In [114]:
type(len(season_soup.find_all('input', attrs = {'name':'tconst'})))

int

In [112]:
season_soup.find_all('input', attrs = {'name':'tconst'})[0]['value']

'tt2081647'

In [107]:
season_soup.find_all('input', attrs = {'name':'tconst'})

[<input name="tconst" type="hidden" value="tt2081647">
 <input name="rating" type="text" value="0"/>
 <input name="auth" type="hidden" value=""/>
 <input name="tracking_tag" type="hidden" value="ttep_ep1_rt"/>
 <input name="pageType" type="hidden" value="title"/>
 <input name="subpageType" type="hidden" value="episodes"/>
 </input>, <input name="tconst" type="hidden" value="tt2301457">
 <input name="rating" type="text" value="0"/>
 <input name="auth" type="hidden" value=""/>
 <input name="tracking_tag" type="hidden" value="ttep_ep2_rt"/>
 <input name="pageType" type="hidden" value="title"/>
 <input name="subpageType" type="hidden" value="episodes"/>
 </input>, <input name="tconst" type="hidden" value="tt2301459">
 <input name="rating" type="text" value="0"/>
 <input name="auth" type="hidden" value=""/>
 <input name="tracking_tag" type="hidden" value="ttep_ep3_rt"/>
 <input name="pageType" type="hidden" value="title"/>
 <input name="subpageType" type="hidden" value="episodes"/>
 </input

In [266]:
# Import numpy so I can use np.arange
import numpy as np

# Create a new dataframe to hold extracted info
columns=['parentTconst','tconst','seasonNumber','episodeNumber','averageRating','numVotes']
show_info_df = pd.DataFrame(0, index=np.arange(numEpisodes*numSeasons), columns=columns)

# Make sure column averageRating is a numpy Float and not integer
show_info_df['averageRating'] = pd.to_numeric(show_info_df['averageRating'], downcast='float')

show_info_df

Unnamed: 0,parentTconst,tconst,seasonNumber,episodeNumber,averageRating,numVotes
0,0,0,0,0,0.0,0
1,0,0,0,0,0.0,0
2,0,0,0,0,0.0,0
3,0,0,0,0,0.0,0
4,0,0,0,0,0.0,0
...,...,...,...,...,...,...
75,0,0,0,0,0.0,0
76,0,0,0,0,0.0,0
77,0,0,0,0,0.0,0
78,0,0,0,0,0.0,0


In [141]:
season_soup.find('span', class_='ipl-rating-star__total-votes').text

'(18,316)'

In [142]:
type(season_soup.find('span', class_='ipl-rating-star__total-votes').text)

str

In [157]:
num_votes = season_soup.find('span', class_='ipl-rating-star__total-votes').text
num_votes = num_votes.strip('()')
num_votes

'18,316'

In [150]:
# To turn strings with thousands separated by commas into floats/ints
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

'en_US.UTF-8'

In [160]:
# Convert comma-separated-thousands string into integer
num_votes = locale.atoi(num_votes)
num_votes

18316

In [212]:
# Combine all of the above....
locale.atoi(season_soup.find_all('span', class_='ipl-rating-star__total-votes')[0].text.strip('()'))

18316

In [173]:
season_soup.find('span', class_='ipl-rating-star__rating')

<span class="ipl-rating-star__rating">9.3</span>

In [174]:
season_soup.find('span', class_='ipl-rating-star__rating').text

'9.3'

In [176]:
season_soup.find_all('span', class_='ipl-rating-star__rating')[0]

<span class="ipl-rating-star__rating">9.3</span>

In [179]:
float(season_soup.find_all('span', class_='ipl-rating-star__rating')[0].text)

9.3

In [192]:
# Combining two finds
# First find the div where rating/votes are stored
# Then find the span with the individual rating and vote number
season_soup.find('div', class_='ipl-rating-star small').find('span', class_='ipl-rating-star__rating').text

'9.3'

In [200]:
season_soup.find_all('div', class_='ipl-rating-star small')[0].find('span', class_='ipl-rating-star__rating').text

'9.9'

In [201]:
float(season_soup.find_all('div', class_='ipl-rating-star small')[0].find('span', class_='ipl-rating-star__rating').text)

9.3

In [267]:
# Extract the tconst from each episode on the season webpage
for e in range(0,numEpisodes,1):
    show_tconst = season_soup.find_all('input', attrs = {'name':'tconst'})[e]['value']
    no_episode = e + 1    
    rating = float(season_soup.find_all('div', class_='ipl-rating-star small')[e].find('span', class_='ipl-rating-star__rating').text) 
    num_votes = int(locale.atoi(season_soup.find_all('span', class_='ipl-rating-star__total-votes')[e].text.strip('()')))
    print("{}\tEp. {}\tRating {}\tVotes: {}".format(show_tconst, no_episode, rating, num_votes))

tt2081647	Ep. 1	Rating 9.3	Votes: 18316
tt2301457	Ep. 2	Rating 8.9	Votes: 15329
tt2301459	Ep. 3	Rating 8.9	Votes: 14856
tt2301461	Ep. 4	Rating 8.9	Votes: 14959
tt2301463	Ep. 5	Rating 9.7	Votes: 22332
tt2301465	Ep. 6	Rating 9.1	Votes: 15332
tt2301467	Ep. 7	Rating 9.6	Votes: 19671
tt2301469	Ep. 8	Rating 9.6	Votes: 20424
tt2301471	Ep. 9	Rating 9.5	Votes: 19889
tt2301443	Ep. 10	Rating 9.2	Votes: 17217
tt2301445	Ep. 11	Rating 9.6	Votes: 21240
tt2301447	Ep. 12	Rating 9.2	Votes: 17973
tt2301449	Ep. 13	Rating 9.8	Votes: 30376
tt2301451	Ep. 14	Rating 10.0	Votes: 114827
tt2301453	Ep. 15	Rating 9.7	Votes: 30729
tt2301455	Ep. 16	Rating 9.9	Votes: 82117


In [268]:
len(show_info_df)

80

In [269]:
pd.set_option('precision', 4)

In [270]:
float(season_soup.find_all('div', class_='ipl-rating-star small')[0].find('span', class_='ipl-rating-star__rating').text) 

9.3

In [276]:
# Need to put information collected into dataframe show_info_df
for e in range(numEpisodes):
    show_tconst = season_soup.find_all('input', attrs = {'name':'tconst'})[e]['value']
    no_episode = e + 1    
    rating = float(season_soup.find_all('div', class_='ipl-rating-star small')[e].find('span', class_='ipl-rating-star__rating').text) 
    num_votes = int(locale.atoi(season_soup.find_all('span', class_='ipl-rating-star__total-votes')[e].text.strip('()')))
    
    show_info_df['parentTconst'][e] = user_tconst
    show_info_df['tconst'][e] = show_tconst
    show_info_df['seasonNumber'][e] = 1
    show_info_df['episodeNumber'][e] = no_episode
    show_info_df['averageRating'][e] = rating
    show_info_df['numVotes'][e] = num_votes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a cop

In [278]:
show_info_df.head(20)

Unnamed: 0,parentTconst,tconst,seasonNumber,episodeNumber,averageRating,numVotes
0,tt0903747,tt2081647,1,1,9.3,18316
1,tt0903747,tt2301457,1,2,8.9,15329
2,tt0903747,tt2301459,1,3,8.9,14856
3,tt0903747,tt2301461,1,4,8.9,14959
4,tt0903747,tt2301463,1,5,9.7,22332
5,tt0903747,tt2301465,1,6,9.1,15332
6,tt0903747,tt2301467,1,7,9.6,19671
7,tt0903747,tt2301469,1,8,9.6,20424
8,tt0903747,tt2301471,1,9,9.5,19889
9,tt0903747,tt2301443,1,10,9.2,17217


In [274]:
show_info_df['averageRating'][0]

9.3

In [275]:
type(show_info_df['averageRating'][0])

numpy.float32