# Peaky Blinders IMDb Ratings 

This project was inspired by the recent trend of IMDb Ratings on r/dataisbeautiful, started by <a href="https://www.reddit.com/r/dataisbeautiful/comments/fw4iv0/oc_rating_of_simpsons_episodes_according_to_imdb/">u/Hbenne</a> who was inspired by <a href="https://www.espinof.com/animacion/guia-de-supervivencia-de-los-simpson-el-momento-definitivo-en-que-la-serie-comenzo-a-fallar">espinof.com</a>. Some users used <a href="https://www.reddit.com/r/dataisbeautiful/comments/fwjces/oc_the_absolute_quality_of_breaking_bad/fmonigx/?utm_source=share&utm_medium=web2x">Excel</a>, <a href="https://www.reddit.com/r/dataisbeautiful/comments/gb9lxw/oc_interactive_dashboard_with_imdb_ratings_for/fp4hs8e?utm_source=share&utm_medium=web2x">Tableau</a>, and <a href="https://www.reddit.com/r/dataisbeautiful/comments/fwqkpc/oc_rating_of_rick_and_morty_episodes_according_to/">graphics</a> <a href="https://www.reddit.com/r/dataisbeautiful/comments/fxe0hy/oc_steven_universe_imdb_scores/">editor</a>. I'm using <a href="https://www.imdb.com/">IMDb</a> because they are "the world's most popular and authoritative source for movie, TV and celebrity content".

I will cross-reference my final result with this <a href="https://www.reddit.com/r/dataisbeautiful/comments/gb9lxw/oc_interactive_dashboard_with_imdb_ratings_for/">interactive Tableau table</a>.

### Thought Process

In [3]:
# Install IMDbPY to retrieve data from IMDb
"""
From Command Prompt (Run as Administrator), typed: 
    pip install imdbpy
"""

'\nFrom Command Prompt (Run as Administrator), typed: \n    pip install imdbpy\n'

In [11]:
# Call to use IMDbPY
from imdb import IMDb

# Create an instance
ia = IMDb()

In [10]:
# The website link on a movie/series looks like this 
# https://www.imdb.com/title/tt2442560/
# Take the numbers and place it in the quotes
peaky_blinders = ia.get_movie('2442560')
peaky_blinders

<Movie id:2442560[http] title:_"Peaky Blinders" (2013)_>

So turns out I didn't even need to use IMDbPY because IMDb releases its episode ratings for download on their <a href="https://www.imdb.com/interfaces/">website</a>.
    
title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

After unzipping the title.ratings.tsv.gz file, I got a data.tsv file.

In [77]:
# Import pandas so I can use read_csv
import pandas as pd

In [89]:
# Read in data.tsv from ratings 
ratings_tsv = pd.read_csv('title.ratings.tsv/data.tsv')
ratings_tsv

Unnamed: 0,tconst\taverageRating\tnumVotes
0,tt0000001\t5.6\t1611
1,tt0000002\t6.0\t198
2,tt0000003\t6.5\t1289
3,tt0000004\t6.1\t121
4,tt0000005\t6.1\t2058
...,...
1034689,tt9916576\t6.0\t9
1034690,tt9916578\t8.4\t17
1034691,tt9916720\t5.6\t50
1034692,tt9916766\t6.8\t13


In [91]:
# Where sep='\s+' means separated by 1 or more whitespace
ratings_tsv = pd.read_csv('title.ratings.tsv/data.tsv', sep='\t')
ratings_tsv

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1611
1,tt0000002,6.0,198
2,tt0000003,6.5,1289
3,tt0000004,6.1,121
4,tt0000005,6.1,2058
...,...,...,...
1034689,tt9916576,6.0,9
1034690,tt9916578,8.4,17
1034691,tt9916720,5.6,50
1034692,tt9916766,6.8,13


In [93]:
# Read in data.tsv from episodes
episodes_tsv = pd.read_csv('title.episode.tsv/data.tsv', sep='\t')
episodes_tsv

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0041951,tt0041038,1,9
1,tt0043693,tt0989125,2,8
2,tt0045519,tt0989125,4,11
3,tt0046135,tt0989125,4,5
4,tt0046150,tt0341798,\N,\N
...,...,...,...,...
2094625,tt9916826,tt1289683,3,10
2094626,tt9916830,tt1828066,2,6
2094627,tt9916836,tt1289683,3,14
2094628,tt9916842,tt1289683,3,16


In [94]:
# Combine episode and ratings data.tsv joining with common column tconst
# Using the default 'inner' where it keeps rows where the column exists on 
# both df results in a shorter list, with missing episodes
df_merged = pd.merge(episodes_tsv, ratings_tsv,
                        left_on='tconst', right_on='tconst')

df_merged

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,averageRating,numVotes
0,tt0041951,tt0041038,1,9,7.4,49
1,tt0046150,tt0341798,\N,\N,8.0,8
2,tt0048302,tt0047768,1,6,6.2,26
3,tt0048462,tt0047702,1,3,7.2,11
4,tt0048562,tt0047768,1,10,6.9,134
...,...,...,...,...,...,...
202889,tt9916204,tt3501074,5,20,8.2,174
202890,tt9916316,tt6579344,4,6,8.4,5
202891,tt9916420,tt1318007,21,1,7.1,7
202892,tt9916576,tt2152112,7,10,6.0,9


In [95]:
# Combine episode and ratings data.tsv joining with common column tconst
# 'outer' = Full Outer Join because there were episodes without parentTconst
# Yup, a whole 2,723,536 more rows using outer merge
df_merged = pd.merge(episodes_tsv, ratings_tsv,
                        left_on='tconst', right_on='tconst',
                        how='outer')

df_merged

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,averageRating,numVotes
0,tt0041951,tt0041038,1,9,7.4,49.0
1,tt0043693,tt0989125,2,8,,
2,tt0045519,tt0989125,4,11,,
3,tt0046135,tt0989125,4,5,,
4,tt0046150,tt0341798,\N,\N,8.0,8.0
...,...,...,...,...,...,...
2926425,tt9916538,,,,8.4,5.0
2926426,tt9916544,,,,7.2,19.0
2926427,tt9916720,,,,5.6,50.0
2926428,tt9916766,,,,6.8,13.0


In [96]:
# What is the parentTconst of one of the missing episode?
# Peaky Blinders Season 1 Episode 1
# https://www.imdb.com/title/tt2471500/?ref_=ttep_ep1
missing_season1episode1 = df_merged[df_merged.tconst=='tt2471500']

# Aye parentTconst is NaN, no wonder I couldn't find it :(
missing_season1episode1

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber,averageRating,numVotes
2663872,tt2471500,,,,8.2,6454.0


In [97]:
# Copy the row associated with Peaky Blinders
# Missing a lot of episodes though
# For example, season 1 had 6 episodes
peaky_blinders = episodes_tsv[episodes_tsv.parentTconst=='tt2442560']
peaky_blinders

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
330475,tt10698464,tt2442560,6,1
372695,tt10906296,tt2442560,6,3
372696,tt10906300,tt2442560,6,4
372698,tt10906306,tt2442560,6,5
372699,tt10906308,tt2442560,6,6
993297,tt2461634,tt2442560,1,4
993298,tt2461638,tt2442560,1,6
994563,tt2471506,tt2442560,1,3
1147328,tt3683572,tt2442560,2,3
1147329,tt3683574,tt2442560,2,4


In [99]:
# So I know tconst is unique to the episode, and parentTconst isn't reliable
# Going back to imdb.com/interfaces, it looks like basics.tsv has primaryTitle
title_basics_tsv = pd.read_csv('title.basics.tsv/data.tsv', sep='\t')
title_basics_tsv

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,\N
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,\N
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,\N
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,\N
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,\N
...,...,...,...,...,...,...,...,...,...
6768523,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,\N
6768524,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,\N
6768525,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,\N
6768526,tt9916856,short,The Wind,The Wind,0,2015,\N,27,\N


In [101]:
# How many rows come up if search by primaryTitle?
peaky_blinders = title_basics_tsv[title_basics_tsv.primaryTitle=='Peaky Blinders']

# Just one omg
peaky_blinders

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3383210,tt2442560,tvSeries,Peaky Blinders,Peaky Blinders,0,2013,\N,60,\N


In [104]:
# Try to pull up every row with Peaky Blinders using another dataset
title_akas_tsv = pd.read_csv('title.akas.tsv/data.tsv', sep='\t')
title_akas_tsv

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Карменсіта,UA,\N,imdbDisplay,\N,0
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
3,tt0000001,4,Καρμενσίτα,GR,\N,imdbDisplay,\N,0
4,tt0000001,5,Карменсита,RU,\N,imdbDisplay,\N,0
...,...,...,...,...,...,...,...,...
21446195,tt9916852,4,エピソード #3.20,JP,ja,\N,\N,0
21446196,tt9916852,5,Episódio #3.20,PT,pt,\N,\N,0
21446197,tt9916852,6,Episodio #3.20,IT,it,\N,\N,0
21446198,tt9916852,7,एपिसोड #3.20,IN,hi,\N,\N,0


In [105]:
# How many rows come up if search by title?
peaky_blinders = title_akas_tsv[title_akas_tsv.title=='Peaky Blinders']

# Huh that wasn't helpful. Well, at least I know the titleID is consistent
peaky_blinders

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
10895588,tt2442560,11,Peaky Blinders,MX,\N,imdbDisplay,\N,0
10895590,tt2442560,13,Peaky Blinders,AR,\N,imdbDisplay,\N,0
10895592,tt2442560,15,Peaky Blinders,PL,\N,imdbDisplay,\N,0
10895594,tt2442560,17,Peaky Blinders,GB,\N,imdbDisplay,\N,0
10895598,tt2442560,20,Peaky Blinders,AU,\N,imdbDisplay,\N,0
10895601,tt2442560,23,Peaky Blinders,\N,\N,original,\N,1
10895602,tt2442560,24,Peaky Blinders,ES,\N,imdbDisplay,\N,0
10895604,tt2442560,26,Peaky Blinders,CA,en,imdbDisplay,\N,0
10895607,tt2442560,4,Peaky Blinders,US,\N,imdbDisplay,\N,0
10895608,tt2442560,5,Peaky Blinders,IN,hi,imdbDisplay,\N,0


If it comes to it, I may have to scrap the website to get all the <code>tconst</code> of the missing episodes. I found this <a href="https://www.chrisgiler.com/journal">walkthrough</a> that may help.