## Everything, Everywhere, All at Once:
### A study in finding the hidden stories in chaotic datasets
Brian and I were tasks by some big movie executives to help them determine what kind of movie they should be working on next based on IMDB data. Here are our results

### Rough Questions to Answer
##### How do genre trends change over time?
##### Actors impacts on ratings count and average
##### how popular are movies broken down by vote count?
##### who are top actors by genre? 

First we need to import our libraries and dataframes

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
basics_df = pd.read_csv('Data/title.basics.tsv', sep='\t')
ratings_df = pd.read_csv('Data/title.ratings.tsv', sep='\t')

In [4]:
ratings_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2064
1,tt0000002,5.6,279
2,tt0000003,6.5,2038
3,tt0000004,5.4,180
4,tt0000005,6.2,2799


The basics and ratings databases include a shared variable, so we can combine them

In [5]:
basics_ratings_df = basics_df.merge(ratings_df, how='inner', on='tconst')
basics_ratings_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,2064
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",5.6,279
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,5,"Action,Adventure,Animation",6.5,2038
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short",5.4,180
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.2,2799


### Cleaning the data for easier analysis and looking for the story

First, our clients are only looking for data on movies, so we can remove all other rows.

In [6]:
basics_ratings_df.titleType.unique()

array(['short', 'movie', 'tvShort', 'tvMovie', 'tvEpisode', 'tvSeries',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame'], dtype=object)

In [7]:
movies_df = basics_ratings_df[basics_ratings_df.titleType == 'movie']

Then, we can determine variable types and correct as needed

In [8]:
movies_df.describe()

Unnamed: 0,averageRating,numVotes
count,314509.0,314509.0
mean,6.167837,3621.551
std,1.360726,36231.8
min,1.0,5.0
25%,5.3,19.0
50%,6.3,61.0
75%,7.1,313.0
max,10.0,2920364.0


In [9]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 314509 entries, 8 to 1463604
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          314509 non-null  object 
 1   titleType       314509 non-null  object 
 2   primaryTitle    314509 non-null  object 
 3   originalTitle   314509 non-null  object 
 4   isAdult         314509 non-null  object 
 5   startYear       314509 non-null  object 
 6   endYear         314509 non-null  object 
 7   runtimeMinutes  314509 non-null  object 
 8   genres          314509 non-null  object 
 9   averageRating   314509 non-null  float64
 10  numVotes        314509 non-null  int64  
dtypes: float64(1), int64(1), object(9)
memory usage: 28.8+ MB


In [10]:
def year_to_decade(x: str) -> str:
    x = x[0:3] + '0'
    return x

In [11]:
def obj_to_int(x: str) -> int:
    x = int(x)
    return x

In [18]:
def genre_list(x:str) -> list:
    genre = x.split(',')
    return genre

In [50]:
movies_df = movies_df[~(movies_df.runtimeMinutes == '\\N')]
movies_df = movies_df[~(movies_df.startYear  == '\\N')]
movies_df = movies_df[~(movies_df.genres  == '\\N')]
movies_df = movies_df[~(movies_df.averageRating  == '\\N')]
movies_df = movies_df[~(movies_df.numVotes  == '\\N')]
movies_df.runtimeMinutes = movies_df.runtimeMinutes.apply(obj_to_int)
movies_df['startYearInt'] = movies_df.startYear.apply(obj_to_int)
movies_df['decade'] = movies_df.startYear.apply(year_to_decade)

AttributeError: 'DataFrame' object has no attribute 'genres'

In [16]:
movies_df.groupby('genres').size()

genres
Action                    957
Action,Adult                2
Action,Adult,Adventure      1
Action,Adult,Comedy         3
Action,Adult,Crime          1
                         ... 
Thriller,War,Western        1
Thriller,Western            8
War                        90
War,Western                 7
Western                   959
Length: 1085, dtype: int64

In [28]:
movies_df = movies_df[~(movies_df.runtimeMinutes > 300)]
movies_df = movies_df[movies_df.numVotes > 99]

In [27]:
genres_df.genres = genres_df.genres.apply(genre_list)
genres_df

Unnamed: 0,tconst,genres
8,tt0000009,[Romance]
144,tt0000147,"[Documentary, News, Sport]"
373,tt0000574,"[Action, Adventure, Biography]"
961,tt0001892,[Drama]
979,tt0001964,[Drama]
...,...,...
1463526,tt9914942,[Drama]
1463528,tt9914972,[Documentary]
1463567,tt9916190,"[Action, Adventure, Thriller]"
1463574,tt9916270,[Thriller]


In [34]:
genres_df = genres_df.genres.apply(pd.Series) \
    .merge(genres_df, right_index = True, left_index = True) \
    .drop(['genres'], axis = 1) \
    .melt(id_vars = ['tconst'], value_name = 'genre') \
    .drop("variable", axis = 1) \
    .dropna()

In [37]:
genres_df.sort_values('tconst')

Unnamed: 0,tconst,genre
0,tt0000009,Romance
256295,tt0000147,Sport
128148,tt0000147,News
1,tt0000147,Documentary
128149,tt0000574,Adventure
...,...,...
128144,tt9916190,Action
384438,tt9916190,Thriller
128145,tt9916270,Thriller
128146,tt9916362,Drama


In [40]:
genres_df.groupby('genre').count()

Unnamed: 0_level_0,tconst
genre,Unnamed: 1_level_1
Action,17393
Adult,815
Adventure,10204
Animation,3038
Biography,4805
Comedy,39544
Crime,15900
Documentary,11120
Drama,65964
Family,5745


In [43]:
genres_df = genres_df[~(genres_df.genre.isin(['Talk-Show', 'Reality-TV', 'News', 'Music', 'History', 'Game-Show']))]

In [47]:
movies_genres_df = genres_df.merge(movies_df, how='left', on='tconst')

In [48]:
movies_genres_df

Unnamed: 0,tconst,genre,primaryTitle,startYear,runtimeMinutes,averageRating,numVotes,startYearInt,decade
0,tt0000009,Romance,Miss Jerry,1894,45,5.4,212,1894,1890
1,tt0000147,Documentary,The Corbett-Fitzsimmons Fight,1897,100,5.2,520,1897,1890
2,tt0000574,Action,The Story of the Kelly Gang,1906,70,6.0,917,1906,1900
3,tt0001892,Drama,Den sorte drøm,1911,53,5.8,270,1911,1910
4,tt0001964,Drama,The Traitress,1911,48,6.0,102,1911,1910
...,...,...,...,...,...,...,...,...,...
251625,tt9900782,Drama,Kaithi,2019,145,8.4,43342,2019,2010
251626,tt9900940,Thriller,Scrapper,2021,87,4.4,1462,2021,2020
251627,tt9904802,War,Enemy Lines,2020,92,4.6,1962,2020,2020
251628,tt9907782,Mystery,The Cursed,2021,111,6.2,17743,2021,2020


In [44]:
movies_df = movies_df.drop(['endYear', 'originalTitle', 'genres', 'titleType', 'isAdult'], axis=1)

In [45]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 128147 entries, 8 to 1463578
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          128147 non-null  object 
 1   primaryTitle    128147 non-null  object 
 2   startYear       128147 non-null  object 
 3   runtimeMinutes  128147 non-null  int64  
 4   averageRating   128147 non-null  float64
 5   numVotes        128147 non-null  int64  
 6   startYearInt    128147 non-null  int64  
 7   decade          128147 non-null  object 
dtypes: float64(1), int64(3), object(4)
memory usage: 12.8+ MB
