## IMDb

So that I was working with the most up-to-date data, I downloaded ```title.basics.tsv.gz``` from the IMDb Developer website (https://developer.imdb.com/non-commercial-datasets/) on the 7th of August 2023. Given the size of the file, I have created a new zipped csv with the relevant information to work with instead.

Changes made: 
* titleType filtered to only include movie and tvMovie
* Adult films removed
* irrelevant columns dropped
* \N converted to NaNs and dropped from the dataset
* startYear filtered to be greater than 2008
* index reset

The original ```title.basics.aug23.tsv.gz``` file has not been pushed to GitHub.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_imdb = pd.read_csv('./zippedData/title.basics.aug23.tsv.gz', sep='\t')
df_imdb.head()

  df_imdb = pd.read_csv('./zippedData/title.basics.aug23.tsv.gz', sep='\t')


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [3]:
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10058021 entries, 0 to 10058020
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 690.6+ MB


In [4]:
df_imdb.titleType.value_counts()

titleType
tvEpisode       7654568
short            944420
movie            653179
video            277989
tvSeries         247567
tvMovie          142593
tvMiniSeries      49691
tvSpecial         42676
videoGame         35336
tvShort           10001
tvPilot               1
Name: count, dtype: int64

In [5]:
# removing all titles that are not movie or tvMovie
df_imdb = df_imdb[(df_imdb.titleType == "movie") | (df_imdb.titleType == "tvMovie")]

# removing adult titles
df_imdb = df_imdb[df_imdb.isAdult == 0]

# dropping irrelevant columns
df_imdb.drop(['originalTitle', 'isAdult', 'endYear'], axis=1, inplace=True)

In [6]:
# replacing \N with NaN to reveal all null values
df_imdb.replace(r'\N', np.nan, inplace=True)

# find all nulls
df_imdb.isna().sum()

tconst                 0
titleType              0
primaryTitle           2
startYear          95386
runtimeMinutes    282328
genres             83712
dtype: int64

In [7]:
# dropping all nulls given the size of the dataset
df_imdb.dropna(inplace=True)

In [8]:
# converting startYear into an int64 datatype
df_imdb['startYear'] = df_imdb['startYear'].astype('int64')

# removing titles with a startYear earlier than 2000
df_imdb = df_imdb[df_imdb['startYear'] > 2008]

In [9]:
# resetting the index
df_imdb.reset_index(drop=True, inplace=True)

In [10]:
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210566 entries, 0 to 210565
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          210566 non-null  object
 1   titleType       210566 non-null  object
 2   primaryTitle    210566 non-null  object
 3   startYear       210566 non-null  int64 
 4   runtimeMinutes  210566 non-null  object
 5   genres          210566 non-null  object
dtypes: int64(1), object(5)
memory usage: 9.6+ MB


In [11]:
df_imdb.to_csv('./zippedData/imdb_basics.aug23_reduced.csv.gz')