Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name. The available datasets are as follows:

title.akas.tsv.gz - Contains the following information for titles:

- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

title.basics.tsv.gz - Contains the following information for titles:

- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:

- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title
- title.episode.tsv.gz – Contains the tv episode information. Fields include:
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series

title.principals.tsv.gz – Contains the principal cast/crew for titles

- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles

- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

name.basics.tsv.gz – Contains the following information for names:

- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

### Importing datasets

In [36]:
import pandas as pd
import numpy as np

In [37]:
principals = './data/title.principals.tsv.gz'
names = './data/name.basics.tsv.gz'
akas = './data/title.akas.tsv.gz'
titles = './data/title.basics.tsv.gz'
crew = './data/title.crew.tsv.gz'
ratings = './data/title.ratings.tsv.gz'

#TODO :
#duplicates ?
#drop columns which would not be used 

In [38]:
df_principals = pd.read_csv(principals, 
            compression = "infer",
            sep = '\t',
            na_values = '\\N')

#df_principals.sample(10)

#check what is self

MemoryError: Unable to allocate 1.77 GiB for an array with shape (5, 47545542) and data type object

In [None]:
df_names = pd.read_csv(names, 
            compression = "infer",
            sep = '\t',
            na_values = '\\N')

df_names.drop(columns = ['birthYear', 'deathYear'], inplace = True)
#df_names.sample(10)

In [None]:
df_titles = pd.read_csv(titles, 
            compression = "infer",
            sep = '\t',
            na_values = '\\N')

df_titles.drop(columns = ['isAdult', 'endYear'], inplace = True)            
#df_titles.sample(10)

#drop columns : isAdult, endYear (as we're considering movies ?)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [None]:
df_akas = pd.read_csv(akas, 
            compression = "infer",
            sep = '\t',
            na_values = '\\N')


#df_akas.sample(10)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [None]:
df_crew = pd.read_csv(crew, 
            compression="infer",
            sep = '\t',
            na_values = '\\N')
            
#df_crew.sample(10)

In [None]:
df_ratings = pd.read_csv(ratings, 
            compression="infer",
            sep = '\t',
            na_values = '\\N')
            
#df_ratings.sample(10)

## Modifying and merging the datasets

Goal here is to transform the dataframe df_principals containing the crew of each movies to a dict object, merge the other dataframes with movies as index and then insert the dict object in a new column named 'crew'. At the end save it as a json file.

In [None]:
# multiple rows for one movie in df_principals (as there are multiple people involved in the crew), for example : 
df_principals.loc[df_principals['tconst'] == 'tt6726106']

Unnamed: 0,tconst,ordering,nconst,category,job,characters
38848507,tt6726106,1,nm9183136,actor,,"[""Lin Yutang""]"
38848508,tt6726106,2,nm9139744,actress,,"[""Ling Chaoxi""]"
38848509,tt6726106,3,nm8360107,actress,,"[""Xia Weiye""]"
38848510,tt6726106,4,nm6718764,actress,,"[""Shen Xi""]"
38848511,tt6726106,5,nm8885471,director,,
38848512,tt6726106,6,nm8364498,producer,producer,
38848513,tt6726106,7,nm8658895,actor,,"[""He Zhizhou""]"
38848514,tt6726106,8,nm9393190,actor,,"[""Baobao""]"


In [None]:
# changing nconst values by actual names from df_names 
df_principals = df_principals.rename(columns = {'nconst' : 'name'})
df_principals['name'] = df_principals['name'].map(df_names.set_index('nconst')['primaryName'])
df_principals

# managing here NANs

Unnamed: 0,tconst,ordering,name,category,job,characters
0,tt0000001,1,Carmencita,self,,"[""Self""]"
1,tt0000001,2,William K.L. Dickson,director,,
2,tt0000001,3,William Heise,cinematographer,director of photography,
3,tt0000002,1,Émile Reynaud,director,,
4,tt0000002,2,Gaston Paulin,composer,,
...,...,...,...,...,...,...
47545537,tt9916880,4,Eden Gamliel,actress,,"[""Horrid Henry""]"
47545538,tt9916880,5,Hilary Audus,director,principal director,
47545539,tt9916880,6,Lucinda Whiteley,writer,,
47545540,tt9916880,7,Francesca Simon,writer,books,


In [None]:
#preparing the dataframe to be transformed into dict 
df_principals_hier = df_principals.set_index(['tconst', 'ordering'])

In [None]:
dict = df_principals_hier.to_dict('index')

MemoryError: 

In [None]:
#merging df_crew and df_titles to have a dataframe containing movies and there respective crew
merged = pd.merge(df_titles, df_crew, on = 'tconst')
merged

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,directors,writers
0,tt0000001,short,Carmencita,Carmencita,1894.0,1.0,"Documentary,Short",nm0005690,
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,1892.0,5.0,"Animation,Short",nm0721526,
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,1892.0,4.0,"Animation,Comedy,Romance",nm0721526,
3,tt0000004,short,Un bon bock,Un bon bock,1892.0,12.0,"Animation,Short",nm0721526,
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,1893.0,1.0,"Comedy,Short",nm0005690,
...,...,...,...,...,...,...,...,...,...
8407394,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,2010.0,,"Action,Drama,Family","nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8407395,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,2010.0,,"Action,Drama,Family","nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8407396,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,2010.0,,"Action,Drama,Family","nm5519454,nm5519375","nm6182221,nm1628284,nm2921377"
8407397,tt9916856,short,The Wind,The Wind,2015.0,27.0,Short,nm10538645,nm6951431


In [None]:
merged['titleType'].unique()

array(['short', 'movie', 'tvEpisode', 'tvSeries', 'tvShort', 'tvMovie',
       'tvMiniSeries', 'tvSpecial', 'video', 'videoGame', 'tvPilot'],
      dtype=object)

In [None]:
#keeping only rows concerning movies
merged.drop(merged.loc[merged['titleType'] !='movie'].index, inplace=True)

print(f'the dataframe contains now {len(merged)} rows')

the dataframe contains now 4206140 rows


In [None]:
df_principals['crew'] = dict.values()
merged = merged.merge(df_principals, on = 'tconst', how = 'left')

KeyboardInterrupt: 

In [None]:
merged.columns = ['tconst', ]
merged.drop(columns = ['ordering', 'name', 'category', 'job', 'characters'])

In [None]:
merged['crew'][101]

{'name': 'Jørgen Lund',
 'category': 'actor',
 'job': nan,
 'characters': '["Gøngehøvdingen"]'}

In [None]:
#for i, l in enumerate(merged['directors']):
#    print("list",i,"is",type(l))
## type <str> which is not what we want    

In [None]:
merged['directors'] = '["' + merged['directors'].str.replace(',', '","') + '"]'
merged['directors'] = str(merged['directors'])

In [None]:
merged['directors'] = merged['directors'].apply(eval)

SyntaxError: invalid syntax (<string>, line 2)

In [None]:
#changing values of directors and writers column based on matching nconst values with df_names
merged['directors'] = merged['directors'].map(df_names.set_index('nconst')['primaryName'])
merged['writers'] = merged['writers'].map(df_names.set_index('nconst')['primaryName'])
merged

#cette technique n'accepte pas les columns avec plusieurs noms

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,directors,writers,ordering,name,category,job,characters,crew
0,tt0000502,movie,Bohemios,Bohemios,1905.0,100.0,,Ricardo de Baños,,1.0,Antonio del Pozo,actor,,,"{'name': 'Antonio del Pozo', 'category': 'acto..."
1,tt0000502,movie,Bohemios,Bohemios,1905.0,100.0,,Ricardo de Baños,,2.0,El Mochuelo,actor,,,"{'name': 'El Mochuelo', 'category': 'actor', '..."
2,tt0000502,movie,Bohemios,Bohemios,1905.0,100.0,,Ricardo de Baños,,3.0,Ricardo de Baños,director,,,"{'name': 'Ricardo de Baños', 'category': 'dire..."
3,tt0000502,movie,Bohemios,Bohemios,1905.0,100.0,,Ricardo de Baños,,4.0,Miguel de Palacios,writer,,,"{'name': 'Miguel de Palacios', 'category': 'wr..."
4,tt0000502,movie,Bohemios,Bohemios,1905.0,100.0,,Ricardo de Baños,,5.0,Guillermo Perrín,writer,,,"{'name': 'Guillermo Perrín', 'category': 'writ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4206135,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013.0,49.0,Documentary,,,5.0,Angela Gurgel,director,supervising director,,"{'name': 'Angela Gurgel', 'category': 'directo..."
4206136,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013.0,49.0,Documentary,,,6.0,Vinicius Augusto Bozzo,director,co-director,,"{'name': 'Vinicius Augusto Bozzo', 'category':..."
4206137,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013.0,49.0,Documentary,,,7.0,Marcelo Alves,cinematographer,,,"{'name': 'Marcelo Alves', 'category': 'cinemat..."
4206138,tt9916754,movie,Chico Albuquerque - Revelações,Chico Albuquerque - Revelações,2013.0,49.0,Documentary,,,8.0,Wellington Barros,cinematographer,,,"{'name': 'Wellington Barros', 'category': 'cin..."


In [None]:
#adding ratings from df_ratings (now we have movies and there title, the crew associated and the ratings)
merged = merged.merge(df_ratings, on = 'tconst', how = 'left')
merged.sample(20)

MemoryError: Unable to allocate 2.35 GiB for an array with shape (18, 17506177) and data type object

In [None]:
df_akas = df_akas.rename(columns = {'titleId' : 'tconst'})
merged = merged.merge(df_akas, on = 'tconst', how = 'left')
merged


MemoryError: Unable to allocate 2.21 GiB for an array with shape (296324337,) and data type int64