In [1]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
!ls -lrth data

total 1.2G
-rw-r--r-- 1 amznamas amazon 6.1M Sep  2 04:40 title.ratings.tsv.gz
-rw-r--r-- 1 amznamas amazon  36M Sep  2 04:40 title.episode.tsv.gz
-rw-r--r-- 1 amznamas amazon  60M Sep  2 04:40 title.crew.tsv.gz
-rw-r--r-- 1 amznamas amazon 154M Sep  2 04:40 title.basics.tsv.gz
-rw-r--r-- 1 amznamas amazon 222M Sep  2 04:41 name.basics.tsv.gz
-rw-r--r-- 1 amznamas amazon 268M Sep  2 04:41 title.akas.tsv.gz
-rw-r--r-- 1 amznamas amazon 391M Sep  2 04:41 title.principals.tsv.gz


In [24]:
%%time

data_directory = 'data'

for filename in os.listdir(data_directory):
    f = os.path.join(data_directory, filename)
    if f.endswith('.tsv.gz'):
        name = f.split('/')[1].split('.tsv')[0].replace('.', '_')
        # assigning variable name from string 'filename'
        globals()[name] = pd.read_csv(f, compression='infer', delimiter='\t', low_memory=False)
        print("loaded '{}' --> '{}' named Pandas variable".format(f, name))

loaded 'data/title.basics.tsv.gz' --> 'title_basics' named variable
loaded 'data/title.ratings.tsv.gz' --> 'title_ratings' named variable
loaded 'data/title.principals.tsv.gz' --> 'title_principals' named variable
loaded 'data/title.akas.tsv.gz' --> 'title_akas' named variable
loaded 'data/title.crew.tsv.gz' --> 'title_crew' named variable
loaded 'data/name.basics.tsv.gz' --> 'name_basics' named variable
loaded 'data/title.episode.tsv.gz' --> 'title_episode' named variable
CPU times: user 2min 41s, sys: 17.7 s, total: 2min 59s
Wall time: 2min 59s


## some cleaning-up

**title.akas.tsv.gz** - Contains the following information for titles:
```
titleId (string) - a tconst, an alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
title (string) – the localized title
region (string) - the region for this version of the title
language (string) - the language of the title
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
attributes (array) - Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) – 0: not original title; 1: original title
```

In [7]:
title_akas.types = title_akas.types.replace(r'\N', 'UNKNOWN')
title_akas.title.fillna('UNKNOWN', inplace=True)
title_akas.region = title_akas.region.replace(r'\\N', 'UNKNOWN')
title_akas.region.fillna('UNKNOWN', inplace=True)
title_akas.language = title_akas.language.replace(r'\N', 'UNKNOWN')
title_akas.attributes = title_akas.attributes.replace(r'\N','')

**title.basics.tsv.gz** - Contains the following information for titles:
```
tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title
```

In [68]:
title_basics.startYear = title_basics.startYear.replace(r'\N','0')
title_basics.startYear = title_basics.startYear.astype('int64')
title_basics.endYear = title_basics.endYear.replace(r'\N', '0')
title_basics.endYear = title_basics.endYear.astype('int64')
title_basics.runtimeMinutes = title_basics.runtimeMinutes.replace(r'\N','UNKNOWN')
title_basics.genres.fillna('UNKNOWN', inplace=True)
title_basics.primaryTitle.fillna('UNKNOWN', inplace=True)
title_basics.originalTitle.fillna('UNKNOWN', inplace=True)

**title.crew.tsv.gz** – Contains the director and writer information for all the titles in IMDb. Fields include:
```
tconst (string) - alphanumeric unique identifier of the title
directors (array of nconsts) - director(s) of the given title
writers (array of nconsts) – writer(s) of the given title
```

In [74]:
title_crew.directors = title_crew.directors.replace(r'\N','UNKNOWN')
title_crew.writers = title_crew.writers.replace(r'\N', 'UNKNOWN')

**title.episode.tsv.gz** – Contains the tv episode information. Fields include:
```
tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series
```

In [80]:
title_episode.seasonNumber = title_episode.seasonNumber.replace(r'\N', 'UNKNOWN')
title_episode.episodeNumber = title_episode.episodeNumber.replace(r'\N', 'UNKNOWN')

**title.principals.tsv.gz** – Contains the principal cast/crew for titles
```
tconst (string) - alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
nconst (string) - alphanumeric unique identifier of the name/person
category (string) - the category of job that person was in
job (string) - the specific job title if applicable, else '\N'
characters (string) - the name of the character played if applicable, else '\N'
```

In [88]:
title_principals.job = title_principals.job.replace(r'\N','UNKNOWN')
title_principals.characters = title_principals.characters.replace(r'\N','UNKNOWN')

**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles
```
tconst (string) - alphanumeric unique identifier of the title
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received
```

**name.basics.tsv.gz** – Contains the following information for names:
```
nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string)– name by which the person is most often credited
birthYear – in YYYY format
deathYear – in YYYY format if applicable, else '\N'
primaryProfession (array of strings)– the top-3 professions of the person
knownForTitles (array of tconsts) – titles the person is known for
```

In [100]:
name_basics.knownForTitles = name_basics.knownForTitles.replace(r'\N','UNKNOWN')
name_basics.birthYear = name_basics.birthYear.replace(r'\N','0').astype('int64')
name_basics.deathYear = name_basics.deathYear.replace(r'\N','0').astype('int64')

## Save cleaned versions

In [8]:
%%time

title_basics.to_csv("data/title_basics.csv", index=None)
title_ratings.to_csv("data/title_ratings.csv", index=None)
title_principals.to_csv("data/title_principals.csv", index=None)
title_akas.to_csv("data/title_akas.csv", index=None)
title_crew.to_csv("data/title_crew.csv", index=None)
name_basics.to_csv("data/name_basics.csv", index=None)
title_episode.to_csv("data/title_episode.csv", index=None)

CPU times: user 1min 8s, sys: 1.36 s, total: 1min 10s
Wall time: 1min 10s
