# The Data

IMDB Provides Several Files with varied information for Movies, TV Shows, Made for TV Movies, etc.

- Overview/Data Dictionary: https://www.imdb.com/interfaces/
- Downloads page: https://datasets.imdbws.com/

I will be focusing on the following files:

- title.basics.tsv.gz
- title.ratings.tsv.gz
- title.akas.tsv.gz

My stakeholder only wants me to include information for movies based on the following specifications:

- Exclude any movie with missing values for genre or runtime
- Include only full-length movies (titleType = "movie").
- Include only fictional movies (not from documentary genre)
- Include only movies that were released 2000 - 2021 (include 2000 and 2021)
- Include only movies that were released in the United States


Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set.

In [1]:
import pandas as pd
import numpy as np

In [2]:
basics_url = 'https://datasets.imdbws.com/title.basics.tsv.gz'
ratings_url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
akas_url = 'https://datasets.imdbws.com/title.akas.tsv.gz'

In [3]:
basics = pd.read_csv(basics_url,sep='\t', low_memory=False)

In [4]:
ratings = pd.read_csv(ratings_url,sep='\t', low_memory=False)

In [5]:
akas = pd.read_csv(akas_url,sep='\t', low_memory=False)

**Changing null value encoding from `\N` to `np.nan`**

In [6]:
basics = basics.replace({'\\N':np.nan})
ratings = ratings.replace({'\\N':np.nan}) 
akas = akas.replace({'\\N':np.nan}) 

**Eliminate movies that are null for `runtimeMinutes` `genres`, and `startYear`**

In [7]:
basics = basics.dropna(subset = ['runtimeMinutes', 'genres', 'startYear'])

**Include only movies that were released 2000 - 2021 (include 2000 and 2021)**

In [8]:
basics['startYear'] = basics['startYear'].astype(int)

In [9]:
basics = basics.loc[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2021)]

**Include only full-length movies (titleType = "movie").**

In [10]:
basics = basics.loc[basics['titleType'] == 'movie']

**Include only fictional movies (not from documentary genre)**

In [11]:
is_documentary = basics['genres'].str.contains('documentary',case=False)
basics = basics[~is_documentary]

**Update AKAs dataframe to eliminate the movies eliminated from basics**

In [12]:
akas.shape

(31737449, 8)

In [13]:
akas = akas.loc[akas['titleId'].isin(basics['tconst'])]

**Keep only movies released in the US**

In [15]:
usAKAs = akas.loc[akas['region'] == 'US']

In [16]:
basics = basics.loc[basics['tconst'].isin(usAKAs['titleId'])]

In [18]:
ratings = ratings.loc[ratings['tconst'].isin(basics['tconst'])]

In [19]:
import os
os.makedirs('Data/',exist_ok=True) 
# Confirm folder created
os.listdir("Data/")

['akas.csv', 'ratings.csv', 'title_basics.csv']

In [20]:
basics.to_csv("Data/basics.csv", index = False)

In [21]:
ratings.to_csv("Data/ratings.csv", index = False)

In [22]:
usAKAs.to_csv("Data/akas.csv", index = False)

In [23]:
## Save dataframes to Compressed .csv.gz files
basics.to_csv("Data/basics.csv.gz",compression='gzip',index=False)
ratings.to_csv("Data/ratings.csv.gz",compression='gzip',index=False)
usAKAs.to_csv("Data/akas.csv.gz",compression='gzip',index=False)

In [25]:
# Open saved file and preview
basics = pd.read_csv("Data/basics.csv.gz", low_memory = False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0093119,movie,Grizzly II: Revenge,Grizzly II: The Predator,0,2020,,74,"Horror,Music,Thriller"
