# Acquiring IMDB Datasets

This project will primarily focus on the use of statistics taken from IMDB. All movies and games analysed will be contained within the IMDB datasets. Note that these can be freely downloaded from https://datasets.imdbws.com The first step is to import the required libraries.

In [1]:
import pandas as pd
import os
import subprocess

We create a function that will download the datasets from imdbws. Note that theis code allows us to acquire the datasets necessary for the scope of this project but additional datasets are available, which do not begin with 'title', and this function would need to be modified if you wish to acquire them.

In [None]:
def download_dataset(filename):
    # The url of the dataset to download
    url = "https://datasets.imdbws.com/title.{}.tsv.gz".format(filename)
    # Create the folder for the downloaded file
    raw_data_dir = "raw_data/raw_imdb"
    os.makedirs(raw_data_dir, exist_ok=True)
    # Filenames for downloaded file
    filename_gz = os.path.join(raw_data_dir, "title.{}.tsv.gz".format(filename))
    filename_tsv = os.path.join(raw_data_dir, "title.{}.tsv".format(filename))
    # Download the file and save to specified location
    curl_cmd = "curl -L {} --output {}".format(url, filename_gz)
    status = subprocess.run(curl_cmd.split())
    # The downloaded file is a .gz file which requires unzipping. 
    gunzip_cmd = "gunzip {} -qq".format(filename_gz)
    status = subprocess.run(gunzip_cmd.split())
    # Create a dataset with the unzipped file
    dataset = pd.read_csv(filename_tsv, sep="\t", low_memory=False)
    return dataset

Create 3 datasets using the the function just defined.

In [None]:
ratings = download_dataset("ratings")
titles = download_dataset("basics")
akas = download_dataset("akas")

Combine the titles and ratings into a single dataset. We can use the tconst which is a unique identifier for every movie or game in the IMDB datsets.

In [3]:
df = titles.join(ratings.set_index('tconst'), on='tconst')
df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",5.7,1965.0
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",5.8,263.0
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",6.5,1807.0
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short",5.6,178.0
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short",6.2,2604.0


We can examine the different types of data we currently have.

In [4]:
df["titleType"].value_counts()

tvEpisode       7425225
short            924804
movie            642410
video            273194
tvSeries         241897
tvMovie          141116
tvMiniSeries      48161
tvSpecial         41168
videoGame         34092
tvShort           10057
tvPilot               1
Name: titleType, dtype: int64

In [5]:
df.tconst.count()

9782125

We'll remove any entries that aren't movies or videogames as we won't be using them:

In [6]:
df = df[(df['titleType']=='movie') | (df['titleType'] == 'game')]

In [7]:
df.tconst.count()

642410

We'll drop entries that have no votes, or no average rating

In [8]:
df = df.dropna(subset=['averageRating', 'numVotes'])

In [9]:
df.tconst.count()

290265

We only want the alternate titles from the akas dataframe. We can remove the other columns, drop the duplicates and group all the alternate titles in a single list for each id.

In [10]:
akas.sample()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
16698630,tt16431004,2,Episodio #7.126,IT,it,\N,\N,0


In [11]:
akas = akas[['titleId', 'title']]
akas = akas.drop_duplicates()
akas = akas.groupby('titleId').agg({'title': lambda x: list(x)})
akas.sample()

Unnamed: 0_level_0,title
titleId,Unnamed: 1_level_1
tt2026515,"[Épisode datant du 23 mars 1994, 1994年3月23日 のエ..."


Combine the akas dataframe to the main dataframe

In [12]:
df = df.join(akas, on='tconst').rename(columns = {"title": "akas"})

Separate the video games and movies and save to csv files.

In [None]:
video_games = df[df["titleType"]=="videoGame"]
video_games.to_csv("imdb_games_db.csv")

movies = df[df["titleType"]=="movie"]
movies.to_csv("imdb_movies_db.csv")