# Exploratory Data Analysis
We will move our exploratory analysis in this separate notebook, 
in order to remove a lot of code cluter from our main notebook.

After we have cleaned our data and they are in a format that we are confortable with, we will save them, 
and access them from our main notebook, where we will continue will working on it.

In [57]:
# imports
import pandas as pd
import numpy as np

from pandasql import sqldf

We import the library `pandasql` so that we can run SQL-style queries on our dataframes.

We will then write a lambda function that will make it easier and quicker to write our queries.
Normally, we'd have to pass in the global variables everytime we use an object. 
So in order to avoid doing this everytime, we write the lambda below to help with this.

In [58]:
# lambda function to help with the global functions need by pandasql
pysqldf = lambda q:sqldf(q, globals())

From our data, we will be using four datasets:
* imdb.title.basic located at `/data/title.basics.csv`
* imdb.title.crew located at `/data/title.crew.csv`
* imdb.title.ratings located at `/data/title.ratings.csv`
* imdb.tn.movie_budgets located at `/data/tn.movie_budgets.csv`

In [59]:
# loading our data
title_basics = pd.read_csv('./data/title.basics.csv')
title_crew = pd.read_csv('./data/title.crew.csv')
title_ratings = pd.read_csv('./data/title.ratings.csv')
movie_budgets = pd.read_csv('./data/tn.movie_budgets.csv')

Take a look at the data we have loaded to get an idea of the kind of data we will be working with.

## Cleaning IMDB Title Basics
We now focus on the `title_basics` dataset, to clean the data, 
and get all the columns in the desired formats and types

In [60]:
#title basic top 5
title_basics.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [61]:
title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


* Our dataset has 146144 entries
* `original_title`, `runtime_minutes`, `genres` has a few missing values
* `start_year` is in the form of integers as expected.

We will drop the `original_title` since we will not be using it and the `primary_title`, has the same content as it,
and convert all values in `primary_title` to lowercase.

In [62]:
# take a look at the year column to see the values contained within 
title_basics.start_year.value_counts()

2017    17504
2016    17272
2018    16849
2015    16243
2014    15589
2013    14709
2012    13787
2011    12900
2010    11849
2019     8379
2020      937
2021       83
2022       32
2023        5
2024        2
2027        1
2026        1
2025        1
2115        1
Name: start_year, dtype: int64

We will also drop the entries with years beyond the current year 2022.

In [63]:
# create a copy of title basics that we can change without affecting the base table
# and drop the original_title column
title_basics_cleaned = title_basics.drop('original_title', axis=1).copy()

# drop any row that has an empty/NaN cell
title_basics_cleaned = title_basics_cleaned.dropna(axis=0, how='any')

# drop all rows in the year column that have year greater than 2022
title_basics_cleaned.drop(title_basics_cleaned[title_basics_cleaned['start_year'] > 2022].index, inplace = True)

# convert all the entries in our primary column to lower case
title_basics_cleaned['primary_title'] = title_basics_cleaned['primary_title'].map(lambda x: x.lower())

In [65]:
title_basics_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112233 entries, 0 to 146139
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   tconst           112233 non-null  object 
 1   primary_title    112233 non-null  object 
 2   start_year       112233 non-null  int64  
 3   runtime_minutes  112233 non-null  float64
 4   genres           112233 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 10.1+ MB


In [64]:
# save the cleaned file for further exploration in the main notebook
title_basics_cleaned.to_csv('./data/cleaned_title_basics.csv', index=False)

## Cleaning IMDB Title Crew

We now look at the crew data, to see its formats, and check if there is anything we need to clean

In [66]:
title_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0285252,nm0899854,nm0899854
1,tt0438973,,"nm0175726,nm1802864"
2,tt0462036,nm1940585,nm1940585
3,tt0835418,nm0151540,"nm0310087,nm0841532"
4,tt0878654,"nm0089502,nm2291498,nm2292011",nm0284943


In [67]:
title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   tconst     146144 non-null  object
 1   directors  140417 non-null  object
 2   writers    110261 non-null  object
dtypes: object(3)
memory usage: 3.3+ MB


We won't make any deletions to our `title_crew` data, as the table ties both directors and writers together,
and deleting a row because of missing a director, will lead to us deleting the writers as well, and vice versa.

Instead we'll fill the empty cells with `Unknown`.

In [70]:
title_crew = title_crew.fillna('unknown')

In [71]:
title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   tconst     146144 non-null  object
 1   directors  146144 non-null  object
 2   writers    146144 non-null  object
dtypes: object(3)
memory usage: 3.3+ MB


In [72]:
# save our modified dataset for further exploration later
title_crew.to_csv('./data/cleaned_title_crew.csv', index=False)