### Final Project Submission

* Student name: Isiah Cruz
* Student pace: Part-Time
* Scheduled project review date/time: April 27, 2020
* Instructor name: Eli Thomas
* Blog post URL:

# Data Cleaning

In [114]:
import pandas as pd #importing data analysis library under the alias pd
import numpy as np #importing scientific computation library under the alias np

In [115]:
df1 = pd.read_csv('./zippeddata/imdb.title.ratings.csv.gz') #importing the IMDB ratings dataset
df2 = pd.read_csv('./zippeddata/imdb.title.basics.csv.gz') #importing the IMDB title basics dataset

In [116]:
df1.head() #taking a look at the first 5 cells in the IMDB ratings dataset

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [117]:
df1.info() #looking at the datatypes in this dataset, it's clear there is some deviation between columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
tconst           73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


Next, it makes sense that the datatype of the tconst column is an object since the unique identifiers for our movies contain letters,

However, in order to really make some headway with our data, it's probably best that we make the remaining columns ('averagerating' and 'numvotes') the same datatype,

Therefore, we will try and convert the 'averagerating' column into an int64 datatype since these are generally easier to work with than float64.

In [118]:
df1['averagerating'] = df1.averagerating.map(lambda x: x*10)

#running a lambda function that converts all of our ratings from being out of 10 to being out of 100
#for example, an average rating of 83 tells us the same as an average rating of 8.3
#except the former allows us to align the datatypes of the columns we have selected

In [119]:
df1.head() #looking to see if the function worked (it did!)

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,83.0,31
1,tt10384606,89.0,559
2,tt1042974,64.0,20
3,tt1043726,42.0,50352
4,tt1060240,65.0,21


In [120]:
df1.info() #checking to see if the datatypes have changed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
tconst           73856 non-null object
averagerating    73856 non-null float64
numvotes         73856 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


Alas, the datatypes have not changed so it's time to try something else.

In [121]:
df1['averagerating'] = df1.averagerating.astype(int) #converting the datatype to an int

In [122]:
df1.head() #taking a look at the first 5 cells in the dataset

Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,83,31
1,tt10384606,89,559
2,tt1042974,64,20
3,tt1043726,42,50352
4,tt1060240,65,21


In [123]:
df1.info() #checking to see if the datatype of the last 2 columns now match

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
tconst           73856 non-null object
averagerating    73856 non-null int64
numvotes         73856 non-null int64
dtypes: int64(2), object(1)
memory usage: 1.7+ MB


In [124]:
df1.to_csv('imdb.ratings.cleaned.csv') #saving the cleaned dataset at a csv

Success! The 'numvotes' column has been converted to become an int64 datatype.

In [125]:
df2.head() #taking a look at the first 5 cells in the IMDB ratings dataset

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [126]:
df2.info() #looking at the datatypes in this dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
tconst             146144 non-null object
primary_title      146144 non-null object
original_title     146123 non-null object
start_year         146144 non-null int64
runtime_minutes    114405 non-null float64
genres             140736 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [127]:
df2.isna().sum() #checking for NaN values and summing them to see how much cleaning we have to do

tconst                 0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

In [128]:
df2['runtime_minutes'].head() #looking at what we are working with

0    175.0
1    114.0
2    122.0
3      NaN
4     80.0
Name: runtime_minutes, dtype: float64

It looks like some movies do not have data in the runtime column, so now we have to devise a way to either drop or replace these NaN values.

In [129]:
df2['runtime_minutes'].describe() #running the describe function which gives us the mean of the column (86.187)

count    114405.000000
mean         86.187247
std         166.360590
min           1.000000
25%          70.000000
50%          87.000000
75%          99.000000
max       51420.000000
Name: runtime_minutes, dtype: float64

In [130]:
df2['runtime_minutes'] = df2['runtime_minutes'].fillna(value=df2['runtime_minutes'].median, inplace=False)

#replacing the NaN values with the median of the runtime column this way we do not mess with the data too much

In [131]:
df2.isna().sum() #checking to see if the amount of NaN values has changed

tconst                0
primary_title         0
original_title       21
start_year            0
runtime_minutes       0
genres             5408
dtype: int64

Great, so we no longer have NaN values in the 'runtime_minutes' column, but we still have some elsewhere.

In looking at the 'original_title' column, there does not seem to be a significant difference with the 'primary_title' column.

Thus, we should be OK to drop the 'original_title' column altogether.

In [132]:
df2 = df2.drop(columns=['original_title']) #dropping the column

In [133]:
df2.isna().sum() #checking to see if the amount of NaN values has changed

tconst                0
primary_title         0
start_year            0
runtime_minutes       0
genres             5408
dtype: int64

In [136]:
df2['genres'] = df2['genres'].fillna(method='ffill') #filling the NaN values with the values from the preceding row

In [137]:
df2.isna().sum() #checking to see if the amount of NaN values has changed

tconst             0
primary_title      0
start_year         0
runtime_minutes    0
genres             0
dtype: int64

In [138]:
df2.head() #taking a look at the first 5 cells in the dataset

Unnamed: 0,tconst,primary_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,2013,175,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,2019,114,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,2018,122,Drama
3,tt0069204,Sabse Bada Sukh,2018,<bound method Series.median of 0 175.0...,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,2017,80,"Comedy,Drama,Fantasy"


In [139]:
df2.to_csv('imdb.title.basics.cleaned.csv') #saving the cleaned dataset at a csv

Our dataset is now cleaned since there are no more NaN values!