# Matthew's Data Cleaning

**In this notebook I will be cleaning "The Movie DB" dataset (tmdb.movies.csv.gz)  for it to be merged into the master table**  

After talking with the group it was decided that we would merge our tables together into a "master table" using what we decided were [relevant columns](https://docs.google.com/spreadsheets/d/1FrTlLWIb5fVswDBTqliIks8QU7fY97RTtus5--7r_q4/edit#gid=596077008). Of the two datasets I explored, it was decided that we would keep data from "The Movie DB" dataset. The kept columns from the table will be:

**Keeping**

- release_date
- original_title
- vote_average
- vote_count
- popularity 

**Creating**

- movie_id
- release_month

and a column called "movie_id" will be created by combining the 'original_title' and the year from the'release_date' column. This column will be used as a key when creating the master 'table. release_month' column will be used in future analysis.

In [2]:
# Import packages for Cleaning
import pandas as pd
import numpy as np

In [3]:
# There is an index column in this data so I set the parameter 'index_col' to 1.
tmdb = pd.read_csv('../Data/zippedData/tmdb.movies.csv.gz', index_col=0)

In [4]:
# Visually confirm no issues with import.
tmdb.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


## Selecting Desired Columns

In [5]:
# First, I'm going to drop the columns that we decided not to use: 'genre_ids', 'id', and 'original_language'
tmdb.drop(columns=['genre_ids', 'id','original_language'], inplace=True)

In [6]:
# Before dropping one of them, I'm going to check the differences between 'original_title' and 'title'
tmdb[tmdb['original_title'] != tmdb['title']]

Unnamed: 0,original_title,popularity,release_date,title,vote_average,vote_count
14,Saw 3D,20.370,2010-10-28,Saw: The Final Chapter,6.0,1488
49,Tres metros sobre el cielo,13.721,2010-12-20,Three Steps Above Heaven,7.5,960
67,Arthur 3: la guerre des deux mondes,12.679,2010-08-22,Arthur 3: The War of the Two Worlds,5.6,865
70,El secreto de sus ojos,12.531,2010-04-16,The Secret in Their Eyes,7.9,1141
75,サマーウォーズ,12.275,2010-10-13,Summer Wars,7.5,447
...,...,...,...,...,...,...
26409,你好，之华,0.600,2018-11-09,Last Letter,6.0,1
26422,El verano del león eléctrico,0.600,2018-11-12,The Summer of the Electric Lion,6.0,1
26432,Contes de Juillet,0.600,2018-03-09,July Tales,6.0,1
26494,La última virgen,0.600,2018-05-26,The Last Virgin,2.0,1


It looks like 'original_title' is either  
1. The title of the movie in its original language
2. The 'prototype' or 'working' title for the movie

I'm going to keep the 'title' column instead, as keeping the 'original_title' column will create problems when creating the 'movie_id' column later.

In [7]:
tmdb.drop(columns='original_title', inplace=True)

# Visually confirm expected result
tmdb.head()

Unnamed: 0,popularity,release_date,title,vote_average,vote_count
0,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,28.005,1995-11-22,Toy Story,7.9,10174
4,27.92,2010-07-16,Inception,8.3,22186


## Setting Datatypes

In [8]:
# Check datatypes
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   popularity    26517 non-null  float64
 1   release_date  26517 non-null  object 
 2   title         26517 non-null  object 
 3   vote_average  26517 non-null  float64
 4   vote_count    26517 non-null  int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 1.2+ MB


In [22]:
# All datatypes are as I need them to be except for 'release_date'. It needs to be a datetime object.
tmdb['release_date'] = tmdb['release_date'].astype('datetime64')
# Confirm change
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   popularity    26517 non-null  float64       
 1   release_date  26517 non-null  datetime64[ns]
 2   title         26517 non-null  object        
 3   vote_average  26517 non-null  float64       
 4   vote_count    26517 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 1.2+ MB


In [26]:
# One more confirmation that I can get the month from the dt object, as it will be needed to create 'release_month'
tmdb['release_date'][0]

Timestamp('2010-11-19 00:00:00')

All other Columns are already the desired datatype.