# Matthew's Data Cleaning

**In this notebook I will be cleaning "The Movie DB" dataset (tmdb.movies.csv.gz)  for it to be merged into the master table**  

After talking with the group it was decided that we would merge our tables together into a "master table" using what we decided were [relevant columns](https://docs.google.com/spreadsheets/d/1FrTlLWIb5fVswDBTqliIks8QU7fY97RTtus5--7r_q4/edit#gid=596077008). Of the two datasets I explored, it was decided that we would keep data from "The Movie DB" dataset. The kept columns from the table will be:

**Keeping**

- release_date
- original_title
- vote_average
- vote_count
- popularity 

**Creating**

- movie_id
- release_month

and a column called "movie_id" will be created by combining the 'original_title' and the year from the'release_date' column. This column will be used as a key when creating the master 'table. release_month' column will be used in future analysis.

In [160]:
# Import packages for Cleaning
import pandas as pd
import numpy as np

In [161]:
# There is an index column in this data so I set the parameter 'index_col' to 1.
tmdb = pd.read_csv('../Data/zippedData/tmdb.movies.csv.gz', index_col=0)

In [162]:
# Visually confirm no issues with import.
tmdb.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


## Selecting Desired Columns

In [163]:
# First, I'm going to drop the columns that we decided not to use: 'genre_ids', 'id', and 'original_language'
tmdb.drop(columns=['genre_ids', 'id','original_language'], inplace=True)

In [164]:
# Before dropping one of them, I'm going to check the differences between 'original_title' and 'title'
tmdb[tmdb['original_title'] != tmdb['title']]

Unnamed: 0,original_title,popularity,release_date,title,vote_average,vote_count
14,Saw 3D,20.370,2010-10-28,Saw: The Final Chapter,6.0,1488
49,Tres metros sobre el cielo,13.721,2010-12-20,Three Steps Above Heaven,7.5,960
67,Arthur 3: la guerre des deux mondes,12.679,2010-08-22,Arthur 3: The War of the Two Worlds,5.6,865
70,El secreto de sus ojos,12.531,2010-04-16,The Secret in Their Eyes,7.9,1141
75,サマーウォーズ,12.275,2010-10-13,Summer Wars,7.5,447
...,...,...,...,...,...,...
26409,你好，之华,0.600,2018-11-09,Last Letter,6.0,1
26422,El verano del león eléctrico,0.600,2018-11-12,The Summer of the Electric Lion,6.0,1
26432,Contes de Juillet,0.600,2018-03-09,July Tales,6.0,1
26494,La última virgen,0.600,2018-05-26,The Last Virgin,2.0,1


It looks like 'original_title' is either  
1. The title of the movie in its original language
2. The 'prototype' or 'working' title for the movie

I'm going to keep the 'title' column instead, as keeping the 'original_title' column will create problems when creating the 'movie_id' column later.

In [165]:
tmdb.drop(columns='original_title', inplace=True)

# Visually confirm expected result
tmdb.head()

Unnamed: 0,popularity,release_date,title,vote_average,vote_count
0,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,28.005,1995-11-22,Toy Story,7.9,10174
4,27.92,2010-07-16,Inception,8.3,22186


## Confirming Valid Data

I want to make sure that I don't have any placeholder data in my columns before sending it off to the master table.

### Popularity

In [166]:
# Checking popularity
tmdb['popularity'].value_counts()

0.600     7037
1.400      649
0.840      587
0.624      104
0.625       92
          ... 
3.742        1
14.749       1
7.924        1
8.414        1
9.060        1
Name: popularity, Length: 7425, dtype: int64

After discussion with my group, we decided that it is likley that the value '0.600' is a result of a quirk with The Movie DB's algorithm for calculating popularity. The odd results are more likley due to low user interaction. We decided to not do anything with it since its not technically placeholder data. 

### Release Date

In [167]:
tmdb['release_date'].value_counts()

2010-01-01    269
2011-01-01    200
2012-01-01    155
2014-01-01    155
2013-01-01    145
             ... 
1985-08-30      1
2012-06-28      1
1966-08-24      1
2011-02-27      1
2010-01-03      1
Name: release_date, Length: 3433, dtype: int64

There is a strong possibility that January 1st is a placeholder month and date. This won't be a problem for creating the 'movie_id' column since the year is probably correct, but it will be a problem when comparing the release month/day to anything. There are around 1000 of them out of 3000 entries.

### Title Column

In [168]:
tmdb['title'].value_counts()

Home                                     7
Eden                                     7
Aftermath                                5
Lucky                                    5
Truth or Dare                            5
                                        ..
Game of Thrones: Conquest & Rebellion    1
My Neighbourhood                         1
Erratum                                  1
A Path Appears                           1
James Gandolfini: Tribute To A Friend    1
Name: title, Length: 24688, dtype: int64

Some common names here, nothing stirkes me as odd here.

### Vote Average

In [169]:
tmdb['vote_average'].value_counts()

6.0     1940
7.0     1560
5.0     1486
10.0    1252
8.0     1231
        ... 
9.4        6
1.2        3
1.4        3
9.1        2
9.7        2
Name: vote_average, Length: 91, dtype: int64

Some of these values look odd to me, but none of them seem impossible.

### Vote Count

In [170]:
tmdb['vote_count'].value_counts()

1       6541
2       3044
3       1757
4       1347
5        969
        ... 
2328       1
6538       1
489        1
2600       1
2049       1
Name: vote_count, Length: 1693, dtype: int64

Again, nothing terribly odd here.

## Movie ID Creation

In [171]:
# This column is the lowercase name of the movie, combined with the year released at the end.
tmdb['movie_id'] = tmdb['title'].str.strip().str.lower()

#Now combine the year
tmdb['movie_id'] = tmdb['movie_id']+tmdb['release_date'].str.slice(0, 4)

In [172]:
# Visually confirm expected result.
tmdb.head()

Unnamed: 0,popularity,release_date,title,vote_average,vote_count,movie_id
0,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,harry potter and the deathly hallows: part 12010
1,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,how to train your dragon2010
2,28.515,2010-05-07,Iron Man 2,6.8,12368,iron man 22010
3,28.005,1995-11-22,Toy Story,7.9,10174,toy story1995
4,27.92,2010-07-16,Inception,8.3,22186,inception2010


## Setting Datatypes

In [173]:
# Check datatypes
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   popularity    26517 non-null  float64
 1   release_date  26517 non-null  object 
 2   title         26517 non-null  object 
 3   vote_average  26517 non-null  float64
 4   vote_count    26517 non-null  int64  
 5   movie_id      26517 non-null  object 
dtypes: float64(2), int64(1), object(3)
memory usage: 1.4+ MB


In [174]:
# All datatypes are as I need them to be except for 'release_date'. It needs to be a datetime object.
tmdb['release_date'] = tmdb['release_date'].astype('datetime64')
# Confirm change
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26517 entries, 0 to 26516
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   popularity    26517 non-null  float64       
 1   release_date  26517 non-null  datetime64[ns]
 2   title         26517 non-null  object        
 3   vote_average  26517 non-null  float64       
 4   vote_count    26517 non-null  int64         
 5   movie_id      26517 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 1.4+ MB


In [175]:
# One more confirmation that I can get the month from the dt object, as it will be needed to create 'release_month'
tmdb['release_date'].dt.month

0        11
1         3
2         5
3        11
4         7
         ..
26512    10
26513     5
26514    10
26515     6
26516    10
Name: release_date, Length: 26517, dtype: int64

All other Columns are already the desired datatype.

## Release Month Creation


In [176]:
# Now that 'release_date' has been converted to a datetime object we can create the 'release_month column'
tmdb['release_month'] = tmdb['release_date'].dt.month

In [177]:
# One final check that data looks as expected
tmdb.head()

Unnamed: 0,popularity,release_date,title,vote_average,vote_count,movie_id,release_month
0,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,harry potter and the deathly hallows: part 12010,11
1,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,how to train your dragon2010,3
2,28.515,2010-05-07,Iron Man 2,6.8,12368,iron man 22010,5
3,28.005,1995-11-22,Toy Story,7.9,10174,toy story1995,11
4,27.92,2010-07-16,Inception,8.3,22186,inception2010,7


In [178]:
# Set index as movie_id for master table.
tmdb.set_index('movie_id')

Unnamed: 0_level_0,popularity,release_date,title,vote_average,vote_count,release_month
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
harry potter and the deathly hallows: part 12010,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,11
how to train your dragon2010,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,3
iron man 22010,28.515,2010-05-07,Iron Man 2,6.8,12368,5
toy story1995,28.005,1995-11-22,Toy Story,7.9,10174,11
inception2010,27.920,2010-07-16,Inception,8.3,22186,7
...,...,...,...,...,...,...
laboratory conditions2018,0.600,2018-10-13,Laboratory Conditions,0.0,1,10
_exhibit_84xxx_2018,0.600,2018-05-01,_EXHIBIT_84xxx_,0.0,1,5
the last one2018,0.600,2018-10-01,The Last One,0.0,1,10
trailer made2018,0.600,2018-06-22,Trailer Made,0.0,1,6


In [179]:
# Finally save the file as a CSV for easier sharing
tmdb.to_csv('tmdb_data')