# 3 - Process
After removing the empty values from the dataset, now I want to fix everything to use it on the next steps.
To achieve it, I want to get the columns into their appropiate types:

In [36]:
import pandas as pd
import numpy as np


In [37]:
# data = pd.read_csv('data\movies_clean.csv')
data = pd.read_csv('data\sample_clean.csv')

In [38]:
data.dtypes

tconst             object
primaryTitle       object
isAdult             int64
startYear           int64
runtimeMinutes     object
rating            float64
genres.1           object
mpaa               object
description        object
votes             float64
dtype: object

The first two columns won't be used for modelling, but I want to keep them as they have the name and id of each movie.

### ``isAdult``

In [39]:
data.isAdult.value_counts()

0    26867
1       34
Name: isAdult, dtype: int64

In [40]:
data.loc[data.isAdult == 1].head()

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,rating,genres.1,mpaa,description,votes
22124,tt0061926,Lust Weekend,1,1967,70,4.6,['drama'],Not Rated,"['David and Jane, a happily married couple, ar...",10.0
22412,tt0062417,Un épais manteau de sang,1,1968,88,6.6,['drama'],Not Rated,"[""Sorenson uses diamond robbery money to buy c...",14.0
22812,tt0063118,I Want You!,1,1969,59,5.4,['drama'],Not Rated,['Brothel madam blackmails her clients.'],13.0
22860,tt0063176,Karla,1,1969,80,5.3,"['drama', 'adult']",Not Rated,['Jackie and Lorna share a house in Florida. J...,8.0
23132,tt0063631,Space Thing,1,1968,70,4.1,"['comedy', 'sci-fi']",Unrated,"['A man is in bed reading Sci-Fi mags, and his...",258.0


In [41]:
data.loc[data['mpaa'] == 'X']

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,rating,genres.1,mpaa,description,votes
18666,tt0056166,Lash of Lust,0,1972,\N,2.6,['western'],X,"[""The man who hears with his eyes and speaks w...",32.0
19971,tt0058264,Kiss Me Quick!,0,1964,70,4.9,"['comedy', 'horror', 'sci-fi']",X,"['Sterilox, asexual ambassador from a distant ...",352.0
21060,tt0060092,The Alley Tramp,0,1968,69,4.5,['drama'],X,"[""A teenage girl sets out on a sexual odyssey ...",282.0
21435,tt0060712,Mondo Topless,0,1966,60,4.8,['documentary'],X,['Completely topless. Completely uninhibited. ...,1206.0
21469,tt0060761,The Notorious Daughter of Fanny Hill,0,1966,74,3.5,"['comedy', 'drama']",X,"[""Kissy Hill, the daughter of a notorious mada...",140.0
...,...,...,...,...,...,...,...,...,...,...
26699,tt0069811,The Body Shop,0,1972,75,3.7,"['horror', 'sci-fi']",X,['Emminent plastic surgeon and mad scientist D...,456.0
26713,tt0069831,Butterfly,1,1975,112,5.5,"['adult', 'drama']",X,"[""A beautiful young country girl tires of life...",360.0
26795,tt0069969,The Devil in Miss Jones,1,1973,67,6.2,"['adult', 'drama', 'fantasy', 'horror']",X,['Miss Jones is tired of her life and commits ...,1589.0
26856,tt0070066,The Devil's Plaything,0,1973,103,4.6,['horror'],X,"[""A lesbian vampire's soul is kept on in the b...",628.0


In this sample, the flag for adult is very biased, which make sense, as that kind of movies don't come up often. Besides, this category should be already covered by the MPAA ratings on its own column.
In the end, my decission is thtat this column is not needed.

In [42]:
cols_to_drop = []
cols_to_drop.append('isAdult')

### ``startYear``
This column is ok, but I want to keep a copy apart for visualization purposes.

In [69]:
data['year_process'] = data.startYear

### ``runtimeMinutes``

In [43]:
data.runtimeMinutes.value_counts(dropna=False)

90     1215
95      893
85      801
100     715
80      692
       ... 
218       1
236       1
284       1
295       1
202       1
Name: runtimeMinutes, Length: 240, dtype: int64

In the original source, the empty values were marked as '\N' instead of NaN. There are two consequences from this: they did not show up in the cleanup fase, and because of them, the solumn is considered as made of objects instead of only numbers.
I could assign them the most common value, or the mean value. In this case, I am choosing the mean.

In [44]:
mean = int(np.floor(data.loc[data['runtimeMinutes'] != '\\N']['runtimeMinutes'].astype('int32').mean()))
mean

89

In [49]:
data.runtimeMinutes = data.runtimeMinutes.replace('\\N', mean).astype('int32')
data.runtimeMinutes.value_counts(dropna=False)

90     1215
95      893
85      801
100     715
89      697
       ... 
218       1
236       1
284       1
295       1
202       1
Name: runtimeMinutes, Length: 239, dtype: int64

In [50]:
data.dtypes

tconst             object
primaryTitle       object
isAdult             int64
startYear           int64
runtimeMinutes      int32
rating            float64
genres             object
mpaa               object
description        object
votes             float64
dtype: object

### ``genres.1``
In here, the first thing I must do is change the name to remove that .1.

In [46]:
data.rename(columns={'genres.1':'genres'}, inplace=True)

In [70]:
data.head(1)

Unnamed: 0,tconst,primaryTitle,isAdult,startYear,runtimeMinutes,rating,genres,mpaa,description,votes,year_process
0,tt0023526,Stranger in Town,0,1932,65,6.1,['drama'],PG,"[""Ulysses Crickle owns a small town grocery st...",115.0,1932


(use the functions in NLP to create the columns here)

### ``mpaa``
The age rating has a list of defined values: 
* **G – General Audiences**
* **PG – Parental Guidance Suggested**
* **PG-13 – Parents Strongly Cautioned**
* **R – Restricted**
* **NC-17 – Adults Only**  
  
I want to minimize the amount of unique values present in the column.

In [51]:
data.mpaa.value_counts(dropna=False)

Not Rated    13987
Approved      6074
Passed        3600
R              824
PG             442
TV-PG          425
G              354
Unrated        327
GP             223
TV-G           179
X              125
TV-14          100
M               90
M/PG            60
PG-13           38
TV-MA           32
TV-Y7            7
TV-Y7-FV         4
NC-17            4
TV-Y             2
Open             2
TV-13            2
Name: mpaa, dtype: int64

Not Rated and Unrated are going to be the same:

In [52]:
data.mpaa = data.mpaa.replace('Unrated', 'Not Rated')

Approved and Passed are going to be the same:

In [54]:
data.mpaa = data.mpaa.replace('Passed', 'Approved')

X, M, M/PG and TV-MA are going to be NC-17

In [67]:
data.mpaa = data.mpaa.replace(dict.fromkeys(['X','M','M/PG','TV-MA'], 'NC-17'))

TV-PG, GP, TV-Y7 and TV-Y7-FV are going to be PG

In [61]:
data.mpaa = data.mpaa.replace(dict.fromkeys(['TV-PG','GP','TV-Y7','TV-Y7-FV'], 'PG'))

TV-G, TV-Y and Open are going to be G

In [63]:
data.mpaa = data.mpaa.replace(dict.fromkeys(['TV-G','TV-Y','Open'], 'G'))

TV-14 and TV-13 are going to be PG-13

In [65]:
data.mpaa = data.mpaa.replace(dict.fromkeys(['TV-14','TV-13'], 'PG-13'))

In [68]:
data.mpaa.value_counts(dropna=False)

Not Rated    14314
Approved      9674
PG            1101
R              824
G              537
NC-17          311
PG-13          140
Name: mpaa, dtype: int64