### Data Cleaning

In this notebook we will create a DataFrame by merging all of the data that was scrapped.  I then have to do some data cleaning before the data is ready for analysis and modeling.

In [27]:
import pandas as pd
import pickle
import numpy as np
from datetime import datetime

In [2]:
# Set constants used to load data
num_ratings = 73
num_budgets = 6

budgets_path = 'data/budgets/'
ratings_path = 'data/scraped/'

In [3]:
def get_budgets(num_budgets):
    # Load a temporary df from each budget.csv file and append it to
    # to the df that will be returned.
    df = pd.DataFrame()
    for i in range(num_budgets):
        df_temp = pd.read_csv(budgets_path+f'budget_{i}.csv')
        df = df.append(df_temp, ignore_index=True)
        
    df.drop(['Unnamed: 0'], axis=1, inplace=True)
    df['Title'] = df['Title'].str.title()
    
    pd.DataFrame(df).to_csv('all_budgets.csv')
    
    return df

In [4]:
def get_ratings(num_ratings):
    # Load a temporary df from each scraped.csv file and append it to
    # to the df that will be returned.
    df = pd.DataFrame()
    for i in range(num_ratings):
        df_temp = pd.read_csv(ratings_path+f'scraped_{i}.csv')
        df = df.append(df_temp, ignore_index=True)
        
    df.drop(['Unnamed: 0'], axis=1, inplace=True)
    
    pd.DataFrame(df).to_csv('all_ratings.csv')
    
    return df

In [5]:
df_b = get_budgets(num_budgets)
print("The budgets DataFrame shape is:", df_b.shape)

The budgets DataFrame shape is: (5559, 5)


In [6]:
df_r = get_ratings(num_ratings)
print("The ratings DataFrame shape is:", df_r.shape)

The ratings DataFrame shape is: (7269, 7)


In [7]:
# Merge budgets and ratings DataFrames on Title and clean up the new DF.
df_final = pd.merge(df_r, df_b, on='Title')
df_final.drop_duplicates(inplace=True)
df_final.reset_index(inplace=True)
df_final.drop(['index'], axis=1, inplace=True)
print("The final DataFrame shape is:", df_final.shape)
df_final.head()

The final DataFrame shape is: (3019, 11)


Unnamed: 0,Genre,Language,Nudity,Rating,Sex,Title,Violence,Release_Date,Production_Budget,Domestic_Gross,Worldwide_Gross
0,Action,Medium,Light,PG-13,,Skyscraper,Medium,7/13/2018,125000000,31562840,71955649
1,Animated,,Light,PG,,Hotel Transylvania 3: Summer Vacation,Light,7/13/2018,65000000,57597439,112373051
2,Thriller,Heavy,Medium,R,Medium,The First Purge,Heavy,7/4/2018,13000000,52929930,76023230
3,Science Fiction,Heavy,Light,PG-13,,Ant-Man And The Wasp,Medium,7/6/2018,130000000,142156135,293129020
4,Thriller,Heavy,,R,,Sicario: Day Of The Soldado,Heavy,6/29/2018,35000000,44370869,60095029


In [8]:
df_final.to_csv('data/movie_df_rough.csv', index=False)

In [9]:
# Check for null values
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3019 entries, 0 to 3018
Data columns (total 11 columns):
Genre                3009 non-null object
Language             3019 non-null object
Nudity               3019 non-null object
Rating               3008 non-null object
Sex                  3019 non-null object
Title                3019 non-null object
Violence             3019 non-null object
Release_Date         3019 non-null object
Production_Budget    3019 non-null int64
Domestic_Gross       3019 non-null int64
Worldwide_Gross      3019 non-null int64
dtypes: int64(3), object(8)
memory usage: 259.5+ KB


In [10]:
# Fill null values in the Rating column with Unknown.
df_final.fillna('Unknown', inplace=True)
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3019 entries, 0 to 3018
Data columns (total 11 columns):
Genre                3019 non-null object
Language             3019 non-null object
Nudity               3019 non-null object
Rating               3019 non-null object
Sex                  3019 non-null object
Title                3019 non-null object
Violence             3019 non-null object
Release_Date         3019 non-null object
Production_Budget    3019 non-null int64
Domestic_Gross       3019 non-null int64
Worldwide_Gross      3019 non-null int64
dtypes: int64(3), object(8)
memory usage: 259.5+ KB


In [11]:
# Check how many unique Genres are in the DataFrame.
len(df_final['Genre'].unique())

214

In [12]:
# Set Genres with title format and see if that removed any duplicates.
df_final['Genre'] = df_final['Genre'].str.title()
len(df_final['Genre'].unique())

203

We will look at one of the genres just to see how poorly labelled they are.  'Sf' is not even a word, much less a movie genre.

In [13]:
df_final.loc[df_final['Genre'].str.contains('Sf')]

Unnamed: 0,Genre,Language,Nudity,Rating,Sex,Title,Violence,Release_Date,Production_Budget,Domestic_Gross,Worldwide_Gross
1742,Sf,Medium,Light,"PG-13 for language, intense action violence, d...",Light,Fantastic Four,Medium,8/7/2015,120000000,56117548,167977596
1743,Sf,Medium,Light,"PG-13 for language, intense action violence, d...",Light,Fantastic Four,Medium,7/8/2005,87500000,154696080,333132750
1747,Sf,Heavy,Light,PG-13,,War Of The Worlds,Medium,6/29/2005,132000000,234280354,606836535


At this point I just manually relabelled the genres using my domain knowledge.  I attempted to label the movies in a general way.  For example, the 'Adventure' genre because the 'Action' genre.

In [14]:
df_final.loc[df_final['Genre'].str.contains('Advent'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Bibl'), 'Genre'] = 'Religious'
df_final.loc[df_final['Genre'].str.contains('Spy'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Drama'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Romantic'), 'Genre'] = 'Romance'
df_final.loc[df_final['Genre'].str.contains('Evang'), 'Genre'] = 'Religious'
df_final.loc[df_final['Genre'].str.contains('Crime'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Thriller'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Science'), 'Genre'] = 'Science Fiction'
df_final.loc[df_final['Genre'].str.contains('Historic'), 'Genre'] = 'Historical'
df_final.loc[df_final['Genre'].str.contains('Animated'), 'Genre'] = 'Animated'
df_final.loc[df_final['Genre'].str.contains('Suspense'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Comedy'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Sex Farce'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Action'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Comic'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Spoof'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Detective'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Tragedy'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Biography'), 'Genre'] = 'Biography'
df_final.loc[df_final['Genre'].str.contains('War'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Children'), 'Genre'] = 'Family'
df_final.loc[df_final['Genre'].str.contains('Satire'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Religious Fant'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Noir'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Fairy'), 'Genre'] = 'Family'
df_final.loc[df_final['Genre'].str.contains('Martial'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Animal'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Animation'), 'Genre'] = 'Animated'
df_final.loc[df_final['Genre'].str.contains('Gangster'), 'Genre'] = 'Thriller'
df_final.loc[df_final['Genre'].str.contains('Animal'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Semi-D'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Sports Docu'), 'Genre'] = 'Documentary'
df_final.loc[df_final['Genre'].str.contains('Porno'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Old West'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Humorous'), 'Genre'] = 'Documentary'
df_final.loc[df_final['Genre'].str.contains('Sci'), 'Genre'] = 'Science Fiction'
df_final.loc[df_final['Genre'].str.contains('Sf'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Sports'), 'Genre'] = 'Drama'
df_final.loc[df_final['Genre'].str.contains('Concert'), 'Genre'] = 'Comedy'
df_final.loc[df_final['Genre'].str.contains('Sf'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Antimated'), 'Genre'] = 'Animated'
df_final.loc[df_final['Genre'].str.contains('Famil'), 'Genre'] = 'Family'
df_final.loc[df_final['Genre'].str.contains('Horror-Mystery'), 'Genre'] = 'Action'
df_final.loc[df_final['Genre'].str.contains('Horror'), 'Genre'] = 'Horror'
df_final.loc[df_final['Genre'].str.contains('Musical M'), 'Genre'] = 'Musical'
df_final.loc[df_final['Genre'].str.contains('Mythological Fantasy'), 'Genre'] = 'Fantasy'
df_final.loc[df_final['Genre'].str.contains('Polemical Documentary'), 'Genre'] = 'Documentary'
df_final.loc[df_final['Genre'].str.contains('Melodrama'), 'Genre'] = 'Drama'


In [15]:
# Check the new unique genres.
df_final['Genre'].unique()

array(['Action', 'Animated', 'Thriller', 'Science Fiction', 'Drama',
       'Comedy', 'Horror', 'Romance', 'Religious', 'Fantasy', 'Musical',
       'Mystery', 'Unknown', 'Western', 'Family', 'Documentary',
       'Historical', 'Biography', 'Soap Opera'], dtype=object)

-----
Now we will look at the 'Rating' column.  There is a lot of text in this column in addition to the movie rating.

In [16]:
# Check how many unique ratings there are.
len(df_final['Rating'].unique())

111

In [17]:
# Look at one of the ratings for an example.
df_final['Rating'].unique()[15]

'PG for some action violence and some crude humor'

In [18]:
a = 'PG for some action violence and some crude humor'

In [19]:
a.split(' ', maxsplit=1)

['PG', 'for some action violence and some crude humor']

In [20]:
# Clean up Ratings column by spliting on white space, performing
# 1 split, and then expanding the results in to a DataFrame.
# The first column of this DataFrame is assigned to Rating column and
# the second column is full of None values and descarded.

df_final['Rating'] = df_final['Rating'].str.split(' ', n=1,expand=True)

In [21]:
# There are fewer ratings values but still some erroneous ones.
df_final['Rating'].unique()

array(['PG-13', 'PG', 'R', 'Unknown', 'Not', 'NR', 'G', 'None', 'June',
       'No', 'TV-G', 'Produced', 'Code', 'Pre-Rating', 'Unrated', 'NC-17',
       'Motion', 'Rated', 'TBD'], dtype=object)

In [22]:
# Create a list of valid ratings values and then assign all values in
# the ratings column not in this list to 'Unknown'

ratings = ['PG-13', 'PG', 'R', 'Unknown', 'G', 'NC-17']

df_final.loc[~df_final['Rating'].isin(ratings), 'Rating'] = 'Unknown'

In [23]:
df_final['Rating'].unique()

array(['PG-13', 'PG', 'R', 'Unknown', 'G', 'NC-17'], dtype=object)

In [24]:
# Check the numberical features for erroneous values.
df_final.describe()

Unnamed: 0,Production_Budget,Domestic_Gross,Worldwide_Gross
count,3019.0,3019.0,3019.0
mean,39989570.0,53174730.0,117897200.0
std,45339450.0,74255550.0,196389900.0
min,1100.0,0.0,0.0
25%,10000000.0,7784598.0,12834770.0
50%,25000000.0,28426750.0,47666010.0
75%,52000000.0,66475700.0,134173900.0
max,425000000.0,760507600.0,2776345000.0


The numerical data is clean.

----
Set release dates to datetime data type and create month and year values.

In [30]:
df_final['Release_Date'] = df_final['Release_Date'].apply(pd.to_datetime)

In [34]:
df_final['Release_Year'] = pd.DatetimeIndex(df_final['Release_Date']).year
df_final['Release_Month'] = pd.DatetimeIndex(df_final['Release_Date']).month


In [40]:
month_dict = dict(zip(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
                      'Aug', 'Sep', 'Oct', 'Nov', 'Dec']))

In [48]:
release_month = []
for m in df_final.loc[:, 'Release_Month']:
    release_month.append(month_dict[m])

In [49]:
df_final['Release_Month'] = pd.Series(release_month)

In [51]:
df_final.to_csv('data/movie_df_smooth.csv', index=False)

In [52]:
with open('data/movie_data_smooth.pkl', 'wb') as picklefile:
    pickle.dump(df_final, picklefile)