# Data Cleaning

I've collected data from the top 10,000 popular movies ranging from summer 2010 to summer 2020 on IMDB using web scraping and combined this with budget and revenue data for these movies collected from TMDB using their API. 

Next, I'll get a sense of what the data looks like.

## Import Libraries and Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_pickle("./movie_data.pkl")

## Explore and Clean Data

First, I want to make sure the shape of the dataframe is what I would expect. I collected 10,000 movies and for each I grabbed the following information.
* `title`
* `mpaa_rating`
* `runtime`
* `genre`
* `star_rating` - This is the rating the movie has on IMDB out of 10.
* `budget`
* `revenue`

As such, I expect my dataframe to have 10,000 rows and 7 columns.

In [3]:
df.shape

(10000, 7)

Great. Now I want to see what this dataframe looks like.

In [4]:
df.head()

Unnamed: 0,title,mpaa_rating,runtime,genre,star_rating,budget,revenue
0,The Outpost,R,123 min,"\nAction, Drama, History",6.7,0.0,0.0
1,The Gentlemen,R,113 min,"\nAction, Comedy, Crime",7.9,22000000.0,114996853.0
2,Murder on the Orient Express,PG-13,114 min,"\nCrime, Drama, Mystery",6.5,55000000.0,351839303.0
3,365 Days,TV-MA,114 min,"\nDrama, Romance",3.3,0.0,9458590.0
4,Mulan,PG-13,115 min,"\nAction, Adventure, Drama",5.4,200000000.0,57000000.0


That looks correct. Next, I want to look at my columns more in depth.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
title          10000 non-null object
mpaa_rating    7246 non-null object
runtime        9909 non-null object
genre          9999 non-null object
star_rating    9987 non-null object
budget         9925 non-null float64
revenue        9925 non-null float64
dtypes: float64(2), object(5)
memory usage: 547.0+ KB


It looks like two of my columns are not the right data type; `star_rating` should be a float, and `runtime` should be an integer. The first will be an easy fix, but since `runtime` contains both numbers and text, I'll need to turn it into a string, strip the text I don't want, and then turn the column into type int. 

Before I can do this to `runtime`, I'll have to clean the null values, since I can't turn a null value into a string. Since I'm missing this data point for only 91 out of my 10,000 movies, I'll go ahead and just remove these.

In [6]:
df.dropna(subset=['runtime'], inplace=True)

In [7]:
df['runtime'].isna().sum()

0

Now I can correct the data types of both `star_rating` and `runtime`.

In [8]:
df['runtime'] = df['runtime'].astype(str)
df['runtime'] = df['runtime'].apply(lambda x: x.rstrip(' min'))
df['runtime'] = df['runtime'].astype(int)

In [9]:
df.rename(columns={'runtime': 'runtime_in_mins'}, inplace=True)

In [10]:
df['star_rating'] = df['star_rating'].astype(float)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9909 entries, 0 to 9999
Data columns (total 7 columns):
title              9909 non-null object
mpaa_rating        7236 non-null object
runtime_in_mins    9909 non-null int64
genre              9908 non-null object
star_rating        9901 non-null float64
budget             9856 non-null float64
revenue            9856 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 619.3+ KB


Next, I'll look at the value count of my categorical column `mpaa_rating`.

In [12]:
df['mpaa_rating'].value_counts(dropna=False)

NaN          2673
R            2424
Not Rated    2028
PG-13        1124
TV-MA         489
PG            457
Unrated       306
TV-14         268
TV-PG          69
G              35
TV-G           15
TV-Y7           7
M               5
NC-17           5
TV-Y            2
MA-17           1
18              1
Name: mpaa_rating, dtype: int64

Immediately I notice that I have 3 different categories that should be combined into one: NA, 'Not Rated', and 'Unrated'.

In [13]:
df['mpaa_rating'].fillna('Not Rated', inplace=True)
df['mpaa_rating'].replace('Unrated', 'Not Rated', inplace=True)

In [14]:
df['mpaa_rating'].value_counts(dropna=False)

Not Rated    5007
R            2424
PG-13        1124
TV-MA         489
PG            457
TV-14         268
TV-PG          69
G              35
TV-G           15
TV-Y7           7
NC-17           5
M               5
TV-Y            2
MA-17           1
18              1
Name: mpaa_rating, dtype: int64

I still have a lot of different categories here. For my analysis, I really only care about what type of audience a movie is catering to, not the specific rating. As such, I'm going to combine the categories as follows:
* 'All Ages' - 'G' and 'TV-G'
* 'Children' - 'TV-Y' and 'TV-Y7'
* 'Pre-teen+'  - 'PG' and 'TV-PG'
* 'Teen+' - 'PG-13' and 'TV-14'
* 'Adult' - 'R', 'NC-17', 'M', 'MA-17', '18' and 'TV-MA'

I'll also rename the column to be `intended_audience`.

In [15]:
df['mpaa_rating'].replace(['G', 'TV-G'], 'All Ages', inplace=True)
df['mpaa_rating'].replace(['TV-Y', 'TV-Y7'], 'Children', inplace=True)
df['mpaa_rating'].replace(['PG', 'TV-PG'], 'Pre-teen+', inplace=True)
df['mpaa_rating'].replace(['PG-13', 'TV-14'], 'Teen+', inplace=True)
df['mpaa_rating'].replace(['R', 'NC-17', 'M', 'MA-17', '18', 'TV-MA'], 'Adult', inplace=True)

In [16]:
df.rename(columns={'mpaa_rating': 'intended_audience'}, inplace=True)

In [17]:
df['intended_audience'].value_counts(dropna=False)

Not Rated    5007
Adult        2925
Teen+        1392
Pre-teen+     526
All Ages       50
Children        9
Name: intended_audience, dtype: int64

From the table, I see that some budget and revenue numbers are listed as 0. When analyzing movie budgets and revenue, I only want to analyze movies which have these data points. I'll create a new dataframe `df_short` which I can use when analyzing the finances of the movies. 

To start off, I'll see how many 0 values I have for these.

In [18]:
df.isin([0]).sum()

title                   0
intended_audience       0
runtime_in_mins         0
genre                   0
star_rating             0
budget               6807
revenue              6979
dtype: int64

In [19]:
df_short = df[df['revenue'] != 0] 
df_short = df_short[df_short['budget'] != 0]

In [20]:
df_short.shape

(2117, 7)

Looks like that leaves me with 2,140 movie details to use for analyzing popular movie finances.