# Data Cleaning

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Data/NBCU-dataLaurel.csv')
data.head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
0,tt0010323,The Cabinet of Dr. Caligari,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",UNRATED,8.1,,15-Oct-97,Rialto Pictures,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...",42583,https://images-na.ssl-images-amazon.com/images...,Robert Wiene,19-Mar-21,67 min,"Fantasy, Horror, Mystery",1 nomination.,expressionism|somnambulist|avant-garde|hypnosi...,18000,0
1,tt0052893,Hiroshima Mon Amour,A French actress filming an anti-war film in H...,NOT RATED,8.0,,24-Jun-03,Rialto Pictures,"Emmanuelle Riva, Eiji Okada, Stella Dassas, Pi...",21154,https://images-na.ssl-images-amazon.com/images...,Alain Resnais,16-May-60,90 min,"Drama, Romance",Nominated for 1 Oscar. Another 6 wins & 5 nomi...,memory|atomic-bomb|lovers-separation|impossibl...,88300,0
2,tt0058898,Alphaville,A U.S. secret agent is sent to the distant spa...,NOT RATED,7.2,,20-Oct-98,Rialto Pictures,"Eddie Constantine, Anna Karina, Akim Tamiroff",17801,https://images-na.ssl-images-amazon.com/images...,Jean-Luc Godard,5-May-65,99 min,"Drama, Mystery, Sci-Fi",1 win.,dystopia|french-new-wave|satire|comic-violence...,220000,46585
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0


In [3]:
data.shape

(8468, 19)

These are the variables we have to work with:

imdbid: Unique Id used by IMDB to refer to the movie.

Title: Title of the movie

plot: Movie plot summary

rating: MPAA Appropriate audience rating

imdb_rating: IMDB's voters' scoring of a movie on a scale from 1-10 (10 being best)

metacritic: Metacritic movie score on a scale of 0-100 (100 being best)

dvd_release: Movie release date on DVD

production: Principle production company

actors: Lead Actors

imdb_votes: Total votes from IMDB members

poster: Movie Poster artwork

director: Movie director

release_date: Theatrical Release Date

runtime: Runtime length of movie in minutes

genre: Genre Classification

awards: Academy awards & nominations

keywords: Keywords associated with the movie

budget: Budget spent on movie production, marketing, and distribution

box office gross: Box Office Gross Returns as of 9/21/2017

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null object
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: float64(2), object(17)
memory usage: 1.2+ MB


Notice how many of variables are just objects. We're going to have to deal with converting a few of these into useful types.

First, we'll start by changing release_date to a datetime-like type.

In [5]:
data['release_date'].head()

0    19-Mar-21
1    16-May-60
2     5-May-65
3    23-Sep-76
4     1-Jun-82
Name: release_date, dtype: object

In [6]:
pd.to_datetime(data['release_date'])[1],data['release_date'][1]

(Timestamp('2060-05-16 00:00:00'), '16-May-60')

Then we see that there is an issue with pandas to_datetime function. It converts very old dates back to the 19th century. Perhaps we need to use the datetime package per [this](https://stackoverflow.com/questions/16600548/how-to-parse-string-dates-with-2-digit-year).

In [7]:
import datetime
import numpy as np

dates = data['release_date']

dates = pd.to_datetime(dates)

for i in range(len(dates)):
    if dates[i].year > 2019:
        dates[i] = dates[i].replace(year = dates[i].year-100)

dates.head()

0   1921-03-19
1   1960-05-16
2   1965-05-05
3   1976-09-23
4   1982-06-01
Name: release_date, dtype: datetime64[ns]

Then we've found a way to account for Python's default pivot year.

In [8]:
data['release_date']=dates
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](1), float64(2), object(16)
memory usage: 1.2+ MB


It seems natural to also do the same for dvd_release.

In [9]:
dates = data['dvd_release']

dates = pd.to_datetime(dates)

for i in range(len(dates)):
    if dates[i].year > 2019:
        dates[i] = dates[i].replace(year = dates[i].year-100)

data['dvd_release'] = dates
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null datetime64[ns]
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](2), float64(2), object(15)
memory usage: 1.2+ MB


Next, we have several important numerical variables that are currenly in object types. First, we'll work with imdb_votes.

In [10]:
data['imdb_votes'].head()

0    42,583
1    21,154
2    17,801
3     5,705
4       132
Name: imdb_votes, dtype: object

So we need to convert imdb_votes to integers. Since there are commas in each number, we cannot simply tell pandas to treat each entry as an integer via the .astype() function. We'll first have to replace each comma with a blank, then apply the int() function. We also have to take care to ignore all of the missing values from imdb_votes as we will be dealing with those later.

In [14]:
votes = data['imdb_votes']
print(len(votes))
votes_parsed = votes[votes.str.find(',')>0].apply(lambda x: x.replace(',',''))
print(len(votes))
#votes[votes.str.find(',')>0]
for i in votes_parsed.index:
    votes[i] = votes_parsed[i]
votes.head()

8468
8468


0    42583
1    21154
2    17801
3     5705
4      132
Name: imdb_votes, dtype: object

Then, we've successfully removed all of the commas.

In [None]:
[len(votes), len(data['imdb_votes'])]

In [None]:
data.isna().sum()

Clearly we are going to have to do something about these missing values. We'll look at what each of these variables is and handle their missing values on a case by case basis.

In [None]:
data['rating'].unique()

Already I get the impression that the missing values should just be 'UNRATED' or 'NOT RATED'. First, we should figure out the distinction between unrated and not rated movies.

Not rated movies are movies that were not submitted to the MPAA for ratings.

Unrated movies are those that have had scenes altered/omitted/added that may or may not have an effect on a movie's rating. Usually this only happens with DVD releases. Using this knowledge, it makes sense to turn all the missing values into 'NOT RATED' categories. There's no good reason to justify giving them a rating.

Another solution is to find another dataset containing each movie and their respective ratings.

In [None]:
data[data['rating'].isna()].head()

Looking at the first few movies with missing ratings, let's get a sense of whether they just have no rating at all, or if the dataset is just missing data.

[Ugly, Dirty, and Bad](https://www.rottentomatoes.com/m/ugly_dirty_and_bad), has a rating of 'NR', Not Rated.

[Losing Ground](https://www.rottentomatoes.com/m/losing_ground_1982), has a rating of 'NR', Not Rated.

[L'argent](https://www.rottentomatoes.com/m/largent), has a rating of 'NR', Not Rated.

[Rebels of the Neon God](https://www.rottentomatoes.com/m/rebels_of_the_neon_god), has a rating of 'NR', Not Rated. Strangely enough, this movie is listed as having come out on Apr 10, 2015. Whereas the dataset has August 4, 1994.

[River of Grass](https://www.rottentomatoes.com/m/river_of_grass), has a rating of 'NR', Not Rated.

Then it seems like a reasonable idea to assign each missing rating, 'NOT RATED'.

In [None]:
data['rating'] = data['rating'].fillna('NOT RATED')

In [None]:
data.isna().sum()

In [None]:
data['imdb_rating'].unique()

The first thing I notice is that we have partial ratings (ie. 8.5, 1.3, 2.2, etc.). Also, this is ordinal data, an 8 movie is better than a 7 movie is better than a 6 movie... So we will bin the ratings into the set [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] after we deal with the missing data.

In [None]:
data[data['imdb_rating'].isna()]

Notice that a fair number of movies that are missing imdb ratings are also missing metacritic ratings, dvd releases, posters, and box office returns. On top of all of this, many also have release dates that are after 9/21/2017. There isn't a reason to try to impute these values since they're essentially missing all relevant information because they haven't been released yet.