# Data Cleaning

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Data/NBCU-dataLaurel.csv')
data.head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
0,tt0010323,The Cabinet of Dr. Caligari,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",UNRATED,8.1,,15-Oct-97,Rialto Pictures,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...",42583,https://images-na.ssl-images-amazon.com/images...,Robert Wiene,19-Mar-21,67 min,"Fantasy, Horror, Mystery",1 nomination.,expressionism|somnambulist|avant-garde|hypnosi...,18000,0
1,tt0052893,Hiroshima Mon Amour,A French actress filming an anti-war film in H...,NOT RATED,8.0,,24-Jun-03,Rialto Pictures,"Emmanuelle Riva, Eiji Okada, Stella Dassas, Pi...",21154,https://images-na.ssl-images-amazon.com/images...,Alain Resnais,16-May-60,90 min,"Drama, Romance",Nominated for 1 Oscar. Another 6 wins & 5 nomi...,memory|atomic-bomb|lovers-separation|impossibl...,88300,0
2,tt0058898,Alphaville,A U.S. secret agent is sent to the distant spa...,NOT RATED,7.2,,20-Oct-98,Rialto Pictures,"Eddie Constantine, Anna Karina, Akim Tamiroff",17801,https://images-na.ssl-images-amazon.com/images...,Jean-Luc Godard,5-May-65,99 min,"Drama, Mystery, Sci-Fi",1 win.,dystopia|french-new-wave|satire|comic-violence...,220000,46585
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0


In [3]:
data.shape

(8468, 19)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null object
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: float64(2), object(17)
memory usage: 1.2+ MB


These are the variables we have to work with:

imdbid: Unique Id used by IMDB to refer to the movie.

Title: Title of the movie

plot: Movie plot summary

rating: MPAA Appropriate audience rating

imdb_rating: IMDB's voters' scoring of a movie on a scale from 1-10 (10 being best)

metacritic: Metacritic movie score on a scale of 0-100 (100 being best)

dvd_release: Movie release date on DVD

production: Principle production company

actors: Lead Actors

imdb_votes: Total votes from IMDB members

poster: Movie Poster artwork

director: Movie director

release_date: Theatrical Release Date

runtime: Runtime length of movie in minutes

genre: Genre Classification

awards: Academy awards & nominations

keywords: Keywords associated with the movie

budget: Budget spent on movie production, marketing, and distribution

box office gross: Box Office Gross Returns as of 9/21/2017

In [5]:
data.isna().sum()

imdbid                 0
title                  0
plot                 272
rating              3216
imdb_rating          733
metacritic          3389
dvd_release         3133
production          1710
actors               315
imdb_votes           733
poster               501
director              78
release_date         185
runtime              622
genre                 44
awards              3226
keywords            2087
Budget                 0
Box Office Gross       0
dtype: int64

Clearly we are going to have to do something about these missing values. We'll look at what each of these variables is and handle their missing values on a case by case basis.

In [6]:
data['rating'].unique()

array(['UNRATED', 'NOT RATED', nan, 'PG', 'G', 'TV-PG', 'PG-13', 'R',
       'TV-14', 'TV-MA', 'M', 'NC-17', 'APPROVED', 'X', 'NR', 'TV-G',
       'TV-Y', 'Unrated'], dtype=object)

Already I get the impression that the missing values should just be 'UNRATED' or 'NOT RATED'. First, we should figure out the distinction between unrated and not rated movies.

Not rated movies are movies that were not submitted to the MPAA for ratings.

Unrated movies are those that have had scenes altered/omitted/added that may or may not have an effect on a movie's rating. Usually this only happens with DVD releases. Using this knowledge, it makes sense to turn all the missing values into 'NOT RATED' categories. There's no good reason to justify giving them a rating.

Another solution is to find another dataset containing each movie and their respective ratings.

In [7]:
data[data['rating'].isna()].head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0
5,tt0085180,L'argent,A forged 500-franc note is cynically passed fr...,,7.5,95.0,24-May-05,Criterion Collection,"Christian Patey, Vincent Risterucci, Caroline ...",5607,https://images-na.ssl-images-amazon.com/images...,Robert Bresson,18-May-83,85 min,"Crime, Drama",2 wins & 3 nominations.,note|murder|solitary-confinement|robbery|deliv...,0,0
9,tt0103935,Rebels of the Neon God,"Within the urban gloom of Taipei, four youths ...",,7.6,82.0,27-Oct-15,Big World Pictures,"Chao-jung Chen, Chang-Bin Jen, Kang-sheng Lee,...",2155,https://images-na.ssl-images-amazon.com/images...,Ming-liang Tsai,4-Aug-94,106 min,"Crime, Drama",5 wins & 5 nominations.,taipei|cigarette-smoking|hotel-room|kissing|ph...,28422,0
13,tt0110998,River of Grass,"Cozy, a dissatisfied housewife, meets Lee at a...",,6.5,69.0,18-Mar-03,Oscilloscope Laboratories,"Larry Fessenden, Dick Russell, Stan Kaplan, Mi...",582,https://images-na.ssl-images-amazon.com/images...,Kelly Reichardt,13-Oct-95,76 min,Drama,6 nominations.,f-rated|title-directed-by-female|directorial-d...,8534,0


Looking at the first few movies with missing ratings, let's get a sense of whether they just have no rating at all, or if the dataset is just missing data.

[Ugly, Dirty, and Bad](https://www.rottentomatoes.com/m/ugly_dirty_and_bad), has a rating of 'NR', Not Rated.

[Losing Ground](https://www.rottentomatoes.com/m/losing_ground_1982), has a rating of 'NR', Not Rated.

[L'argent](https://www.rottentomatoes.com/m/largent), has a rating of 'NR', Not Rated.

[Rebels of the Neon God](https://www.rottentomatoes.com/m/rebels_of_the_neon_god), has a rating of 'NR', Not Rated. Strangely enough, this movie is listed as having come out on Apr 10, 2015. Whereas the dataset has August 4, 1994.

[River of Grass](https://www.rottentomatoes.com/m/river_of_grass), has a rating of 'NR', Not Rated.

Then it seems like a reasonable idea to assign each missing rating, 'NOT RATED'.

In [8]:
data['rating'] = data['rating'].fillna('NOT RATED')

In [9]:
data.isna().sum()

imdbid                 0
title                  0
plot                 272
rating                 0
imdb_rating          733
metacritic          3389
dvd_release         3133
production          1710
actors               315
imdb_votes           733
poster               501
director              78
release_date         185
runtime              622
genre                 44
awards              3226
keywords            2087
Budget                 0
Box Office Gross       0
dtype: int64

In [10]:
data['Budget'].describe()

count     8468
unique    2380
top          0
freq      3013
Name: Budget, dtype: object

In [11]:
data['imdb_votes']

0          42,583
1          21,154
2          17,801
3           5,705
4             132
5           5,607
6         289,586
7           1,197
8         332,843
9           2,155
10            986
11        665,919
12      1,365,937
13            582
14        847,267
15        573,249
16        290,520
17         12,764
18            143
19        250,549
20         94,235
21        458,076
22         89,627
23          7,461
24         44,680
25        130,873
26         26,474
27        317,227
28        220,905
29         66,783
          ...    
8438        2,939
8439        2,965
8440       15,328
8441       77,079
8442          103
8443       12,340
8444       37,810
8445          385
8446        4,365
8447        8,912
8448        4,223
8449        9,278
8450        2,331
8451        8,230
8452        4,729
8453        5,662
8454        1,610
8455          202
8456       22,916
8457          105
8458          171
8459          130
8460        8,380
8461           29
8462      