# Data Cleaning

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
data = pd.read_csv('Data/NBCU-dataLaurel.csv')
data.head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
0,tt0010323,The Cabinet of Dr. Caligari,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",UNRATED,8.1,,15-Oct-97,Rialto Pictures,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...",42583,https://images-na.ssl-images-amazon.com/images...,Robert Wiene,19-Mar-21,67 min,"Fantasy, Horror, Mystery",1 nomination.,expressionism|somnambulist|avant-garde|hypnosi...,18000,0
1,tt0052893,Hiroshima Mon Amour,A French actress filming an anti-war film in H...,NOT RATED,8.0,,24-Jun-03,Rialto Pictures,"Emmanuelle Riva, Eiji Okada, Stella Dassas, Pi...",21154,https://images-na.ssl-images-amazon.com/images...,Alain Resnais,16-May-60,90 min,"Drama, Romance",Nominated for 1 Oscar. Another 6 wins & 5 nomi...,memory|atomic-bomb|lovers-separation|impossibl...,88300,0
2,tt0058898,Alphaville,A U.S. secret agent is sent to the distant spa...,NOT RATED,7.2,,20-Oct-98,Rialto Pictures,"Eddie Constantine, Anna Karina, Akim Tamiroff",17801,https://images-na.ssl-images-amazon.com/images...,Jean-Luc Godard,5-May-65,99 min,"Drama, Mystery, Sci-Fi",1 win.,dystopia|french-new-wave|satire|comic-violence...,220000,46585
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0


In [6]:
data.shape

(8468, 19)

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null object
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: float64(2), object(17)
memory usage: 1.2+ MB


These are the variables we have to work with:

imdbid: Unique Id used by IMDB to refer to the movie.

Title: Title of the movie

plot: Movie plot summary

rating: MPAA Appropriate audience rating

imdb_rating: IMDB's voters' scoring of a movie on a scale from 1-10 (10 being best)

metacritic: Metacritic movie score on a scale of 0-100 (100 being best)

dvd_release: Movie release date on DVD

production: Principle production company

actors: Lead Actors

imdb_votes: Total votes from IMDB members

poster: Movie Poster artwork

director: Movie director

release_date: Theatrical Release Date

runtime: Runtime length of movie in minutes

genre: Genre Classification

awards: Academy awards & nominations

keywords: Keywords associated with the movie

budget: Budget spent on movie production, marketing, and distribution

box office gross: Box Office Gross Returns as of 9/21/2017

In [21]:
data['genre'].unique

<bound method Series.unique of 0            Fantasy, Horror, Mystery
1                      Drama, Romance
2              Drama, Mystery, Sci-Fi
3                       Comedy, Drama
4                       Comedy, Drama
5                        Crime, Drama
6           Action, Adventure, Comedy
7          Adventure, Fantasy, Sci-Fi
8          Animation, Family, Fantasy
9                        Crime, Drama
10            Drama, History, Romance
11        Adventure, Sci-Fi, Thriller
12             Comedy, Drama, Romance
13                              Drama
14                     Drama, Romance
15         Action, Adventure, Fantasy
16            Action, Crime, Thriller
17               Action, Crime, Drama
18       Animation, Adventure, Comedy
19           Adventure, Comedy, Drama
20              Crime, Drama, Mystery
21          Action, Adventure, Sci-Fi
22                      Comedy, Drama
23                       Crime, Drama
24          Animation, Comedy, Family
25           Drama,

In [17]:
data.isna().sum()

imdbid                 0
title                  0
plot                 272
rating              3216
imdb_rating          733
metacritic          3389
dvd_release         3133
production          1710
actors               315
imdb_votes           733
poster               501
director              78
release_date         185
runtime              622
genre                 44
awards              3226
keywords            2087
Budget                 0
Box Office Gross       0
dtype: int64

In [36]:
data['Budget'].describe()

count     8468
unique    2380
top          0
freq      3013
Name: Budget, dtype: object

In [50]:
data['imdb_votes']

0          42,583
1          21,154
2          17,801
3           5,705
4             132
5           5,607
6         289,586
7           1,197
8         332,843
9           2,155
10            986
11        665,919
12      1,365,937
13            582
14        847,267
15        573,249
16        290,520
17         12,764
18            143
19        250,549
20         94,235
21        458,076
22         89,627
23          7,461
24         44,680
25        130,873
26         26,474
27        317,227
28        220,905
29         66,783
          ...    
8438        2,939
8439        2,965
8440       15,328
8441       77,079
8442          103
8443       12,340
8444       37,810
8445          385
8446        4,365
8447        8,912
8448        4,223
8449        9,278
8450        2,331
8451        8,230
8452        4,729
8453        5,662
8454        1,610
8455          202
8456       22,916
8457          105
8458          171
8459          130
8460        8,380
8461           29
8462      