# Data Cleaning

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('Data/NBCU-dataLaurel.csv')
data.head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
0,tt0010323,The Cabinet of Dr. Caligari,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",UNRATED,8.1,,15-Oct-97,Rialto Pictures,"Werner Krauss, Conrad Veidt, Friedrich Feher, ...",42583,https://images-na.ssl-images-amazon.com/images...,Robert Wiene,19-Mar-21,67 min,"Fantasy, Horror, Mystery",1 nomination.,expressionism|somnambulist|avant-garde|hypnosi...,18000,0
1,tt0052893,Hiroshima Mon Amour,A French actress filming an anti-war film in H...,NOT RATED,8.0,,24-Jun-03,Rialto Pictures,"Emmanuelle Riva, Eiji Okada, Stella Dassas, Pi...",21154,https://images-na.ssl-images-amazon.com/images...,Alain Resnais,16-May-60,90 min,"Drama, Romance",Nominated for 1 Oscar. Another 6 wins & 5 nomi...,memory|atomic-bomb|lovers-separation|impossibl...,88300,0
2,tt0058898,Alphaville,A U.S. secret agent is sent to the distant spa...,NOT RATED,7.2,,20-Oct-98,Rialto Pictures,"Eddie Constantine, Anna Karina, Akim Tamiroff",17801,https://images-na.ssl-images-amazon.com/images...,Jean-Luc Godard,5-May-65,99 min,"Drama, Mystery, Sci-Fi",1 win.,dystopia|french-new-wave|satire|comic-violence...,220000,46585
3,tt0074252,"Ugly, Dirty and Bad",Four generations of a family live crowded toge...,,7.9,,1-Nov-16,Compagnia Cinematografica Champion,"Nino Manfredi, Maria Luisa Santella, Francesco...",5705,https://images-na.ssl-images-amazon.com/images...,Ettore Scola,23-Sep-76,115 min,"Comedy, Drama",1 win & 2 nominations.,incest|failed-murder-attempt|poisoned-food|bap...,6590,0
4,tt0084269,Losing Ground,A comedy-drama about a Black American female p...,,6.3,,,Milestone Film & Video,"Billie Allen, Gary Bolling, Clarence Branch Jr...",132,https://images-na.ssl-images-amazon.com/images...,Kathleen Collins,1-Jun-82,86 min,"Comedy, Drama",,artist|painter|marriage|black-independent-film...,0,0


In [3]:
data.shape

(8468, 19)

These are the variables we have to work with:

imdbid: Unique Id used by IMDB to refer to the movie.

Title: Title of the movie

plot: Movie plot summary

rating: MPAA Appropriate audience rating

imdb_rating: IMDB's voters' scoring of a movie on a scale from 1-10 (10 being best)

metacritic: Metacritic movie score on a scale of 0-100 (100 being best)

dvd_release: Movie release date on DVD

production: Principle production company

actors: Lead Actors

imdb_votes: Total votes from IMDB members

poster: Movie Poster artwork

director: Movie director

release_date: Theatrical Release Date

runtime: Runtime length of movie in minutes

genre: Genre Classification

awards: Academy awards & nominations

keywords: Keywords associated with the movie

budget: Budget spent on movie production, marketing, and distribution

box office gross: Box Office Gross Returns as of 9/21/2017

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null object
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: float64(2), object(17)
memory usage: 1.2+ MB


Notice how many of variables are just objects. We're going to have to deal with converting a few of these into useful types.

First, we'll start by changing release_date to a datetime-like type.

In [5]:
data['release_date'].head()

0    19-Mar-21
1    16-May-60
2     5-May-65
3    23-Sep-76
4     1-Jun-82
Name: release_date, dtype: object

In [6]:
pd.to_datetime(data['release_date'])[1],data['release_date'][1]

(Timestamp('2060-05-16 00:00:00'), '16-May-60')

Then we see that there is an issue with pandas to_datetime function. It converts very old dates back to the 19th century. Perhaps we need to use the datetime package per [this](https://stackoverflow.com/questions/16600548/how-to-parse-string-dates-with-2-digit-year).

In [7]:
import datetime
import numpy as np

dates = data['release_date']

dates = pd.to_datetime(dates)

for i in range(len(dates)):
    if dates[i].year > 2019:
        dates[i] = dates[i].replace(year = dates[i].year-100)

dates.head()

0   1921-03-19
1   1960-05-16
2   1965-05-05
3   1976-09-23
4   1982-06-01
Name: release_date, dtype: datetime64[ns]

Then we've found a way to account for Python's default pivot year.

In [8]:
data['release_date']=dates
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null object
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](1), float64(2), object(16)
memory usage: 1.2+ MB


It seems natural to also do the same for dvd_release.

In [9]:
dates = data['dvd_release']

dates = pd.to_datetime(dates)

for i in range(len(dates)):
    if dates[i].year > 2019:
        dates[i] = dates[i].replace(year = dates[i].year-100)

data['dvd_release'] = dates
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null datetime64[ns]
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null object
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](2), float64(2), object(15)
memory usage: 1.2+ MB


Next, we have several important numerical variables that are currenly in object types. First, we'll work with imdb_votes.

In [10]:
data['imdb_votes'].head()

0    42,583
1    21,154
2    17,801
3     5,705
4       132
Name: imdb_votes, dtype: object

So we need to convert imdb_votes to integers. Since there are commas in each number, we cannot simply tell pandas to treat each entry as an integer via the .astype() function. We'll first have to replace each comma with a blank, then apply the int() function. We also have to take care to ignore all of the missing values from imdb_votes as we will be dealing with those later.

In [11]:
votes = data['imdb_votes']
votes_parsed = votes[votes.str.find(',')>0].apply(lambda x: x.replace(',',''))
for i in votes_parsed.index:
    votes[i] = votes_parsed[i]
votes.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


0    42583
1    21154
2    17801
3     5705
4      132
Name: imdb_votes, dtype: object

Then, we've successfully removed all of the commas. Let's confirm that we didn't lose any datapoints along the way.

In [12]:
[len(votes), len(data['imdb_votes'])]

[8468, 8468]

In order to convert to int, we have to find a way to work around missing values. Let's replace all the missing values with -1 and then convert them back to NaN after conversion.

In [13]:
import numpy as np

votes_int = votes.fillna(-1).astype('int')
votes_int[votes_int==-1] = np.nan
votes_int.isna().sum() == votes.isna().sum()

True

Then we see that we've successfully preserved the NaN cases.

In [14]:
data['imdb_votes'] = votes_int
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null datetime64[ns]
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null float64
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null object
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](2), float64(3), object(14)
memory usage: 1.2+ MB


Now we have to deal with 'Budget' and 'Box Office Gross' in a similar manner.

In [15]:
data[['Budget', 'Box Office Gross']].head()

Unnamed: 0,Budget,Box Office Gross
0,18000,0
1,88300,0
2,220000,46585
3,6590,0
4,0,0


Looks like we don't have to worry about any commas in 'Budget' or in 'Box Office Gross', so the conversions will be much simplier. Upon further analysis, it turns out that there are entries that are in Euros instead of USD. We will have to parse out the 'EU' and then convert that number to USD.

In [16]:
budget = data['Budget'].astype('str')
data[budget.str.contains('EU')]

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
2866,tt4538016,Unless,A writer struggles with her daughter's decisio...,,5.8,,NaT,,"Catherine Keener, Matt Craven, Hannah Gross, C...",32.0,https://images-na.ssl-images-amazon.com/images...,Alan Gilsenan,2016-09-11,90 min,Drama,1 nomination.,,"EU 3,973,431",0


Since there's only 1 value containing EU, it's simple enough to just assign the proper value to it. This movie was released in 2016, so we'll have to use the 2016 Euro to USD exchange rate (1.11), found [here](https://www.statista.com/statistics/412794/euro-to-u-s-dollar-annual-average-exchange-rate/).

In [17]:
3973431 * 1.11

4410508.41

In [18]:
budget[budget.str.contains('EU')] = '4410508'
budget.str.contains('EU').sum()

0

Once we try to apply .astype('int'), we encounter another case: 'CAD'.

In [19]:
data[budget.str.contains('CAD')]

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
5203,tt1092082,Passchendaele,"The lives of a troubled veteran, his nurse gir...",R,6.5,,2009-11-03,Alliance Atlantis,"Paul Gross, Caroline Dhavernas, Joe Dinicol, M...",7246.0,https://images-na.ssl-images-amazon.com/images...,Paul Gross,2008-10-17,114 min,"Drama, History, Romance",11 wins & 5 nominations.,battle|veteran|canadian-armed-forces|canadian-...,"CAD 20,000,000",0
5764,tt1376195,Gunless,A hardened American gunslinger is repeatedly t...,,6.5,,2011-08-08,Cinema Eopch,"Paul Gross, Sienna Guillory, Dustin Milligan, ...",3157.0,https://images-na.ssl-images-amazon.com/images...,William Phillips,2010-04-30,89 min,"Action, Comedy, Drama",5 wins & 5 nominations.,gunslinger|duel|wild-west|bounty-hunter|blacks...,"CAD 10,000,000",0


Once again, it's simple enough to just change these two values by hand since there are only 2. CAD to USD exchange rate found [here](https://fxtop.com/en/historical-currency-converter.php?A=100&C1=USD&C2=USD&DD=01&MM=01&YYYY=2008&B=1&P=&I=1&btnOK=Go%21).

In [20]:
[20000000*0.981523, 10000000*1.050118]

[19630460.0, 10501180.000000002]

In [21]:
budget[5203] = '19630460'
budget[5764] = '10501180'

In [22]:
budget.astype('int').head()

0     18000
1     88300
2    220000
3      6590
4         0
Name: Budget, dtype: int64

There aren't anymore errors when converting to int. Then we've finished parsing budget.

In [23]:
budget = budget.astype('int')
data['Budget'] = budget
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8468 entries, 0 to 8467
Data columns (total 19 columns):
imdbid              8468 non-null object
title               8468 non-null object
plot                8196 non-null object
rating              5252 non-null object
imdb_rating         7735 non-null float64
metacritic          5079 non-null float64
dvd_release         5335 non-null datetime64[ns]
production          6758 non-null object
actors              8153 non-null object
imdb_votes          7735 non-null float64
poster              7967 non-null object
director            8390 non-null object
release_date        8283 non-null datetime64[ns]
runtime             7846 non-null object
genre               8424 non-null object
awards              5242 non-null object
keywords            6381 non-null object
Budget              8468 non-null int64
Box Office Gross    8468 non-null object
dtypes: datetime64[ns](2), float64(3), int64(1), object(13)
memory usage: 1.2+ MB


In [24]:
data['Box Office Gross'].head()

0        0
1        0
2    46585
3        0
4        0
Name: Box Office Gross, dtype: object

Looks like Box Office Gross can be treated similary to Budget.

In [25]:
gross = data['Box Office Gross']
data[gross.str.contains('GBP')].head()

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
139,tt0926084,Harry Potter and the Deathly Hallows: Part 1,As Harry races against time and evil to destro...,PG-13,7.7,65.0,2011-04-15,Warner Bros. Pictures,"Bill Nighy, Emma Watson, Richard Griffiths, Ha...",358750.0,https://images-na.ssl-images-amazon.com/images...,David Yates,2010-11-19,146 min,"Adventure, Family, Fantasy",Nominated for 2 Oscars. Another 16 wins & 52 n...,immortality|power|mission|race-against-time|ma...,0,"GBP150,000,000"
496,tt1527835,Archipelago,Deep fractures within a family dynamic begin t...,,6.1,82.0,2014-11-04,Kino Lorber Films,"Christopher Baker, Kate Fahy, Tom Hiddleston, ...",1673.0,https://images-na.ssl-images-amazon.com/images...,Joanna Hogg,2011-03-04,114 min,Drama,6 nominations.,f-rated,0,"GBP500,000"
1058,tt2184287,Summer in February,"A true tale of love, liberty and scandal among...",NOT RATED,5.6,22.0,2014-08-12,Tribeca Film,"Dominic Cooper, Dan Stevens, Jane Cussons, Dap...",3207.0,https://images-na.ssl-images-amazon.com/images...,Christopher Menaul,2014-01-17,100 min,"Biography, Drama, Romance",1 win & 2 nominations.,period-drama|artist|coast|love-triangle|aspiri...,0,"GBP5,000,000"
1259,tt2375574,A Field in England,"Amid the Civil War in 17th-century England, a ...",NOT RATED,6.3,73.0,2014-04-07,Drafthouse Films,"Julian Barratt, Peter Ferdinando, Richard Glov...",7826.0,https://images-na.ssl-images-amazon.com/images...,Ben Wheatley,2013-07-05,90 min,"Drama, History, Horror",1 win & 8 nominations.,occultism|england|17th-century|magic|psychedel...,0,"GBP316,000"
1380,tt2473794,Mr. Turner,An exploration of the last quarter century of ...,R,6.8,94.0,2015-05-05,Sony Pictures Classics,"Timothy Spall, Paul Jesson, Dorothy Atkinson, ...",19996.0,https://images-na.ssl-images-amazon.com/images...,Mike Leigh,2014-10-31,150 min,"Biography, Drama, History",Nominated for 4 Oscars. Another 19 wins & 61 n...,scrofula|sketching|sunrise|prism|rented-room|p...,0,"GBP8,200,000"


Note to self: It appears that there are several movies that either have values under Budget, but don't have values under Box Office Gross. It seems reasonable to believe that the reverse is true as well. Remember to fix this while dealing with missing values!

There are a lot of values in GBP that we have to convert to USD. Unfortuntely, this means we can't just change them on a case by case basis. We'll have to leverage the [forex_python](https://forex-python.readthedocs.io/en/latest/index.html) package in order to obtain exchange rates.

In [26]:
from forex_python.converter import CurrencyRates
c = CurrencyRates()
dates = data[gross.str.contains('GBP')]['release_date']
dates.head()
dates[dates.isna()]
# c.get_rate('GBP', 'USD', date)

2976   NaT
7612   NaT
7764   NaT
Name: release_date, dtype: datetime64[ns]

There are a few movies that don't have release dates. Since there are only three, let's see if we can just manually add these in.

In [27]:
data['title'][[2976, 7612, 7764]]

2976           Whisky Galore
7612    White Irish Drinkers
7764            The Way Back
Name: title, dtype: object

There are only a few of these in this case, so it'll be easier to handle these case by case.

In [28]:
data[data['title'] == 'Whisky Galore']

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
2976,tt4769214,Whisky Galore,Scottish islanders try to plunder cases of whi...,,,,NaT,,"James Cosmo, Ellie Kendrick, Sean Biggerstaff,...",,http://ia.media-imdb.com/images/M/MV5BZTViNjdj...,Gillies MacKinnon,NaT,,"Comedy, Romance",,plunder|sinking-ship|world-war-two|telephone-e...,0,"GBP5,400,000"


Needed to look at all of Whiskey Galore's data because Whiskey Galore is actually a remake.

[Whiskey Galore](https://www.imdb.com/title/tt4769214/) released on May 12, 2017.

[White Irish Drinkers](https://www.imdb.com/title/tt01550524/) released on March 25, 2011.

[The Way Back](https://www.imdb.com/title/tt1023114/) released on January 21, 2011.

In [29]:
data['release_date'][2976] = pd.to_datetime('2017-05-12')
data['release_date'][7612] = pd.to_datetime('2011-03-25')
data['release_date'][7764] = pd.to_datetime('2011-01-21')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


We've filled out these release dates, however forex_python doesn't have information for exchange rates before 1999. Perhaps we have to make sure we keep movies with release dates after 1999.

In [30]:
dates = data[gross.str.contains('GBP')]['release_date']
#sum(data['release_date']<'1999-01-01')
sum(dates<'1999-01-01')
#dates.apply(lambda x: c.get_rate('GBP', 'USD', x))

5

There are only 5 data points with release dates older than 01-01-1999. In order to make or job easier, we will simply remove those.

In [45]:
gbp_table = data['Box Office Gross'].str.contains('GBP')
outdated_table = data['release_date']<'1999-01-01'
outdatedgbp_table = gbp_table*outdated_table
data[outdatedgbp_table]

Unnamed: 0,imdbid,title,plot,rating,imdb_rating,metacritic,dvd_release,production,actors,imdb_votes,poster,director,release_date,runtime,genre,awards,keywords,Budget,Box Office Gross
3638,tt0028358,Things to Come,The story of a century: a decades-long second ...,NOT RATED,6.7,,2000-02-01,United Artists,"Raymond Massey, Edward Chapman, Ralph Richards...",5490.0,https://images-na.ssl-images-amazon.com/images...,William Cameron Menzies,1936-09-14,100 min,"Drama, Sci-Fi, War",,plague|progress|1940s|scientist|outer-space|ba...,0,"GBP300,000"
3731,tt0087803,1984,"In a totalitarian future society, a man whose ...",R,7.2,,2003-03-04,Live Home Video,"John Hurt, Richard Burton, Suzanna Hamilton, C...",53670.0,https://images-na.ssl-images-amazon.com/images...,Michael Radford,1985-03-22,113 min,"Drama, Sci-Fi",Nominated for 1 BAFTA Film Award. Another 5 wi...,dystopia|pneumatic-tube|totalitarianism|helico...,0,"GBP3,000,000"
3803,tt0097937,My Left Foot,"Christy Brown, born with cerebral palsy, learn...",R,7.9,97.0,2005-08-16,HBO Video,"Daniel Day-Lewis, Brenda Fricker, Alison Whela...",53887.0,https://images-na.ssl-images-amazon.com/images...,Jim Sheridan,1990-03-30,103 min,"Biography, Drama",Won 2 Oscars. Another 20 wins & 19 nominations.,foot|cerebral-palsy|irish|flashback|author|poe...,0,"GBP600,000"
3821,tt0099594,Fools of Fortune,A Protestant Irish family is caught up in a co...,PG-13,5.7,,1991-07-03,Working Title Films,"Iain Glen, Mary Elizabeth Mastrantonio, Sean T...",169.0,https://images-na.ssl-images-amazon.com/images...,Pat O'Connor,1990-06-22,109 min,"Drama, Romance",2 wins.,republican|ireland|revenge|black-and-tans|alco...,0,"GBP2,500,000"
4090,tt0120735,"Lock, Stock and Two Smoking Barrels",A botched card game in London triggers four fr...,R,8.2,66.0,1999-08-31,Gramercy Pictures,"Jason Flemyng, Dexter Fletcher, Nick Moran, Ja...",447546.0,https://images-na.ssl-images-amazon.com/images...,Guy Ritchie,1998-08-28,107 min,"Comedy, Crime",Nominated for 1 BAFTA Film Award. Another 13 w...,gangster|cockney-accent|shotgun|hatchet|antiqu...,0,"GBP960,000"


Then, we've successfully isolated the entries whose release dates prevent us from converting their numbers to USD. Now we'll remove them.

In [49]:
data = data.drop(data[outdatedgbp_table].index)

In [51]:
dates = data[gross.str.contains('GBP')]['release_date']
rates = dates.apply(lambda x: c.get_rate('GBP', 'USD', x))

  """Entry point for launching an IPython kernel.


In [None]:
data.isna().sum()

Clearly we are going to have to do something about these missing values. We'll look at what each of these variables is and handle their missing values on a case by case basis.

In [None]:
data['rating'].unique()

Already I get the impression that the missing values should just be 'UNRATED' or 'NOT RATED'. First, we should figure out the distinction between unrated and not rated movies.

Not rated movies are movies that were not submitted to the MPAA for ratings.

Unrated movies are those that have had scenes altered/omitted/added that may or may not have an effect on a movie's rating. Usually this only happens with DVD releases. Using this knowledge, it makes sense to turn all the missing values into 'NOT RATED' categories. There's no good reason to justify giving them a rating.

Another solution is to find another dataset containing each movie and their respective ratings.

In [None]:
data[data['rating'].isna()].head()

Looking at the first few movies with missing ratings, let's get a sense of whether they just have no rating at all, or if the dataset is just missing data.

[Ugly, Dirty, and Bad](https://www.rottentomatoes.com/m/ugly_dirty_and_bad), has a rating of 'NR', Not Rated.

[Losing Ground](https://www.rottentomatoes.com/m/losing_ground_1982), has a rating of 'NR', Not Rated.

[L'argent](https://www.rottentomatoes.com/m/largent), has a rating of 'NR', Not Rated.

[Rebels of the Neon God](https://www.rottentomatoes.com/m/rebels_of_the_neon_god), has a rating of 'NR', Not Rated. Strangely enough, this movie is listed as having come out on Apr 10, 2015. Whereas the dataset has August 4, 1994.

[River of Grass](https://www.rottentomatoes.com/m/river_of_grass), has a rating of 'NR', Not Rated.

Then it seems like a reasonable idea to assign each missing rating, 'NOT RATED'.

In [None]:
data['rating'] = data['rating'].fillna('NOT RATED')

In [None]:
data.isna().sum()

In [None]:
data['imdb_rating'].unique()

The first thing I notice is that we have partial ratings (ie. 8.5, 1.3, 2.2, etc.). Also, this is ordinal data, an 8 movie is better than a 7 movie is better than a 6 movie... So we will bin the ratings into the set [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] after we deal with the missing data.

In [None]:
data[data['imdb_rating'].isna()]

Notice that a fair number of movies that are missing imdb ratings are also missing metacritic ratings, dvd releases, posters, and box office returns. On top of all of this, many also have release dates that are after 9/21/2017. There isn't a reason to try to impute these values since they're essentially missing all relevant information because they haven't been released yet.