# Regression Open-Ended Project

-----

# Previous Notebooks

- Web Scraping
- Cleaning data
- Exploratory Data Analysis

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Feature Engineering

In [2]:
%matplotlib inline
%pylab
import pandas as pd
import numpy as np
import pickle

Using matplotlib backend: TkAgg
Populating the interactive namespace from numpy and matplotlib


In [3]:
ebert_imdb_df = pickle.load(open('../data/interim/ebert_imdb_df.pkl', 'rb'))

In [4]:
ebert_imdb_df.head(2).T

Unnamed: 0,0,1
Title,Call Me by Your Name,Mudbound
IMDB_Rating,8.4,7.4
Rating_Count,4766,2666
Description,Directed by Luca Guadagnino. With Armie Hamme...,"Directed by Dee Rees. With Carey Mulligan, Ga..."
Metascore,93,86
User_Review_Count,39,22
Critic_Review_Count,107,61
Genre_List,"Drama, Romance",Drama
Stars_List,"Armie Hammer,, Timothée Chalamet,, Michael Stu...","Carey Mulligan,, Garrett Hedlund,, Jason Clarke"
Director,Luca Guadagnino,Dee Rees


The functions below will be used to engineer the following features:

* Whether or not the film is foreign (0/1)
* The ratio of user reviews to critic reviews (counts)
* The description length (number of words)
* The rewview length (number of words)
* Season (convert release date to one of four seasons)

In [5]:
def check_foreign(row):
    try:
        country = row['Country']

        if country in ['USA', 'UK', 'Canada']:
            return 0
        else:
            return 1
    except:
        return np.nan
    
def user_critic_ratio(row):
    try:
        ratio = row['User_Review_Count'] / row['Critic_Review_Count']
        return ratio
    except:
        return np.nan
    
def description_length(row):
    try:
        length = len(row['Description'].split())
        return length
    except:
        return np.nan
    
def review_length(row):
    try:
        length = len(row['Review'].split())
        return length
    except:
        return np.nan
    
def convert_season(row):
    try:
        # get day of year (int)
        day = row['Release_Date'].timetuple().tm_yday
        
        # use the bisect function to map yday to seasons
        import bisect
        bisect_outputs = [0,1,2,3,4]  # bisect will output one of these
        # we need winter twice: it's at the head and tail of yday values
        seasons = ['Winter', 'Spring', 'Summer', 'Fall', 'Winter']
        season_map = {k:v for k,v in zip(bisect_outputs, seasons)}
        # day >= 355 --> 4
        season = season_map[bisect.bisect_right([80, 173, 264, 355], day)]
        return season
    except:
        return np.nan

In [6]:
# create the features from funcs above
ebert_imdb_df['Foreign'] = ebert_imdb_df.apply(lambda x: check_foreign(x), 1)
ebert_imdb_df['UC_Ratio'] = ebert_imdb_df.apply(lambda x: user_critic_ratio(x), 1)
ebert_imdb_df['Description_Len'] = ebert_imdb_df.apply(lambda x: description_length(x), 1)
ebert_imdb_df['Review_Len'] = ebert_imdb_df.apply(lambda x: review_length(x), 1)
ebert_imdb_df['Season'] = ebert_imdb_df.apply(lambda x: convert_season(x), 1)

## Convert MPAA Rating and Season to Numeric

We need to one-hot encode the MPAA rating and Season values.

In [7]:
ebert_imdb_df.head().T

Unnamed: 0,0,1,2,3,4
Title,Call Me by Your Name,Mudbound,Justice League,Wonder,Mr. Roosevelt
IMDB_Rating,8.4,7.4,7.4,8,6.8
Rating_Count,4766,2666,78007,1579,116
Description,Directed by Luca Guadagnino. With Armie Hamme...,"Directed by Dee Rees. With Carey Mulligan, Ga...","Directed by Zack Snyder. With Ben Affleck, Ga...",Directed by Stephen Chbosky. With Jacob Tremb...,"Directed by Noël Wells. With Noël Wells, Nick..."
Metascore,93,86,46,67,73
User_Review_Count,39,22,709,22,
Critic_Review_Count,107,61,286,55,55
Genre_List,"Drama, Romance",Drama,"Action, Adventure, Fantasy",Drama,Comedy
Stars_List,"Armie Hammer,, Timothée Chalamet,, Michael Stu...","Carey Mulligan,, Garrett Hedlund,, Jason Clarke","Ben Affleck,, Gal Gadot,, Jason Momoa","Jacob Tremblay,, Owen Wilson,, Izabela Vidovic","Noël Wells,, Nick Thune,, Britt Lower"
Director,Luca Guadagnino,Dee Rees,Zack Snyder,Stephen Chbosky,Noël Wells


In [8]:
ebert_imdb_df['Rating'].unique().tolist()

['R',
 'NR',
 'PG-13',
 'PG',
 '',
 'G',
 'NC-17',
 'Unrated',
 'TV',
 'PG13',
 'Not rated',
 'No MPAA rating',
 'PG-13&#8206;',
 'No rating',
 'No MPAA Rating',
 '.',
 'g PG-13',
 'R,',
 ': R',
 'PG- 13']

Let's convert the empty values (incl. '.') to unrated, and fix some of the other values.

In [9]:
# dict to map values
mpaa_fix = {'': 'Unrated',
            'TV': 'Unrated',
            'NR': 'Unrated',
            'Not rated': 'Unrated',
            'No MPAA rating': 'Unrated',
            'No rating': 'Unrated',
            'No MPAA Rating': 'Unrated',
            '.': 'Unrated',
            'PG13': 'PG-13',
            'PG-13&#8206;': 'PG-13',
            'g PG-13': 'PG-13',
            'PG- 13': 'PG-13',
            'R,': 'R',
            ': R': 'R',
            'X': 'NC-17'}

In [10]:
for i, rating in ebert_imdb_df['Rating'].iteritems():
    if rating in mpaa_fix.keys():
        better_name = mpaa_fix.get(rating)
        ebert_imdb_df.set_value(i, 'Rating', better_name)
        
ebert_imdb_df['Rating'].unique()

array(['R', 'Unrated', 'PG-13', 'PG', 'G', 'NC-17'], dtype=object)

This gives us a much better set of ratings. Now let's one-hot encode using the pandas `get_dummies` method.

**Note**: At the top of the next cell is an alternative that will create dummy columns with clean names (e.g. 'Spring' instead of 'Season_Spring'), but we still lose the original column. Because we lose that label, we would keep all dummy values to still have reference to all season or rating values via the column names. The code below that first line (the active code) uses a concatenation to keep the original columns, and so we can use the `drop_first` argument to reduce each of 'Season' and 'Rating' by 1 column. That means a dummyvector of all zeros would indicate the presence of the dropped value.

In [11]:
# alternative
# pd.get_dummies(ebert_imdb_df, columns=['Rating','Season'], prefix='', prefix_sep='')

# preserve original columns and reduce colinearity by dropping first
df_rating = pd.get_dummies(ebert_imdb_df['Rating'], drop_first=True)
df_season = pd.get_dummies(ebert_imdb_df['Season'], drop_first=True)
ebert_imdb_df = pd.concat([ebert_imdb_df, df_season, df_rating], axis=1)

In [12]:
ebert_imdb_df.head(2).T

Unnamed: 0,0,1
Title,Call Me by Your Name,Mudbound
IMDB_Rating,8.4,7.4
Rating_Count,4766,2666
Description,Directed by Luca Guadagnino. With Armie Hamme...,"Directed by Dee Rees. With Carey Mulligan, Ga..."
Metascore,93,86
User_Review_Count,39,22
Critic_Review_Count,107,61
Genre_List,"Drama, Romance",Drama
Stars_List,"Armie Hammer,, Timothée Chalamet,, Michael Stu...","Carey Mulligan,, Garrett Hedlund,, Jason Clarke"
Director,Luca Guadagnino,Dee Rees


In [13]:
ebert_imdb_df.columns.tolist()

['Title',
 'IMDB_Rating',
 'Rating_Count',
 'Description',
 'Metascore',
 'User_Review_Count',
 'Critic_Review_Count',
 'Genre_List',
 'Stars_List',
 'Director',
 'Country',
 'Release_Date',
 'EbertStars',
 'Year',
 'URL',
 'Rating',
 'Runtime',
 'Review',
 'Foreign',
 'UC_Ratio',
 'Description_Len',
 'Review_Len',
 'Season',
 'Spring',
 'Summer',
 'Winter',
 'NC-17',
 'PG',
 'PG-13',
 'R',
 'Unrated']

In [14]:
ebert_imdb_df.shape

(9504, 31)

# Convert Genres to Numerical

For this encoding we use a more manual approach to one-hot encoding.

In [15]:
genre_array = [genre_list.split(",") for genre_list in ebert_imdb_df['Genre_List']]
unique_genres = {genre.strip() for genres in genre_array for genre in genres}
unique_genres.discard('')
unique_genres

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western'}

In [16]:
# create a new column of zeros for each genre
for genre in sorted(unique_genres):
    ebert_imdb_df[genre] = np.zeros(len(ebert_imdb_df), dtype=int)

In [17]:
pd.set_option("display.max_columns", 150)
ebert_imdb_df.head()

Unnamed: 0,Title,IMDB_Rating,Rating_Count,Description,Metascore,User_Review_Count,Critic_Review_Count,Genre_List,Stars_List,Director,Country,Release_Date,EbertStars,Year,URL,Rating,Runtime,Review,Foreign,UC_Ratio,Description_Len,Review_Len,Season,Spring,Summer,Winter,NC-17,PG,PG-13,R,Unrated,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Call Me by Your Name,8.4,4766.0,Directed by Luca Guadagnino. With Armie Hamme...,93.0,39.0,107.0,"Drama, Romance","Armie Hammer,, Timothée Chalamet,, Michael Stu...",Luca Guadagnino,USA,2017-11-24,4.0,2017.0,/reviews/call-me-by-your-name-2017,R,130.0,Luca Guadagnino’s films are all about the tran...,0,0.364486,47,1176,Fall,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Mudbound,7.4,2666.0,"Directed by Dee Rees. With Carey Mulligan, Ga...",86.0,22.0,61.0,Drama,"Carey Mulligan,, Garrett Hedlund,, Jason Clarke",Dee Rees,USA,2017-11-17,4.0,2017.0,/reviews/mudbound-2017,Unrated,134.0,“Mudbound” is all about perception. How it can...,0,0.360656,42,1536,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Justice League,7.4,78007.0,"Directed by Zack Snyder. With Ben Affleck, Ga...",46.0,709.0,286.0,"Action, Adventure, Fantasy","Ben Affleck,, Gal Gadot,, Jason Momoa",Zack Snyder,USA,2017-11-17,3.0,2017.0,/reviews/justice-league-2017,PG-13,120.0,For a film about a band of heroes trying to st...,0,2.479021,43,1242,Fall,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Wonder,8.0,1579.0,Directed by Stephen Chbosky. With Jacob Tremb...,67.0,22.0,55.0,Drama,"Jacob Tremblay,, Owen Wilson,, Izabela Vidovic",Stephen Chbosky,USA,2017-11-17,3.0,2017.0,/reviews/wonder-2017,PG,113.0,Based on the R.J. Palacio novel of the same na...,0,0.4,49,828,Fall,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Mr. Roosevelt,6.8,116.0,"Directed by Noël Wells. With Noël Wells, Nick...",73.0,,55.0,Comedy,"Noël Wells,, Nick Thune,, Britt Lower",Noël Wells,USA,2017-11-22,3.0,2017.0,/reviews/mr-roosevelt-2017,Unrated,90.0,Emily Martin (Noël Wells) doesn't quite know h...,0,,48,1118,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
# fill the genres with 1's
for idx, row in ebert_imdb_df.iterrows():
    for genre in row['Genre_List'].split(", "):
        if genre != '':
            ebert_imdb_df.ix[idx, genre] = 1

In [19]:
ebert_imdb_df.head()

Unnamed: 0,Title,IMDB_Rating,Rating_Count,Description,Metascore,User_Review_Count,Critic_Review_Count,Genre_List,Stars_List,Director,Country,Release_Date,EbertStars,Year,URL,Rating,Runtime,Review,Foreign,UC_Ratio,Description_Len,Review_Len,Season,Spring,Summer,Winter,NC-17,PG,PG-13,R,Unrated,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Call Me by Your Name,8.4,4766.0,Directed by Luca Guadagnino. With Armie Hamme...,93.0,39.0,107.0,"Drama, Romance","Armie Hammer,, Timothée Chalamet,, Michael Stu...",Luca Guadagnino,USA,2017-11-24,4.0,2017.0,/reviews/call-me-by-your-name-2017,R,130.0,Luca Guadagnino’s films are all about the tran...,0,0.364486,47,1176,Fall,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Mudbound,7.4,2666.0,"Directed by Dee Rees. With Carey Mulligan, Ga...",86.0,22.0,61.0,Drama,"Carey Mulligan,, Garrett Hedlund,, Jason Clarke",Dee Rees,USA,2017-11-17,4.0,2017.0,/reviews/mudbound-2017,Unrated,134.0,“Mudbound” is all about perception. How it can...,0,0.360656,42,1536,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Justice League,7.4,78007.0,"Directed by Zack Snyder. With Ben Affleck, Ga...",46.0,709.0,286.0,"Action, Adventure, Fantasy","Ben Affleck,, Gal Gadot,, Jason Momoa",Zack Snyder,USA,2017-11-17,3.0,2017.0,/reviews/justice-league-2017,PG-13,120.0,For a film about a band of heroes trying to st...,0,2.479021,43,1242,Fall,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Wonder,8.0,1579.0,Directed by Stephen Chbosky. With Jacob Tremb...,67.0,22.0,55.0,Drama,"Jacob Tremblay,, Owen Wilson,, Izabela Vidovic",Stephen Chbosky,USA,2017-11-17,3.0,2017.0,/reviews/wonder-2017,PG,113.0,Based on the R.J. Palacio novel of the same na...,0,0.4,49,828,Fall,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Mr. Roosevelt,6.8,116.0,"Directed by Noël Wells. With Noël Wells, Nick...",73.0,,55.0,Comedy,"Noël Wells,, Nick Thune,, Britt Lower",Noël Wells,USA,2017-11-22,3.0,2017.0,/reviews/mr-roosevelt-2017,Unrated,90.0,Emily Martin (Noël Wells) doesn't quite know h...,0,,48,1118,Fall,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
# we don't need the list of generes for each movie anymore
ebert_imdb_df = ebert_imdb_df.drop(['Genre_List'], axis='columns')

In [21]:
# save the unique genres
pickle.dump(unique_genres, open('../data/interim/unique_genres.pkl', 'wb'))

## Convert Directors to Numerical

We have already seen in EDA that the director may be a somehwate useful predictor for rating, but because we have fewer than 10k datapoints, we don't want to increase dimensionality of our data too much. When approaching this from an encoding standpoint, we can label any directors with a number of movies under some threshold all together, and that can reduce the number of dummy variables by quite a bit. Let's keep the total number of features under sqrt(N), where N is the number of datapoints.

In [22]:
int(np.sqrt(ebert_imdb_df.size)), ebert_imdb_df.shape[1]

(736, 57)

Right now we have 57 features, and we want to keep that number under 736, so let's filter directors accordingly, but also plan for engineering more features.

In [23]:
print(len(ebert_imdb_df.Director.unique()))
print((ebert_imdb_df.Director.value_counts() >= 5).value_counts()[1])

4523
460


By specifying only directors with at least 5 movies in the data, we can keep our dimensionality at a comfortable level (460 + 57 so far).

In [24]:
# filter directors and get their names
series = ebert_imdb_df['Director'].value_counts() >= 5
relevant_directors = series[series].index.values

# create columns for each of these relevant directors
for director in sorted(relevant_directors):
    ebert_imdb_df[director] = np.zeros(len(ebert_imdb_df), dtype=int)

# fill the directors with 1's
for idx, row in ebert_imdb_df.iterrows():
    director = row['Director']
    if director != '' and director in relevant_directors:
        ebert_imdb_df.ix[idx, director] = 1

In [25]:
ebert_imdb_df.head(2)

Unnamed: 0,Title,IMDB_Rating,Rating_Count,Description,Metascore,User_Review_Count,Critic_Review_Count,Stars_List,Director,Country,Release_Date,EbertStars,Year,URL,Rating,Runtime,Review,Foreign,UC_Ratio,Description_Len,Review_Len,Season,Spring,Summer,Winter,NC-17,PG,PG-13,R,Unrated,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,Unnamed: 58,Abel Ferrara,Adam McKay,Adam Shankman,Adrian Lyne,Agnieszka Holland,Aki Kaurismäki,Alan J. Pakula,Alan Parker,Alan Rudolph,Albert Hughes,Alejandro Amenábar,Alejandro González Iñárritu,Alex Gibney,Alex Proyas,Alexander Payne,Alfonso Cuarón,Alfred Hitchcock,...,Roman Polanski,Ron Clements,Ron Howard,Ron Shelton,Rowdy Herrington,Ruben Fleischer,Ruben Östlund,Sally Potter,Sam Mendes,Sam Miller,Sam Raimi,Scott Derrickson,Scott Hicks,Seth Gordon,Shana Feste,Shari Springer Berman,Shawn Levy,Sidney Lumet,Silvio Soldini,Simon West,Simon Wincer,Sofia Coppola,Sophie Barthes,Spike Lee,Stephen Chow,Stephen Daldry,Stephen Frears,Stephen Herek,Stephen Sommers,Steve Miner,Steven Brill,Steven Soderbergh,Steven Spielberg,Stuart Gordon,Sydney Pollack,Takeshi Kitano,Tate Taylor,Taylor Hackford,Ted Demme,Ted Kotcheff,Terence Davies,Terrence Malick,Terry George,Terry Gilliam,Thomas Carter,Thomas Vinterberg,Tim Burton,Tim Johnson,Tim Story,Tobe Hooper,Todd Haynes,Todd Phillips,Todd Solondz,Tom Holland,Tom Hooper,Tom Shadyac,Tom Tykwer,Tom Vaughan,Tony Scott,Tyler Perry,Uli Edel,Walter Hill,Walter Salles,Wayne Wang,Werner Herzog,Wes Anderson,Wes Craven,William Friedkin,Wim Wenders,Wolfgang Petersen,Woody Allen,Yimou Zhang,Yorgos Lanthimos,Zack Snyder,Éric Rohmer
0,Call Me by Your Name,8.4,4766.0,Directed by Luca Guadagnino. With Armie Hamme...,93.0,39.0,107.0,"Armie Hammer,, Timothée Chalamet,, Michael Stu...",Luca Guadagnino,USA,2017-11-24,4.0,2017.0,/reviews/call-me-by-your-name-2017,R,130.0,Luca Guadagnino’s films are all about the tran...,0,0.364486,47,1176,Fall,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Mudbound,7.4,2666.0,"Directed by Dee Rees. With Carey Mulligan, Ga...",86.0,22.0,61.0,"Carey Mulligan,, Garrett Hedlund,, Jason Clarke",Dee Rees,USA,2017-11-17,4.0,2017.0,/reviews/mudbound-2017,Unrated,134.0,“Mudbound” is all about perception. How it can...,0,0.360656,42,1536,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [26]:
# remove the original director column
ebert_imdb_df = ebert_imdb_df.drop(['Director'], axis='columns')

## Convert Actors to Numerical

Treating actors just like directors is perhaps not the best approach, partly because there are so many more actors. Instead, let's create a column that indicates a weighted average of ratings for actors appearing in a given movie (weighted by number of total movie appearances).

We will engineer this feature in these steps:

1. Parse unique stars from Stars_List column
2. Create counts of movie appearances for each star
3. Get mean movie ratins for each star
4. Combine values into a single data frame
5. Iterate through the data and compute weighted average of stars mean ratings using the above data frame

### Step 1.

In [27]:
# parse individual stars from the Stars_List column
stars_array = [stars_list.split(",,") for stars_list in ebert_imdb_df['Stars_List']]
unique_stars = {star.strip() for stars in stars_array for star in stars}
unique_stars.discard('')
unique_stars = pd.Series(sorted(unique_stars))

### Steps 2. and 3.

In [28]:
# store how many appearances each star has and mean ratings
star_values  = {}

# get star mean ratings
for star in unique_stars:
    # subset only movies containing the star
    subset = ebert_imdb_df[ebert_imdb_df.Stars_List.str.contains(star)]
    # star number of appearances
    star_values.setdefault(star, {})
    star_values[star]['appearances'] = subset.shape[0]
    # get the star mean rating
    star_values[star]['mean_rating'] = subset.EbertStars.mean()
    
star_values

{'Jeremy Renner': {'appearances': 11, 'mean_rating': 3.1818181818181817},
 'Vlad Ivanov': {'appearances': 2, 'mean_rating': 3.5},
 'Damián Alcázar': {'appearances': 1, 'mean_rating': 2.0},
 'Gabriel Spahiu': {'appearances': 1, 'mean_rating': 3.5},
 'Dahong Ni': {'appearances': 1, 'mean_rating': 2.0},
 'Emmanuelle Vaugier': {'appearances': 1, 'mean_rating': 2.0},
 'Debra Winger': {'appearances': 16, 'mean_rating': 2.90625},
 'Emma Stone': {'appearances': 15, 'mean_rating': 2.5},
 'Josh Boles': {'appearances': 1, 'mean_rating': 2.0},
 'Pilar Padilla': {'appearances': 1, 'mean_rating': 3.5},
 'Robert Preston': {'appearances': 2, 'mean_rating': 2.75},
 'Clive Owen': {'appearances': 21, 'mean_rating': 2.8095238095238093},
 'Kelly McGillis': {'appearances': 10, 'mean_rating': 2.45},
 'Mike Gleason': {'appearances': 1, 'mean_rating': 3.5},
 'Siobhan Brooks': {'appearances': 1, 'mean_rating': 3.0},
 'Hannah Murray': {'appearances': 3, 'mean_rating': 2.5},
 'Seymour Bernstein': {'appearances': 

### Step 4.

In [29]:
stars_df = pd.DataFrame(star_values).T
stars_df

Unnamed: 0,appearances,mean_rating
'Weird Al' Yankovic,1.0,1.000000
50 Cent,3.0,2.666667
7 Year Bitch,1.0,3.000000
A Martinez,1.0,3.000000
A-Trak,1.0,3.000000
A. Michael Baldwin,1.0,2.500000
A.J. Buckley,1.0,3.000000
A.J. Cook,1.0,1.500000
A.J. Langer,1.0,1.500000
A.O. Scott,1.0,4.000000


### Step 5.

In [30]:
# create column of zeros
ebert_imdb_df['star_weighted_rating'] = np.zeros(len(ebert_imdb_df))
# iterate through rows and compute weighted average of all stars for given movie
for i,r in enumerate(ebert_imdb_df.Stars_List):
    stars = [s for s in r.split(',') if s != '']
    weighted = stars_df.loc[stars].product(axis=1).sum() / stars_df.loc[stars].appearances.sum()
    ebert_imdb_df.iloc[i, -1] = weighted
ebert_imdb_df.star_weighted_rating.head(10)

0    2.750000
1    3.200000
2    2.568966
3    2.600000
4    3.000000
5    3.088889
6         NaN
7    2.307692
8    2.750000
9    2.666667
Name: star_weighted_rating, dtype: float64

Inspection of the data shows that NaNs are coming from movies with empty stars lists.

In [31]:
ebert_imdb_df.head(2)

Unnamed: 0,Title,IMDB_Rating,Rating_Count,Description,Metascore,User_Review_Count,Critic_Review_Count,Stars_List,Country,Release_Date,EbertStars,Year,URL,Rating,Runtime,Review,Foreign,UC_Ratio,Description_Len,Review_Len,Season,Spring,Summer,Winter,NC-17,PG,PG-13,R,Unrated,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,Unnamed: 57,Abel Ferrara,Adam McKay,Adam Shankman,Adrian Lyne,Agnieszka Holland,Aki Kaurismäki,Alan J. Pakula,Alan Parker,Alan Rudolph,Albert Hughes,Alejandro Amenábar,Alejandro González Iñárritu,Alex Gibney,Alex Proyas,Alexander Payne,Alfonso Cuarón,Alfred Hitchcock,Amy Heckerling,...,Ron Clements,Ron Howard,Ron Shelton,Rowdy Herrington,Ruben Fleischer,Ruben Östlund,Sally Potter,Sam Mendes,Sam Miller,Sam Raimi,Scott Derrickson,Scott Hicks,Seth Gordon,Shana Feste,Shari Springer Berman,Shawn Levy,Sidney Lumet,Silvio Soldini,Simon West,Simon Wincer,Sofia Coppola,Sophie Barthes,Spike Lee,Stephen Chow,Stephen Daldry,Stephen Frears,Stephen Herek,Stephen Sommers,Steve Miner,Steven Brill,Steven Soderbergh,Steven Spielberg,Stuart Gordon,Sydney Pollack,Takeshi Kitano,Tate Taylor,Taylor Hackford,Ted Demme,Ted Kotcheff,Terence Davies,Terrence Malick,Terry George,Terry Gilliam,Thomas Carter,Thomas Vinterberg,Tim Burton,Tim Johnson,Tim Story,Tobe Hooper,Todd Haynes,Todd Phillips,Todd Solondz,Tom Holland,Tom Hooper,Tom Shadyac,Tom Tykwer,Tom Vaughan,Tony Scott,Tyler Perry,Uli Edel,Walter Hill,Walter Salles,Wayne Wang,Werner Herzog,Wes Anderson,Wes Craven,William Friedkin,Wim Wenders,Wolfgang Petersen,Woody Allen,Yimou Zhang,Yorgos Lanthimos,Zack Snyder,Éric Rohmer,star_weighted_rating
0,Call Me by Your Name,8.4,4766.0,Directed by Luca Guadagnino. With Armie Hamme...,93.0,39.0,107.0,"Armie Hammer,, Timothée Chalamet,, Michael Stu...",USA,2017-11-24,4.0,2017.0,/reviews/call-me-by-your-name-2017,R,130.0,Luca Guadagnino’s films are all about the tran...,0,0.364486,47,1176,Fall,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.75
1,Mudbound,7.4,2666.0,"Directed by Dee Rees. With Carey Mulligan, Ga...",86.0,22.0,61.0,"Carey Mulligan,, Garrett Hedlund,, Jason Clarke",USA,2017-11-17,4.0,2017.0,/reviews/mudbound-2017,Unrated,134.0,“Mudbound” is all about perception. How it can...,0,0.360656,42,1536,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.2


In [32]:
# drop the stars list
ebert_imdb_df = ebert_imdb_df.drop(['Stars_List'], axis='columns')

## Create buckets for decades

In [33]:
ebert_imdb_df['Year'] = ebert_imdb_df['Year'].apply(pd.to_numeric, args=('coerce',))
ebert_imdb_df['Runtime'] = ebert_imdb_df['Runtime'].apply(pd.to_numeric, args=('coerce',))

In [34]:
ebert_imdb_df = ebert_imdb_df.dropna(subset=['Release_Date', 'Year'])
# ax = ebert_imdb_df.Year.hist(bins=range(1920,2020, 10), figsize=(17, 8))

In [35]:
decade_buckets = range(1920, 2020, 10)
for decade in decade_buckets:
    ebert_imdb_df[decade] = np.zeros(len(ebert_imdb_df), dtype=int)

In [36]:
# fill decades with 1's
for idx, row in ebert_imdb_df.iterrows():
    decade_idx = int((row['Year'] - 1920) // 10)
    ebert_imdb_df.ix[idx, decade_buckets[decade_idx]] = 1

In [37]:
ebert_imdb_df.head(2)

Unnamed: 0,Title,IMDB_Rating,Rating_Count,Description,Metascore,User_Review_Count,Critic_Review_Count,Country,Release_Date,EbertStars,Year,URL,Rating,Runtime,Review,Foreign,UC_Ratio,Description_Len,Review_Len,Season,Spring,Summer,Winter,NC-17,PG,PG-13,R,Unrated,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,Unnamed: 56,Abel Ferrara,Adam McKay,Adam Shankman,Adrian Lyne,Agnieszka Holland,Aki Kaurismäki,Alan J. Pakula,Alan Parker,Alan Rudolph,Albert Hughes,Alejandro Amenábar,Alejandro González Iñárritu,Alex Gibney,Alex Proyas,Alexander Payne,Alfonso Cuarón,Alfred Hitchcock,Amy Heckerling,Andrei Tarkovsky,...,Scott Derrickson,Scott Hicks,Seth Gordon,Shana Feste,Shari Springer Berman,Shawn Levy,Sidney Lumet,Silvio Soldini,Simon West,Simon Wincer,Sofia Coppola,Sophie Barthes,Spike Lee,Stephen Chow,Stephen Daldry,Stephen Frears,Stephen Herek,Stephen Sommers,Steve Miner,Steven Brill,Steven Soderbergh,Steven Spielberg,Stuart Gordon,Sydney Pollack,Takeshi Kitano,Tate Taylor,Taylor Hackford,Ted Demme,Ted Kotcheff,Terence Davies,Terrence Malick,Terry George,Terry Gilliam,Thomas Carter,Thomas Vinterberg,Tim Burton,Tim Johnson,Tim Story,Tobe Hooper,Todd Haynes,Todd Phillips,Todd Solondz,Tom Holland,Tom Hooper,Tom Shadyac,Tom Tykwer,Tom Vaughan,Tony Scott,Tyler Perry,Uli Edel,Walter Hill,Walter Salles,Wayne Wang,Werner Herzog,Wes Anderson,Wes Craven,William Friedkin,Wim Wenders,Wolfgang Petersen,Woody Allen,Yimou Zhang,Yorgos Lanthimos,Zack Snyder,Éric Rohmer,star_weighted_rating,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010
0,Call Me by Your Name,8.4,4766.0,Directed by Luca Guadagnino. With Armie Hamme...,93.0,39.0,107.0,USA,2017-11-24,4.0,2017.0,/reviews/call-me-by-your-name-2017,R,130.0,Luca Guadagnino’s films are all about the tran...,0,0.364486,47,1176,Fall,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.75,0,0,0,0,0,0,0,0,0,1
1,Mudbound,7.4,2666.0,"Directed by Dee Rees. With Carey Mulligan, Ga...",86.0,22.0,61.0,USA,2017-11-17,4.0,2017.0,/reviews/mudbound-2017,Unrated,134.0,“Mudbound” is all about perception. How it can...,0,0.360656,42,1536,Fall,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.2,0,0,0,0,0,0,0,0,0,1


## Export to pickle

In [38]:
pickle.dump(ebert_imdb_df, open('../data/processed/ebert_imdb_final_df.pkl', 'wb'))

# Plan for Following Notebooks

- More Exploratory Data Analysis
- Making predictions
- Final analysis