In [8]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from scipy import stats
from scipy.stats import pearsonr

# Data Pipeline

## How datasets are joined

### Movie dataset and Character dataset
We join the two datasets on the `freebase_movie_id`.

### Character dataset and Oscar dataset
Oscar dataset does not have `freebase_movie_id` or `freebase_actor_id`. We instead use `parsed_actor_name` and `movie_identifier`. `parsed_actor_name` will be unique for each movie as we drop actors if they share `parsed_actor_name` from playing another character in the same movie. `movie_identifier` is a combination of `parsed_movie_name` and `release_year`. This is unique as we drop movies that share `movie_identifier`.

### Resulting dataset from previous steps and IMDb dataset
We join these datasets using a combination of `parsed_movie_name` and `release_year` as primary key.
<br><br><br>
The resulting dataset after the entire pipeline is run is written to `cache/data.csv`, ready for use in P3.

In [9]:
%run data_pipeline.ipynb

########## Data pipeline ##########

Preparing CMU data
379 movies shared both name and release year, dropping
314 movies had actors with the same name, dropping

Merging Oscar dataset, after merge:
Number of different Oscar nominated movies in dataset: 952 in total 63968 different movies
Number of different Oscar nominated actors in dataset: 801 in total 134907 different actors
Number of Oscar nominated rows: 1443

Merging IMDb dataset, after merge:
Number of movies with ratings: 36758
Oscar nominated movies with rating: 939
Number of rows in data before cleaning:  443504
Number of rows in data after cleaning:  23819
Number of rows where age is < 0: 7 . Dropping these rows

FINAL STATE OF DATA
Number of rows:  23812
Number of different Oscar nominated movies in dataset: 394 in total 5987 different movies
Number of different Oscar nominated actors in dataset: 284 in total 2959 different actors
Number of Oscar nominated rows: 519
Processing done, dataset written to cache/data.csv


In [10]:
movie_df = pd.read_csv('cache/data.csv', sep=',', index_col=0)

In [11]:
movie_df.head()

Unnamed: 0,title,box_office_revenue,runtime,languages,countries,genres,movie_identifier,actor_gender,actor_height,actor_ethnicity,...,identifier,category,winner,oscar_nominated,year,average_rating,number_of_votes,number_of_movies_starred_in,average_rating_previous_movies,average_box_office_revenue_previous_movies
140029,Down to You,24419914,92,"['French Language', 'English Language']",['United States of America'],"['Romantic comedy', 'Romance Film', 'Drama', '...",down to you_2000,M,1.88,/m/0xnvg,...,down to you_2000_adam carolla,,,False,2000,5.0,15878,1,5.0,24419914.0
60320,The Bible: In The Beginning,34900023,171,['English Language'],"['United States of America', 'Italy']","['Christian film', 'Drama', 'Epic', 'World cin...",the bible in the beginning_1966,M,1.85,/m/03bkbh,...,the bible in the beginning_1966_richard harris,,,False,1966,6.2,6385,1,6.2,34900023.0
389034,Hawaii,34562222,161,['English Language'],['United States of America'],"['Period piece', 'Roadshow theatrical release'...",hawaii_1966,M,1.85,/m/03bkbh,...,hawaii_1966_richard harris,,,False,1966,6.5,3708,1,12.7,69462245.0
130002,Camelot,31102578,178,['English Language'],['United States of America'],"['Costume drama', 'Musical', 'Roadshow theatri...",camelot_1967,M,1.85,/m/03bkbh,...,camelot_1967_richard harris,,,False,1967,6.6,7624,2,9.65,50282411.5
182566,Caprice,4075000,95,['English Language'],['United States of America'],"['Romantic comedy', 'Crime Fiction', 'Mystery'...",caprice_1967,M,1.85,/m/03bkbh,...,caprice_1967_richard harris,,,False,1967,5.5,1761,3,8.266667,34879941.0


## Descriptive statistics and data limitations

### NaN-values

In [12]:
print('Percentage of NaN values in each column:')
movie_df.isnull().sum() * 100 / len(movie_df)

Percentage of NaN values in each column:


title                                          0.000000
box_office_revenue                             0.000000
runtime                                        0.000000
languages                                      0.000000
countries                                      0.000000
genres                                         0.000000
movie_identifier                               0.000000
actor_gender                                   0.000000
actor_height                                   0.000000
actor_ethnicity                                0.000000
actor_age                                      0.000000
parsed_actor_name                              0.000000
actor_identifier                               0.000000
identifier                                     0.000000
category                                      97.820427
winner                                        97.820427
oscar_nominated                                0.000000
year                                           0

Some columns are critical, yet have high share of NaN values, e.g. actor_ethnicity and box_office_revenue 

We have asked TAs for input on how to handle these values, we see two options:
1. Make a fully cleaned dataset with no NaNs
2. Have different subsets of data for different analysis questions

Two columns are specific to rows that have been Oscar nominated (category, winner). It is therefore no problem that they have many NaN values.

We examine how much of the data would be lost if we drop all rows with NaN-values in relevant columns:

In [13]:
print(f'Number of data points before dropping NaN values: {len(movie_df)}')

Number of data points before dropping NaN values: 23812


In [14]:
data_points_after_drop = len(movie_df.dropna(subset=['title', 'release_date', 'box_office_revenue', 'runtime', 'languages',
       'countries', 'genres', 'movie_identifier', 'actor_gender',
       'actor_height', 'actor_ethnicity', 'actor_name', 'actor_age',
       'parsed_actor_name', 'actor_identifier', 'identifier','year', 'has_rating', 'average_rating',
       'number_of_votes']))

print(f'Number of complete data points we would have if we dropped all NaN values in relevant columns: {data_points_after_drop}')

KeyError: ['release_date', 'actor_name', 'has_rating']

We see that a significant portion of the data (~95%) would be lost by removing rows with relevant NaN-values.

### Correlation

In [None]:
cols = ['oscar_nominated', 'number_of_votes', 'average_rating', 'actor_height', 'runtime', 'box_office_revenue']
numerical_df = movie_df[cols].dropna()
print('Nr. of datapoints in the correlation analysis', len(numerical_df))
numerical_df.corr(method='pearson')

Most entries in the correlation matrix are positive. The ones that are negative are small. 

Below we analyze the p-value for the correlation to see between which relations it is significant and between which it is not. 

In [None]:
# Calculating p-values and storing them in the lists 'significant' and 'insignificant' depending on the test outcome. 
p_values_matrix = []
insignificant = []
significant = []
for col1 in cols: 
    p_values_list = []
    for col2 in cols: 
        if pearsonr(numerical_df[col1], numerical_df[col2])[1] > (0.05 / 30):  # 95% confidence level adjusted to bonferroni correction 
            insgnificant.append((col1, col2))
        else: 
            significant.append((col1, col2))
            

# Printing findings: 
print(len(significant) - len(cols), 'entries in correlation matrix have significant p-value') # Removing self-correlation
print(len(insignificant), 'entries in correlation matrix have insignificant p-value')
print()

# Printing significant column pairs and skipping self-relations. 
print('Significant pairs: ')
for significant_pair in significant: 
    if significant_pair[0] != significant_pair[1]:
        print(significant_pair[0],'&', significant_pair[1])
print()
print()

# Printing insignificant column pairs
print('Insignificant pairs: ')
for insignificant_pair in insignificant: 
    print(insignificant_pair[0], '&', insignificant_pair[1])

The above result indicates that we believe 18 of 30 of the entries in the correlation matrix to be significant at the 95% level. 

# Country/nomination analysis

In [None]:
# movie_character_oscar_rating_df contains a row for each actor/movie pair. We select the non-American actors and compare with the American actors

# All actors/movie rows, American and non-American
total_actors_num = len(movie_df['countries'])
american_total_actors_num = len(movie_df[movie_df['countries'].str.contains('United States of America')])
non_american_total_actors_num = total_actors_num - american_total_actors_num

# All actors/movie rows with an Oscar nomination, American and non-American
total_nominated_actors_num = len(movie_df[movie_df['oscar_nominated'] == True]['countries'])
american_nominations_num = len(movie_df[(movie_df['countries'].str.contains('United States of America')) & (movie_df['oscar_nominated'] == True)])
non_american_nominations_num = total_nominated_actors_num - american_nominations_num

In [None]:
# Observed probability of American actor getting nominated for a film
p_american = american_nominations_num / american_total_actors_num
p_non_american = non_american_nominations_num / non_american_total_actors_num

# We perform a two-sided hypothesis test for whether non-American actors have the same binomial probability of getting nominated as American ones
stats.binomtest(non_american_nominations_num, non_american_total_actors_num, p_american)

Using alpha=0.05. P-value=2.6118071094409342e-307 < 0.05. We can safely discard the null hypothesis that these have the same probability distribution, and conclude that there is a significantly different probability of being nominated for an Oscar for American and non-American actors.

In [None]:
print('Fraction of American actors nominated for an Oscar:',round(p_american, 5))
print('Fraction of non-American actors nominated for an Oscar:', round(p_non_american, 5))

We see that the observed probability of being nominated is higher for actors in American movies. We believe based on this analysis that the Oscar nominations are generally skewed with higher chances for actors in American movies.

# Logistic regression on movie and actor traits

In [None]:
movie_df.columns

In [None]:
# Finding the most common ethnicities
movie_df.groupby('actor_ethnicity').count().sort_values(by='title', ascending=False).head(15)

In [None]:
# The most frequent ethnicities, in descending order.
# Found the mappings manually, by looking the Freebase ethnicity ids up.

# An alternate solution would probably be to download a Freebase data dump and join using that.
# However, the dataset is quite large so we chose to go this route instead.
ethnicity_map = {    
    'Indian' : '/m/0dryh9k',
    'Black' : '/m/0x67',
    'Jewish' : '/m/041rx', 
    'English' : '/m/02w7gg',
    'Irish_Americans' : '/m/033tf_',
    'Italian_Americans' : '/m/0xnvg',
    'White_people' : '/m/02ctzb',
    'White_Americans' : '/m/07hwkr',
    'Scottish_Americans': '/m/07bch9',
    # '???' : '/m/044038p', Could not find what this Freebase id maps to
    'Irish_people' : '/m/03bkbh',
    'British' : '/m/0d7wh',
    'French' : '/m/03ts0c',
    'Italians' : '/m/0222qb',
    'Tamil' : '/m/01rv7x',   
}

In [None]:
# We normalize the data before performing logistic regression
def normalize_column(df_column):
    return (df_column - df_column.mean()) / df_column.std()

In [None]:
normalized_movie_df = movie_df.copy(deep=True)
features_to_normalize = ['actor_age', 'box_office_revenue', 'runtime', 'actor_height', 'year', 'average_rating', 'number_of_votes',]
normalized_movie_df[features_to_normalize] = normalized_movie_df[features_to_normalize].apply(normalize_column)

# Encode oscar_nominated as 0 or 1 for logistic regression
normalized_movie_df['oscar_nominated'] = normalized_movie_df['oscar_nominated'].astype(int)

# One-hot encoding the 5 most frequent ethnicities for the logistic regression:
ethnicities = list(ethnicity_map.keys())[:5]
for name in ethnicities:
    normalized_movie_df[name] = normalized_movie_df['actor_ethnicity'].map(lambda ethnicity: 1 if ethnicity == ethnicity_map[name] else 0)

In [None]:
# The following regression and plotting code was inspired and/or copied from the solution to exercise 4
# We perform logistic regression using a selection of relevant features from the dataframe
mod = smf.logit(formula='oscar_nominated ~  runtime + box_office_revenue + actor_height + \
                        actor_age + year + average_rating + number_of_votes + \
                        C(Indian) + C(Black) + C(Jewish) + C(English) + C(Irish_Americans)', data=normalized_movie_df)

In [None]:
# Fit the model and print results
res = mod.fit()
print(res.summary())

Note: we get the runtime warning as 1+np.exp(-X) gets so massive it is not computed properly. It does not have impact on the output which will be 0 anyways (division by a very large number).

In [None]:
# feature names
variables = res.params.index

# quantifying uncertainty!

# coefficients
coefficients = res.params.values

# p-values
p_values = res.pvalues

# standard errors
standard_errors = res.bse.values

#confidence intervals
res.conf_int()

In [None]:
#sort them all by coefficients
l1, l2, l3, l4 = zip(*sorted(zip(coefficients[1:], variables[1:], standard_errors[1:], p_values[1:])))

In [None]:
# Plot the results
plt.errorbar(l1, np.array(range(len(l1))), xerr= 2*np.array(l3), linewidth = 1,
             linestyle = 'none',marker = 'o',markersize= 3,
             markerfacecolor = 'black',markeredgecolor = 'black', capsize= 5)

plt.vlines(0,0, len(l1), linestyle = '--')

plt.yticks(range(len(l2)),l2)
plt.xlabel('Coefficient')
plt.title('Logistic regression coefficient by feature')

Lines around the points represent the confidence interval for the coefficient of each feature.

From this plot we see that there are multiple factors that can be used to predict whether a movie/actor row will be nominated or not. This serves as an initial analysis, we will do this more thoroughly in P3 to make more relevant conclusions for our research questions.

# Review analysis

To extract movies with nominated actors we need to find every movie where at least one of the rows in the column 'oscar_nominated' is positive.
To extract movies without a nominated actor we need to find every movie where every row in the column 'oscar_nominated' is false. 

In [None]:
# Grouping all movies by title, into unique_movies_df
unique_movies_df = movie_df.groupby('movie_identifier').first().reset_index()

print('Shape before: ', unique_movies_df.shape)
unique_nominated_movies_df = movie_df[movie_df['oscar_nominated'] == True].groupby('movie_identifier').first().reset_index()
# Mask is true if a movie from unique_movies_df is not in the dataframe unique_nominated_movies_df
mask = unique_movies_df['movie_identifier'].isin(unique_nominated_movies_df['movie_identifier']) == False

# Applying the mask 
not_nominated_df = unique_movies_df[mask]
# Checking the intersection between nominated and not nominated movies, should be 0 
print('Intersection between nominated and not nominated: ', pd.Series(list(set(unique_nominated_movies_df['movie_identifier']).intersection(set(not_nominated_df['movie_identifier'])))))

unique_movies_df = pd.concat([unique_nominated_movies_df, not_nominated_df], axis = 0) 
print('Shape after: ', unique_movies_df.shape)

Intersection is [], hence selection worked. 

In [None]:
# Removing movies without imdb ratings
movie_unique_with_rating_df = unique_movies_df[unique_movies_df['average_rating'].notna()]


In [None]:
print('Movies with oscar nominated actors with ratings: ', len(movie_unique_with_rating_df[movie_unique_with_rating_df['oscar_nominated'] == True]))

In [None]:
# Extracting nominated and movies and not nominated movies 
nominated = movie_unique_with_rating_df[movie_unique_with_rating_df['oscar_nominated']]
not_nominated = movie_unique_with_rating_df[movie_unique_with_rating_df['oscar_nominated'] == False]
assert nominated.shape[0] + not_nominated.shape[0] == movie_unique_with_rating_df.shape[0]

In [None]:
# We exclude all movies with fewer than 30 reviews. There are no movies with oscar nominated actors with fewer than 30 reviews.
# This is based on a rule of thumb to exclude outliers / low confidence values 
excluded = not_nominated[not_nominated['number_of_votes'] < 30]
print('Excluded nr of movies from analysis due to few reviews (< 30): ', len(excluded))
not_nominated = not_nominated[not_nominated['number_of_votes'] >= 30]
nominated = nominated[nominated['number_of_votes'] >= 30]

In [None]:
# Empirical CDF for nominated and not nominated 

sns.histplot(nominated, x="average_rating", stat = 'density', color = 'gold',label ='Nominated', bins =40)
sns.histplot(not_nominated, x="average_rating", stat="density", color = 'grey', label = 'Not nominated', bins = 50)

plt.title('Rating distribution of movies')
plt.xlabel('IMDB rating')
plt.ylabel('Probability density')
plt.legend()
plt.show()


These empirical distributions look different. We use a two sample Kolmogorov-Smirnov test to test if they are different. The null hypothesis is that the observations come from the same distribution. We reject the null hypothesis if the p-value < 0.05. Also we note that if the test statistic is 0 the distributions are identical and if the test statistic is 1 the distributions are completely different. 

In [None]:
stats.kstest(nominated['average_rating'], not_nominated['average_rating'])

Test statistic is 0.55, meaning distributions are different but not completely different. 
P-value = 1.6450047550532477e-259. This is extremely small, we can safely reject the null hypothesis. The conclusion is that the distributions are in fact different distributions. 

In [None]:
# Plotting reviews per IMDB reviews per movie 
sns.histplot(not_nominated, x="number_of_votes", bins=50, label = 'Not nominated', color = 'grey')
sns.histplot(nominated, x="number_of_votes", bins=50, label = 'Nominated', color = 'gold')
plt.yscale('log')
plt.title('Review distribution')
plt.xlabel('Reviews per movie (millions)')
plt.ylabel('Nr. of movies (log)')
plt.legend()
plt.show()

In [None]:
stats.kstest(not_nominated['number_of_votes'], nominated['number_of_votes'])

In [None]:
# As per the plot above, most movies with nominated actors have fewer than 500 000 reviews.
# We zoom in and look at the movies with few reviews. 

lim_not_nominated = not_nominated[not_nominated['number_of_votes'] < 10000]
lim_nominated = nominated[nominated['number_of_votes'] < 10000]

print('Share of not nominated movies with fewer than 10 000 reviews:', round(len(lim_not_nominated)/len(not_nominated)*100,2), '%')
print('Share of nominated with fewer than 10 000 reviews:', round(len(lim_nominated)/(len(nominated))*100,2),  '%')

sns.histplot(lim_not_nominated, x="number_of_votes", bins=50, label = 'Not nominated', color = 'grey')
sns.histplot(lim_nominated, x="number_of_votes", bins=50, label = 'Nominated', color = 'gold')

plt.yscale('log')
plt.title('Reviews per movie')
plt.xlabel('Reviews')
plt.ylabel('Nr. of movies (log)')
plt.legend()
plt.show()

We can see that most movies with relatively few review are not nominated. 

## Box-Office Revenue

Note: we will inflation adjust box-office revenues in P3 for higher accuracy. 

In [None]:
sns.histplot(not_nominated, x="box_office_revenue", stat="density", color = 'grey', label = 'Not nominated', bins = 60)
sns.histplot(nominated, x="box_office_revenue", stat = 'density', color = 'gold',label ='Nominated', bins = 40)

plt.title('Box office revenue distribution')
plt.yscale('log')
plt.xlabel('Box-office revenue (billions) ')
plt.ylabel('Probability density (log)')
plt.legend()
plt.show()

Notice that the above plot is a probability distribution and that the y-axis is in log scale. We are surprised since all movies with nominated actors do not seem to be the ones with the highest revenue. To investigate this we look into movies with lower box-office revenue. 

In [None]:
lim_not_nominated = not_nominated[not_nominated['box_office_revenue'] < 10**7]
lim_nominated = nominated[nominated['box_office_revenue'] < 10**7]

sns.histplot(lim_not_nominated, x="box_office_revenue", stat="density", color = 'grey', label = 'Not nominated', bins = 20)
sns.histplot(lim_nominated, x="box_office_revenue", stat = 'density', color = 'gold',label ='Nominated', bins = 20)

plt.title('Box office revenue distribution for movies with revenue less than 10**7')
plt.xlabel('Box-office revenue (10s of millions)')
plt.ylabel('Probability density')
plt.legend()
plt.show()


We can see that movies with nominated actors have revenue in an interval. They are neither the movies with the highest revenue, or the movies with the lowest revenue. We think this will be a hypothesis to explore further in P3. 