# Cast
- This notebook intends to investigate the casts of the CMU Movie Summary Corpus dataset.
- The notebook will do some initial analyses to see how the cast and individual movie actors affects the box office revenue.

**Summary**

- By including all actors that have played in more than 15 movies, we get $R^2$=xx
- 

**Contents of Notebook**

-

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

# turn off warning
pd.options.mode.chained_assignment = None

In [2]:
# constants
data_folder = './data/'
MOVIE_PATH = data_folder + 'movie.metadata.tsv'
CHARACTER_PATH = data_folder + 'character.metadata.tsv'
RATING_PATH = data_folder + 'title.ratings.tsv'

# Data Processing

### Loading data

In [3]:
# create dataframes

# define collumn names
colnames_movies = [
    "wikipedia_movie_ID",
    "freebase_movie_ID",
    "name",
    "release_date",
    "box_office_revenue",
    "runtime",
    "languages",
    "countries",
    "genres",
]


colnames_character = [
    "wikipedia_movie_ID",
    "freebase_movie_ID",
    "last_update",
    "character_name",
    "actor_DOB",
    "actor_gender",
    "actor_height",
    "actor_ethnicity",
    "actor_name",
    "actor_age_at_movie_release",
    "freebase_character/actor_map_ID",
    "freebase_character_ID",
    "freebase_actor_ID",
]


# load data
movies = pd.read_csv(MOVIE_PATH, sep="\t", names=colnames_movies, header=None)
characters = pd.read_csv(
    CHARACTER_PATH, sep="\t", names=colnames_character, header=None
)

In [4]:
# Removing movies from before 2000
movies = movies[movies['release_date'] >= '2000']
movies.shape

(24496, 9)

In [5]:
movies.shape

(24496, 9)

In [6]:
characters.shape

(450669, 13)

In [7]:
# Merging movies with characters on wiki_id. 
movies_characters = pd.merge(left=movies, right=characters, on=['wikipedia_movie_ID', 'freebase_movie_ID'])

In [None]:
movies_characters.isna().sum()

Comment: Both box office revenue and freebase actor ID contain some NaN values we want to remove before exploring actors' effect on revenue. 

In [None]:
# Removing movies without freebase_actor_id
movies_characters = movies_characters[movies_characters['freebase_actor_ID'].notna()]

# Removing movies without box_office_revenue
movies_characters = movies_characters[movies_characters['freebase_actor_ID'].notna()]

In [None]:
movie_count = movies_characters.shape[0]
unique_combos = movies_characters.value_counts(subset=['wikipedia_movie_ID', 'freebase_actor_ID'], dropna=False).shape[0]

print(
    """
    Total number of movies in our merged dataset: {}
    Unique number of combinations of 'wikipedia_movie_ID' and 'freebase_actor_ID': {}
    """.format(movie_count, unique_combos))

Comment: Some actors are listed mulitple times because they play different characters in the same movie. We only want unique combinations of 'wikipedia_movie_ID' and 'freebase_actor_ID'. 


In [None]:
# We filter out duplicated combinations of 'wikipedia_movie_ID' and 'freebase_actor_ID'. 
# For now, we do not care which row we keep
movies_characters = movies_characters.drop_duplicates(subset=['wikipedia_movie_ID', 'freebase_actor_ID'], keep='first')

In [None]:
movies_characters[['wikipedia_movie_ID', 'freebase_actor_ID']].value_counts().shape

In [None]:
# Number of unique movies
num_movies = movies_characters["wikipedia_movie_ID"].nunique()
num_movies

In [None]:
# Number of unique actors in dataset
num_actors_unique = movies_characters["freebase_actor_ID"].nunique()
num_actors_unique

In [None]:
# Checking number of actors in dataset, counting actors multiple times if they play
# in multiple movies
count_actors = movies_characters["freebase_actor_ID"].count()
count_actors

In [None]:
# Number of actors on average per movie
count_actors / num_movies

In [None]:
# One Hot Encoding of Actors
movies_characters_dummy = pd.get_dummies(data=movies_characters, columns=['freebase_actor_ID'])

# As long as the actor have played in the movie, we want to display the corresponding value as 1
# If the actor has played multiple characters in the same movie, the value is still 1
movies_characters_dummy = movies_characters_dummy

In [None]:
movies_characters_dummy.apply(lambda x: )

In [None]:
# Only include wikipedia movie ID and one hot encoding of actors in dataframe
dummy_actor_columns = movies_characters_dummy.filter(regex='wikipedia_movie_ID|freebase_actor_ID_')

In [None]:
# Only include actor columns that correspond to actors that have played in more than 16 movies
# Threshold=16 is the lowest we can go to not exceed the maximum recursion depth in the
# linear regression, which we will get back to. 
dummy_actor_columns = dummy_actor_columns.loc[:, dummy_actor_columns.sum(axis=0) > 16]

In [None]:
# Grouping movies such that every movie correspond to only one row in the dataframe
dummy_actor_columns = dummy_actor_columns.groupby('wikipedia_movie_ID').agg('sum')

In [None]:
dummy_actor_columns.shape

In [None]:
# Merging dummy actor columns with the movies
movies_binary_actors = pd.merge(movies, dummy_actor_columns, left_on='wikipedia_movie_ID', right_index=True)
movies_binary_actors.head(1)

In [None]:
# Removing slashes in column names to avoid error in regression. 
movies_binary_actors.columns = movies_binary_actors.columns.str.replace('/', '')
dummy_actor_columns.columns = dummy_actor_columns.columns.str.replace('/', '')

In [None]:
# Constructing formula used for regression
# For now we only include actors as categorical predictors
formula = 'box_office_revenue ~ '
for col in dummy_actor_columns.columns:
    formula += 'C(' + col + ')+'
    
formula = formula[:-1]

In [None]:
# Linear regression
mod = smf.ols(formula=formula, data=movies_binary_actors)
res = mod.fit()
res_summary = res.summary()

In [None]:
# Note: The smallest eigenvalue is 3.33e-30. This might indicate that there are
# strong multicollinearity problems

In [None]:
print("Our model with actors that played in more than 16 movies gets R-squared = {:.2f}".format(res.rsquared))
print("The corresponding adjusted R-squared is: {:.2f}".format(res.rsquared_adj))

Comment: 
Our model explains 70% of the change in the box office revenue. However, we have many variables included (actors), which could make our model look more accurate even if there are many actors poorly contributing in some way. The adjusted R-squared score (56%) may be telling us that some variables are not contributing to our model’s R-squared properly. There are also most likely confounders in place. 

In [None]:
# Turning result summary into a dataframe
res_as_html = res_summary.tables[1].as_html()
summary_df = pd.read_html(res_as_html, header=0, index_col=0)[0]

In [None]:
summary_df.head(1)

In [None]:
# Only including actors with p-value < 0.05.
# Statistically significant actors
summary_df = summary_df[summary_df['P>|t|'] < 0.05]

# Sorting the dataframe in descending order according to coefficient
summary_df.sort_values(by='coef', ascending=False, inplace=True)
summary_df.head(1)