# ADA Project: Welcome to the final analysis of **Amercian Influence in the Cinema industry**

## Authors
- Group Name: ADACTYLOUS
    - Chloé Bouchiat
    - Claire Pinson
    - Germana Sergi
    - Luca Soravia
    - Marlen Stöckli

## Notebook's structure A MODIFIER POUR P3
- Main librairies and specific functions from the utils folder
- Our analysis workflow with a markdown cell explaining each code cell
- The notebook reads as follow:
    - movie.metadata enriched by IMDB ratings (coming from the IMDb non commercial datasets ([IMDb](https://developer.imdb.com/))
        - General information about the dataset (i.e. basic stats, first visualization)
        - Exploratory analysis of the dataset according to our research questions
    - character.metadata enriched with wikipedia actor's nationality ([DBpedia](https://www.dbpedia.org/about/))
        - General information about the dataset (i.e. basic stats, first visualization)
        - Exploratory analysis of the dataset according to our research questions

#### Recall of the repository structure:
```
📁 ada-2024-project-adactylous
│
├── 📄 results.ipynb (where all ours plots and analysis are)
├── 📄 .gitignore (what is ignored during push and pull requests)
├── 📄 requirements.txt (install into your environment)
├── 📄 README.md
│
├── 📁 data
       │── 📄 actor_metadata_CMU.csv
       │── 📄 movie_metadata_CMU_IMDB.csv
       │── 📄 nationality.csv
       │── 📄 personas_metadata_CMU.csv
       │── 📄 plot_summaries_CMU.csv
├── 📁 src
    ├── 📁 data
    ├── 📁 models
    ├── 📁 scripts
        │── 📄 datasets_cleaner.py
        │── 📄 nationality_importer.py
    ├── 📁 utils
└── 📁 tests
``` 
**NOTE**: Other empty folders and .py files will be filled up later on during the project.

In [46]:
# Import the needed libarairies
import warnings # to ignore pandas version warning
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm

from matplotlib.ticker import MaxNLocator
from scipy.stats import ttest_ind, spearmanr # to implement statistical tests
from sklearn.metrics import r2_score
from sklearn.cluster import KMeans # for actors analysis
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


############# Add more libraries if needed

In [2]:
########################### Read datasets from repository's data folder ############################
movie_mtd = pd.read_table('data/movie_metadata_CMU_IMDB.csv', sep=',')
actor_mtd = pd.read_table('data/actor_metadata_CMU.csv', sep=',')
personas_mtd = pd.read_table('data/personas_metadata_CMU.csv', sep=',')

In [3]:
# Set a global background theme for all our plots and ignore warnings
sns.set_theme(style="darkgrid")
warnings.filterwarnings("ignore")

### Movie Sucess index

In [351]:
# Create function to encode multivariate features

def multi_label_binarizer(df, col, k):
    # df : input data frame
    # col : name of the column with multi labels (string) 
    # return : dataframe with the column split into one-hot encoder

    # split cells into lists of countries, this will allow the correct labels to be stored
    df[col] = df[col].str.split(',')

    # process the dataset for a linear regression
    # Since the columns of genres and countries contains more that one label, we use the function MultiLabelBinarizer 
    # (pd.get_dummies does not work correctly and we get dozen of thousands of columns)
    # (found on : https://www.kaggle.com/discussions/questions-and-answers/66693)
    #instantiating MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    #Encode the multilabel data in MLB Format
    df_temporary = mlb.fit_transform(df[col])
    # transform to Dataframe to concat with original X dataset. We should specify the columns names (MultiLabelBinarizer stored them)
    df_temporary = pd.DataFrame(df_temporary, columns=mlb.classes_, index=df.index)

    # Drop rare features (i.e. features that are positive in less than k movies of the dataset)
    p=0
    for feature in mlb.classes_:
        if len(df_temporary[df_temporary[feature]==1])<k:
            p+=1
            df_temporary=df_temporary.drop(columns=[feature])
            #print(f'{feature} was dropped...')

    print(f'For the {col} feature, {p} categories were dropped.')
   

    # Merge the processed countries will the training variables
    df = pd.concat([df, df_temporary], axis=1)
    # X = pd.concat([X, X_countries['United States of America']], axis=1) #to add only USA
    # Drop column with the names of all countries aggregated
    df = df.drop(columns=[col])

    return df

In [352]:
# Names of all possible variables for the linear regression

print(movie_mtd.columns)

Index(['wiki_movie_ID', 'freebase_movie_ID', 'title', 'release_date',
       'box_office', 'runtime', 'languages', 'countries', 'genres_CMU',
       'release_year', 'genres_IMDB', 'averageRating', 'numVotes'],
      dtype='object')


In [353]:
# Choose variables on which we'll train to Linear regression
features=['box_office', 'languages', 'runtime', 'genres_CMU','release_year']
# multilabel feature to encode latter
features_multilabels=[ 'languages', 'genres_CMU']



########################################### Remove NaNs #################################################
movie_mtd_ratings=movie_mtd.dropna(subset='averageRating')
# We choose first to drop any line with NaN for any of the attributes
movie_mtd_ratings=movie_mtd_ratings.dropna(subset=features)
print(f'The dataset that contains the ratings has a lenth of {len(movie_mtd_ratings)} movies')

#preprocessing => move to appropriate file
movie_mtd_ratings['countries']=movie_mtd_ratings['countries'].str.replace(', ',',')
movie_mtd_ratings['countries']=movie_mtd_ratings['countries'].str.lower()
movie_mtd_ratings['languages']=movie_mtd_ratings['languages'].str.replace(', ',',')
movie_mtd_ratings['languages']=movie_mtd_ratings['languages'].str.lower()
movie_mtd_ratings['genres_IMDB']=movie_mtd_ratings['genres_IMDB'].str.lower()
movie_mtd_ratings['genres_CMU']=movie_mtd_ratings['genres_CMU'].str.lower()
movie_mtd_ratings['genres_CMU']=movie_mtd_ratings['genres_CMU'].str.replace(', ',',')


print(movie_mtd_ratings['countries'])

############################# Split data into features and target variables #############################
# Variables to train on
X = movie_mtd_ratings[features]
# Target variable 
Y = movie_mtd_ratings['averageRating']

######################################## Plotting for intuition ######################
#movie_mtd_ratings['box_office']=np.log(movie_mtd_ratings['box_office'])
#sns.scatterplot(movie_mtd_ratings, x='runtime', y='averageRating')

####################################### Encode multilabel features #######################################
####################################### Drop rare features ###############################################
# We choose to drop features when only k or less movies are positive. 
# This is because we don't have enough evidence that these features impact positively or negatively the ratings 
# The model would be overfitting
k=10
for feature in features_multilabels :
    X=multi_label_binarizer(X, feature, k)
    print(feature)



####################################### Apply (nonlinear) operations on features ###########################
#reduce box office since the values are spread over many orders of magnitude
X['box_office']=np.log(X['box_office'])




The dataset that contains the ratings has a lenth of 3668 movies
0                                 united states of america
7                                 united states of america
13                                          united kingdom
17                                united states of america
36                                united states of america
                               ...                        
81664                             united states of america
81666                      france,united states of america
81701    kingdom of great britain,japan,england,united ...
81702                             united states of america
81727                             united states of america
Name: countries, Length: 3668, dtype: object
For the languages feature, 81 categories were dropped.
languages
For the genres_CMU feature, 122 categories were dropped.
genres_CMU


In [354]:
# Split to test and train
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42, shuffle = True)

# Displaying the size of each sets
print(f"The training set has {X_train.shape[0]} samples (and a shape of {X_train.shape}).")
print(f"The test set has {X_test.shape[0]} samples (and a shape of {X_test.shape}).")

print(f"The target variable y for training has the shape {y_train.shape}.")
print(f"The target variable y for testing has the shape {y_test.shape}.")

The training set has 2934 samples (and a shape of (2934, 171)).
The test set has 734 samples (and a shape of (734, 171)).
The target variable y for training has the shape (2934,).
The target variable y for testing has the shape (734,).


In [355]:
'''# Standardizing the features using the built-in StandardScaler 
########## essayer une autre stardandization pour ne pas prédire de box office négatif
scaler = StandardScaler(copy=True, with_mean=True, with_std=True).set_output(transform="pandas") 

# Fitting the scaler to X_train to normalize the features
X_train = scaler.fit_transform(X_train)

# Standardizing X_test with mean and standard deviation of X_train since scaler is called on X_train first
X_test = scaler.transform(X_test)'''

# Adding a constant column, i.e. an intercept
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)


In [356]:
print(X_train.isnull().sum().sum())  # Should be 0
print(np.isinf(X_train).sum().sum())  # Should be 0
print(y_train.isnull().sum())  # Should be 0
print(np.isinf(y_train).sum())  # Should be 0

X_train.head()

0
0
0
0


Unnamed: 0,const,box_office,runtime,release_year,american english,arabic language,cantonese,english language,french language,german language,...,teen,thriller,time travel,tragedy,tragicomedy,war film,western,workplace comedy,world cinema,zombie film
31520,1.0,14.151983,84.0,1955.0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
49062,1.0,16.118096,91.0,1980.0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
17562,1.0,9.479833,97.0,2011.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
80558,1.0,18.685302,146.0,2004.0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
1346,1.0,18.080814,87.0,2001.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [357]:
# Initializing and training the regression model
model = sm.OLS(y_train, X_train) # in this function y comes first on the contrary of scikit-learn function
results = model.fit()
print(results.summary())



############# mettre en contexte : le rating est entre 0 et 10 donc c'est normal que les coeff du box office soient très très faibles

                            OLS Regression Results                            
Dep. Variable:          averageRating   R-squared:                       0.382
Model:                            OLS   Adj. R-squared:                  0.344
Method:                 Least Squares   F-statistic:                     10.07
Date:                Sat, 30 Nov 2024   Prob (F-statistic):          1.37e-186
Time:                        17:31:24   Log-Likelihood:                -3287.3
No. Observations:                2934   AIC:                             6917.
Df Residuals:                    2763   BIC:                             7940.
Df Model:                         170                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                 

In [358]:
print(results.params.sort_values(ascending=False).head(5))

const          28.702845
documentary     1.169391
stop motion     0.763059
gay themed      0.742381
animation       0.710644
dtype: float64


In [359]:
#################################### Making predictions on the test set ###################################
y_predicted = results.predict(X_test)


############################################ Calculating metrics ############################################
# Root mean square error
rmse = sm.tools.eval_measures.rmse(y_test, y_predicted)
print(f'The RMSE on the test set is {rmse:.3f}')

# R-squared
r_squared = r2_score(y_test, y_predicted)
print(f"The R² on the test set is {r_squared:.3f}.")

The RMSE on the test set is 0.770
The R² on the test set is 0.340.
