# THE SUCCESS OF INCLUSIVE CINEMA

## Table of contents
1. <a href="#introduction">Introduction</a>
2. <a href="#i-get-our-hand-dirty-les-clean-it-up">I) Get our hand dirty! Le's clean it up!</a>
3. <a href="#ii-how-to-define-success">II) How to define success?</a>
    - <a href="#1-lets-add-more-data">1) Let's add more data</a>
    - <a href="#2-merge-to-success">2) Merge to success</a>
    - <a href="#3-the-success-score">3) The success score</a>
4. <a href="#iii-how-to-define-diversity">III) How to define diversity?</a>
    - <a href="#1-treating-the-ethnicities">1) Treating the ethnicities</a>
    - <a href="#2-defining-diversity">2) Defining diversity</a>
    - <a href="#3-further-comments">3) Further comments</a>
5. <a href="#iv-lets-explore-the-data">IV) Let's explore the data</a>
    - <a href="#1-diversity-on-overall-success">1) Diversity on overall success</a>
        - <a href="#a-is-diversity-higher-in-successful-movies-compared-to-less-successful-movies">a) Is diversity higher in successful movies compared to less successful movies?</a>
        - <a href="#b-is-the-difference-significant">b) Is the difference significant?</a>
6. <a href="#v-lets-dig-in">V) Let's dig in</a>
    - <a href="#1-box-office-revenue">1) Box office revenue</a>
    - <a href="#2-user-ratings">2) User ratings</a>
    - <a href="#3-award-nominations">3) Award nominations</a>
7. <a href="#vi-conclusion-the-cruel-truth">VI) Conclusion: the cruel truth</a>

## **Introduction**

In [1]:
# Some basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from statsmodels.stats import diagnostic
from scipy import stats
import networkx as nx
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px
import sys
import warnings
import ast
import re

In [2]:
# Import some python modules
import src.data.cleaning_data as cleandata
import src.data.diversity as diversity_calc
import src.data.success as success

## **I) Get our hand dirty! Le's clean it up!**

In [3]:
movie_path = 'data/raw_data/movie.metadata.tsv'
character_path = 'data/raw_data/character.metadata.tsv'
ethnicity_mapping_path = 'data/raw_data/fb_wiki_mapping.tsv'
movie_df, box_office_df = cleandata.main(movie_path,character_path,ethnicity_mapping_path) # box_office_df will be used in our definition of success

# Preview the cleaned data
if  movie_df is not None:
    display(movie_df.head())

Cleaned data saved to data/preprocess_data/clean_dataset.csv


Unnamed: 0,Wikipedia_movie_ID,Movie_release_date,Actor_ethnicity,Movie_name,Movie_runtime,Movie_languages,Movie_countries
10,3196793,2000,African Americans,Getting Away with Murder: The JonBenét Ramsey ...,95.0,English Language,United States of America
57,18768079,1938,Jewish people,Fast Company,75.0,English Language,United States of America
59,612710,1999,Italians,New Rose Hotel,92.0,English Language,United States of America
60,612710,1999,German Americans,New Rose Hotel,92.0,English Language,United States of America
83,156558,2001,African Americans,Baby Boy,123.0,English Language,United States of America


## **II) How to define success?**
### **1) Let's add more data**

We define a successful film based on three key criteria: high box office revenue or high user ratings, or award nominations. Let's add them to our dataset.

In [4]:
# Import the datasets we need for our definition of success
ratings_df = success.ratings_setup()
awards_df = success.nominations_setup()


Unnamed: 0,Movie_name,Movie_release_date,Ratings,Wikipedia_movie_ID,Movie_box_office_revenue,Nomination
0,!Women Art Revolution,2010,6.9,29988427.0,,False
1,$,1971,6.3,4213160.0,,False
3,$9.99,2008,6.7,20624798.0,,False
4,'68,1988,5.8,2250713.0,,False
5,'Neath the Arizona Skies,1934,5.0,3610422.0,,False
...,...,...,...,...,...,...
42927,È l'amor che mi rovina,1951,5.0,23687589.0,,False
42928,Échangistes,2007,4.1,27932113.0,,False
42929,Édes Anna,1958,7.4,21534981.0,,False
42930,Élisa,1995,6.6,1719500.0,,False


### **2) Merge to success**

In [None]:
# Merge the datasets to get our success definition
success_df = success.merge_success_df(box_office_df, awards_df, ratings_df)
success_df = success.drop_NaN_on_success(success_df)
display(success_df)

### **3) The success score**
To ensure a representative sample of successful films, we establish specific thresholds for each criterion:
#### **a) High Box Office Revenue**
A film is considered successful in terms of box office revenue if it exceeds **the threshold value of 38,119,483.0 (derived from the third quartile of the revenue distribution)**(*to modify in consequence*). Films below this value are not categorized as successful by this measure.
#### **b) High User Ratings**
Films with an average user **rating of 7/10 or higher are deemed highly rated** (*to modify in consequence*). This threshold, also based on the third quartile of user ratings, ensures that only the top-rated films are included.
#### **c) Award Nominations**
Rather than limiting our selection to award winners, which could be too restrictive, we include all films nominated for Oscars and Golden Globes, among other prominent awards.
These criteria provide a comprehensive framework to evaluate a film's success across financial, audience, and critical dimensions.


In [5]:
# Define the success threshold
# Ratings and box office of success movies are above this quantile : 
ratings_quantile = 0.75
box_office_quantile = 0.75
success_df = success.define_success(success_df, ratings_quantile, box_office_quantile)
success.save_success_df(success_df, "success_movies")
display(success_df)

Proportion of success movies: 24.837229949689256
DataFrame saved successfully.


Unnamed: 0,Movie_name,Movie_release_date,Ratings,Wikipedia_movie_ID,Movie_box_office_revenue,Nomination,Success
0,!Women Art Revolution,2010,6.9,29988427.0,,False,0
1,$,1971,6.3,4213160.0,,False,0
3,$9.99,2008,6.7,20624798.0,,False,0
4,'68,1988,5.8,2250713.0,,False,0
5,'Neath the Arizona Skies,1934,5.0,3610422.0,,False,0
...,...,...,...,...,...,...,...
42927,È l'amor che mi rovina,1951,5.0,23687589.0,,False,0
42928,Échangistes,2007,4.1,27932113.0,,False,0
42929,Édes Anna,1958,7.4,21534981.0,,False,1
42930,Élisa,1995,6.6,1719500.0,,False,0


## **III) How to define diversity?**
### **1) Treating the ethnicities**
When we first have a look at the ethnicities, we can see that there are a total of more than 350 different ethnicities, some of them still very similar (e.g. 'Austrian American' and 'Austrian Canadian' etc). We want to first simplify this ethnicity criterion before defining diversity. If we didn’t sort the ethnicities, a film with a cast of a German, Austrian and Swiss would be considered very diverse. This is however not what we want to consider diverse. It is for this reason that the ethnicities were first grouped into larger ethnic groups. This was done with the help of a LLM, with checks and corrections done by hand. Doing this by hand was still possible thanks to the manageable number of ethnicities and the LLM doing the most time-consuming part.


In [6]:
actors_df = diversity_calc.load_df('data/processed_data/clean_dataset.csv')
actors_diversity = diversity_calc.ethnic_groups(actors_df)
diversity_calc.check_nan_Ethnicity(actors_diversity)
diversity = diversity_calc.naive_diversity(actors_diversity)
diversity = diversity_calc.ethnic_entropy(actors_df,diversity)
actors_df = diversity_calc.merge_on_movies(actors_df,diversity)
actors_df = actors_df[actors_df['actor_number'] != 1]
actors_df = actors_df.drop(columns='Actor_ethnicity').drop_duplicates(subset='Wikipedia_movie_ID')
actors_df.to_csv("data/processed_data/clean_div_dataset.csv", index=False, encoding='utf-8-sig')

### **2) Defining diversity**
Once the ethnicities have been sorted into larger groups (16), we can start defining diversity. The focus being on the diversity of the cast and not the representation of minority groups, country of production of the movie doesn’t have to be taken into account. The first and easiest way to calculate diversity would be dividing the number of ethnicities over the number of actors. However, for a film with 9 actors and 3 ethnicities, this definition would give the same diversity score for a distribution (3,3,3) as for (1,1,7). Calculating an entropy could therefore complete the previous definition.
The basic entropy formula is: $$ S = -\sum_{i=1}^{n} p_i \log(p_i) \tag{1}$$
Where pi is the fraction an ethnicity represents in the movie 


One modification was made. Indeed, if we have all actors from 1 ethnicity, we get an entropy of 0, but it is preferable to avoid the value of 0 since we will multiply the entropy with the other definition. We therefore have added 1 to the entropy. This entropy penalises the movies with smaller numbers of actors, which is why we have multiplied entropy with the first definition to establish our final diversity coefficient

### **3) Further comments**
Firstly, the data set gives us many movies with different numbers of actors and all the movies with 1 actor cannot be considered for a diversity calculation. If we wanted to further complete the analysis we could consider whether an actor is from a minority group.
Secondly, the diversity coefficient is based on the ethnic groups established previously. Changing the characteristics of the ethnic groups, such as their size, their number or their content will change the diversity factor.


## **IV) Let's explore the data**
### **1) Diversity on overall success**
#### **a) Is diversity higher in successful movies compared to less successful movies ?**

#### **b) Is the difference significant?**
**T-test**

**Test of correlation**

Spearman correlation coefficient

**Propensity score matching**

## **V) Let's dig in**
Explore the data per success criterion

### **1) Box office revenue**
**T-test**


**Test of correlation**
pearson correlation coefficient

**Propensity score matching**

### **2) User ratings**
**T-test**

**Test of correlation**
spearman correlation coefficient

**Propensity score matching**

### **3) Award nominations**
**T-test**

**Test of correlation**
Pearson correlation coefficient

**Propensity score matching**

## **VI) Conclusion: the cruel truth**