# Key Findings from Our Preliminary Analysis

In [1]:
import pandas as pd
from src.scripts import data_cleaning
from src.scripts import data_modification
from src.scripts import sentiment_analysis

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\march\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


This analysis dives into various aspects of cinema data, exploring questions such as the distribution of movie genres, gender representation over decades, and the emotions expressed in films. We focus mostly on movie related data for our project since it is our main focus point, however, we did also clean and browse character and actor data in case we decide that we want to include them for any reason. 

We have decided to only report images of our plots without the code to avoid cluttering this notebook but in case you would need it, you can find all related code and the plots in the src/scripts/notebook folder for a detailed step-by-step walkthrough. Concerning functional data pipelines like data cleaning, sentiment analysis and data modifications, we will call them from this notebook to give an idea of what the dataframes we work with look like. The scripts are located in src/scripts.

## What’s in the data?

### Data cleaning

As a first step, we tried to thoroughly clean the data in its entirety but especially plot summaries and movie related data. We removed tags, citations and most of weird characters from the movie summaries. We unified dates and in some cases removed the complete dates to keep only the year. We cleaned box office data by removing strings or changing them into numerical values and checked the bounds on some field (for example age should not be negative). We removed ids and columns we will mostly not use.

In [2]:
cleaned_data_movie = data_cleaning.movie_data_cleaning_pipeline()
print(cleaned_data_movie.columns)
cleaned_data_movie.sample(1)

Index(['Wikipedia_movie_ID', 'summary', 'Freebase_movie_ID', 'Movie_name',
       'Movie_release_date', 'Movie_box_office_revenue', 'Movie_runtime',
       'Movie_languages', 'Movie_countries', 'Movie_genres'],
      dtype='object')


Unnamed: 0,Wikipedia_movie_ID,summary,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_runtime,Movie_languages,Movie_countries,Movie_genres
42248,18581839,,/m/04g04wt,The Great King,1942.0,,118.0,German Language,Germany,"Drama, Black-and-white"


In [3]:
cleaned_data_movie.describe()

Unnamed: 0,Wikipedia_movie_ID,Movie_release_date,Movie_box_office_revenue,Movie_runtime
count,81840.0,39372.0,8401.0,61291.0
mean,17411810.0,1977.548816,47993630.0,111.8192
std,10988810.0,30.938137,112175300.0,4360.07
min,330.0,1892.0,10000.0,0.0
25%,7324745.0,1952.0,2083193.0,81.0
50%,17792340.0,1989.0,10639690.0,93.0
75%,27161290.0,2005.0,40716960.0,106.0
max,37501920.0,2016.0,2782275000.0,1079281.0


In [4]:
clean_character_data = data_cleaning.character_data_cleaning_pipeline()
print(clean_character_data.columns)
clean_character_data.sample(1)

Index(['Wikipedia_movie_ID', 'Freebase_movie_ID', 'Movie_release_date',
       'Character_name', 'Actor_date_of_birth', 'Actor_gender',
       'Actor_height_(in meters)', 'Actor_ethnicity_(Freebase ID)',
       'Actor_name', 'Actor_age_at_movie_release',
       'Freebase_character/actor_map_ID', 'Freebase_character_ID',
       'Freebase_actor_ID', 'unique_character_name', 'character_types',
       'character', 'movie'],
      dtype='object')


Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_release_date,Character_name,Actor_date_of_birth,Actor_gender,Actor_height_(in meters),Actor_ethnicity_(Freebase ID),Actor_name,Actor_age_at_movie_release,Freebase_character/actor_map_ID,Freebase_character_ID,Freebase_actor_ID,unique_character_name,character_types,character,movie
290201,30233816.0,/m/0g56dc1,2010.0,,,F,,,Sylvia Panacione,,/m/0gkk82k,,/m/0gc1rqk,,,,


### Missing values

<img src="src/scripts/plots/movie_exploration/Percentage_of_Missing_Data_for_Movie_Characteristics.png" alt="box off per cat" width="700">

<img src="src/scripts/plots/actor_exploration/Percentage_of_Missing_Data_for_Actor_Characteristics.png" alt="box off per cat" width="700">


These visualizations highlight the percentage of missing data for both movie and actor characteristics. Some key insights from these two plots are that some categories, like movie summaries or box office revenue have significant gaps that could affect our following analysis. Since our main point of interests are summaries and box offices, we will use scraping to remediate the situation.

<img src="src/scripts/plots/movie_exploration/Distribution_of_the_summary_that_have_length_inf_to_5000_and_1000.png" alt="box off per cat" width="900">

Digging deeper into the plot summaries data, we can observe that a certain amount of summary have less than 200 words. To thoroughly analyze emotion evolution throughout the summaries, we decided to decide on a cutoff value and scrape both missing plots from Wikipiedia when possible and summaries that are to short.


### Web-Scraping

As observed in the various plots, approximately 40% of film summaries and over 90% of box office values are missing, both of which are critical for our analysis. To address these gaps in the original CMU dataset, we have decided to supplement it with additional information from the films' Wikipedia pages.

Our focus is on films with missing or very short summaries (fewer than 200 words), for which we retrieve more detailed summaries from Wikipedia. This approach ensures that the sentiment analysis performed later has a sufficiently robust dataset to yield meaningful insights. Additionally, since predicting box office revenue is a key objective, we also scrape box office data from Wikipedia to enrich the dataset.

The results presented here reflect the outcome of scraping applied to a sample of 2,000 films from the CMU dataset. This process enriches our data, specifically targeting films with incomplete summaries (replacing those under 200 words with the Wikipedia entry) and adding missing box office values. Regarding the data size, as we have applied our scraping techniques to a smaller sample of 2000 elements with a running time of 12 minutes for the summaries and 25 minutes for the box office revenues, we estimate a maximal running time of 8 hours and 16 hours. Nonetheless, we expect our running time to be lower as we will impose higher validity and usefulness constraints on our dataset's elements, disqualifying and dropping outlier elements (too short "films", films with too many unknows unable to be scraped,...)

<img src="src\scripts\plots\movie_exploration\distribution_summary_length.png" alt="Dist plots" width="800"/>

Here, we present the distribution of summary lengths for the original CMU dataset (labeled "Before") alongside the enriched CMU dataset, supplemented with web-scraped data from Wikipedia (labeled "After").

When comparing the two distributions of summary lengths, we observe that after scraping, there is an increase in the number of longer summaries, subsequently accompanied by a decrease in the number of shorter ones. This shift is further balanced by the fact that some previously missing summaries (NaN values) have now been populated.

<img src="src\scripts\plots\movie_exploration\percentage_missing_values_box_office.png" alt="Dist plots" width="800"/>


Here, we present the results showing the percentage of missing box office values in the original CMU dataset ("Before scraping") compared to the enriched dataset supplemented with web scraping ("After scraping").

After scraping, we observe a reduction of more than 25% in the number of missing values for the box office revenue category in our sample of 2,000 entries. We can observe the successful outcomes of enriching our dataset through web scraping, and we rely on this enriched dataset to ensure it is robust enough to address our research questions.

If needed the web-scraping related code can be found in src/scripts/notebook/data_scraping.ipynb

### Gender parity

Does the proportion of male and female in the movie industry vary accross time and continent ? We clearly see that the industry is dominated by male characters consistently across all continents and decades. This does not relate directly to our analysis but we found it interesting to note.

<img src="src/scripts/plots/actor_exploration/Gender_Proportion_Across_Decades_by_Continent.png" alt="box off per cat" width="1000">

### Genres and Geographic Distribution 

Because of the astronomic number of genres in the data (about 360), we decided to handcraft a mapping to go down from 300 plus genres of movies to about a dozen. While our mapping may be a point of discussion among movie enthousiasts, we find it a good approximation to a more general genre classification. We also took the same route about geographical location, classifying countries by continent. The mapping themselves can be found in the notebook or scripts about data modifications. Our dataframe is therefore added with a continent and category column. This allows us to draw distribution per genre/continent

In [5]:
modified_data = data_modification.all_data_transformations()
modified_data.sample(1)

Unnamed: 0,Wikipedia_movie_ID,summary,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_runtime,Movie_languages,Movie_countries,Movie_genres,category,continent
64853,28615549,,/m/0czblhj,The Phantom,1931.0,,61.0,English Language,United States of America,"Thriller, Mystery, Horror","[Thriller, Horror]",North America


<img src="src/scripts/plots/movie_exploration/Distribution_of_Movies_by_Continent.png" alt="box off per cat" width="500">

We can see that North America is predominant in the field, followed by Europe and Asia. The other continents represent less than 5% of the film industry. This point leads us to believe that considering an analysis with and without North America could potentially bring out interesting differences as North America is overwhelmingly over-represented.

Now let's take a look at the genres distribution in films.

<img src="src/scripts/plots/movie_exploration/Distribution_of_Movies_genres.png" alt="box off per cat" width="500">

Drama is the most common genre, accounting for the largest proportion at approximately 26%. This is followed by comedy and thriller, which hold significant but smaller shares. The distribution suggests a diverse set of genres without a single dominant category, reflecting the varied preferences in cinematic storytelling. It also gives us a hint to what genre we will consider for the final analysis, as smaller genres may not be represented enough to give us readable results.

Let's explore how the distribution of movie genres differs across continents.

<img src="src/scripts/plots/movie_exploration/distribution_of_film_per_continent.png" alt="box off per cat" width="1000">

The charts show drama as a dominant genre globally, with variations in preferences across continents. Comedy is more prominent in North America and Oceania, while action/adventure stands out in Asia. The "others" category highlights regional diversity and niche genres.

### Box office and relationship to release date

We next wanted to explore what variables other than emotional evolution could explain box office success, to know be aware of already existing trends due to other parameters. The bar chart shows a clear upward trend in average box office earnings by decade and an increase in movie production. Earnings rose gradually from the 1910s to the 1960s, with a sharper increase starting in the 1970s, likely driven by advancements in technology, global market expansion, and blockbuster productions. The slight dip in the 2010s might be due to the data ending around 2015, capturing only part of the decade. This trend reflects both economic factors like inflation and the evolving scale of the film industry. Inflantion is a factor that affects any kind of financial comparison in time, taking care of it in our final results is also one of our goals.

<img src="src/scripts/plots/movie_exploration/Number_of_Movies_per_Decade.png" alt="box off per cat" width="700">

<img src="src/scripts/plots/movie_exploration/Mean_Box_Office_Earnings_by_Decade.png" alt="box off per cat" width="700">

How does the box office revenue varies accross genres and continent ? (We won't be looking at Africa or South America because we don't have enough box office revenue data for these continent.) 

<img src="src/scripts/plots/movie_exploration/Box_Office_Revenue_by_Movie_Category_by_Continent2.png" alt="box off per cat" width="700">

The mean box office revenue clearly varies across continents. Interestingly, some film genres achieve high box office success despite being minimally represented in the overall distribution. For instance, the Family/Animation category performs exceptionally well in Europe, even though it accounts for only 2.5% of the film distribution across continents.

## Sentiment analysis

Most of our datastory will rely on sentiment analysis to analyze possible emotions throughout a movie approximated via its summary. We considered three possible sentiment analysis:

1) We first use a semantic similarity model to aggregate sentences of close meaning together and segment the plot summary. This means we would first pass every sentence in the model that projects them into a high-dimension embedding and then use cosine similarity with a threshold to group our sentences. Finally, we would pass each segment through a sentiment analysis model that gives back scores across 7 emotions: anger, disgust, fear, joy, neutral, sadness, or surprise.
2) Passing sentence by sentence in the same sentiment analysis model that gives back scores across 7 emotions: anger, disgust, fear, joy, neutral, sadness, or surprise.
3) Using a simpler sentiment analysis model that gives only a positive/negative score per sentence.

We remarked that segmenting plot summaries with the semantic model and a threshold of 0.5 would only group about 13% of sentences, so we decided to go with the sentences option. We choose to use option 2 with emotion classification but have an idea of double checking our results with the positive/negative classifier to make sure positive and negative emotions match a positive/negative score.

In [5]:
#As an example, we run it on 2 plot
sentiment_df = sentiment_analysis.sentiment_analysis_sentences(limit=2)
sentiment_df

100%|██████████| 2/2 [00:00<00:00,  2.97it/s]


Unnamed: 0,anger,disgust,fear,joy,neutral,sadness,surprise,sentence_id,sentence,Wikipedia_movie_ID
0,0.038214,0.118294,0.077507,0.147301,0.4462,0.033855,0.138631,0,In order to prepare the role of an important o...,330
1,0.040454,0.007021,0.818398,0.002523,0.061496,0.003478,0.06663,0,"After being pulled through a time portal, Ash ...",3217
2,0.260286,0.336119,0.070448,0.002892,0.178517,0.142633,0.009105,1,"He is enslaved along with the captured Henry, ...",3217
3,0.127714,0.681643,0.060334,0.001472,0.046439,0.07476,0.007639,2,Ash is thrown in a pit where he fights off a D...,3217
4,0.029963,0.013735,0.001158,0.758849,0.150335,0.007201,0.038758,3,After demanding Henry and his men be set free ...,3217
5,0.011017,0.125419,0.002716,0.069069,0.755839,0.017873,0.018066,4,"He also grows attracted to Sheila, the sister ...",3217
6,0.017308,0.035617,0.010424,0.003796,0.879692,0.045444,0.007719,5,"According to the Wise Man, the only way Ash ca...",3217
7,0.009902,0.018827,0.018521,0.038465,0.503885,0.403152,0.007247,6,"After bidding goodbye to Sheila, Ash starts hi...",3217
8,0.001351,0.004731,0.970422,0.000833,0.011798,0.004374,0.006491,7,"As he enters a haunted forest, an unseen force...",3217
9,0.03463,0.013735,0.016873,0.774963,0.140356,0.012839,0.006604,8,"Fleeing, he ducks into a windmill where he cra...",3217


As you see, we now have an emotion score for each sentence. The next is to make it comparable across movies. We remarked that the average plot length for now (before scraping) is about 20 sentences and chose 20 as the number of timesteps across a movie where we will consider the emotions. This number may change as the length of our overall plots augments. In the data, we now use simple interpolation and fill missing values with extrapolation to obtain the emotional evolution of movies across 20 points.

In [10]:
interpolated_data = sentiment_analyis.interpolate_df(sentiment_df)
interpolated_data.dropna(inplace=True)
interpolated_data

  slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]


Unnamed: 0,anger,disgust,fear,joy,neutral,sadness,surprise,Wikipedia_movie_ID,timestep
20,0.040454,0.007021,0.818398,0.002523,0.061496,0.003478,0.06663,3217,0
21,0.232376,0.408861,0.068319,0.002593,0.150711,0.128344,0.008796,3217,1
22,0.086556,0.400418,0.035418,0.320368,0.090185,0.046314,0.020742,3217,2
23,0.017997,0.084273,0.002142,0.323199,0.532759,0.013941,0.025689,3217,3
24,0.016314,0.049796,0.009207,0.014102,0.860136,0.041091,0.009353,3217,4
25,0.009452,0.018085,0.068621,0.036484,0.477986,0.382164,0.007207,3217,5
26,0.010108,0.007101,0.719488,0.204552,0.045629,0.006602,0.006521,3217,6
27,0.120718,0.165876,0.084803,0.409418,0.103597,0.110009,0.005578,3217,7
28,0.146809,0.60475,0.062461,0.002435,0.102097,0.07481,0.006639,3217,8
29,0.022271,0.170628,0.010483,0.003933,0.522641,0.042434,0.22761,3217,9


Now that we have this, it is easy to groupby anything we would want (continent, category, revenue etc etc) and we will show our preliminary results.

### Technical notes on runtime

We have evaluated our pipeline over 5000 thousand films only for now in order to make sure that it runs smoothly and to see if we obtained results that are interesting. Running 5000 movies on a google colab notebook with an A100 GPU takes about 10-13 minutes, so we estimate that after data scraping and we will most likely evaluate about 80% of the dataset, which is about 60'000 movies for an estimated runtime of 2-3 hours.

The following results are therefore only a 5000 movies subset but we expect trends to generalize.

### Evolution of emotions 

How do emotions progress over the course of a movie? 

<img src="src\scripts\plots\sentiment_analysis\Normalized_Average_Evolution_of_Emotions_Across_Timesteps_for_All_Movies.png" alt="box off per cat" width="1400">

The first graph shows the averaged evolution of emotions over the course of a movie, with neutral emotions being the most prominent and steadily increasing toward the end. Disgust, on the other hand, maintains a consistent presence throughout the movie. The second graph, which normalizes the emotions, reveals more dynamic changes, with surprise and fear peaking early on and then declining, while neutral emotions gradually dominate as the movie progresses. These trends reflect typical emotional pacing in storytelling, balancing tension and resolution.

Now, let's take a closer look at each emotion.

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Anger_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Disgust_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Fear_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Sadness_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Joy_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Surprise_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Neutral_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

No matter the genre, the evolution of emotions follows a similar pattern. Negative emotions like anger, fear, and disgust tend to rise during the first half of a movie, building tension, while emotions like neutrality and joy increase towards the end, reflecting narrative resolution. This consistency suggests that filmmakers across genres rely on a common emotional arc to engage audiences.

### Evolution of emotions across continents

Does it differ across continents?

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_North_America.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Europe.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Asia.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Oceania.png" alt="box off per cat" width="1600">

Across North America, Europe, Asia, and Oceania, movies follow a consistent emotional pattern: negative emotions like anger, disgust, and fear rise early to build tension, while joy and neutrality increase toward the end, signaling resolution. Sadness dips mid-story but often peaks near the conclusion, aligning with emotional climaxes. 

We can also observe that North America follows a fixed emotional pattern, with steady rises and falls in emotions. It may be due to the bigger amount of films in that region.

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_South_America.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Africa.png" alt="box off per cat" width="1600">

We can note that the results for South America and Africa might be different from the four other continents because we lacks data for these two.

Can we predict a movie's box office performance based on the data we have? We used a simple model, a Random Forest Regressor with the release date, the runtime, the  the categories, the continent and the mean emotion of the films. With this, we got a R-squared score of 0.124189, which is not really good. This mean we cannot predict the boxoffice with simple model.This is understandable, as we can see from the plots on the normalized evolutions of emotions accross timestep for each genre doesn't change much from one genre to another, which means that many features are of little use in predicting box office.

<img src="src\scripts\plots\sentiment_analysis\box_office_prediction2.png" alt="box off per cat" width="700">

# Conclusion

Our analysis provides an in-depth exploration of cinema-related data, focusing on various dimensions such as genre distribution, gender representation, box office trends, and the evolution of emotions in movies. The insights gained highlight several key aspects of the film industry while revealing areas where data limitations affect our conclusions.

#### Key Findings:
1. **Data Cleaning and Enrichment**:
   - A significant portion of the original dataset, including movie summaries and box office data, contained missing or incomplete values.
   - Through web scraping, we successfully enriched the dataset, reducing missing values by over 25% for box office data and improving the average quality of movie summaries.

2. **Emotional Evolution in Movies**:
   - Emotions in movies follow a consistent narrative arc across genres and continents:
     - Negative emotions such as anger, fear, and disgust rise during the initial stages, building tension.
     - Joy and neutrality increase toward the conclusion, reflecting narrative resolution.
   - The trends suggest that filmmakers across the globe rely on a shared emotional structure to engage audiences.

3. **Genre and Geographic Trends**:
   - Drama dominates globally, accounting for 26% of films, followed by comedy and thriller.
   - North America leads in film production, contributing the majority of the dataset, with Europe and Asia trailing behind. Other continents, such as South America and Africa, are underrepresented, limiting our ability to generalize findings for these regions.

4. **Box Office Performance**:
   - Our attempt to predict box office performance using a Random Forest Regressor yielded a low R-squared score of 0.124, indicating that the selected features (release date, runtime, categories, continent, and mean emotions) are insufficient for accurate prediction.
   - The limited predictive power suggests that other factors, such as marketing budgets, star power, and audience reception, likely play a significant role in box office success.

#### Limitations:
- **Data Imbalance**: The overrepresentation of North American films skews global trends, making it difficult to draw conclusions for underrepresented regions.
- **Feature Limitations**: Emotional arcs and basic film metadata, while insightful, are not strong predictors of box office performance without additional contextual factors.
- **Incomplete Data**: Despite web scraping, some gaps remain, particularly for smaller film industries in South America and Africa.

#### Next Steps:
1. **Expand the Dataset**: Further scraping and validation for missing data, particularly in underrepresented regions, could improve the robustness of the analysis.
2. **Explore Advanced Models**: Incorporating advanced machine learning models or external features (e.g., social media engagement, audience ratings) could enhance box office prediction.
3. **Analyze Niche Genres**: Focused studies on niche genres or regional film trends could uncover unique patterns not evident in the broader analysis.

In conclusion, this work lays a foundation for understanding emotional storytelling and genre trends in global cinema, while identifying areas for future research and data enrichment. The results emphasize the universal appeal of emotional arcs in movies and the challenges of predicting commercial success in such a complex and multi-faceted industry.
