# Key Findings from Our Preliminary Analysis

In [1]:
import pandas as pd
from src.scripts import data_cleaning
from src.scripts import data_modification
from src.scripts import sentiment_analysis

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\papad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


This analysis dives into various aspects of cinema data, exploring questions such as the distribution of movie genres, gender representation over decades, and the emotions expressed in films. We focus mostly on movie related data for our project since it is our main focus point, however, we did also clean and browse character and actor data in case we decide that we want to include them for any reason. 

We have decided to only report images of our plots without the code to avoid cluttering this notebook but in case you would need it, you can find all related code and the plots in the src/scripts/notebook folder for a detailed step-by-step walkthrough. Concerning functional data pipelines like data cleaning, sentiment analysis and data modifications, we will call them from this notebook to give an idea of what the dataframes we work with look like. The scripts are located in src/scripts.

## What’s in the data?

### Data cleaning

As a first step, we tried to thoroughly clean the data in its entirety but especially plot summaries and movie related data. We removed tags, citations and most of weird characters from the movie summaries. We unified dates and in some cases removed the complete dates to keep only the year. We cleaned box office data by removing strings or changing them into numerical values and checked the bounds on some field (for example age should not be negative). We removed ids and columns we will mostly not use.

In [2]:
cleaned_data_movie = data_cleaning.movie_data_cleaning_pipeline()
print(cleaned_data_movie.columns)
cleaned_data_movie.sample(1)

Index(['Wikipedia_movie_ID', 'summary', 'Freebase_movie_ID', 'Movie_name',
       'Movie_release_date', 'Movie_box_office_revenue', 'Movie_runtime',
       'Movie_languages', 'Movie_countries', 'Movie_genres'],
      dtype='object')


Unnamed: 0,Wikipedia_movie_ID,summary,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_runtime,Movie_languages,Movie_countries,Movie_genres
66117,29146560,,/m/0dlm9db,The Nut,1921.0,,74.0,English Language,United States of America,"Romance Film, Comedy"


In [3]:
cleaned_data_movie.describe()

Unnamed: 0,Wikipedia_movie_ID,Movie_release_date,Movie_box_office_revenue,Movie_runtime
count,81840.0,39372.0,8401.0,61291.0
mean,17411810.0,1977.548816,47993630.0,111.8192
std,10988810.0,30.938137,112175300.0,4360.07
min,330.0,1892.0,10000.0,0.0
25%,7324745.0,1952.0,2083193.0,81.0
50%,17792340.0,1989.0,10639690.0,93.0
75%,27161290.0,2005.0,40716960.0,106.0
max,37501920.0,2016.0,2782275000.0,1079281.0


In [4]:
clean_character_data = data_cleaning.character_data_cleaning_pipeline()
print(clean_character_data.columns)
clean_character_data.sample(1)

Index(['Wikipedia_movie_ID', 'Freebase_movie_ID', 'Movie_release_date',
       'Character_name', 'Actor_date_of_birth', 'Actor_gender',
       'Actor_height_(in meters)', 'Actor_ethnicity_(Freebase ID)',
       'Actor_name', 'Actor_age_at_movie_release',
       'Freebase_character/actor_map_ID', 'Freebase_character_ID',
       'Freebase_actor_ID', 'unique_character_name', 'character_types',
       'character', 'movie'],
      dtype='object')


Unnamed: 0,Wikipedia_movie_ID,Freebase_movie_ID,Movie_release_date,Character_name,Actor_date_of_birth,Actor_gender,Actor_height_(in meters),Actor_ethnicity_(Freebase ID),Actor_name,Actor_age_at_movie_release,Freebase_character/actor_map_ID,Freebase_character_ID,Freebase_actor_ID,unique_character_name,character_types,character,movie
162000,876728.0,/m/03kx49,1979.0,,1928.0,M,1.88,,James Coburn,50.0,/m/0cg2wc7,,/m/01337_,,,,


### Missing values

<img src="src/scripts/plots/movie_exploration/Percentage_of_Missing_Data_for_Movie_Characteristics.png" alt="box off per cat" width="700">

<img src="src/scripts/plots/actor_exploration/Percentage_of_Missing_Data_for_Actor_Characteristics.png" alt="box off per cat" width="700">


These visualizations highlight the percentage of missing data for both movie and actor characteristics. Some key insights from these two plots are that some categories, like movie summaries, have significant gaps that could affect our following analysis. Since our main points of interest are summaries and ratings, we used scraping to remediate the situation. Initially, we considered using box office revenue to evaluate the success of a movie. However, after further deliberation, we decided it wasn’t the most reliable metric for assessing success due to its dependence on external factors like marketing and distribution. Instead, we opted to use ratings as a more representative measure of audience perception and movie quality. Unfortunately, ratings were missing in our dataset, necessitating additional scraping.

<img src="src/scripts/plots/movie_exploration/Distribution_of_the_summary_that_have_length_inf_to_5000_and_1000.png" alt="box off per cat" width="900">

Digging deeper into the plot summaries data, we can observe that a certain amount of summary have less than 200 words. To thoroughly analyze emotion evolution throughout the summaries, we decided to decide on a cutoff value and scrape both missing plots from Wikipiedia when possible and summaries that are to short.


### Web-Scraping

As observed in the various plots, approximately 40% of film summaries and all of the ratings are missing, both of which are critical for our analysis. To address these gaps in the original CMU dataset, we have decided to supplement it with additional information from the films' Wikipedia pages.

Our focus is on films with missing or very short summaries (fewer than 200 words), for which we retrieve more detailed summaries from Wikipedia. This approach ensures that the sentiment analysis performed later has a sufficiently robust dataset to yield meaningful insights. Additionally, since predicting the success of a movie is a key objective, we also scrape ratings data from Wikipedia to enrich the dataset.

The results presented here reflect the outcome of scraping applied to a sample of 2,000 films from the CMU dataset. This process enriches our data, specifically targeting films with incomplete summaries (replacing those under 200 words with the Wikipedia entry) and adding ratings.

<img src="src\scripts\plots\results\summary_lenghts_comparison.png" alt="Dist plots" width="1100"/>

Here, we present the distribution of summary lengths for the original CMU dataset (labeled "Before") alongside the enriched CMU dataset, supplemented with web-scraped data from Wikipedia (labeled "After").

When comparing the two distributions of summary lengths, we observe that after scraping, there is an increase in the number of longer summaries, subsequently accompanied by a decrease in the number of shorter ones. This shift is further balanced by the fact that some previously missing summaries (NaN values) have now been populated.

To complete our dataset, we scraped ratings for 92.85% of the movies, a process that took approximately 9 hours. Using libraries like requests and BeautifulSoup, we resolved title variations and handled redirects to ensure accurate data collection. The validated ratings significantly improved our dataset's completeness and analytical potential. 

The full code for this process is available in `src/scripts/notebook/summaries_scraping.ipynb` and `src/scripts/notebook/movie_rating_scraping.ipynb`.

### Ratings analysis

To understand the relationship between film genres and ratings, we present two histograms. The first shows the distribution of overall ratings for all films in the dataset, while the second breaks down ratings by genre, allowing us to observe how different genres tend to perform in terms of audience perception.

<img src="src/scripts/plots/results/ratings.png" alt="Dist plots" width="1100"/>

As seen in the histograms, the overall distribution of ratings is fairly concentrated around the middle, with most films receiving ratings between 5 and 7. Genres like Drama and Thriller show a broader distribution, indicating a more mixed reception than the others. The Horror genre has a wider spread, with a higher number of films receiving lower ratings, likely reflecting the polarized nature of this genre.

### Genres and Geographic Distribution 

Because of the astronomic number of genres in the data (about 360), we decided to handcraft a mapping to go down from 300 plus genres of movies to about a dozen. While our mapping may be a point of discussion among movie enthousiasts, we find it a good approximation to a more general genre classification. We also took the same route about geographical location, classifying countries by continent. The mapping themselves can be found in the notebook or scripts about data modifications. Our dataframe is therefore added with a continent and category column. This allows us to draw distribution per genre/continent

In [5]:
modified_data = data_modification.all_data_transformations()
modified_data.sample(1)

Unnamed: 0,Wikipedia_movie_ID,summary,Freebase_movie_ID,Movie_name,Movie_release_date,Movie_box_office_revenue,Movie_runtime,Movie_languages,Movie_countries,Movie_genres,category,continent
13947,4544022,"Aloyisius T. McKeever , a New York City hobo, ...",/m/0c824q,It Happened on 5th Avenue,,,116.0,English Language,United States of America,"Screwball comedy, Comedy",[Comedy],North America


<img src="src/scripts/plots/results/movie_continent.png" alt="box off per cat" width="900">

North America dominates the film industry, followed by Europe and Asia, while the other continents collectively account for less than 5% of film production. Additionally, most movies in the dataset were produced in the 21st century. Although the number of films appears to decline after 2010, this is due to the dataset ending around 2015. 

Now, let's examine the distribution of genres in films.

<img src="src/scripts/plots/results/genre_distr.png" alt="box off per cat" width="1100">

Drama emerges as the most common genre, making up approximately 26% of the dataset, followed by comedy and thriller, which also hold significant shares. The distribution highlights a diverse range of genres, reflecting varied audience preferences in cinematic storytelling. To streamline our analysis and gain a broader perspective, we grouped the numerous film genres into eight main categories: drama, comedy, thriller, romance, action/adventure, fantasy/sci-fi, horror, and family/animation. This approach allows for more comprehensive and global insights. For instance, here is an overview of the Action and Adventure category.

<img src="src/scripts/plots/results/action.png" alt="box off per cat" width="600">

Now, let's examine the distribution of genres across continents to uncover regional preferences and trends in cinematic storytelling.

<img src="src/scripts/plots/results/genre_cont.png" alt="box off per cat" width="750">

The charts show drama as a dominant genre globally, with variations in preferences across continents. Comedy is more prominent in North America and Oceania, while action/adventure stands out in Asia.

## Technical part: sentiment analysis, PCA/clustering, K-shape clustering

### Sentiment analysis

Most of our datastory will rely on sentiment analysis to analyze possible emotions throughout a movie approximated via its summary. We considered three possible sentiment analysis:

1) We first use a semantic similarity model to aggregate sentences of close meaning together and segment the plot summary. This means we would first pass every sentence in the model that projects them into a high-dimension embedding and then use cosine similarity with a threshold to group our sentences. Finally, we would pass each segment through a sentiment analysis model that gives back scores across 7 emotions: anger, disgust, fear, joy, neutral, sadness, or surprise.
2) Passing sentence by sentence in the same sentiment analysis model that gives back scores across 7 emotions: anger, disgust, fear, joy, neutral, sadness, or surprise.
3) Using a simpler sentiment analysis model that gives only a positive/negative score per sentence.

We remarked that segmenting plot summaries with the semantic model and a threshold of 0.5 would only group about 13% of sentences, so we decided to go with the sentences option. We choose to use option 2 with emotion classification but have an idea of double checking our results with the positive/negative classifier to make sure positive and negative emotions match a positive/negative score.

In [10]:
#As an example, we run it on 2 plot
sentiment_df = sentiment_analysis.sentiment_analysis_sentences(limit=2)
sentiment_df

FileNotFoundError: [Errno 2] No such file or directory: 'data/our_movie_data_extended.csv'

As you see, we now have an emotion score for each sentence. The next is to make it comparable across movies. We remarked that the average plot length for now (before scraping) is about 20 sentences and chose 20 as the number of timesteps across a movie where we will consider the emotions. This number may change as the length of our overall plots augments. In the data, we now use simple interpolation and fill missing values with extrapolation to obtain the emotional evolution of movies across 20 points.

In [None]:
interpolated_data = sentiment_analyis.interpolate_df(sentiment_df)
interpolated_data.dropna(inplace=True)
interpolated_data

  slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]


Unnamed: 0,anger,disgust,fear,joy,neutral,sadness,surprise,Wikipedia_movie_ID,timestep
20,0.040454,0.007021,0.818398,0.002523,0.061496,0.003478,0.06663,3217,0
21,0.232376,0.408861,0.068319,0.002593,0.150711,0.128344,0.008796,3217,1
22,0.086556,0.400418,0.035418,0.320368,0.090185,0.046314,0.020742,3217,2
23,0.017997,0.084273,0.002142,0.323199,0.532759,0.013941,0.025689,3217,3
24,0.016314,0.049796,0.009207,0.014102,0.860136,0.041091,0.009353,3217,4
25,0.009452,0.018085,0.068621,0.036484,0.477986,0.382164,0.007207,3217,5
26,0.010108,0.007101,0.719488,0.204552,0.045629,0.006602,0.006521,3217,6
27,0.120718,0.165876,0.084803,0.409418,0.103597,0.110009,0.005578,3217,7
28,0.146809,0.60475,0.062461,0.002435,0.102097,0.07481,0.006639,3217,8
29,0.022271,0.170628,0.010483,0.003933,0.522641,0.042434,0.22761,3217,9


Now that we have this, it is easy to groupby anything we would want (continent, category, revenue etc etc) and we will show our preliminary results.

#### Technical notes on runtime

We have evaluated our pipeline over 5000 thousand films only for now in order to make sure that it runs smoothly and to see if we obtained results that are interesting. Running 5000 movies on a google colab notebook with an A100 GPU takes about 10-13 minutes, so we estimate that after data scraping and we will most likely evaluate about 80% of the dataset, which is about 60'000 movies for an estimated runtime of 2-3 hours.

The following results are therefore only a 5000 movies subset but we expect trends to generalize.

### PCA/clustering

### K-shape clustering

## Patterns in film sentiment 

How do emotions progress over the course of a movie? 

<img src="src/scripts/plots/results/evol_emotion.png" alt="box off per cat" width="800">

<img src="src/scripts/plots/results/norm_emotion.png" alt="box off per cat" width="800">

The first graph shows the averaged evolution of emotions over the course of a movie, with neutral emotions being the most prominent and steadily increasing toward the end. Disgust, on the other hand, maintains a consistent presence throughout the movie. The second graph, which normalizes the emotions, reveals more dynamic changes, with surprise and fear peaking early on and then declining, while neutral emotions gradually dominate as the movie progresses. These trends reflect typical emotional pacing in storytelling, balancing tension and resolution.

Now, let's take a closer look at each emotion.

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Anger_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Disgust_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Fear_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Sadness_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Joy_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Surprise_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Neutral_across_Timesteps_for_Each_Genre.png" alt="box off per cat" width="850">

No matter the genre, the evolution of emotions follows a similar pattern. Negative emotions like anger, fear, and disgust tend to rise during the first half of a movie, building tension, while emotions like neutrality and joy increase towards the end, reflecting narrative resolution. This consistency suggests that filmmakers across genres rely on a common emotional arc to engage audiences.

### Evolution of emotions across continents

Does it differ across continents?

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_North_America.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Europe.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Asia.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Oceania.png" alt="box off per cat" width="1600">

Across North America, Europe, Asia, and Oceania, movies follow a consistent emotional pattern: negative emotions like anger, disgust, and fear rise early to build tension, while joy and neutrality increase toward the end, signaling resolution. Sadness dips mid-story but often peaks near the conclusion, aligning with emotional climaxes. 

We can also observe that North America follows a fixed emotional pattern, with steady rises and falls in emotions. It may be due to the bigger amount of films in that region.

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_South_America.png" alt="box off per cat" width="1600">

<img src="src\scripts\plots\sentiment_analysis\Normalized_Evolution_of_Emotions_across_Timesteps_in_Africa.png" alt="box off per cat" width="1600">

We can note that the results for South America and Africa might be different from the four other continents because we lacks data for these two.

## Historical analysis

### The Impact of Color and Sound on Early Cinema

The evolution of emotions in movies provides valuable insight into how storytelling adapts to technological advancements and societal changes. By examining key emotional trends across two periods: before 1922 and between 1930 and 1940, we stopped our analysis at 1940 to ensure that any changes in emotions would not be influenced by World War II. We aim to understand how the introduction of sound and color influenced emotional representation in films. The first set of graphs presents the normalized evolution of seven emotions: anger, disgust, fear, joy, neutral, sadness, and surprise.

<img src="src/scripts/plots/results/1922_full.png" alt="box off per cat" width="1000">

<img src="src/scripts/plots/results/1930_full.png" alt="box off per cat" width="1000">

From the initial graphs, we identified joy, fear, disgust, sadness, and surprise as the emotions with the most significant differences between the two periods. These emotions showed distinct trends that warranted deeper analysis, as they appeared to be the most impacted by technological and cultural shifts. In contrast, anger and neutral exhibited relatively stable patterns and were excluded from further analysis.

<img src="src/scripts/plots/results/1922_part.png" alt="box off per cat" width="1000">

<img src="src/scripts/plots/results/1930_part.png" alt="box off per cat" width="1000">

The second set of graphs focuses exclusively on the five selected emotions, offering a clearer view of how their trajectories evolved:

- Joy shows a muted presence before 1922, with a sharp increase towards the end. Between 1930 and 1940, it fluctuates more dynamically, reflecting changes in storytelling enabled by sound and color.
- Fear maintains a consistent peak near the middle of the timeline in both periods but displays greater variation in the later era, likely due to enhanced audio techniques that heightened suspense.
- Disgust declines more sharply before 1922, indicating its earlier prevalence in silent films, possibly as a visual tool for evoking moral or visceral reactions.
- Sadness demonstrates higher variability and intensity before 1922.
- Surprise rises significantly in the later period, benefiting from plot twists and sound effects that amplified its impact.

This focused analysis highlights how the adoption of sound and color reshaped the emotional arcs in movies, making them more dynamic and nuanced.

### Impact of WW1

World War I was a transformative event that left a profound imprint on all aspects of society, including cinema. To evaluate how this global conflict shaped the emotional tone of movies, we examined the evolution of seven emotions—anger, disgust, fear, joy, neutral, sadness, and surprise—in films released before 1914 and during the war years (1914–1922). The first set of graphs provides a comparative view of these emotional dynamics over time.

<img src="src/scripts/plots/results/1914_full.png" alt="box off per cat" width="1000">

<img src="src/scripts/plots/results/1914_22_full.png" alt="box off per cat" width="1000">

Upon reviewing the initial graphs, it becomes clear that disgust and fear display the most pronounced changes between the two periods. While other emotions exhibit minor fluctuations, the stark variations in these two emotions are particularly noteworthy. This suggests that the context of war may have specifically influenced how filmmakers used these emotions to convey narratives. To gain deeper insights, we focused on the trends of disgust and fear, isolating them in subsequent visualizations.

<img src="src/scripts/plots/results/1914_part.png" alt="box off per cat" width="700">

<img src="src/scripts/plots/results/1914_22_part.png" alt="box off per cat" width="700">

The focused graphs reveal distinct patterns for these emotions. Disgust shows a consistent decline before 1914, followed by a steeper drop during the war years. This may indicate a societal shift in tolerance for graphic depictions or an intentional effort by filmmakers to avoid overly unsettling themes during a time of widespread trauma. Conversely, fear exhibits a significant rise during the wartime period, peaking more consistently. This trend likely mirrors the pervasive anxiety and dread experienced by audiences during the war, as filmmakers sought to capture and reflect the heightened emotional state of the time.

### Impact of 9/11

To analyze the emotional impact of the 9/11 attacks on cinema, we focused exclusively on movies produced in the United States. By examining films from two time periods—1998 to 2000 (pre-9/11) and 2001 to 2003 (post-9/11)—we aimed to minimize the influence of other events and isolate the effects of this pivotal moment. The analysis tracks the normalized evolution of emotions such as anger, disgust, fear, joy, neutral, sadness, and surprise across different genres, providing a detailed view of how storytelling may have shifted during this era.

<img src="src/scripts/plots/results/Before_911_full.png" alt="box off per cat" width="1400">

<img src="src/scripts/plots/results/After_911_full.png" alt="box off per cat" width="1400">

From the first set of graphs, we notice that sadness shows distinct variations before and after 9/11, particularly in the genres of romance and family/animation. This observation led us to narrow our focus to these two genres, as they displayed the most striking emotional shifts. Sadness in family/animation films appears to decrease after the attacks, reflecting a potential trend toward lighter, more uplifting narratives for younger audiences. Conversely, sadness rises in romance films, suggesting a deeper exploration of emotional vulnerability and connection in response to societal grief.

To better understand these patterns, we plotted the evolution of sadness specifically for these two genres before and after 9/11.

<img src="src/scripts/plots/results/911_part.png" alt="box off per cat" width="900">

The final graphs reveal contrasting trends. Sadness in family/animation films declines sharply post-9/11, likely reflecting an industry-wide attempt to provide escapism and comfort to audiences. In contrast, sadness in romance films increases during the same period, suggesting a shift toward more poignant and reflective storytelling that resonates with societal mourning. These findings underscore the varied emotional responses of different genres to the same historical event, highlighting the adaptive nature of cinema in addressing collective emotions.

## Correlation between emotions and ratings

# Conclusion

Our analysis provides an in-depth exploration of cinema-related data, focusing on various dimensions such as genre distribution, gender representation, box office trends, and the evolution of emotions in movies. The insights gained highlight several key aspects of the film industry while revealing areas where data limitations affect our conclusions.

#### Key Findings:
1. **Data Cleaning and Enrichment**:
   - A significant portion of the original dataset, including movie summaries and box office data, contained missing or incomplete values.
   - Through web scraping, we successfully enriched the dataset, reducing missing values by over 25% for box office data and improving the average quality of movie summaries.

2. **Emotional Evolution in Movies**:
   - Emotions in movies follow a consistent narrative arc across genres and continents:
     - Negative emotions such as anger, fear, and disgust rise during the initial stages, building tension.
     - Joy and neutrality increase toward the conclusion, reflecting narrative resolution.
   - The trends suggest that filmmakers across the globe rely on a shared emotional structure to engage audiences.

3. **Genre and Geographic Trends**:
   - Drama dominates globally, accounting for 26% of films, followed by comedy and thriller.
   - North America leads in film production, contributing the majority of the dataset, with Europe and Asia trailing behind. Other continents, such as South America and Africa, are underrepresented, limiting our ability to generalize findings for these regions.

4. **Box Office Performance**:
   - Our attempt to predict box office performance using a Random Forest Regressor yielded a low R-squared score of 0.124, indicating that the selected features (release date, runtime, categories, continent, and mean emotions) are insufficient for accurate prediction.
   - The limited predictive power suggests that other factors, such as marketing budgets, star power, and audience reception, likely play a significant role in box office success.

#### Limitations:
- **Data Imbalance**: The overrepresentation of North American films skews global trends, making it difficult to draw conclusions for underrepresented regions.
- **Feature Limitations**: Emotional arcs and basic film metadata, while insightful, are not strong predictors of box office performance without additional contextual factors.
- **Incomplete Data**: Despite web scraping, some gaps remain, particularly for smaller film industries in South America and Africa.

#### Next Steps:
1. **Expand the Dataset**: Further scraping and validation for missing data, particularly in underrepresented regions, could improve the robustness of the analysis.
2. **Explore Advanced Models**: Incorporating advanced machine learning models could enhance box office prediction.
3. **Conduct Detailed Analyses**: Focused investigations into differences across genres, time, and continents could uncover unique patterns in movie dynamics. Additionally, examining the relationship between emotions and factors such as box office performance, reviews, and awards may provide deeper insights into how storytelling elements influence audience reception and commercial success.


In conclusion, this work lays a foundation for understanding emotional storytelling and genre trends in global cinema, while identifying areas for future research and data enrichment. The results emphasize the universal appeal of emotional arcs in movies and the challenges of predicting commercial success in such a complex and multi-faceted industry.
