# Milestone 2: Project proposal and initial analysis

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>Project:</strong> Decoding Box-Office Bombs 💣
    <br>
    <strong>Team:</strong> ADAdventurers2024
</div>

To replicate our dataset, please download the data indicated in the [README](https://github.com/epfl-ada/ada-2024-project-adaventurers2024/blob/main/README.md) file. Then, navigate to the `scripts` folder and run the following script:

```cmd
python preprocess_data.py
```

This script will generate five files in the data folder:

- `cmu_tmdb.csv`: A merged dataset from CMU and TMDB, containing movie information such as revenue, budget, and other details.
- `movie_tropes.csv`: Tropes associated with each movie in the IMDb dataset, which serves as an intermediary file for merging tropes with the CMU dataset.
- `cmu_tropes.csv`: Tropes associated with each movie in the CMU dataset.
- `movie_actors.csv`: Actors linked to each movie in the CMU dataset.
- `movie_directors_actors.csv`: Directors and actors linked to each movie in the IMDb dataset.

You can now proceed with exploratory data analysis and initial assessments.

------

## Exploratory data analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
DATA_PATH = "data"

#### CMU revenue and other metrics

In [None]:
df_cmu_tmdb = pd.read_csv(f"{DATA_PATH}/cmu_tmdb.csv")
df_cmu_tmdb.head()

In [None]:
df_cmu_tmdb.info()

#### CMU cast and crew

In [None]:
df_movie_actors = pd.read_csv(f"{DATA_PATH}/movie_actors.csv")
df_movie_actors.head()

In [None]:
df_movie_actors.info()

In [None]:
df_movie_directors_actors = pd.read_csv(f"{DATA_PATH}/movie_directors_actors.csv")
df_movie_directors_actors.head()

In [None]:
df_movie_directors_actors.info()

#### CMU tropes

In [None]:
df_cmu_tropes = pd.read_csv(f"{DATA_PATH}/cmu_tropes.csv")
df_cmu_tropes.head()

In [None]:
df_cmu_tropes.info()

--------

## Research questions

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ast
import warnings

warnings.filterwarnings("ignore")

# Set visualization style
%matplotlib inline
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📊 Metrics & Performance
</strong> 
</div>


### 1. What metrics (e.g., low ratings, limited number of ratings, revenue vs budget) best indicate movie failure?


In [None]:
## Code
from src.utils.metric_analysis import *

metric_analysis("data/cmu_tmdb.csv")

#### 1.1 What we have done for the initial analysis


#### 1.2 Key observations


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>👥 Cast & Crew Analysis</strong> 
</div>


### 2. How do actor demographics and lack of diversity impact audience disengagement and contribute to box office underperformance?

In [None]:
## code
from src.utils.actor_analysis import *

actor_analysis("data/movie_actors.csv", "data/wikidata_ethnicities.csv")

#### 2.1 What we have done for the initial analysis

##### Data Cleaning and Preparation
To begin the analysis, we removed duplicate entries. We dropped rows with missing values in critical columns, such as actor gender, age at movie release, revenue, average rating, and the number of votes, as these features are essential for the study. We mapped actor ethnicity IDs to corresponding names using `wikidata_ethnicities.csv`.

##### Exploration of Actor Demographics
We created a bar plot to visualize the distribution of male and female actors. We examined the age distribution of actors using a histogram to highlight age-based cast patterns. Additionally, we analyzed ethnicity diversity by counting unique ethnicities represented in the dataset and visualized the top 10 most represented ethnicities.

##### Computation of Diversity Metrics per Movie
We quantified gender diversity as the proportion of female actors in a movie’s cast; ethnic diversity as the number of unique ethnicities in the cast; and age diversity as the standard deviation of actors' ages. Additionally, we calculated movie success indicators, such as revenue and ratings, for each film to assess the link between diversity and movie outcomes.

##### Correlation and Impact Analysis
We constructed a correlation matrix to explore relationships between diversity metrics and movie success indicators, including gender diversity, ethnic diversity, age diversity, revenue, and ratings. We visualized these correlations with a heatmap, highlighting the strength and direction of each relationship. We also created scatter plots to examine how gender, ethnic, and age diversity influenced revenue and ratings.

#### 2.2 Key observations

##### Gender Diversity
From the correlation matrix, we observed that gender diversity has a weak and slightly negative correlation with revenue (-0.073) and average rating (-0.048). Scatter plots indicated that movies with a 5%-50% proportion of female actors tended to perform slightly better in revenue and receive better audience ratings.

##### Ethnic Diversity
We found that ethnic diversity has a moderate positive correlation with revenue (0.34), suggesting that movies with more ethnic diversity tend to generate higher revenue. The scatter plot of ethnic diversity versus rating showed an upward trend, with movies featuring more ethnic diversity achieving better ratings. The plot also suggests that moderate ethnic diversity (3-7) is associated with higher revenue.

##### Age Diversity
From the correlation matrix, age diversity shows only weak positive correlations with revenue (0.14) and average rating (0.15), indicating age diversity showed limited influence on movie success. Scatter plots revealed no strong correlation between age diversity and average rating, as movies with both high and low age diversity exhibited a wide range of ratings. In the plot of age diversity vs average rating, some movies with moderate actors' age deversity achived a higher revenue.

### 3. Is thematic consistency in director filmographies a predictor of movie failure?

In [None]:
## code
from src.utils.director_analysis import *

director_analysis("data/movie_directors_actors.csv")

#### 3.1 What we have done for the initial analysis


#### 3.2 Key observations

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>🎬 Genre & Market Factors</strong> 
</div>

### 4. How does genre choice influence a movie's failure, particularly in different cultural contexts?

In [None]:
## code
from src.utils.genre_analysis import *
from src.utils.visualization_utils import *

# Setup
setup_visualization()
df, df_genres = prepare_data("data/cmu_tmdb.csv")
unique_genres = sorted(df_genres["genres"].unique())
genre_colors = create_genre_colors(unique_genres)

# Basic Analysis
plot_genre_distributions(df_genres, genre_colors)

# Performance Analysis
plot_genre_performance(df_genres, genre_colors)

# Temporal Analysis
analyze_temporal_trends(df_genres, genre_colors, unique_genres)

# ROI Analysis
df, df_genres = analyze_roi(df, df_genres, genre_colors)

# Budget Analysis
budget_stats = analyze_budget_categories(df)

# Success/Failure Rate Analysis
performance_stats = analyze_success_failure_rates(
    df_genres, genre_colors, unique_genres
)

# Summary Statistics
summary_stats = get_summary_statistics(df_genres)

#### 4.1 What we have done for the initial analysis

We employed several analytical approaches to understand genre impact on movie failure. 

- First, we used violin plots with symmetric log scaling to visualize profit distribution across genres, capturing both the central tendency and spread of financial performance. 
- To understand cultural reception, we analyzed the relationship between ratings and popularity (measured by vote count) using scatter plots with logarithmic scaling for vote counts. 
- We tracked genre performance over time using 5-year moving averages to identify long-term trends in audience reception. 
- Finally, we calculated and compared genre-specific success and failure rates to identify which genres carry the highest risk of significant financial loss.


#### 4.2 Key observations

##### Financial Performance by Genre

1. High-Profit Potential:

- Action/Adventure/Fantasy lead in extreme profits (mean profits: 13-25M USD)
- Documentary/TV Movies show lowest profits but highest ROI (9.03 and 7.13)
- Horror shows strong ROI (5.69) with moderate investment
- Drama (most common genre, 22,560 movies) shows modest profits (3.47M USD)

2. Budget Impact:

- Very low budget films: Highest ROI potential but highest volatility
- High budget films: More consistent but lower returns
- Clear inverse relationship between budget size and ROI potential

##### Audience Reception

1. Ratings:

- Animation leads with highest average rating (6.01)
- Family/War films follow with strong ratings
- Western shows lowest ratings
- Most genres maintain 5-7 rating range
- Documentary shows most consistent ratings

2. Popularity Patterns:

- Popular movies (high vote counts) cluster around 6-7 ratings
- Less popular movies show wider rating variation
- Profitable movies typically have high vote counts
- Genre impact on popularity is minimal

##### Historical Trends

- Ratings stabilized post-1960
- High volatility in early years (pre-1940)
- Modern convergence around 5-6 rating range
- Genre distinctions decreased over time

##### Risk Assessment

1. Success Rates:

- Adventure/Science Fiction/Fantasy: Highest success rates
- Documentary/TV Movies: Extreme success/failure patterns
- Horror: Good success rate with moderate risk

2. Failure Rates:

- Thriller/Science Fiction/Mystery: Highest failure rates
- Action/Adventure: More moderate failure rates despite high budgets
- Documentary: High risk but high potential return

##### Key Takeaway

Genre significantly impacts financial performance and risk levels. While Action/Adventure/Fantasy lead in absolute profits, smaller genres like Documentary and Horror show strong ROI potential. Ratings remain relatively consistent across genres, with Animation and Family films maintaining slight advantages. Budget size shows stronger correlation with returns than genre choice.


### 5. How does poor release timing (e.g., season, holiday periods) affect a movie's likelihood of failing?

In [None]:
## code
from src.utils.timing_analysis import *

# Seasonal Analysis
seasonal_stats = plot_seasonal_distributions(df)

# Monthly Analysis
analyze_monthly_performance(df)

# Monthly ROI Analysis
monthly_perf_df = analyze_monthly_roi(df)

# Monthly Success Rate Analysis
plot_monthly_success_rates(monthly_perf_df)

# Monthly Statistics
monthly_stats = analyze_monthly_statistics(df, monthly_perf_df)

# Yearly Analysis
yearly_performance = analyze_temporal_trends(df)

#### 5.1 What we have done for the initial analysis

- To investigate how release timing affects movie failure, we analyzed the distribution of profits and ratings across different temporal categories using violin plots. 
- We compared failure rates across seasons and months to identify particularly risky release periods. 
- To account for industry evolution, we examined the temporal trends of success and failure rates alongside movie release volume using a dual-axis visualization combining line graphs for rates and bar charts for release counts. 

This allowed us to identify historical patterns in optimal release timing while controlling for changes in industry output volume.


#### 5.2 Key observations

##### Seasonal Patterns

1. Profitability and ROI:

   - Fall shows best overall performance (high ROI: 3.46, good profit distribution)
   - Spring has highest mean profit (7.23M USD) and median ROI (1.16)
   - Winter consistently underperforms (lowest profit: 2.56M USD, lowest ROI: 2.77)
   - Summer shows moderate, stable performance

2. Ratings and Volume:

   - Ratings remain relatively consistent across seasons (range: 4.72-5.19)
   - Winter/Fall have highest release volumes (~13,000 movies each)
   - Spring/Summer have fewer releases (~11,500 movies each)

##### Monthly Patterns

1. Strong Months:

   - June/July: Highest success rates (~8%), good ROI potential
   - December: Strong performance (high success rate, good profit potential)
   - Summer months generally show better profit concentration

2. Weak Months:

   - January: Lowest success rate (~2%), volatile ROI
   - August/September: Highest failure rates (~3%)
   - Early fall months show increased risk

##### Historical Trends

- Movie volume increased significantly since 1980s
- Success/failure rates remained relatively stable until recent years
- Post-2000 shows increased volatility
- Possible data anomaly showing success spike near 2020

##### Key Takeaway

Best release windows appear to be summer months (June/July) and December, while January and early fall carry higher risks. Fall and Spring show strongest overall financial metrics, but Winter consistently underperforms across all measures.


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📖 Narrative & Thematic Elements</strong> 
</div>

### 6. Which tropes consistently lead to negative reception by genre?

In [None]:
## code

#### 6.1 What we have done for the initial analysis


#### 6.2 Key observations

### 7. What recurring plot patterns appear most frequently in critically panned films?

In [None]:
# code

#### 7.1 What we have done for the initial analysis


#### 7.2 Key observations