# Milestone 2: Project proposal and initial analysis

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>Project:</strong> Decoding Box-Office Bombs 💣
    <br>
    <strong>Team:</strong> ADAdventurers2024
</div>

To replicate our dataset, please download the data indicated in the [README](https://github.com/epfl-ada/ada-2024-project-adaventurers2024/blob/main/README.md) file. Then, navigate to the `scripts` folder and run the following script:

```cmd
python preprocess_data.py
```

This script will generate five files in the data folder:

- `cmu_tmdb.csv`: A merged dataset from CMU and TMDB, containing movie information such as revenue, budget, and other details.
- `movie_tropes.csv`: Tropes associated with each movie in the IMDb dataset, which serves as an intermediary file for merging tropes with the CMU dataset.
- `cmu_tropes.csv`: Tropes associated with each movie in the CMU dataset.
- `movie_actors.csv`: Actors linked to each movie in the CMU dataset.
- `movie_directors_actors.csv`: Directors and actors linked to each movie in the IMDb dataset.

You can now proceed with exploratory data analysis and initial assessments.

------

## Exploratory data analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
DATA_PATH = "data"

#### CMU revenue and other metrics

In [None]:
df_cmu_tmdb = pd.read_csv(f"{DATA_PATH}/cmu_tmdb.csv")
df_cmu_tmdb.head()

In [None]:
df_cmu_tmdb.info()

#### CMU cast and crew

In [None]:
df_movie_actors = pd.read_csv(f"{DATA_PATH}/movie_actors.csv")
df_movie_actors.head()

In [None]:
df_movie_actors.info()

In [None]:
df_movie_directors_actors = pd.read_csv(f"{DATA_PATH}/movie_directors_actors.csv")
df_movie_directors_actors.head()

In [None]:
df_movie_directors_actors.info()

#### CMU tropes

In [None]:
df_cmu_tropes = pd.read_csv(f"{DATA_PATH}/cmu_tropes.csv")
df_cmu_tropes.head()

In [None]:
df_cmu_tropes.info()

--------

## Research questions

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import ast
import warnings

warnings.filterwarnings("ignore")

# Set visualization style
%matplotlib inline
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📊 Metrics & Performance
</strong> 
</div>


### 1. What metrics (e.g., low ratings, limited number of ratings, revenue vs budget) best indicate movie failure?


In [None]:
## Code
from src.utils.metric_analysis import *

metric_analysis("data/cmu_tmdb.csv")

#### 1.1 What we have done for the initial analysis


#### 1.2 Key observations


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>👥 Cast & Crew Analysis</strong> 
</div>


### 2. How do actor demographics and lack of diversity impact audience disengagement and contribute to box office underperformance?

In [None]:
## code
from src.utils.actor_analysis import *

actor_analysis("data/movie_actors.csv", "data/wikidata_ethnicities.csv")

#### 2.1 What we have done for the initial analysis


#### 2.2 Key observations

### 3. Is thematic consistency in director filmographies a predictor of movie failure?

In [None]:
## code
from src.utils.director_analysis import *

director_analysis("data/movie_directors_actors.csv")

In [None]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Load the dataset
df = pd.read_csv('./data/movie_directors_actors.csv')

# Parse 'genres_x' into lists
df['genres_list'] = df['genres_x'].fillna('').apply(lambda x: x.split(',') if x != '\\N' else [])

# Function to parse 'genres_y' strings
def parse_genres_y(s):
    try:
        if pd.isnull(s) or s == '\\N':
            return []
        # Replace double double-quotes with single double-quotes and remove backslashes
        s = s.replace('""', '"').replace('\\', '')
        genres_dict = json.loads(s)
        return list(genres_dict.values())
    except json.JSONDecodeError:
        return []

# Apply the function to 'genres_y'
df['genres_y_list'] = df['genres_y'].apply(parse_genres_y)

# Combine the two genre lists
df['all_genres'] = df['genres_list'] + df['genres_y_list']

# Convert to numeric types
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
df['average_rating'] = pd.to_numeric(df['average_rating'], errors='coerce')

# Explode the genres
df_exploded = df.explode('all_genres')

# Remove rows with empty genres
df_exploded = df_exploded[df_exploded['all_genres'].notna() & (df_exploded['all_genres'] != '')]

# Group by director and genre
grouped = df_exploded.groupby(['director_name', 'all_genres'])

# Compute the metrics
result = grouped.agg(
    num_movies=('movie_id', 'nunique'),
    avg_revenue=('revenue', 'mean'),
    avg_rating=('average_rating', 'mean')
).reset_index()

# Number of genres per director
director_genre_counts = result.groupby('director_name').agg(
    num_genres=('all_genres', 'nunique'),
    total_movies=('num_movies', 'sum'),
    overall_avg_revenue=('avg_revenue', 'mean'),
    overall_avg_rating=('avg_rating', 'mean')
).reset_index()

# Correlation analysis
correlation_revenue = director_genre_counts['num_genres'].corr(director_genre_counts['overall_avg_revenue'])
correlation_rating = director_genre_counts['num_genres'].corr(director_genre_counts['overall_avg_rating'])

print(f"Correlation between number of genres and average revenue: {correlation_revenue}")
print(f"Correlation between number of genres and average rating: {correlation_rating}")

# Display the result
print(director_genre_counts)

# Visualization for Revenue
plt.figure(figsize=(10, 6))
plt.scatter(director_genre_counts['num_genres'], director_genre_counts['overall_avg_revenue'])
plt.xlabel('Number of Genres')
plt.ylabel('Overall Average Revenue')
plt.title('Number of Genres vs. Average Revenue')
plt.grid(True)
plt.show()

# Visualization for Rating
plt.figure(figsize=(10, 6))
plt.scatter(director_genre_counts['num_genres'], director_genre_counts['overall_avg_rating'])
plt.xlabel('Number of Genres')
plt.ylabel('Overall Average Rating')
plt.title('Number of Genres vs. Average Rating')
plt.grid(True)
plt.show()

# Regression Analysis for Revenue
X = director_genre_counts[['num_genres']]
y_revenue = director_genre_counts['overall_avg_revenue']
X_with_const = sm.add_constant(X)
model_revenue = sm.OLS(y_revenue, X_with_const).fit()
print(model_revenue.summary())

# Regression Analysis for Rating
y_rating = director_genre_counts['overall_avg_rating']
model_rating = sm.OLS(y_rating, X_with_const).fit()
print(model_rating.summary())


The first step towards analysing the role that direction plays in a film failure or success was to gather data on directors from Wikipedia. The figure shows the profile of the films released by director Terrence Malick over the years. Such profile was assembled for over 2,000 directors. A first look at the profiles shows that most directors don't have an extensive filmography, i.e. they directed 2 films or less, whilst a few individuals are very prolific. The next steps will inculde building collaboration and award profiles and identifying clusters of unsuccessful directors.

The second figure shows the number of different genres a director explored in their filmography and the corresponding average revenue made over the course of their career. Most directors tend to focus on a limited number of genres (<20) while only a fraction of directors make films across 60+ genres. One observes that directors making films across an extreme number of genres tend to perform poorly. On the other hand, the directors with the highest-grossing filmography directed films spanning 15 to 30 genres. The picture is different when the metric used to indicate success is the IMDB rating. In this case, wildly different ratings (0-10) can be observed for directors who made films that enter a narrow range of genre categories (thid figure). Then, as the number of genre increase, the average rating over the filmography converges towards a rating of 7/10, which suggests that versatile directors tend to make good, although not great, films.

#### 3.1 What we have done for the initial analysis


#### 3.2 Key observations

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>🎬 Genre & Market Factors</strong> 
</div>

### 4. How does genre choice influence a movie's failure, particularly in different cultural contexts?

In [None]:
## code
from src.utils.genre_analysis import *
from src.utils.visualization_utils import *

# Setup
setup_visualization()
df, df_genres = prepare_data("data/cmu_tmdb.csv")
unique_genres = sorted(df_genres["genres"].unique())
genre_colors = create_genre_colors(unique_genres)

# Basic Analysis
plot_genre_distributions(df_genres, genre_colors)

# Performance Analysis
plot_genre_performance(df_genres, genre_colors)

# Temporal Analysis
analyze_temporal_trends(df_genres, genre_colors, unique_genres)

# ROI Analysis
df, df_genres = analyze_roi(df, df_genres, genre_colors)

# Budget Analysis
budget_stats = analyze_budget_categories(df)

# Success/Failure Rate Analysis
performance_stats = analyze_success_failure_rates(
    df_genres, genre_colors, unique_genres
)

# Summary Statistics
summary_stats = get_summary_statistics(df_genres)

#### 4.1 What we have done for the initial analysis

We employed several analytical approaches to understand genre impact on movie failure. 

- First, we used violin plots with symmetric log scaling to visualize profit distribution across genres, capturing both the central tendency and spread of financial performance. 
- To understand cultural reception, we analyzed the relationship between ratings and popularity (measured by vote count) using scatter plots with logarithmic scaling for vote counts. 
- We tracked genre performance over time using 5-year moving averages to identify long-term trends in audience reception. 
- Finally, we calculated and compared genre-specific success and failure rates to identify which genres carry the highest risk of significant financial loss.


#### 4.2 Key observations

##### Financial Performance by Genre

1. High-Profit Potential:

- Action/Adventure/Fantasy lead in extreme profits (mean profits: 13-25M USD)
- Documentary/TV Movies show lowest profits but highest ROI (9.03 and 7.13)
- Horror shows strong ROI (5.69) with moderate investment
- Drama (most common genre, 22,560 movies) shows modest profits (3.47M USD)

2. Budget Impact:

- Very low budget films: Highest ROI potential but highest volatility
- High budget films: More consistent but lower returns
- Clear inverse relationship between budget size and ROI potential

##### Audience Reception

1. Ratings:

- Animation leads with highest average rating (6.01)
- Family/War films follow with strong ratings
- Western shows lowest ratings
- Most genres maintain 5-7 rating range
- Documentary shows most consistent ratings

2. Popularity Patterns:

- Popular movies (high vote counts) cluster around 6-7 ratings
- Less popular movies show wider rating variation
- Profitable movies typically have high vote counts
- Genre impact on popularity is minimal

##### Historical Trends

- Ratings stabilized post-1960
- High volatility in early years (pre-1940)
- Modern convergence around 5-6 rating range
- Genre distinctions decreased over time

##### Risk Assessment

1. Success Rates:

- Adventure/Science Fiction/Fantasy: Highest success rates
- Documentary/TV Movies: Extreme success/failure patterns
- Horror: Good success rate with moderate risk

2. Failure Rates:

- Thriller/Science Fiction/Mystery: Highest failure rates
- Action/Adventure: More moderate failure rates despite high budgets
- Documentary: High risk but high potential return

##### Key Takeaway

Genre significantly impacts financial performance and risk levels. While Action/Adventure/Fantasy lead in absolute profits, smaller genres like Documentary and Horror show strong ROI potential. Ratings remain relatively consistent across genres, with Animation and Family films maintaining slight advantages. Budget size shows stronger correlation with returns than genre choice.


### 5. How does poor release timing (e.g., season, holiday periods) affect a movie's likelihood of failing?

In [None]:
## code
from src.utils.timing_analysis import *

# Seasonal Analysis
seasonal_stats = plot_seasonal_distributions(df)

# Monthly Analysis
analyze_monthly_performance(df)

# Monthly ROI Analysis
monthly_perf_df = analyze_monthly_roi(df)

# Monthly Success Rate Analysis
plot_monthly_success_rates(monthly_perf_df)

# Monthly Statistics
monthly_stats = analyze_monthly_statistics(df, monthly_perf_df)

# Yearly Analysis
yearly_performance = analyze_temporal_trends(df)

#### 5.1 What we have done for the initial analysis

- To investigate how release timing affects movie failure, we analyzed the distribution of profits and ratings across different temporal categories using violin plots. 
- We compared failure rates across seasons and months to identify particularly risky release periods. 
- To account for industry evolution, we examined the temporal trends of success and failure rates alongside movie release volume using a dual-axis visualization combining line graphs for rates and bar charts for release counts. 

This allowed us to identify historical patterns in optimal release timing while controlling for changes in industry output volume.


#### 5.2 Key observations

##### Seasonal Patterns

1. Profitability and ROI:

   - Fall shows best overall performance (high ROI: 3.46, good profit distribution)
   - Spring has highest mean profit (7.23M USD) and median ROI (1.16)
   - Winter consistently underperforms (lowest profit: 2.56M USD, lowest ROI: 2.77)
   - Summer shows moderate, stable performance

2. Ratings and Volume:

   - Ratings remain relatively consistent across seasons (range: 4.72-5.19)
   - Winter/Fall have highest release volumes (~13,000 movies each)
   - Spring/Summer have fewer releases (~11,500 movies each)

##### Monthly Patterns

1. Strong Months:

   - June/July: Highest success rates (~8%), good ROI potential
   - December: Strong performance (high success rate, good profit potential)
   - Summer months generally show better profit concentration

2. Weak Months:

   - January: Lowest success rate (~2%), volatile ROI
   - August/September: Highest failure rates (~3%)
   - Early fall months show increased risk

##### Historical Trends

- Movie volume increased significantly since 1980s
- Success/failure rates remained relatively stable until recent years
- Post-2000 shows increased volatility
- Possible data anomaly showing success spike near 2020

##### Key Takeaway

Best release windows appear to be summer months (June/July) and December, while January and early fall carry higher risks. Fall and Spring show strongest overall financial metrics, but Winter consistently underperforms across all measures.


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📖 Narrative & Thematic Elements</strong> 
</div>

A trope is a commonly recurring store-telling device in creative works. In film, tropes can be anything from narrative patterns (like the "last-minute rescue") to character archetypes (like the "mad scientist"), or even specific plot devices (like "time-traveling mishaps"). While tropes aren't inherently good or bad, their execution and context largely determine their effectiveness in storytelling.

This analysis examines tropes that appear disproportionately often in poorly-rated films. Our first step is identify the most common tropes in low-rated films, to then explore each genre individually, and finally compare the results across genres.

In [None]:
from src.utils.trope_analysis import *

### 6. What recurring plot patterns appear most frequently in critically panned films?

In [None]:
## code
df_movie_tropes = pd.read_csv('data/movie_tropes.csv')

rq6(
    df_cmu_tmdb,
    df_movie_tropes,
    k=20,
    vote_threshold=6.0,
)

After filtering out movies with a rating of 6 or lower, our initial analysis reveals a mean vote average of 5.44 and a median of 5.6. We then extracted the top 20 most common tropes in low-rated movies, with the top 5 being "ShotOut", "HorrorFilms", "FilmsOfThe1980s", "BigBad", and "LargeHam", each appearing in over 80 films. Now, we will analyze the distribution of these tropes across different genres.

### 7. Which tropes consistently lead to negative reception by genre?

In [None]:
## code
rq7(
    df_cmu_tropes=df_cmu_tropes,
    genres=['Horror', 'Adventure', 'Comedy'],
    vote_threshold=6.0
)

# for movie_genre in ['Horror', 'Adventure', 'Comedy']:
#     vote_threshold = 6

#     df_genre_tropes = df_cmu_tropes[df_cmu_tropes['genres_y'].str.contains(movie_genre)]
#     df_genre_tropes = df_genre_tropes[df_genre_tropes['vote_count'] > 100]

#     print(f'There are {len(df_genre_tropes.id.unique())} {movie_genre.lower()} films in the dataset')

#     df_low_rated_tropes = df_genre_tropes[df_genre_tropes['vote_average'] <= vote_threshold]
#     df_high_rated_tropes = df_genre_tropes[df_genre_tropes['vote_average'] > vote_threshold]

#     print(f'Films with low ratings: {len(df_low_rated_tropes.id.unique())}')
#     print(f'Films with high ratings: {len(df_high_rated_tropes.id.unique())}')

#     low_rated_tropes = df_low_rated_tropes.trope.value_counts()
#     high_rated_tropes = df_high_rated_tropes.trope.value_counts()

#     low_rated_dict = low_rated_tropes.to_dict()
#     high_rated_dict = high_rated_tropes.to_dict()

#     trope_ratios = {}
#     for trope in low_rated_dict:
#         low_count = low_rated_dict[trope]
#         high_count = high_rated_dict.get(trope, 0)
        
#         if low_count >= 5:
#             ratio = low_count / (high_count + 1)
#             trope_ratios[trope] = ratio

#     sorted_tropes = sorted(trope_ratios.items(), key=lambda x: x[1], reverse=True)

#     plt.figure(figsize=(10, 6))
#     sns.barplot(x=[x[1] for x in sorted_tropes[:10]], y=[x[0] for x in sorted_tropes[:10]], palette='viridis')
#     plt.title(f'Top 10 tropes more common in low-rated {movie_genre.lower()} films')
#     plt.xlabel('Ratio low:high rated')
#     plt.ylabel('Tropes')
#     plt.show()

The plot shows the top 10 narrative tropes that appear disproportionately more often in low-rated films compared to high-rated ones. The y-axis lists the tropes, while the x-axis shows the ratio of appearance in low-rated vs. high-rated films. We analyze for this initial results only 3 genres: horror, adventure and comedy.

Key observations for horror films:
- "BadSanta" and "JackassGenie" tropes have the highest ratios (approximately 9-10x), suggesting these supernatural antagonist concepts rarely work well in horror films
- Mid-tier ratios (6-7x) include concepts like "AttackOfTheTownFestival" and "SelfPlagiarism", indicating that festival-horror settings and derivative storytelling tend to correlate with lower ratings

Key observations for adventure films:
- "NotScreenedForCritics" tops the list (about 7.5x ratio), appearing in both comedy and adventure genres' low-rated films, suggesting it's a reliable indicator of lower quality across genres
- Superhero-related tropes ("Superman" and "TeenSuperspy" at ~6x and 5x respectively) indicate that certain superhero elements may be challenging to execute well in adventure films

Key observation for comedy films:
- "ContinuityReboot" has the highest ratio (around 9.5x), suggesting that comedy reboots of existing properties tend to be challenging to execute well
- Meta-industry tropes like "NotScreenedForCritics" and "SlasherFilm" (both ~5.5x) hint that comedies avoiding critical review or parodying horror often receive poor ratings
- Plot devices like "FakinMacGuffin" and "LastRequest" (both ~8x) appear to be overused in lower quality comedies
- Character-based tropes such as "TeenSuperspy" and "LiteralSplitPersonality" (both ~5x) show that certain character archetypes may be harder to execute successfully

**Next steps:** We will expand the previous analysis to include all the genres, comparing the results to identify differences and similarities in how tropes influence reception across genres. Additionally, we plan to investigate the occurrence of tropes combinations in movies that failed by examining different tuples of up to three tropes, such as ("Love Triangle", "Secret Identity", "Big Damn Kiss"), to determine if certain trope combinations are more prone to result in a movie's failure.