# Milestone 2: Project proposal and initial analysis

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>Project:</strong> Decoding Box-Office Bombs 💣
    <br>
    <strong>Team:</strong> ADAdventurers2024
</div>

To replicate our dataset, please download the data indicated in the [README](https://github.com/epfl-ada/ada-2024-project-adaventurers2024/blob/main/README.md) file. Then, navigate to the `scripts` folder and run the following script:

```cmd
python preprocess_data.py
```

This script will generate five files in the data folder:

- `cmu_tmdb.csv`: A merged dataset from CMU and TMDB, containing movie information such as revenue, budget, and other details.
- `movie_tropes.csv`: Tropes associated with each movie in the IMDb dataset, which serves as an intermediary file for merging tropes with the CMU dataset.
- `cmu_tropes.csv`: Tropes associated with each movie in the CMU dataset.
- `movie_actors.csv`: Actors linked to each movie in the CMU dataset.
- `movie_directors_actors.csv`: Directors and actors linked to each movie in the IMDb dataset.

You can now proceed with exploratory data analysis and initial assessments.

------

## Exploratory data analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
DATA_PATH = 'data'

#### CMU revenue and other metrics

In [None]:
df_cmu_tmdb = pd.read_csv(f'{DATA_PATH}/cmu_tmdb.csv')
df_cmu_tmdb.head()

In [None]:
df_cmu_tmdb.info()

#### CMU cast and crew

In [None]:
df_movie_actors = pd.read_csv(f'{DATA_PATH}/movie_actors.csv')
df_movie_actors.head()

In [None]:
df_movie_actors.info()

#### CMU tropes

In [None]:
df_cmu_tropes = pd.read_csv(f"{DATA_PATH}/cmu_tropes.csv")
df_cmu_tropes.head()

In [None]:
df_cmu_tropes.info()

--------

## Research questions

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📊 Metrics & Performance
</strong> 
</div>


1. What metrics (e.g., low ratings, limited number of ratings, revenue vs budget) best indicate movie failure?


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>👥 Cast & Crew Analysis</strong> 
</div>


2. How do actor demographics and lack of diversity impact audience disengagement and contribute to box office underperformance?

3. What role do director-actor collaborations play in a movie's failure, and are there specific patterns in these partnerships that correlate with unsuccessful films?

4. Is thematic consistency in director filmographies a predictor of failure/success?

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>🎬 Genre & Market Factors</strong> 
</div>

5. How does genre choice influence a movie's failure, particularly in different cultural contexts?

In [None]:
#--------------------------------
# 1. Data Preparation and Setup
#--------------------------------

import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Read and prepare data
df = pd.read_csv('data/cmu_tmdb.csv')

# Create financial metrics
df['profit'] = df['revenue'] - df['budget']
df['roi'] = (df['revenue'] - df['budget']) / df['budget']
df['profit_scaled'] = df['profit'] / 1e6  # Scale profit to millions of dollars

# Process temporal data
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_month'] = df['release_date'].dt.month
df['release_season'] = pd.cut(df['release_date'].dt.month, 
                             bins=[0,3,6,9,12], 
                             labels=['Winter','Spring','Summer','Fall'])

# Process genres
df['genres'] = df['genres'].fillna('')
df['genres'] = df['genres'].str.split(', ')
df_genres = df.explode('genres')
df_genres = df_genres[df_genres['genres'] != '']

# Set up visualization parameters
unique_genres = sorted(df_genres['genres'].unique())
genre_colors = dict(zip(unique_genres, sns.color_palette("husl", len(unique_genres))))
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = [12, 6]
plt.rcParams['figure.dpi'] = 100
plt.rcParams['figure.autolayout'] = True

#--------------------------------
# 2. Basic Genre Analysis
#--------------------------------

# Financial Performance by Genre
plt.figure(figsize=(12, 6))
sns.violinplot(data=df_genres, x='genres', y='profit_scaled', 
               hue='genres', palette=genre_colors, legend=False)
plt.yscale('symlog', linthresh=1)
plt.xticks(rotation=45, ha='right')
plt.title('Movie Profits Distribution by Genre', pad=20)
plt.ylabel('Profit (Millions USD, Log Scale)')
plt.xlabel('Genre')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Rating Distribution by Genre
plt.figure(figsize=(12, 9))
ax = sns.violinplot(data=df_genres, x='genres', y='vote_average', 
                    hue='genres', palette=genre_colors, legend=False)
plt.xticks(rotation=45, ha='right')
plt.title('Movie Rating Distribution by Genre', pad=20)
plt.ylabel('Average Rating (0-10)')
plt.xlabel('Genre')
plt.grid(True, alpha=0.3)

ax.axhline(y=5, color='gray', linestyle='--', alpha=0.3)
ax.axhline(y=6, color='gray', linestyle='--', alpha=0.3)
ax.axhline(y=7, color='gray', linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

#--------------------------------
# 3. Genre Performance Analysis
#--------------------------------

# Genre-based Rating vs. Popularity
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df_genres.sample(n=min(10000, len(df_genres))), 
                x='vote_count', y='vote_average', 
                hue='genres', alpha=0.6,
                palette=genre_colors)
plt.xscale('log')
plt.title('Rating vs. Popularity by Genre', pad=20)
plt.xlabel('Number of Votes (Log Scale)')
plt.ylabel('Average Rating (0-10)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Genre')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Profit-based Rating vs. Popularity
plt.figure(figsize=(12, 6))
sample_size = min(2000, len(df_genres))
df_genres_sample = df_genres.sample(n=sample_size, random_state=42)
scatter = plt.scatter(df_genres_sample['vote_count'], 
                     df_genres_sample['vote_average'],
                     c=df_genres_sample['profit_scaled'],
                     s=30, alpha=0.6, cmap='RdYlBu')
plt.xscale('log')
plt.title('Rating vs. Popularity by Profit', pad=20)
plt.xlabel('Number of Votes (Log Scale)')
plt.ylabel('Average Rating (0-10)')
plt.grid(True, alpha=0.3)
plt.colorbar(scatter, label='Profit (Millions USD)')
plt.tight_layout()
plt.show()

#--------------------------------
# 4. Temporal Genre Analysis
#--------------------------------

# Genre Rating Trends Over Time
plt.figure(figsize=(12, 6))
for genre in unique_genres:
    genre_data = df_genres[df_genres['genres'] == genre]
    if len(genre_data) > 0:
        yearly_avg = genre_data.groupby('release_year')['vote_average'].mean()
        moving_avg = yearly_avg.rolling(window=5, min_periods=1).mean()
        plt.plot(moving_avg.index, moving_avg.values, 
                label=genre, color=genre_colors[genre], 
                alpha=0.7, linewidth=2)

plt.title('Average Movie Ratings by Genre Over Time (5-Year Moving Average)', pad=20)
plt.xlabel('Release Year')
plt.ylabel('Average Rating (0-10)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Genre')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

#--------------------------------
# 5. ROI Analysis by Genre
#--------------------------------

# Clean ROI data
def clean_roi(x):
    if np.isinf(x) or np.isnan(x) or x < -0.99:
        return np.nan
    if x > 50:  # Cap at 5000%
        return 50
    return x

df['roi_clean'] = df['roi'].apply(clean_roi)
df_genres['roi_clean'] = df_genres['roi'].apply(clean_roi)

print("\nOverall ROI Statistics (cleaned):")
print(df['roi_clean'].describe())

# ROI Distribution Visualization
plt.figure(figsize=(12, 6))
sns.violinplot(data=df_genres[df_genres['roi_clean'].notna()], 
               x='genres', y='roi_clean', palette=genre_colors)
plt.yscale('symlog', linthresh=0.1)
plt.xticks(rotation=45, ha='right')
plt.title('Movie ROI Distribution by Genre (capped at 5000%)', pad=20)
plt.ylabel('Return on Investment (ROI)')
plt.xlabel('Genre')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

#--------------------------------
# 6. Budget Category Analysis
#--------------------------------

# Create and analyze budget categories
df_with_budget = df[df['budget'] > 0].copy()
df_with_budget['budget_category'] = pd.qcut(df_with_budget['budget'], 
                                          q=5, 
                                          labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

print("\nROI Statistics by Budget Category (excluding zero budgets):")
budget_roi = df_with_budget.groupby('budget_category').agg({
    'roi_clean': ['count', 'mean', 'median'],
    'budget': ['mean', 'min', 'max']
}).round(2)
print(budget_roi)

# Visualize ROI by budget category
plt.figure(figsize=(12, 6))
sns.violinplot(data=df_with_budget, 
               x='budget_category', y='roi_clean',
               palette='viridis')
plt.yscale('symlog', linthresh=0.1)
plt.title('ROI Distribution by Budget Category', pad=20)
plt.ylabel('Return on Investment (ROI)')
plt.xlabel('Budget Category')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

#--------------------------------
# 7. Success/Failure Analysis
#--------------------------------

# Define performance thresholds
performance_thresholds = {
    'Total Loss (>90%)': -0.9,
    'Severe Loss (>70%)': -0.7,
    'Significant Loss (>50%)': -0.5,
    'Moderate Loss (>30%)': -0.3,
    'Minor Loss (>0%)': 0,
    'Break Even': 0,
    'Modest Success (>50%)': 0.5,
    'Successful (>100%)': 1.0,
    'Very Successful (>200%)': 2.0,
    'Blockbuster (>500%)': 5.0
}

# Calculate performance rates
performance_rates = {}
for genre in unique_genres:
    genre_data = df_genres[df_genres['genres'] == genre]['roi_clean']
    rates = {}
    for threshold_name, threshold in performance_thresholds.items():
        if 'Loss' in threshold_name:
            rate = (genre_data < threshold).mean()
        else:
            rate = (genre_data > threshold).mean()
        rates[threshold_name] = rate
    performance_rates[genre] = rates

performance_df = pd.DataFrame(performance_rates).T.round(3) * 100
print("\nPerformance Rates by Genre:")
print(performance_df)

# Visualize success and failure rates
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6))

# Success rates
success_series = performance_df['Successful (>100%)'].sort_values(ascending=False)
colors_success = [genre_colors[genre] for genre in success_series.index]
success_series.plot(kind='bar', color=colors_success, ax=ax1)
ax1.set_title('Success Rate: Movies Achieving >100% ROI', pad=20)
ax1.set_xlabel('Genre')
ax1.set_ylabel('Success Rate (%)')
plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# Failure rates
failure_series = performance_df['Significant Loss (>50%)'].sort_values(ascending=False)
colors_failure = [genre_colors[genre] for genre in failure_series.index]
failure_series.plot(kind='bar', color=colors_failure, ax=ax2)
ax2.set_title('Failure Rate: Movies with >50% Loss', pad=20)
ax2.set_xlabel('Genre')
ax2.set_ylabel('Failure Rate (%)')
plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

#--------------------------------
# 8. Summary Statistics
#--------------------------------

print("\nSummary Statistics by Genre:")
print(df_genres.groupby('genres').agg({
    'profit_scaled': ['mean', 'median'],
    'vote_average': ['mean', 'count'],
    'roi_clean': ['mean', 'median']
}).round(2))

6. How does poor release timing (e.g., season, holiday periods) affect a movie's likelihood of failing?

In [None]:
#--------------------------------
# 1. Seasonal Analysis
#--------------------------------

# Profit and Rating Distribution by Season
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Profit by Season
sns.violinplot(data=df, x='release_season', y='profit_scaled', 
               ax=ax1, palette='viridis')
ax1.set_yscale('symlog', linthresh=1)
ax1.set_title('Movie Profits Distribution by Season', pad=20)
ax1.set_ylabel('Profit (Millions USD, Log Scale)')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# Ratings by Season
sns.violinplot(data=df, x='release_season', y='vote_average', 
               ax=ax2, palette='viridis')
ax2.set_title('Movie Rating Distribution by Season', pad=20)
ax2.set_ylabel('Average Rating (0-10)')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)
ax2.axhline(y=5, color='gray', linestyle='--', alpha=0.3)
ax2.axhline(y=6, color='gray', linestyle='--', alpha=0.3)
ax2.axhline(y=7, color='gray', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

# Print seasonal statistics
print("\nSummary Statistics by Season:")
print(df.groupby('release_season').agg({
    'profit_scaled': ['mean', 'median'],
    'vote_average': ['mean', 'count'],
    'roi_clean': ['mean', 'median']
}).round(2))

#--------------------------------
# 2. Monthly Analysis
#--------------------------------

# Profit and Rating Distribution by Month
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))

# Monthly Profit Distribution
sns.violinplot(data=df, x='release_month', y='profit_scaled', 
               ax=ax1, palette='viridis')
ax1.set_yscale('symlog', linthresh=1)
ax1.set_title('Movie Profits Distribution by Release Month', pad=20)
ax1.set_xlabel('Month')
ax1.set_ylabel('Profit (Millions USD, Log Scale)')
ax1.grid(True, alpha=0.3)

# Monthly Rating Distribution
sns.violinplot(data=df, x='release_month', y='vote_average', 
               ax=ax2, palette='viridis')
ax2.set_title('Movie Rating Distribution by Release Month', pad=20)
ax2.set_xlabel('Month')
ax2.set_ylabel('Average Rating (0-10)')
ax2.grid(True, alpha=0.3)
ax2.axhline(y=5, color='gray', linestyle='--', alpha=0.3)
ax2.axhline(y=6, color='gray', linestyle='--', alpha=0.3)
ax2.axhline(y=7, color='gray', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

# Monthly ROI Analysis
plt.figure(figsize=(14, 6))
sns.violinplot(data=df[df['roi_clean'].notna()], 
               x='release_month', y='roi_clean', 
               palette='viridis')
plt.title('ROI Distribution by Release Month', pad=20)
plt.ylabel('Return on Investment (ROI)')
plt.xlabel('Month')
plt.yscale('symlog', linthresh=0.1)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate monthly performance rates
monthly_performance = {}
for month in range(1, 13):
    month_data = df[df['release_month'] == month]['roi_clean']
    rates = {
        'Severe Loss (>70%)': (month_data < -0.7).mean(),
        'Significant Loss (>50%)': (month_data < -0.5).mean(),
        'Break Even': (month_data > 0).mean(),
        'Successful (>100%)': (month_data > 1).mean(),
        'Very Successful (>200%)': (month_data > 2).mean()
    }
    monthly_performance[month] = rates

monthly_perf_df = pd.DataFrame(monthly_performance).T.round(3) * 100

# Plot monthly success and failure rates
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Success rates by month
monthly_perf_df['Successful (>100%)'].plot(kind='bar', ax=ax1, 
                                         color=sns.color_palette('viridis', 12))
ax1.set_title('Success Rate by Month (>100% ROI)', pad=20)
ax1.set_xlabel('Month')
ax1.set_ylabel('Success Rate (%)')
plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# Failure rates by month
monthly_perf_df['Significant Loss (>50%)'].plot(kind='bar', ax=ax2, 
                                              color=sns.color_palette('viridis', 12))
ax2.set_title('Failure Rate by Month (>50% Loss)', pad=20)
ax2.set_xlabel('Month')
ax2.set_ylabel('Failure Rate (%)')
plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Monthly statistics and analysis
month_names = {
    1: 'January', 2: 'February', 3: 'March', 4: 'April',
    5: 'May', 6: 'June', 7: 'July', 8: 'August',
    9: 'September', 10: 'October', 11: 'November', 12: 'December'
}

# Calculate detailed monthly ROI statistics
monthly_roi_stats = df.groupby('release_month')['roi_clean'].agg([
    'count', 'mean', 'median',
    lambda x: x.quantile(0.25),
    lambda x: x.quantile(0.75),
    lambda x: x.std()
]).round(3)
monthly_roi_stats.columns = ['Count', 'Mean ROI', 'Median ROI', 
                           '25th Percentile', '75th Percentile', 'Std Dev']
monthly_roi_stats.index = [month_names[m] for m in monthly_roi_stats.index]

print("\nDetailed Monthly ROI Statistics:")
print(monthly_roi_stats)

# Best and worst months analysis
best_months = monthly_perf_df['Successful (>100%)'].nlargest(3)
worst_months = monthly_perf_df['Significant Loss (>50%)'].nlargest(3)

print("\nBest Months for Movie Release (by Success Rate):")
for month, rate in best_months.items():
    print(f"{month_names[month]}: {rate:.1f}% success rate")

print("\nRiskiest Months for Movie Release (by Failure Rate):")
for month, rate in worst_months.items():
    print(f"{month_names[month]}: {rate:.1f}% failure rate")

# Risk-adjusted performance
monthly_perf_df['Risk-Adjusted Performance'] = (
    monthly_perf_df['Successful (>100%)'] / 
    monthly_perf_df['Significant Loss (>50%)']
).round(2)

print("\nRisk-Adjusted Performance by Month (Success Rate / Failure Rate):")
risk_adjusted = monthly_perf_df['Risk-Adjusted Performance'].sort_values(ascending=False)
for month, ratio in risk_adjusted.items():
    print(f"{month_names[month]}: {ratio:.2f}")

#--------------------------------
# 3. Temporal Analysis
#--------------------------------

# Success, Failure Rates and Volume Over Time
df['release_year'] = pd.to_numeric(df['release_year'])
yearly_performance = df.groupby('release_year').agg(
    success_rate=('roi_clean', lambda x: (x > 1).mean()),
    failure_rate=('roi_clean', lambda x: (x < -0.5).mean()),
    movie_count=('roi_clean', 'count')
)

fig, ax1 = plt.subplots(figsize=(12, 6))

# Plot success and failure rates
line1 = ax1.plot(yearly_performance.index, yearly_performance['success_rate'] * 100, 
                 color='forestgreen', linewidth=2, label='Success Rate (>100% ROI)')
line2 = ax1.plot(yearly_performance.index, yearly_performance['failure_rate'] * 100, 
                 color='crimson', linewidth=2, label='Failure Rate (>50% Loss)')
ax1.set_xlabel('Release Year')
ax1.set_ylabel('Rate (%)', color='black')
ax1.tick_params(axis='y', labelcolor='black')
ax1.grid(True, alpha=0.3)

# Add movie count on secondary axis
ax2 = ax1.twinx()
bars = ax2.bar(yearly_performance.index, yearly_performance['movie_count'], 
               alpha=0.2, color='royalblue', label='Number of Movies')
ax2.set_ylabel('Number of Movies', color='royalblue')
ax2.tick_params(axis='y', labelcolor='royalblue')

# Combine legends
lines = line1 + line2
labels = [l.get_label() for l in lines]
lines.append(bars)
labels.append('Number of Movies')
ax1.legend(lines, labels, loc='upper left')

plt.title('Movie Success, Failure Rates and Volume Over Time', pad=20)
plt.tight_layout()
plt.show()

# Print temporal statistics
print("\nPerformance Trends Summary:")
print("\nEarly Years (Before 1950):")
early = yearly_performance[yearly_performance.index < 1950].mean()
print(f"Average Success Rate: {early['success_rate']*100:.1f}%")
print(f"Average Failure Rate: {early['failure_rate']*100:.1f}%")
print(f"Average Annual Movies: {early['movie_count']:.0f}")

print("\nRecent Years (After 2000):")
recent = yearly_performance[yearly_performance.index >= 2000].mean()
print(f"Average Success Rate: {recent['success_rate']*100:.1f}%")
print(f"Average Failure Rate: {recent['failure_rate']*100:.1f}%")
print(f"Average Annual Movies: {recent['movie_count']:.0f}")

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📖 Narrative & Thematic Elements</strong> 
</div>


7. Which trope combinations consistently lead to negative reception by genre?

8. What recurring plot patterns appear most frequently in critically panned films?