# Milestone 2: Project proposal and initial analysis

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>Project:</strong> Decoding Box-Office Bombs 💣
    <br>
    <strong>Team:</strong> ADAdventurers2024
</div>

To replicate our dataset, please download the data indicated in the [README](https://github.com/epfl-ada/ada-2024-project-adaventurers2024/blob/main/README.md) file. Then, navigate to the `scripts` folder and run the following script:

```cmd
python preprocess_data.py
```

This script will generate five files in the data folder:

- `cmu_tmdb.csv`: A merged dataset from CMU and TMDB, containing movie information such as revenue, budget, and other details.
- `movie_tropes.csv`: Tropes associated with each movie in the IMDb dataset, which serves as an intermediary file for merging tropes with the CMU dataset.
- `cmu_tropes.csv`: Tropes associated with each movie in the CMU dataset.
- `movie_actors.csv`: Actors linked to each movie in the CMU dataset.
- `movie_directors_actors.csv`: Directors and actors linked to each movie in the IMDb dataset.

You can now proceed with exploratory data analysis and initial assessments.

------

## Exploratory data analysis

In [11]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [14]:
DATA_PATH = 'data'

#### CMU revenue and other metrics

In [18]:
df_cmu_tmdb = pd.read_csv(f'{DATA_PATH}/cmu_tmdb.csv')
df_cmu_tmdb.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,tagline,genres,production_companies,production_countries,spoken_languages,keywords,release_year,wikipedia_movie_id,freebase_movie_id,name
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc...",2010,23270459,/m/0661ql3,Inception
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,...",2014,6009939,/m/0fkf28,Interstellar
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f...",2008,4276475,/m/0btpm6,The Dark Knight
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ...",2009,4273140,/m/0bth54,Avatar
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com...",2012,22114132,/m/062zm5h,The Avengers


In [19]:
df_cmu_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49516 entries, 0 to 49515
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    49516 non-null  int64  
 1   title                 49516 non-null  object 
 2   vote_average          49516 non-null  float64
 3   vote_count            49516 non-null  int64  
 4   status                49516 non-null  object 
 5   release_date          49516 non-null  object 
 6   revenue               49516 non-null  int64  
 7   runtime               49516 non-null  int64  
 8   adult                 49516 non-null  bool   
 9   backdrop_path         34367 non-null  object 
 10  budget                49516 non-null  int64  
 11  homepage              4220 non-null   object 
 12  imdb_id               49516 non-null  object 
 13  original_language     49516 non-null  object 
 14  original_title        49516 non-null  object 
 15  overview           

#### CMU cast and crew

In [20]:
df_movie_actors = pd.read_csv(f'{DATA_PATH}/movie_actors.csv')
df_movie_actors.head()

Unnamed: 0,wikipedia_movie_id,freebase_movie_id,release_date_x,character_name,actor_date_of_birth,actor_gender,actor_height_in_meters,actor_ethnicity_freebase_id,actor_name,actor_age_at_movie_release,...,release_year,release_date_y,revenue,runtime,languages,countries,genres,movie_name,average_rating,num_votes
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,...,2001,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Ghosts of Mars,4.9,58894.0
1,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,...,2001,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Ghosts of Mars,4.9,58894.0
2,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,...,2001,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Ghosts of Mars,4.9,58894.0
3,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,...,2001,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Ghosts of Mars,4.9,58894.0
4,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,...,2001,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...",Ghosts of Mars,4.9,58894.0


In [None]:
df_movie_actors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5270656 entries, 0 to 5270655
Data columns (total 23 columns):
 #   Column                           Dtype  
---  ------                           -----  
 0   wikipedia_movie_id               int64  
 1   freebase_movie_id                object 
 2   release_date_x                   object 
 3   character_name                   object 
 4   actor_date_of_birth              object 
 5   actor_gender                     object 
 6   actor_height_in_meters           float64
 7   actor_ethnicity_freebase_id      object 
 8   actor_name                       object 
 9   actor_age_at_movie_release       float64
 10  freebase_character_actor_map_id  object 
 11  freebase_character_id            object 
 12  freebase_actor_id                object 
 13  release_year                     int64  
 14  release_date_y                   object 
 15  revenue                          float64
 16  runtime                          float64
 17  language

#### CMU tropes

In [16]:
df_cmu_tropes = pd.read_csv(f"{DATA_PATH}/cmu_tropes.csv")
df_cmu_tropes.head()

  df_cmu_tropes = pd.read_csv(f"{DATA_PATH}/cmu_tropes.csv")


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,example,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres_y
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,,tt1375666,movie,Inception,Inception,0,2010,\N,148,"Action,Adventure,Sci-Fi"
1,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,"This is the goal of Cobb and his crew, only t...",tt1375666,movie,Inception,Inception,0,2010,\N,148,"Action,Adventure,Sci-Fi"
2,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Word of God says this movie is Nolan's homa...,tt1375666,movie,Inception,Inception,0,2010,\N,148,"Action,Adventure,Sci-Fi"
3,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Saito is rich/powerful enough to have Cobb ...,tt1375666,movie,Inception,Inception,0,2010,\N,148,"Action,Adventure,Sci-Fi"
4,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,A small one between Ariadne and Arthur after ...,tt1375666,movie,Inception,Inception,0,2010,\N,148,"Action,Adventure,Sci-Fi"


In [17]:
df_cmu_tropes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298086 entries, 0 to 298085
Data columns (total 43 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    298086 non-null  int64  
 1   title                 298086 non-null  object 
 2   vote_average          298086 non-null  float64
 3   vote_count            298086 non-null  int64  
 4   status                298086 non-null  object 
 5   release_date          298086 non-null  object 
 6   revenue               298086 non-null  int64  
 7   runtime               298086 non-null  int64  
 8   adult                 298086 non-null  bool   
 9   backdrop_path         296801 non-null  object 
 10  budget                298086 non-null  int64  
 11  homepage              111971 non-null  object 
 12  imdb_id               298086 non-null  object 
 13  original_language     298086 non-null  object 
 14  original_title        298086 non-null  object 
 15  

--------

## Research questions

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📊 Metrics & Performance
</strong> 
</div>


1. What metrics (e.g., low ratings, limited number of ratings, revenue vs budget) best indicate movie failure?


<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>👥 Cast & Crew Analysis</strong> 
</div>


2. How do actor demographics and lack of diversity impact audience disengagement and contribute to box office underperformance?

3. What role do director-actor collaborations play in a movie's failure, and are there specific patterns in these partnerships that correlate with unsuccessful films?

4. Is thematic consistency in director filmographies a predictor of failure/success?

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>🎬 Genre & Market Factors</strong> 
</div>

5. How does genre choice influence a movie's failure, particularly in different cultural contexts?

6. How does poor release timing (e.g., season, holiday periods) affect a movie's likelihood of failing?

<div style="border: 2px solid #4CAF50; border-radius: 10px; padding: 15px; background-color: #E8F5E9; color: #333;">
    <strong>📖 Narrative & Thematic Elements</strong> 
</div>


7. Which trope combinations consistently lead to negative reception by genre?

8. What recurring plot patterns appear most frequently in critically panned films?