# Applied Data Analysis: CMU Movie Summary Corpus

## Abstract

## Research questions

We can investigate whether factors like movie genre, runtime, release date, or language influence box office revenue. 

**Basics questions**
- Do action films generate more revenue than dramas? 
- Does a film’s runtime have an impact on its financial performance? 
- Does movies released in the summer or during holiday seasons perform better than those released at other times of the year. 

**Correlation questions**
- Is there a correlation between certain genres and higher revenue? 
- What are the most common predictors of a film’s financial success? 

**Output model** \
By using machine learning models like linear regression, we can create a predictive model for box office success based on historical data. 
- Can we predict a new film’s box office revenue based on its genre, language, runtime, and cast? 


Comparing the actual and predicted box office results, it could provide insight into the key factors that influence financial success in the film industry

## Dataset Details
This dataset contains metadata and plot summaries for over 42,000 movies. The dataset is a rich resource for exploring relationships between movies, characters, and plot content.

- **Name**: CMU Movie Summary Corpus
- **Source**: [CMU Movie Summary Corpus Dataset](https://www.cs.cmu.edu/~ark/personas/)
- **Size**: 46 MB (compressed)
- **Typology**: Text + Graphs + Numerical Data
- **Tags**: Movies, Characters, Plot Summaries, Metadata

The dataset contains:
- **Movie Metadata**: Information about the movies such as titles, genres, release dates.
- **Character Metadata**: Details about the characters in each movie.
- **Plot Summaries**: Textual summaries of movie plots extracted from Wikipedia.

You can read more about the dataset in the paper: [Bamman et al., 2013](http://www.cs.cmu.edu/~dbamman/pubs/pdf/bamman+oconnor+smith.acl13.pdf).

## First analsysis, understanding the data
In this section, we will begin by loading the dataset and performing some initial exploratory data analysis (EDA). This will help us understand the structure and content of the data, identify any missing values, and get a sense of the distributions and relationships between different variables.

In [86]:
# Imports
import os
import importlib
import src.data.CMU_dataset_dataloader as CMU_dataset_dataloader

importlib.reload(CMU_dataset_dataloader)

# Constants
CMU_DATA_PATH_INITIAL = 'data/initial/'

In [103]:
# Load the plot summaries
print("Plot Summaries:")
plots = CMU_dataset_dataloader.MoviesSummaryDataset(CMU_DATA_PATH_INITIAL, 'plot_summaries.txt', categories=['movie_id', 'summary'])
df_plots = plots.data
print(plots.data.sample(5))

# Load the movie metadata
print("\nMovie Metadata:")
categories = [
    'wikipedia_movie_id',
    'freebase_movie_id',
    'movie_name',
    'release_date',
    'box_office_revenue',
    'runtime',
    'languages',
    'countries',
    'genres'
    ]
movies = CMU_dataset_dataloader.MoviesSummaryDataset(CMU_DATA_PATH_INITIAL, 'movie.metadata.tsv', categories=categories)
df_movies = movies.data
print(movies.data.sample(5))

# Load the character metadata
print("\nCharacter Metadata:")
categories = [
    'wikipedia_movie_id', 
    'freebase_movie_id', 
    'release_date', 
    'character_name', 
    'actor_birth', 
    'actor_gender', 
    'actor_height', 
    'actor_ethnicity', 
    'actor_name', 
    'actor_age', 
    'freebase_character_actor_id', 
    'freebase_character_id', 
    'freebase_actor_id'
    ]
characters = CMU_dataset_dataloader.MoviesSummaryDataset(CMU_DATA_PATH_INITIAL, 'character.metadata.tsv', categories=categories)
df_characters = characters.data
print(characters.data.sample(5))

Plot Summaries:
       movie_id                                            summary
2820   32819057  Wealthy Mary Townleigh gets lost in the bush a...
5526     455224  In twelfth century Europe, Philippe Gaston, "T...
28747  24005375  England, 1710. A woman disguises herself as a ...
14236  19111921  An actress, having just discovered she's been ...
41492  31306133  In Fat Head Tom Naughton questions the claims ...

Movie Metadata:
       wikipedia_movie_id  ...                                             genres
18020             1556135  ...  {"/m/01jfsb": "Thriller", "/m/0bkbm": "Spy", "...
30505            35016696  ...                        {"/m/0jtdp": "Documentary"}
25360            27387205  ...  {"/m/06n90": "Science Fiction", "/m/02kdv5l": ...
66008            26083628  ...    {"/m/0hn10": "LGBT", "/m/0jtdp": "Documentary"}
12117             4677732  ...  {"/m/04t36": "Musical", "/m/07s9rl0": "Drama",...

[5 rows x 9 columns]

Character Metadata:
        wikipedia_movie_id  ..

In [128]:
import pandas as pd
import json

# Plot summaries pre-processing

# missing values
assert df_plots['summary'].isnull().sum() == 0
assert df_plots['movie_id'].isnull().sum() == 0

# Average plot summary length
df_plots['summary_length'] = df_plots['summary'].apply(len)
print(df_plots['summary_length'].describe())

# ==> very heterogeneous lengths 

# Print the preprocessed summaries
print(df_plots.sample(5))

count    42303.000000
mean      1784.034229
std       1808.925536
min         99.000000
25%        508.000000
50%       1079.000000
75%       2604.500000
max      28159.000000
Name: summary_length, dtype: float64
       movie_id  ... summary_length
39744  13201636  ...           1186
15132    187642  ...            746
25412  29121828  ...            685
18738  29836661  ...            265
40190  30352493  ...            754

[5 rows x 3 columns]


In [None]:
# Movie metadata pre-processing

# Removing freebase (deprecated)
if 'freebase_movie_id' in df_movies.columns:
    df_movies.drop(columns=['freebase_movie_id'], inplace=True)

# missing values
#df_movies.drop_duplicates(inplace=True)
df_movies = df_movies.dropna()

assert df_movies['wikipedia_movie_id'].isnull().sum() == 0
assert df_movies['movie_name'].isnull().sum() == 0
assert df_movies['release_date'].isnull().sum() == 0
assert df_movies['box_office_revenue'].isnull().sum() == 0
assert df_movies['runtime'].isnull().sum() == 0
assert df_movies['languages'].isnull().sum() == 0
assert df_movies['countries'].isnull().sum() == 0
assert df_movies['genres'].isnull().sum() == 0

# Format release date
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'], errors='coerce')

def convert_to_dict(dict_str):
    if not isinstance(dict_str, str):
        return dict_str
    
    dict_str = dict_str.replace("'", '"')
    try:
        return json.loads(dict_str)
    except json.JSONDecodeError as e:
        return {} 

# Extracting the language values for all rows
df_movies['languages'] = df_movies['languages'].apply(convert_to_dict)
df_movies['countries'] = df_movies['countries'].apply(convert_to_dict)
df_movies['genres'] = df_movies['genres'].apply(convert_to_dict)

# Must check for empty dictionaries ??

df_movies.sample(5)

Unnamed: 0,wikipedia_movie_id,movie_name,release_date,box_office_revenue,runtime,languages,countries,genres
75864,5533519,The Crocodile Hunter: Collision Course,2002-07-12,33436931.0,90.0,{'/m/02h40lc': 'English Language'},"{'/m/09c7w0': 'United States of America', '/m/...","{'/m/03k9fj': 'Adventure', '/m/03q4nz': 'World..."
62507,2460704,Pokémon: The First Movie,1998-07-18,163644662.0,95.0,"{'/m/03_9r': 'Japanese Language', '/m/02h40lc'...","{'/m/09c7w0': 'United States of America', '/m/...",{}
68987,27891311,Moneyball,2011-09-09,110206216.0,133.0,"{'/m/02h40lc': 'English Language', '/m/0t_2': ...",{'/m/09c7w0': 'United States of America'},"{'/m/01z02hx': 'Sports', '/m/03bxz7': 'Biograp..."
40763,3060756,Babel,2006-05-23,135330182.0,143.0,"{'/m/064_8sq': 'French Language', '/m/03_9r': ...","{'/m/09c7w0': 'United States of America', '/m/...","{'/m/07s9rl0': 'Drama', '/m/0219x_': 'Indie', ..."
47724,9541484,Crimes of Passion,1984-10-19,2900000.0,112.0,{'/m/02h40lc': 'English Language'},{'/m/09c7w0': 'United States of America'},"{'/m/01jfsb': 'Thriller', '/m/0219x_': 'Indie'..."
