# A comedy handbook - preliminary results and analysis

In this notebook, you will find the preliminary results and code needed for Milestone 2 of the project. This same notebook will be updated for milestone 3 as the project progresses.

The goal of this notebook is to showcase the feasability of the approaches mentionned in our report. It also contains a presentation of the various datasets we use, as well as the initial data cleaning and pre-processing done on them. 

## Imports, constants and data loading

In [73]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
!pip install -r pip_requirements.txt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [74]:
import pandas as pd
import swifter # Looks unused but actually is through Monkey Patching -- don't delete
import matplotlib.pyplot as plt
import re
import geopandas as gpd
from src.utils.data_utils import *
from src.utils.general_utils import *
from tqdm import tqdm
from prettytable import PrettyTable,MARKDOWN

In [75]:
# Constants

TEXT_SEPARATOR = "_" * 40
# The following Color Palette is color-blind friendly, as has been carefully curated from Tol and IBM's specialized palettes (https://davidmathlogic.com/colorblind/)
# You can play with it/visualize it here : https://coolors.co/332288-88ccee-e69f00-44aa99-f0e442-d55e00-882255-009e73 
COLOR_PALETTE = ["332288","88ccee","e69f00","44aa99","f0e442","d55e00","882255","009e73"]


In [86]:
# Dataset setups
RAW_DATA_FOLDER = "data/raw/"

CMU_PLOTS_DS = ProjectDataset(RAW_DATA_FOLDER+"plot_summaries.txt",
                              "CMU plots",
                              "Base dataset for ADA. Contains a plot summary for each of the movie in the CMU Movies dataset",
                              {
                                  "wikipedia_id":"The UID corresponding to the movie",
                                  "plot": "The (summarized) plot of the movie"
                                }
                            )

CMU_MOVIES_DS = ProjectDataset(RAW_DATA_FOLDER+"movie.metadata.tsv",
                               "CMU Movies",
                               "Base dataset for ADA. Contains ~80K movies and their metadata.",
                               {
                                    "wikipedia_id": "A UID for the movie on Wikipedia.",
                                    "freebase_id": "A UID for the movie in Freebase (https://freebase.toolforge.org/)",
                                    "title": "The title of the movie.",
                                    "release_date": "The date the movie was officially released. (<YEAR>, or <YYYY-MM-DD>)",
                                    "box_office_revenue": "The total revenue of the movie",
                                    "runtime": "The duration of the movie, minutes (float).",
                                    "languages": "The languages spoken in the movie. Dict[<Freebase Language Identifier String>:<Language>]",
                                    "countries": "The countries where the movie was produced or primarily associated. Dict[<Freebase Country Identifier String>:<Country>]",
                                    "genres": "Movie genre(s), such as action, drama, comedy, ..."
                                }
                            )

CMU_CHARACTER_DS = ProjectDataset(RAW_DATA_FOLDER+"character.metadata.tsv",
                                  "CMU characters",
                                  "Base dataset for ADA. Contains ~450K+ characters from movies, and their descriptions.",
                                  {
                                        "wikipedia_id": "A UID for the movie the character comes from on Wikipedia.",
                                        "freebase_movie_id": "A UID for the movie the character comes from on Freebase.",
                                        "character_name": "The name of the character in the movie.",
                                        "actor_dob": "The date of birth of the actor portraying the character.",
                                        "actor_gender": "The gender of the actor portraying the character.",
                                        "actor_height": "The height of the actor portraying the character, typically in centimeters or feet.",
                                        "actor_ethnicity": "The ethnicity of the actor portraying the character.",
                                        "actor_name": "The name of the actor portraying the character.",
                                        "actor_age_at_movie_release": "The age of the actor at the time the movie was released.",
                                        "freebase_character_map": "A unique identifier or mapping of the character in the Freebase database."
                                    }
                                )

MASSIVE_RT_MOVIE_DS = ProjectDataset(RAW_DATA_FOLDER+"rotten_tomatoes_movies.csv",
                               "Massive Rotten Tomatoes Movie metadata",
                               "Dataset containing ~140K+ movies from RT",
                               {
                                     "id": "Unique identifier for each movie.",
                                    "title": "The title of the movie.",
                                    "audienceScore": "The average score given by regular viewers.",
                                    "tomatoMeter": "The percentage of positive reviews from professional critics.",
                                    "rating": "The movie's age-based classification (e.g., 'G', 'PG', 'PG-13', 'R').",
                                    "ratingContents": "Reasons for the age-based classification.",
                                    "releaseDateTheaters": "The date the movie was released in theaters.",
                                    "releaseDateStreaming": "The date the movie became available for streaming.",
                                    "runtimeMinutes": "The movie's duration in minutes.",
                                    "genre": "The movie's genre(s).",
                                    "originalLanguage": "The original language of the movie.",
                                    "director": "The director(s) of the movie.",
                                    "writer": "The writer(s) of the movie.",
                                    "boxOffice": "The total box office earnings of the movie.",
                                    "distributor": "The company responsible for distributing the movie.",
                                    "soundMix": "The sound mixing format(s) used in the movie."
                                }
                            )

MASSIVE_RT_REVIEW_DS = ProjectDataset(RAW_DATA_FOLDER+"rotten_tomatoes_movie_reviews.csv",
                               "Massive Rotten Tomatoes Reviews",
                               "Dataset containing ~1.4M+ reviews from RT",                               
                               {
                                    "id": "Unique identifier for each movie (matches the ID in the movies dataset).",
                                    "reviewId": "Unique identifier for each critic review.",
                                    "creationDate": "The date the review was published.",
                                    "criticName": "Name of the critic who wrote the review.",
                                    "isTopCritic": "Indicates if the critic is considered a 'Top Critic' (True or False).",
                                    "originalScore": "The score provided by the critic.",
                                    "reviewState": "The status of the review (e.g., 'fresh', 'rotten').",
                                    "publicatioName": "The name of the publication where the review was published.",
                                    "reviewText": "The full text of the critic review.",
                                    "scoreSentiment": "The sentiment of the critic's score (e.g., 'positive', 'negative', 'neutral').",
                                    "reviewUrl": "The url of the review"
                }
                            )

RT_EXTRA_MOVIE_INFO_DS = ProjectDataset(RAW_DATA_FOLDER+"movie_info.csv",
                                        "Extra Rotten Tomatoes Movies",
                                        "Extra scraped movie information from Rotten Tomatoes for ~12K major US releases between 1970 and 2024",
                                        {
                                            "title": "The title of the movie.",
                                            "url": "RT link to the movie.",
                                            "release_date": "Release date of the movie (format is one of ['Released <DATE as text>',<YEAR>]).",
                                            "critic_score": "The rating given by professional critics.",
                                            "audience_score": "The rating given by the general audience."
                                        }
                                    )

ALL_DATASETS = [CMU_MOVIES_DS,CMU_CHARACTER_DS,MASSIVE_RT_MOVIE_DS,MASSIVE_RT_REVIEW_DS,RT_EXTRA_MOVIE_INFO_DS]

In [77]:
print(f"We will use {len(ALL_DATASETS)} datasets in total, namely {', '.join(str(ds) for ds in ALL_DATASETS)}.")

We will use 5 datasets in total, namely CMU Movies, CMU characters, Massive Rotten Tomatoes Movie metadata, Massive Rotten Tomatoes Reviews, Extra Rotten Tomatoes Movies.


Here is a detailed description of what each datasets contains:


In [78]:
for i,ds in enumerate(ALL_DATASETS):
    print(f"- {number_to_emoji(i+1)} \033[4m{ds.name}\033[0m\n")
    print(f"{ds.description}\n")
    table = PrettyTable()
    table.field_names = ["Column/Fearure", "Description"]
    table.align['Description'] = 'l'
    table.max_width["Description"] = 40
    
    for column_name, column_description in ds.columns_descriptions.items():
        table.add_row([column_name, column_description+'\n'])

    print(table)
    print("\n")



- 1️⃣ [4mCMU Movies[0m

Base dataset for ADA. Contains ~80K movies and their metadata.

+--------------------+------------------------------------------+
|   Column/Fearure   | Description                              |
+--------------------+------------------------------------------+
|    wikipedia_id    | A UID for the movie on Wikipedia.        |
|                    |                                          |
|    freebase_id     | A UID for the movie in Freebase          |
|                    | (https://freebase.toolforge.org/)        |
|                    |                                          |
|       title        | The title of the movie.                  |
|                    |                                          |
|    release_date    | The date the movie was officially        |
|                    | released. (<YEAR>, or <YYYY-MM-DD>)      |
|                    |                                          |
| box_office_revenue | The total revenue of the movi

In [79]:
for dataset in tqdm(ALL_DATASETS,total=len(ALL_DATASETS),desc="Preloading datasets..."):
    _ = dataset.df

Preloading datasets...: 100%|██████████| 5/5 [00:09<00:00,  1.88s/it]


## Data handling

In this section, we will now show that our various approaches are feasible, by providing a small Proof of Concept for each type of approach. The results here are by no means our final results (of our final data story), but rather a intermediate viability checkpoint.

##

In order to access useful indicators of popularity and perform our quality analysis we extend our initial dataset with several "Rotten Tomatoes" dataset. The useful indicators we want to study here are : the box-office, the expert critic reviews, the expert critic score, the movie rewards.
We start by selecting the useful data in each dataset and we merge appropriate datasets to asses the feasability of exploitation. The size and content of the merges enbale us to discuss the possibility to assess the next operations, for example statistical analysis.


In [89]:
useful_col = ['wikipedia_id','title','box_office_revenue','genres']
unfiltered_df_box_office = CMU_MOVIES_DS.df[useful_col]
df_box_office = unfiltered_df_box_office[unfiltered_df_box_office['genres'].str.contains('comedy', case=False, na=False)]
count_na = df_box_office['box_office_revenue'].isna().sum()
print(count_na/len(df_box_office))

0.8392791297935103


In the initial dataset the number of missing box-office values is very high, about 84% of the movies are missing a box-office value.

Standardizing the movie names is necessary to improve the accuracy of the merges between the different dataframes. 

In [81]:
def standardize_title(title):
    if not isinstance(title, str):
        return None
    title = title.lower()
    title = re.sub(r'[^a-z0-9\s]', '', title)
    title = re.sub(r'\s+', ' ', title).strip()
    return title

df_raw_rt_rev = MASSIVE_RT_REVIEW_DS.df
df_raw_rt = MASSIVE_RT_MOVIE_DS.df

df_short_rt = df_raw_rt_rev[["id","isTopCritic","originalScore","scoreSentiment"]]
df_id_title = df_raw_rt[["id","title"]]

df_id_title.head()


Unnamed: 0,id,title
0,space-zombie-bingo,Space Zombie Bingo!
1,the_green_grass,The Green Grass
2,love_lies,"Love, Lies"
3,the_sore_losers_1997,Sore Losers
4,dinosaur_island_2002,Dinosaur Island


In [82]:
merged_raws = pd.merge(df_id_title, df_short_rt, on= "id", how="inner")

merged_raws["title"] = merged_raws["title"].apply(standardize_title)
df_box_office["title"] = df_box_office["title"].apply(standardize_title)

merge_big_dfs = pd.merge(df_box_office, merged_raws, on='title', how='inner')
merge_big_dfs = merge_big_dfs.drop_duplicates()
merge_big_dfs = merge_big_dfs[merge_big_dfs["originalScore"].notna()].copy()
df_expert_critics = merge_big_dfs[merge_big_dfs["isTopCritic"]==True]
list_movies = df_expert_critics["title"].unique() 
print(len(list_movies))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_box_office["title"] = df_box_office["title"].apply(standardize_title)


5340


The merge of the initial dataset with the 'big' Rotten tomatoes one gives an exploitable dataset : 5340 comedies with critic score and reviews. There are still some missing values (=NaN) on the box office column from the initial dataset, these missing values will be completed with another dataset.

In [91]:
df_plot = CMU_PLOTS_DS.df
df_1st_plot = CMU_PLOTS_DS.df.head(5)

df_plot.head()

Unnamed: 0,wikipedia_id,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


In [None]:
from llmware.agents import LLMfx

def get_one_sentiment_classification(text):
    agent = LLMfx(verbose=True)
    agent.load_tool("sentiment")
    sentiment = agent.sentiment(text)
    
    sentiment_value = sentiment["llm_response"]["sentiment"]
    confidence_level = sentiment["confidence_score"]
    
    print("Plot:", text)
    print("Sentiment:", sentiment_value)
    print("Confidence level", confidence_level)
    
    return sentiment_value, confidence_level

# Application of the analysis on the plots
df_1st_plot[['Sentiment', 'Confidence']] = df_1st_plot['plot'].swifter.apply(
    lambda x: pd.Series(get_one_sentiment_classification(x))
)


Pandas Apply:   0%|          | 0/5 [00:00<?, ?it/s]

update: Launching LLMfx process
step - 	1 - 	creating object - ready to start processing.
step - 	2 - 	loading tool - sentiment
step - 	3 - 	executing function call - deploying - sentiment 
step - 	4 - 	loading new processing text - 1 new entries
step - 	5 - 	executing function call - getting response - sentiment
				 -- llm_response - {'sentiment': ['positive']}
				 -- output type - dict
				 -- usage - {'input': 69, 'output': 9, 'total': 78, 'metric': 'tokens', 'processing_time': 3.328444480895996, 'type': 'dict'}
step - 	6 - 	analyzing response - sentiment
				 -- confidence score - 0.995
				 -- analyzing response - [32msent[32miment[32m':[32m ['[32mpos[32mitive[32m'][32m}[39m
				 -- analyzing response - {'choice_1': ['pos', 0.995, 1066], 'choice_2': ['negative', 0.005, 22198], 'choice_3': ['neut', 0.0, 17821]}
				 -- analyzing response - [('positive', 1066, 0.995), ('negative', 22198, 0.005), ('neutral', 17821, 0.0)]
Plot: Shlykov, a hard-working taxi driver and Lyosha

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1st_plot[['Sentiment', 'Confidence']] = df_1st_plot['plot'].swifter.apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1st_plot[['Sentiment', 'Confidence']] = df_1st_plot['plot'].swifter.apply(


The computation of the sentiment analysis for the first 5 plots of the dataset takes 2min34s. We linearize this values using the number of words in the 5 plots and the number of words in all the comedy plots : 

In [93]:
df_merge_IDs = pd.merge(df_box_office, df_plot, on='wikipedia_id', how='inner')

time = 154
words = ' '.join(df_1st_plot["plot"]).split()
words_comedy = ' '.join(df_merge_IDs["plot"]).split()
computation_time = ((len(words_comedy)/len(words))*time)/(24*3600)
print(f"Linear estimation of the computation time for all the comedy plot summaries: {computation_time} days")


Linear estimation of the computation time for all the comedy plot summaries: 2.8842083242377936 days


We have a computation time of about 3 days which is quite long, this computation time can probably be optimized later. This part of the code only has to be ran once, the results would be saved.