> This notebook focuses on the success of movies relative to specific events.

# Sucess Analysis

#### Imports

In [1]:
import pandas as pd
import numpy as np

import plotly.graph_objects as go
from scipy import stats
from itertools import combinations

from helpers import*

#### Data Loading

In [2]:
MOVIES_EVENTS = 'data/AdditionalDatasets/movies_events_reviews.csv'

movies_events_df = pd.read_csv(MOVIES_EVENTS)

movies_events_df.sample()

Unnamed: 0.1,Unnamed: 0,id_wiki,id_freebase,movie,date,box_office,runtime,lang,country,genre,...,Terrorism,events_belongs_to,review_id,reviewer,rating,review_summary,review_date,spoiler_tag,review_detail,helpful
24239,24239,19615593,/m/0djbdd0,Voltron: Fleet of Doom,1986.0,,48.0,,['United States of America'],"['Science Fiction', 'Anime', 'Action', 'Animat...",...,False,[],,,,,,,,


#### Priliminary insights

> In our study of society and cinema, we think it would be interesting to examine which historical events or movements in movies are most successful.

In [3]:
movies_events_df.columns

Index(['Unnamed: 0', 'id_wiki', 'id_freebase', 'movie', 'date', 'box_office',
       'runtime', 'lang', 'country', 'genre', 'id_wiki_movie', 'summary',
       'popularity', 'vote_average', 'vote_count', 'world_region', 'new_genre',
       'WW1', 'WW2', 'Space', 'Cold War', 'Vietnam war', 'Women emancipation',
       'Black History', 'Digitalisation', 'Sexuality', 'STDs', 'Opioid Crisis',
       'Mental Health', 'Atomic Bomb', 'Genetic Engineering', 'LGBTQ',
       'Terrorism', 'events_belongs_to', 'review_id', 'reviewer', 'rating',
       'review_summary', 'review_date', 'spoiler_tag', 'review_detail',
       'helpful'],
      dtype='object')

>We have five columns in our dataset that allow us to assess the success of a movie:
>`popularity`, `vote_average`, `vote_count`, `box_office`, `rating` .

In [5]:
# Nan percentage for the columns of interest in the overall movies

nan_percentage_overall = (movies_events_df[['popularity', 'vote_average', 'vote_count', 'box_office', 'rating']].isna().mean() * 100).round(2)
print("NaN percentage for overall movies:")
print(nan_percentage_overall)


# Nan percentage for events movies

events = ['WW1', 'WW2', 'Space', 'Cold War', 'Women emancipation', 'Black History', 'Vietnam war', 'Digitalisation', 'Sexuality', 'STDs', 'Opioid Crisis', 'Mental Health', 'Atomic Bomb', 'Genetic Engineering', 'LGBTQ', 'Terrorism']
nan_percentage_events = (movies_events_df[movies_events_df[events].any(axis=1)][['popularity', 'vote_average', 'vote_count', 'box_office', 'rating']].isna().mean() * 100).round(2)
print("\nNaN percentage for events movies:")
print(nan_percentage_events)

NaN percentage for overall movies:
popularity      51.18
vote_average    51.18
vote_count      51.18
box_office      49.75
rating          91.56
dtype: float64

NaN percentage for events movies:
popularity      36.89
vote_average    36.89
vote_count      36.89
box_office      35.77
rating          85.51
dtype: float64


> After checking the percentage of NaNs, we will only use the following columns: `popularity`, `vote_average`, `vote_count`, and `box_office` to estimate the success of a movie as the `rating` column has too many missing values.

> However, we need to understand what they reveal about a movie's success and, more precisely, how they compare to each other. To achieve this, let's compute pairwise correlations.

In [6]:
columns_to_compare = ['box_office', 'popularity', 'vote_average', 'vote_count']

# Spearman Correlation between pairwise columns

correlation_results = []

for column_pair in combinations(columns_to_compare, 2):
    
    column1, column2 = column_pair
    correlation = movies_events_df[column1].corr(movies_events_df[column2], method='pearson') 
    correlation_results.append({
        'Variable1': column1,
        'Variable2': column2,
        'Correlation': correlation
    })


correlation_results_df = pd.DataFrame(correlation_results)
print(correlation_results_df)

      Variable1     Variable2  Correlation
0    box_office    popularity     0.565368
1    box_office  vote_average     0.103766
2    box_office    vote_count     0.799977
3    popularity  vote_average     0.223640
4    popularity    vote_count     0.640109
5  vote_average    vote_count     0.163940


> Here we observe that:
>
>- There are strong and moderate positive correlations among the following pairwise parameters: `box_office`, `vote_count` and `popularity`.
>- `vote_average` has a weak positive correlation with the other variables. Even though it is a measure of success, it differs from how a movie attains fame (represented by `box_office`, `vote_count` and `popularity`). However, this variable still contributes to success as it captures the appreciation of a movie.
>
> For a comprehensive success analysis, we will focus on:
>
>- `box_office`, which reflects the financial success and fame of movies, especially those related to specific events. This variable has a lower percentage of missing values, and the correlation with `vote_count` and `popularity` allows us to drop those two variables.
>- `vote_average`, which measures audience appreciation and engagement with the movie.
>
>By considering these variables, the analysis encompasses both the financial and qualitative aspects of a movie's success.


#### Box office and Vote average barplots

##### - Box office

In [9]:
# Average and 95% CI computations for Box office

all_averages = []
all_conf_intervals = []

for element in events:
    
    element_events = movie_affected_to_event(movies_events_df, element)
    element_revenue = element_events['box_office'].copy()
    element_revenue_cleaned = element_revenue.dropna()
    
    avg = element_revenue_cleaned.mean()
    conf_interval = stats.t.interval(0.95, len(element_revenue_cleaned) - 1, loc=avg, scale=stats.sem(element_revenue_cleaned))
    all_averages.append(avg)
    all_conf_intervals.append(conf_interval)

# Lower and upper bounds of confidence intervals for Box office

lower_bound, upper_bound = zip(*all_conf_intervals)

In [13]:
# Interactive barplot 

fig = go.Figure()

fig.add_trace(go.Bar(
    name="Mean",
    x=events,
    y=all_averages,
    marker=dict(color='rgb(99, 110, 250)'),
    error_y=dict(
        type='data',
        symmetric=False,
        array=[(high - low) / 2 for low, high in zip(lower_bound, upper_bound)],
        arrayminus=[(high - low) / 2 for low, high in zip(lower_bound, upper_bound)]
    ),
))

fig.update_layout(
    barmode="group",
    width=700,
    height=500,
    xaxis_title="Events",
    yaxis_title="Average Revenue",
    title="Average Revenue for Movies from Different Events with 95% CI",
    title_x=0.5,
)

fig.show()
# fig.write_image("./plots/average_revenue_bar.png")
#fig.write_html("./plots/average_revenue_bar.html")

> $ANALYSIS$ : *to complete once events dictionaries are ok*

##### - Vote average

In [11]:
# Average and 95% CI computations for vote_average

all_averages_vote = []
all_conf_intervals_vote = []

for element in events:
    
    element_events = movie_affected_to_event(movies_events_df, element)
    element_vote_average = element_events['vote_average'].copy()
    element_vote_average_cleaned = element_vote_average.dropna()
    
    avg_vote = element_vote_average_cleaned.mean()
    conf_interval_vote = stats.t.interval(0.95, len(element_vote_average_cleaned) - 1, loc=avg_vote, scale=stats.sem(element_vote_average_cleaned))
    all_averages_vote.append(avg_vote)
    all_conf_intervals_vote.append(conf_interval_vote)


# Lower and upper bounds of confidence intervals for vote_average

lower_bound_vote, upper_bound_vote = zip(*all_conf_intervals_vote)

In [12]:
# Create the interactive histogram for vote_average

fig_vote = go.Figure()

fig_vote.add_trace(go.Bar(
    name="Mean",
    x=events,
    y=all_averages_vote,
    marker=dict(color='rgb(99, 110, 250)'),
    error_y=dict(
        type='data',
        symmetric=False,
        array=[(high - low) / 2 for low, high in zip(lower_bound_vote, upper_bound_vote)],
        arrayminus=[(high - low) / 2 for low, high in zip(lower_bound_vote, upper_bound_vote)]
    ),
))

fig_vote.update_layout(
    barmode="group",
    width=700,
    height=500,
    xaxis_title="Events",
    yaxis_title="Average Vote Average",
    title="Average Vote Average for Movies from Different Events with 95% CI",
    title_x=0.5,
)

fig_vote.show()
# fig_vote.write_image("./plots/average_vote_average_bar.png")
#fig_vote.write_html("./plots/average_vote_average_bar.html")

> $ANALYSIS$ : *to complete once events dictionaries are ok*