# Parallel questions study
In this notebook, we carry out the study of the parallel questions related to the influence of movies on baby names, therefore conducting a global analysis. For consistency in this analysis, we will work mainly with the names that were significantly influenced by the release of a movie as defined previously in the **influence metric computation** in order to try to draw conclusion. The statistical significance level was chosen to be 5% in this analysis.

In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy import stats
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
import plotly.graph_objects as go
import plotly.express as px

from dash import dcc, html, dash_table
import dash
from dash.dependencies import Input, Output

## Importing preprocessed datasets

In [2]:
folder_processed_data_path = './data/processed_data/'

# Dataset containing month of release
movie_df = pd.read_csv(os.path.join(folder_processed_data_path, 'movie_df.csv'))
movie_df.set_index(['wiki_ID'], inplace=True)
display(movie_df)

# Dataset containing p_value and other scrapped caracteristic of movies
name_by_movie_df = pd.read_csv(os.path.join(folder_processed_data_path, 'name_by_movie_ordered_pvalue_10_5_df.csv'))
name_by_movie_df.set_index(['wiki_ID'], inplace=True)
display(name_by_movie_df)

# Dataset containing movie genre
movie_genres_df = pd.read_csv(os.path.join(folder_processed_data_path, 'movie_genres_df.csv'))
movie_genres_df.set_index(['wiki_ID'], inplace=True)
display(movie_genres_df)

# Selection of significance level
alpha = 0.05

Unnamed: 0_level_0,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
975900,Ghosts of Mars,2001,8.0,14010832.0,56880,4.9
3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000,2.0,,69,6.0
28463795,Brun bitter,1988,,,40,5.6
9363483,White Of The Eye,1987,,,2891,6.1
261236,A Woman in Flames,1983,,,623,5.9
...,...,...,...,...,...,...
35228177,Mermaids: The Body Found,2011,3.0,,1711,4.6
34980460,Knuckle,2011,1.0,,3192,6.8
9971909,Another Nice Mess,1972,9.0,,111,5.8
913762,The Super Dimension Fortress Macross II: Lover...,1992,5.0,,657,6.0


Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3217,Gold,6.0,,,,0.000000
3217,Linda,7.0,F,-0.416786,0.684853,0.000673
3217,Henry,4.0,M,-2.031668,0.067058,0.002513
3217,Duke,4.0,M,0.579441,0.573967,-0.000113
3217,Warrior,9.0,M,,,0.000000
...,...,...,...,...,...,...
37478048,Ajay,9.0,M,-0.819213,0.430057,0.000130
37501922,Murphy,3.0,F,1.264175,0.232298,-0.000365
37501922,Hunter,1.0,M,-7.083089,0.000020,0.036603
37501922,John,1.0,M,-2.172964,0.052505,0.012557


Unnamed: 0_level_0,genre
wiki_ID,Unnamed: 1_level_1
330,Comedy-drama
330,Drama
3217,Action
3217,Comedy
3217,Time travel
...,...
37476824,Crime Comedy
37476824,Caper story
37476824,Crime Fiction
37478048,Comedy film


### Description and implementation of frequently used dataframes throughout the study

First, we need to generate the dataframes that will permit to adress the study questions. To do that, we merge the existing raw dataframes. Here is a list of the principal dataframes that will be used in the notebook

`name_by_movie_df`: dataframe with names, p_value, slope_change

`movie_df`: dataframe with film caracteristics

`movie_genre_df`: dataframe with movie genre

`name_by_movie_aggregate_df`: `name_by_movie_df` + `movie_df`: dataframe with names, p_value, slope change + film caracteristics

`movie_genre_aggregate_df`: `name_by_movie_df`  + `movie_genre_df`: dataframe with names, p_value, slope change + movie genre for each name in a movie

`movie_genre_aggregate_with_years_df`: dataframe with names, p_value, slope change + movie genre for each name in a movie + years

In [3]:
# First, aggregate dataframe with p_value with dataframe containing movie genre
# Outer merge required in order to obtain for each name of each film, all the possible genre it can be associated to 
movie_genre_aggregate_df = name_by_movie_df.merge(movie_genres_df, how='outer', left_on='wiki_ID', right_on='wiki_ID')
movie_genre_aggregate_df.head(25)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3217,Gold,6.0,,,,0.0,Action
3217,Gold,6.0,,,,0.0,Comedy
3217,Gold,6.0,,,,0.0,Time travel
3217,Gold,6.0,,,,0.0,Black comedy
3217,Gold,6.0,,,,0.0,Zombie Film
3217,Gold,6.0,,,,0.0,Horror Comedy
3217,Gold,6.0,,,,0.0,Action/Adventure
3217,Gold,6.0,,,,0.0,Costume drama
3217,Gold,6.0,,,,0.0,Stop motion
3217,Gold,6.0,,,,0.0,Horror


## <span style="color:green">Question 1: Month of release</span>

Looking at studies showing that baby conception rates are at the highest in fall or winter season leading to higher birth in the summer, will movies released in summer show the highest correlation with newborn naming?

In order to study this question, we will divide the movies by season of release and then look at the seasonly/monthly proportion of influenced names with respect to all the names considered. Then, we look at the average inlfuence over all the movies for the 4 different seasons.

In [4]:
# First, aggregate dataframe with p_value table with dataframe containing release month 
name_by_movie_aggregate_df = name_by_movie_df.merge(movie_df, how='left', left_on='wiki_ID', right_on='wiki_ID')
display(name_by_movie_aggregate_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3217,Gold,6.0,,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Henry,4.0,M,-2.031668,0.067058,0.002513,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Duke,4.0,M,0.579441,0.573967,-0.000113,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Warrior,9.0,M,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...
37478048,Ajay,9.0,M,-0.819213,0.430057,0.000130,Mr. Bechara,1996,,,395,5.4
37501922,Murphy,3.0,F,1.264175,0.232298,-0.000365,Terminal Bliss,1992,,,245,4.4
37501922,Hunter,1.0,M,-7.083089,0.000020,0.036603,Terminal Bliss,1992,,,245,4.4
37501922,John,1.0,M,-2.172964,0.052505,0.012557,Terminal Bliss,1992,,,245,4.4


We first compute the proportion of influenced names per season to further perform statistical test about their difference.

In [5]:
summer = [6.0, 7.0, 8.0]
fall = [9.0,10.0,11.0]
winter = [12.0,1.0,2.0]
spring = [3.0,4.0,5.0]
summer_movies_df = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['month'].isin(summer))]
fall_movies_df = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['month'].isin(fall))]
winter_movies_df = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['month'].isin(winter))]
spring_movies_df = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['month'].isin(spring))]

display(summer_movies_df)
display(fall_movies_df)
display(winter_movies_df)
display(spring_movies_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3746,Deckard,0.0,M,,,0.000000,Blade Runner,1982,6.0,33139618.0,804384,8.1
3746,Eldon,8.0,M,-0.454573,0.658256,0.000106,Blade Runner,1982,6.0,33139618.0,804384,8.1
3746,Lewis,12.0,M,-1.014454,0.332160,0.000707,Blade Runner,1982,6.0,33139618.0,804384,8.1
3746,Bear,11.0,M,0.181738,0.859094,-0.000003,Blade Runner,1982,6.0,33139618.0,804384,8.1
3746,Leon,7.0,M,0.758120,0.464312,-0.000544,Blade Runner,1982,6.0,33139618.0,804384,8.1
...,...,...,...,...,...,...,...,...,...,...,...,...
36699915,Luke,5.0,M,0.216557,0.832517,-0.001600,Percy Jackson & the Olympians: Sea of Monsters,2013,8.0,,123248,5.7
36699915,Underwood,1.0,M,,,0.000000,Percy Jackson & the Olympians: Sea of Monsters,2013,8.0,,123248,5.7
36699915,Chase,2.0,F,1.559383,0.147195,-0.011920,Percy Jackson & the Olympians: Sea of Monsters,2013,8.0,,123248,5.7
36699915,Circe,,F,0.394402,0.700823,-0.000015,Percy Jackson & the Olympians: Sea of Monsters,2013,8.0,,123248,5.7


Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3217,Gold,6.0,,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Henry,4.0,M,-2.031668,0.067058,0.002513,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Duke,4.0,M,0.579441,0.573967,-0.000113,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Warrior,9.0,M,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...
37322106,Major,0.0,M,-1.922979,0.080743,0.002631,Jab Tak Hai Jaan,2012,11.0,,58012,6.7
37373877,Beth,5.0,F,-0.810731,0.434710,0.000273,Crazy Eights,2006,10.0,,3338,3.8
37373877,Patterson,5.0,F,-0.539253,0.600457,0.000041,Crazy Eights,2006,10.0,,3338,3.8
37373877,Jennifer,0.0,F,-0.395613,0.699955,0.003273,Crazy Eights,2006,10.0,,3338,3.8


Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3837,Lamarr,3.0,M,0.272089,0.790593,-0.000041,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
3837,Van,11.0,M,-1.222164,0.247188,0.000463,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
3837,Bart,0.0,M,0.186272,0.855622,-0.000167,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
3837,Lyle,6.0,M,-0.150477,0.883112,0.000103,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
3837,Buddy,18.0,M,0.041667,0.967511,-0.000017,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
...,...,...,...,...,...,...,...,...,...,...,...,...
36956792,Kid,18.0,M,,,0.000000,The Water Horse: Legend of the Deep,2007,12.0,103071443.0,42523,6.4
36956792,Charlie,5.0,M,-5.446114,0.000202,0.006446,The Water Horse: Legend of the Deep,2007,12.0,103071443.0,42523,6.4
36956792,Beach,18.0,M,,,0.000000,The Water Horse: Legend of the Deep,2007,12.0,103071443.0,42523,6.4
36956792,Walker,8.0,M,-0.597936,0.561991,0.000593,The Water Horse: Legend of the Deep,2007,12.0,103071443.0,42523,6.4


Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4560,Morrison,19.0,M,-1.433674,0.179473,0.000053,Braveheart,1995,5.0,211409945.0,1072580,8.3
4560,Edward,3.0,M,-0.358692,0.726615,0.000825,Braveheart,1995,5.0,211409945.0,1072580,8.3
4560,Campbell,5.0,M,-1.732399,0.111109,0.000489,Braveheart,1995,5.0,211409945.0,1072580,8.3
4560,Murron,1.0,F,,,0.000000,Braveheart,1995,5.0,211409945.0,1072580,8.3
4560,William,0.0,M,-3.378640,0.006157,0.015610,Braveheart,1995,5.0,211409945.0,1072580,8.3
...,...,...,...,...,...,...,...,...,...,...,...,...
36814246,Girl,4.0,F,,,0.000000,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Mary,1.0,F,-2.783137,0.017804,0.041502,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Beautiful,4.0,F,,,0.000000,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Hall,4.0,F,-1.055993,0.313613,0.000021,Eraserhead,1977,3.0,7000000.0,124128,7.3


### Proportion of influenced names per season

We first compute the proportion of influenced names per season to further perform statistical test about their difference.

In [6]:
prop_summer = len(summer_movies_df[summer_movies_df['p_value']<alpha])/len(summer_movies_df['p_value'])
display(prop_summer)
prop_fall = len(fall_movies_df[fall_movies_df['p_value']<alpha])/len(fall_movies_df['p_value'])
display(prop_fall)
prop_winter = len(winter_movies_df[winter_movies_df['p_value']<alpha])/len(winter_movies_df['p_value'])
display(prop_winter)
prop_spring = len(spring_movies_df[spring_movies_df['p_value']<alpha])/len(spring_movies_df['p_value'])
display(prop_spring)

0.09862273304463907

0.1044368939262419

0.10103536528617962

0.09722030147974

Statistical test to assess whether proportion for different season are different or not.

H0 : The proportions are all equal i.e. no movie season release affects baby naming more than the other

In [7]:
from scipy.stats import chi2_contingency

# Organize the data into a contingency table
observed_data = [
    [len(summer_movies_df[summer_movies_df['p_value'] < alpha]), len(summer_movies_df['p_value'])],
    [len(fall_movies_df[fall_movies_df['p_value'] < alpha]), len(fall_movies_df['p_value'])],
    [len(winter_movies_df[winter_movies_df['p_value'] < alpha]), len(winter_movies_df['p_value'])],
    [len(spring_movies_df[spring_movies_df['p_value'] < alpha]), len(spring_movies_df['p_value'])]
]

# Perform the chi-squared test
chi2, p, _, _ = chi2_contingency(observed_data)

# Print the results
print("Chi-squared value:", chi2)
print("P-value:", p)

Chi-squared value: 10.22583914103114
P-value: 0.01674082055652168


Since the Chi-squared value is 10.23, we can reject the null hypothesis that the proportion of inlfuenced names are the same between the season at the 5% significance level.

### Proportion of influenced names per season per year

We now try to visualize the variation of the percentage of significantly influence names per season of release per year. Also, we look at the mean magnitude influence per season over the years with the corresponding 5% confidence interval.

In [8]:
# Sort by year first, then apply seasonal filter
def seasonal_filter(season_df):
    season_df_sorted = season_df.groupby('year').apply(lambda x: pd.Series({
        'avg': x[x['p_value']<alpha]['slope_change'].dropna().abs().mean(),
        'se': x[x['p_value']<alpha]['slope_change'].dropna().abs().sem(),
        'nb_names':  x[x['p_value']<alpha]['p_value'].count(),
        'prop_influenced': len(x[x['p_value']<alpha])/len(x['p_value'])
    }))
    season_df_sorted.reset_index(inplace=True)
    return season_df_sorted

summer_movies_df_sorted = seasonal_filter(summer_movies_df)
display(summer_movies_df_sorted)
fall_movies_df_sorted = seasonal_filter(fall_movies_df)
winter_movies_df_sorted = seasonal_filter(winter_movies_df)
spring_movies_df_sorted = seasonal_filter(spring_movies_df)

Unnamed: 0,year,avg,se,nb_names,prop_influenced
0,1895,,,0.0,0.000000
1,1898,,,0.0,0.000000
2,1909,,,0.0,0.000000
3,1910,,,0.0,0.000000
4,1912,0.063773,,1.0,0.055556
...,...,...,...,...,...
98,2009,0.010280,0.000815,164.0,0.115656
99,2010,0.008099,0.000559,233.0,0.150809
100,2011,0.009004,0.000792,216.0,0.137843
101,2012,0.010322,0.001228,82.0,0.132686


Variation of influenced names per movie release season per year

In [9]:
# Create the main figure
fig = go.Figure()

# Define the main line plot for each season
seasons = ['Summer', 'Fall', 'Winter', 'Spring']
colors = ['red', 'orange', 'blue', 'green']
data = [summer_movies_df_sorted, fall_movies_df_sorted, winter_movies_df_sorted, spring_movies_df_sorted]

for i, season in enumerate(seasons):
    main_trace = go.Scatter(
        x=data[i]['year'],  
        y=data[i]['prop_influenced'],  
        mode='lines+markers',
        line_shape='linear',
        name=season,
        line=dict(color=colors[i]),
        legendgroup=season,
        #visible=(season == 'Summer')
    )
    
    # Add the main line trace to the figure
    fig.add_trace(main_trace)

# Update the layout
fig.update_layout(
    title='Yearly evolution of proportion of influenced baby names per season of release',
    xaxis_title='Year',
    yaxis_title='Proportion of inlfuenced names'
)

# Show the figure
fig.show()


From the above plot, it doesn't seem to have big difference between season of movie release. We can now look at monthly variation in proportion to assess whether an eventual effect is more pronounced individually rather than seasonaly.

### Monthly proportion of influenced names per year

In [10]:
monthly_prop_df = name_by_movie_aggregate_df.groupby(['year','month']).apply(lambda x: pd.Series({
    'prop_significant': len(x[x['p_value']<alpha])/len(x['p_value'])
}))
monthly_prop_df_reset = monthly_prop_df.reset_index()
display(monthly_prop_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,prop_significant
year,month,Unnamed: 2_level_1
1895,8.0,0.000000
1898,7.0,0.000000
1900,11.0,0.000000
1902,9.0,0.000000
1903,1.0,0.000000
...,...,...
2013,8.0,0.176471
2013,9.0,0.111111
2013,10.0,0.038462
2013,11.0,0.200000


In [11]:
# Convert 'month' column to numeric
monthly_prop_df_reset['month'] = pd.to_numeric(monthly_prop_df_reset['month'], errors='coerce')

# Create a line plot using Plotly
fig = px.line(
    monthly_prop_df_reset,
    x='year',
    y='prop_significant',
    color='month',
    markers=True,
    line_shape='linear', 
    labels={'prop_significant': 'Proportion of Significant Values'},
    title='Proportion of Significant influences Over Months for Each Year',
)

# Show the plot
fig.show()


### Number of influenced names per season per year 

The yearly variation of the number of influenced names per season can be used as a first way to try to quantify the influence of movie release. We then show the qualitative effect of movie release season, rather than the quantitative effect, which will be study in the following cells.

In [12]:
# Bar chart plot
fig = go.Figure()
for i, season in enumerate(seasons):
    fig.add_trace(go.Bar(
        x=data[i]['year'],
        y=data[i]['nb_names'],
        name=seasons[i],
        marker_color=colors[i],
        offsetgroup=1
    ))

# Mise en forme du tracé
fig.update_layout(
    title = "Evolution of number of influenced names per season of movie release",
    xaxis=dict(title='Year'),
    yaxis=dict(title='# influenced names'),
    barmode='stack'  # 'stack' empile les barres pour chaque order  
      
)
fig.update_xaxes(range=[1900, 2020])

### Average monthly influence mangitude
After looking at the yearly variation in proportion of significantly influenced names across season/month, it is time to try to quantify the quantitative influenced differentiation between months. First, we study the mean magnitude slope change for each month (January - Decembre) on the average of all significantly influenced names.

In [13]:
influence_per_month_df = name_by_movie_aggregate_df[name_by_movie_aggregate_df['p_value']<alpha].groupby('month').apply(lambda x: pd.Series ({
    'avg_slope_change_significant_per_month': x['slope_change'].mean(),
    'se_slope_change_significant_per_month': x['slope_change'].sem(),
    'avg_mag_slope_change_significant_per_month': x['slope_change'].abs().mean(),
    'se_mag_slope_change_significant_per_month': x['slope_change'].abs().sem()
}))

display(influence_per_month_df)

Unnamed: 0_level_0,avg_slope_change_significant_per_month,se_slope_change_significant_per_month,avg_mag_slope_change_significant_per_month,se_mag_slope_change_significant_per_month
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,0.004002,0.000612,0.012312,0.000512
2.0,0.0022,0.000877,0.014785,0.000755
3.0,0.001348,0.00077,0.013974,0.000644
4.0,0.002236,0.000776,0.013958,0.000662
5.0,0.001237,0.000762,0.014681,0.000642
6.0,0.002118,0.000853,0.015013,0.000717
7.0,0.001589,0.000834,0.013865,0.000704
8.0,0.001908,0.000777,0.013848,0.000651
9.0,0.002026,0.000572,0.012868,0.000486
10.0,0.003292,0.000687,0.014343,0.000574


In [14]:
fig = go.Figure([
    go.Scatter(
        name='All film infuence',
        x=influence_per_month_df.index,
        y=influence_per_month_df['avg_mag_slope_change_significant_per_month'],
        mode='lines',
        line=dict(color='rgb(31, 119, 180)'),
    ),
    go.Scatter(
        name='Upper Bound',
        x=influence_per_month_df.index,
        y=influence_per_month_df['avg_mag_slope_change_significant_per_month']+1.96*influence_per_month_df['se_mag_slope_change_significant_per_month'],
        mode='lines',
        marker=dict(color="#444"),
        line=dict(width=0),
        showlegend=False
    ),
    go.Scatter(
        name='Lower Bound',
        x=influence_per_month_df.index,
        y=influence_per_month_df['avg_mag_slope_change_significant_per_month']-1.96*influence_per_month_df['se_mag_slope_change_significant_per_month'],
        marker=dict(color="#444"),
        line=dict(width=0),
        mode='lines',
        fillcolor='rgba(68, 68, 68, 0.3)',
        fill='tonexty',
        showlegend=False
    )
])
fig.update_layout(
    yaxis_title='Average magnitude influence',
    title='Average general influence of films per month',
    hovermode="x"
)
fig.show()

Now we look at the variation of magnitude influence per year across season.

In [15]:
# Create the main figure
fig = go.Figure()

# Define the main line plot for each season
seasons = ['Summer', 'Fall', 'Winter', 'Spring']
colors = ['red', 'orange', 'blue', 'green']
data = [summer_movies_df_sorted, fall_movies_df_sorted, winter_movies_df_sorted, spring_movies_df_sorted]

for i, season in enumerate(seasons):
    main_trace = go.Scatter(
        x=data[i]['year'],  
        y=data[i]['avg'],  
        mode='lines+markers',
        line_shape='linear',
        name=season,
        line=dict(color=colors[i]),
        legendgroup=season,
        #visible=(season == 'Summer')
    )
    
    # Add the main line trace to the figure
    fig.add_trace(main_trace)
    
    # Calculate confidence interval data
    lower_ci = data[i]['avg'] - 1.96*data[i]['se'] 
    upper_ci = data[i]['avg'] + 1.96*data[i]['se']
    
    # Add the trace for confidence interval
    ci_trace = go.Scatter(
        x=data[i]['year'],
        y=upper_ci,
        mode='lines',
        line=dict(color=colors[i], width=0),
        name=f'{season} 95% CI',
        showlegend=False,
        legendgroup=season,
        #visible=(season == 'Summer')
    )
    
    fig.add_trace(ci_trace)
    
    # Add the filled area between the main line and confidence interval
    fig.add_trace(go.Scatter(
        x=data[i]['year'],
        y=lower_ci,
        mode='lines',
        line=dict(color=colors[i], width=0),
        name=f'{season} 95% CI',
        fill='tonexty',
        #fillcolor=f'rgba{((colors[i]), 0.2)}',
        #showlegend=False,
        #legendgroup=season
        #visible=(season == 'Summer')
    ))

# Update the layout
fig.update_layout(
    title='Evolution of movies influence baby on names per season of release',
    xaxis_title='Year',
    yaxis_title='Average magnitude influence',
    #yaxis=dict(type="log"),
    xaxis=dict(range=[1900, 2020])
)


# Show the figure
fig.show()


### Influence per month per genre
A question we can further delve into might be the possible covariates of movie genre on the month of release concerning the influence of moving on baby naming. By intuition,  Plot the mean magnitude slope change for each month (January - Decembre) for the 10 most represented movie genres in the dataset

In [16]:
# Merge dataframe so that we have the genre of a movie, 
# with its month of release and p_value/slope change
movie_genre_caracteristics_aggregate_df = movie_genre_aggregate_df.merge(movie_df, how='left', left_on='wiki_ID', right_on='wiki_ID')
display(movie_genre_caracteristics_aggregate_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3217,Gold,6.0,,,,0.0,Action,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Time travel,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Black comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Zombie Film,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37241569,,,,,,,Action,Cold War,2012,11.0,,5033,6.6
37476824,,,,,,,Comedy,I Love New Year,2011,,,876,3.4
37476824,,,,,,,Crime Comedy,I Love New Year,2011,,,876,3.4
37476824,,,,,,,Caper story,I Love New Year,2011,,,876,3.4


In [17]:
# Select 10 most represented movie genre in ths dataset
most_representative_genre = movie_genres_df['genre'].value_counts().nlargest(10).index
display(most_representative_genre)


Index(['Drama', 'Comedy', 'Romance Film', 'Thriller', 'Action',
       'Black-and-white', 'World cinema', 'Crime Fiction', 'Indie',
       'Short Film'],
      dtype='object', name='genre')

In [18]:
display(movie_genre_caracteristics_aggregate_df[(movie_genre_caracteristics_aggregate_df['p_value']<alpha) & movie_genre_aggregate_df['genre'].isin(most_representative_genre)])

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3837,Jim,1.0,M,-2.715964,0.020076,0.006715,Comedy,Blazing Saddles,1974,2.0,119500000.0,147934,7.7
3947,Hunter,13.0,M,-4.938567,0.000444,0.002903,Thriller,Blue Velvet,1986,8.0,8551228.0,210543,7.7
3947,Hunter,13.0,M,-4.938567,0.000444,0.002903,Crime Fiction,Blue Velvet,1986,8.0,8551228.0,210543,7.7
4231,Jennifer,6.0,F,-2.455800,0.031916,0.088259,Action,Buffy the Vampire Slayer,1992,7.0,16624456.0,48386,5.7
4231,Jennifer,6.0,F,-2.455800,0.031916,0.088259,Comedy,Buffy the Vampire Slayer,1992,7.0,16624456.0,48386,5.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
36814246,Man,6.0,M,-4.220301,0.001436,0.000037,Indie,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Mary,1.0,F,-2.783137,0.017804,0.041502,Black-and-white,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Mary,1.0,F,-2.783137,0.017804,0.041502,Drama,Eraserhead,1977,3.0,7000000.0,124128,7.3
36814246,Mary,1.0,F,-2.783137,0.017804,0.041502,Indie,Eraserhead,1977,3.0,7000000.0,124128,7.3


In [19]:
influence_per_month_per_genre_df = movie_genre_caracteristics_aggregate_df[(movie_genre_caracteristics_aggregate_df['p_value']<alpha) & (movie_genre_aggregate_df['genre'].isin(most_representative_genre))].groupby(['genre','month']).apply(lambda x: pd.Series ({
    'avg_slope_change_significant_per_month': x['slope_change'].mean(),
    'se_slope_change_significant_per_month': x['slope_change'].sem(),
    'avg_mag_slope_change_significant_per_month': x['slope_change'].abs().mean(),
    'se_mag_slope_change_significant_per_month': x['slope_change'].abs().sem()
}))

display(influence_per_month_per_genre_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_slope_change_significant_per_month,se_slope_change_significant_per_month,avg_mag_slope_change_significant_per_month,se_mag_slope_change_significant_per_month
genre,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,1.0,0.005842,0.001404,0.011863,0.001166
Action,2.0,0.001324,0.001430,0.011918,0.001177
Action,3.0,0.002122,0.001996,0.015558,0.001679
Action,4.0,0.001943,0.001682,0.012363,0.001469
Action,5.0,0.000159,0.001708,0.013627,0.001476
...,...,...,...,...,...
World cinema,8.0,0.001728,0.001422,0.006549,0.001170
World cinema,9.0,0.002973,0.001268,0.008708,0.001029
World cinema,10.0,0.005847,0.001060,0.006478,0.000980
World cinema,11.0,-0.002937,0.002664,0.007640,0.002382


In [20]:
# fig = px.line(influence_per_month_df, x=influence_per_month_df.index, y="avg_mag_slope_change_significant_per_month", title='Mean influence per month over all films')
# fig.show()
#error_y='se_mag_slope_change_significant_per_month'

fig = px.line(influence_per_month_per_genre_df.reset_index(), x='month', y='avg_mag_slope_change_significant_per_month', color='genre',
              labels={'avg_mag_slope_change_significant_per_month': 'Average Magnitude of Slope Change'})

fig.update_layout(title='Average Influence per Month per Movie Genre',
                  xaxis_title='Month',
                  yaxis_title='Average magnitude influence')

fig.show()

In [21]:
# Create the main figure
fig = go.Figure()

# Define the main line plot for each season
seasons = ['Drama', 'Comedy', 'Romance Film', 'Thriller', 'Action',
       'Black-and-white', 'World cinema', 'Crime Fiction', 'Indie',
       'Short Film']
colors = ['brown', 'lightgreen', 'pink', 'green', 'blue', 'black', 'lightblue', 'red', 'orange', 'yellow']
#data = [summer_movies_df_sorted, fall_movies_df_sorted, winter_movies_df_sorted, spring_movies_df_sorted]
influence_per_month_per_genre_df.reset_index(inplace=True)

for i, season in enumerate(seasons):
    data = influence_per_month_per_genre_df[influence_per_month_per_genre_df['genre'] == season]
    main_trace = go.Scatter(
        x=data['month'],  # Replace with the actual data
        y=data['avg_mag_slope_change_significant_per_month'],  # Replace with the actual data
        mode='lines+markers',
        line_shape='linear',
        name=season,
        line=dict(color=colors[i]),
        legendgroup=season,
        #visible=(season == 'Summer')
    )
    
    # Add the main line trace to the figure
    fig.add_trace(main_trace)
    
    # Calculate confidence interval data
    lower_ci = data['avg_mag_slope_change_significant_per_month'] - 1.96*data['se_mag_slope_change_significant_per_month']  # Replace with the actual data
    upper_ci = data['avg_mag_slope_change_significant_per_month'] + 1.96*data['se_mag_slope_change_significant_per_month']  # Replace with the actual data
    
    # Add the trace for confidence interval
    ci_trace = go.Scatter(
        x=data['month'],  # Replace with the actual data
        y=upper_ci,
        mode='lines',
        line=dict(color=colors[i], width=0),
        name=f'{season} 95% CI',
        showlegend=False,
        legendgroup=season,
        #visible=(season == 'Summer')
    )
    
    fig.add_trace(ci_trace)
    
    # Add the filled area between the main line and confidence interval
    fig.add_trace(go.Scatter(
        x=data['month'],  # Replace with the actual data
        y=lower_ci,
        mode='lines',
        line=dict(color=colors[i], width=0),
        name=f'{season} 95% CI',
        fill='tonexty',
        #fillcolor=f'rgba{((colors[i]), 0.2)}',  # Adjust the transparency as needed
        #showlegend=False,
        legendgroup=season
        #visible=(season == 'Summer')
    ))

# Update the layout
fig.update_layout(
    title='Monthly evolution of movie genre influence',
    xaxis_title='Months',
    yaxis_title='Average magnitude influence'
)

# Show the figure
fig.show()


## <span style="color:green">Question 2: Movie Genre has an impact ?</span>

In [22]:
# Need to drop the duplicates i.e. the instances that have the same wiki_ID for the same genre and same char words
movie_genre_aggregate_df.reset_index().drop_duplicates(subset=['genre', 'wiki_ID'], inplace=True)

First groupby test: can be removed when cleaning notebook

In [23]:
name_by_genre_significant_df = movie_genre_aggregate_df.groupby('genre').apply(lambda x: x[x['p_value'] < alpha])
display(name_by_genre_significant_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
genre,wiki_ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Absurdism,19701,Tim,1.0,M,-4.091383,0.001785,0.010195,Absurdism
Absurdism,46505,Ted,0.0,M,-2.225789,0.047878,0.001111,Absurdism
Absurdism,46505,Johnny,10.0,M,-2.226029,0.047858,0.002591,Absurdism
Absurdism,75261,Robert,9.0,M,-3.585998,0.004273,0.053382,Absurdism
Absurdism,75261,Dave,19.0,M,-2.481872,0.030472,0.001477,Absurdism
...,...,...,...,...,...,...,...,...
Zombie Film,28362996,Burke,,M,-2.804857,0.017125,0.000248,Zombie Film
Zombie Film,30430079,Holly,1.0,F,-2.884757,0.014844,0.003713,Zombie Film
Zombie Film,33432215,Sarah,7.0,F,-6.204518,0.000067,0.021320,Zombie Film
Zombie Film,33432215,Mack,4.0,M,-2.626380,0.023559,0.001174,Zombie Film


Trying to see why there are for some "movie genre" NaN value for sem computation but not for mean computation.

 ANSWER: due to the fact that there is only one data point in after the groupby and filtering a given movie genre.

In [24]:
name_by_genre_significant_df.loc['Acid western']

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
113651,William,0.0,M,-3.37864,0.006157,0.01561,Acid western
413426,Walker,0.0,M,-3.102857,0.010053,0.000597,Acid western
5579768,Jake,0.0,M,-2.400606,0.035195,0.001247,Acid western
6415208,Matthew,2.0,M,-2.213936,0.048881,0.030503,Acid western


In [25]:
# Try to compute number of film per genre
display(movie_genre_aggregate_df.reset_index().groupby('genre')['wiki_ID'].nunique())

# Sanity check for "Acid Western" ––> 9 movies
display(movie_genre_aggregate_df[movie_genre_aggregate_df['genre'] == 'Acid western'])
display(len(movie_genre_aggregate_df[movie_genre_aggregate_df['genre'] == 'Acid western']))

# Look at number of names/char_words per genre, here on "Acid Western"
display(movie_genre_aggregate_df[movie_genre_aggregate_df['genre'] == 'Acid western']['char_words'].nunique())

genre
Absurdism             91
Acid western           9
Action              7859
Action Comedy        162
Action Thrillers     497
                    ... 
World History         20
World cinema        7073
Wuxia                115
Z movie                3
Zombie Film          266
Name: wiki_ID, Length: 363, dtype: int64

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
113651,Blake,0.0,M,2.090286,0.060611,-0.006569,Acid western
113651,Cole,3.0,M,-1.087006,0.300279,0.00858,Acid western
113651,William,0.0,M,-3.37864,0.006157,0.01561,Acid western
113651,Marvin,,,-0.113402,0.911755,8.2e-05,Acid western
113651,Thel,11.0,F,,,0.0,Acid western
113651,Charlie,9.0,M,-1.241759,0.240151,0.000651,Acid western
113651,Tench,10.0,M,,,0.0,Acid western
113651,Russell,11.0,F,-1.39867,0.189469,0.001816,Acid western
113651,Conway,4.0,M,-0.201327,0.844117,1.9e-05,Acid western
113651,John,6.0,M,-1.018001,0.330546,0.005455,Acid western


32

28

In [26]:
# Compute proportion of impacted names by genre
# Also computation of non significant and nan proportion for sanity check
name_by_genre_prop_df = movie_genre_aggregate_df.groupby('genre').apply(lambda x: pd.Series({
        # Number of film in a given movie genre 
        'nb_films_in_genre': x.reset_index()['wiki_ID'].count(),
        # Number of total different names that appear in a given movie genre
        'nb_names_in_genre': x['char_words'].count(),
        # Number of different names per genre that are significantly impacted by a movie release from that genre
        'nb_names_signi_in_genre': x[x['p_value'] < alpha]['char_words'].count(),
        # Proportion of names significantly impacted by a movie genre divided by total number of films in this movie genre
        'prop_names_signi_in_genre_per_total_film_in_genre': (x[x['p_value'] < alpha]['char_words'].count())/(x.reset_index()['wiki_ID'].count()),
        'is_na_sum': x['slope_change'].isna().sum(),
        'prop_signif_per_genre': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'prop_non_signi': (x['p_value'] > alpha).sum()/len(x['p_value']),
        'prop_nan': (x['p_value'].isna()).sum()/len(x['p_value']),
        'avg_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'se_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].sem(),
        'avg_mag_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'se_mag_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().sem(),
        'avg_slope_change_global': x['slope_change'].mean()
    }))
display(name_by_genre_prop_df)
#name_by_genre_prop_df.head(50)


Unnamed: 0_level_0,nb_films_in_genre,nb_names_in_genre,nb_names_signi_in_genre,prop_names_signi_in_genre_per_total_film_in_genre,is_na_sum,prop_signif_per_genre,prop_non_signi,prop_nan,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_mag_slope_change_significant,avg_slope_change_global
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Absurdism,740.0,721.0,64.0,0.086486,19.0,0.086486,0.778378,0.135135,0.003319,0.003524,0.014686,0.003029,0.000370
Acid western,32.0,30.0,4.0,0.125000,2.0,0.125000,0.687500,0.187500,0.011989,0.007077,0.011989,0.007077,0.002101
Action,34780.0,31575.0,2995.0,0.086113,3205.0,0.086113,0.692984,0.220903,0.001634,0.000441,0.013268,0.000370,0.000169
Action Comedy,1036.0,984.0,92.0,0.088803,52.0,0.088803,0.723938,0.187259,-0.000049,0.002079,0.012144,0.001644,0.000061
Action Thrillers,2911.0,2755.0,274.0,0.094126,156.0,0.094126,0.734112,0.171762,-0.000393,0.001593,0.014340,0.001336,0.000197
...,...,...,...,...,...,...,...,...,...,...,...,...,...
World History,20.0,0.0,0.0,0.000000,20.0,0.000000,0.000000,1.000000,,,,,
World cinema,19067.0,15344.0,945.0,0.049562,3723.0,0.049562,0.631353,0.319085,0.000796,0.000640,0.009445,0.000561,-0.000050
Wuxia,215.0,134.0,7.0,0.032558,81.0,0.032558,0.395349,0.572093,0.002028,0.000882,0.002028,0.000882,0.000107
Z movie,3.0,0.0,0.0,0.000000,3.0,0.000000,0.000000,1.000000,,,,,


In [27]:
display(name_by_genre_prop_df.isna().sum())
# Drop NaN values
name_by_genre_prop_df.dropna(inplace=True)
display(name_by_genre_prop_df)
# Sanity check
name_by_genre_prop_df.isna().sum()

nb_films_in_genre                                     0
nb_names_in_genre                                     0
nb_names_signi_in_genre                               0
prop_names_signi_in_genre_per_total_film_in_genre     0
is_na_sum                                             0
prop_signif_per_genre                                 0
prop_non_signi                                        0
prop_nan                                              0
avg_slope_change_significant                         69
se_slope_change_significant                          88
avg_mag_slope_change_significant                     69
se_mag_slope_change_significant                      88
avg_slope_change_global                              24
dtype: int64

Unnamed: 0_level_0,nb_films_in_genre,nb_names_in_genre,nb_names_signi_in_genre,prop_names_signi_in_genre_per_total_film_in_genre,is_na_sum,prop_signif_per_genre,prop_non_signi,prop_nan,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_mag_slope_change_significant,avg_slope_change_global
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Absurdism,740.0,721.0,64.0,0.086486,19.0,0.086486,0.778378,0.135135,0.003319,0.003524,0.014686,0.003029,0.000370
Acid western,32.0,30.0,4.0,0.125000,2.0,0.125000,0.687500,0.187500,0.011989,0.007077,0.011989,0.007077,0.002101
Action,34780.0,31575.0,2995.0,0.086113,3205.0,0.086113,0.692984,0.220903,0.001634,0.000441,0.013268,0.000370,0.000169
Action Comedy,1036.0,984.0,92.0,0.088803,52.0,0.088803,0.723938,0.187259,-0.000049,0.002079,0.012144,0.001644,0.000061
Action Thrillers,2911.0,2755.0,274.0,0.094126,156.0,0.094126,0.734112,0.171762,-0.000393,0.001593,0.014340,0.001336,0.000197
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Women in prison films,62.0,52.0,7.0,0.112903,10.0,0.112903,0.645161,0.241935,0.017780,0.010870,0.020819,0.009930,0.002551
Workplace Comedy,677.0,654.0,76.0,0.112260,23.0,0.112260,0.776957,0.110783,0.002557,0.001460,0.008550,0.001116,0.000441
World cinema,19067.0,15344.0,945.0,0.049562,3723.0,0.049562,0.631353,0.319085,0.000796,0.000640,0.009445,0.000561,-0.000050
Wuxia,215.0,134.0,7.0,0.032558,81.0,0.032558,0.395349,0.572093,0.002028,0.000882,0.002028,0.000882,0.000107


nb_films_in_genre                                    0
nb_names_in_genre                                    0
nb_names_signi_in_genre                              0
prop_names_signi_in_genre_per_total_film_in_genre    0
is_na_sum                                            0
prop_signif_per_genre                                0
prop_non_signi                                       0
prop_nan                                             0
avg_slope_change_significant                         0
se_slope_change_significant                          0
avg_mag_slope_change_significant                     0
se_mag_slope_change_significant                      0
avg_slope_change_global                              0
dtype: int64

### Saving data

In [28]:
ready_for_web = './data/web_data/'
# Add the genre as a column of the dataframe and save as csv
name_by_genre_prop_df.to_csv(os.path.join(ready_for_web, 'movie_genre_significant.csv'), index=True)

### Analysis looking at time effects

In [29]:
# Need to merge datasets containing "p_value" (name_by_movie_df), "movie_genre" (movie_genres_df), "release_date" (movie_df)
# => aggregate "name_by_movie_aggregate_df" with "movie_genres_df"
movie_genre_aggregate_with_years_df = movie_genre_aggregate_df.merge(movie_df, how='left', left_on='wiki_ID', right_on='wiki_ID')
display(movie_genre_aggregate_with_years_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3217,Gold,6.0,,,,0.0,Action,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Time travel,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Black comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Gold,6.0,,,,0.0,Zombie Film,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37241569,,,,,,,Action,Cold War,2012,11.0,,5033,6.6
37476824,,,,,,,Comedy,I Love New Year,2011,,,876,3.4
37476824,,,,,,,Crime Comedy,I Love New Year,2011,,,876,3.4
37476824,,,,,,,Caper story,I Love New Year,2011,,,876,3.4


In [30]:
# Keeping only significant datapoints
name_by_genre_per_year_prop_df = movie_genre_aggregate_with_years_df.groupby(['genre','year']).apply(lambda x: x[x['p_value'] < alpha])
name_by_genre_per_year_prop_df.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
genre,year,wiki_ID,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Absurdism,1964,248601,George,2.0,M,-2.573227,0.025901,0.01219,Absurdism,A Hard Day's Night,1964,7.0,,47276,7.5
Absurdism,1974,19701,Tim,1.0,M,-4.091383,0.001785,0.010195,Absurdism,Monty Python and the Holy Grail,1974,4.0,,560662,8.2
Absurdism,1977,903082,Man,20.0,M,-4.220301,0.001436,3.7e-05,Absurdism,The Kentucky Fried Movie,1977,8.0,20000000.0,19727,6.4
Absurdism,1978,75261,Robert,9.0,M,-3.585998,0.004273,0.053382,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,Dave,19.0,M,-2.481872,0.030472,0.001477,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,Barbara,15.0,F,-3.618317,0.004038,0.018721,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,Kent,8.0,M,-2.642745,0.022881,0.002435,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,Donald,6.0,M,-2.96567,0.012844,0.010112,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,Dean,2.0,M,-3.168871,0.008937,0.009001,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4
Absurdism,1978,75261,John,0.0,M,-3.482099,0.005129,0.059891,Absurdism,National Lampoon's Animal House,1978,7.0,141600000.0,127114,7.4


In [31]:
# Compute proportion of impacted names by genre by year
name_by_genre_per_year_prop_df = movie_genre_aggregate_with_years_df.groupby(['genre','year']).apply(lambda x: pd.Series({
        'prop_signif_per_genre_per_year': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'se_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].sem(),
        'avg_mag_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'se_slope_change_magnitude_significant': x[x['p_value'] < alpha]['slope_change'].abs().sem(),
        'avg_slope_change_global': x['slope_change'].mean()
    }))
display(name_by_genre_per_year_prop_df)
#name_by_genre_per_year_prop_df.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
genre,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Absurdism,1929,0.000000,,,,,
Absurdism,1930,0.000000,,,,,0.000005
Absurdism,1932,0.000000,,,,,-0.000088
Absurdism,1938,0.000000,,,,,
Absurdism,1940,0.000000,,,,,
...,...,...,...,...,...,...,...
Zombie Film,2008,0.108434,0.005349,0.002500,0.006609,0.002090,0.000667
Zombie Film,2009,0.046512,-0.013458,0.013578,0.013578,0.013458,-0.001215
Zombie Film,2010,0.137931,0.005537,0.005221,0.013275,0.002543,0.001270
Zombie Film,2011,0.230769,0.009126,0.006299,0.015096,0.003279,0.003075


#### Need to fill the missing year for each genre with 0 for further plotting

In [32]:
# Define a function to fill gaps and add corresponding values
all_years_df = pd.DataFrame({'year': range(movie_df['year'].min(), movie_df['year'].max() + 1)}).reset_index(drop=True)
all_years_df = all_years_df.set_index('year', drop=True)
#display(all_years_df)
def fill_gaps(group):
    filled_group = pd.merge(all_years_df, group, on='year', how='left').fillna(0)
    return filled_group

name_by_genre_per_year_prop_df.reset_index(inplace=True)
display(name_by_genre_per_year_prop_df)

name_by_genre_per_year_prop_filled_df = name_by_genre_per_year_prop_df.groupby('genre').apply(fill_gaps)
display(name_by_genre_per_year_prop_filled_df)

Unnamed: 0,genre,year,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
0,Absurdism,1929,0.000000,,,,,
1,Absurdism,1930,0.000000,,,,,0.000005
2,Absurdism,1932,0.000000,,,,,-0.000088
3,Absurdism,1938,0.000000,,,,,
4,Absurdism,1940,0.000000,,,,,
...,...,...,...,...,...,...,...,...
13827,Zombie Film,2008,0.108434,0.005349,0.002500,0.006609,0.002090,0.000667
13828,Zombie Film,2009,0.046512,-0.013458,0.013578,0.013578,0.013458,-0.001215
13829,Zombie Film,2010,0.137931,0.005537,0.005221,0.013275,0.002543,0.001270
13830,Zombie Film,2011,0.230769,0.009126,0.006299,0.015096,0.003279,0.003075


Unnamed: 0_level_0,Unnamed: 1_level_0,year,genre,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Absurdism,0,1888,0,0.0,0.0,0.0,0.0,0.0,0.0
Absurdism,1,1889,0,0.0,0.0,0.0,0.0,0.0,0.0
Absurdism,2,1890,0,0.0,0.0,0.0,0.0,0.0,0.0
Absurdism,3,1891,0,0.0,0.0,0.0,0.0,0.0,0.0
Absurdism,4,1892,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
Zombie Film,124,2012,Zombie Film,0.0,0.0,0.0,0.0,0.0,0.0
Zombie Film,125,2013,0,0.0,0.0,0.0,0.0,0.0,0.0
Zombie Film,126,2014,0,0.0,0.0,0.0,0.0,0.0,0.0
Zombie Film,127,2015,0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
# Keep only movie genre for which there is at least x nonzero values
name_by_genre_per_year_prop_filled_df.drop(columns=['genre'], inplace=True)
name_by_genre_per_year_prop_filled_df.reset_index(inplace=True)
display(name_by_genre_per_year_prop_filled_df)
# Count the number of non-zero values for each genre
genre_counts = name_by_genre_per_year_prop_filled_df[name_by_genre_per_year_prop_filled_df['avg_slope_change_significant'] != 0].groupby('genre')['year'].nunique()
display(genre_counts)

# Filter out movie genres with fewer than x non-zero years
threshold = 50
selected_genres = genre_counts[genre_counts >= threshold].index
display(selected_genres)
display(len(selected_genres))

# Filter the original DataFrame based on the selected genres
name_by_genre_per_year_prop_filled_filtered_df = name_by_genre_per_year_prop_filled_df[name_by_genre_per_year_prop_filled_df['genre'].isin(selected_genres)]
display(name_by_genre_per_year_prop_filled_filtered_df)



Unnamed: 0,genre,level_1,year,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
0,Absurdism,0,1888,0.0,0.0,0.0,0.0,0.0,0.0
1,Absurdism,1,1889,0.0,0.0,0.0,0.0,0.0,0.0
2,Absurdism,2,1890,0.0,0.0,0.0,0.0,0.0,0.0
3,Absurdism,3,1891,0.0,0.0,0.0,0.0,0.0,0.0
4,Absurdism,4,1892,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
46822,Zombie Film,124,2012,0.0,0.0,0.0,0.0,0.0,0.0
46823,Zombie Film,125,2013,0.0,0.0,0.0,0.0,0.0,0.0
46824,Zombie Film,126,2014,0.0,0.0,0.0,0.0,0.0,0.0
46825,Zombie Film,127,2015,0.0,0.0,0.0,0.0,0.0,0.0


genre
Absurdism                25
Acid western              4
Action                   89
Action Comedy            23
Action Thrillers         42
                         ..
Women in prison films     3
Workplace Comedy         22
World cinema             64
Wuxia                     4
Zombie Film              19
Name: year, Length: 294, dtype: int64

Index(['Action', 'Action/Adventure', 'Adventure', 'Animation',
       'Biographical film', 'Biography', 'Biopic [feature]', 'Black-and-white',
       'Bollywood', 'Comedy', 'Comedy film', 'Comedy-drama', 'Coming of age',
       'Costume drama', 'Crime Drama', 'Crime Fiction', 'Crime Thriller',
       'Cult', 'Detective', 'Drama', 'Family Drama', 'Family Film', 'Fantasy',
       'Film adaptation', 'Horror', 'Indie', 'Melodrama', 'Musical', 'Mystery',
       'Parody', 'Period piece', 'Political drama', 'Psychological thriller',
       'Romance Film', 'Romantic comedy', 'Romantic drama', 'Satire',
       'Science Fiction', 'Short Film', 'Sports', 'Spy', 'Suspense',
       'Thriller', 'War film', 'Western', 'World cinema'],
      dtype='object', name='genre')

46

Unnamed: 0,genre,level_1,year,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
258,Action,0,1888,0.0,0.0,0.0,0.0,0.0,0.000000
259,Action,1,1889,0.0,0.0,0.0,0.0,0.0,0.000000
260,Action,2,1890,0.0,0.0,0.0,0.0,0.0,0.000000
261,Action,3,1891,0.0,0.0,0.0,0.0,0.0,0.000000
262,Action,4,1892,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...
46435,World cinema,124,2012,0.0,0.0,0.0,0.0,0.0,0.000739
46436,World cinema,125,2013,0.0,0.0,0.0,0.0,0.0,0.000000
46437,World cinema,126,2014,0.0,0.0,0.0,0.0,0.0,0.000000
46438,World cinema,127,2015,0.0,0.0,0.0,0.0,0.0,0.000000


In [34]:
# Dropping columns 
name_by_genre_per_year_prop_filled_filtered_df.drop(columns=['level_1'], inplace=True)
display(name_by_genre_per_year_prop_filled_filtered_df)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,genre,year,prop_signif_per_genre_per_year,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_slope_change_magnitude_significant,avg_slope_change_global
258,Action,1888,0.0,0.0,0.0,0.0,0.0,0.000000
259,Action,1889,0.0,0.0,0.0,0.0,0.0,0.000000
260,Action,1890,0.0,0.0,0.0,0.0,0.0,0.000000
261,Action,1891,0.0,0.0,0.0,0.0,0.0,0.000000
262,Action,1892,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...
46435,World cinema,2012,0.0,0.0,0.0,0.0,0.0,0.000739
46436,World cinema,2013,0.0,0.0,0.0,0.0,0.0,0.000000
46437,World cinema,2014,0.0,0.0,0.0,0.0,0.0,0.000000
46438,World cinema,2015,0.0,0.0,0.0,0.0,0.0,0.000000


In [35]:
display(name_by_genre_per_year_prop_filled_filtered_df.isna().sum())
# # Drop NaN values
# name_by_genre_per_year_prop_df.fillna(0, inplace=True)
# display(name_by_genre_per_year_prop_df)
# # Sanity check
# display(name_by_genre_per_year_prop_df.isna().sum())

genre                                    0
year                                     0
prop_signif_per_genre_per_year           0
avg_slope_change_significant             0
se_slope_change_significant              0
avg_mag_slope_change_significant         0
se_slope_change_magnitude_significant    0
avg_slope_change_global                  0
dtype: int64

### Saving the data

In [36]:
# Add the genre as a column of the dataframe and save as csv
name_by_genre_per_year_prop_filled_filtered_df.to_csv(os.path.join(ready_for_web, 'movie_genre_per_year_significant.csv'), index=False)

In [37]:
# Resaving data for Circle Packing with only movie genre kept in time analysis
# Add the genre as a column of the dataframe and save as csv
# Filter the original DataFrame based on the selected genres
name_by_genre_prop_df.reset_index(inplace=True)
name_by_genre_prop_filtered_df = name_by_genre_prop_df[name_by_genre_prop_df['genre'].isin(selected_genres)]
display(name_by_genre_prop_filtered_df)
name_by_genre_prop_filtered_df.to_csv(os.path.join(ready_for_web, 'movie_genre_significant_filtered.csv'), index=False)

Unnamed: 0,genre,nb_films_in_genre,nb_names_in_genre,nb_names_signi_in_genre,prop_names_signi_in_genre_per_total_film_in_genre,is_na_sum,prop_signif_per_genre,prop_non_signi,prop_nan,avg_slope_change_significant,se_slope_change_significant,avg_mag_slope_change_significant,se_mag_slope_change_significant,avg_slope_change_global
2,Action,34780.0,31575.0,2995.0,0.086113,3205.0,0.086113,0.692984,0.220903,0.001634,0.000441,0.013268,0.00037,0.000169
5,Action/Adventure,21112.0,19502.0,1942.0,0.091986,1610.0,0.091986,0.70666,0.201355,0.001518,0.000543,0.013323,0.000452,0.000156
8,Adventure,20830.0,18801.0,1852.0,0.08891,2029.0,0.08891,0.671675,0.239414,0.001156,0.000581,0.013427,0.000491,5.1e-05
18,Animation,7507.0,5771.0,469.0,0.062475,1736.0,0.062475,0.564673,0.372852,-0.001831,0.001225,0.012171,0.001092,-0.000103
34,Biographical film,5589.0,5117.0,510.0,0.091251,472.0,0.091251,0.712829,0.195921,0.003028,0.000908,0.011695,0.000757,0.000497
35,Biography,5548.0,4760.0,472.0,0.085076,788.0,0.085076,0.671774,0.243151,0.003414,0.000826,0.011287,0.000661,0.000543
36,Biopic [feature],3223.0,3013.0,302.0,0.093702,210.0,0.093702,0.739684,0.166615,0.003422,0.001368,0.014033,0.001121,0.000388
38,Black-and-white,13750.0,8702.0,741.0,0.053891,5048.0,0.053891,0.464727,0.481382,0.000452,0.001655,0.02609,0.001349,-0.00026
40,Bollywood,5525.0,5398.0,220.0,0.039819,127.0,0.039819,0.765068,0.195113,-0.000108,0.000854,0.006212,0.000743,-4.9e-05
59,Comedy,61018.0,54389.0,5494.0,0.090039,6629.0,0.090039,0.703022,0.206939,0.002377,0.000352,0.013515,0.000303,0.000272


## <span style="color:red">Question 3: Attendence/popularity + ratings</span>

In [38]:
# The dataframe "name_by_movie_aggregate_df" already contains the wanted caracteristics
display(name_by_movie_aggregate_df)
name_by_movie_aggregate_df['numVotes'].max()

#Proportion of the film that had an influence in data segmented by number of votes

prop_0_10k = len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] < 10000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]) / len(name_by_movie_aggregate_df[name_by_movie_aggregate_df['numVotes'] < 10000])

print(f"Proportion of movies with numVotes < 10k and p_value < 0.1: {prop_0_10k :.3%}")

prop_10k_100k = len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 10000) & (name_by_movie_aggregate_df['numVotes'] < 100000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]) / len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 10000) & (name_by_movie_aggregate_df['numVotes'] < 100000)])

print(f"Proportion of movies with numVotes in [10k-100k] and p_value < 0.1: {prop_10k_100k :.3%}")

prop_100k_1M = len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 100000) & (name_by_movie_aggregate_df['numVotes'] < 1000000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]) / len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 100000) & (name_by_movie_aggregate_df['numVotes'] < 1000000)])

print(f"Proportion of movies with numVotes in [100k-1M] and p_value < 0.1: {prop_100k_1M :.3%}")

prop_greater_1M = len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 1000000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]) / len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 1000000)])

print(f"Proportion of movies with numVotes > 1M and p_value < 0.1: {prop_greater_1M :.3%}")


len(name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 1000000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]['numVotes'].unique())


Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3217,Gold,6.0,,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Henry,4.0,M,-2.031668,0.067058,0.002513,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Duke,4.0,M,0.579441,0.573967,-0.000113,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Warrior,9.0,M,,,0.000000,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...
37478048,Ajay,9.0,M,-0.819213,0.430057,0.000130,Mr. Bechara,1996,,,395,5.4
37501922,Murphy,3.0,F,1.264175,0.232298,-0.000365,Terminal Bliss,1992,,,245,4.4
37501922,Hunter,1.0,M,-7.083089,0.000020,0.036603,Terminal Bliss,1992,,,245,4.4
37501922,John,1.0,M,-2.172964,0.052505,0.012557,Terminal Bliss,1992,,,245,4.4


Proportion of movies with numVotes < 10k and p_value < 0.1: 13.350%
Proportion of movies with numVotes in [10k-100k] and p_value < 0.1: 15.580%
Proportion of movies with numVotes in [100k-1M] and p_value < 0.1: 15.919%
Proportion of movies with numVotes > 1M and p_value < 0.1: 14.890%


52

Assumption: 
-Attendence is estimated by the number of votes
-A threshold of # of votes anove wich we start to study the influence of rating 

Ideas: 

-separate data according to number of votes & then separate data accordimng to rating 

-separate first according to votes and then in the segments of votes separates bad and good reviews

Question 4 : Faire la moyenne

In [39]:
#We segment the data frame according to the number of votes

votes_seg_0_10k = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] < 10000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]
votes_seg_10k_100k = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 10000) & (name_by_movie_aggregate_df['numVotes'] < 100000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]
votes_seg_100k_1M = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 100000) & (name_by_movie_aggregate_df['numVotes'] < 1000000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]
votes_seg_1M_inf = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['numVotes'] > 1000000) & (name_by_movie_aggregate_df['p_value'] < 0.1)]

a = [votes_seg_0_10k['slope_change'].mean(), votes_seg_10k_100k['slope_change'].mean(), votes_seg_100k_1M['slope_change'].mean(), votes_seg_1M_inf['slope_change'].mean()]
index_names = ['0-10k', '10k-100k', '100k-1M', '1M-inf']
results = pd.DataFrame(a, index=index_names,columns = ['avg_slope_change'])
results.index.name = 'Seg_numVotes'
display(results)

Unnamed: 0_level_0,avg_slope_change
Seg_numVotes,Unnamed: 1_level_1
0-10k,0.001185
10k-100k,0.001746
100k-1M,0.002241
1M-inf,0.00075


In [40]:
name_by_movie_aggregate_df_significant = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['p_value'] < 0.1) & (name_by_movie_aggregate_df['numVotes']>=100)]

#We segment the data frame according to the number of votes

##Calculate the average cahnge of slopes for the different number of vote segments 

numVotes_bins = [0,10000,100000,1000000,np.inf]
segments_numVotes_label = ['0-10000','10000-100000','100000-1000000','1000000+']
name_by_movie_aggregate_df_significant['numVotes_segmented']  = pd.cut(name_by_movie_aggregate_df_significant['numVotes'],numVotes_bins,labels=segments_numVotes_label,right=True)

avg_magnitude_slopes_change_numVotes = name_by_movie_aggregate_df_significant.groupby('numVotes_segmented').apply(lambda x: pd.Series({
    'avg_magnitude_slopes_change': x['slope_change'].abs().mean(),
    'SE_magnitude' : x['slope_change'].abs().sem(),
    'avg_slope_change': x['slope_change'].mean(),
    'SE_slope_change': x['slope_change'].sem()
    }))
display(avg_magnitude_slopes_change_numVotes)



# ###################################Chat GPT below 
# import pandas as pd
# import plotly.graph_objects as go
# import ipywidgets as widgets
# from ipywidgets import interact

# # Assuming name_by_movie_aggregate_df is your DataFrame
# # Replace 'numVotes' with the actual column name if it's different

# # Filter the DataFrame for significant p-values and minimum numVotes
# name_by_movie_aggregate_df_significant = name_by_movie_aggregate_df[(name_by_movie_aggregate_df['p_value'] < 0.1) & (name_by_movie_aggregate_df['numVotes'] >= 100)]

# # Function to create the interactive plot
# def plot_interactive(num_bins):
#     # Calculate the range and create custom bins
#     min_num_votes = name_by_movie_aggregate_df_significant['numVotes'].min()
#     max_num_votes = name_by_movie_aggregate_df_significant['numVotes'].max()
#     segment_length = int((max_num_votes - min_num_votes) / num_bins) + 1
#     custom_bins = [min_num_votes + i * segment_length for i in range(num_bins + 1)]

#     # Create segments using custom bins
#     name_by_movie_aggregate_df_significant['numVotes_segmented'] = pd.cut(name_by_movie_aggregate_df_significant['numVotes'], bins=custom_bins, labels=None, right=True, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

#     # Convert Interval type to string
#     name_by_movie_aggregate_df_significant['numVotes_segmented'] = name_by_movie_aggregate_df_significant['numVotes_segmented'].astype(str)

#     # Calculate average change of slopes for the different numVotes segments
#     avg_magnitude_slopes_change_numVotes = name_by_movie_aggregate_df_significant.groupby('numVotes_segmented').apply(lambda x: pd.Series({
#         'avg_magnitude_slopes_change': x['slope_change'].abs().mean(),
#         'SE_magnitude': x['slope_change'].abs().sem(),
#         'avg_slope_change': x['slope_change'].mean(),
#         'SE_slope_change': x['slope_change'].sem()
#     }))

#     # Plotting with Plotly
#     fig = go.Figure(data=[
#         go.Bar(name='SF Zoo', x=avg_magnitude_slopes_change_numVotes.index, y=avg_magnitude_slopes_change_numVotes['avg_magnitude_slopes_change'])
#     ])

#     # Change the bar mode
#     fig.update_layout(barmode='group')

#     # Limit y-axis to 0.05
#     fig.update_yaxes(range=[0, 0.05])

#     # Show the plot
#     fig.show()

# # Create an interactive slider to choose the number of bins
# interact(plot_interactive, num_bins=widgets.IntSlider(min=1, max=20, step=1, value=5))

# fig.write_html("segmentation_numVotes.html",include_plotlyjs ='cdn')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,avg_magnitude_slopes_change,SE_magnitude,avg_slope_change,SE_slope_change
numVotes_segmented,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0-10000,0.013589,0.000207,0.001297,0.000239
10000-100000,0.010688,0.000198,0.001746,0.000235
100000-1000000,0.00951,0.000246,0.002241,0.000288
1000000+,0.010704,0.001467,0.00075,0.0017


In [41]:
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='SF Zoo', x=avg_magnitude_slopes_change_numVotes.index, y=avg_magnitude_slopes_change_numVotes['avg_magnitude_slopes_change'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()


# import plotly.graph_objects as go
# import ipywidgets as widgets
# from ipywidgets import interact

# # Assuming avg_magnitude_slopes_change_numVotes is your DataFrame

# # Create a function to update the plot based on the chosen number of bins
# def update_plot(num_bins):
#     # Your logic to calculate the DataFrame for the chosen bin size
#     # Replace this with your actual calculations
#     # For demonstration, I'm using a dummy DataFrame named df_updated
#     df_updated = avg_magnitude_slopes_change_numVotes * num_bins

#     # Create the Plotly plot
#     fig = go.Figure(data=[
#         go.Bar(name='SF Zoo', x=df_updated.index, y=df_updated['avg_magnitude_slopes_change'])
#     ])

#     # Change the bar mode
#     fig.update_layout(barmode='group')
#     fig.update_layout(title=f'Average Magnitude Slope Change for {num_bins} Bins')
#     fig.update_layout(xaxis_title='Number of Votes Segments')
#     fig.update_layout(yaxis_title='Average Magnitude Slope Change')

#     fig.show()

# # Set up interactive plotting with ipywidgets
# num_bins_slider = widgets.IntSlider(value=5, min=1, max=20, step=1, description='Number of Bins:')

# # Use the interact decorator to connect the function and the widget
# interact(update_plot, num_bins=num_bins_slider)


#### Segementing w.r.t. movie rating

In [42]:
#We segment the data frame according to the rating
#Calculate the average change of slopes for the different rating segements

rating_quantiles = np.quantile(name_by_movie_aggregate_df_significant['averageRating'],[0.25,0.5,0.75])
#display(rating_quantiles)

# display((name_by_movie_aggregate_df_significant['averageRating']<= 5.5).sum()/len(name_by_movie_aggregate_df_significant))

rating_bins = [0,rating_quantiles[0],rating_quantiles[1],rating_quantiles[2],10]
segments_rating_label = ['0-{}'.format(rating_quantiles[0]),'{}-{}'.format(rating_quantiles[0], rating_quantiles[1]),'{}-{}'.format(rating_quantiles[1], rating_quantiles[2]),'{}-10'.format(rating_quantiles[2])]
name_by_movie_aggregate_df_significant['rating_segmented']  = pd.cut(name_by_movie_aggregate_df_significant['averageRating'],rating_bins,labels=segments_rating_label,right=True)
avg_slopes_change_rating = name_by_movie_aggregate_df_significant.groupby('rating_segmented').apply(lambda x: pd.Series({
    'avg_slopes_change': x['slope_change'].mean(),
    'avg_magnitude_slopes_change': x['slope_change'].abs().mean()
}))

display(avg_slopes_change_rating)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0_level_0,avg_slopes_change,avg_magnitude_slopes_change
rating_segmented,Unnamed: 1_level_1,Unnamed: 2_level_1
0-5.5,0.001742,0.011391
5.5-6.3,0.001799,0.012351
6.3-6.9,0.001323,0.01237
6.9-10,0.001427,0.012042


In [43]:
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='SF Zoo', x=avg_slopes_change_rating.index, y=avg_slopes_change_rating['avg_magnitude_slopes_change'])
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

In [44]:

heatmap_data = name_by_movie_aggregate_df_significant.groupby(['rating_segmented', 'numVotes_segmented']).apply(lambda x: pd.Series({
    'avg_magnitude_slopes_change_HM': x['slope_change'].abs().mean()
}))

# name_by_movie_aggregate_df_significant['rating_segmented'] 
# name_by_movie_aggregate_df_significant['numVotes_segmented']

import plotly.graph_objects as go

# Assuming heatmap_data is the DataFrame you created
# If not, replace it with the actual DataFrame variable

# Reshape the data for heatmap
heatmap_pivot = heatmap_data.pivot_table(
    values='avg_magnitude_slopes_change_HM',
    index='rating_segmented',
    columns='numVotes_segmented',
    aggfunc='mean'
)

# Create a heatmap using Plotly
heatmap = go.Figure(data=go.Heatmap(
    z=heatmap_pivot.values,
    x=heatmap_pivot.columns,
    y=heatmap_pivot.index,
    colorscale='YlGnBu',
    colorbar=dict(title='Avg Magnitude Slopes Change'),
))

# Set axis labels and title
heatmap.update_layout(
    xaxis=dict(title='Number of Votes (Segmented)'),
    yaxis=dict(title='Rating (Segmented)'),
    title='Average Magnitude of Slopes Change Heatmap'
)

# Show the plot
heatmap.show()



## <span style="color:red">Question 4: Character Importance in film<span>

In [45]:
# The dataframe "name_by_movie_df" already contains the wanted caracteristics ("order")
# display(name_by_order_df)
name_by_order_df = name_by_movie_df.groupby("order").apply(lambda x: x[x['p_value'] < alpha])
display(name_by_order_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,char_words,order,gender,t_stat,p_value,slope_change
order,wiki_ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,4560,William,0.0,M,-3.378640,0.006157,0.015610
0.0,5035,Eric,0.0,M,-6.765221,0.000031,0.025314
0.0,5729,Harold,0.0,M,-2.233082,0.047271,0.001985
0.0,19715,Gracie,0.0,F,-2.941462,0.013413,0.008645
0.0,21180,Knox,0.0,M,-2.783720,0.017785,0.000111
...,...,...,...,...,...,...,...
94.0,9834441,Lily,94.0,F,4.655481,0.000699,-0.024797
95.0,20777420,Thomas,95.0,M,-4.265520,0.001331,0.011104
98.0,370064,Anderson,98.0,F,-4.352241,0.001151,0.003382
98.0,25079197,Tyson,98.0,M,4.232343,0.001407,-0.003961


In [46]:
name_by_order_prop_df = name_by_movie_df.groupby("order").apply(lambda x: pd.Series({
        'prop_signif_per_order': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'avg_slope_change_global': x['slope_change'].mean(),
        'avg_magnitude_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'total_number_signif_per_order': (x['p_value'] < alpha).sum(),
        'proportion_negative_SC' : (x[x['p_value'] < alpha]['slope_change'] < 0).sum() / len(x[x['p_value'] < alpha]['slope_change']),
        'proportion_positive_SC' : (x[x['p_value'] < alpha]['slope_change'] > 0).sum() / len(x[x['p_value'] < alpha]['slope_change']),
        'se_slope_change_magnitude_significant': x[x['p_value'] < alpha]['slope_change'].abs().sem()
    }))
display(name_by_order_prop_df)



invalid value encountered in scalar divide


invalid value encountered in scalar divide



Unnamed: 0_level_0,prop_signif_per_order,avg_slope_change_significant,avg_slope_change_global,avg_magnitude_slope_change_significant,total_number_signif_per_order,proportion_negative_SC,proportion_positive_SC,se_slope_change_magnitude_significant
order,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,0.111207,0.002603,0.000161,0.016890,2717.0,0.247332,0.752668,0.000494
1.0,0.113455,0.001298,-0.000027,0.016383,2146.0,0.251165,0.748835,0.000568
2.0,0.108609,0.002201,0.000206,0.014257,1577.0,0.228916,0.771084,0.000587
3.0,0.104310,0.001885,0.000271,0.014035,1244.0,0.237942,0.762058,0.000593
4.0,0.101252,0.002250,0.000253,0.013192,1019.0,0.217861,0.782139,0.000638
...,...,...,...,...,...,...,...,...
151.0,0.000000,,-0.001630,,0.0,,,
152.0,0.000000,,0.000917,,0.0,,,
169.0,0.000000,,0.000035,,0.0,,,
300.0,0.000000,,0.000036,,0.0,,,


In [47]:
# Limiter les données jusqu'à l'ordre 100
filtered_df = name_by_order_prop_df[(name_by_order_prop_df.index <= 100) & (name_by_order_prop_df.index >= 0)]

y_range = [0, 0.03]

# Création du bar chart interactif avec sous-graphiques
fig = go.Figure()


# Tracé pour la proportion de slope change négatif à l'intérieur de la barre de magnitude
fig.add_trace(go.Bar(
    x=filtered_df.index,
    y=filtered_df['proportion_negative_SC'] * filtered_df['avg_magnitude_slope_change_significant'],
    name='Proportion Slope Change Negatif',
    marker_color='red',
    offsetgroup=1
))

# Tracé pour la proportion de slope change positif à l'intérieur de la barre de magnitude
fig.add_trace(go.Bar(
    x=filtered_df.index,
    y=filtered_df['proportion_positive_SC'] * filtered_df['avg_magnitude_slope_change_significant'],
    name='Proportion Slope Change Positif',
    marker_color='green',
    offsetgroup=1
))

# Mise en forme du tracé
fig.update_layout(
    xaxis=dict(title='Order'),
    yaxis=dict(title='Magnitude / Proportion', range=y_range),
    barmode='stack'  # 'stack' empile les barres pour chaque order
)

In [48]:
fig.write_html("CaracterRole.html")

In [49]:
# Assumez que votre dataframe s'appelle name_by_order_prop_df

# Limiter les données jusqu'à l'ordre 20
filtered_df = name_by_order_prop_df[(name_by_order_prop_df.index <= 100) & (name_by_order_prop_df.index >= 0)]

# Limiter la plage de la hauteur entre 0 et 0.03
y_range = [0, 0.05]

# Création du bar chart interactif avec sous-graphiques
fig = go.Figure()

# Tracé pour avg magnitude slope change avec erreur
fig.add_trace(go.Bar(
    x=filtered_df.index,
    y=filtered_df['avg_magnitude_slope_change_significant'],
    name='Avg Magnitude Slope Change',
    marker_color='orange',
    error_y=dict(
        type='data',
        array=filtered_df['se_slope_change_magnitude_significant'],
        visible=True
    )
))

# Mise en forme du tracé
fig.update_layout(
    xaxis=dict(title='Charater Order'),
    yaxis=dict(title='Slope Change Magnitude', range=y_range),
    barmode='stack'  # 'stack' empile les barres pour chaque order
)

# Affichage du graphique
fig.show()

In [50]:
fig.write_html("CaracterRole_magnitude.html",include_plotlyjs ='cdn')

#### Does movie genre and caracter role are linked ?

In [51]:
# Does the order influence is impacted by movie genre ? Study of impact due to role importance per movie genre
name_by_order_by_genre_prop_df = movie_genre_aggregate_df.groupby(['order','genre']).apply(lambda x: pd.Series({
        'prop_signif_per_order_per_genre': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'avg_magnitude_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'avg_slope_change_global': x['slope_change'].mean(),
        'total_number_signif_per_order_per_genre': (x['p_value'] < alpha).sum(),
    }))
display(name_by_order_by_genre_prop_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,prop_signif_per_order_per_genre,avg_slope_change_significant,avg_magnitude_slope_change_significant,avg_slope_change_global,total_number_signif_per_order_per_genre
order,genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.0,Absurdism,0.075949,-0.031150,0.055551,-0.002293,6.0
0.0,Acid western,0.375000,0.005818,0.005818,0.000786,3.0
0.0,Action,0.112434,0.001211,0.015567,0.000086,491.0
0.0,Action Comedy,0.092784,-0.012707,0.015439,-0.000509,9.0
0.0,Action Thrillers,0.112272,-0.004716,0.019957,0.000162,43.0
...,...,...,...,...,...,...
302.0,Biographical film,0.000000,,,0.000000,0.0
302.0,Biography,0.000000,,,0.000000,0.0
302.0,Drama,0.000000,,,0.000000,0.0
302.0,Period piece,0.000000,,,0.000000,0.0


### Does the order of a name influence differently according to gender ?
<span style="color:red"> *Prendre seulement les valeur ou p less 0.1 pour faire l'etude des slopes ? Si on les gardes ça va influencer nos moyenne avec des truc pas significantes *</span>

<span style="color:red"> **Revoir **</span>

In [52]:
# Calculate the average magnitude of slope change on all the data
# Calculate the average magnitude of slope change on data having a slope change statistically significant
# Calculate the average of slope change on data having a slope change statistically significant
name_by_order_by_gender_prop_df = name_by_movie_df.groupby(['order','gender']).apply(lambda x: pd.Series({
        'prop_signif_per_order_per_genre': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'avg_slope_change_global': x['slope_change'].mean(),
        'avg_magnitude_slope_change_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'avg_magnitude_slope_change_global': x['slope_change'].abs().mean(),
        'total_number_signif_per_order_per_genre': (x['p_value'] < alpha).sum(),
    }))
display(name_by_order_by_gender_prop_df)


Unnamed: 0_level_0,Unnamed: 1_level_0,prop_signif_per_order_per_genre,avg_slope_change_significant,avg_slope_change_global,avg_magnitude_slope_change_significant,avg_magnitude_slope_change_global,total_number_signif_per_order_per_genre
order,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0.0,F,0.123389,0.001294,-0.000065,0.016735,0.004073,919.0
0.0,M,0.105736,0.003443,0.000282,0.016987,0.003522,1764.0
1.0,F,0.122682,0.000456,-0.000253,0.016857,0.003878,1151.0
1.0,M,0.104579,0.002432,0.000217,0.015772,0.003267,973.0
2.0,F,0.117599,0.001187,-0.000063,0.016242,0.003640,725.0
...,...,...,...,...,...,...,...
151.0,M,0.000000,,-0.001630,,0.001630,0.0
152.0,F,0.000000,,0.000917,,0.000917,0.0
169.0,M,0.000000,,0.000035,,0.000035,0.0
300.0,M,0.000000,,0.000036,,0.000036,0.0


## <span style="color:red">Question 5: Character gender in film<span>
<span style="color:green"> ok</span>

In [53]:
# The dataframe "name_by_movie" has everything we need
# Keep only significant value (5% level) an values higher than 10e-5
name_by_gender_df = name_by_movie_df.groupby('gender').apply(lambda x: x[(x['p_value'] <= alpha)])
display(name_by_gender_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,char_words,order,gender,t_stat,p_value,slope_change
gender,wiki_ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
F,4231,Jennifer,6.0,F,-2.455800,0.031916,0.088259
F,4560,Isabelle,2.0,F,-8.004577,0.000006,0.008832
F,5224,Susan,2.0,F,-4.547336,0.000834,0.063266
F,9835,Maggie,7.0,F,-2.247749,0.046071,0.002362
F,9979,Amanda,,F,-2.735891,0.019373,0.043621
...,...,...,...,...,...,...,...
M,36699915,Jackson,0.0,M,4.587186,0.000781,-0.026825
M,36814246,Man,6.0,M,-4.220301,0.001436,0.000037
M,36956792,Gunner,13.0,M,-4.245248,0.001377,0.002472
M,36956792,Charlie,5.0,M,-5.446114,0.000202,0.006446


In [54]:
# # Assuming you have a Dash app set up
# app = dash.Dash(__name__)

# # Sample data
# threshold = 10e-4

# name_by_gender_df['abs_slope_change'] = name_by_gender_df['slope_change'].abs()
# #name_by_gender_filtered_df = name_by_gender_df[name_by_gender_df['abs_slope_change'] > threshold]

# # Create the initial figure
# fig = go.Figure()

# # Define color scale for both genders and signs
# color_scale = {'M': {'Positive': 'orange', 'Negative': 'blue'},
#                'F': {'Positive': 'orange', 'Negative': 'blue'}}

# for gender in ['M', 'F']:
#     for sign in ['Positive', 'Negative']:
#         subset = name_by_gender_df[(name_by_gender_df['gender'] == gender) & (name_by_gender_df['slope_change'] * (-1) ** (sign == 'Positive') > 0)]
#         fig.add_trace(go.Violin(x=subset['gender'], y=subset['abs_slope_change'],
#                                 name=f'{gender} ({sign})', side='positive' if sign == 'Positive' else 'negative',
#                                 line_color=color_scale[gender][sign]))

# # Create the Dash layout
# app.layout = html.Div([
#     dcc.Slider(
#         id='threshold-slider',
#         min=10e-6,
#         max=10e-2,
#         step=10e-6,
#         value=threshold,
#         marks={i: f"{i:.0e}" for i in [10e-6, 10e-5, 10e-4, 10e-3]},
#         tooltip={'placement': 'bottom', 'always_visible': True}
#     ),
#     dcc.Graph(id='gender-violin-plot', figure=fig),
# ])

# # Define callback to update the plot based on the slider value
# @app.callback(
#     Output('gender-violin-plot', 'figure'),
#     [Input('threshold-slider', 'value')]
# )
# def update_plot(threshold_value):
#     updated_df = name_by_gender_df[name_by_gender_df['abs_slope_change'] > threshold_value]
#     updated_fig = go.Figure()

#     for gender in ['M', 'F']:
#         for sign in ['Positive', 'Negative']:
#             subset = updated_df[(updated_df['gender'] == gender) & (updated_df['slope_change'] * (-1) ** (sign == 'Positive') > 0)]
#             updated_fig.add_trace(go.Violin(x=subset['gender'], y=subset['abs_slope_change'],
#                                             name=f'{gender} ({sign})', side='positive' if sign == 'Positive' else 'negative',
#                                             line_color=color_scale[gender][sign]))

#     return updated_fig

# # Run the app
# if __name__ == '__main__':
#     app.run_server(debug=True)



In [55]:
# Doesn't work as intended => can't use this plot
# fig = go.Figure()

# # Male (M) Violin Plot
# fig.add_trace(go.Violin(x=name_by_gender_filtered_df['gender'][(name_by_gender_filtered_df['gender'] == 'M') & (name_by_gender_filtered_df['slope_change'] < 0)],
#                         y=name_by_gender_filtered_df['abs_slope_change'][(name_by_gender_filtered_df['gender'] == 'M') & (name_by_gender_filtered_df['slope_change'] < 0)],
#                         legendgroup='Male', scalegroup='Male', name='Male (Negative)',
#                         side='negative',
#                         line_color='blue')
#              )
# fig.add_trace(go.Violin(x=name_by_gender_filtered_df['gender'][(name_by_gender_filtered_df['gender'] == 'M') & (name_by_gender_filtered_df['slope_change'] > 0)],
#                         y=name_by_gender_filtered_df['abs_slope_change'][(name_by_gender_filtered_df['gender'] == 'M') & (name_by_gender_filtered_df['slope_change'] > 0)],
#                         legendgroup='Male', scalegroup='Male', name='Male (Positive)',
#                         side='positive',
#                         line_color='orange')
#              )
# # Female (F) Violin Plot
# fig.add_trace(go.Violin(x=name_by_gender_filtered_df['gender'][(name_by_gender_filtered_df['gender'] == 'F') & (name_by_gender_filtered_df['slope_change'] < 0)],
#                         y=name_by_gender_filtered_df['abs_slope_change'][(name_by_gender_filtered_df['gender'] == 'F') & (name_by_gender_filtered_df['slope_change'] < 0)],
#                         legendgroup='Female', scalegroup='Female', name='Female (Negative)',
#                         side='negative',
#                         line_color='blue')
#              )
# fig.add_trace(go.Violin(x=name_by_gender_filtered_df['gender'][(name_by_gender_filtered_df['gender'] == 'F') & (name_by_gender_filtered_df['slope_change'] > 0)],
#                         y=name_by_gender_filtered_df['abs_slope_change'][(name_by_gender_filtered_df['gender'] == 'F') & (name_by_gender_filtered_df['slope_change'] > 0)],
#                         legendgroup='Female', scalegroup='Female', name='Female (Positive)',
#                         side='positive',
#                         line_color='orange')
#              )

# #fig.update_yaxes(type="log")  # Set y-axis to logarithmic scale
# fig.update_traces(meanline_visible=True)
# fig.update_layout(violingap=0, violinmode='overlay')
# fig.show()
# fig.write_html("Question5_1.html")


In [56]:
#Average slope change 
name_by_gender_prop_df = name_by_movie_df.groupby("gender").apply(lambda x: pd.Series({
        'prop_signif_per_gender': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_per_gender_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'se_slope_change_per_gender_significant': x[x['p_value'] < alpha]['slope_change'].sem(),
        'avg_slope_change_per_gender_global': x['slope_change'].mean(),
        'avg_mag_slope_change_per_gender_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'se_mag_slope_change_per_gender_significant': x[x['p_value'] < alpha]['slope_change'].abs().sem(),
        'avg_mag_slope_change_per_gender_global': x['slope_change'].abs().mean(),
        'total_number_signif_per_gender': (x['p_value'] < alpha).sum(),
    }))
display(name_by_gender_prop_df)

Unnamed: 0_level_0,prop_signif_per_gender,avg_slope_change_per_gender_significant,se_slope_change_per_gender_significant,avg_slope_change_per_gender_global,avg_mag_slope_change_per_gender_significant,se_mag_slope_change_per_gender_significant,avg_mag_slope_change_per_gender_global,total_number_signif_per_gender
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,0.113346,0.001698,0.000346,0.00011,0.014831,0.000299,0.0032,7107.0
M,0.090571,0.002384,0.000249,0.000249,0.01363,0.000208,0.002523,9487.0


In [57]:
# Compute the standard error - Sanity check - Validated
#se_F = name_by_gender_df[name_by_gender_df.index.isin(['F'])]['abs_slope_change'].sem()
#se_M = name_by_gender_df[name_by_gender_df.index.isin(['M'])]['abs_slope_change'].sem()

fig = go.Figure()
colors = {'M': 'blue', 'F': 'pink'}
fig.add_trace(go.Bar(
    #name='Control',
    x=name_by_gender_prop_df.index,
    y=name_by_gender_prop_df['avg_mag_slope_change_per_gender_significant'],
    error_y=dict(type='data', array=2*name_by_gender_prop_df['se_mag_slope_change_per_gender_significant']),
    marker_color=[colors[gender] for gender in name_by_gender_prop_df.index]
))
# fig.add_trace(go.Bar(
#     name='Experimental',
#     x=['Trial 1', 'Trial 2', 'Trial 3'], y=[4, 7, 3],
#     error_y=dict(type='data', array=[0.5, 1, 2])
# ))
fig.update_layout(barmode='group')
#fig.update_yaxes(type="log")
fig.show()

In [58]:
t_value, p_value = stats.ttest_ind(name_by_gender_df.loc[name_by_gender_df['gender'] == 'M']['abs_slope_change'], name_by_gender_df.loc[name_by_gender_df['gender'] == 'F']['abs_slope_change'])
display("p-value is {:.5f}".format(p_value))

KeyError: 'abs_slope_change'

#### Does caracter gender and movie genre are linked ?

##### Try to group by movie genre and see the distribution Men/Woment for the 5 movie genre with most data 

In [59]:
display(movie_genre_aggregate_df)
most_data_per_genre = movie_genre_aggregate_df[movie_genre_aggregate_df['p_value'] < alpha].groupby(['genre']).count().nlargest(5, columns="p_value").index
display(most_data_per_genre)
genre_with_most_data_df = movie_genre_aggregate_df[movie_genre_aggregate_df['genre'].isin(most_data_per_genre) & (movie_genre_aggregate_df['p_value'] < alpha)]
display(genre_with_most_data_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3217,Gold,6.0,,,,0.0,Action
3217,Gold,6.0,,,,0.0,Comedy
3217,Gold,6.0,,,,0.0,Time travel
3217,Gold,6.0,,,,0.0,Black comedy
3217,Gold,6.0,,,,0.0,Zombie Film
...,...,...,...,...,...,...,...
37241569,,,,,,,Action
37476824,,,,,,,Comedy
37476824,,,,,,,Crime Comedy
37476824,,,,,,,Caper story


Index(['Drama', 'Comedy', 'Thriller', 'Romance Film', 'Action'], dtype='object', name='genre')

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3837,Jim,1.0,M,-2.715964,0.020076,0.006715,Comedy
3947,Hunter,13.0,M,-4.938567,0.000444,0.002903,Thriller
4231,Jennifer,6.0,F,-2.455800,0.031916,0.088259,Action
4231,Jennifer,6.0,F,-2.455800,0.031916,0.088259,Comedy
4560,William,0.0,M,-3.378640,0.006157,0.015610,Action
...,...,...,...,...,...,...,...
36699915,Jackson,0.0,M,4.587186,0.000781,-0.026825,Action
36699915,Jackson,0.0,M,4.587186,0.000781,-0.026825,Drama
36814246,Man,6.0,M,-4.220301,0.001436,0.000037,Drama
36814246,Mary,1.0,F,-2.783137,0.017804,0.041502,Drama


In [60]:
# Does the gender influence is impacted by movie genre ? Study of impact due to role importance per movie genre
name_by_gender_by_genre_prop_df = genre_with_most_data_df.groupby(['gender','genre']).apply(lambda x: pd.Series({
        'prop_signif_per_gender_per_genre': (x['p_value'] < alpha).sum()/len(x['p_value']),
        'avg_slope_change_per_gender_per_genre_significant': x[x['p_value'] < alpha]['slope_change'].mean(),
        'se_slope_change_per_gender_per_genre_significant': x[x['p_value'] < alpha]['slope_change'].sem(),
        'avg_slope_change_per_gender_per_genre_global': x['slope_change'].mean(),
        'avg_mag_slope_change_per_gender_per_genre_significant': x[x['p_value'] < alpha]['slope_change'].abs().mean(),
        'se_mag_slope_change_per_gender_per_genre_significant': x[x['p_value'] < alpha]['slope_change'].abs().sem(),
        'avg_mag_slope_change_per_gender_per_genre_global': x['slope_change'].abs().mean(),
        'total_number_signif_per_gender_per_genre': (x['p_value'] < alpha).sum(),
    }))
display(name_by_gender_by_genre_prop_df)


Unnamed: 0_level_0,Unnamed: 1_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
F,Action,1.0,0.000447,0.00087,0.000447,0.01398,0.00075,0.01398,1002.0
F,Comedy,1.0,0.002229,0.000589,0.002229,0.01473,0.000512,0.01473,2505.0
F,Drama,1.0,0.002089,0.000417,0.002089,0.013669,0.000357,0.013669,3958.0
F,Romance Film,1.0,0.001437,0.000738,0.001437,0.01464,0.000656,0.01464,1863.0
F,Thriller,1.0,0.002103,0.000609,0.002103,0.013739,0.000511,0.013739,1688.0
M,Action,1.0,0.002146,0.000512,0.002146,0.013006,0.00042,0.013006,1927.0
M,Comedy,1.0,0.002474,0.000433,0.002474,0.012532,0.000366,0.012532,2847.0
M,Drama,1.0,0.002329,0.000326,0.002329,0.013589,0.000269,0.013589,5325.0
M,Romance Film,1.0,0.002522,0.000539,0.002522,0.013558,0.000448,0.013558,1982.0
M,Thriller,1.0,0.002065,0.000469,0.002065,0.013301,0.000389,0.013301,2529.0


In [61]:
# Assuming you have a DataFrame named name_by_gender_df with 'gender' and 'genre' columns
# name_by_gender_df should contain the relevant columns such as 'avg_mag_slope_change_per_gender_significant' and 'se_mag_slope_change_per_gender_significant'

fig = go.Figure()

colors = {'M': 'blue', 'F': 'pink'}

genres = most_data_per_genre
display(genres)

for genre in genres:
    genre_data = name_by_gender_by_genre_prop_df.xs(genre, level='genre')
    display(genre_data)
    # Bar for men
    fig.add_trace(go.Bar(
        x=genre_data.index,
        y=genre_data[genre_data.index == 'M']['avg_mag_slope_change_per_gender_per_genre_significant'],
        error_y=dict(type='data', array=2 * genre_data[genre_data.index == 'M']['se_mag_slope_change_per_gender_per_genre_significant']),
        marker_color=colors['M'],
        name=f'{genre} - Men'
    ))

    # Bar for women
    fig.add_trace(go.Bar(
        x=genre_data.index,
        y=genre_data[genre_data.index == 'F']['avg_mag_slope_change_per_gender_per_genre_significant'],
        error_y=dict(type='data', array=2 * genre_data[genre_data.index == 'F']['se_mag_slope_change_per_gender_per_genre_significant']),
        marker_color=colors['F'],
        name=f'{genre} - Women'
    ))

fig.update_layout(barmode='group', xaxis={'categoryorder':'total ascending'})
fig.show()


Index(['Drama', 'Comedy', 'Thriller', 'Romance Film', 'Action'], dtype='object', name='genre')

Unnamed: 0_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1.0,0.002089,0.000417,0.002089,0.013669,0.000357,0.013669,3958.0
M,1.0,0.002329,0.000326,0.002329,0.013589,0.000269,0.013589,5325.0


Unnamed: 0_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1.0,0.002229,0.000589,0.002229,0.01473,0.000512,0.01473,2505.0
M,1.0,0.002474,0.000433,0.002474,0.012532,0.000366,0.012532,2847.0


Unnamed: 0_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1.0,0.002103,0.000609,0.002103,0.013739,0.000511,0.013739,1688.0
M,1.0,0.002065,0.000469,0.002065,0.013301,0.000389,0.013301,2529.0


Unnamed: 0_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1.0,0.001437,0.000738,0.001437,0.01464,0.000656,0.01464,1863.0
M,1.0,0.002522,0.000539,0.002522,0.013558,0.000448,0.013558,1982.0


Unnamed: 0_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,1.0,0.000447,0.00087,0.000447,0.01398,0.00075,0.01398,1002.0
M,1.0,0.002146,0.000512,0.002146,0.013006,0.00042,0.013006,1927.0


In [62]:
fig = px.box(name_by_gender_df, x='gender', y='abs_slope_change', color='gender')
fig.update_yaxes(type="log")
fig.show()

ValueError: Value of 'y' is not the name of a column in 'data_frame'. Expected one of ['char_words', 'order', 'gender', 't_stat', 'p_value', 'slope_change'] but received: abs_slope_change

In [63]:
#Display change slope per gender and per genre

display(len(name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['F']))

display(len(name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['M']))

display(name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['F'])

#Find genre that are common to both subgroup

intersection_genre = name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['M'].index.intersection(name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['F'].index)


common_data_series_M = name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['M'].loc[intersection_genre]
common_data_series_F = name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['F'].loc[intersection_genre]

import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Male', x=common_data_series_M.index, y=common_data_series_M),
    go.Bar(name='Female', x=common_data_series_F.index, y=common_data_series_F)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

5

5

genre
Action          0.013980
Comedy          0.014730
Drama           0.013669
Romance Film    0.014640
Thriller        0.013739
Name: avg_mag_slope_change_per_gender_per_genre_significant, dtype: float64

<span style="color:red"> Mettre les erroros barres pour chaque genre sur la magnitude de slope change. 

<span style="color:red"> Intéressant peut être de regarder le nombre de noms (ou proportion sur le nombre total de nom aparaissant dans le genre correspondant) influencé par genre (parmi les 5 genres affichlés) entre homme femme

In [64]:


# Get the 5 genre for which there are the most of name having a significative slope change
top5_genres = name_by_genre_prop_df.nlargest(5, 'nb_names_signi_in_genre')[['genre', 'nb_names_signi_in_genre', 'se_slope_change_significant']]

#Sanity check 
display(top5_genres['genre'].isin(intersection_genre))

#Extract common data for male and female 
common_data_series_M_2 = name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['M'].loc[top5_genres['genre']]
common_data_series_F_2 = name_by_gender_by_genre_prop_df['avg_mag_slope_change_per_gender_per_genre_significant']['F'].loc[top5_genres['genre']]

import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Male', x=common_data_series_M_2.index, y=common_data_series_M_2),
    go.Bar(name='Female', x=common_data_series_F_2.index, y=common_data_series_F_2)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()


94     True
59     True
260    True
214    True
2      True
Name: genre, dtype: bool

In [65]:
display(top5_genres)
display(name_by_gender_by_genre_prop_df)

Unnamed: 0,genre,nb_names_signi_in_genre,se_slope_change_significant
94,Drama,9552.0,0.000257
59,Comedy,5494.0,0.000352
260,Thriller,4340.0,0.000369
214,Romance Film,3941.0,0.000444
2,Action,2995.0,0.000441


Unnamed: 0_level_0,Unnamed: 1_level_0,prop_signif_per_gender_per_genre,avg_slope_change_per_gender_per_genre_significant,se_slope_change_per_gender_per_genre_significant,avg_slope_change_per_gender_per_genre_global,avg_mag_slope_change_per_gender_per_genre_significant,se_mag_slope_change_per_gender_per_genre_significant,avg_mag_slope_change_per_gender_per_genre_global,total_number_signif_per_gender_per_genre
gender,genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
F,Action,1.0,0.000447,0.00087,0.000447,0.01398,0.00075,0.01398,1002.0
F,Comedy,1.0,0.002229,0.000589,0.002229,0.01473,0.000512,0.01473,2505.0
F,Drama,1.0,0.002089,0.000417,0.002089,0.013669,0.000357,0.013669,3958.0
F,Romance Film,1.0,0.001437,0.000738,0.001437,0.01464,0.000656,0.01464,1863.0
F,Thriller,1.0,0.002103,0.000609,0.002103,0.013739,0.000511,0.013739,1688.0
M,Action,1.0,0.002146,0.000512,0.002146,0.013006,0.00042,0.013006,1927.0
M,Comedy,1.0,0.002474,0.000433,0.002474,0.012532,0.000366,0.012532,2847.0
M,Drama,1.0,0.002329,0.000326,0.002329,0.013589,0.000269,0.013589,5325.0
M,Romance Film,1.0,0.002522,0.000539,0.002522,0.013558,0.000448,0.013558,1982.0
M,Thriller,1.0,0.002065,0.000469,0.002065,0.013301,0.000389,0.013301,2529.0


## Matching

#### Step 1: establishing dataframe

In [66]:
# Select initial required dataframe
matching_df = movie_genre_aggregate_with_years_df

In [67]:
# Dropping Nan values on matched caracteristics: movie genre, gender, year, order
# As well as numVotes since it's the treatment variable and p_value since we will use it to observe significant influence proportion variation
matching_df.dropna(subset=['order', 'numVotes', 'genre', 'gender', 'order', 'p_value'], inplace=True)
display(matching_df)

Unnamed: 0_level_0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Action,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Time travel,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Black comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3217,Linda,7.0,F,-0.416786,0.684853,0.000673,Zombie Film,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37478048,Ajay,9.0,M,-0.819213,0.430057,0.000130,Comedy film,Mr. Bechara,1996,,,395,5.4
37501922,Murphy,3.0,F,1.264175,0.232298,-0.000365,Drama,Terminal Bliss,1992,,,245,4.4
37501922,Hunter,1.0,M,-7.083089,0.000020,0.036603,Drama,Terminal Bliss,1992,,,245,4.4
37501922,John,1.0,M,-2.172964,0.052505,0.012557,Drama,Terminal Bliss,1992,,,245,4.4


In [68]:
# make a new unique index for each character rlevant for the matching

matching_df.set_index(pd.Index(list(range(0,len(matching_df)))), inplace=True)
display(matching_df)

Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating
0,Linda,7.0,F,-0.416786,0.684853,0.000673,Action,Army of Darkness,1992,10.0,21502796.0,191068,7.4
1,Linda,7.0,F,-0.416786,0.684853,0.000673,Comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
2,Linda,7.0,F,-0.416786,0.684853,0.000673,Time travel,Army of Darkness,1992,10.0,21502796.0,191068,7.4
3,Linda,7.0,F,-0.416786,0.684853,0.000673,Black comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4
4,Linda,7.0,F,-0.416786,0.684853,0.000673,Zombie Film,Army of Darkness,1992,10.0,21502796.0,191068,7.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
538259,Ajay,9.0,M,-0.819213,0.430057,0.000130,Comedy film,Mr. Bechara,1996,,,395,5.4
538260,Murphy,3.0,F,1.264175,0.232298,-0.000365,Drama,Terminal Bliss,1992,,,245,4.4
538261,Hunter,1.0,M,-7.083089,0.000020,0.036603,Drama,Terminal Bliss,1992,,,245,4.4
538262,John,1.0,M,-2.172964,0.052505,0.012557,Drama,Terminal Bliss,1992,,,245,4.4


In [69]:
# In order to perfrom matching, it is required to transform categorical features into
# binary or integer ones. Here the choice to represetn movie genre and gender as integer 
# variables (on integer for each movie genre or gender) was chosen for simplicity
# add a new column in name_by_movie_aggregate_genre_df that encode the movie genre
matching_df['genre'] = matching_df['genre'].astype('category')
matching_df['genre_code'] = matching_df['genre'].cat.codes

# add a new column in name_by_movie_aggregate_genre_df that encode the gender column
# 0 = Women, 1 = Men
matching_df['gender'] = matching_df['gender'].astype('category')
matching_df['gender_code'] = matching_df['gender'].cat.codes

display(matching_df)


Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code
0,Linda,7.0,F,-0.416786,0.684853,0.000673,Action,Army of Darkness,1992,10.0,21502796.0,191068,7.4,2,0
1,Linda,7.0,F,-0.416786,0.684853,0.000673,Comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4,70,0
2,Linda,7.0,F,-0.416786,0.684853,0.000673,Time travel,Army of Darkness,1992,10.0,21502796.0,191068,7.4,313,0
3,Linda,7.0,F,-0.416786,0.684853,0.000673,Black comedy,Army of Darkness,1992,10.0,21502796.0,191068,7.4,42,0
4,Linda,7.0,F,-0.416786,0.684853,0.000673,Zombie Film,Army of Darkness,1992,10.0,21502796.0,191068,7.4,329,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
538259,Ajay,9.0,M,-0.819213,0.430057,0.000130,Comedy film,Mr. Bechara,1996,,,395,5.4,73,1
538260,Murphy,3.0,F,1.264175,0.232298,-0.000365,Drama,Terminal Bliss,1992,,,245,4.4,108,0
538261,Hunter,1.0,M,-7.083089,0.000020,0.036603,Drama,Terminal Bliss,1992,,,245,4.4,108,1
538262,John,1.0,M,-2.172964,0.052505,0.012557,Drama,Terminal Bliss,1992,,,245,4.4,108,1


#### Step 2 : create treatment and control group with a separation of the character based on the median of the `numVotes` values

In [70]:
import statistics

In [71]:
# Determination of median value
percentile90 = np.percentile(matching_df[matching_df['p_value']<alpha]['numVotes'], 90)
display(percentile90)

256116.0

In [72]:
# Filtering to create control (numVotes <= 14171) and treatment group (numVotes > 14171)
control_df = matching_df[(matching_df['numVotes'] <= percentile90) & (matching_df['p_value'] < alpha)]
treatment_df = matching_df[(matching_df['numVotes'] > percentile90) & (matching_df['p_value'] < alpha)]

# Add new column in each dataframe to identify if in treatment or control group
control_df['is_treated'] = 0
treatment_df['is_treated'] = 1

display(control_df.sample(5))
print(f"Length of control population: {len(control_df)}")
display(treatment_df.sample(5))
print(f"Length of treatment population: {len(treatment_df)}")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code,is_treated
404314,Eleanor,0.0,F,-4.044161,0.001935,0.010289,War film,Courage of Lassie,1946,11.0,,1640,6.2,321,0,0
100684,Alexander,8.0,F,5.887622,0.000105,-0.063085,Mystery,Babylon 5: The Gathering,1993,2.0,,9669,6.5,215,0,0
397424,Reed,14.0,M,-2.619964,0.02383,0.001548,Sex comedy,Who's Your Daddy?,2004,1.0,,4614,4.5,276,1,0
433359,Julius,7.0,M,2.579757,0.025601,-0.002231,Television movie,Grey Gardens,2009,4.0,,11006,7.4,309,1,0
491018,Irina,19.0,F,-2.436515,0.033026,0.000407,Teen,The Twilight Saga: Breaking Dawn - Part 1,2011,11.0,705058657.0,250095,4.9,308,0,0


Length of control population: 57152


Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code,is_treated
96320,Robert,8.0,M,-3.298445,0.007098,0.014676,Fantasy,The Village,2004,7.0,256697520.0,273017,6.6,131,1,1
383520,Jessica,7.0,F,-4.511284,0.000885,0.025635,Teen,Twilight,2008,11.0,392616625.0,478923,5.3,308,0,1
362524,Cody,9.0,M,-2.544109,0.02728,0.008408,Parody,Tropic Thunder,2008,8.0,188072649.0,437972,7.1,230,1,1
58440,Bruce,0.0,M,-2.361831,0.037692,0.001583,Adventure,Hulk,2003,6.0,245360480.0,275529,5.6,8,1,1
391348,Scott,6.0,M,-3.457807,0.005354,0.004627,Short Film,The Wrestler,2008,9.0,44703995.0,316140,7.9,278,1,1


Length of treatment population: 6207


#### Step 3 : matching

In [73]:
# Columns to use for matching
matching_columns = ['order', 'year', 'genre_code', 'gender_code', 'averageRating']

# Initialize a list to store the matched pairs
matched_pairs = []

# Iterate through each row in the control dataframe
for control_index, control_row in control_df.iterrows():
    # Filter the treatment dataframe based on the matching columns
    matching_rows = treatment_df[
        (treatment_df[matching_columns] == control_row[matching_columns]).all(axis=1)
    ]

    # Check if there is a match
    if not matching_rows.empty:
        # Store the index of the matched pair
        treatment_index = matching_rows.index[0]
        matched_pairs.append((control_index, treatment_index))

# Display the matched pairs
print("Matched Pairs:")
print(matched_pairs) # control_index, treatment_index

Matched Pairs:
[(2230, 47915), (2244, 47957), (5109, 134465), (5111, 134469), (26959, 26865), (42225, 27070), (42226, 27069), (42227, 27066), (42228, 27067), (42229, 27065), (44449, 379171), (49094, 8055), (62325, 43729), (62328, 43732), (66095, 55936), (66098, 55933), (66099, 55928), (66100, 55935), (70759, 54102), (77289, 38025), (86748, 43606), (89389, 57581), (108715, 63294), (108716, 63293), (110239, 96343), (110240, 96345), (110241, 96342), (110242, 96319), (110243, 96321), (110244, 96318), (114653, 91235), (114654, 91236), (114656, 91237), (120121, 50702), (121910, 43603), (123420, 58399), (123425, 58397), (123426, 58398), (123427, 58400), (133272, 214440), (133273, 214444), (133275, 214446), (133276, 214441), (149617, 214453), (149620, 214451), (175359, 47797), (177401, 161559), (181008, 168402), (181012, 168404), (195919, 120327), (211294, 142477), (213834, 155238), (213835, 155237), (219178, 199956), (219181, 199957), (219182, 199958), (219184, 199959), (219185, 199961), (231

In [74]:
# index_1_noOpti = pd.Series([i[0] for i in list(matched_pairs)])
# index_2_noOpti = pd.Series([i[1] for i in list(matched_pairs)])
# display(index_1_noOpti)


In [75]:
# duplicates_1 = index_1_noOpti.duplicated()
# display(duplicates_1)

# duplicates_2 = index_2_noOpti.duplicated()
# display(duplicates_2)

In [76]:
import networkx as nx
# Create graph
G = nx.Graph()

# Adding edge between pairs
for pair in matched_pairs:
    G.add_edge(pair[0], pair[1])

In [77]:
# Compute the best matching
matching = nx.maximal_matching(G)

display(list(matching))
print(f"number of matched pairs: {len(matching)}")

[(123427, 58400),
 (428525, 437095),
 (495446, 463897),
 (114653, 91235),
 (287776, 41011),
 (366839, 379172),
 (108716, 63293),
 (406575, 379220),
 (534520, 523030),
 (2230, 47915),
 (345060, 197943),
 (389809, 299717),
 (133275, 214446),
 (110239, 96343),
 (506503, 289934),
 (279022, 283097),
 (258046, 183299),
 (473844, 299836),
 (428554, 444934),
 (121910, 43603),
 (219178, 199956),
 (258052, 183296),
 (233382, 168405),
 (290360, 162952),
 (49094, 8055),
 (258051, 183297),
 (366840, 379174),
 (89389, 57581),
 (44449, 379171),
 (399515, 379132),
 (380831, 350515),
 (342490, 363973),
 (449444, 432679),
 (279021, 283100),
 (479842, 503198),
 (133273, 214444),
 (123425, 58397),
 (219182, 199958),
 (287778, 41015),
 (345062, 197941),
 (345063, 197940),
 (427780, 367580),
 (339893, 91590),
 (409460, 444933),
 (232527, 168420),
 (428553, 444937),
 (496405, 492262),
 (528153, 536851),
 (5109, 134465),
 (501192, 500307),
 (510245, 500309),
 (110240, 96345),
 (374212, 414205),
 (506507, 3795

number of matched pairs: 228


In [78]:
matching_result_df = pd.DataFrame(matching, columns=['control_data', 'treated_data'])
matching_result_df.to_csv('matching_result_rating_significant_90.csv', index=False)

In [79]:
# Separate first and second elements of each pair: first element is in control dataframe, second in treatment dataframe
index_control = [i[0] for i in list(matching)]
index_treatment = [i[1] for i in list(matching)]
print(index_control)
print(index_treatment)

[123427, 428525, 495446, 114653, 287776, 366839, 108716, 406575, 534520, 2230, 345060, 389809, 133275, 110239, 506503, 279022, 258046, 473844, 428554, 121910, 219178, 258052, 233382, 290360, 49094, 258051, 366840, 89389, 44449, 399515, 380831, 342490, 449444, 279021, 479842, 133273, 123425, 219182, 287778, 345062, 345063, 427780, 339893, 409460, 232527, 428553, 496405, 528153, 5109, 501192, 510245, 110240, 374212, 506507, 468406, 123426, 428400, 120121, 181012, 501107, 406580, 381852, 364959, 506502, 428327, 287782, 406547, 520956, 108715, 496404, 428394, 310879, 177401, 364958, 62328, 66099, 77289, 303065, 382636, 393541, 271239, 451130, 531065, 110241, 506504, 306057, 133272, 439143, 510871, 232522, 181008, 339892, 66100, 211294, 358989, 451133, 476358, 480247, 5111, 475212, 517758, 408949, 473848, 508948, 479841, 265243, 347896, 114656, 474552, 505144, 282986, 292144, 433037, 393540, 516482, 344081, 359609, 449447, 110244, 271238, 510094, 310878, 488643, 413852, 86748, 66098, 381811

In [80]:
display(matching_df.loc[index_control])
display(matching_df.loc[index_treatment])

Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code
123427,Steven,7.0,M,-2.477505,0.030709,0.011481,Adventure,Timeline,2003,11.0,43935763.0,65086,5.6,8,1
428525,Charlie,0.0,M,-2.962863,0.012909,0.003892,Action,From Paris with Love,2010,2.0,52826594.0,120490,6.4,2,1
495446,Eric,0.0,M,-2.945878,0.013307,0.007560,Comedy,Life as We Know It,2010,10.0,105648706.0,137099,6.5,70,1
114653,Luke,2.0,M,3.364851,0.006309,-0.018467,Drama,Wicker Park,2004,9.0,21568818.0,58049,6.9,108,1
287776,Charlie,1.0,M,-2.271196,0.044215,0.001843,Thriller,Shade,2003,6.0,459098.0,13071,6.3,312,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
389538,Kevin,12.0,M,2.558811,0.026575,-0.011775,Drama,Doubt,2008,10.0,50907234.0,134474,7.5,108,1
381812,Rachel,1.0,F,-4.418206,0.001032,0.014350,Thriller,Eagle Eye,2008,9.0,178966569.0,193179,6.6,312,0
294308,Kate,1.0,F,2.265992,0.044620,-0.008073,Romance Film,Mister Foe,2007,,,13009,6.9,262,0
282985,Emily,1.0,F,4.521404,0.000870,-0.056235,Comedy,The Puffy Chair,2005,,192467.0,5217,6.5,70,0


Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code
58400,Harper,7.0,M,-4.749034,0.000601,0.003749,Adventure,Hulk,2003,6.0,245360480.0,275529,5.6,8,1
437095,Ross,0.0,M,-2.404932,0.034926,0.000787,Action,The Expendables,2010,8.0,274470394.0,359966,6.4,2,1
463897,Peter,0.0,M,-3.313844,0.006906,0.003360,Comedy,Due Date,2010,10.0,211780824.0,353981,6.5,70,1
91235,Riley,2.0,M,2.316911,0.040799,-0.014125,Drama,National Treasure,2004,11.0,347512318.0,348263,6.9,108,1
41011,Connor,1.0,M,2.564686,0.026298,-0.017925,Thriller,Terminator 3: Rise of the Machines,2003,6.0,433371112.0,413355,6.3,312,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402028,Charlie,12.0,M,-5.861696,0.000109,0.006586,Drama,The Hurt Locker,2008,9.0,49230772.0,466890,7.5,108,1
280834,Ross,1.0,F,-2.837781,0.016146,0.001503,Thriller,The Incredible Hulk,2008,6.0,263427551.0,514491,6.6,312,0
255090,Scott,1.0,F,-2.875118,0.015103,0.004210,Romance Film,Knocked Up,2007,5.0,219076518.0,380011,6.9,262,0
161556,Jane,1.0,F,-2.209976,0.049221,0.001213,Comedy,Mr. & Mrs. Smith,2005,6.0,478336279.0,524509,6.5,70,0


In [81]:
# Sanity check matching
display((matching_df.loc[index_control]['order'].reset_index(drop=True) == matching_df.loc[index_treatment]['order'].reset_index(drop=True)) &
        (matching_df.loc[index_control]['genre_code'].reset_index(drop=True) == matching_df.loc[index_treatment]['genre_code'].reset_index(drop=True)) &
        (matching_df.loc[index_control]['gender_code'].reset_index(drop=True) == matching_df.loc[index_treatment]['gender_code'].reset_index(drop=True)) &
        (matching_df.loc[index_control]['year'].reset_index(drop=True) == matching_df.loc[index_treatment]['year'].reset_index(drop=True)) &
        (matching_df.loc[index_control]['numVotes'].reset_index(drop=True) < percentile90) & (matching_df.loc[index_treatment]['numVotes'].reset_index(drop=True) > percentile90))


0      True
1      True
2      True
3      True
4      True
       ... 
223    True
224    True
225    True
226    True
227    True
Length: 228, dtype: bool

In [82]:
# Create new dataframes containing only matched datapoints in treatment and control
matched_control_df = control_df.loc[index_control].copy(deep=True)
matched_treatment_df = treatment_df.loc[index_treatment].copy(deep=True)
display(matched_control_df)
display(matched_treatment_df)

# Create concatenat dataframe
matched_aggregate_df = pd.concat([matched_control_df, matched_treatment_df], axis=0)
display(matched_aggregate_df)

Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code,is_treated
123427,Steven,7.0,M,-2.477505,0.030709,0.011481,Adventure,Timeline,2003,11.0,43935763.0,65086,5.6,8,1,0
428525,Charlie,0.0,M,-2.962863,0.012909,0.003892,Action,From Paris with Love,2010,2.0,52826594.0,120490,6.4,2,1,0
495446,Eric,0.0,M,-2.945878,0.013307,0.007560,Comedy,Life as We Know It,2010,10.0,105648706.0,137099,6.5,70,1,0
114653,Luke,2.0,M,3.364851,0.006309,-0.018467,Drama,Wicker Park,2004,9.0,21568818.0,58049,6.9,108,1,0
287776,Charlie,1.0,M,-2.271196,0.044215,0.001843,Thriller,Shade,2003,6.0,459098.0,13071,6.3,312,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
389538,Kevin,12.0,M,2.558811,0.026575,-0.011775,Drama,Doubt,2008,10.0,50907234.0,134474,7.5,108,1,0
381812,Rachel,1.0,F,-4.418206,0.001032,0.014350,Thriller,Eagle Eye,2008,9.0,178966569.0,193179,6.6,312,0,0
294308,Kate,1.0,F,2.265992,0.044620,-0.008073,Romance Film,Mister Foe,2007,,,13009,6.9,262,0,0
282985,Emily,1.0,F,4.521404,0.000870,-0.056235,Comedy,The Puffy Chair,2005,,192467.0,5217,6.5,70,0,0


Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code,is_treated
58400,Harper,7.0,M,-4.749034,0.000601,0.003749,Adventure,Hulk,2003,6.0,245360480.0,275529,5.6,8,1,1
437095,Ross,0.0,M,-2.404932,0.034926,0.000787,Action,The Expendables,2010,8.0,274470394.0,359966,6.4,2,1,1
463897,Peter,0.0,M,-3.313844,0.006906,0.003360,Comedy,Due Date,2010,10.0,211780824.0,353981,6.5,70,1,1
91235,Riley,2.0,M,2.316911,0.040799,-0.014125,Drama,National Treasure,2004,11.0,347512318.0,348263,6.9,108,1,1
41011,Connor,1.0,M,2.564686,0.026298,-0.017925,Thriller,Terminator 3: Rise of the Machines,2003,6.0,433371112.0,413355,6.3,312,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402028,Charlie,12.0,M,-5.861696,0.000109,0.006586,Drama,The Hurt Locker,2008,9.0,49230772.0,466890,7.5,108,1,1
280834,Ross,1.0,F,-2.837781,0.016146,0.001503,Thriller,The Incredible Hulk,2008,6.0,263427551.0,514491,6.6,312,0,1
255090,Scott,1.0,F,-2.875118,0.015103,0.004210,Romance Film,Knocked Up,2007,5.0,219076518.0,380011,6.9,262,0,1
161556,Jane,1.0,F,-2.209976,0.049221,0.001213,Comedy,Mr. & Mrs. Smith,2005,6.0,478336279.0,524509,6.5,70,0,1


Unnamed: 0,char_words,order,gender,t_stat,p_value,slope_change,genre,mov_name,year,month,revenue,numVotes,averageRating,genre_code,gender_code,is_treated
123427,Steven,7.0,M,-2.477505,0.030709,0.011481,Adventure,Timeline,2003,11.0,43935763.0,65086,5.6,8,1,0
428525,Charlie,0.0,M,-2.962863,0.012909,0.003892,Action,From Paris with Love,2010,2.0,52826594.0,120490,6.4,2,1,0
495446,Eric,0.0,M,-2.945878,0.013307,0.007560,Comedy,Life as We Know It,2010,10.0,105648706.0,137099,6.5,70,1,0
114653,Luke,2.0,M,3.364851,0.006309,-0.018467,Drama,Wicker Park,2004,9.0,21568818.0,58049,6.9,108,1,0
287776,Charlie,1.0,M,-2.271196,0.044215,0.001843,Thriller,Shade,2003,6.0,459098.0,13071,6.3,312,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402028,Charlie,12.0,M,-5.861696,0.000109,0.006586,Drama,The Hurt Locker,2008,9.0,49230772.0,466890,7.5,108,1,1
280834,Ross,1.0,F,-2.837781,0.016146,0.001503,Thriller,The Incredible Hulk,2008,6.0,263427551.0,514491,6.6,312,0,1
255090,Scott,1.0,F,-2.875118,0.015103,0.004210,Romance Film,Knocked Up,2007,5.0,219076518.0,380011,6.9,262,0,1
161556,Jane,1.0,F,-2.209976,0.049221,0.001213,Comedy,Mr. & Mrs. Smith,2005,6.0,478336279.0,524509,6.5,70,0,1


### Variation of proportion of influenced names per year

In [83]:
# # Create dataframe to show variation of the proportion of significantly names
# matched_aggregate_df_grouped = matched_aggregate_df.groupby('year').apply(lambda x: pd.Series({
#     'prop_significant_control': len(x[(x['p_value'] < alpha) & (x['is_treated'] == 0)])/len(x[x['is_treated'] == 0]),
#     'prop_significant_treated': len(x[(x['p_value'] < alpha) & (x['is_treated'] == 1)])/len(x[x['is_treated'] == 1])
    
# }))
# display(matched_aggregate_df_grouped)
# Create dataframe to show variation of the proportion of significantly names
matched_control_df_grouped = matched_control_df.groupby('year').apply(lambda x: pd.Series({
    'avg_slope_change': x['slope_change'].mean(),
    'se_slope_change': x['slope_change'].sem(), 
    'avg_mag_slope_change': x['slope_change'].abs().mean(),
    'se_mag_slope_change': x['slope_change'].abs().sem()
}))
display(matched_control_df_grouped)

matched_treatment_df_groupes = matched_treatment_df.groupby('year').apply(lambda x: pd.Series({
    'avg_slope_change': x['slope_change'].mean(),
    'se_slope_change': x['slope_change'].sem(), 
    'avg_mag_slope_change': x['slope_change'].abs().mean(),
    'se_mag_slope_change': x['slope_change'].abs().sem()
}))
display(matched_control_df_grouped)

Unnamed: 0_level_0,avg_slope_change,se_slope_change,avg_mag_slope_change,se_mag_slope_change
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1992,0.010793,0.004839,0.010793,0.004839
1995,0.032745,,0.032745,
1996,-0.003712,0.0,0.003712,0.0
1998,-0.001259,0.001473,0.001473,0.001259
1999,0.008212,0.002425,0.010018,0.001329
2000,0.010715,0.00207,0.010715,0.00207
2001,0.004933,0.004878,0.011236,0.003325
2002,0.007337,0.0,0.007337,0.0
2003,-0.00477,0.005737,0.012007,0.005116
2004,-0.000394,0.003932,0.009826,0.002412


Unnamed: 0_level_0,avg_slope_change,se_slope_change,avg_mag_slope_change,se_mag_slope_change
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1992,0.010793,0.004839,0.010793,0.004839
1995,0.032745,,0.032745,
1996,-0.003712,0.0,0.003712,0.0
1998,-0.001259,0.001473,0.001473,0.001259
1999,0.008212,0.002425,0.010018,0.001329
2000,0.010715,0.00207,0.010715,0.00207
2001,0.004933,0.004878,0.011236,0.003325
2002,0.007337,0.0,0.007337,0.0
2003,-0.00477,0.005737,0.012007,0.005116
2004,-0.000394,0.003932,0.009826,0.002412


In [84]:
# fig = px.line(matched_aggregate_df_grouped.reset_index(), 
#               x='year', 
#               y=['prop_significant_control', 'prop_significant_treated'],
#               labels={'value': 'Proportion of Significantly Named'},
#               title='Variation of Proportion of Significantly Named Items Over Years',
#               markers=True)



# # Show the plot
# fig.show()


In [85]:
# import scipy.stats as st
# def prop_and_ci(data):
#     proportion_influenced = (data['p_value']<alpha).mean()
#     se = st.sem((data['p_value']<alpha).astype(int))

#     ci_upper = proportion_influenced + 1.96*se
#     ci_lower = proportion_influenced - 1.96*se
#     return pd.Series({
#         'proportion_influenced': proportion_influenced,
#         'ci_lower': ci_lower,
#         'ci_upper': ci_upper
#     })

In [86]:
# # Compute proportion and confidence interval of proportion of significantly influenced names in control and treatmenr groups
# prop_significant_treated = prop_and_ci(matched_treatment_df)
# prop_significant_control = prop_and_ci(matched_control_df)
# display(prop_significant_treated)
# display(prop_significant_control)

# # Create dataframe to easily retrieve event of control (row = 0) and treated (row = 1)
# # when plotting
# prop_significant = pd.DataFrame((prop_significant_control,prop_significant_treated), columns=['proportion_influenced', 'ci_lower', 'ci_upper'])
# display(prop_significant)

In [87]:
# Save data for dynamic plotting
# prop_significant.to_csv('matching_result_prop_significant.csv', index=False)

### Difference in Difference analysis on numbre of votes

In [93]:
# For matched_control_df
matched_control_df_average = {
    'avg_over_all_years': matched_control_df['slope_change'].mean(),
    'se_over_all_years': matched_control_df['slope_change'].sem(),
    'avg_mag_over_all_years': matched_control_df['slope_change'].abs().mean(),
    'se_mag_over_all_years': matched_control_df['slope_change'].abs().sem()
}
display(matched_control_df_average)

matched_treatment_df_average = {
    'avg_over_all_years': matched_treatment_df['slope_change'].mean(),
    'se_over_all_years': matched_treatment_df['slope_change'].sem(),
    'avg_mag_over_all_years': matched_treatment_df['slope_change'].abs().mean(),
    'se_mag_over_all_years': matched_treatment_df['slope_change'].abs().sem()
}
display(matched_treatment_df_average)

prop_significant = pd.DataFrame([matched_control_df_average, matched_treatment_df_average], columns=['avg_over_all_years', 'se_over_all_years', 'avg_mag_over_all_years', 'se_mag_over_all_years'])
display(prop_significant)

{'avg_over_all_years': 0.004332238329017957,
 'se_over_all_years': 0.0009770065118857511,
 'avg_mag_over_all_years': 0.010226568091055576,
 'se_mag_over_all_years': 0.0007592790616837042}

{'avg_over_all_years': 0.003835681454676958,
 'se_over_all_years': 0.0007203437414121094,
 'avg_mag_over_all_years': 0.008400601855049816,
 'se_mag_over_all_years': 0.0005223276003311975}

Unnamed: 0,avg_over_all_years,se_over_all_years,avg_mag_over_all_years,se_mag_over_all_years
0,0.004332,0.000977,0.010227,0.000759
1,0.003836,0.00072,0.008401,0.000522


In [94]:
# Save data for dynamic plotting
prop_significant.to_csv('matching_result_signi_rating_90.csv', index=False)

In [91]:
# Average influence over all the years between groups

# Create a Plotly figure
fig = go.Figure()

groups = ['control', 'treatment']
colors = ['#1f78b4', '#e31a1c']

for group, color in zip(groups, colors):
    prop_data = locals()[f'matched_{group.lower()}_df_average']
    ci_lower = prop_data['avg_over_all_years'] - 1.96*prop_data['se_over_all_years']
    ci_upper = prop_data['avg_over_all_years'] + 1.96*prop_data['se_over_all_years']
    fig.add_trace(go.Scatter(
        x=[group],
        y=[prop_data['avg_over_all_years']],
        error_y=dict(type='data', array=[prop_data['avg_over_all_years'] - ci_lower, ci_upper - prop_data['avg_over_all_years']]),
        mode='markers+lines',
        name=group,
        marker=dict(color=color, size=10)
    ))

fig.update_layout(
    yaxis=dict(title='Average influence'),
    title='Difference in difference analysis on “numVote"'
)
fig.update_layout(
    xaxis=dict(range=[-0.5, len(groups) - 0.5]),  # Adjust the range based on your data
)

# Show the plot
fig.show()

In [92]:
# Average influence over all the years between groups

# Create a Plotly figure
fig = go.Figure()

groups = ['control', 'treatment']
colors = ['#1f78b4', '#e31a1c']

for group, color in zip(groups, colors):
    prop_data = locals()[f'matched_{group.lower()}_df_average']
    ci_lower = prop_data['avg_mag_over_all_years'] - 1.96*prop_data['se_mag_over_all_years']
    ci_upper = prop_data['avg_mag_over_all_years'] + 1.96*prop_data['se_mag_over_all_years']
    fig.add_trace(go.Scatter(
        x=[group],
        y=[prop_data['avg_mag_over_all_years']],
        error_y=dict(type='data', array=[prop_data['avg_mag_over_all_years'] - ci_lower, ci_upper - prop_data['avg_mag_over_all_years']]),
        mode='markers+lines',
        name=group,
        marker=dict(color=color, size=10)
    ))

fig.update_layout(
    yaxis=dict(title='Average magnitude influence'),
    title='Difference in difference analysis on “numVote"'
)
fig.update_layout(
    xaxis=dict(range=[-0.5, len(groups) - 0.5]),  # Adjust the range based on your data
)

# Show the plot
fig.show()