# <span style="color:black">This notebook includes the work done to study Cancel Culture of the people accused of Sexual Misconduct.</span>

## <span style="color:black">To know more about Cancel Culture, please check this [link](https://en.wikipedia.org/wiki/Cancel_culture)</span>

# <span style="color:black">Mount drive to notebook</span>

In [2]:
from google.colab import drive
drive._mount('/content/drive', force_remount=True)

Mounted at /content/drive


# <span style="color:black">Install needed packages</span>

In [4]:
!pip install tld
# !pip install pandas==1.0.5

import os
import bz2
import json
import glob
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# from utils import process_chunk, process_text, manual_extraction, ids_to_tweets, embedSentence, embedding, averageEmbedding, featurize

#Packages for url parsing
import tld
from tld import get_tld
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.request import urlopen
#Packages for dimentionality reduction
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from nltk.tokenize.treebank import TreebankWordDetokenizer
#Packages for NLP methods
import re
import nltk
import gensim
from gensim import models
from gensim import corpora
import gensim.downloader as api
from gensim.test.utils import datapath, common_texts
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
!pip install plotly==5.4.0
# We need this dataset in order to use the tokenizer
nltk.download('punkt')
# Also download the list of stopwords to filter out
nltk.download('stopwords')
stemmer = PorterStemmer()

# Add constants/paths
_DATASETS_PATHS = '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets'

Collecting plotly==5.4.0
  Downloading plotly-5.4.0-py2.py3-none-any.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 1.7 MB/s 
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 4.4.1
    Uninstalling plotly-4.4.1:
      Successfully uninstalled plotly-4.4.1
Successfully installed plotly-5.4.0 tenacity-8.0.1
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# <span style="color:black">Cancel Culture Timeline</span>

In [5]:
#helper function for dealing with dates
def create_time(df, name_date_feature):
  '''This function modifies and unifies the time format'''
  df['datetime'] =  pd.to_datetime(df[name_date_feature], infer_datetime_format=True)
  df['Year'] = df['datetime'].dt.strftime('%y')
  df['Month'] = df['datetime'].dt.strftime('%m')
  df['Day'] = df['datetime'].dt.strftime('%d')
  df.loc[:,'time'] = 100*df['Year'].astype(int) + df['Month'].astype(int)
  df['YearMonth'] = df['datetime'].dt.strftime('%y-%m')
  return df

## <span style="color:Blue">1) Getting list of accused people</span>

### <span style="color:purple"><div style="text-align: justify">First we scraped this [website](https://www.vox.com/a/sexual-harassment-assault-allegations-list/john-kricfalusi) in order to extract the traumatic events information (Name of accused people, date,...etc). The web scraping details are mentioned in the **Impact Study** notebook</div></span>

### <span style="color:purple"><div style="text-align: justify">Then, we saved the names of accused people *(accusations.csv)* to study the cancel culture on them.</div></span>

In [6]:
_DATASETS_PATHS = '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets'
#find in the accused people, the names of interest for cancel culture
accused_people = pd.read_csv(_DATASETS_PATHS+'/accusations.csv')
accused_people = accused_people.drop(columns='Unnamed: 0')
accused_people = create_time(accused_people, 'dates')
accused_names = list(accused_people.names)
accused_people

Unnamed: 0,dates,names,category,datetime,Year,Month,Day,time,YearMonth
0,2018-12-17,Frankie Shaw,Arts & Entertainment,2018-12-17,18,12,17,1812,18-12
1,2018-12-13,Michael Weatherly,Arts & Entertainment,2018-12-13,18,12,13,1812,18-12
2,2018-09-06,Steven Wilder Striegel,Arts & Entertainment,2018-09-06,18,09,06,1809,18-09
3,2018-08-30,Gerard Depardieu,Arts & Entertainment,2018-08-30,18,08,30,1808,18-08
4,2018-08-28,Chase Finlay,Arts & Entertainment,2018-08-28,18,08,28,1808,18-08
...,...,...,...,...,...,...,...,...,...
256,2017-10-21,John Besh,Other,2017-10-21,17,10,21,1710,17-10
257,2017-10-06,David Marchant,Other,2017-10-06,17,10,06,1710,17-10
258,2017-09-08,T. Florian Jaeger,Other,2017-09-08,17,09,08,1709,17-09
259,2017-04-19,Cristiano Ronaldo,Other,2017-04-19,17,04,19,1704,17-04


## <span style="color:Blue">2) Timeline of quotes spoken by the accused people</span>

### <span style="color:purple"><div style="text-align: justify">Here, we build the timeline by getting from the extracted Quotebank the quotes of speakers (accused) known and included in the metadata *(extraction_known_speakers.csv.bz2)*.</div></span>

In [7]:
#preparing the extraction from quotebank of the quotes with known speakers in the metadata
df_metadata = pd.read_parquet('/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/speaker_attributes.parquet')
#cleaning df_medata from duplicates
clean = df_metadata.drop_duplicates(subset='label', keep=False)

In [8]:
# Loading "metoo quotes"
keywords_extracted_data_path = '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/keywords-extracted-data.csv.bz2'
keywords_extracted_data = glob.glob(keywords_extracted_data_path)
keywords_extracted_data

['/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/keywords-extracted-data.csv.bz2']

In [None]:
#process the merge between clean metadata dataframe and metoo quotes
def process_chunk(chunk):
  merge = pd.merge(chunk, clean, how='inner',left_on='speaker', right_on='label' )
  return merge

total_size=0

for dataset in keywords_extracted_data:
  df_reader = pd.read_csv(dataset, compression='bz2', chunksize=500000)
  print(f'Processing with {dataset}')
  for chunk in df_reader:
      print(f'Processing chunk with {len(chunk)} rows')
      merge = process_chunk(chunk.replace(to_replace='None', value=np.nan).dropna(subset=['speaker']))
      merge.to_csv(path_or_buf='/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/extraction_known_speakers.csv.bz2', compression='bz2', mode='a')
      total_size+=len(merge)

### <span style="color:purple"><div style="text-align: justify">Now, we are getting the quotes spoken by the accused people *(quotes_from_accused)*.</div></span>

In [None]:
#extracts from quotebank the quotes pronounced by accused people
df_extraction_known_speakers = pd.read_csv('/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/extraction_known_speakers.csv.bz2', compression='bz2')

#getting quotes from accused people
quotes_from_accused = df_extraction_known_speakers[df_extraction_known_speakers['speaker'].isin(accused_names)]
quotes_from_accused = create_time(quotes_from_accused, 'date')

In [12]:
quotes_from_accused = quotes_from_accused.drop(columns='Unnamed: 0.1')
quotes_from_accused.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,probas,urls,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,id,label,candidacy,type,religion,datetime,Year,Month,Day,time,YearMonth
14296,2015-04-06-001948,"Ahh, who has time to scrap the main gang-rape ...",Jann Wenner,['Q519143'],2015-04-06 17:35:00,"[['Jann Wenner', '0.5324'], ['None', '0.4676']]",['http://feeds.gawker.com/~r/gawker/full/~3/Cg...,"['Jann Simon Wenner' 'Wenner Media, LLC']",['+1946-01-07T00:00:00Z'],['Q30'],['Q6581097'],1392063156,,,['Q1930187' 'Q1607826' 'Q43845'],,,Q519143,Jann Wenner,,item,,2015-04-06 17:35:00,15,4,6,1504,15-04
14297,2015-05-12-061162,To personify the university's alleged institut...,Jann Wenner,['Q519143'],2015-05-12 15:46:20,"[['Jann Wenner', '0.411'], ['None', '0.3542'],...",['http://www.post-gazette.com/news/nation/2015...,"['Jann Simon Wenner' 'Wenner Media, LLC']",['+1946-01-07T00:00:00Z'],['Q30'],['Q6581097'],1392063156,,,['Q1930187' 'Q1607826' 'Q43845'],,,Q519143,Jann Wenner,,item,,2015-05-12 15:46:20,15,5,12,1505,15-05
14298,2015-04-08-051469,"Last July 8, Sabrina Rubin Erdely, a writer fo...",Jann Wenner,['Q519143'],2015-04-08 14:35:36,"[['Jann Wenner', '0.5716'], ['None', '0.4284']]",['http://www.foxnews.com/opinion/2015/04/08/ro...,"['Jann Simon Wenner' 'Wenner Media, LLC']",['+1946-01-07T00:00:00Z'],['Q30'],['Q6581097'],1392063156,,,['Q1930187' 'Q1607826' 'Q43845'],,,Q519143,Jann Wenner,,item,,2015-04-08 14:35:36,15,4,8,1504,15-04
14299,2015-04-06-068532,was willing to go too far in her effort to try...,Jann Wenner,['Q519143'],2015-04-06 11:30:11,"[['Jann Wenner', '0.8075'], ['None', '0.1925']]",['http://buzzfeed.com/jtes/rolling-stone-is-st...,"['Jann Simon Wenner' 'Wenner Media, LLC']",['+1946-01-07T00:00:00Z'],['Q30'],['Q6581097'],1392063156,,,['Q1930187' 'Q1607826' 'Q43845'],,,Q519143,Jann Wenner,,item,,2015-04-06 11:30:11,15,4,6,1504,15-04
14300,2015-12-04-012983,But nothing has ever come up to the level of t...,Jann Wenner,['Q519143'],2015-12-04 18:25:05,"[['Jann Wenner', '0.4856'], ['None', '0.228'],...",['https://medium.com/collectors-weekly/did-the...,"['Jann Simon Wenner' 'Wenner Media, LLC']",['+1946-01-07T00:00:00Z'],['Q30'],['Q6581097'],1392063156,,,['Q1930187' 'Q1607826' 'Q43845'],,,Q519143,Jann Wenner,,item,,2015-12-04 18:25:05,15,12,4,1512,15-12


### <span style="color:purple"><div style="text-align: justify">Check how many quotes did the people of interest said, then pick the ones of interest (speakers who have several quotes) so we could analyze.</div></span>

In [16]:
quotes_timeline = quotes_from_accused.groupby(['time', 'speaker'], as_index=False)['quotation'].count()
quotes_timeline.head()

Unnamed: 0,time,speaker,quotation
0,1501,Al Franken,3
1,1501,Andrew Kreisberg,4
2,1501,Aziz Ansari,1
3,1501,Ben Affleck,5
4,1501,Carlos Uresti,2


In [17]:
#filtration of accused people who have regular quotes (more than 15 Quotes) - finish with list of 65 accused people
keep_names=[]
for name in accused_names: 
  if len(quotes_timeline.loc[quotes_timeline['speaker']==name, 'quotation'])>15:
    keep_names.append(name)
keep_names

['Frankie Shaw',
 'Michael Weatherly',
 'Asia Argento',
 'Luc Besson',
 'Allison Mack',
 'Paul Marciano',
 'Mario Testino',
 'Aziz Ansari',
 'James Franco',
 'Paul Haggis',
 'Dan Harmon',
 'T.J. Miller',
 'James Levine',
 'Geoffrey Rush',
 'John Lasseter',
 'Sylvester Stallone',
 'Richard Dreyfuss',
 'Andrew Kreisberg',
 'George Takei',
 'Louis C.K.',
 'Russell Simmons',
 'Jeffrey Tambor',
 'Brett Ratner',
 'Dustin Hoffman',
 'Jeremy Piven',
 'Kevin Spacey',
 'Bob Weinstein',
 'Oliver Stone',
 'Ben Affleck',
 'Roman Polanski',
 'Kimberly Guilfoyle',
 'Tom Brokaw',
 'Ryan Seacrest',
 'Tavis Smiley',
 'Ryan Lizza',
 'Matt Lauer',
 'Garrison Keillor',
 'Glenn Thrush',
 'Jann Wenner',
 'Mark Halperin',
 'Eric Bolling',
 'Sean Hannity',
 'Andy Rubin',
 'Chris Sacca',
 'Travis Kalanick',
 'Charles Schwertner',
 'Brett Kavanaugh',
 'Tom Frieden',
 'Corey Coleman',
 'Mel Watt',
 'Curtis Hill',
 'Eric Schneiderman',
 'Cristina Garcia',
 'Eric Greitens',
 'Corey Lewandowski',
 'Alex Kozinski',
 

### <span style="color:purple"><div style="text-align: justify">It's time to get some PLOTS!</div></span>

In [13]:
# plotting the timeline of quotes spoken by accused people of interest
import plotly.express as px
from google.colab import files

quotes_timeline_plot = quotes_from_accused.groupby(['YearMonth', 'speaker'], as_index=False)['quotation'].count()

to_plot = ['Al Franken','Eric Schneiderman', 'Andrew Kreisberg']

for i, name in enumerate(to_plot):
  df2 = quotes_timeline_plot[quotes_timeline_plot['speaker']==name]
  fig = px.line(df2, x="YearMonth", y="quotation", labels={
                     "YearMonth": "Time",
                     "quotation": "#quotations"
                 })
  fig.add_vline(x=str(accused_people[accused_people['names']==name]['YearMonth'].values).replace("[", "").replace("]", "").replace("'", ""), line_width=3, line_dash="dash", line_color="green")
  fig.update_layout(font_size=28)
  fig.write_html("timeline_quotes_from"+name+".html")
  files.download("timeline_quotes_from"+name+".html")
  fig.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## <span style="color:blue"><div style="text-align: justify">3) Statistical Analysis</div></span>

In [None]:
#need to standardize the number of quotes per people of interest, making them comparable
from sklearn.preprocessing import StandardScaler
standardize = quotes_timeline_plot[quotes_timeline_plot['speaker'].isin(keep_names)]

#using MinMaxScaler() on the number of quotes for each accused person
for name in keep_names : 
  standardize.loc[standardize['speaker']==name, 'quotation']= MinMaxScaler().fit_transform(standardize[standardize['speaker']==name].quotation.to_numpy().reshape(-1,1))

#building YearMonth date index
standardize['YearMonth'] = standardize['YearMonth'].apply(lambda x : 100*int(x.split('-')[0]) + int(x.split('-')[1]))

In [19]:
standardize.head()

Unnamed: 0,YearMonth,speaker,quotation
0,1501,Al Franken,0.006231
1,1501,Andrew Kreisberg,0.066667
2,1501,Aziz Ansari,0.0
3,1501,Ben Affleck,0.052632
4,1501,Carlos Uresti,0.071429


In [None]:
#prepare tab for t-test, create the mean of quotes before||after the conviction
before = []
after=[]

for name in keep_names:
  df_name=standardize[standardize['speaker']==name]
  #df_name= MinMaxScaler().fit_transform(df_name.to_numpy().reshape(-1,1))
  conviction = int(accused_people[accused_people['names']==name]['time'])
  before.append(df_name[df_name['YearMonth']<conviction]['YearMonth'].count()/(conviction-df_name['YearMonth'].min()))
  after.append(df_name[df_name['YearMonth']>conviction]['YearMonth'].count()/(df_name['YearMonth'].max()-conviction))

In [None]:
nbr_quotes = pd.DataFrame(keep_names)
nbr_quotes['before']=before
nbr_quotes['after']=after
nbr_quotes.columns = ['name', 'mean_before', 'mean_after']

#get rid of inf values
nbr_quotes = nbr_quotes.replace([np.inf, -np.inf], np.nan).dropna()
nbr_quotes.head()

Unnamed: 0,name,mean_before,mean_after
0,Frankie Shaw,0.045307,0.021978
1,Michael Weatherly,0.03871,0.020942
2,Asia Argento,0.050251,0.046154
3,Luc Besson,0.043771,0.019048
4,Allison Mack,0.05102,0.06599


In [None]:
import scipy.stats as stats

statistic, p_value = stats.ttest_ind(nbr_quotes['mean_before'], nbr_quotes['mean_after'])
if p_value < 0.05:
    print(f'The difference between both distributions is signifcant, p = {p_value}')
else:
    print(f'There is no statistically significant difference between both distributions, p = {p_value}')

The difference between both distributions is signifcant, p = 0.00046240694927555503


## <span style="color:blue"><div style="text-align: justify">4)Timeline of quotes including accused people (how much we talk about them)</div></span>

### <span style="color:purple"><div style="text-align: justify">Here, we want to get from Quotebank the quotes mentioning the accused people *(new_cancel_quotes.csv.bz2)*</div></span>

In [None]:
def process_chunk(chunk, proc_function, keywords, save_path):
    '''This function process a chunk of a dataset using the processing function given and save the preprocessed results to the path       given'''
    print(f'Processing chunk with {len(chunk)} rows')
    # Remove Phase and Num of occurences columns (not useful and redundant, respectively) to save memory
    chunk.drop(['phase', 'numOccurrences'], axis=1)
    # Apply the processing function
    chunk = proc_function(chunk, keywords)
    if len(chunk) > 0:
        print(f'There are {len(chunk)} rows that include some of the keywords')
        chunk[['quotation', 'date']].to_csv(path_or_buf=save_path, compression='bz2', mode='a')
    return chunk

def manual_extraction(df, keywords):
    '''This function extracts rows in which their quotes contain any of the keywords given.'''
    results = df['quotation'].str.contains('|'.join(keywords))
    indices = list(results[results == True].index)
    return df.loc[indices,:]

In [None]:
from gensim import corpora
corpuses = []
#keywords= 'sexual harassment|me too|'

#Iterate through all datasets and extract the quotes containing any of the keywords selected
for dataset in quote_datasets:
  print(f'Processing Data: {dataset}')
  df_reader = pd.read_json(dataset, lines=True, compression='bz2', chunksize=500000)
  for chunk in df_reader:
      print('Processing Data:')
      chunk = process_chunk(chunk, manual_extraction, accused_names, '/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/new_cancel_quotes.csv.bz2')

In [None]:
#getting quotes mentionning the accused people
new_cancel_quotes = pd.read_csv('/content/drive/Shareddrives/ADA LUNATICS 2021/datasets/new_cancel_quotes.csv.bz2', compression='bz2')

#specify the name of the accused people mentioned in the quotes
new_cancel_quotes['name']='None'
for name in keep_names : 
  new_cancel_quotes['name'] = new_cancel_quotes['name'].astype('str')
  new_cancel_quotes.loc[new_cancel_quotes['quotation'].str.contains(name), 'name'] = new_cancel_quotes[new_cancel_quotes['quotation'].str.contains(name)]['name'].apply(lambda x : name)

#dealing with dates
new_cancel_quotes = new_cancel_quotes[new_cancel_quotes['name'].isin(keep_names)]
new_cancel_quotes = create_time(new_cancel_quotes, 'date')

In [23]:
new_cancel_quotes = new_cancel_quotes.drop(columns='Unnamed: 0')
new_cancel_quotes.head()

Unnamed: 0,quotation,date,name,datetime,Year,Month,Day,time,YearMonth
4,Finally after all that hard work for my 21st b...,2015-01-20 16:49:28,Ryan Seacrest,2015-01-20 16:49:28,15,1,20,1501,15-01
13,"At Trump Tower, rival staff members are vying ...",2015-08-07 10:51:13,Corey Lewandowski,2015-08-07 10:51:13,15,8,7,1508,15-08
15,"We're talking about Ben Affleck here,",2015-01-19 08:22:12,Ben Affleck,2015-01-19 08:22:12,15,1,19,1501,15-01
20,"I like it. Obviously, babies aren't really act...",2015-09-14 15:30:00,Louis C.K.,2015-09-14 15:30:00,15,9,14,1509,15-09
32,Garrison Keillor's A Prairie Home Companion - ...,2015-07-31 17:47:41,Garrison Keillor,2015-07-31 17:47:41,15,7,31,1507,15-07


### <span style="color:purple"><div style="text-align: justify">Similarly we plot the timeline of quotes as before.</div></span>

In [None]:
quotes_timeline_plot = new_cancel_quotes.groupby(['YearMonth', 'name'], as_index=False)['quotation'].count()
quotes_timeline_plot.head()

Unnamed: 0,YearMonth,name,quotation
0,15-01,Al Franken,6
1,15-01,Andrew Kreisberg,1
2,15-01,Aziz Ansari,4
3,15-01,Ben Affleck,27
4,15-01,Blake Farenthold,1


In [None]:
import plotly.express as px

to_plot = ['Al Franken','Eric Schneiderman', 'Andrew Kreisberg']

for i, name in enumerate(to_plot ):
  df2 = quotes_timeline_plot[quotes_timeline_plot['name']==name]
  fig = px.line(df2, x="YearMonth", y="quotation", labels={
                     "YearMonth": "Time",
                     "quotation": "#quotations"
                 })
  fig.add_vline(x=str(accused_people[accused_people['names']==name]['YearMonth'].values).replace("[", "").replace("]", "").replace("'", ""), line_width=3, line_dash="dash", line_color="green")
  fig.update_layout(font_size=28)
  fig.write_html("timeline_quotes_mentioning"+name+".html")
  files.download("timeline_quotes_mentioning"+name+".html")
  fig.show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## <span style="color:blue"><div style="text-align: justify">5)Statistical Analysis</div></span>

In [None]:
#need to standardize the number of quotes per people of interest, making them comparable
from sklearn.preprocessing import StandardScaler

quotes_timeline = new_cancel_quotes.groupby(['time', 'name'], as_index=False)['quotation'].count()
standardize = quotes_timeline_plot[quotes_timeline_plot['name'].isin(keep_names)]

#using MinMaxScaler() on the number of quotes for each accused person
for name in keep_names : 
  standardize.loc[standardize['name']==name, 'quotation']= MinMaxScaler().fit_transform(standardize[standardize['name']==name].quotation.to_numpy().reshape(-1,1))

#building YearMonth date index
standardize['YearMonth'] = standardize['YearMonth'].apply(lambda x : 100*int(x.split('-')[0]) + int(x.split('-')[1]))
standardize.head()

Unnamed: 0,YearMonth,name,quotation
0,1501,Al Franken,0.008881
1,1501,Andrew Kreisberg,0.0
2,1501,Aziz Ansari,0.025424
3,1501,Ben Affleck,0.346667
4,1501,Blake Farenthold,0.0


In [None]:
#prepare tab for t-test, create the mean of quotes before||after the conviction
before = []
after=[]

for name in keep_names:
  df_name=standardize[standardize['name']==name]
  conviction = int(accused_people[accused_people['names']==name]['time'])
  before.append(df_name[df_name['YearMonth']<conviction]['YearMonth'].count()/(conviction-df_name['YearMonth'].min()))
  after.append(df_name[df_name['YearMonth']>conviction]['YearMonth'].count()/(df_name['YearMonth'].max()-conviction))

In [None]:
nbr_quotes = pd.DataFrame(keep_names)
nbr_quotes['before']=before
nbr_quotes['after']=after
nbr_quotes.columns = ['name', 'mean_before', 'mean_after']

#get rid of inf values
nbr_quotes = nbr_quotes.replace([np.inf, -np.inf], np.nan).dropna()
nbr_quotes.head()

Unnamed: 0,name,mean_before,mean_after
0,Frankie Shaw,0.049505,0.021978
1,Michael Weatherly,0.064725,0.042105
2,Asia Argento,0.042623,0.067308
3,Luc Besson,0.085526,0.065657
4,Allison Mack,0.041667,0.081218


In [None]:
import scipy.stats as stats

statistic, p_value = stats.ttest_ind(nbr_quotes['mean_before'], nbr_quotes['mean_after'])
if p_value < 0.05:
    print(f'The difference between both distributions is signifcant, p = {p_value}')
else:
    print(f'There is no statistically significant difference between both distributions, p = {p_value}')

There is no statistically significant difference between both distributions, p = 0.6719953632915576
