# Data wrangling of the Quotebank dataset

This notebook is dedicated to developpe the datawrangling methods that will be used to clean the full Quotebank dataset used in our project. The methods are tested on a sample of the full database.

The sample is made frome 200'000 quotes of each years of interest (2015-2020) which should therefore contain $1.2 \cdot 10^6$ entries.

The script will proceed through the following steps:
  - drop of the duplicates
  - drop of the quotes for which the speaker is not identified (threshold to be defined)
  - drop the quotes in which the probability between one or more speaker his near each other (threshold to be defined)
  - drop of any quotes that is empty
  - keep only the speaker that has the most probability of being the author of the quote
    

In [1]:
#importing the required modules
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
# Small adjustments to default style of plots, making sure it's readable and colorblind-friendly everywhere
plt.style.use('seaborn-colorblind')
plt.rcParams.update({'font.size' : 12.5,
                     'figure.figsize':(10,7)})

Quick look at the raw data:

In [3]:
#copy the path of the sample quotes: (to big to put in the git)
#ALEX: 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'
#JULES: ...
#MARIN: ...
#NICO: ...

path_2_data = 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'


#import the dataset sample
raw_data = pd.read_json(path_2_data + 'Sample.json.bz2',compression="bz2",lines=True)

raw_data.describe()

Unnamed: 0,numOccurrences
count,1263790.0
mean,3.767778
std,46.66187
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,33000.0


In [4]:
raw_data.sample(3)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
674024,2018-02-26-130035,"We left the known universe several weeks ago,",Greg Lyle,[Q57565496],2018-02-26 00:00:00,2,"[[Greg Lyle, 0.9388], [None, 0.0556], [Chris C...",[http://montrealgazette.com/news/politics/patr...,E
956638,2019-01-20-007315,Constituents have told me they walk on the bus...,,[],2019-01-20 12:22:38,2,"[[None, 0.5391], [Joel Goodman, 0.4609]]",[http://feeds.manchestereveningnews.co.uk/~r/m...,E
737039,2018-07-06-046460,It will be thoroughly checked and used again f...,,[],2018-07-06 08:42:00,1,"[[None, 0.9034], [Lewis Hamilton, 0.0966]]",[http://www.pitpass.com/62284/New-power-unit-f...,E


Test to see if the ids are unique within the dataset

In [5]:
#Keeping the first occurence of the duplicates
size_bf = raw_data.shape[0]
df = raw_data.copy().drop_duplicates(subset = 'quoteID', keep='first')
size_af = df.shape[0]

print('{} dupplicates rows have been removed'.format(size_bf-size_af))
print('Unique rows in the data set:', df.quoteID.is_unique)

0 dupplicates rows have been removed
Unique rows in the data set: True


Sample of the sample to speed up calculation while implementing

In [11]:
df_test = df[0:10]
pd.DataFrame(df_test)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2015-11-11-109291,They'll call me lots of different things. Libe...,Chris Christie,[Q63879],2015-11-11 00:55:12,1,"[[Chris Christie, 0.7395], [Bobby Jindal, 0.15...",[http://thehill.com/blogs/ballot-box/259760-ch...,E
1,2015-11-04-105046,"The choices are not that easy,",Dr. John,"[Q511074, Q54593093]",2015-11-04 18:13:06,2,"[[Dr. John, 0.5531], [None, 0.4469]]",[http://delawareonline.com/story/news/health/2...,E
2,2015-09-11-070666,It's kind of the same way it's been with the R...,Niklas Kronwall,[Q722939],2015-09-11 19:54:00,1,"[[Niklas Kronwall, 0.7119], [None, 0.2067], [H...",[http://redwings.nhl.com/club/news.htm?id=7787...,E
3,2015-01-12-082489,"We're now going back to the frozen tundra, and...",Frances McDormand,[Q204299],2015-01-12 01:40:00,3,"[[Frances McDormand, 0.484], [None, 0.4495], [...",[http://feeds.people.com/~r/people/headlines/~...,E
4,2015-11-09-033345,I had a chuckle: They were showing a video of ...,Kris Draper,[Q948695],2015-11-09 00:57:45,3,"[[Kris Draper, 0.8782], [None, 0.1043], [Serge...",[http://ca.rd.yahoo.com/sports/rss/nfl/SIG=13u...,E
5,2015-09-05-038628,New Zealand will go in with a lot of confidenc...,John Eales,[Q926351],2015-09-05 02:40:10,3,"[[John Eales, 0.7896], [None, 0.2006], [Toutai...",[http://www.stuff.co.nz/sport/rugby/all-blacks...,E
6,2015-10-23-081328,Raja roti khilane ke paise nai leta (a king do...,,[],2015-10-23 07:18:43,2,"[[None, 0.5062], [Jwala Singh, 0.2486], [Ram P...",[http://timesofindia.indiatimes.com/city/agra/...,E
7,2015-03-01-042169,The severely inclement weather in the south an...,,[],2015-03-01 17:55:41,19,"[[None, 0.7051], [Leonard Nimoy, 0.1416], [Woo...",[http://uk.reuters.com/article/2015/03/01/usa-...,E
8,2015-02-11-042325,In his suicide note he even made a joke thanki...,Pat Buckley,"[Q19956564, Q23006312, Q7143252, Q7143253]",2015-02-11 09:59:09,1,"[[Pat Buckley, 0.8816], [None, 0.1184]]",[http://independent.ie/life/health-wellbeing/m...,E
9,2015-06-28-039933,We played there [ at Wigan ] together a few ti...,FRASER Fyvie,[Q1361441],2015-06-28 00:19:04,1,"[[FRASER Fyvie, 0.8274], [None, 0.1726]]",[http://scotsman.com/sport/football/spfl-lower...,E


Drop of the quotes for which the speaker is not identified (threshold to be defined)

In [28]:
#fixing the threshold (=> to low percentage of attribution to be considered in the analysis)
threshold_min = 0.5

#removing the rows that not pass the criterion
df_thres = df_test.copy()
df_thres['p'] = [i[0][1] for i in df_test['probas']]
indexNames = df_thres[df_thres['p'].astype("float") < threshold_min].index
df_thres.drop(indexNames , inplace=True)

#check of the number of deleted rows
size_bf = df_test.shape[0]
size_af = df_thres.shape[0]
print('Threshold set at ',threshold_min)
print('Due to low probability attribution, {} rows have been removed'.format(size_bf-size_af))

Threshold set at  0.5
Due to low probability attribution, 1 rows have been removed


Drop the quotes in which the probability between one or more speaker his near each other (threshold to be defined)

In [36]:
threshold_diff = 0.3

In [37]:
df_thres2 = df_test.copy()
df_thres2['p1'] = [i[0][1] for i in df_test['probas']]
df_thres2['p2'] = [i[1][1] for i in df_test['probas']]
df_thres2['delta_p'] = df_thres2['p1'].astype("float")-df_thres2['p2'].astype("float")
indexNames = df_thres2[df_thres2['delta_p'] < threshold_diff].index
df_thres2.drop(indexNames , inplace=True)



In [38]:
#check of the number of deleted rows
size_bf = df_test.shape[0]
size_af = df_thres2.shape[0]
print('Threshold set at ',threshold_min)
print('Due to near probability attribution, {} rows have been removed'.format(size_bf-size_af))

Threshold set at  0.5
Due to near probability attribution, 3 rows have been removed


In [43]:
df_none = df_test.copy()
indexNames = df_none[df_none['speaker'] == 'None'].index
df_none.drop(indexNames , inplace=True)

#check of the number of deleted rows
size_bf = df_test.shape[0]
size_af = df_none.shape[0]
print('Due to not defined speaker, {} rows have been removed'.format(size_bf-size_af))

Due to not defined speaker, 2 rows have been removed
