# Data wrangling of the Quotebank dataset

This notebook is dedicated to developpe the datawrangling methods that will be used to clean the full Quotebank dataset used in our project. The methods are tested on a sample of the full database.

The sample is made frome 200'000 quotes of each years of interest (2015-2020) which should therefore contain $1.2 \cdot 10^6$ entries.

The script will proceed through the following steps:
  - drop of the duplicates
  - drop of the quotes for which the speaker is not identified (threshold to be defined)
  - drop the quotes in which the probability between one or more speaker his near each other (threshold to be defined)
  - drop of any quotes that is empty
  - keep only the speaker that has the most probability of being the author of the quote
    

In [9]:
#importing the required modules
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns

In [10]:
# Small adjustments to default style of plots, making sure it's readable and colorblind-friendly everywhere
plt.style.use('seaborn-colorblind')
plt.rcParams.update({'font.size' : 12.5,
                     'figure.figsize':(10,7)})

Quick look at the raw data:

In [11]:
#copy the path of the sample quotes: (to big to put in the git)
#ALEX: 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'
#JULES: ...
#MARIN: ...
#NICO: ...

path_2_data = 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'


#import the dataset sample
raw_data = pd.read_json(path_2_data + 'Sample.json.bz2',compression="bz2",lines=True)

raw_data.describe()

Unnamed: 0,numOccurrences
count,1263790.0
mean,3.767778
std,46.66187
min,1.0
25%,1.0
50%,1.0
75%,2.0
max,33000.0


In [15]:
raw_data.sample(3)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
684936,2018-10-19-086101,"since we got back to work, that it was going t...",LeBron James,[Q36159],2018-10-19 11:14:39,2,"[[LeBron James, 0.7167], [None, 0.2834]]",[http://mobile.nytimes.com/2018/10/19/sports/l...,E
696132,2018-06-05-004839,always tried to support it in the decisions an...,Hillary Clinton,[Q6294],2018-06-05 16:15:35,3,"[[Hillary Clinton, 0.9324], [None, 0.0552], [C...",[http://europe.newsweek.com/bill-clinton-backp...,E
654197,2018-09-08-021077,I still have a couple more boxes to check befo...,Lance McCullers,"[Q6483471, Q6483473]",2018-09-08 00:00:00,43,"[[Lance McCullers, 0.9416], [None, 0.0549], [M...",[http://www.dailyherald.com/article/20180908/s...,E


Test to see if the ids are unique within the dataset

In [76]:
#Keeping the first occurence of the duplicates
size_bf = raw_data.shape[0]
df = raw_data.copy().drop_duplicates(subset = 'quoteID', keep='first')
size_af = df.shape[0]

print('{} dupplicates rows have been removed'.format(size_bf-size_af))
print('Unique rows in the data set:', df.quoteID.is_unique)

0 dupplicates rows have been removed
Unique rows in the data set: True


In [77]:
df_test = df[0:1000]
pd.DataFrame(df_test)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2015-11-11-109291,They'll call me lots of different things. Libe...,Chris Christie,[Q63879],2015-11-11 00:55:12,1,"[[Chris Christie, 0.7395], [Bobby Jindal, 0.15...",[http://thehill.com/blogs/ballot-box/259760-ch...,E
1,2015-11-04-105046,"The choices are not that easy,",Dr. John,"[Q511074, Q54593093]",2015-11-04 18:13:06,2,"[[Dr. John, 0.5531], [None, 0.4469]]",[http://delawareonline.com/story/news/health/2...,E
2,2015-09-11-070666,It's kind of the same way it's been with the R...,Niklas Kronwall,[Q722939],2015-09-11 19:54:00,1,"[[Niklas Kronwall, 0.7119], [None, 0.2067], [H...",[http://redwings.nhl.com/club/news.htm?id=7787...,E
3,2015-01-12-082489,"We're now going back to the frozen tundra, and...",Frances McDormand,[Q204299],2015-01-12 01:40:00,3,"[[Frances McDormand, 0.484], [None, 0.4495], [...",[http://feeds.people.com/~r/people/headlines/~...,E
4,2015-11-09-033345,I had a chuckle: They were showing a video of ...,Kris Draper,[Q948695],2015-11-09 00:57:45,3,"[[Kris Draper, 0.8782], [None, 0.1043], [Serge...",[http://ca.rd.yahoo.com/sports/rss/nfl/SIG=13u...,E
...,...,...,...,...,...,...,...,...,...
995,2015-06-26-031997,It's really been an honor for me to be involve...,Jim Obergefell,[Q23419417],2015-06-26 14:22:28,5,"[[Jim Obergefell, 0.3523], [None, 0.3281], [Pr...",[http://www.latimes.com/la-na-gay-marriage-rul...,E
996,2015-04-06-073536,We stay very relaxed. We have a lot of fun smi...,Jack Sock,[Q54663],2015-04-06 10:53:00,1,"[[Jack Sock, 0.8094], [None, 0.1906]]",[http://www.atpworldtour.com/News/Tennis/2015/...,E
997,2015-11-02-015564,Commendation for Self-sufficiency and Commenda...,,[],2015-11-02 00:18:35,1,"[[None, 0.5101], [Amy Jadesimi, 0.4899]]",[http://sunnewsonline.com/new/ladol-adjudged-a...,E
998,2015-10-19-092831,We will have a new constitution because I thin...,Alassane Ouattara,[Q28669746],2015-10-19 19:53:12,7,"[[Alassane Ouattara, 0.6951], [None, 0.3049]]",[http://uk.reuters.com/article/2015/10/19/uk-i...,E


Removing the quotes with to low probability of speaker detection according to the variable *threshold*

In [78]:
df_test['probas'] = [i[0][1] for i in df_test['probas']]

df_test


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['probas'] = [i[0][1] for i in df_test['probas']]


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2015-11-11-109291,They'll call me lots of different things. Libe...,Chris Christie,[Q63879],2015-11-11 00:55:12,1,0.7395,[http://thehill.com/blogs/ballot-box/259760-ch...,E
1,2015-11-04-105046,"The choices are not that easy,",Dr. John,"[Q511074, Q54593093]",2015-11-04 18:13:06,2,0.5531,[http://delawareonline.com/story/news/health/2...,E
2,2015-09-11-070666,It's kind of the same way it's been with the R...,Niklas Kronwall,[Q722939],2015-09-11 19:54:00,1,0.7119,[http://redwings.nhl.com/club/news.htm?id=7787...,E
3,2015-01-12-082489,"We're now going back to the frozen tundra, and...",Frances McDormand,[Q204299],2015-01-12 01:40:00,3,0.484,[http://feeds.people.com/~r/people/headlines/~...,E
4,2015-11-09-033345,I had a chuckle: They were showing a video of ...,Kris Draper,[Q948695],2015-11-09 00:57:45,3,0.8782,[http://ca.rd.yahoo.com/sports/rss/nfl/SIG=13u...,E
...,...,...,...,...,...,...,...,...,...
995,2015-06-26-031997,It's really been an honor for me to be involve...,Jim Obergefell,[Q23419417],2015-06-26 14:22:28,5,0.3523,[http://www.latimes.com/la-na-gay-marriage-rul...,E
996,2015-04-06-073536,We stay very relaxed. We have a lot of fun smi...,Jack Sock,[Q54663],2015-04-06 10:53:00,1,0.8094,[http://www.atpworldtour.com/News/Tennis/2015/...,E
997,2015-11-02-015564,Commendation for Self-sufficiency and Commenda...,,[],2015-11-02 00:18:35,1,0.5101,[http://sunnewsonline.com/new/ladol-adjudged-a...,E
998,2015-10-19-092831,We will have a new constitution because I thin...,Alassane Ouattara,[Q28669746],2015-10-19 19:53:12,7,0.6951,[http://uk.reuters.com/article/2015/10/19/uk-i...,E


In [79]:
df_test["probas"] = df_test["probas"].astype("float")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test["probas"] = df_test["probas"].astype("float")


In [80]:
indexNames = df_test[df_test['probas'] < 0.10].index
df_test.drop(indexNames , inplace=True)
len(df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


1000

In [66]:
#Percentage up to which the quote is defined to be not attributed to a speaker
threshold = 0.1

size_bf = df_test.shape[0]
df_test_after = df_test.copy().loc(df_test['probas'] < threshold)
size_af = df_test_after.shape[0]

print('{} dupplicates rows have been removed'.format(size_bf-size_af))


#looping through the rows

    

TypeError: unhashable type: 'Series'

In [21]:
first = probas[0]
first

NameError: name 'probas' is not defined

In [40]:
result = float(first[1])
result

0.7475