# Data wrangling of the Quotebank dataset

This notebook is dedicated to developpe the datawrangling methods that will be used to clean the full Quotebank dataset used in our project. The methods are tested on a sample of the full database.

The sample is made frome 200'000 quotes of each years of interest (2015-2020) which should therefore contain $1.2 \cdot 10^6$ entries.

The script will proceed through the following steps:
  - drop of the duplicates
  - drop of the quotes for which the speaker is not identified (threshold to be defined)
  - drop the quotes in which the probability between one or more speaker his near each other (threshold to be defined)
  - drop of any quotes that is empty
  - keep only the speaker that has the most probability of being the author of the quote
    

In [8]:
#importing the required modules
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns

In [9]:
# Small adjustments to default style of plots, making sure it's readable and colorblind-friendly everywhere
plt.style.use('seaborn-colorblind')
plt.rcParams.update({'font.size' : 12.5,
                     'figure.figsize':(10,7)})

Quick look at the raw data:

In [10]:
#import the dataset sample

raw_data = pd.read_json('quotes-2019-nytimes.json',lines = True)

raw_data.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2019-04-17-024782,"It is not a low-income immigration,",James Fisher,"[Q16213953, Q20707104, Q43143598, Q58886302, Q...",2019-04-17 13:31:18,1,"[[James Fisher, 0.7475], [None, 0.2525]]",[https://www.nytimes.com/2019/04/17/realestate...,E
1,2019-04-02-001128,a champion figure skater switching to roller s...,John Updike,[Q105756],2019-04-02 14:58:33,2,"[[John Updike, 0.5856], [None, 0.4144]]",[https://www.nytimes.com/2019/04/02/opinion/vl...,E
2,2019-05-09-055187,It makes it much more difficult for him to mak...,,[],2019-05-09 18:11:29,1,"[[None, 0.6493], [President Bill Clinton, 0.27...",[http://mobile.nytimes.com/2019/05/09/world/as...,E
3,2019-10-31-056366,"It puts me in a predicament,",Xavier Becerra,[Q1855840],2019-10-31 16:45:15,3,"[[Xavier Becerra, 0.9065], [None, 0.0909], [St...",[http://www.nytimes.com/2019/10/31/technology/...,E
4,2019-01-04-001792,A Pile of Leaves.,,[],2019-01-04 10:00:07,1,"[[None, 0.8737], [Jason Fulford, 0.1263]]",[https://www.nytimes.com/2019/01/04/books/revi...,E


Test to see if the ids are unique within the dataset

In [14]:
#Keeping the first occurence of the duplicates
size_bf = raw_data.shape[0]
df = raw_data.copy().drop_duplicates(subset = 'quoteID', keep='first')
size_af = df.shape[0]

print('{} dupplicates rows have been removed'.format(size_bf-size_af))
print('Unique rows in the data set:', df.quoteID.is_unique)

0 dupplicates rows have been removed
Unique rows in the data set: True


In [46]:
df_test = df[0:20]
list_probas = df.probas[0:20]
list_probas

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2019-04-17-024782,"It is not a low-income immigration,",James Fisher,"[Q16213953, Q20707104, Q43143598, Q58886302, Q...",2019-04-17 13:31:18,1,"[[James Fisher, 0.7475], [None, 0.2525]]",[https://www.nytimes.com/2019/04/17/realestate...,E
1,2019-04-02-001128,a champion figure skater switching to roller s...,John Updike,[Q105756],2019-04-02 14:58:33,2,"[[John Updike, 0.5856], [None, 0.4144]]",[https://www.nytimes.com/2019/04/02/opinion/vl...,E
2,2019-05-09-055187,It makes it much more difficult for him to mak...,,[],2019-05-09 18:11:29,1,"[[None, 0.6493], [President Bill Clinton, 0.27...",[http://mobile.nytimes.com/2019/05/09/world/as...,E
3,2019-10-31-056366,"It puts me in a predicament,",Xavier Becerra,[Q1855840],2019-10-31 16:45:15,3,"[[Xavier Becerra, 0.9065], [None, 0.0909], [St...",[http://www.nytimes.com/2019/10/31/technology/...,E
4,2019-01-04-001792,A Pile of Leaves.,,[],2019-01-04 10:00:07,1,"[[None, 0.8737], [Jason Fulford, 0.1263]]",[https://www.nytimes.com/2019/01/04/books/revi...,E
5,2019-08-15-002017,A Senator we can call our own.,Tom Rath,[Q7817334],2019-08-15 22:36:33,1,"[[Tom Rath, 0.7598], [None, 0.1993], [Warren R...",[http://www.nytimes.com/2019/08/15/us/politics...,E
6,2019-02-12-002452,"a thorn in the side of cabinet,",,[],2019-02-12 16:43:04,8,"[[None, 0.463], [Jody Wilson-Raybould, 0.4073]...",[https://castlegarsource.com/news/letter-apolo...,E
7,2019-01-28-048557,"It was just on a lark,",,[],2019-01-28 18:32:47,2,"[[None, 0.5762], [Alexander the Great, 0.2153]...",[http://www.nytimes.com/2019/01/28/health/micr...,E
8,2019-07-22-032883,"It's a success, a relief and a technical feat,",Florence Parly,[Q3074013],2019-07-22 02:37:50,21,"[[Florence Parly, 0.9262], [None, 0.0738]]",[http://www.breitbart.com/news/french-submarin...,E
9,2019-12-31-002666,always appreciates and respects the work and a...,,[],2019-12-31 22:09:50,3,"[[None, 0.8953], [President Trump, 0.1047]]",[http://www.ibtimes.com/climate-change-epa-sci...,E


Removing the quotes with to low probability of speaker detection according to the variable *threshold

In [54]:
#Percentage up to which the quote is defined to be not attributed to a speaker
threshold = 0.1

size_bf = df_test.shape[0]
df_test_after = df_test.copy().drop(float(df.probas[0][0][1]) < threshold)
size_af = df_test_after.shape[0]

print('{} dupplicates rows have been removed'.format(size_bf-size_af))


#looping through the rows

    

KeyError: '[False] not found in axis'

In [37]:
first = probas[0]
first

['James Fisher', '0.7475']

In [40]:
result = float(first[1])
result

0.7475