In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Here we did an with the 2020 quotations, but we did the same for each years just by changing the path.

In [2]:
df_2020 = pd.read_json("DATA\Quotebank\quotes-2020.json.bz2", compression="bz2", chunksize=100000, lines=True, encoding='UTF-8')

In [3]:
selected_columns = ['aliases', 'nationality', 'occupation', 'party', 'label']
attributes = pd.read_csv('DATA/speaker_attributes_parse.csv', usecols=selected_columns)
attributes.dropna(how='all', inplace=True)
attributes.reset_index(inplace=True)
attributes.drop('index', axis=1, inplace=True)
attributes.drop_duplicates(subset=['label'], inplace=True)

We proceed by batch because the data is too big to fit in memory. We parse the urls to only keep the "website.com" or "www.website.com" to save memory and to have a cleaner representation. And we keep only the more useful columns to save memory too. We also remove the None speaker because the data is too large and we are only interested with the quotations with a speaker.

In [None]:
selected_columns = ['quoteID', 'quotation', 'speaker', 'date', 'numOccurrences', 'urls']
chunk_list = []
for batch in df_2020:
    batch = batch[selected_columns]
    batch = batch[batch['speaker'] != 'None']
    urls_parse = batch['urls'].apply(
                        lambda x: x[0][x[0].index('//') + 2:]).apply(
                        lambda y: y[:y.index('/')] if y.find('/') != -1 else 'unknown')
    batch['urls_parse'] = urls_parse
    batch.drop('urls', axis=1, inplace=True)
    chunk_list.append(batch)

df_final = pd.concat(chunk_list)

We merge the speaker attributes with the quotes dataset to have more information about the speaker of the quotes like the nationality, occupation and politic party.

In [6]:
merged = df_final.merge(attributes, how='left', left_on='speaker', right_on='label')

In [7]:
merged.drop('label', axis=1, inplace=True)

Unnamed: 0,quoteID,quotation,speaker,date,numOccurrences,urls_parse,aliases,nationality,occupation,party
0,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,2020-01-16 12:00:13,1,thehill.com,,['United States of America'],['politician'],['Republican Party']
1,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,2020-01-24 20:37:09,4,people.com,,,,
2,2020-01-17-000357,[ The delay ] will have an impact [ on Slough ...,Dexter Smith,2020-01-17 13:03:00,1,www.sloughexpress.co.uk,,,['basketball player'],
3,2020-04-02-000239,[ The scheme ] treats addiction as an illness ...,Barry Coppinger,2020-04-02 14:18:20,1,www.theweek.co.uk,,,,['Labour Party']
4,2020-03-19-000276,[ These ] actions will allow households who ha...,Ben Carson,2020-03-19 19:14:00,1,mortgageorb.com,"['Benjamin Solomon Carson' 'Benjamin Solomon ""...",['United States of America'],"['psychologist', 'neurosurgeon', 'politician',...","['Republican Party', 'Democratic Party', 'inde..."
...,...,...,...,...,...,...,...,...,...,...
3443600,2020-03-03-079268,you're going to take care of the gun problem w...,Joe Biden,2020-03-03 15:49:51,2,twitchy.com,['Joseph Biden' 'Joseph R. Biden' 'Joseph R. B...,['United States of America'],"['politician', 'lawyer', 'university teacher']",['Democratic Party']
3443601,2020-02-24-080186,"you're seeing a young team that's maturing, th...",Brendan Whittet,2020-02-24 05:00:28,1,feeds.browndailyherald.com,,['United States of America'],['ice hockey player'],
3443602,2020-02-07-122251,"You're talking about African-Americans, right?...",Barry Michael Cooper,2020-02-07 00:00:00,1,www.villagevoice.com,,['United States of America'],"['journalist', 'screenwriter']",
3443603,2020-02-04-118820,You've got to sometimes take that leap of fait...,Brad Gushue,2020-02-04 14:47:00,10,timescolonist.com,,['Canada'],['curler'],


In [9]:
merged.to_csv("DATA/quotes_2020_parse.csv", compression='bz2')