# Who has a voice in the media?

## 1. Pre-processing the dataset
To start with, we remove the rows of the dataset where either the author or the quotation is NaN. As our whole analysis of "who has a voice in the media" is all about the speaker and what it has said, it makes no sense to take these rows into account.

Later, we also do a sanity controll and remove possible duplicate of rows with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

In [3]:
import pandas as pd
import numpy as np
from timeit import default_timer as timer
import ast     # for transforming str to list

In [None]:
def clean_data(chunk, thresh=0.5):
    
    # Drop duplicate quoteIDs
    nr_rows = chunk.shape[0]
    chunk = chunk.drop_duplicates(subset=['quoteID'])
    print('- Dropped {} duplicate rows with same quoteID;'.format(nr_rows - chunk.shape[0]))
    
    # Drop quotes where either speaker or quotation is None
    nr_rows = chunk.shape[0]
    chunk.replace(to_replace=['None'], value=np.nan, inplace=True)
    chunk = chunk.dropna(axis=0, subset=['speaker', 'quotation'])
    print('- Dropped {} rows with NaN speaker or quotation;'.format(nr_rows - chunk.shape[0]))
    
    # Drop rows where speakers has probability less than 50%
    nr_rows = chunk.shape[0]
    prob_filter = pd.Series([(float(ast.literal_eval(df.iloc[i].probas)[0][1]) > thresh) for i in range(nr_rows)])
    chunk = chunk[prob_filter.values]
    print('- Dropped {} rows with speaker prob smaller than 50%;'.format(nr_rows - chunk.shape[0]))
    
    # Remove columns we don't care about
    chunk.drop(columns=['quoteID', 'speaker', 'probas', 'urls', 'phase', 'numOccurrences'])

    return chunk

start_of_all = timer()
with pd.read_json('data/quotes-2020.json.bz2', lines=True, compression='bz2', chunksize=10_000) as df_reader:
    print('Started to process chunks...')
    i = 0
    for chunk in df_reader:
        if i > 1:
            break
        i += 1
        print('\nProcessing new chunk...')
        start = timer()
        processed_chunk = clean_data(chunk)
        #chunk_list.append(processed_chunk)
        processed_chunk.to_csv(path_or_buf='data/clean-quotes-2020-2.bz2', compression='bz2', mode='a', index=False)
        end = timer()
        print('Done processing and saving chunk after {:.3f} seconds.'.format(end - start))
        
end_of_all = timer()
print('Done processing all chunks and saving as csv after {:.3f} minutes.'.format((end_of_all - start_of_all) / 60))

#### Short discussion
Around one third of the dataset seems to have either a NaN speaker or quotation field. We should rethink if there is a way to use this data despite of missing speaker or quotation fields.

Elsemore, it seems like there are no duplicates of quote-IDs in the dataset.

## 2. Initial analyzes
Here, we do initial studies on the dataset. For instance we plot the following information about the speakers:
- gender;
- age;
- ethnicity;
- profession.

Also, we do analyzes on the content of the quotes.

In [None]:
# number of males and females
# profession
# age
# ethnicity
# clean if person has several references in wikidata?

In [4]:
df = pd.read_csv('data/clean-quotes-2020.bz2', compression='bz2')

In [None]:
parq = pd.read_parquet('data/speaker_attributes.parquet-20211104T133449Z-001.zip')

In [5]:
df

Unnamed: 0.1,Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,0,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,['Q367796'],2020-01-16 12:00:13,1,"[['Sue Myrick', '0.8867'], ['None', '0.0992'],...",['http://thehill.com/opinion/international/478...,E
1,1,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,['Q20684375'],2020-01-24 20:37:09,4,"[['Meghan King Edmonds', '0.5446'], ['None', '...",['https://people.com/parents/meghan-king-edmon...,E
2,2,2020-01-17-000357,[ The delay ] will have an impact [ on Slough ...,Dexter Smith,['Q5268447'],2020-01-17 13:03:00,1,"[['Dexter Smith', '0.924'], ['None', '0.076']]",['http://www.sloughexpress.co.uk/gallery/sloug...,E
3,3,2020-04-02-000239,[ The scheme ] treats addiction as an illness ...,Barry Coppinger,['Q4864119'],2020-04-02 14:18:20,1,"[['Barry Coppinger', '0.9017'], ['None', '0.09...",['http://www.theweek.co.uk/106479/why-police-a...,E
4,4,2020-03-19-000276,[ These ] actions will allow households who ha...,Ben Carson,['Q816459'],2020-03-19 19:14:00,1,"[['Ben Carson', '0.9227'], ['None', '0.0773']]",['https://mortgageorb.com/hud-fha-suspend-fore...,E
...,...,...,...,...,...,...,...,...,...,...
984573,984573,2020-03-27-056695,There's no question in my mind that as we look...,Mary Ellen Carroll,['Q6779440'],2020-03-27 11:00:00,1,"[['Mary Ellen Carroll', '0.8831'], ['None', '0...",['http://sfgate.com/bayarea/heatherknight/arti...,E
984574,984574,2020-01-11-043035,There's some guys who just do things the right...,Jeff Hafley,['Q23016913'],2020-01-11 16:18:28,1,"[['Jeff Hafley', '0.4425'], ['None', '0.4324']...",['http://timesunion.com/sports/article/New-BC-...,E
984575,984575,2020-02-12-096454,There's something undeniably feminine and empo...,"Sophie , Countess of Wessex",['Q155203'],2020-02-12 19:27:00,1,"[['Sophie , Countess of Wessex', '0.9259'], ['...",['http://express.co.uk/life-style/style/124150...,E
984576,984576,2020-01-20-066631,There's the performance down there on the stag...,Mark Sweet,['Q16150632'],2020-01-20 11:00:00,1,"[['Mark Sweet', '0.9213'], ['None', '0.0787']]",['http://newyorker.com/magazine/2020/01/27/how...,E
