# Who has a voice in the media?

## 1. Pre-processing the dataset
To start with, we remove the rows of the dataset where either the author or the quotation is NaN. As our whole analysis of "who has a voice in the media" is all about the speaker and what it has said, it makes no sense to take these rows into account.

Later, we also do a sanity controll and remove possible duplicate of rows with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

In [None]:
import pandas as pd
import numpy as np
from timeit import default_timer as timer

In [None]:
def clean_data(chunk):
    
    # Drop duplicate quoteIDs
    nr_rows = chunk.shape[0]
    chunk = chunk.drop_duplicates(subset=['quoteID'])
    print('- Dropped {} duplicate rows with same quoteID;'.format(nr_rows - chunk.shape[0]))
    
    # Drop quotes where either speaker or quotation is None
    nr_rows = chunk.shape[0]
    chunk.replace(to_replace=['None'], value=np.nan, inplace=True)
    chunk = chunk.dropna(axis=0, subset=['speaker', 'quotation'])
    print('- Dropped {} rows with NaN speaker or quotation;'.format(nr_rows - chunk.shape[0]))

    return chunk

start_of_all = timer()
with pd.read_json('data/quotes-2020.json.bz2', lines=True, compression='bz2', chunksize=500_000) as df_reader:
    print('Started to process chunks...')
    i = 0
    for chunk in df_reader:
        if i > 4:
            break
        print('\nProcessing new chunk...')
        start = timer()
        processed_chunk = clean_data(chunk)
        #chunk_list.append(processed_chunk)
        processed_chunk.to_csv(path_or_buf='data/clean-quotes-2020-2.bz2', compression='bz2', mode='a')
        end = timer()
        print('Done processing and saving chunk after {:.3f} seconds.'.format(end - start))
        
end_of_all = timer()
print('Done processing all chunks and saving as csv after {:.3f} minutes.'.format((end_of_all - start_of_all) / 60))

#### Short discussion
Around one third of the dataset seems to have either a NaN speaker or quotation field. We should rethink if there is a way to use this data despite of missing speaker or quotation fields.

Elsemore, it seems like there are no duplicates of quote-IDs in the dataset.

## 2. Initial analyzes
Here, we do initial studies on the dataset. For instance we plot the following information about the speakers:
- gender;
- age;
- ethnicity;
- profession.

Also, we do analyzes on the content of the quotes.

In [None]:
# number of males and females
# profession
# age
# ethnicity
# clean if person has several references in wikidata?

In [None]:
df = pd.read_csv('data/clean-quotes-2020.bz2', compression='bz2')

In [None]:
df.head(30)