# Who has a voice in the media?


## Pre-processing the dataset
To start with, we remove the rows of the dataset where either the author or the quotation is NaN. In addition, we remove the authors who probability is lower than 50%. As our whole analysis of "who has a voice in the media" is all about the speaker and what it has said, it makes no sense to take these rows into account.

Later, we also do a sanity controll and remove possible duplicate of rows with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

Finally, to reduce the dataset further we remove columns that we will not use for our analysis: _quoteID_, _speaker_, _probas_, _urls_, _phase_ and _numOccurrences_.

In [12]:
from timeit import timeit as timer
import numpy as np
import pandas as pd
def clean_data(chunk, thresh=0.5):
    
    # Drop duplicate quoteIDs
    nr_rows = chunk.shape[0]
    chunk = chunk.drop_duplicates(subset=['quoteID'])
    print('- Dropped {} duplicate rows with same quoteID;'.format(nr_rows - chunk.shape[0]))
    
    # Drop quotes where either speaker or quotation is None
    nr_rows = chunk.shape[0]
    chunk.replace(to_replace=['None'], value=np.nan, inplace=True)
    chunk = chunk.dropna(axis=0, subset=['speaker', 'quotation'])
    print('- Dropped {} rows with NaN speaker or quotation;'.format(nr_rows - chunk.shape[0]))
    
    # Drop rows where speakers has probability less than 50%
    nr_rows = chunk.shape[0]
    prob_filter = pd.Series([(float(chunk.iloc[i].probas[0][1]) > thresh) for i in range(nr_rows)])
    prob_filter = pd.Series(prob_filter)
    chunk = chunk[prob_filter.values]
    print('- Dropped {} rows with speaker prob smaller than 50%;'.format(nr_rows - chunk.shape[0]))
    
    # Remove columns we don't care about
    chunk = chunk.drop(columns=['speaker', 'probas'])

    return chunk

start_of_all = timer()
read_from_file = 'data/quotes-2015.json-002.bz2'
write_to_file = 'data/clean-quotes-2015-updated.bz2'
with pd.read_json(read_from_file, lines=True, compression='bz2', chunksize=1_000_000) as df_reader:
    print('Started to process chunks...')
    i = 0
    for chunk in df_reader:
        print('\nProcessing new chunk...')
        start = timer()
        processed_chunk = clean_data(chunk)
        processed_chunk.to_csv(write_to_file, compression='bz2', mode='a', index=False)
        end = timer()
        print('Done processing and saving chunk after {:.3f} seconds.'.format(end - start))
        
end_of_all = timer()
print('\nDONE processing all chunks and saving as csv after {:.3f} minutes.'.format((end_of_all - start_of_all) / 60))
print('THE END!')

Started to process chunks...

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348775 rows with NaN speaker or quotation;
- Dropped 32474 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.005 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348985 rows with NaN speaker or quotation;
- Dropped 32299 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.004 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348778 rows with NaN speaker or quotation;
- Dropped 32428 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.002 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348514 rows with NaN speaker or quotation;
- Dropped 32595 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.004 seconds.

Processing new chunk...
- 

#### Short discussion around pre-processing
Around one third of the original dataset has either a NaN quotation field, a NaN speaker, or a speaker with lower than 50% probability of having said that quote. Another one third of the original data is removed by the removal of the unwanted columns. Thus we are left with one third of the original dataset and still with full possibility of doing the wanted analysis

Elsemore, it seems like there are no duplicates of quote-IDs in the dataset.

### Creating the Speakers dataset

In order to analyse who has a voice in the media we add a new column "n_quotes" to the wikidata-dataset which is how many times that person is present in the quotebank dataset from 2015 to 2020. This new dataset is saved in "speakers.bz2"

In [14]:
from pathlib import Path
import pandas as pd
from collections import Counter

datafolder = Path("data")

DATA = {
    '2015': datafolder / Path("clean-quotes-2015.bz2"),
    '2016': datafolder / Path("clean-quotes-2016.bz2"),
    '2017': datafolder / Path("clean-quotes-2017.bz2"),
    '2018': datafolder / Path("clean-quotes-2018.bz2"),
    '2019': datafolder / Path("clean-quotes-2019.bz2"),
    '2020': datafolder / Path("clean-quotes-2020.bz2"),
}

ALL_YEARS = ["2015","2016","2017","2018","2019","2020"]

def load_data(year, sample=True, sample_size=100_000):
    if DATA[year].exists():
        df = pd.read_csv(DATA[year], compression='bz2')
        if sample:
            df = df.sample(n=sample_size, random_state=1)
        return df
    else:
        return None 
    
wikidata_speakers = pd.read_parquet('data/speaker_attributes.parquet')
wikidata_speakers.set_index('id', inplace=True)

In [15]:
wanted_qids =  []
for year in ALL_YEARS:
    print(f"Loading {year}...")
    df = load_data(year, sample=False)
    print(f"Done loading {year}")
    if df is not None:
        qids = df.qids.tolist()
        wanted_qids += [eval(qid)[0] for qid in qids if len(eval(qid)) == 1 and eval(qid)[0] in wikidata_speakers.index]
    else:
        Print(f"could not find file for year {year}")
        
speakers = wikidata_speakers.loc[wanted_qids]
speakers = speakers[~speakers.index.duplicated(keep='first')]

n_quotes_per_person = Counter(wanted_qids)
speakers['n_quotes'] = speakers.index.map(n_quotes_per_person)


Loading 2015...
Done loading 2015
Loading 2016...
Done loading 2016
Loading 2017...
Done loading 2017
Loading 2018...
Done loading 2018
Loading 2019...
Done loading 2019
Loading 2020...
Done loading 2020


In [23]:
speakers.to_csv("data/speakers.bz2",compression='bz2')