# Who has a voice in the media?

## 1. Pre-processing the dataset
To start with, we remove the rows of the dataset where either the author or the quotation is NaN. As our whole analysis of "who has a voice in the media" is all about the speaker and what it has said, it makes no sense to take these rows into account.

Later, we also do a sanity controll and remove possible duplicate of rows with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

In [1]:
import pandas as pd
import numpy as np
from timeit import default_timer as timer

In [2]:
def clean_data(chunk, thresh=0.5):
    
    # Drop duplicate quoteIDs
    nr_rows = chunk.shape[0]
    chunk = chunk.drop_duplicates(subset=['quoteID'])
    print('- Dropped {} duplicate rows with same quoteID;'.format(nr_rows - chunk.shape[0]))
    
    # Drop quotes where either speaker or quotation is None
    nr_rows = chunk.shape[0]
    chunk.replace(to_replace=['None'], value=np.nan, inplace=True)
    chunk = chunk.dropna(axis=0, subset=['speaker', 'quotation'])
    print('- Dropped {} rows with NaN speaker or quotation;'.format(nr_rows - chunk.shape[0]))
    
    # Drop rows where speakers has probability less than 50%
    nr_rows = chunk.shape[0]
    prob_filter = pd.Series([(float(chunk.iloc[i].probas[0][1]) > thresh) for i in range(nr_rows)])
#     prob_filter = []
#     for i in range(nr_rows):
#         prob = float(chunk.iloc[i].probas[0][1])
#         if prob < thresh:
#             prob_filter.append(False)
#         else:
#             prob_filter.append(True)
    prob_filter = pd.Series(prob_filter)
    chunk = chunk[prob_filter.values]
    print('- Dropped {} rows with speaker prob smaller than 50%;'.format(nr_rows - chunk.shape[0]))
    
    # Remove columns we don't care about
    chunk = chunk.drop(columns=['quoteID', 'speaker', 'probas', 'urls', 'phase', 'numOccurrences'])

    return chunk

start_of_all = timer()
read_from_file = 'data/quotes-2020.json.bz2'
write_to_file = 'data/clean-quotes-2020.bz2'
with pd.read_json(read_from_file, lines=True, compression='bz2', chunksize=1_000_000) as df_reader:
    print('Started to process chunks...')
    i = 0
    for chunk in df_reader:
#         if i > 1:
#             break
#         i += 1
        print('\nProcessing new chunk...')
        start = timer()
        processed_chunk = clean_data(chunk)
        processed_chunk.to_csv(write_to_file, compression='bz2', mode='a', index=False)
        end = timer()
        print('Done processing and saving chunk after {:.3f} seconds.'.format(end - start))
        
end_of_all = timer()
print('\nDONE processing all chunks and saving as csv after {:.3f} minutes.'.format((end_of_all - start_of_all) / 60))

Started to process chunks...

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 343700 rows with NaN speaker or quotation;
- Dropped 30524 rows with speaker prob smaller than 50%;
Done processing and saving chunk after 89.029 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 343778 rows with NaN speaker or quotation;
- Dropped 30897 rows with speaker prob smaller than 50%;
Done processing and saving chunk after 125.076 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 343353 rows with NaN speaker or quotation;
- Dropped 30405 rows with speaker prob smaller than 50%;
Done processing and saving chunk after 95.449 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 343472 rows with NaN speaker or quotation;
- Dropped 30511 rows with speaker prob smaller than 50%;
Done processing and saving chunk after 104.026 seconds.

Processing new chunk...


#### Short discussion
Around one third of the dataset seems to have either a NaN speaker or quotation field. We should rethink if there is a way to use this data despite of missing speaker or quotation fields.

Elsemore, it seems like there are no duplicates of quote-IDs in the dataset.

## 2. Initial analyzes
Here, we do initial studies on the dataset. For instance we plot the following information about the speakers:
- gender;
- age;
- ethnicity;
- profession.

Also, we do analyzes on the content of the quotes.

In [None]:
# number of males and females
# profession
# age
# ethnicity
# clean if person has several references in wikidata?

In [3]:
df = pd.read_csv('data/clean-quotes-2020.bz2', compression='bz2')

In [None]:
parq = pd.read_parquet('data/speaker_attributes.parquet-20211104T133449Z-001.zip')

In [None]:
df = df.drop(columns=['probas', 'speaker'])

In [4]:
df

Unnamed: 0,quotation,qids,date
0,[ Department of Homeland Security ] was livid ...,['Q367796'],2020-01-16 12:00:13
1,[ I met them ] when they just turned 4 and 7. ...,['Q20684375'],2020-01-24 20:37:09
2,[ The delay ] will have an impact [ on Slough ...,['Q5268447'],2020-01-17 13:03:00
3,[ The scheme ] treats addiction as an illness ...,['Q4864119'],2020-04-02 14:18:20
4,[ These ] actions will allow households who ha...,['Q816459'],2020-03-19 19:14:00
...,...,...,...
3282966,you're going to take care of the gun problem w...,['Q6279'],2020-03-03 15:49:51
3282967,"you're seeing a young team that's maturing, th...",['Q18115465'],2020-02-24 05:00:28
3282968,"You're talking about African-Americans, right?...",['Q3635235'],2020-02-07 00:00:00
3282969,You've got to sometimes take that leap of fait...,['Q896796'],2020-02-04 14:47:00
