# Who has a voice in the media?


## 1. Pre-processing the Quotebank dataset
To start with, we remove the rows of the dataset where either the author or the quotation is NaN. In addition, we remove the authors who probability is lower than 50%. As our whole analysis of "who has a voice in the media" is all about the speaker and what it has said, it makes no sense to take these rows into account.

Later, we also do a sanity controll and remove possible duplicate of rows with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

Finally, to reduce the dataset further we remove columns that we will not use for our analysis: _quoteID_, _speaker_, _probas_, _urls_, _phase_ and _numOccurrences_.

In [12]:
from timeit import timeit as timer
import numpy as np
import pandas as pd

def clean_data(chunk, thresh=0.5):
    
    # Drop duplicate quoteIDs
    nr_rows = chunk.shape[0]
    chunk = chunk.drop_duplicates(subset=['quoteID'])
    print('- Dropped {} duplicate rows with same quoteID;'.format(nr_rows - chunk.shape[0]))
    
    # Drop quotes where either speaker or quotation is None
    nr_rows = chunk.shape[0]
    chunk.replace(to_replace=['None'], value=np.nan, inplace=True)
    chunk = chunk.dropna(axis=0, subset=['speaker', 'quotation'])
    print('- Dropped {} rows with NaN speaker or quotation;'.format(nr_rows - chunk.shape[0]))
    
    # Drop rows where speakers has probability less than 50%
    nr_rows = chunk.shape[0]
    prob_filter = pd.Series([(float(chunk.iloc[i].probas[0][1]) > thresh) for i in range(nr_rows)])
    prob_filter = pd.Series(prob_filter)
    chunk = chunk[prob_filter.values]
    print('- Dropped {} rows with speaker prob smaller than 50%;'.format(nr_rows - chunk.shape[0]))
    
    # Remove columns we don't care about
    chunk = chunk.drop(columns=['speaker', 'probas'])

    return chunk

start_of_all = timer()
read_from_file = 'data/quotes-2015.json-002.bz2'
write_to_file = 'data/clean-quotes-2015-updated.bz2'
with pd.read_json(read_from_file, lines=True, compression='bz2', chunksize=1_000_000) as df_reader:
    print('Started to process chunks...')
    i = 0
    for chunk in df_reader:
        print('\nProcessing new chunk...')
        start = timer()
        processed_chunk = clean_data(chunk)
        processed_chunk.to_csv(write_to_file, compression='bz2', mode='a', index=False)
        end = timer()
        print('Done processing and saving chunk after {:.3f} seconds.'.format(end - start))
        
end_of_all = timer()
print('\nDONE processing all chunks and saving as csv after {:.3f} minutes.'.format((end_of_all - start_of_all) / 60))
print('THE END!')

Started to process chunks...

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348775 rows with NaN speaker or quotation;
- Dropped 32474 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.005 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348985 rows with NaN speaker or quotation;
- Dropped 32299 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.004 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348778 rows with NaN speaker or quotation;
- Dropped 32428 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.002 seconds.

Processing new chunk...
- Dropped 0 duplicate rows with same quoteID;
- Dropped 348514 rows with NaN speaker or quotation;
- Dropped 32595 rows with speaker prob smaller than 50%;
Done processing and saving chunk after -0.004 seconds.

Processing new chunk...
- 

#### Short discussion around pre-processing
Around one third of the original dataset has either a NaN quotation field, a NaN speaker, or a speaker with lower than 50% probability of having said that quote. Another one third of the original data is removed by the removal of the unwanted columns. Thus we are left with one third of the original dataset and still with full possibility of doing the wanted analysis

Elsemore, it seems like there are no duplicates of quote-IDs in the dataset.

## 2. Creating the wikidata-speakers dataset

In order to analyse who has a voice in the media we add a new column "n_quotes" to the wikidata-dataset which is how many times that person is present in the quotebank dataset from 2015 to 2020. This new dataset is saved in "speakers.bz2"

In [2]:
from pathlib import Path
import pandas as pd
from collections import Counter
from datetime import datetime
from nltk.corpus import wordnet as wn
import nltk
_ = nltk.download('wordnet')

datafolder = Path("data")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mathe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
DATA = {
    '2015': datafolder / Path("clean-quotes-2015.bz2"),
    '2016': datafolder / Path("clean-quotes-2016.bz2"),
    '2017': datafolder / Path("clean-quotes-2017.bz2"),
    '2018': datafolder / Path("clean-quotes-2018.bz2"),
    '2019': datafolder / Path("clean-quotes-2019.bz2"),
    '2020': datafolder / Path("clean-quotes-2020.bz2"),
}

ALL_YEARS = ["2015","2016","2017","2018","2019","2020"]

def load_data(year, sample=True, sample_size=100_000):
    if DATA[year].exists():
        df = pd.read_csv(DATA[year], compression='bz2')
        if sample:
            df = df.sample(n=sample_size, random_state=1)
        return df
    else:
        return None 
    
wikidata_speakers = pd.read_parquet('data/speaker_attributes.parquet')
wikidata_speakers.set_index('id', inplace=True)

In [15]:
wanted_qids =  []
for year in ALL_YEARS:
    print(f"Loading {year}...")
    df = load_data(year, sample=False)
    print(f"Done loading {year}")
    if df is not None:
        qids = df.qids.tolist()
        wanted_qids += [eval(qid)[0] for qid in qids if len(eval(qid)) == 1 and eval(qid)[0] in wikidata_speakers.index]
    else:
        Print(f"could not find file for year {year}")
        
speakers = wikidata_speakers.loc[wanted_qids]
speakers = speakers[~speakers.index.duplicated(keep='first')]

n_quotes_per_person = Counter(wanted_qids)
speakers['n_quotes'] = speakers.index.map(n_quotes_per_person)


Loading 2015...
Done loading 2015
Loading 2016...
Done loading 2016
Loading 2017...
Done loading 2017
Loading 2018...
Done loading 2018
Loading 2019...
Done loading 2019
Loading 2020...
Done loading 2020


In [23]:
speakers.to_csv("data/speakers.bz2",compression='bz2')

Now, let us clean this speaker dataset in the following ways:
- change date_of_birth by age
- drop columns that won't be used in our analysis
- only keep first item of the list in the columns with list-values
- drop the rows with any None values
- remove the authors whose age is not between 0 and 200, and whose n_quotes is smaller than 1
- construct new features using Wordnet 

In [20]:
speakers_features.head(2)

Unnamed: 0,nationality,gender,occupation,n_unique_quotes,n_quotes,age
Q270316,[Q30],[Q6581072],[Q82955],4094,21060,74.0
Q1253,[Q884],[Q6581097],"[Q82955, Q193391]",12746,94704,77.0


In [22]:
# Load dataframes
speakers = pd.read_json('data/speakers.json.bz2', compression='bz2')

# Change date_of_birth to age
ages = []
for date in speakers.date_of_birth.values:
    if not date is None:
        ages.append(datetime.now().year - int(date[0][1:5]))
    else:
        ages.append(None)

speakers['age'] = ages

# Drop uninteresting columns
speakers_features = speakers.drop(columns=['aliases', 'label', 'US_congress_bio_ID', 
                                           'lastrevid', 'type', 
                                           'candidacy', 'academic_degree',
                                           'date_of_birth', 'religion',
                                           'ethnic_group', 'party'
                                           ])

# Keep only first instance of occupation, nationality, gender
speakers_features_full = pd.DataFrame()
speakers_features_full['n_quotes'] = speakers_features['n_quotes']
speakers_features_full['n_unique_quotes'] = speakers_features['n_unique_quotes']
speakers_features_full['age'] = speakers_features['age']

for name, values in speakers_features.iteritems():
    if name not in ['n_quotes', 'age', 'n_unique_quotes']:
        updated_values = []
        for val in values:
            if not val is None:
                updated_values.append(val[0])
            else:
                updated_values.append(None)
        speakers_features_full[name] = updated_values

speakers_features_preprocessed_ = speakers_features_full.dropna(axis=0) # remove row if any column value is None

# Remove the authors whose age is not between 0 and 150, and whose n_quotes is smaller than 1
speakers_features_preprocessed = speakers_features_preprocessed_[(speakers_features_preprocessed_.age > 0) 
                                                               & (speakers_features_preprocessed_.age < 150) 
                                                               & (speakers_features_preprocessed_.n_quotes > 0)]
speakers_features_preprocessed.head(5)

Unnamed: 0,n_quotes,n_unique_quotes,age,nationality,gender,occupation
Q270316,21060,4094,74.0,Q30,Q6581072,Q82955
Q1253,94704,12746,77.0,Q884,Q6581097,Q82955
Q19874690,1207,205,62.0,Q408,Q6581097,Q39631
Q5271548,1587,573,83.0,Q30,Q6581072,Q1930187
Q2287947,132971,16482,28.0,Q30,Q6581097,Q11303721


Construct new occupation-related features using **Wordnet**. Every occupation has its semantic difference to the top 8 pre-defined occupations calculated with Wordnet. This way, there is no need to one-hot encode the roughly 2400 occupations when doing the clustering, but we can rather keep it down to 8. 

In [24]:
from qwikidata.linked_data_interface import get_entity_dict_from_api

def map_qid_occupations():
    """
    Maps the qids to occupation names. Change some occupation names so that they match the
    existing nouns in Wordnet.
    """
    qids = speakers_features_preprocessed.occupation.unique().tolist()
    qid_occupation_map = {}
    for qid in qids:
        if qid not in qid_occupation_map:
            entity = get_entity_dict_from_api(qid)['labels']
            if 'en' not in entity: continue
            occupation_name = entity['en']['value']
            if occupation_name == 'association football player': occupation_name = 'soccer_player'
            elif occupation_name == 'American football player': occupation_name = 'football_player'
            elif occupation_name == 'rugby union player': occupation_name = 'football_player'
            elif occupation_name == 'rugby league player': occupation_name = 'football_player'
            elif occupation_name == 'ice hockey player': occupation_name = 'athlete'
            elif occupation_name == 'boxer': occupation_name = 'athlete'
            elif occupation_name == 'golfer': occupation_name = 'athlete'
            elif occupation_name == 'business magnate': occupation_name = 'businessperson'
            elif occupation_name == 'business executive': occupation_name = 'businessperson'
            elif occupation_name == 'singer-songwriter': occupation_name = 'musician'
            elif occupation_name == 'composer': occupation_name = 'musician'
            elif occupation_name == 'film-director': occupation_name = 'film_director'
            elif occupation_name == 'film producer': occupation_name = 'film_director'
            elif occupation_name == 'film actor': occupation_name = 'actor'
            elif occupation_name == 'television actor': occupation_name = 'actor'
            elif occupation_name == 'comedian': occupation_name = 'actor'
            elif occupation_name == 'diplomat': occupation_name = 'politician'
            elif occupation_name == 'philosopher': occupation_name = 'researcher'
            elif occupation_name == 'economist': occupation_name = 'researcher'
            qid_occupation_map[qid] = occupation_name
    return qid_occupation_map
    
# UNCOMMENT ROW BELOW TO GET qid <-> occupation_name map
qid_occupation_map = map_qid_occupations()

top_occupations = ['politician', 'athlete', 'actor', 'lawyer', 'researcher', 'journalist', 'musician', 'businessperson']
import numpy as np

def get_wordnet_similarity_to(top_occupations, qid_occupation_map, thresh=0.7):
    """
    Calculates the similarities of all unique occupation qids to the pre-defined top occupations.
    Returns a dictionary where the keys are each unique occupation qid, and the value is the similarity
    to each top occupation. E.g.: qid_simlarities_map = {'Q1234': [similarity_to_politician, similarity_to_athlete, ..., similarity_to_businessperson], ...}.
    """
    unique_qids = speakers_features_preprocessed.occupation.unique().tolist()
    qid_similarities_map = {}
    for qid in unique_qids:
        if qid in qid_occupation_map and wn.synsets(qid_occupation_map[qid]): 
            dist_to_top_occupations = np.zeros((8, )) #{occupation: None for occupation in top_occupations}
            qid_synset_obj = wn.synsets(qid_occupation_map[qid])[0]
            for i, occ in enumerate(top_occupations):
                top_occ_synset_obj = wn.synsets(occ)[0]
                similarity_to_top_occupation = top_occ_synset_obj.wup_similarity(qid_synset_obj)
                dist_to_top_occupations[i] = similarity_to_top_occupation
            if np.max(dist_to_top_occupations) >= thresh:
                dist_to_top_occupations[dist_to_top_occupations != np.max(dist_to_top_occupations)] = 0
                dist_to_top_occupations[dist_to_top_occupations == np.max(dist_to_top_occupations)] = 1
            qid_similarities_map[qid] = dist_to_top_occupations / 5 # divide by 5 to compensate the fact that the semantic distance to the top occupations are lower than threshold, this means it is uncertain in which it best fits
    return qid_similarities_map
            
qid_similarities_map = get_wordnet_similarity_to(top_occupations, qid_occupation_map)

# Create new columns: politician_score, athlete_score, actor_score, lawyer_score, researcher_score, journalist_score, musician_score, businessperson_score
for top_occupation in top_occupations:
    speakers_features_preprocessed[f'{top_occupation}_score'] = None
speakers_features_preprocessed.head(20)

unique_occupation_qids = speakers_features_preprocessed.occupation.unique().tolist()
for occ_qid in qid_similarities_map:
    speakers_features_preprocessed.loc[speakers_features_preprocessed['occupation'] == occ_qid, 
                                      ['politician_score', 'athlete_score', 'actor_score', 'lawyer_score', 
                                       'researcher_score', 'journalist_score', 'musician_score', 'businessperson_score']] = qid_similarities_map[occ_qid]

speakers_features_preprocessed_final = speakers_features_preprocessed.drop(columns=['occupation']).dropna()
print(speakers_features_preprocessed_final.shape)
speakers_features_preprocessed_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  speakers_features_preprocessed[f'{top_occupation}_score'] = None
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, v, pi)


(330297, 13)


Unnamed: 0,n_quotes,n_unique_quotes,age,nationality,gender,politician_score,athlete_score,actor_score,lawyer_score,researcher_score,journalist_score,musician_score,businessperson_score
Q270316,21060,4094,74.0,Q30,Q6581072,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Q1253,94704,12746,77.0,Q884,Q6581097,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Q19874690,1207,205,62.0,Q408,Q6581097,0.114286,0.114286,0.109091,0.109091,0.114286,0.109091,0.109091,0.114286
Q5271548,1587,573,83.0,Q30,Q6581072,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0
Q2287947,132971,16482,28.0,Q30,Q6581097,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0


Save to JSON. This will be used all through our analysis: in clustering.ipynb, what.ipynb, and how.ipynb.

In [None]:
speakers_features_preprocessed_final.to_json('data/speakers_8_occupations', compression='bz2')