# _Who has a voice in the media?_

## 1. Pre-processing the dataset
In this study of "_Who has a voice in the media_", the **speaker identity and what it said is vital**. Thus, we remove the following rows from the original dataset:
- rows where either the author or the quotation is NaN; 
- rows where the author has probability lower than 50%. 

Later, we also do a sanity controll and **remove possible duplicate of rows** with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

Finally, to reduce the dataset further we **remove columns** that we will not use for our analysis: _quoteID_, _speaker_, _probas_, _urls_, _phase_ and _numOccurrences_.

In [95]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from collections import Counter
from pathlib import Path

## 2. Initial analysis
Here, we do initial studies on the content of the dataset. For instance we plot the following information about the speakers:
- occupation;
- gender;
- age;
- ethnicity;
- top 20 speakers.
**OBS! For practical reasons, in the initial analysis in Milestone 2, we randomly picked 100,000 quotations of each year instead of dealing with the whole data. The code and the analysis will basically remain the same but only need to be run for a longer time.**

In [96]:
datafolder = Path("data")

DATA = {
    '2015': 'data/clean-quotes-2015.bz2',
    '2016': 'data/clean-quotes-2016.bz2',
    '2017': 'data/clean-quotes-2017.bz2',
    '2018': 'data/clean-quotes-2018.bz2',
    '2019': 'data/clean-quotes-2019.bz2',
    '2020': 'data/clean-quotes-2020.bz2',
}

ALL_YEARS = ['2015', '2016', '2017', '2018', '2019', '2020']

def load_data(year, sample=True, sample_size=100_000):
    year_file = Path(DATA[year])
    if year_file.exists():
        df = pd.read_csv(DATA[year], compression='bz2')
        if sample:
            df = df.sample(n=sample_size, random_state=1)
    else:
        return None 
    
wikidata_speakers = pd.read_parquet('data/speaker_attributes.parquet')
wikidata_speakers.set_index('id', inplace=True)

In [97]:
df = pd.read_csv(DATA['2020'], compression='bz2')
df = df.sample(n=100_000, random_state=1)

In [98]:
qids = df.qids.tolist()
wanted_qids = [eval(qid)[0] for qid in qids if len(eval(qid)) == 1 and eval(qid)[0] in wikidata_speakers.index]
speakers = wikidata_speakers.loc[wanted_qids]
speakers = speakers[~speakers.index.duplicated(keep='first')]

n_quotes_per_person = Counter(wanted_qids)
speakers['n_quotes'] = speakers.index.map(n_quotes_per_person)

ages = []
for date in speakers.date_of_birth.values:
    if not date is None:
        ages.append(datetime.now().year - int(date[0][1:5]))
    else:
        ages.append(None)

speakers['age'] = ages


### Prepare dataset
- Only keep meaningful features
- Plot histogram of different features
- Find way to mitigate fact that features have different scales

In [135]:
speakers.head(5)

Unnamed: 0_level_0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,label,candidacy,type,religion,n_quotes,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Q7804031,,[+1943-05-06T00:00:00Z],[Q30],[Q6581097],1317248645,,,[Q10871364],,,Tim Murtaugh,,item,,15,78.0
Q5232425,,[+1960-01-02T00:00:00Z],[Q30],[Q6581097],1315350858,,,[Q2309784],,,David Clinton,,item,,1,61.0
Q5040552,,[+1966-10-25T00:00:00Z],[Q145],[Q6581097],1392517444,,,[Q3665646],,,Carl Miller,,item,,1,55.0
Q24034267,,[+1970-09-00T00:00:00Z],[Q145],[Q6581072],1393011906,,,"[Q82955, Q1781198, Q1631120]",[Q3243587],,Rachael Hamilton,,item,,3,51.0
Q4558770,,[+1990-11-13T00:00:00Z],[Q16],[Q6581097],1351687048,,,[Q11774891],,,Brenden Dillon,,item,,3,31.0


In [144]:
print(f'The percentage of speakers with US_congress_bio_ID is {100 * len(speakers.US_congress_bio_ID.value_counts().index) / len(speakers.index) :.3f}%')
print(f'The percentage of speakers with candidacy is {100 * sum(speakers.candidacy.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with ethnic_group is {100 * sum(speakers.ethnic_group.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with religion is {100 * sum(speakers.religion.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with academic_degree is {100 * sum(speakers.academic_degree.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with party is {100 * sum(speakers.party.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with nationality is {100 * sum(speakers.nationality.value_counts().values) / len(speakers.index):.3f}%')
print(f'The percentage of speakers with gender is {100 * sum(speakers.gender.value_counts().values) / len(speakers.index):.3f}%')


The percentage of speakers with US_congress_bio_ID is 1.337%
The percentage of speakers with candidacy is 2.956%
The percentage of speakers with ethnic_group is 5.188%
The percentage of speakers with religion is 6.742%
The percentage of speakers with academic_degree is 1.340%
The percentage of speakers with party is 17.377%
The percentage of speakers with nationality is 80.010%
The percentage of speakers with gender is 97.814%


Drop the columns 
- that are particular to the speaker and don't add any value to the clustering (alisaes, label, US_congress_ID) ;
- that don't add value to clustering (lastrevid, type);
- that contain too little data to draw any conclusions (candidacy - 2.9%, academic_degree - 1.3%);
- used to produce the new ages column (date_of_birth). 

In [145]:
speakers_features = speakers.drop(columns=['aliases', 'label', 'US_congress_bio_ID', 
                                           'lastrevid', 'type', 
                                           'candidacy', 'academic_degree', 
                                           'date_of_birth'])
speakers_features.head()

Unnamed: 0_level_0,nationality,gender,ethnic_group,occupation,party,religion,n_quotes,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Q7804031,[Q30],[Q6581097],,[Q10871364],,,15,78.0
Q5232425,[Q30],[Q6581097],,[Q2309784],,,1,61.0
Q5040552,[Q145],[Q6581097],,[Q3665646],,,1,55.0
Q24034267,[Q145],[Q6581072],,"[Q82955, Q1781198, Q1631120]",[Q3243587],,3,51.0
Q4558770,[Q16],[Q6581097],,[Q11774891],,,3,31.0


More pre-processing: make the None names to 'Unknown', otherwise the KPrototype doesn't run. This way, the 'Unknown' class becomes a categorical variable.

In [216]:
speakers_features.ethnic_group.isna().sum()

31052

In [245]:
speakers_features_preprocessed = pd.DataFrame()
speakers_features_preprocessed['n_quotes'] = speakers_features['n_quotes']
speakers_features_preprocessed['age'] = speakers_features['age'].fillna(speakers_features['age'].median())

for name, values in speakers_features.iteritems():
    # Remove all None values from categorical rows, and keep only first instance of list occupation, nationality
    if name not in ['n_quotes', 'age']:
        updated_values = []
        for val in values:
            if not val is None:
                updated_values.append(val[0])
            else:
                updated_values.append('Unknown')
        speakers_features_preprocessed[name] = updated_values
    

In [247]:
speakers_features_preprocessed.head()

Unnamed: 0_level_0,n_quotes,age,nationality,gender,ethnic_group,occupation,party,religion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Q7804031,15,78.0,Q30,Q6581097,Unknown,Q10871364,Unknown,Unknown
Q5232425,1,61.0,Q30,Q6581097,Unknown,Q2309784,Unknown,Unknown
Q5040552,1,55.0,Q145,Q6581097,Unknown,Q3665646,Unknown,Unknown
Q24034267,3,51.0,Q145,Q6581072,Unknown,Q82955,Q3243587,Unknown
Q4558770,3,31.0,Q16,Q6581097,Unknown,Q11774891,Unknown,Unknown


In [253]:
assert(speakers_features_preprocessed.n_quotes.isna().sum() == 0)
assert(speakers_features_preprocessed.age.isna().sum() == 0)
assert(speakers_features_preprocessed.nationality.isna().sum() == 0)
assert(speakers_features_preprocessed.gender.isna().sum() == 0)
assert(speakers_features_preprocessed.ethnic_group.isna().sum() == 0)
assert(speakers_features_preprocessed.occupation.isna().sum() == 0)
assert(speakers_features_preprocessed.party.isna().sum() == 0)
assert(speakers_features_preprocessed.religion.isna().sum() == 0)

In [256]:
from kmodes.kprototypes import KPrototypes

def plot_sse(features, start=5, end=7):
    sse = []
    for k in range(start, end):
        # Assign the labels to the clusters
        kproto = KPrototypes(n_clusters=k, random_state=10).fit(features, categorical=[2, 3, 4, 5, 6, 7])
        sse.append({"k": k, "sse": kproto.cost_ })

    sse = pd.DataFrame(sse)
    # Plot the data
    plt.plot(sse.k, sse.sse)
    plt.xlabel("K")
    plt.ylabel("Sum of Squared Errors")

In [258]:
# plot_sse(speakers_features_preprocessed)
kproto = KPrototypes(n_clusters=k, random_state=10).fit(speakers_features_preprocessed, categorical=[2, 3, 4, 5, 6, 7])

NameError: name 'features' is not defined