# _Who has a voice in the media?_

## 1. Pre-processing the dataset
In this study of "_Who has a voice in the media_", the **speaker identity and what it said is vital**. Thus, we remove the following rows from the original dataset:
- rows where either the author or the quotation is NaN; 
- rows where the author has probability lower than 50%. 

Later, we also do a sanity controll and **remove possible duplicate of rows** with the same quote-ID as we obiously don't want to use exactly the same quote more than once in our analyzes. 

Finally, to reduce the dataset further we **remove columns** that we will not use for our analysis: _quoteID_, _speaker_, _probas_, _urls_, _phase_ and _numOccurrences_.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import itertools
import time
from datetime import datetime
from timeit import default_timer as timer
from collections import Counter
from qwikidata.linked_data_interface import get_entity_dict_from_api
from qwikidata.entity import WikidataItem
from pathlib import Path

## 2. Initial analysis
Here, we do initial studies on the content of the dataset. For instance we plot the following information about the speakers:
- occupation;
- gender;
- age;
- ethnicity;
- top 20 speakers.
**OBS! For practical reasons, in the initial analysis in Milestone 2, we randomly picked 100,000 quotations of each year instead of dealing with the whole data. The code and the analysis will basically remain the same but only need to be run for a longer time.**

In [10]:
datafolder = Path("data")

DATA = {
    '2015': 'data/clean-quotes-2015.bz2',
    '2016': 'data/clean-quotes-2016.bz2',
    '2017': 'data/clean-quotes-2017.bz2',
    '2018': 'data/clean-quotes-2018.bz2',
    '2019': 'data/clean-quotes-2019.bz2',
    '2020': 'data/clean-quotes-2020.bz2',
}

ALL_YEARS = ['2015', '2016', '2017', '2018', '2019', '2020']

def load_data(year, sample=True, sample_size=100_000):
    year_file = Path(DATA[year])
    if year_file.exists():
        df = pd.read_csv(DATA[year], compression='bz2')
        if sample:
            df = df.sample(n=sample_size, random_state=1)
    else:
        return None 
    
wikidata_speakers = pd.read_parquet('data/speaker_attributes.parquet')
wikidata_speakers.set_index('id', inplace=True)

In [18]:
df = pd.read_csv(DATA['2020'], compression='bz2')
df = df.sample(n=100_000, random_state=1)

In [79]:
qids = df.qids.tolist()
wanted_qids = [eval(qid)[0] for qid in qids if len(eval(qid)) == 1 and eval(qid)[0] in wikidata_speakers.index]
speakers = wikidata_speakers.loc[wanted_qids]
speakers = speakers[~speakers.index.duplicated(keep='first')]

n_quotes_per_person = Counter(wanted_qids)
speakers['n_quotes'] = speakers.index.map(n_quotes_per_person)

ages = []
for date in speakers.date_of_birth.values:
    if not date is None:
        ages.append(datetime.now().year - int(date[0][1:5]))
    else:
        ages.append(None)

speakers['age'] = ages


### Prepare dataset
- Only keep meaningful features
- Plot histogram of different features
- Find way to mitigate fact that features have different scales

In [80]:
speakers.head()

Unnamed: 0_level_0,aliases,date_of_birth,nationality,gender,lastrevid,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,label,candidacy,type,religion,n_quotes,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Q7804031,,[+1943-05-06T00:00:00Z],[Q30],[Q6581097],1317248645,,,[Q10871364],,,Tim Murtaugh,,item,,15,78.0
Q5232425,,[+1960-01-02T00:00:00Z],[Q30],[Q6581097],1315350858,,,[Q2309784],,,David Clinton,,item,,1,61.0
Q5040552,,[+1966-10-25T00:00:00Z],[Q145],[Q6581097],1392517444,,,[Q3665646],,,Carl Miller,,item,,1,55.0
Q24034267,,[+1970-09-00T00:00:00Z],[Q145],[Q6581072],1393011906,,,"[Q82955, Q1781198, Q1631120]",[Q3243587],,Rachael Hamilton,,item,,3,51.0
Q4558770,,[+1990-11-13T00:00:00Z],[Q16],[Q6581097],1351687048,,,[Q11774891],,,Brenden Dillon,,item,,3,31.0


In [81]:
speakers_features = speakers.drop(columns=['aliases', 'lastrevid', 'date_of_birth', 'type', 'label'])
print(type(speakers_features.iloc[0].party))
speakers_features.head()

<class 'NoneType'>


Unnamed: 0_level_0,nationality,gender,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,candidacy,religion,n_quotes,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Q7804031,[Q30],[Q6581097],,,[Q10871364],,,,,15,78.0
Q5232425,[Q30],[Q6581097],,,[Q2309784],,,,,1,61.0
Q5040552,[Q145],[Q6581097],,,[Q3665646],,,,,1,55.0
Q24034267,[Q145],[Q6581072],,,"[Q82955, Q1781198, Q1631120]",[Q3243587],,,,3,51.0
Q4558770,[Q16],[Q6581097],,,[Q11774891],,,,,3,31.0


More pre-processing

In [82]:
speakers_features_preprocessed = pd.DataFrame()
speakers_features_preprocessed['age'] = speakers_features['age']
speakers_features_preprocessed['n_quotes'] = speakers_features['n_quotes']

for name, values in speakers_features.iteritems():
    if name not in ['n_quotes', 'age']:
        updated_values = []
        for val in values:
            if not val is None:
                updated_values.append(val[0])
            else:
                updated_values.append('Unknown')

    speakers_features_preprocessed[name] = updated_values

In [84]:
speakers_features_preprocessed.head()

Unnamed: 0_level_0,age,n_quotes,nationality,gender,ethnic_group,US_congress_bio_ID,occupation,party,academic_degree,candidacy,religion
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Q7804031,Unknown,Unknown,Q30,Q6581097,Unknown,Unknown,Q10871364,Unknown,Unknown,Unknown,Unknown
Q5232425,Unknown,Unknown,Q30,Q6581097,Unknown,Unknown,Q2309784,Unknown,Unknown,Unknown,Unknown
Q5040552,Unknown,Unknown,Q145,Q6581097,Unknown,Unknown,Q3665646,Unknown,Unknown,Unknown,Unknown
Q24034267,Unknown,Unknown,Q145,Q6581072,Unknown,Unknown,Q82955,Q3243587,Unknown,Unknown,Unknown
Q4558770,Unknown,Unknown,Q16,Q6581097,Unknown,Unknown,Q11774891,Unknown,Unknown,Unknown,Unknown


In [85]:
from kmodes.kmodes import KModes

def plot_sse(features, start=2, end=11):
    sse = []
    for k in range(start, end):
        # Assign the labels to the clusters
        kmodes = KModes(n_clusters=k, random_state=10).fit(features)
        sse.append({"k": k, "sse": kmodes.inertia_})

    sse = pd.DataFrame(sse)
    # Plot the data
    plt.plot(sse.k, sse.sse)
    plt.xlabel("K")
    plt.ylabel("Sum of Squared Errors")

In [86]:
kmodes = KModes(n_clusters=2, random_state=10).fit(speakers_features)

AttributeError: 'bool' object has no attribute 'any'