## Initial Data Analysis

In [21]:
import bz2
import json
import pandas as pd
import csv

In [6]:
# Load the concatenated quotes for top 100 politician
with open('./data/politician-quotes-concatenated_1636411537251.json', 'r') as f:
    data = json.load(f)

After getting the data, we extract the quote ID and the concatenated quote of each politician, and write them to `input_data1.csv` for the LIWC personality analysis.

In [None]:
with open('input_data1.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["qid", "quote"])
    for qid, all_value in data.items():
        quote = all_value["quotations"]
        writer.writerow([qid, quote])

### LIWC Analysis
After parsing the `input_data1.csv` using the liwc software (Academic Version), for each concatenated quote, it gains a list of features in terms of LIWC categories, such as pronoun, articles. We save the data as `output_data1.csv` and then load it to our notebook.

In [10]:
liwc = pd.read_csv('output_data1.csv')

In [13]:
# Visualise a random sample
liwc.sample()

Unnamed: 0,Source (A),Source (B),WC,WPS,Sixltr,Dic,Pronoun,I,We,Self,...,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP
12,Q1077594,The NCAA and collegiate sports more broadly n...,825,14.73,28.36,68.36,7.64,1.45,1.7,3.15,...,2.79,0.12,0.0,0.0,0.0,2.06,0.0,1.45,0.0,0.12


### Personality Analysis Based on LIWC Results
According to the research by Tal Yarkoni from University of Colorado at Boulder, significant correlations between LIWC categories and the big five personalities are identified based on a large scale analysis (2010). Hence, we create the `predict_personality()` function which allows us to select the features based on the recorded significant level from the research.

In [None]:
def predict_personality(liwc_data: pd.DataFrame, sig_level: int = 1) -> pd.DataFrame:
    """Predicts personality based on the LIWC metrics

    Args:
        liwc_data (pd.DataFrame): LIWC metrics
        sig_level (int, optional): Significance level. Defaults to 3 (i.e. greater than 0.001)

    Returns:
        pd.DataFrame: Personality scores
    """
    liwc_ocean_data = pd.read_csv('data/LIWC_OCEAN.csv', index_col=0)
    liwc_ocean_sig_data = pd.read_csv('data/LIWC_OCEAN_Significance.csv', index_col=0)
    liwc_data = liwc_data[list(LIWC_OCEAN_MAP.keys())].rename(columns=LIWC_OCEAN_MAP)
    liwc_data = liwc_data.div(liwc_data.sum(axis=1), axis=0)
    assert (liwc_ocean_data.index == liwc_data.columns).all()
    liwc_ocean_data_with_sig = liwc_ocean_data * (liwc_ocean_sig_data >= sig_level).astype(int)
    return liwc_data.dot(liwc_ocean_data_with_sig)

In [14]:
import helpers
# call the predict_personality function from helpers file which include the name cleaning map for liwc.
bigfive = helpers.predict_personality(liwc)

In [16]:
bigfive.sample()

Unnamed: 0,neuroticism,anxiety,hostility,depression,self_consciousness,immoderation,vulnerability,extraversion,friendliness,gregariousness,...,cooperation,modesty,sympathy,conscientiousness,self_efficacy,orderliness,dutifulness,achievement_striving,self_discipline,cautiousness
117,0.920334,-0.038939,2.156352,-0.049568,0.14622,-0.185656,-0.397437,2.485448,3.972683,3.314592,...,3.109864,0.954405,2.603854,-1.471541,2.190921,0.337737,0.779478,-1.747988,-1.446806,0.423959


In [17]:
# concat the liwc output to the bigfive result
df1 = pd.concat([liwc, bigfive], axis=1)
df1.sample()

Unnamed: 0,Source (A),Source (B),WC,WPS,Sixltr,Dic,Pronoun,I,We,Self,...,cooperation,modesty,sympathy,conscientiousness,self_efficacy,orderliness,dutifulness,achievement_striving,self_discipline,cautiousness
51,Q934898,And I see my father. My father was just wipin...,969,17.3,10.42,85.86,15.58,4.02,3.82,7.84,...,1.246182,1.163683,1.270105,-1.796902,0.643346,0.97126,0.736629,-1.169705,-1.303115,-0.2487


In [18]:
# load the top 100 politician data, queried from wikidata, concat with current dataset.
with open('./data/top100_politicians_by_party.json', 'r') as f:
    data_top100 = json.load(f)

dem_df = pd.DataFrame(data_top100["dem"])
rep_df = pd.DataFrame(data_top100["rep"])
dem_df['party'] = "dem"
rep_df['party'] = "rep"
politician_wiki = pd.concat([dem_df, rep_df])
df2 = df1.merge(politician_wiki, left_on='Source (A)', right_on='item', how = "left")
df2.sample()

Unnamed: 0,Source (A),Source (B),WC,WPS,Sixltr,Dic,Pronoun,I,We,Self,...,citizenshipLabel,languageLabel,religionLabel,ethnicLabel,degreeLabel,dateOfBirth,placeOfBirthLabel,memberOfParty,memberOfPartyLabel,party
78,Q16215328,A lot of good takeaways from this weekend. A ...,873,19.84,20.73,70.9,7.22,0.34,2.63,2.98,...,United States of America,,,,,1975-01-01T00:00:00Z,,http://www.wikidata.org/entity/Q29552,Democratic Party,dem


## Basic Analysis

For this part, we use the programming language R to get better visualisation and embed the output in html by knitting the Rmarkdown document.

### Comparing the Personality of Politicians from Democratic and Republic Parties

We compare each characteristic for democratic and republic politicians using Wilcoxon rank sum test, which tests whether top politicians from different parties have the equal medians for each attribute.

In [20]:
from IPython.display import IFrame
IFrame(src='./section-1.html', width=700, height=600)

The current result shows that the only significant differences are artistic_interests and emotionality, where democratic politicians have higher average artistic_interests and emotionality compared to republic politicians.

### Comparing Main Politicians in US Based on Their Personality

We use interactive heatmap in r to produce the following map. By clicking on the specific cell, you can see the value of each attribute for the person and compare it to other politicians. When a cell is blue, it means a positive value with regards to that attribute. When it is red, it gives a negative value for the respective characteristic.

Moreover, by selecting several cells, you are able to zoom in to see difference in details. To go back to the original view, please double click the plot.

In [24]:
IFrame(src='./section-2.html', width=700, height=600)

On the plot above, similar people will be scattered together. For example,
- Barack Obama and George W. Bush have similar personalities based on their quotations. 
- Donald Trump and Lindsey Graham are quite close on most scales of personality. 
- Obama and Trump seem to have opposite personality.