# Part 1: Data Preprocessing and Exploration
This final project is based around the [Open Psychometrics "Which Character" Quiz](https://openpsychometrics.org/tests/characters/documentation/). The quiz follows a standard internet format: Respondents assess themselves on series of opposed traits (e.g., are you more selfish or altruistic?), and at the end of the quiz, they are presented with their most similar fictional character (e.g., Batman or Buffy the Vampire Slayer). After the quiz has been completed, users are invited to rate the personalities of the characters themselves (e.g., is Batman more altruistic or selfish?). Open Psychometrics researchers have aggregated the ratings of 2,125 characters across 500 dimensions on a 100-point scale. The aggregate ratings are based on 3,386,031 user responses. Our work is inspired by the work of the [Vermont Computational Story Lab](https://compstorylab.org/archetypometrics/).

In this first notebook, we'll import, clean, and prepare the data for exploration. We'll conduct light exploration to assess the contents of the data. The dataset `characters-aggregated-scores.csv` was downloaded from [Open Psychometrics](https://openpsychometrics.org/tests/characters/data/). Supplemental datasets (to provide variable and character names) were developed based on the online documentation, which is available here as an `.html` file in the `data` folder. _Note: If downloading an updated version of the dataset, the data formats, character names, and variables might have changed._

### Imports

In [1]:
# imports
import pandas as pd
import numpy as np

### Reading the Data

In [34]:
data = pd.read_csv("data/characters-aggregated-scores.csv", sep=",")
var_key = pd.read_csv("data/variable-key.csv")
char_key = pd.read_csv("data/character-key.csv")
data.head()

Unnamed: 0,id,BAP1,BAP2,BAP3,BAP4,BAP5,BAP6,BAP7,BAP8,BAP9,...,BAP491,BAP492,BAP493,BAP494,BAP495,BAP496,BAP497,BAP498,BAP499,BAP500
0,HML/1,62.4,69.8,92.6,31.9,61.2,53.5,28.8,44.0,63.9,...,27.5,78.8,40.5,53.4,77.4,14.0,56.3,51.4,87.4,8.2
1,HML/2,79.1,62.2,68.5,78.1,36.9,40.3,42.6,40.4,23.3,...,42.8,23.9,84.9,73.7,49.0,73.7,21.1,71.0,26.3,63.3
2,HML/3,83.2,85.3,69.4,21.8,39.1,35.8,49.9,16.0,59.3,...,11.3,29.7,50.7,78.6,68.2,20.3,31.6,48.7,74.3,55.0
3,HML/4,72.5,65.0,67.1,28.2,66.3,47.9,30.4,18.1,34.4,...,31.6,22.2,75.7,60.4,79.0,55.9,25.5,48.2,80.1,49.6
4,HML/5,40.7,48.1,81.8,90.0,52.6,59.3,41.1,73.9,43.0,...,35.6,42.4,75.0,61.7,61.3,15.1,57.3,54.7,90.3,24.9


The columns represent each dimension on which the characters were rated—or "binary adjective pairs" (BAPs). We'll make these more legible using our `var_key` and provide actual names for the characters. We rename the columns based on the `var_key`, and then we drop a few specific columns: The authors used emojis for some of the BAPs, which are hard to interpret and cause problems with visualization, so they have been labelled "INVALID." In addition, the authors accidentally included the "hard-soft" pair twice, so only the first pair is kept.

In [35]:
data = pd.merge(char_key, data, on="id") # merge ratings with character information
data.columns = ["id"] + ["character"] + ["source"] + var_key["scale"].to_list() # rename columns
data = data.loc[:,~data.columns.str.startswith('INVALID')]
data = data.loc[:, ~data.columns.duplicated()]

In the datafame below, the low end of the 100-point scale correspond to left-hand "adjective", and vice versa.

In [64]:
data.head()

Unnamed: 0,id,character,source,playful_serious,shy_bold,cheery_sorrowful,masculine_feminine,charming_awkward,lewd_tasteful,intellectual_physical,...,cringing-away_welcoming-experience,stereotypical_boundary-breaking,energetic_mellow,hopeful_fearful,likes-change_resists-change,manic_mild,old-fashioned_progressive,gross_hygienic,stable_unstable,overthinker_underthinker
0,HML/1,Prince Hamlet,Hamlet,62.4,69.8,92.6,31.9,61.2,53.5,28.8,...,27.5,78.8,40.5,53.4,77.4,14.0,56.3,51.4,87.4,8.2
1,HML/2,Queen Gertrude,Hamlet,79.1,62.2,68.5,78.1,36.9,40.3,42.6,...,42.8,23.9,84.9,73.7,49.0,73.7,21.1,71.0,26.3,63.3
2,HML/3,King Claudius,Hamlet,83.2,85.3,69.4,21.8,39.1,35.8,49.9,...,11.3,29.7,50.7,78.6,68.2,20.3,31.6,48.7,74.3,55.0
3,HML/4,Polonius,Hamlet,72.5,65.0,67.1,28.2,66.3,47.9,30.4,...,31.6,22.2,75.7,60.4,79.0,55.9,25.5,48.2,80.1,49.6
4,HML/5,Ophelia,Hamlet,40.7,48.1,81.8,90.0,52.6,59.3,41.1,...,35.6,42.4,75.0,61.7,61.3,15.1,57.3,54.7,90.3,24.9


### Data Exploration

What if we want to know the 10 most charming and awkward characters in the dataset? The functions below allow you to input the data and the name of the column you're most interested in. `most_right` will print the highest scores for the right-hand term, while `most_left` will print the highest scores for the left-hand term (which are technically the lowest scores on that dimension).

In [59]:
def most_right(data, column_name):
    most_right = data.nlargest(n=10, columns=[column_name])
    most_right = most_right[["character", "source", column_name]]
    print(most_right)

def most_left(data, column_name):
    most_right = data.nsmallest(n=10, columns=[column_name])
    most_right = most_right[["character", "source", column_name]]
    print(most_right)

In [65]:
most_right(data, "charming_awkward")

                character                        source  charming_awkward
816        Emma Pillsbury                          Glee              93.1
1264  Mr. William Collins           Pride and Prejudice              93.0
762          Tina Belcher                 Bob's Burgers              92.2
1016         Kirk Gleason                 Gilmore Girls              91.7
2063         Buster Bluth          Arrested Development              91.6
909          Stuart Bloom           The Big Bang Theory              91.4
1324                James  The End of the F***ing World              91.3
345            Jonah Ryan                          Veep              90.8
2064         Tobias Funke          Arrested Development              90.8
672           Morty Smith                Rick and Morty              90.6


In [66]:
most_left(data, "charming_awkward")

                character                source  charming_awkward
1142         Neal Caffrey          White Collar               3.1
2092           James Bond  Tommorrow Never Dies               4.4
248           Inara Serra    Firefly + Serenity               4.8
556   Lucifer Morningstar               Lucifer               6.4
1223       Frank Abagnale   Catch Me If You Can               6.5
203            Don Draper               Mad Men               6.7
1534      Damon Salvatore   The Vampire Diaries               6.8
1545             Lagertha               Vikings               6.9
207         Joan Holloway               Mad Men               7.4
63         Derek Shepherd        Grey's Anatomy               7.7


For Neha, some ideas for data exploration:
1. Scores with the highest/lowest averages?
2. Correlation matrices?
3. highest variance?