# Group 12
### Members:
- DIZON, GAVIN RAINE R.
- PALMARES, ALYSSA JAYE L.
- TEE, GIANCARLO T.

## An Overview

The group would be working on a K-means clustering model for the Five Big Personality traits dataset. To give a brief overview of our task, the goal is to cluster the dataset based on the personality traits in the Big Five Personality Model or the Five-Factor model. The Big Five Personality Model or the Five-Factor Model is used for grouping or clustering people based on personality traits. The model uses common descriptors of common languages. This test is mainly used for career assessment since this gives people more insight in how they react in different situations. In addition, this test uses the Big-Five Factor Markers from the International Personality Item Pool, developed by Goldberg. These are commonly used to describe the human personality and psyche.

## OCEAN or Five Factor Makers

1. **Extraversion (or Extroversion)** which can be identified as someone who gets motivated or energized in the company of others.

2. **Emotional Stability or Neuroticism** is a physical and emotional response to stress and perceived threats in someone’s daily life. This is mostly characterized by sadness or moodiness.
3. **Agreeableness** - these people tend to have high or prosocial behaviors. Moreover, People who exhibit high agreeableness will show signs of trust, altruism, kindness, and affection
4. **Conscientiousness** - includes having high levels of thoughtfulness, good impulse control, and goal-directed behaviors
5. **Openness** - These people are eager to learn and experience new things. They are imaginative and insightful.


### Importing needed libraries

In [None]:
#import sys
#!{sys.executable} -m pip install yellowbrick

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import PCA
from yellowbrick.cluster import KElbowVisualizer
from sklearn.mixture import GaussianMixture
from sklearn.metrics import (silhouette_score, calinski_harabasz_score, davies_bouldin_score)
np.random.seed(0)

## Initializing Functions to be used later

In [None]:
# For visualization
def vis_questions(groupname, questions, color):
    plt.figure(figsize=(15,35))
    for i in range(1, 11):
        plt.subplot(10,3,i)
        plt.hist(data[groupname[i-1]], bins=14, color= color, alpha=.5)
        plt.title(questions[groupname[i-1]], fontsize=14)

def compare_two_graphs(groupname, questions, arr1, arr2 , grp1_name, grp2_name):
    plt.figure(figsize=(15, 35))
    labels = ["Cluster 0", "Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Population"]    
    for i in range(0, len(arr1)):
        plt.subplot(10, 3, i+1)
        height = np.append(arr1[i], arr2[i])
        plt.bar(labels, height=height, color=["blue","blue","blue","blue", "blue", "green"], width = 0.4)
        plt.title(questions[groupname[i-1]], fontsize=14)

## Explaining, Loading, and Preprocessing Data

The data was collected through an online interactive personality test from 2016-2018. This survey was made by the International Personality Item Pool (IPIP). Participants were informed and asked for consent that the data will be recorded and used for research. 

In total, there are `50` questions wherein 10 of which are allotted to a specific factor maker. In the dataset, the questions were labeld as follows:

<body>
<style> 
table td, table th, table tr {text-align:left !important;}
</style>

<table style="text-align: left; float:left;">
    <tr>
        <th>Extroversion</th>
        <th>Emotional Stability</th>
        <th>Agreeableness</th>
        <th>Conscientiousness</th>
        <th>Openness</th>
    </tr>
    <tr>
        <td><b>EXT1</b> I am the life of the party.</td>
        <td><b>EST1</b> I get stressed out easily.</td>
        <td><b>AGR1</b> I feel little concern for others.</td>
        <td><b>CSN1</b> I am always prepared.</td>
        <td><b>OPN1</b> I have a rich vocabulary.</td>
    </tr>
    <tr>
        <td><b>EXT2</b> I don't talk a lot.</td>
        <td><b>EST2</b> I am relaxed most of the time.</td>
        <td><b>AGR2</b> I am interested in people.</td>
        <td><b>CSN2</b> I leave my belongings around.</td>
        <td><b>OPN2</b> I have difficulty understanding abstract ideas.</td>
    </tr>
    <tr>
        <td><b>EXT2</b> I don't talk a lot.</td>
        <td><b>EST2</b> I am relaxed most of the time.</td>
        <td><b>AGR2</b> I am interested in people.</td>
        <td><b>CSN2</b> I leave my belongings around.</td>
        <td><b>OPN2</b> I have difficulty understanding abstract ideas.</td>
    </tr>
    <tr>
        <td><b>EXT3</b> I feel comfortable around people.</td>
        <td><b>EST3</b> I worry about things.</td>
        <td><b>AGR3</b> I insult people.</td>
        <td><b>CSN3</b> I pay attention to details.</td>
        <td><b>OPN3</b> I have a vivid imagination.</td>
    </tr>
    <tr>
        <td><b>EXT4</b> I keep in the background.</td>
        <td><b>EST4</b> I seldom feel blue.</td>
        <td><b>AGR4</b> I sympathize with others' feelings.</td>
        <td><b>CSN4</b> I make a mess of things.</td>
        <td><b>OPN4</b> I am not interested in abstract ideas.</td>
    </tr>
    <tr>
        <td><b>EXT5</b> I start conversations.</td>
        <td><b>EST5</b> I am easily disturbed.</td>
        <td><b>AGR5</b> I am not interested in other people's problems.</td>
        <td><b>CSN5</b> I get chores done right away.</td>
        <td><b>OPN5</b> I have excellent ideas.</td>
    </tr>
    <tr>
        <td><b>EXT6</b> I have little to say.</td>
        <td><b>EST6</b> I get upset easily.</td>
        <td><b>AGR6</b> I have a soft heart.</td>
        <td><b>CSN6</b> I often forget to put things back in their proper place.</td>
        <td><b>OPN6</b> I do not have a good imagination.</td>
    </tr>
    <tr>
        <td><b>EXT7</b> I talk to a lot of different people at parties.</td>
        <td><b>EST7</b> I change my mood a lot.</td>
        <td><b>AGR7</b> I am not really interested in others.</td>
        <td><b>CSN7</b> I like order.</td>
        <td><b>OPN7</b> I am quick to understand things.</td>
    </tr>
    <tr>
        <td><b>EXT8</b> I don't like to draw attention to myself.</td>
        <td><b>EST8</b> I have frequent mood swings.</td>
        <td><b>AGR8</b> I take time out for others.</td>
        <td><b>CSN8</b> I shirk my duties.</td>
        <td><b>OPN8</b> I use difficult words.</td>
    </tr>
    <tr>
        <td><b>EXT9</b> I don't mind being the center of attention.</td>
        <td><b>EST9</b> I get irritated easily.</td>
        <td><b>AGR9</b> I feel others' emotions.</td>
        <td><b>CSN9</b> I follow a schedule.</td>
        <td><b>OPN9</b> I spend time reflecting on things.</td>
    </tr>
    <tr>
        <td><b>EXT10</b> I am quiet around strangers.</td>
        <td><b>EST10</b> I often feel blue.</td>
        <td><b>AGR10</b> I make people feel at ease.</td>
        <td><b>CSN10</b> I am exacting in my work.</td>
        <td><b>OPN10</b> I am full of ideas.</td>
    </tr>
</table>   
</body>

Another columns present in the given dataset are the `country` where the survey was answered, `IPC` or the number of users that share the same IP address when taking the test, and the screen size used when taking the test. There are also columns which measures the time they spent on a single question, the start page, and the end page.

### Removing Unused Columns and Rows with Missing Data

Since the only valid data from the dataset are the tables related to the survey such as columns containing Personality related questions like  (EXT, EST , AGR, CSN, and OPN) and the IPC, the other unrelated columns are truncated. The rows with missing values are also dropped.

In [None]:
data = pd.read_csv('data/data-final.csv', sep='\t')
pd.options.display.max_columns = 150

# drop unused columns
data.drop(data.columns[50:], axis=1, inplace=True)

print('How many missing values? ', data.isnull().values.sum())
# drop any rows with missing values.
data.dropna(inplace=True)
print('Number of participants: ', len(data))
data.head()

### Removing zeroes and zero-centering data
We then remove the zeros from the data as they signify that the person who answered skipped a question or multiple questions. We also zero-center the data to better visualize the respondent's answers by subtracting their answer by 3, which means they are neutral according to the Likert scale. Strongly agree will now be represented as `2`, and strongly disagree is now `-2`. 

In [None]:
# dropping rows with zeros
data = data[(data != 0).all(1)]

# Subtract 3 pre-processing
data = data.apply(lambda x: x - 3)
print('Number of participants: ', len(data))
data.head()

## Initializing Variables for Visualization

In [None]:
ext_questions = {'EXT1' : 'I am the life of the party',
                 'EXT2' : 'I dont talk a lot',
                 'EXT3' : 'I feel comfortable around people',
                 'EXT4' : 'I keep in the background',
                 'EXT5' : 'I start conversations',
                 'EXT6' : 'I have little to say',
                 'EXT7' : 'I talk to a lot of different people at parties',
                 'EXT8' : 'I dont like to draw attention to myself',
                 'EXT9' : 'I dont mind being the center of attention',
                 'EXT10': 'I am quiet around strangers'}

est_questions = {'EST1' : 'I get stressed out easily',
                 'EST2' : 'I am relaxed most of the time',
                 'EST3' : 'I worry about things',
                 'EST4' : 'I seldom feel blue',
                 'EST5' : 'I am easily disturbed',
                 'EST6' : 'I get upset easily',
                 'EST7' : 'I change my mood a lot',
                 'EST8' : 'I have frequent mood swings',
                 'EST9' : 'I get irritated easily',
                 'EST10': 'I often feel blue'}

agr_questions = {'AGR1' : 'I feel little concern for others',
                 'AGR2' : 'I am interested in people',
                 'AGR3' : 'I insult people',
                 'AGR4' : 'I sympathize with others feelings',
                 'AGR5' : 'I am not interested in other peoples problems',
                 'AGR6' : 'I have a soft heart',
                 'AGR7' : 'I am not really interested in others',
                 'AGR8' : 'I take time out for others',
                 'AGR9' : 'I feel others emotions',
                 'AGR10': 'I make people feel at ease'}

csn_questions = {'CSN1' : 'I am always prepared',
                 'CSN2' : 'I leave my belongings around',
                 'CSN3' : 'I pay attention to details',
                 'CSN4' : 'I make a mess of things',
                 'CSN5' : 'I get chores done right away',
                 'CSN6' : 'I often forget to put things back in their proper place',
                 'CSN7' : 'I like order',
                 'CSN8' : 'I shirk my duties',
                 'CSN9' : 'I follow a schedule',
                 'CSN10' : 'I am exacting in my work'}

opn_questions = {'OPN1' : 'I have a rich vocabulary',
                 'OPN2' : 'I have difficulty understanding abstract ideas',
                 'OPN3' : 'I have a vivid imagination',
                 'OPN4' : 'I am not interested in abstract ideas',
                 'OPN5' : 'I have excellent ideas',
                 'OPN6' : 'I do not have a good imagination',
                 'OPN7' : 'I am quick to understand things',
                 'OPN8' : 'I use difficult words',
                 'OPN9' : 'I spend time reflecting on things',
                 'OPN10': 'I am full of ideas'
}


EXT = [column for column in data if column.startswith('EXT')]
EST = [column for column in data if column.startswith('EST')]
AGR = [column for column in data if column.startswith('AGR')]
CSN = [column for column in data if column.startswith('CSN')]
OPN = [column for column in data if column.startswith('OPN')]



# Exploratory Data Analysis (EDA)

## Data Visualization

Here, we show the number of respondents and their answers for each questions for each personality makers. 

In [None]:
print('Q&As Related to Extroversion Personality')
vis_questions(EXT, ext_questions, 'black')

In [None]:
print('Q&As Related to Emotional Stability Personality')
vis_questions(EST, est_questions, 'black')

In [None]:
print('Q&As Related to Agreeableness Personality')
vis_questions(AGR, agr_questions, 'black')

In [None]:
print('Q&As Related to Conscentiousness Personality')
vis_questions(CSN, csn_questions, 'black')

In [None]:
print('Q&As Related to Openness Personality')
vis_questions(OPN, opn_questions, 'black')

In [None]:
#ADD MORE EDA HERE


# The Task

The main task is to create clusters with the given data. The main problem with these kind of dataset is that there is no exact `label` that we can use to create the clustering. There are unsupervised machine learning models that handle this situation but for this experiment, we will be using the **K-Means cluster**.


> K-means cluster is an iterative algorithm that divides an unlabeled dataset into *k* different clusters in such a way that each row in the dataset belongs only to one group that has similar properties or `features`. 

Similar, to the K-nearest neighbor algorithm, this algorithm makes use of a centroid where each cluster is associated with it. Since we will be evaulating the five personality makers, we will be setting our number of clusters `k` to 5.

In [None]:
df = data

kmeans = KMeans(n_clusters=5, random_state=42)
k_fit = kmeans.fit(df)
predictions = k_fit.labels_

# Create a new column of clusters to predict the data
df['Cluster'] = predictions

### Analysis

Now, we get the mean of each of the personality makers for each clusters. 

In [None]:
col_list = list(df)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

data_sums = pd.DataFrame()
data_sums['extroversion'] = df[ext].sum(axis=1)/10
data_sums['neurotic'] = df[est].sum(axis=1)/10
data_sums['agreeable'] = df[agr].sum(axis=1)/10
data_sums['conscientious'] = df[csn].sum(axis=1)/10
data_sums['open'] = df[opn].sum(axis=1)/10
data_sums['clusters'] = predictions
data_sums.groupby('clusters').mean()

## Principal Component Analysis

**Prinicipal Component analysis (PCA)** is a dimensionality-reduction method used to reduce dimensionality (or features) of large datasets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. However, dimensionality-reduction would have a minimal trade-off to the accuracy of the result. Here, we use PCA to visualize the personality clusters. To do this, we will be reducing the features from 50 (the number of questions) to 2 `n_components`. This is for us to visualize the graph in a 2D-plane.

In [None]:
#PCA?

In [None]:
pca = PCA(n_components=2)
pca_fit = pca.fit_transform(df)

df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])
df_pca['Clusters'] = predictions

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='tab10', alpha=0.8)
plt.title('Personality Clusters after PCA (K-Means)');

## Finding an optimum number of groups


### Elbow
The question now is is the value `5` the best `k` for the clustering given the data. `KElbowVisualizer` is used to find an optimal number  of clusters for k-means clustering using the "elbow" method. Here, we pass a range of k's to evaluate. Once we fit this to the data we can make use of its attribute, the elbow value which is the optimum value.

In [None]:
#KElbowVisualizer?
# Elbow Method - (check)
# Gap Statistics - (wala pa) 

In [None]:
df_opt = data.drop('Cluster', axis=1)

columns = list(df_opt.columns)
#columns.remove('Cluster')

scaler = MinMaxScaler(feature_range=(0,1))
df_opt = scaler.fit_transform(df_opt)
df_opt = pd.DataFrame(df_opt, columns=columns)
visualizer = KElbowVisualizer(KMeans(verbose=1), k=(2,8))

In [None]:

visualizer.elbow_value_

With the graph shown above, we can prove that `5` is the most optimal number for clustering the dataset. Now let's try to find the optimum number with the `Gap Statistics` method.


### Gap Statistics

# Alternative Models

Other than the usual K-means cluster approach, there are also alternative methods that we can use to cluster an unlabeled dataset. 

## Batch K-Means

Mini-Batch K-Means uses small, random fixed-size batch of data to store in memory and with each iteration, a random sample of the data will be collected and used in updating the clusters.


In [None]:
MiniBatchKMeans?

In [None]:
df_mb = data

mbKMeans = MiniBatchKMeans(n_clusters=5, max_iter=500, verbose=1, random_state=42)
mbKMeans.fit(df_mb)

# Create a new column of clusters to predict the data
df_mb['Cluster'] = mbKMeans.labels_

### Get Mean

In [None]:
col_list = list(df_mb)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

data_sums_mb = pd.DataFrame()
data_sums_mb['extroversion'] = df_mb[ext].sum(axis=1)/10
data_sums_mb['neurotic'] = df_mb[est].sum(axis=1)/10
data_sums_mb['agreeable'] = df_mb[agr].sum(axis=1)/10
data_sums_mb['conscientious'] = df_mb[csn].sum(axis=1)/10
data_sums_mb['open'] = df_mb[opn].sum(axis=1)/10
data_sums_mb['clusters'] = mbKMeans.labels_
data_sums_mb.groupby('clusters').mean()

### Computing the PCA

In [None]:
pca_mb = PCA(n_components=2)
pca_mb_fit = pca_mb.fit_transform(df_mb)

df_pca_mb = pd.DataFrame(data=pca_mb_fit, columns=['PCA1', 'PCA2'])
df_pca_mb['Clusters'] = mbKMeans.labels_
# TSNE

### Visualizing the Model

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(data=df_pca_mb, x='PCA1', y='PCA2', hue='Clusters', palette='tab10', alpha=0.8)
plt.title('Personality Clusters after PCA (Mini-Batch K-means)');

## Gaussian Mixture Model

explain

In [None]:
df_gmm = data

gmm = GaussianMixture(n_components=5, random_state=42)
gmm.fit(df_gmm)

# Create a new column of clusters to predict the data
df_gmm['Cluster'] = gmm.predict(df_gmm)

### Get Mean

explain

In [None]:
col_list = list(df_gmm)
ext = col_list[0:10]
est = col_list[10:20]
agr = col_list[20:30]
csn = col_list[30:40]
opn = col_list[40:50]

data_sums_gmm = pd.DataFrame()
data_sums_gmm['extroversion'] = df_gmm[ext].sum(axis=1)/10
data_sums_gmm['neurotic'] = df_gmm[est].sum(axis=1)/10
data_sums_gmm['agreeable'] = df_gmm[agr].sum(axis=1)/10
data_sums_gmm['conscientious'] = df_gmm[csn].sum(axis=1)/10
data_sums_gmm['open'] = df_gmm[opn].sum(axis=1)/10
data_sums_gmm['clusters'] = df_gmm['Cluster']
data_sums_gmm.groupby('clusters').mean()

### Computing for the PCA

In [None]:
pca_gmm = PCA(n_components=2)
pca_gmm_fit = pca_mb.fit_transform(df_gmm)

df_pca_gmm = pd.DataFrame(data=pca_gmm_fit, columns=['PCA1', 'PCA2'])
df_pca_gmm['Clusters'] = list(df_gmm['Cluster'])

### Visualizing the Clusters with PCA

explain

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(data=df_pca_gmm, x='PCA1', y='PCA2', hue='Clusters', palette='tab10', alpha=0.8)
plt.title('Personality Clusters after PCA (Gaussian Mixture Model)');

### Validating and Evaluating the Clusters

In [None]:
# remove column cluster
col = list(df_pca.columns)
col.remove("Clusters")
col

df_eval = pd.DataFrame(columns=["Model", "Calinski-Harabasz", "Davies_Bouldin", "Silhouette"])
print("Computing Scores in K-Means...")
df_eval.loc[len(df_eval.index)]= ["K-means", calinski_harabasz_score(df_pca[col], df_pca['Clusters']), davies_bouldin_score(df_pca[col], df_pca['Clusters']), silhouette_score(df_pca[col], df_pca['Clusters'], metric='euclidean', sample_size=100000)] 
print("Computing Scores in Mini-Batch K-means...")
df_eval.loc[len(df_eval.index)]= ["Mini-Batch K-means", calinski_harabasz_score(df_pca_mb[col], df_pca_mb['Clusters']), davies_bouldin_score(df_pca_mb[col], df_pca_mb['Clusters']), silhouette_score(df_pca_mb[col], df_pca_mb['Clusters'], metric='euclidean', sample_size=100000)] 
print("Computing Scores in Gaussian Mixture Model...")
df_eval.loc[len(df_eval.index)]= ["Gaussian Mixture Model", calinski_harabasz_score(df_pca_gmm[col], df_pca_gmm['Clusters']), davies_bouldin_score(df_pca_gmm[col], df_pca_gmm['Clusters']), silhouette_score(df_pca_gmm[col], df_pca_gmm['Clusters'], metric='euclidean', sample_size=100000)] 

In [None]:
df_eval

## Exploratory Data Analysis (EDA)

Among the three models, the `Gaussian Mixture` Model was the Model that produced the best clustering of data. However in this experiment, we will be analyzing the results from `K-means clustering` as it is the most widely among the `3`.

The Mean and Standard Deviation of the result for each question was grouped by cluster and is stored in `df_cluster` and `df_cluster_std` respectively in the code below.

In [None]:
df_cluster = df.groupby('Cluster').mean()
df_cluster_std = df.groupby('Cluster').std()

### Extroversion Questions

#### Cluster Mean vs Population Mean comparison

In [None]:
EXT_df_cluster = [column for column in df_cluster if column.startswith('EXT')]
compare_two_graphs(EXT, ext_questions, df_cluster[EXT_df_cluster].values.transpose(), df[EXT_df_cluster].mean().transpose(), "Cluster ", "Population Mean")

##### Analysis

In general, it can be seen that most respondents in `cluster 0` and `cluster 3` are both similar when it comes to extroversion response. They lean more into the non-extroverted side as most responses on the even questions which are questions pointing towards extroversion lean more on the negative side while most responses on the odd questions which point towards introversion lean largely on the positive side. On the other hand, the other 3 clusters which are clusters `1`, `2`, and `4` have most of their respondents mostly on the pro-extroversion side with `cluster 4` being the cluster with the most positive responses with regards to extroversion behavior.

Comparing this to the mean of the population, the general trend from the results is that overall, a bit more of the respondents show traits aligning to extroversion than there are not.


#### Cluster Standard Deviation vs Population Standard Deviation comparison



In [None]:
compare_two_graphs(EXT, ext_questions, df_cluster_std[EXT_df_cluster].values.transpose(), df[EXT_df_cluster].std().transpose(), "Cluster ", "Population STD")

### Emotional Stability Questions

#### Cluster Mean vs Population Mean comparison

In [None]:
EST_df_cluster = [column for column in df_cluster if column.startswith('EST')]
compare_two_graphs(EST, est_questions, df_cluster[EST_df_cluster].values.transpose(), df[EST_df_cluster].mean().transpose(), "Cluster ", "Population Mean")

##### Analysis

Unlike extroversion, the responses of the respondents in each cluster varies for each question. However it can be noted that a large majority of the respondents in all clusters answered positively when it came to `EST3` which is `I am relaxed most of the time` with `cluster 3` having the lowest mean. In addition, cluster `1` and `3` generally have the same trend of responses aligning more into high emotional stability with negative average responses on most questions. `cluster 0` and `cluster 4` on the other hand generally responded positively to questions pointing to emotional instability.

Looking at the population mean, it can be seen that most respondents generally lean more towards having emotional stability although moderately as the mean generally drifts near `0.0` in most questions.

#### Cluster Standard Deviation vs Population Standard Deviation comparison

In [None]:
print("Emotional Stability Questions: Cluster Standard Deviation vs Population Standard Deviation comparison")
compare_two_graphs(EST, est_questions, df_cluster_std[EST_df_cluster].values.transpose(), df[EST_df_cluster].std().transpose(), "Cluster ", "Population STD")

### Conscientiousness Questions

#### Cluster Mean vs Population Mean comparison

In [None]:
CSN_df_cluster = [column for column in df_cluster if column.startswith('CSN')]
compare_two_graphs(CSN, csn_questions, df_cluster[CSN_df_cluster].values.transpose(), df[CSN_df_cluster].mean().transpose(), "Cluster ", "Population Mean")

##### Analysis

Looking at the means when it comes to questions related to conscientiousness, For all clusters, a large majority responded positively to questions `3`, `7`, and `10`. Question `3` and `7` pertains to poor item management with questions on how often the respondents forget to put items back in their place and if they leave their belongings around while question `10` is with regards to proper time management asking if the respondents follow a schedule. This indicates that a majority of the population at the very least manages their time well but are also lacking when it comes to item management. With regards to the trend between clusters, the average responses in `Cluster 1` and `Cluster 3` are similar with responses leaning more to negative. On the other hand, cluster `0`, `2`, and `3`'s responses are generally the opposite of cluster `1` and `3` except for question `4` where `cluster 2` leaned more into the negative as opposed to conscientiousness.

#### Cluster Standard Deviation vs Population Standard Deviation comparison

In [None]:
compare_two_graphs(CSN, csn_questions, df_cluster_std[CSN_df_cluster].values.transpose(), df[CSN_df_cluster].std().transpose(), "Cluster ", "Population STD")

### Agreeableness Questions

#### Cluster Mean vs Population Mean comparison

In [None]:
AGR_df_cluster = [column for column in df_cluster if column.startswith('AGR')]
compare_two_graphs(AGR, agr_questions, df_cluster[AGR_df_cluster].values.transpose(), df[AGR_df_cluster].mean().transpose(), "Cluster", "Population Mean")

##### Analysis

When it comes to agreeableness,`4` out of the `5` clusters have very similar responses for all questions. With clusters `0`, `1`, `3`, and `4` generally having responses leaning away from agreeableness while question `2` gets varied responses with the majority being toward agreeableness. With there being only 1 cluster that is an outlier, the overall population mean for each question shows that most of the respondents are not agreeable.

#### Cluster Standard Deviation vs Population Standard Deviation comparison

In [None]:
compare_two_graphs(AGR, agr_questions, df_cluster_std[AGR_df_cluster].values.transpose(), df[AGR_df_cluster].std().transpose(), "Cluster", "Population STD")

### Openness Questions

#### Cluster Mean vs Population Mean comparison

In [None]:
OPN_df_cluster = [column for column in df_cluster if column.startswith('OPN')]
compare_two_graphs(OPN, opn_questions, df_cluster[OPN_df_cluster].values.transpose(), df[OPN_df_cluster].mean().transpose(), "Cluster", "Population Mean")

##### Analysis

All the responses in the cluster follows the same trend when it comes to questions about intellect, imagination, and openness. Most of the respondents in the clusters lean away from intellect and imagination with negative responses towards abstract ideas and understanding them as well as imagination and excellent ideas. On the other hand, the clusters lean more positively to questions about openness with a positive mean on questions regarding reflection, understanding, and ideas. The only outlier is `cluster 1` in question `8` whose responses generally leaned more towards being slow when it comes to understanding things.

#### Cluster Standard Deviation vs Population Standard Deviation comparison

In [None]:
compare_two_graphs(OPN, opn_questions, df_cluster_std[OPN_df_cluster].values.transpose(), df[OPN_df_cluster].std().transpose(), "Cluster", "Population STD")