Personality Classification With KNN

In [1]:
import numpy as np
import pandas as pd
import random
import math
import time

In [2]:
# Reading csv file

dataset = pd.read_csv('16P.csv', encoding = 'ISO-8859-1')
# Dropping 'Response Id'
dataset = dataset.iloc[:, 1:]

acc_list = np.zeros((2, 10))
prec_list = np.zeros((2, 10))
rec_list = np.zeros((2, 10))

dataset.head()

Unnamed: 0,You regularly make new friends.,You spend a lot of your free time exploring various random topics that pique your interest,Seeing other people cry can easily make you feel like you want to cry too,You often make a backup plan for a backup plan.,"You usually stay calm, even under a lot of pressure","At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know",You prefer to completely finish one project before starting another.,You are very sentimental.,You like to use organizing tools like schedules and lists.,Even a small mistake can cause you to doubt your overall abilities and knowledge.,...,You believe that pondering abstract philosophical questions is a waste of time.,"You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places.",You know at first glance how someone is feeling.,You often feel overwhelmed.,You complete things methodically without skipping over any steps.,You are very intrigued by things labeled as controversial.,You would pass along a good opportunity if you thought someone else needed it more.,You struggle with deadlines.,You feel confident that things will work out for you.,Personality
0,0,0,0,0,0,1,1,0,0,0,...,0,0,0,-1,0,0,0,0,0,ENFP
1,0,0,-2,-3,-1,2,-2,0,3,0,...,0,-2,0,2,0,-1,-1,-1,3,ISFP
2,0,0,2,0,-1,2,0,0,1,0,...,0,2,0,2,-1,0,1,2,1,INFJ
3,0,-1,3,-1,0,0,-2,0,-2,0,...,0,0,-1,-1,0,1,0,-2,-1,ISTP
4,0,0,-1,0,2,-1,-2,0,1,0,...,0,1,0,2,0,1,-1,2,-1,ENFJ


In [3]:
# Replacing personalities with their respective ids

personality_ids = {
    'ESTJ': 0,
    'ENTJ': 1,
    'ESFJ': 2,
    'ENFJ': 3,
    'ISTJ': 4,
    'ISFJ': 5,
    'INTJ': 6,
    'INFJ': 7,
    'ESTP': 8,
    'ESFP': 9,
    'ENTP': 10,
    'ENFP': 11,
    'ISTP': 12,
    'ISFP': 13,
    'INTP': 14,
    'INFP': 15
}

for i in personality_ids:
    dataset['Personality'] = dataset['Personality'].replace(i, personality_ids[i])

dataset.head()

Unnamed: 0,You regularly make new friends.,You spend a lot of your free time exploring various random topics that pique your interest,Seeing other people cry can easily make you feel like you want to cry too,You often make a backup plan for a backup plan.,"You usually stay calm, even under a lot of pressure","At social events, you rarely try to introduce yourself to new people and mostly talk to the ones you already know",You prefer to completely finish one project before starting another.,You are very sentimental.,You like to use organizing tools like schedules and lists.,Even a small mistake can cause you to doubt your overall abilities and knowledge.,...,You believe that pondering abstract philosophical questions is a waste of time.,"You feel more drawn to places with busy, bustling atmospheres than quiet, intimate places.",You know at first glance how someone is feeling.,You often feel overwhelmed.,You complete things methodically without skipping over any steps.,You are very intrigued by things labeled as controversial.,You would pass along a good opportunity if you thought someone else needed it more.,You struggle with deadlines.,You feel confident that things will work out for you.,Personality
0,0,0,0,0,0,1,1,0,0,0,...,0,0,0,-1,0,0,0,0,0,11
1,0,0,-2,-3,-1,2,-2,0,3,0,...,0,-2,0,2,0,-1,-1,-1,3,13
2,0,0,2,0,-1,2,0,0,1,0,...,0,2,0,2,-1,0,1,2,1,7
3,0,-1,3,-1,0,0,-2,0,-2,0,...,0,0,-1,-1,0,1,0,-2,-1,12
4,0,0,-1,0,2,-1,-2,0,1,0,...,0,1,0,2,0,1,-1,2,-1,3


In [4]:
# Converting into numpy array for convenience

dataset = dataset.to_numpy(dtype = np.intc)

In [5]:
# Splitting for k-fold cross validation

def five_fold_cross_val(dataset, target_group):
    predictor, target = ['', '', '', '', ''], ['', '', '', '', '']
    
    for i in range(0, 5):
        predictor[i] = dataset[12000 * i: 12000 * (i + 1), 0: 60]
        target[i] = dataset[12000 * i: 12000 * (i + 1), 60]
    
    predictor_test, target_test = predictor[target_group], target[target_group]
    predictor_train, target_train = '', ''
    
    check = True
    
    for i in range(0, 5):
        if i == target_group:
            continue
        
        if check:
            predictor_train = predictor[i]
            target_train = target[i]
            
            check = False
        else:
            predictor_train = np.concatenate((predictor_train, predictor[i]))
            target_train = np.concatenate((target_train, target[i]))
    
    return (predictor_train, predictor_test, target_train, target_test)

In [6]:
# Euclidean distance calculation

def euclidean_dist_calc(pred_train, pred_test):
    
    cross = np.add(
        -2 * np.einsum('ij,kj->ik', pred_test, pred_train),
        np.add(
            np.sum(pred_train ** 2, axis=1),
            np.sum(pred_test ** 2, axis=1, keepdims=True)
        )
    )
    
    #cross = (
    #    -2 * np.einsum('ij,kj->ik', pred_test, pred_train) + 
    #    np.sum(pred_train ** 2, axis=1) + 
    #    np.sum(pred_test ** 2, axis=1, keepdims=True)
    #)
    
    return np.sqrt(cross)

In [7]:
# Feature scaling

def feature_scaling(val, max_val, min_val):
    return (val - min_val) / (max_val - min_val)

In [8]:
# Accuracy, Precision, Recall calculation

def acc_prec_rec_calc(table):
    acc_sum, prec_sum, rec_sum = np.float64(0.0), np.float64(0.0), np.float64(0.0)
    
    for i in range(0, 16):
        tp, tn, fn, fp = np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)
        
        tp += table[i, i]
        
        for j in range(0, 16):
            if j != i:
                fn += table[i, j]
                fp += table[j, i]
                
        for j in range(0, 16):
            for k in range(0, 16):
                if k != i and j != i:
                    tn += table[j, k]
                    
        if tp + fp == 0 or tp + fn == 0:
            continue
        
        acc_sum += np.divide(tp + tn, tp + tn + fp + fn)
        prec_sum += np.divide(tp ,tp + fp)
        rec_sum += tp / (tp + fn)
    
    return (acc_sum / 16, prec_sum / 16, rec_sum / 16)

In [9]:
# Predicts personality based on distances

def predict_per(dists, k, target_train):
    n = dists.shape[0]
    predictions = np.zeros(n)
    
    for i in range(0, n):
        sorted_pers = target_train[np.argsort(dists[i])]
        sorted_pers_k = sorted_pers[: k]
        
        val, cou = np.unique(sorted_pers_k, return_counts=True)
        predictions[i] = val[np.argmax(cou)]
    
    return predictions

In [10]:
def knn(dataset, k, ind):
    random.shuffle(dataset)
    acc_sum, prec_sum, rec_sum = np.float64(0.0), np.float64(0.0), np.float64(0.0)
    
    for i in range(0, 5):
        pred_train, pred_test, target_train, target_test = five_fold_cross_val(dataset, i)
        table = np.zeros((16, 16))
        
        dists = euclidean_dist_calc(pred_train, pred_test)
        
        predictions = predict_per(dists, k, target_train)
        
        for i in range(0, len(target_test)):
            
            table[target_test[i], int(predictions[i])] += 1
            
        acc, prec, rec = acc_prec_rec_calc(table)
        acc_sum += acc
        prec_sum += prec
        rec_sum += rec
        
    global acc_list
    global rec_list
    global prec_list
    
    acc_list[0, ind] = acc_sum/5
    rec_list[0, ind] = rec_sum/5
    prec_list[0, ind] = prec_sum/5

In [11]:
tot_time = 0

# Non-Scaled tests

start = time.time()

print('Non-Scaled Tests:')

for i in range(5):
    knn(dataset, i * 2 + 1, i)
    
    acc_list[0, i] = np.round(acc_list[0, i], decimals = 3)
    prec_list[0, i] = np.round(prec_list[0, i], decimals = 3)
    rec_list[0, i] = np.round(rec_list[0, i], decimals = 3)
    
    end = time.time()
    tot_time += (end - start)
    
    print(
        'k = {k}, time = {time}\n'.format(k=(i * 2 + 1), time=(end - start)) +
        'macro accuracy: {acc}\n'.format(acc=acc_list[0, i]) +
        'macro precision: {prec}\n'.format(prec=prec_list[0, i]) +
        'macro recall: {rec}\n'.format(rec=rec_list[0, i])
    )
    
    start = end

Non-Scaled Tests:
k = 1, time = 467.6421399116516
macro accuracy: 0.999
macro precision: 0.991
macro recall: 0.991

k = 3, time = 456.99508810043335
macro accuracy: 0.999
macro precision: 0.995
macro recall: 0.995

k = 5, time = 481.44154047966003
macro accuracy: 0.999
macro precision: 0.996
macro recall: 0.996

k = 7, time = 457.0759587287903
macro accuracy: 1.0
macro precision: 0.996
macro recall: 0.996

k = 9, time = 453.2518994808197
macro accuracy: 1.0
macro precision: 0.997
macro recall: 0.996



In [12]:
# Scaling

scaled_dataset = (dataset + 3) / 6

#for i in range(0, scaled_dataset.shape[0]):
#    for j in range(0, 60):
#        scaled_dataset[i, j] = feature_scaling(scaled_dataset[i, j], 3, -3)
        
end = time.time()

tot_time += (end - start)

print('Scaling time: {time}'.format(time=end-start))

Scaling time: 0.251253604888916


In [13]:
# Scaled tests

start = time.time()
        
print('Scaled Tests:')
    
# Scaled
for i in range(5):
    knn(dataset, i * 2 + 1, i + 5)
    
    acc_list[0, i + 5] = np.round(acc_list[0, i + 5], decimals = 3)
    prec_list[0, i + 5] = np.round(prec_list[0, i + 5], decimals = 3)
    rec_list[0, i + 5] = np.round(rec_list[0, i + 5], decimals = 3)
    
    end = time.time()
    tot_time += (end - start)
    
    print(
        'k = {k}, time = {time}\n'.format(k=(i * 2 + 1), time=(end - start)) +
        'macro accuracy: {acc}\n'.format(acc=acc_list[0, i + 5]) +
        'macro precision: {prec}\n'.format(prec=prec_list[0, i + 5]) +
        'macro recall: {rec}\n'.format(rec=rec_list[0, i + 5])
    )
    
    start = end

Scaled Tests:
k = 1, time = 524.5711002349854
macro accuracy: 1.0
macro precision: 1.0
macro recall: 0.999

k = 3, time = 481.87176847457886
macro accuracy: 1.0
macro precision: 0.999
macro recall: 0.998

k = 5, time = 439.74562907218933
macro accuracy: 1.0
macro precision: 0.998
macro recall: 0.997

k = 7, time = 425.7210519313812
macro accuracy: 1.0
macro precision: 0.998
macro recall: 0.995

k = 9, time = 384.6793851852417
macro accuracy: 1.0
macro precision: 0.998
macro recall: 0.994



In [14]:
# Total time

print('Total testing time = {time}'.format(time=tot_time))

Total testing time = 4573.24681520462


Results


All tests which are Scaled and Non-Scaled datasets with k as k ∈ {1, 3, 5, 7, 9} for each has took a total time of 76 minutes.
Scaled and Non-Scaled tests took similar amounts of time of 38.6 minutes and 37.4 minutes respectively.


Although both were fairly close to 100% in comparison Scaled tests proved to be more accurate than None-Scaled tests.
The cases where accuracy reached 100% is probably a calculation or a rounding mistake and shouldn't taken for certain.


In Scaled tests times seems to be fluctuating a lot, I don't know the exact reason for it but the cause might be the machine I ran the code on.


Specs of the machine used to run this code:
CPU: AMD Ryzen 7 Mobile 3700U
RAM: 8 GB 2667 Mhz

Error Analysis


I'm not able to provide an example but I'm aware of the fact that in tests that k > 1, some personalities may be wrong due to multiple personalities having same number of occurrences. Weighted KNN algorithm can be implemented to prevent this from happening.


Code may raise MemoryError if dataset is bigger than the one we're working with.