In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv("16P.csv", encoding='cp1252')
del data[data.columns[0]]

In this part I get data files from the same directory as the .ipynb file. I used "cp1252" for encoding and delete first part for unnecessary "response id".

In [2]:
mbti_codes = ['ESTJ', 'ENTJ', 'ESFJ', 'ENFJ', 'ISTJ', 'ISFJ', 'INTJ', 'INFJ', 'ESTP', 'ESFP', 'ENTP', 'ENFP', 'ISTP', 'ISFP', 'INTP', 'INFP']
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
mbti_dict = dict(zip(mbti_codes, numbers))
matrix = {number: {"TP": 0, "FP": 0, "FN": 0, "TN": 0} for number in numbers}
data[data.columns[-1]] = data[data.columns[-1]].apply(lambda x: mbti_dict[x])
data_array = data.to_numpy()
dependent_variable = data_array[:, -1]
independent_variable = data_array[:, :-1]
cross_code = 0
data_norm = (independent_variable - np.min(independent_variable))/(np.max(independent_variable) - np.min(independent_variable))


In this part I prepare the important data arrays and encoding the mbti personalities to numbers. "matrix" dictionary holds the confusion matrix for all datas and personalities. "data" dictionary holds datas and their corresponding personalities. After I created the "data dictionary", i make it numpy array and split the answer datas to independent_variable and corresponding personalities to dependent_variable. At the bottom part I created cross_code variable for count the number in the 5-fold cross validation. I alsı created data_norm for make normalization and use the normalized data.

In [3]:
def get_neighbors(neighbor_size, test):
    distances = list()
    dist = np.linalg.norm((test - train_data), axis=1)
    distances = np.column_stack((train_data, dist))
    distances = distances[distances[:, -1].argsort()]
    neighbor = list()
    for j in range(neighbor_size):
        neighbor.append(distances[j][:-1])
    return neighbor


This is the main function of k-nearest neighbor algorithm. This function takes 1 data from independent variables and calculate the distance between all train values. For this, I used linalg.norm() function. It simply calculates the Euclidean distance between 2 points in vector form. Actually I need a for loop for calculate the distances between test and train_data for every train data, but i take train_data as 1 dimensional array and calculate the distance with this 1 dimensional array. With this, I got rid of 1 loop and made the algorithm faster. After calculation of distance, i sort all distances and take the nearest "k" distances according to the value of k.

In [4]:
for i in range(12000, 72000, 12000):
    cross_code += 1
    print(f"Selected test data: {cross_code} out of 5")
    for l in range(2):
        if l == 1:
            print("Without Normalization:\n")
            dataset = dict()
            test_data = independent_variable[i-12000:i]
            train_data = np.concatenate((independent_variable[:i-12000], independent_variable[i:60000]))
            for n in range(59999):
                dataset[tuple(independent_variable[n])] = dependent_variable[n]
        else:
            print("With Normalization:\n")
            dataset = dict()
            test_data = data_norm[i-12000:i]
            train_data = np.concatenate((data_norm[:i-12000], data_norm[i:60000]))
            for n in range(59999):
                dataset[tuple(data_norm[n])] = dependent_variable[n]
        for k in range(1,10,2):
            correct_answers = 0
            wrong_answers = 0
            accuracy = 0
            precision = 0
            recall = 0
            print("Neighbor size: ",k)
            for testing in test_data:
                neighbors = get_neighbors(k, testing)
                neighbor_list = []
                for neighbor in neighbors:
                    neighbor_list.append(dataset[tuple(neighbor)])
                majority_vote = max(set(neighbor_list), key = neighbor_list.count)
                if majority_vote == dataset[tuple(testing)]:
                    correct_answers += 1
                    matrix[majority_vote]["TP"] += 1
                else:
                    wrong_answers += 1
                    matrix[majority_vote]["FN"] += 1
                TN_sum = 0
                FP_sum = 0
                for a in matrix:
                    TN_sum += matrix[a]["TP"]
                    FP_sum += matrix[a]["FN"]
                TN_sum -= matrix[majority_vote]["TP"]
                FP_sum -= matrix[majority_vote]["FN"]
            matrix[majority_vote]["TN"] = TN_sum
            matrix[majority_vote]["FP"] = FP_sum
            accuracy += (matrix[majority_vote]["TP"] + matrix[majority_vote]["TN"]) / (matrix[majority_vote]["TP"] + matrix[majority_vote]["TN"] + matrix[majority_vote]["FP"] + matrix[majority_vote]["FN"])
            precision += (matrix[majority_vote]["TP"]) / (matrix[majority_vote]["TP"] + matrix[majority_vote]["FP"])
            recall += (matrix[majority_vote]["TP"]) / (matrix[majority_vote]["TP"] + matrix[majority_vote]["FN"])
            print("Correct Answers: ", correct_answers)
            print("Wrong Answers: ", wrong_answers)
            print(f"Accuracy: %{accuracy*100}")
            print(f"Precision: %{precision*100}")
            print(f"Recall: %{recall*100}\n")

Selected test data: 1 out of 5
With Normalization:

Neighbor size:  1
Correct Answers:  11747
Wrong Answers:  253
Accuracy: %97.89166666666667
Precision: %76.55786350148368
Recall: %97.9746835443038

Neighbor size:  3
Correct Answers:  11864
Wrong Answers:  136
Accuracy: %98.37916666666666
Precision: %81.27282211789255
Recall: %98.11083123425692

Neighbor size:  5
Correct Answers:  11873
Wrong Answers:  127
Accuracy: %98.56666666666666
Precision: %83.14407381121363
Recall: %98.28020134228188

Neighbor size:  7
Correct Answers:  11876
Wrong Answers:  124
Accuracy: %98.66666666666667
Precision: %84.17653390742734
Recall: %98.36477987421384

Neighbor size:  9
Correct Answers:  11877
Wrong Answers:  123
Accuracy: %98.72833333333332
Precision: %84.82549317147192
Recall: %98.41549295774648

Without Normalization:

Neighbor size:  1
Correct Answers:  11744
Wrong Answers:  256
Accuracy: %98.58472222222223
Precision: %83.29779673063256
Recall: %98.34277323264108

Neighbor size:  3
Correct Answe

At the final part, I prepare the datas for the k-nearest neighbor algorithm. I make 5 datas for 5-fold cross validation, 2 for with normalization and without normalization datas and 5 for each k values(1, 3, 5, 7 ,9). In summary I created 5x5x2 = 50 different datas and calculated the correct,wrong answers; accuracy, precision and recall of all of them. For this, I used 3 nested loops: 1 for 5-fold, 1 for normalization, 1 for k values. After the calculations, I created nearest distances for every data and look for corresponding personalities of this data. In distances personalities, I take majority of the personalities(majority_vote) and trying to the guess the personality of selected data with this personality. With correct and wrong answers, I create the confusion matrix and calculate the accuracy, precision and recall with this matrix. In the final, I print summary of the data above. Below is a tabular version of all the data.

https://www.linkpicture.com/q/Adsız_60.png

Analysis
As shown in the summary of the datas, the choice of neighbor values ​​seriously affects the accuracy, precision and recall percentage. When we compare the values ​​with 1 and 3 neighbors, we see that there is a decrease in the number of wrong answers by almost half. The biggest reason for this is that while only 1 point in 1-neighbor values is taken, the value is obtained by majority voting since there is more than one distance at 3 and beyond. In addition, we can easily say that the number of correct answers increases as the neighbor values increase. In the normalization part, it can be said that there is not much difference between the correct answer values. However, in terms of the processing speed of each data set, it can be said that normalized data sets are processed much faster.
Error Analysis
we see a number of wrong answers in each dataset, approximately 100 to 300. The biggest reason for this is that the knn algorithm interprets the data set differently, especially for neighbor values with a large difference. We can explain this more easily with an example:


O         X      data O        X              O


As seen in the example, when we take the value of k = 1, we see that the value of O is dominant, when we take the value of k = 3, the value of X is dominant with the majority vote, and when we take the value of k = 5, we see that the value of O is dominant again. Such errors may vary according to the increase or decrease of neighbor values. Therefore, we cannot say that the largest or smallest neighbor value is the best. However, in line with the results we observed, we can say that the most suitable and efficient data set in terms of speed and correct response is the datasets that are normalized and have 7 neighbor values.