# Classification using Haberman's Survival Data Set

This is a reimplementation of the K-Nearest Neighbors algorithm using plain Python.

In my opinion it is important to understand the "low level", not just the abstraction.

Data Set: [Haberman's Survival Data Set](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)

In [56]:
import math
import pandas as pd
import numpy as np

In [57]:
data = []

##### Import the data and append it to the list 

[age_of_the_patient, year_of_operation, number_of_nodes_detected, survival_status]

Check the data set's link above for more details

In [58]:
with open('dataset.data', 'r') as f:
    for line in f.readlines():
        atributes = line.strip('\n').split(',')
        data.append([int(x) for x in atributes])
# 'dataset'에 값들을 4개씩 쪼개서 'data' 라는 이름의 list에 입력

### Data nomalize(True)

In [59]:
df=pd.DataFrame(data)

In [60]:
df.describe()

Unnamed: 0,0,1,2,3
count,306.0,306.0,306.0,306.0
mean,52.457516,62.852941,4.026144,1.264706
std,10.803452,3.249405,7.189654,0.441899
min,30.0,58.0,0.0,1.0
25%,44.0,60.0,0.0,1.0
50%,52.0,63.0,1.0,1.0
75%,60.75,65.75,4.0,2.0
max,83.0,69.0,52.0,2.0


In [61]:
stan_df=(df-df.mean())/df.std()

In [62]:
add = df[3].values
stan_df[3] = add

In [63]:
data=stan_df.values

##### Auxiliary function to help the visualization
Also returns key information of the data set

In [64]:
def info_dataset(data, verbose=True):
    label1, label2 = 0, 0
    data_size = len(data)
    for datum in data:
        if datum[-1] == 1:
            label1 += 1
        else:
            label2 += 1
    if verbose:
        print('Total of samples: %d' % data_size)
        print('Total label 1: %d' % label1)
        print('Total label 2: %d' % label2)
    return [len(data), label1, label2]
# 'dataset'을 label화 하기 위한 class
# verbose는 식을 표현하기 위한 도구에 불과하니, 그렇게 큰 신경을 쓸 필요는 없음

In [65]:
info_dataset(data)

Total of samples: 306
Total label 1: 225
Total label 2: 81


[306, 225, 81]

##### Define the train/total percentage

In [66]:
p = 0.6
_, label1, label2 = info_dataset(data,False)
# labelling한 데이터를 data data, train data로 나누기 위한 것
# 60%는 train_data, 40%는 test_data

##### Split the data set into train set and test set

In [67]:
train_set, test_set = [], []
max_label1, max_label2 = int(p * label1), int(p * label2)
total_label1, total_label2 = 0, 0
for sample in data:
    if (total_label1 + total_label2) < (max_label1 + max_label2):
        train_set.append(sample)
        if sample[-1] == 1 and total_label1 < max_label1:
            total_label1 += 1
        else:
            total_label2 += 1
    else:
        test_set.append(sample)
# data 분류 해서 train_data, test_data로 분류

##### Define function to calculate the euclidian distance between two points
[Euclidian Distance - Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)

In [68]:
def manhattan_dist(p1, p2):
    dim, sum_ = len(p1), 0
    for index in range(dim-1):
        sum_ += abs(p1[index]-p2[index])
    return sum_
# 맨해튼 거리 구하는 클래스

##### Calculates the distance between a given sample and every other in the train set
Feeds its distances to a dictionary, the sort it and gets the nearest K neighbors;
Then it counts witch of the labels is the most recurring, and returns it. 

In [69]:
def knn_manhattan(train_set, new_sample,K):
    dists, train_size = {}, len(train_set)
    
    for i in range(train_size):
        d = manhattan_dist(train_set[i],new_sample)
        dists[i] = d
        
    k_neighbors = sorted(dists,key=dists.get)[:K]
    
    qty_label1, qty_label2 = 0, 0
    
    for index in k_neighbors:
        if train_set[index][-1] == 1:
            qty_label1 +=1
        else:
            qty_label2 +=1
    
    if qty_label1>qty_label2:
        return 1
    else:
        return 2
#knn_manhattan 알고리즘 클래스.
# train_set을 입력받아 knn 평가 알고리즘을 형성하고, new_sample을 매게변수로 받아
# new_sample을 labelling한다. (K-nearest로)

##### Example

In [70]:
print(test_set[0])
print(knn_manhattan(train_set, test_set[68], 15))

[ 0.97584395 -0.5702402   0.6918075   1.        ]
1


##### Counts the correct predictions of the test set with a given K

In [81]:
correct, K = 0, 15
list_manhattan =[]
for sample in test_set:
    label = knn_manhattan(train_set, sample, K)
    list_manhattan.append(label)
    if sample[-1] == label:
        correct += 1
# test set의 데이터 중 몇개가 정확한지(correct) 평가하는 것.

In [82]:
print("Train set size: %d" % len(train_set))
print("Test set size: %d" % len(test_set))
print("Correct predicitons: %d" % correct)
print("Accuracy: %.2f%%" % (100 * correct / len(test_set)))

Train set size: 183
Test set size: 123
Correct predicitons: 94
Accuracy: 76.42%


In [102]:
list_test=[]
for i in test_set:
    list_test.append(i[3])

In [103]:
list_test=pd.Series(list_test)

In [104]:
list_manhattan=pd.Series(list_manhattan)

In [117]:
result=pd.crosstab(list_test,list_manhattan,rownames=['True'],colnames=['Predicted'],margins=True)

In [118]:
result.rename(index={1:"True Negative",2:"True Positive","All":"total"},columns={1:"Predicted Negative",2:"Predicted Postive","All":"total"})

Predicted,Predicted Negative,Predicted Postive,total
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
True Negative,84,9,93
True Positive,20,10,30
total,104,19,123


## Q.3

In [119]:
TN=result[1][1]
FN=result[1][2]
FP=result[2][1]
TP=result[2][2]
N=result['All'][1] # 1 예측
P=result['All'][2] # 2 예측
N_=result[1]['All'] # 실제 1
P_=result[2]['All'] # 실제 2

In [120]:
print("Accuracy : %.2f%%"%(100*(TN+TP)/(TN+FP+FN+TP)))
print("Precision : %.2f%%"%(100*TP/P_))
print("Recall : %.2f%%"%(100*TP/P))
print("False Positive(rate) : %.2f%%"%(100*FP/P_))
print("True Positive(rate) : %.2f%%"%(100*TP/P_))

Accuracy : 76.42%
Precision : 52.63%
Recall : 33.33%
False Positive(rate) : 47.37%
True Positive(rate) : 52.63%
