# Classification using Haberman's Survival Data Set

This is a reimplementation of the K-Nearest Neighbors algorithm using plain Python.

In my opinion it is important to understand the "low level", not just the abstraction.

Data Set: [Haberman's Survival Data Set](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)

In [1]:
import math # import math module

In [2]:
data = [] # creat empty Data list

##### Import the data and append it to the list 

[age_of_the_patient, year_of_operation, number_of_nodes_detected, survival_status]

Check the data set's link above for more details

In [3]:
with open('dataset.data', 'r') as f: # Open the existing file for reading.
    for line in f.readlines(): # readlines() reads all lines in the file and returns each line to a list of elements 
                                # 'line' is String variable
        atributes = line.strip('\n').split(',') # strip() is to remove \n present at both ends of the string
                                                # and, split() distribute the string based on ',' 
        data.append([int(x) for x in atributes]) # change String to Integer , append integer type data to 'data' list

##### Auxiliary function to help the visualization
Also returns key information of the data set

In [4]:
def info_dataset(data, verbose=True): # created info_dataset() that receives data and verbose as input.
                                       # Always execute if statement by setting 'verbose' to 'True'
    label1, label2 = 0, 0  # initialize label1 & label2 to 0
    data_size = len(data)  # Set data_size to the size of the list.
    for datum in data:     # By rotating the repetition
                           # Enter the value of the list (data) as the variable datum 
        if datum[-1] == 1: # if datum's last value is 1 
            label1 += 1    # 1 Add to label1 
        else:              # if datum's last value is not 1 
            label2 += 1    # 1 Add to label2
    if verbose: 
        # String formatting
        print('Total of samples: %d' % data_size) # Print String After entering 'data_size' in '%d' position
        print('Total label 1: %d' % label1)       # Print String After entering 'label1' in '%d' position
        print('Total label 2: %d' % label2)       # Print String After entering 'label2' in '%d' position
    return [len(data), label1, label2]  # [len(data), label1, label2] return in order.

In [5]:
info_dataset(data) # execute info_dataset function about 'data'

##### Define the train/total percentage

In [6]:
p = 0.6
# set the percentage to 0.6
_, label1, label2 = info_dataset(data,False)
# The meaning of '_' written here means that you want to ignore the value.# In 'info_dataset(data)', ignore 'len(data)' of the return value and assign only label1 and label2.

##### Split the data set into train set and test set

In [7]:
train_set, test_set = [], []
# tcreate two empty data list naming train_set, test_set

max_label1, max_label2 = int(p * label1), int(p * label2)
# set max_label1 to p*label1 in integer, set max_label2 to p*label1 in integer

total_label1, total_label2 = 0, 0
# total_label1 is 0, total_label2 is 0

for sample in data: 
# create variable 'sample' in data
    
    if (total_label1 + total_label2) < (max_label1 + max_label2):
# if total_label1 + total_label2 is smaller than max_label1 + max_label2

        train_set.append(sample)
# append sample in train_set

        if sample[-1] == 1 and total_label1 < max_label1:
# if sample[-1] is equal to 1, total_label1 and smaller than max_label1

            total_label1 += 1
# plus 1 to total_label1
    
        else:
# if else  ?????????????????

            total_label2 += 1
# plus 1 to total_label2

    else:
# if total_label1 + total_label2 is not smaller than max_label1 + max_label2

        test_set.append(sample)
# append sample in test_set

##### Define function to calculate the euclidian distance between two points
[Euclidian Distance - Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)

In [8]:
def euclidian_dist(p1, p2):
    dim, sum_ = len(p1), 0
    for index in range(dim - 1):
        sum_ += math.pow(p1[index] - p2[index], 2)
    return math.sqrt(sum_)

##### Calculates the distance between a given sample and every other in the train set
Feeds its distances to a dictionary, the sort it and gets the nearest K neighbors;
Then it counts witch of the labels is the most recurring, and returns it. 

In [9]:
def knn(train_set, new_sample, K):
    dists, train_size = {}, len(train_set)
    
    for i in range(train_size):
        d = euclidian_dist(train_set[i], new_sample)
        dists[i] = d
    
    k_neighbors = sorted(dists, key=dists.get)[:K]
    
    qty_label1, qty_label2 = 0, 0
    for index in k_neighbors:
        if train_set[index][-1] == 1:
            qty_label1 += 1
        else:
            qty_label2 += 1
            
    if qty_label1 > qty_label2:
        return 1
    else:
        return 2

##### Example

In [10]:
print(test_set[0])
print(knn(train_set, test_set[0], 12))

##### Counts the correct predictions of the test set with a given K

In [11]:
correct, K = 0, 15
# correct is 0, K is 15

for sample in test_set:
# create variable 'sample' in data

    label = knn(train_set, sample, K)
# define label is 'knn'(train_set, sample, K)

    if sample[-1] == label:
# if sample[-1] is equal to label

        correct += 1
# plus 1 to correct

In [12]:
print("Train set size: %d" % len(train_set))
# Print String After entering len(train_set) in '%d' position

print("Test set size: %d" % len(test_set))
# Print String After entering len(test_set) in '%d' position

print("Correct predicitons: %d" % correct)
# Print String After entering 'correct' in '%d' position

print("Accuracy: %.2f%%" % (100 * correct / len(train_set)))
# Print String After entering (100 * correct / len(train_set)) in '%.2f%%' position

Train set size: 183
Test set size: 123
Correct predicitons: 93
Accuracy: 50.82%
