# KNN IMPLEMENTATION USING NUMPY EXCLUSIVELY
K- Nearest Neighbor algorthim is supervised learning scheme which uses neighboring points of training dataset to predict testing points. It can be applied to both regression and classification problems. 

## K Nearest Neighbor:
Like name suggest K Nearest Neighbors are the closest points to test point. Number of these points can be change to optimize the classification of test point. Neighbors are defined based on the distance proximity. distance from each point can be calculated using algorthims such as Euclidean Distance, Manhattan Distance formulas etc.
Implementation of this algorthim can be done in 3 step process.

### 1. DISTANCE CALCULATION
Calculate distances for each test point against all training points. We will be using Euclidean distancing method to calculate distances.

### 2. SORTING 
Select the top K number of closest points (for example if have selected k = 4, than this means that we have selected top 4 closest point to test points or 4 neighbors). This can be done simply by sorting distances from lowest to highest and selecting the top K values. In this step we will collect closest distance and original indices of corresponding points.

### 3. PREDICTING
#### Classification:
There are basically 2 method for classifying test points, one is uniform method in which we considers neighbors classes. For instance, we have selected k=3, that mean there are 3 closes neighbors to test point, if 2 of them belong to class A while other one belongs to class B then test point will be classified as from class A. According to that method, we classify the test points regardless of the proximity or distance.Other method is inverse distancing weighting proporting, classify the points based on proximity of neighboring points from test point. Algorthim calculates the probability of each class for each test point.
#### Regression: 
Simple KNN regression can be solved by taking mean of elements of neigboring classes. It can also be done using inverse distancing. Inverse distancing is consider more robust but it can be susceptible to amiguity when testing and training dataset contain some same features. In that case, closest distance between test and training set will be equal to zero and inverse distance will be infinity which is undesirable for analysis. To counter that issue, we can use mean method.

Let s' Begin.


## Importing numpy libraries

In [8]:
import numpy as np

### Import Classification and Regression Data

In [2]:
# local Path here
path=r'D:\Directory/'

# Classification file
class_file_name = "classification.csv"

# Regression file
regr_file_name  = "regression.csv"

## Part 1 - KNN classification

This algorthim will make use of inverse distancing weighting proporation to calculate the class probability.

In [3]:
def knn_classify(test, train, k):
    
    dis=[]
    for tr_sample in train:
        # Calculating Euclidean distance of test data corresponding to each training point
        dis.append(np.sqrt(np.sum( (test - tr_sample) ** 2 )))
        
    # Sorting for minimum distances and their indices
    ndis=np.sort(dis)[:k]
    nind=np.array(dis).argsort()[:k]
    
    # Determining labels of minimum distances neighbors
    row_pred = train_y[nind]
    
    # total classes in labelled data
    label_classes=np.unique(train_y)
    
    # Using inverse distancing wieghting proporation
    inv=1/np.array(ndis)
    # average inverse distance
    m_inv = inv / np.sum(inv)[np.newaxis]
    weighted_vote_count=[]
    
    # Loop for determining the proportional weight of each class.
    for label in label_classes:
        index=row_pred==label
        weighted_vote = np.sum(m_inv[index])
        weighted_vote_count.append(weighted_vote)
    probable_class_index = np.argmax(weighted_vote_count)
    label=label_classes[probable_class_index]
    
    return label
    

### 1.2 Data Analysis

In [4]:
# Uploading data from local machine
data=np.genfromtxt(path+class_file_name, delimiter=',')[1:]


#Splitting dataset into training, validating and testing as according to implied ratios.
tr_split=int(0.6*len(data))
val_split=int(0.8*len(data))

#Training dataset
train=data[:tr_split]
train_x=train[:,1:]
train_y=train[:,0].astype(int)

#Validation dataset
val=data[tr_split:val_split]
val_x=val[:,1:]
val_y=val[:,0].astype(int)

#Testing dataset
test=data[val_split:]
test_x=test[:,1:]
test_y=test[:,0].astype(int)



# KNN Classification Implementation
validation_result=[]

#Looping K value
for k in range(2,40):
    val_predict=[]
    
    #Looping each feature of validation data
    for v in val_x:
        val_predict.append(knn_classify(v,train_x,k))
    
    # Calculating Percent Correct Predictions for Validation
    val_pc=(np.count_nonzero(val_predict-val_y == 0)/np.array(val_predict).shape[0])*100
    
    #Appending Percent Correct Predictions value corresponding to each "K" value.
    validation_result.append([val_pc,k])
    
    # Sorting Maximum Percent Correct Predictions along with its "K" value
    validation_result.sort(key=lambda x: x[0],reverse=True)
    
    # Best Percent Correct Predictions and "K" value
    best_validation_result = validation_result[0][0]
    best_k = validation_result[0][1]

# Predicting against test data using best "K" value   
test_predict=[]   
for t in test_x:
    test_predict.append(knn_classify(t,train_x,best_k))

# Calculating Percent Correct Predictions for Test
test_pc=(np.count_nonzero(test_predict-test_y == 0)/np.array(test_predict).shape[0])*100


#Results
print(f'Percentage Correct Prediction of Validation Data = {np.round(best_validation_result,2)} %')
print(f'Percentage Correct Prediction of Test Data = {np.round(test_pc,2)} %')
print(f'Best K Nearest Neighbor value (k) = {np.round(best_k,1)}\n')

Percentage Correct Prediction of Validation Data = 85.0 %
Percentage Correct Prediction of Test Data = 87.5 %
Best K Nearest Neighbor value (k) = 2



## Part 2 - KNN Regression

### 2.1 KNN regression algorithm

In [5]:
def knn_regression(test, train, k):
    
    dis=[]
    # Calculating distancing corresponding to training data
    for tr_sample in train:
        dis.append(np.sqrt(np.sum( (test - tr_sample) ** 2 )))
    
    # Sorting the minimum distance and their indices.
    ndis=np.sort(dis)[:k]
    nind=np.array(dis).argsort()[:k]
    
    # When minimum distance is zero inverse distancing yeild infinity 
    # So in that case, we consider uses mean to calculate label.
    if ndis[0]==0.0:
        label=train_y[nind].mean()
        return round(label,2)
    # Otherwise, We are using inverse distancing method
    else:
        row_pred = train_y[nind]
        inv=1/np.array(ndis)
        label = round(np.matmul(inv, row_pred) / np.sum(inv),2)
        return label

### 2.2 Data Analysis

In [7]:
# Uploading data from local machine
regr_data=np.genfromtxt(path+regr_file_name, delimiter=',')[1:]

#Splitting dataset into training, validating and testing as according to implied ratios.
# Splits
tr_split=int(0.6*len(regr_data))
val_split=int(0.8*len(regr_data))

#Training dataset
train=regr_data[:tr_split]
train_x=train[:,1:]
train_y=train[:,0]

#Validation dataset
val=regr_data[tr_split:val_split]
val_x=val[:,1:]
val_y=val[:,0]

#Testing dataset
test=regr_data[val_split:]
test_x=test[:,1:]
test_y=test[:,0]



#KNN Regression Implementation
validation_result=[]

# looping "K" value 
for k in range(2,15):
    val_predict=[]
    
    #Looping each feature of validation data
    for v in val_x:
        val_predict.append(knn_regression(v,train_x,k))
    
    # Calculating Residual Sum of Square (RSS) for validation
    val_rss=np.sum(np.square(val_predict-val_y))
    
    # Appending RSS value corresponding to each "K" value.
    validation_result.append([val_rss,k])
    
    # Sorting for minimum RSS value,to get best "K" value
    validation_result.sort(key=lambda x: x[0])
    
    # Best RSS and K value
    best_validation_result = validation_result[0][0]
    best_k = validation_result[0][1]

# Predicting against test data using best "K" value
test_predict=[]   
for t in test_x:
    test_predict.append(knn_regression(t,train_x,best_k))
    
# Calculating Residual Sum of Square (RSS) for Test
test_rss=np.sum(np.square(test_predict-test_y))


#Results
print('---K Nearest Neighbour Regression:')
print(f'Residual Sum of Square (RSS) of Validation Data = {np.round(best_validation_result,2)} ')
print(f'Residual Sum of Square (RSS) of Test Data = {np.round(test_rss,2)} ')
print(f'Best K Nearest Neighbor value (k) = {np.round(best_k,1)}\n')

---K Nearest Neighbour Regression:
Residual Sum of Square (RSS) of Validation Data = 0.04 
Residual Sum of Square (RSS) of Test Data = 0.05 
Best K Nearest Neighbor value (k) = 3

---Linear Regression:
Residual Sum of Square (RSS) of Test Data = 0.1 
