#  Implementing k-NN

Implementation k-NN. 


1. Preprocessing of data
2. The code defines a function `distance(x1, x2)` that calculates the Euclidean distance between two data points `x1` and `x2`.
3. The code defines a function `knn(X_train, Y_train, X_test, k=9)` that trains a k-nearest neighbor model on the training data `X_train` and `Y_train`, and tests the model on the test data `X_test` using `k` nearest neighbors. The function returns the predicted values for the test data.
4. The code applies the `knn` function to the training and test data with `k=9` to predict the survival of each passenger in the test set.
5. The code calculates the accuracy of the model by comparing the predicted survival values to the actual survival values in the test set.
6. The code prints the accuracy of the model rounded to two decimal places.

The `distance` function calculates the Euclidean distance between two data points by iterating over each feature and summing the squared differences between the corresponding feature values. The function then takes the square root of the sum to obtain the final distance.

The `knn` function trains a k-nearest neighbor model by iterating over each test data point and finding the `k` nearest neighbors in the training data using the `distance` function. The function then assigns a class label to the test data point based on the majority class of the nearest neighbors. The function returns the predicted class labels for all test data points.

In [1]:
import numpy as np
import pandas as pd
# DO NOT use any other import statements for this question

df = pd.read_csv('titanic.csv')
data = df[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']].dropna()
data.loc[data["Sex"] == "male", "Sex"] = 1
data.loc[data["Sex"] == "female", "Sex"] = 0
data = np.array(data)
X, Y = data[:, 1:], data[:, 0]

# normalise all cols
for c in range(X.shape[1]):
    X[:,c] = (max(X[:,c]) -  X[:,c])/(max(X[:,c]) - min(X[:,c]))
    
# break into train/test
split = int(0.8 * data.shape[0])

X_train = X[:split]
X_test = X[split:]
Y_train = Y[:split]
Y_test = Y[split:]

# Add intercept term to X
X_train = np.concatenate([np.ones((X_train.shape[0], 1)), X_train], axis=1)
X_test = np.concatenate([np.ones((X_test.shape[0], 1)), X_test], axis=1)

In [2]:
def distance(x1, x2):
    """
    Calculates Euclidean distance between two point x1 and x2.
    
    Args:
      x1                  : Data point, a numeric vector of size n (number of features)
      x2                  : Data point, a numeric vector of size n (number of features)
      
    Returns:
      d (float)           : Euclidean distance (scalar) between x1 and x2 in feature space 
      """
    
    d = 0.0
    
    # ===================== CODE HERE ======================
    for i in range(len(x1)):
      d += ((x1[i] - x2[i]) ** 2)
    d = np.sqrt(d)
    # ===========================================================
    
    return d

In [4]:
def knn(X_train, Y_train, X_test, k=9):
    """
    Trains the k nearest neighbour model using X_train, Y_train dataset, 
    then tests the model on X_test, and returns predictions for the test set.
    
    Args:
      X_train  : Training data, ndarray array, n examples with m features
      Y_train  : ndarray vector of target n values
      X_test   : Test data, ndarray array, n_test examples with m features
      k (int)  : number of neighbour points to be used by k-NN
      
    Returns:
      Y_pred   : ndarray vector of predicted n values, one predicted value for each test example
      """
    Y_pred = []

    # ===================== CODE HERE ======================

    for i in range(X_test.shape[0]):
        distances = []
        for j in range(X_train.shape[0]):
            dist = distance(X_test[i], X_train[j])
            distances.append((dist, Y_train[j]))
        distances = sorted(distances)
        neighbors = distances[:k]
        classes = {}
        for n in neighbors:
            if n[1] not in classes:
                classes[n[1]] = 0
            classes[n[1]] += 1
        Y_pred.append(sorted(classes.items(), key=lambda x: x[1], reverse=True)[0][0])

    # ============================================================= 
        
    return Y_pred

In [5]:
"""
Applying knn function coded above to calculate the accuracy

"""
accuracy = 0

# ===================== CODE HERE ======================

Y_pred = knn(X_train, Y_train, X_test, k=9)
accuracy = np.sum(Y_pred == Y_test) / len(Y_test)
# ===========================================================

print('Please copy the folowing result line to Question 5 "(Accuracy = )"')
print(np.round(accuracy,2))

Please copy the folowing result line to Question 5 "(Accuracy = )"
0.83
