<a href="https://www.kaggle.com/code/sacrum/ml-labs-06-knn-from-scratch?scriptVersionId=178243561" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# Import Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Data
In this notebook we are going to use Iris Dataset

The Iris dataset is a classic dataset in the field of machine learning and statistics. It was introduced by the British statistician and biologist Ronald Fisher in 1936 as an example of discriminant analysis. The dataset consists of 150 samples from three different species of iris flowers (Iris setosa, Iris versicolor, and Iris virginica). For each sample, four features are measured: the length and width of the sepals and petals, in centimeters. The dataset is often used as a benchmark for classification tasks, as it is relatively small but has well-defined class labels and features that make it suitable for valing various machine learning algorithms. The goal is to classify iris flowers into the correct species based on the measured features.

In [2]:
data_path = "/kaggle/input/iris/Iris.csv"

In [3]:
df = pd.read_csv(data_path)

# set Id column as index
df = df.set_index("Id")

print("Shape:", df.shape)
df

Shape: (150, 5)


Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
146,6.7,3.0,5.2,2.3,Iris-virginica
147,6.3,2.5,5.0,1.9,Iris-virginica
148,6.5,3.0,5.2,2.0,Iris-virginica
149,6.2,3.4,5.4,2.3,Iris-virginica


## Preprocessing
Apply Label Encoding in `Species` column

In [4]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [5]:
df.sample(5)

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
107,4.9,2.5,4.5,1.7,Iris-virginica
133,6.4,2.8,5.6,2.2,Iris-virginica
70,5.6,2.5,3.9,1.1,Iris-versicolor
150,5.9,3.0,5.1,1.8,Iris-virginica
40,5.1,3.4,1.5,0.2,Iris-setosa


In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df['Species'] = encoder.fit_transform(df['Species'])

In [7]:
df.sample(5)

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
76,6.6,3.0,4.4,1.4,1
93,5.8,2.6,4.0,1.2,1
112,6.4,2.7,5.3,1.9,2
127,6.2,2.8,4.8,1.8,2
8,5.0,3.4,1.5,0.2,0


## Data split

In [8]:
X = df
y = X.pop("Species")

X = X.values
y = y.values

X.shape, y.shape

((150, 4), (150,))

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
	X,
	y,
	test_size=0.1,
	random_state=42,
	stratify=y
)

print("X_train.shape:", X_train.shape)
print("X_val.shape:", X_val.shape)
print("y_train.shape:", y_train.shape)
print("y_val.shape:", y_val.shape)

X_train.shape: (135, 4)
X_val.shape: (15, 4)
y_train.shape: (135,)
y_val.shape: (15,)


# Model

Create KNN from Scratch

In [10]:
class CustomKNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        distances = [np.sqrt(np.sum((x - x_train)**2)) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        most_common = Counter(k_nearest_labels).most_common(1)
        return most_common[0][0]

Import from Scikit-learn

In [11]:
from sklearn.neighbors import KNeighborsClassifier as KNN

## Training

Initialize Different Models Configuration

In [12]:
container = {
	"custom_knn_3": {
		"description": "Custom KNN with k=3",
		"model": CustomKNN(3),
		
	},
	"custom_knn_5": {
		"description": "Custom KNN with k=5",
		"model": CustomKNN(5),
		
	},
	"custom_knn_7": {
		"description": "Custom KNN with k=7",
		"model": CustomKNN(7),
		
	},
	"knn_3": {
		"description": "Scikit KNN with k=3",
		"model": KNN(3),
		
	},
	"knn_5": {
		"description": "Scikit KNN with k=5",
		"model": KNN(5),
		
	},
	"knn_7": {
		"description": "Scikit KNN with k=7",
		"model": KNN(7),
		
	},
}


Fit training data

In [13]:
from sklearn.metrics import accuracy_score

for model_type, model_container in container.items():
	
	# load model object
	description = model_container['description']
	model = model_container['model']

	# train the model
	i = time.time()
	model.fit(X_train, y_train)
	f = time.time()
	time_to_fit = round(f - i, 3)

	# generate predictions
	i = time.time()
	pred = model.predict(X_val)
	f = time.time()
	time_to_predict = round(f - i, 3)

	# calculate accuracy
	train_accuracy = accuracy_score(
		model.predict(X_train),
		y_train
	)
	val_accuracy = accuracy_score(
		model.predict(X_val),
		y_val
	)

	# print out the results
	print(f"""> Model: {model_type}
	Description: {model_container['description']}
	Time to Fit: {time_to_fit}
	Time to Predict: {time_to_predict}
	Train Accuracy: {round(train_accuracy, 3)}
	Validation Accuracy: {round(val_accuracy, 3)}
	""")


> Model: custom_knn_3
	Description: Custom KNN with k=3
	Time to Fit: 0.0
	Time to Predict: 0.014
	Train Accuracy: 0.956
	Validation Accuracy: 1.0
	
> Model: custom_knn_5
	Description: Custom KNN with k=5
	Time to Fit: 0.0
	Time to Predict: 0.014
	Train Accuracy: 0.963
	Validation Accuracy: 1.0
	
> Model: custom_knn_7
	Description: Custom KNN with k=7
	Time to Fit: 0.0
	Time to Predict: 0.014
	Train Accuracy: 0.97
	Validation Accuracy: 1.0
	
> Model: knn_3
	Description: Scikit KNN with k=3
	Time to Fit: 0.001
	Time to Predict: 0.003
	Train Accuracy: 0.956
	Validation Accuracy: 1.0
	
> Model: knn_5
	Description: Scikit KNN with k=5
	Time to Fit: 0.001
	Time to Predict: 0.002
	Train Accuracy: 0.963
	Validation Accuracy: 1.0
	
> Model: knn_7
	Description: Scikit KNN with k=7
	Time to Fit: 0.001
	Time to Predict: 0.002
	Train Accuracy: 0.97
	Validation Accuracy: 1.0
	


Based on the output, we can make the following conclusions:

1. **Custom KNN vs. Scikit KNN**: Both custom and Scikit-learn implementations of KNN perform similarly, with no significant difference in accuracy. However, the custom implementation seems to have slightly higher prediction times compared to the Scikit-learn implementation.

2. **Effect of K value**: Increasing the value of k (number of neighbors) from 3 to 7 in both custom and Scikit-learn KNN models does not significantly impact the val accuracy. This suggests that the dataset may not be very sensitive to the choice of k in this range.

3. **Time Complexity**: The custom KNN implementation appears to have lower time complexity for prediction compared to the Scikit-learn implementation. This could be due to differences in implementation details or optimizations in the custom implementation.

4. **Overfitting**: There is no evidence of overfitting in any of the models, as both train and val accuracies are very close, indicating that the models generalize well to unseen data.

In conclusion, both custom and Scikit-learn KNN implementations perform well on the dataset, with similar accuracies and minimal overfitting. The choice between the two implementations may depend on the specific requirements of the application, such as speed and ease of use.

**Find More Labs**

This lab is from my Machine Learning Course, that is a part of my [Software Engineering](https://seecs.nust.edu.pk/program/bachelor-of-software-engineering-for-fall-2021-onward) Degree at [NUST](https://nust.edu.pk).

The content in the provided list of notebooks covers a range of topics in **machine learning** and **data analysis** implemented from scratch or using popular libraries like **NumPy**, **pandas**, **scikit-learn**, **seaborn**, and **matplotlib**. It includes introductory materials on NumPy showcasing its efficiency for mathematical operations, **linear regression**, **logistic regression**, **decision trees**, **K-nearest neighbors (KNN)**, **support vector machines (SVM)**, **Naive Bayes**, **K-means** clustering, principle component analysis (**PCA**), and **neural networks** with **backpropagation**. Each notebook demonstrates practical implementation and application of these algorithms on various datasets such as the **California Housing** Dataset, **MNIST** dataset, **Iris** dataset, **Auto-MPG** dataset, and the **UCI Adult Census Income** dataset. Additionally, it covers topics like **gradient descent optimization**, model evaluation metrics (e.g., **accuracy, precision, recall, f1 score**), **regularization** techniques (e.g., **Lasso**, **Ridge**), and **data visualization**.

| Title                                                                                                                   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [01 - Intro to Numpy](https://www.kaggle.com/code/sacrum/ml-labs-01-intro-to-numpy)                                     | The notebook demonstrates NumPy's efficiency for mathematical operations like array `reshaping`, `sigmoid`, `softmax`, `dot` and `outer products`, `L1 and L2 losses`, and matrix operations. It highlights NumPy's superiority over standard Python lists in speed and convenience for scientific computing and machine learning tasks.                                                                                                                                                                                              |
| [02 - Linear Regression From Scratch](https://www.kaggle.com/code/sacrum/ml-labs-02-linear-regression-from-scratch)     | This notebook implements `linear regression` and `gradient descent` from scratch in Python using `NumPy`, focusing on predicting house prices with the `California Housing Dataset`. It defines functions for prediction, `MSE` calculation, and gradient computation. Batch gradient descent is used for optimization. The dataset is loaded, scaled, and split. `Batch, stochastic, and mini-batch gradient descents` are applied with varying hyperparameters. Finally, the MSEs of the predictions from each method are compared. |
| [03 - Logistic Regression from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-03-logistic-regression-from-scratch) | This notebook outlines the implementation of `logistic regression` from scratch in Python using `NumPy`, including functions for prediction, loss calculation, gradient computation, and batch `gradient descent` optimization, applied to the `MNIST` dataset for handwritten digit recognition and `Iris` data. And also inclues metrics like `accuracy`, `precision`, `recall`, `f1 score`                                                                                                                                         |
| [04 - Auto-MPG Regression](https://www.kaggle.com/code/sacrum/ml-labs-04-auto-mpg-regression)                           | The notebook uses `pandas` for data manipulation, `seaborn` and `matplotlib` for visualization, and `sklearn` for `linear regression` and `regularization` techniques (`Lasso` and `Ridge`). It includes data loading, processing, visualization, model training, and evaluation on the `Auto-MPG dataset`.                                                                                                                                                                                                                           |
| [05 - Desicion Trees from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-05-desicion-trees-from-scratch)           | In this notebook, `DecisionTree` algorithm has been implmented from scratch and applied on dummy dataset                                                                                                                                                                                                                                                                                                                                                                                                                              |
| [06 - KNN from Scratch](https://www.kaggle.com/code/sacrum/ml-labs-06-knn-from-scratch)                                 | In this notebook, `K-Nearest Neighbour` algorithm has been implemented from scratch and compared with KNN provided in scikit-learn package                                                                                                                                                                                                                                                                                                                                                                                            |
| [07 - SVM](https://www.kaggle.com/code/sacrum/ml-labs-07-svm)                                                           | This notebook implements `SVM classifier` on `Iris Dataset`                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| [08 - Naive Bayes](https://www.kaggle.com/code/sacrum/ml-labs-08-naive-bayes)                                           | This notebook trains `Naive Bayes` and compares it with other algorithms `Decision Trees`, `SVM` and `Logistic Regression`                                                                                                                                                                                                                                                                                                                                                                                                            |
| [09 - K-means](https://www.kaggle.com/code/sacrum/ml-labs-09-k-means)                                                   | In this notebook `K-means` algorithm has been implemented using `scikit-learn` and different values of `k` are compared to understand the `elbow method` in `Calinski Harabasz Scores`                                                                                                                                                                                                                                                                                                                                                |
| [10 - UCI Adult Census Income](https://www.kaggle.com/code/sacrum/ml-labs-10-uci-adult-census-income)                   | Here I have used the UCI Adult Income dataset and applied different machine learning algorithms to find the best model configuration for predicting salary from the given information                                                                                                                                                                                                                                                                                                                                                 |
| [11 - PCA](https://www.kaggle.com/code/sacrum/ml-labs-11-pca)                                                           | `Principle Component Analysis` implemented from scratch                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| [12 - Neural Networks](https://www.kaggle.com/code/sacrum/ml-labs-12-neural-networks)                                   | This code implements neural networks with back propagation from scratch                                                                                                                                                                                                                                                                                                                                                                                                                                                               |