<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/KNN4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook experiments with the KNeighbors classification algorithm. <br>
Then compares at the error of the algorithm with different K values. <br> 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

**Get the data**

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 
         'Class']
# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)

In [None]:
dataset.head()

**Select the all the features for training.** <br>
**Select the final column (Class) for the labels**

In [None]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

**Split the data into training and test sets**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

**Scale the data**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


**Train the model**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

**Make predictions using the test set features**

In [None]:
y_pred = classifier.predict(X_test)

**Calculate the confusion matrix**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Plot and compare the model performance for different values of K**

In [None]:
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')


**Assignment 1**<br>
What value of K should you select?<br>
What should you consider when making the decision? 

Assignment 2<br>
The code below creates a synthetic dataset. <br>
1. Modify the size of the dataset<br>
2. Modify the ratio of train and test sets<br>

Record what happens to the confusion matrix and the K plot. <br>

In [None]:
from sklearn.datasets import make_classification
from random import seed
from random import random
from random import randint
# Generate Balanced Data
X,y = make_classification(n_samples=100, n_features=2, 
                          n_redundant=0, n_repeated=0, n_classes=3, 
                          n_clusters_per_class=1,class_sep=2,
                          flip_y=0,weights=[0.33,0.33,0.33], 
                          random_state=randint(0,20))

In [None]:
n_neighbors=2

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neighbors, datasets
clf_U = neighbors.KNeighborsClassifier(n_neighbors, weights='uniform')
U_trained=clf_U.fit(X_train, y_train)
y_U_pred = clf_U.predict(X_test)

In [None]:
print(confusion_matrix(y_test, y_U_pred))
print(classification_report(y_test, y_U_pred))

In [None]:
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
         markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')