# Lecture 10a. Classification using KNN scikit learn implementation.

### The EDA part of this lecture is based on refactoring of a nice [blog post](https://medium.com/codex/machine-learning-k-nearest-neighbors-algorithm-with-python-df94b374ad41) that could be improved a lot. Note: the code from this blog post is not working with the current modules versions we use.   
**Let's all repeat together once more: "Version Control is a fundamental skill if you want to learn programming nad one of the first things to test for debugging."**


### The *iris* plants dataset.
[iris flowers](https://en.wikipedia.org/wiki/Iris_(plant))  
[iris setosa](https://en.wikipedia.org/wiki/Iris_setosa)  
[Iris Versicolour](https://en.wikipedia.org/wiki/Iris_versicolor)  
[Iris Virginica](https://en.wikipedia.org/wiki/Iris_virginica)

[Info about dataset](https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.names)    


### Notes:  
* Classify an observation in the data by measuring its distance from an integer K of nearest observations.  
  Simple counting (voting) of K neighbours.    
  "Tell me who your friends are and I can tell you who you are."  

* KNN is a **"supervised learning"** algo. We know the labels.  
Goal is to classify observations to known categories.

* **"Lazy learning"**: use the whole dataset, don't split the data into train, test and validation sets.

[Intro to KNN by my favorite tutor](https://www.youtube.com/watch?v=HnCHdeyJNOM)  

Appropriate for: 
* **"tall skinny datasets"**, Many instances, few features.
* uniformly shaped datasets, 

**Selecting K**:  
The optimal K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct.   
Default scikit learn value is 5 neighbours. Not optimal, just a default value.      
Changing the number of K is the only parameter you may set for a basic tuning implementation.  

**Important "keyword only" parameters of scikit learn implementation and their default values:**
* weights='uniform'. All neighbors have equal weights ("voting power"=1). If 'distance' increases then "voting power" of neighbour decreases. 
* metric='minkowski'. Metric used to compute distance. Default “minkowski” results in the standard Euclidean distance when p=2.  
  See the documentation of [scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html) and the metrics listed in [distance_metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics) for valid metric values.  
* p=2. Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2.
* algorithm='auto'. Decide the most appropriate algorithm based on data values.
* leaf_size=30. The number of observation in the final "leaf node". Trade-offs in accuracy and efficienvy (RAM, speed).   
    Larger Size: Tends to make tree construction faster and reduces memory usage, but may slow down query time as more points are checked within each leaf.  
    The impact on accuracy is usually minimal unless the leaf size significantly affects the neighborhood search quality.  
    Smaller Size: Leads to a more finely partitioned tree, potentially improving query speed and accuracy, but can increase tree construction time and memory usage.  
Balance leaf size for optimal performance in both tree construction and querying, while ensuring that accuracy is not adversely affected.  
* n_jobs=None. Used for parallel processors jobs. None means 1.


### KNN implementaion reading material:
[Knn user guide in scikit-learn](https://scikit-learn.org/stable/modules/neighbors.html#classification)

[Knn funtion and parameters in scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)


### Extras:
#### Alternative nearest neighbours classifier: Radius N N
Classify based on the number of neighbors within a fixed radius of each training point, which is a floating-point value specified by the user.   
Useful when data are not uniformely shaped (e.g. geospatial, time internals).   

#### KNN Regression
[KNN Regression User Guide](https://scikit-learn.org/stable/modules/neighbors.html#regression)  
Data labels are continuous rather than discrete variables.   
The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.

### Import necessary modules.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style

from sklearn.neighbors import KNeighborsClassifier  # The algo function

from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler  # Just for presentation.

In [None]:
import sklearn.neighbors

## A. Examine the data, using graphs (EDA) and statistics.

In [None]:
# seaborn loads training datasets and returns a pandas dataframe
df = sns.load_dataset('iris')

In [None]:
type(df)

In [None]:
# seaborn loads training datasets and returns a pandas dataframe which contains X and y.
df.sample(5)

#### A1. Basic info about the data.

In [None]:
df.columns

In [None]:
df.info()

In [None]:
# Descriptive stats on all the data offer no insights concerning the 3 different categories (species).
df.describe()

In [None]:
# Species differences are obvious with scimple descriptive stats.
df.groupby("species").describe().T

### Various plots for EDA

In [None]:
# sns.  # Use tab key to aget options in Jupyter lab.

In [None]:
# plt. # Use tab key to aget options in Jupyter lab.

In [None]:
# Matplotlib styles include seaborn styles currently.
style.available

In [None]:
# Set a common style for all plots.
# style.use('seaborn-whitegrid')  # This used to work a year ago. Not anymore.
# style.use("seaborn-v0_8-dark")  # Try this if you want.
style.use("dark_background")  # I prefer this one
plt.rcParams['figure.figsize'] = (14, 8)

#### A2. Sepal scatter visualization

In [None]:
# sns.scatterplot?

In [None]:
# Call functions with many parameters using one line for each parameter: Readability, flexibility, faster, cleaner code.
sns.scatterplot(data=df,
                x='sepal_length',
                y='sepal_width',
                hue='species',
                palette=['Red', 'Blue', 'Limegreen'],
                # palette = 'Set2', # Try this one.
                edgecolor='black',
                s=100,
                alpha=0.85  # try setting 0.5
)

plt.title('Sepal Length and Sepal width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend(markerscale=1.5, loc="upper right", prop={'size': 14});  # Try different marker scales.

# plt.savefig("sepal.png")  # Can you guess what does this line of code do?

In [None]:
# Spot the three differences in the code between this plot and the one above:
# a)
# b)
# c)

# Call functions with many parameters using one line for each parameter: Readability, flexibility, faster, cleaner code.
sns.scatterplot(
    x='sepal_length',
    y='sepal_width',
    data=df,  # a)
    hue='species',
    palette=['red', 'blue', 'limegreen'],  # b)
    edgecolor='b', # c) Can you guess if this is "black" or "blue"? Explicit is better than implicit.
    s=100,
    alpha=0.85
)

plt.title('Sepal Length to Sepal width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend(markerscale=1.5, loc="upper right", prop={'size': 14});

# plt.savefig('sepal.png')  # Can you guess what does this line of code do?

#### A3. Petal scatter visualization

In [None]:
sns.scatterplot(
    x='petal_length', y='petal_width', data=df, hue='species',
    palette = ['Red', 'Blue', 'limegreen'],
    edgecolor = 'w', s = 150, alpha = 0.7
)

plt.title('Petal Length To Petal Width')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(loc = 'upper left', fontsize = 12);

# plt.savefig('petal.png')

In [None]:
# sns.heatmap?

#### A4. Data Heatmap

In [None]:
# df_corr = df.corr()  # This used to work until Pandas 2.0. Not anymore.

df_corr = df.corr(numeric_only=True)

sns.heatmap(
    df_corr, annot = True, 
    cmap = 'Blues',
    xticklabels = df_corr.columns.values,
    yticklabels = df_corr.columns.values
);

plt.title('Iris Data Heatmap', fontsize = 15);
plt.xticks(fontsize = 12);
plt.yticks(fontsize = 12);

# plt.savefig('heatmap.png')

#### A5. Scatter Matrix

In [None]:
# all features ar numeric, so we use the data as is.
# use the species to 
sns.pairplot(data=df, hue='species', palette=['Red', 'Blue', 'limegreen']);

# plt.savefig('iris_pairplot.png')

In [None]:
# 5. Distribution plot for all species of iris

ax1 = plt.subplot(211)
# sns.kdeplot(df['sepal_length'], color = 'r', shade = True)  # deprecation warning
# sns.kdeplot(df['sepal_width'], color = 'b', shade = True)
sns.kdeplot(df['sepal_length'], color = 'r', fill = True);
sns.kdeplot(df['sepal_width'], color = 'b', fill = True);
plt.xlabel('Sepal width and sepal length');
plt.legend(["Length", "Width"])


ax2 = plt.subplot(212)
# sns.kdeplot(df['petal_length'], color = 'coral', shade = True);  # deprecation warning
# sns.kdeplot(df['petal_width'], color = 'green', shade = True);
sns.kdeplot(df['petal_length'], color = 'coral', fill = True);
sns.kdeplot(df['petal_width'], color = 'green', fill = True);
plt.xlabel('Petal width and sepal length');
plt.legend(["Length", "Width"])

# plt.savefig('dist.png')

## B. Apply knn, using scikit learn

When using a function from library, a very important thing to know: the "input" data type. Also remember: libraries evolve fast.

### B1. Data Pre-processing 

#### B1.1 Separate:  
* "feature matrix" = (explanatory variables)
* "classification vector" = (target variable)

In [None]:
df.head(3)

In [None]:
# X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].to_numpy()
# convert to numpy array is not necessary, scikit works with pandas dataframes since ages. But that was not always the case.

X = df.iloc[:, :-1]#.to_numpy()
#X = df[["sepal_length",	"sepal_width", 	"petal_length", "petal_width"]]   # Same result as above, but what would you do if you had thousands of features?

y = df['species']#.to_numpy()

In [None]:
X.head(3)
# X[:5]  # same result, pythonic syntax

In [None]:
# 5 Random target variable values
y.sample(5)
# y[:5]

#### B1.2 Scale the data (if necessary)  

In this case, the features all have the same scale, all are measeured in centimeters and their range is very small.  
So this step is NOT necessary for this dataset.    
Below are some examples of scaling methods, for you to practice.  
Try to see if you get different results.

In [None]:
# standardization = subtract the mean and divide by s. Mean becomes zero and s = 1
# X = StandardScaler().fit_transform(X.astype(float))

# X = MaxAbsScaler().fit_transform(X.astype(float))  # feature_range=[-1, 1]

# X = MinMaxScaler().fit_transform(X.astype(float))  # feature_range=(min, max)

# X.head()

In [None]:
# Range of features values.
(X.max() - X.min())

#### B1.3 Split data in train set and test set   

#### IMPORTANT: k-NN does NOT need TRAIN & TEST => "lazy learning algo".

### B2. Implement the scikit library

#### B2.1 Select N of neighbours and classifier model

In [None]:
# get help in jupyter about the function and its parameters
KNeighborsClassifier?

In [None]:
# help(KNeighborsClassifier)  # Extensive documentation.

In [None]:
k = 5  # Number of neighbours

# name that we gave to the "algo" instance
neighbours_clf = KNeighborsClassifier(n_neighbors=k)

In [None]:
neighbours_clf = KNeighborsClassifier()

#### B2.2 Fit the model to the train set (a subset of the data)

In [None]:
neighbours_clf  # Before fitting

In [None]:
neighbours_clf.fit(X, y);

In [None]:
neighbours_clf  # After fitting

#### B2.3 Create predictions on test set

In [None]:
# this creates a new, fifth subset
y_pred = neighbours_clf.predict(X)

In [None]:
y_pred

#### B2.4 predict a single observation

In [None]:
X.head(1)  # first row

In [None]:
X.iloc[0:1]  # first row, same as above different syntax

In [None]:
y_pred_obs_number_one = neighbours_clf.predict(X.iloc[0:1])  # predict class of first row
y_pred_obs_number_one

In [None]:
y[0:1]  # actual class value of first row

In [None]:
# Read the warning and undestand it. Data without names, still works because of training.
my_whatever_one_observation_feature_values = [[5.8, 2.8, 5.1, 2.4]]  # array of 4 random values, double brackets.
y_pred_my_whatever_values = neighbours_clf.predict(my_whatever_one_observation_feature_values)

y_pred_my_whatever_values = neighbours_clf.predict([[5.8, 2.8, 5.1, 2.4]])  # same as the line above.
y_pred_my_whatever_values

In [None]:
# Create DataFrame for the new observation to include headers
feature_names = df.columns[:-1]
feature_names

In [None]:
# Create a df (values, columns headers)
one_new_observation_df = pd.DataFrame(my_whatever_one_observation_feature_values, columns=feature_names)

In [None]:
y_pred_my_whatever_values = neighbours_clf.predict(one_new_observation_df)
y_pred_my_whatever_values

In [None]:
# Create predictions
y_pred_first_ten_rows = neighbours_clf.predict(X.iloc[:10])

In [None]:
# Call predictions
y_pred_first_ten_rows

In [None]:
# show classifier classes
neighbours_clf.classes_

In [None]:
# Show probablilities of prediction for class for each observation
neighbours_clf.predict_proba(X)

In [None]:
# Old school python:
# Get the predicted probabilities for each sample
probabilities = neighbours_clf.predict_proba(X)

# Find the indices of samples where the maximum probability is not 1
indices = np.where(np.max(probabilities, axis=1) != 1)

# Print these samples and their corresponding probabilities
for i in indices[0]:
    print(f"Sample {i}:")
    print(X.iloc[i])
    print(f"Predicted probabilities: {probabilities[i]}")
    print("\n")

In [None]:
# Get the predicted probabilities for each sample
probabilities = neighbours_clf.predict_proba(X)

# Find the indices of samples where the maximum probability is not 1
indices = np.where(np.max(probabilities, axis=1) != 1)

# Initialize an empty DataFrame to store the results
results_df = pd.DataFrame(columns=["Sample"] + list(X.columns) + ["Predicted Probabilities"])

# Loop over the indices
for i in indices[0]:
    # Append the sample index, features and predicted probabilities to the DataFrame
    results_df.loc[len(results_df)] = [i] + list(X.iloc[i]) + [probabilities[i]]

# Print the results DataFrame
results_df

### B3. Evaluate model performance

#### B3.1 Accuracy score

In [None]:
# using the default score()  function of the classifier
neighbours_clf.score(X, y)

In [None]:
# using the scikit learn metric function. Same result
round(accuracy_score(y, y_pred) * 100, 4)

In [None]:
y[:10], y_pred[:10]

In [None]:
# Create df (column_headers: values)
y_df = pd.DataFrame({'y': y, 'y_pred': y_pred},
                    columns=['y', 'y_pred'])

In [None]:
# Call dateframe
y_df

In [None]:
# Create new column with ["column name"] and values = the result of boolean equality y is equal to predicted y

y_df["prediction_outcome"] = (y_df.y == y_df.y_pred)

In [None]:
y_df.head()

In [None]:
# Filter y_df, get rows of column prediction_outcome is False.
y_df[y_df.prediction_outcome == False]

#### B3.2 Confusion Matrix

In [None]:
# Confusion matrix, manual implementation using pandas crosstab function
pd.crosstab(y, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
confusion_matrix?

In [None]:
# Confusion matrix, using ready scikit learn function
confusion_matrix(y, y_pred, labels=["setosa", "versicolor",	"virginica"])

In [None]:
# Confusion matrix,normalize in perventages
confusion_matrix(y, y_pred, normalize="true")

In [None]:
confusion_matrix(y, y_pred, labels=neighbours_clf.classes_)

In [None]:
cm = confusion_matrix(y, y_pred, labels=neighbours_clf.classes_)

In [None]:
# Get available colormap options for matplotlib
plt.colormaps()

In [None]:
ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=neighbours_clf.classes_
).plot(
    # cmap="inferno_r"
    cmap='rocket_r'
);

### B4. Compare different number of k for neighbors  
[simple loops](https://www.kaggle.com/jmataya/k-nearest-neighbors-classifier)

In [None]:
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 20)
accuracy = np.empty(len(neighbors))

In [None]:
# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn_clf = KNeighborsClassifier(n_neighbors=k)

    # Fit the classifier to the training data
    knn_clf.fit(X, y)
    
    #Compute accuracy on the training set
    accuracy[i] = knn_clf.score(X, y)

In [None]:
accuracy

In [None]:
accuracy[4]  # k =5, index starts from zero

In [None]:
# Generate plot
plt.title('k-NN: Using different Number of Neighbors')
plt.plot(neighbors, accuracy, label='Accuracy')
# plt.plot(neighbors, train_accuracy, label='Training Accuracy')
plt.xticks(range(1, 20))
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')

plt.show()

In [None]:
# Generate the same plot as above without setting any parameter values for labels or ticks
plt.plot(neighbors, accuracy, label='Accuracy')
plt.show()

In [None]:
type(accuracy)

In [None]:
max(accuracy)

In [None]:
# accuracy for 5 neighbours
accuracy[4]

In [None]:
# Apply knn with k=15
fit_neighbours_15 = KNeighborsClassifier(n_neighbors=15).fit(X, y)

In [None]:
y_pred = fit_neighbours_15.predict(X)

In [None]:
confusion_matrix(y, y_pred)

In [None]:
# Initialize variables to store the best number of neighbors and the highest accuracy
best_n = 0
highest_accuracy = 0

# Loop over different values of k from 4 to 30
for n_neighbors in range(4, 31):
    # Train the KNN model on the entire dataset
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X, y)

    # Make predictions on the entire dataset
    y_pred = knn.predict(X)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y, y_pred)

    # If the current model's accuracy is higher than our highest accuracy so far,
    # update our best number of neighbors and highest accuracy
    if accuracy > highest_accuracy:
        best_n = n_neighbors
        highest_accuracy = accuracy

# Print the best number of neighbors and the highest accuracy
print("Best number of neighbors:", best_n)
print("Highest accuracy:", highest_accuracy)

### Apply Knn with k=15 and use wieghts for distance

In [None]:
fit_neighbours_clf = KNeighborsClassifier(n_neighbors=15, weights='distance').fit(X, y)

In [None]:
y_pred = fit_neighbours_clf.predict(X)

In [None]:
confusion_matrix(y, y_pred)

#### WOW: 100% accuracy!   
Important Note: In most cases, uniform weights are not optimal, just a starting point.

## C. Discuss optimal k in general and in case of a "tie".   


Tie-breaking rules:
even or odd k,  
number of classes,
distribution of classes, SOS
size of dataset,  
random class assignment,  
total distance weights (inverse distance)  
different distance metrics,   
loop over different k,  
k-Fold validation of data.

[Scikit learn Knn iris example](https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html#sphx-glr-auto-examples-neighbors-plot-classification-py)

# In class Exercise  
Implement KNN on a new dataset.

In [None]:
from sklearn import datasets

In [None]:
digits = datasets.load_digits()

In [None]:
dir(digits)

In [None]:
type(digits)  # custom Dictionary-like type

In [None]:
print(digits.keys())

In [None]:
print(digits.DESCR)

In [None]:
digits.target[100]

In [None]:
digits.target[1282]

In [None]:
# Display digit observation 1010
plt.imshow(
    digits.images[100],
    cmap=plt.cm.gray_r,
    interpolation='nearest');

In [None]:
plt.imshow(digits.images[1282], cmap=plt.cm.gray_r, interpolation='nearest');

[Convert scikit learn data to pandas dataframe](https://stackoverflow.com/a/46379878)  (How it was donw some years ago).

In [None]:
digits.data

In [None]:
digits.target_names

In [None]:
# digits.feature_names  # Show column names

### The floor is yours!

In [None]:
digits = datasets.load_digits(as_frame=True)

In [None]:
digits.data.head(5)

In [None]:
digits.target.sample(3)

In [None]:
digits.data["pixel_4_4"]

#### Fastest and optimal way to solve this:
Open the notebook in Pycharm and open copilot chat. Type in prompt:   
"Solve the inclass Exercise without splitting to train and test and without scaling the data. Show max accuracy".

In [None]:
## This is what copilot returned.
# Split the dataset into features (X) and target (y)
digits = datasets.load_digits(as_frame=True)
X = digits.data
y = digits.target

# Initialize variables to store the best number of neighbors and the highest accuracy
best_n = 0
highest_accuracy = 0

# Loop over different values of k from 4 to 30
for n_neighbors in range(4, 31):
    # Train the KNN model on the entire dataset
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X, y)

    # Make predictions on the entire dataset
    y_pred = knn.predict(X)

    # Calculate the accuracy of the model
    accuracy = accuracy_score(y, y_pred)

    # If the current model's accuracy is higher than our highest accuracy so far,
    # update our best number of neighbors and highest accuracy
    if accuracy > highest_accuracy:
        best_n = n_neighbors
        highest_accuracy = accuracy

# Print the best number of neighbors and the highest accuracy
print("Best number of neighbors:", best_n)
print("Highest accuracy:", highest_accuracy)