# Lesson 03: From Clustering To Classification

## k-nearest neighbor clustering
Recap the kmeans clustering result:
![by Weston.pace, from commons.wikimedia.org under CC-BY-SA 3.0](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/K_Means_Example_Step_4.svg/278px-K_Means_Example_Step_4.svg.png)

- in this case, we have $k=3$ clusters and hence have produced a dataset of form  
$ \mathcal{D}_{k=3} = \{ \langle \vec{x}_{1}, f(\vec{x}_{1}) \rangle, \dots \} $ 
  + where $f$ denotes a class label of a cluster, e.g. if $\vec{x}_{1}$ belongs to cluster $2$, then $f(\vec{x}_{1}) = 2$
  + in other words $f(\vec{x}_{1})$ represents the mathematical mapping that our `kmeans` applies to our dataset
  

## going for classification

- to use this for classification, we want to start from a fully labelled dataset
- given an unseen query point $\vec{x}_{q}$, we would like to know what cluster it belongs to

![by Sebastian Raschka, Stat 451: intro to ML](https://github.com/deeplearning540/lesson03/blob/main/images/raschka_knn_p28.png)


- there are multiple options how to decide to which class the query point belongs to
![by Sebastian Raschka, Stat 451: intro to ML](https://github.com/deeplearning540/lesson03/blob/main/images/raschka_knn_p29.png)

- **note** that the choice of $k$ determines the radius in this image above, here $k=5$ was set


- the plurality vote is mathematically known as the **mode** of a distribution of discrete numbers, i.e. the category with highest frequency wins

![by Sebastian Raschka, Stat 451: intro to ML](https://github.com/deeplearning540/lesson03/blob/main/images/raschka_knn_p30.png)

- important: what hyper parameters govern the decision boundary?
  + the choice of distance metric, e.g. euclidean distance
  + the number of neighbors to consider, i.e. $k$
  

# Using kNN classification


## Data

For the following, I will rely (again) on the Palmer penguin dataset obtained from [this repo](https://github.com/allisonhorst/palmerpenguins). To quote the repo:

> Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php)
> and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).


In [2]:
import pandas as pd
print("pandas version:", pd.__version__)


pandas version: 1.0.5


In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
#let's remove the rows with NaN values
df = df[ df.bill_length_mm.notnull() ]

#convert species column to 
df[["species_"]] = df[["species"]].astype("category")

print(df.head())


  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
4  Adelie  Torgersen            36.7           19.3              193.0   
5  Adelie  Torgersen            39.3           20.6              190.0   

   body_mass_g     sex  year species_  
0       3750.0    male  2007   Adelie  
1       3800.0  female  2007   Adelie  
2       3250.0  female  2007   Adelie  
4       3450.0  female  2007   Adelie  
5       3650.0    male  2007   Adelie  


In [4]:
print("species_ encoding:")
print("\n".join([ str(item)+" : "+df.species_.cat.categories[item] for item in range(len(df.species_.cat.categories)) ]))

species_ encoding:
0 : Adelie
1 : Chinstrap
2 : Gentoo


# Further Reading

- some parts of this material were inspired by [Sebastian Raschka](https://sebastianraschka.com)
  + confusion matrix [lesson 12.1](https://www.youtube.com/watch?v=07dtryhNGms)
  + precision, recall and F1 score [lesson 12.2](https://youtu.be/yEw9oDdJkT0)
  
- a generally good resource 
  + [Confusion_matrix](https://en.wikipedia.org/wiki/Confusion_matrix)
  + [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
  
- all of the above is nicely implemented and [documented](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)