# Enter Clustering

Machine Learning is divided into three main categories:

- supervised learning
- unsupervised learning
- reinforcement learning

## Supervised Learning

- all data used is labelled (with ground truth information)
- the algorithm is meant to predict outcome
- the algorithm is provided direct feedback


### Classification

<p><a href="https://commons.wikimedia.org/wiki/File:Svm_separating_hyperplanes.png#/media/File:Svm_separating_hyperplanes.png"><img src="https://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png" alt="Svm separating hyperplanes.png" width="503" height="480"></a><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Cyc&amp;amp;action=edit&amp;amp;redlink=1" class="new" title="User:Cyc (page does not exist)">Cyc</a> - <span class="int-own-work" lang="en">Own work</span>, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=3566969">Link</a></p>



### Regression

<p><a href="https://commons.wikimedia.org/wiki/File:Linear_regression.svg#/media/File:Linear_regression.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Linear_regression.svg/640px-Linear_regression.svg.png" alt="Linear regression.svg"></a><br>By <a href="//commons.wikimedia.org/w/index.php?title=User:Sewaqu&amp;amp;action=edit&amp;amp;redlink=1" class="new" title="User:Sewaqu (page does not exist)">Sewaqu</a> - <span class="int-own-work" lang="en">Own work</span>, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=11967659">Link</a></p>



## Unsupervised Learning

- there are **NO** labels (i.e. **WITHOUT** ground truth information)
- no feedback is provided to the algorithm
- goal: find hidden structure in data

<p><a href="https://commons.wikimedia.org/wiki/File:KMeans-Gaussian-data.svg#/media/File:KMeans-Gaussian-data.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/KMeans-Gaussian-data.svg/1200px-KMeans-Gaussian-data.svg.png" alt="KMeans-Gaussian-data.svg"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Chire" title="User:Chire">Chire</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=17085714">Link</a></p>

## Reinforcement Learning

- model a decision process
- reward system
- learn series of actions


<p><a href="https://upload.wikimedia.org/wikipedia/commons/1/1b/Reinforcement_learning_diagram.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Reinforcement_learning_diagram.svg/794px-Reinforcement_learning_diagram.svg.png" alt="Reinforcement learning diagram.svg" width="800"></a><br>By <a href="//commons.wikimedia.org/wiki/User:Megajuice" title="User:Megajuice">Megajuice</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/publicdomain/zero/1.0/deed.en" title="">CC0 1.0</a></p>

# Important Notation 

- we are given a dataset of size $N$ as   
$\mathcal{D} = \{ \langle \vec{x}, y \rangle_{i}, i = 1, \dots, N \} $

- the data represents a mapping:   
$f(\vec{x}) = y$

- machine learning produces a hypothesis (i.e. a prediction):   
$h(\vec{x}) = \hat{y}$

## classification versus regression

- classification:   
$h : \mathcal{R}^n \rightarrow \mathcal{Z} $   
(e.g. for 3 categories $\{0,1,2\}$)

- regression:   
$h : \mathcal{R}^n \rightarrow \mathcal{R} $ (regression can also produce $\mathcal{R}^{n}$)  


# Data

For the following, I will rely on the Palmer penguin dataset obtained from [this repo](https://github.com/allisonhorst/palmerpenguins). To quote the repo:

> Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php)
> and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).


In [2]:
import pandas as pd
print("pandas version:", pd.__version__)


pandas version: 1.0.5


In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")
print(df.head())
print(df.tail())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             NaN            NaN                NaN   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  
2       3250.0  female  2007  
3          NaN     NaN  2007  
4       3450.0  female  2007  
       species island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
339  Chinstrap  Dream            55.8           19.8              207.0   
340  Chinstrap  Dream            43.5           18.1              202.0   
341  Chinstrap  Dream            49.6           18.2              193.0   
342  Chinstrap  Dream            50.8           19.0              210