# Exercise 7: Unsupervised Learning - Clustering
## Theory
### Task 1: MC
---
Multiple answers are possible.

**DBSCAN**
 - [ ] can be used for outlier detection
 - [ ] works best with isotropic data
 - [ ] can be used to construct a KDTree
 - [ ] computes all possible pairwise distances

**K-means**
 - [ ] guarantees convergence
 - [ ] is insensitive to outliers
 - [ ] can be used with a custom distance function 
 - [ ] All of the above

**Agglomerative HC**
 - [ ] can be used with a custom distance function
 - [ ] can be used with categorical data
 - [ ] 's result can be used to visualize the distance matrix
 - [ ] All of the above

**K-means**
 - [ ] is fast in practice
 - [ ] scales well with the dataset size
 - [ ] can be used for outlier detection
 - [ ] All of the above

**K-medoids**
 - [ ] can be used with a custom distance function
 - [ ] can be initialized with kmeans++ initialization
 - [ ] computes centroids using the median
 - [ ] All of the above

**Clustering Evaluation**
 - [ ] ```sklearn.metrics.silhouette_score``` is used to evaluate a clustering given the true (cluster) labels
 - [ ] ```sklearn.metrics.silhouette_score``` is used to evaluate a clustering given the data ```X```
 - [ ] ```sklearn.metrics.rand_score``` computes differences for all possible pairs of points
 - [ ] ```sklearn.metrics.rand_score``` can also be used for classification


## Programming
### Task 1: KMeans
---
For this task you are given data ```X, y``` that is distributed around four centers. 
The goal of this exercise is to familiarize yourself with the pitfalls of K-mean (and its evaluation performance). Therefore

1. Plot X using a scatter plot, and color the data according to the labels
2. run KMeans with cluster sizes from 2 to 6 and this time plot the data using color according to the  prediction. 
3. Place both of these plots next to each other by first calling 
```fig, (ax1, ax2) = plt.subplots(1, 2)``` and then placing each plot on one axis

Follow this link for an example on how to use subplots (you have to sccroll down) https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.subplots.html

In [2]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=500,
                  n_features=2,  # 2D data
                  centers=4,  # distributed in 4 clusters
                  cluster_std=1, # std deviation
                  center_box=(-10.0, 10.0), # xlim, ylim
                  shuffle=True,
                  random_state=1)

# Follow steps 1,2,3

The result should be as expected. Now add a score evaluation using the ```sklearn.metrics.silhouette_score``` to each of kmeans predictions. 

**What can you conclude?**

**[Optional]** Refine the plots, by including axis titles  'true_clusters' and 'predicted_clusters' on each corresponding plot. For the axis corresponding to 'predicted_clusters', also add the silhouette_score value to the title. the results should look like this:

<img src="img\exercise_one.png" alt="Drawing" style="width: 600px;"/>

Repeat the experiment, but with a anisotropically scaled (1) version of the previous dataset X and using only the correct number of clusters (four), i.e. do not compute K-means for 2,3,5,6 clusters. 

**What can you conclude** after using K-means on this scaled version of the original data?

(1) For that, use the transformation ```transform = np.array([[1.6, -0.1], [1,-.5]]) ``` and use ```numpy.dot``` together with the original data to obtain the transformed version.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
# rescale and run K-means using 4 clusters, 

# plot and evaluate as before using
# fig, (ax1, ax2) = plt.subplots(1, 2)

### Task 2: Agglomerative Clustering
---
The goal of this exercise is to show how you can enhance hierarchical clustering by using a matrix of neighborhoods, that will restrict computation of pairwise distances to only the connected neighbors.

For this exercise, you are given the ```S-curve``` dataset, which is used to show that clustering (or manifold-learning) algorithms preserve locality. We will see that in the default case, Agglomerative Clustering does not do so.


For that, you should
1. visualize the 3 dimensions of your data by calling ax.scatter3D.
   As a color map use plt.cm.Spectral
2. compute AgglomerativeClustering with n_clusters=8 and default linkage
3. repeat the plot, this time coloring the data according to the labels returned by the clustering
4. Now use AgglomerativeClustering again, but using the 'connectivity' argument. You should pass a ```kneighbors_graph``` of the data with ```n_neighbors=4```. By default, neighbors are computed using the euclidean distance. You should leave this default setting.
5. repeat the plot with labels colored according to the new clustering.

**What is different?**



In [None]:
from sklearn.datasets import make_s_curve
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph

# initialize plots using
# fig = plt.figure()
# ax = fig.add_subplot(111, projection='3d')
# ax.view_init(7, -80)

# S-curve data
X, t = make_s_curve(1500, random_state=42)

### [Optional] Tuning DBSCAN
---
Using what we have learned so far, find the combination of ```eps, min_samples``` that works well for the given dataset and explain what well could mean in this case.

In [5]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
centers = [[1, 1], [-1, -1], [1, -1]]
X, y = make_blobs(n_samples=750, centers=centers, cluster_std=0.5, random_state=42)

In [None]:
pass