# Unsupervised Learning Techniques

**Clustering**: grouping similar instances into clusters.

**Anomaly detection**: objective to learn what "normal" data looks like, and use it to detect abnormal instances.

**Density estimation**: estimating the probability density function (PDF) of the random process that generated the dataset. Useful for data analysis, visualization and detecting anomalies.

# Clustering
Works well with multiple features.

## K-Means

```python
from sklearn.cluster import KMeans

k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

X_new = ...
kmeans.predict(X_new)

# measure distance from each instance to every centroid
kmeans.transform(X_new)
```

Properties:
- An instance's *label* is the index of the cluster that the instance gets assigned to.

You can plot the decision boundaries to get a *Voronoi tessellation*.

**Hard clustering**: assigning each instance to a single cluster.

**Soft clustering**: giving each instance a score per cluster.

**Normal Algorithm**:
- Place centroid randomly, label instances, update centroid, label, etc. until centroids stop moving.
- Can get stuck into a local optimum if you are not lucky with the random initialization step.
- sklearn uses the `n_init` parameter to run the model multiple times and keep the model with the best performance (lowest inertia). The `score` is negative, as it respects the *"great is better"* rule.

**K-Means++**:
- Take one centroid $c^{(1)}$, chosen uniformly at random from the dataset.
- Take a new centroid $c^{(i)}$, choosing an instance $x^{(i)}$ with probability $D(x^{(i)})^2 \Sigma^m_{j=1}D(x^{(j)})^2$ where $D(x^{(i)}$ is the distance between the instance $x^{(i)}$ and the closest centroid that was already chose. This distribution ensures that instances further away from already chosen centroids are more likely to be selected as centroids.
- Repeat until all $k$ centroids have been chosen.

**MiniBatchKMeans**:
- Speed up KMeans by a factor 3/4 and make it posible to cluster huge data.
- Intertia is generally slightly worse.
- `partial_fit` can be used, but using `memmap` class is easiest.

```python

from sklearn.cluster import MiniBatchKMeans

minibatch_kmeans = MiniBatchKMeans(n_clusters=k)
minibatch_kmeans.fit(X)
```

**Selecting number of clusters**:
- Use elbow rule, as picking the lowest inertia won't make sense, as it will keep decreasing.
    - Rough estimation.
- Silhouette score.
    - More precise (but more computationally expensive)
    - Mean of the *silhouette coefficient*.
    - *silhouette coefficient*: $(b-a)/max(a,b)$ where $a$ is mean distance to other instances in the same cluster, and $b$ is the mean nearest-cluster distance.
    - Score between -1 and +1. 
        - Close to +1 means the instance is well inside its own cluster and far from other clusters.
        - 0 means that is close to a cluster boundary.
        - -1 means that the instance may have been assigned to the wrong cluster.

```python
from sklearn.metrics import silhoutte_score

silhouette_score(X, kmeans.labels_)

```

In [3]:
from sklearn.cluster import KMeans

In [None]:
k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)