# Intro to Data Science @ SzISz Part VI.
## Clustering

### Table of contents
- <a href="#What-is-Clustering?">Clustering Theory</a>
- <a href="#K-Means">K-Means</a>
- <a href="#DBSCAN">DBSCAN</a>
- <a href="#Hierarchical-Clustering">Hierarchical Clustering</a>
- <a href="#Spectral-Clustering">Spectral Clustering</a>
- <a href="#Gaussian-Mixture-Models">GMM</a>
- <a href="#Cluster-Validation">Cluster Validation</a>
    

## What is Clustering?
Clustering is an <a href="http://scikit-learn.org/stable/unsupervised_learning.html">unsupervised machine learning</a> problem. _"Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution."_ from: <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">Wiki</a> 

_"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)."_ from: <a href="https://en.wikipedia.org/wiki/Cluster_analysis">Wiki</a>


## Why is it important?
Often the data does not contain target variables so one must find the hidden structure in the data first in order to achieve his/her goals. In case of recommender systems, it is a common technique to group the similar items together. In some cases the task itself is to find similar/connected/related items in the data. Like in image processing, Social network analysis, medical analysis, or it can be used to find the anomalies in the data.

## Tools
- K-Means
- Affinity propagation
- Mean-shift
- Spectral clustering
- Ward hierarchical clustering
- Agglomerative clustering
- DBSCAN
- Gaussian mixtures
- Birch
- Support Vector Clustering
- etc.

In [None]:
%matplotlib inline
import numpy as np
import scipy.sparse as sp
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets

In [None]:
n_clusters = 3
n_samples = 1500
noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,
                                      noise=.05)
noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05)
blobs = datasets.make_blobs(n_samples=n_samples, random_state=8)
no_structure = np.random.rand(n_samples, 2), None

datasets = {
    'noisy_circles': noisy_circles,
    'noisy_moons': noisy_moons,
    'blobs': blobs,
    'no_structure': no_structure
}

colors = np.array([x for x in 'bgrcmyk'])

In [None]:
def cluster_datasets(model, preprocess=None, **params):
    model = model(**params)
    results = {}
    Xs = {}
    for problem, dataset in datasets.iteritems():
        X, y = dataset
        if preprocess:
            X = preprocess.fit_transform(X, y)
        Xs[problem] = X
        model.fit(X)
        if hasattr(model, 'labels_'):
            results[problem] = model.labels_.astype(np.int)
        else:
            results[problem] = model.predict(X)
    return model, Xs, results

In [None]:
def plot(Xs, results):
    plot_num = 1
    plt.figure(figsize=(len(datasets) * 2 + 3, 9.5))
    for problem, X in Xs.iteritems():
        plt.subplot(1, len(datasets), plot_num)
        plt.scatter(X[:, 0], X[:, 1], color=colors[results[problem]].tolist())
        plot_num += 1

## K-Means

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

## DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

## Hierarchical-Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
from sklearn.neighbors import kneighbors_graph
# for Ward
# connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# connectivity = 0.5 * (connectivity + connectivity.T)

## Spectral Clustering

In [None]:
from sklearn.cluster import SpectralClustering

## Gaussian Mixture Models

In [None]:
from sklearn.mixture import GMM

## Cluster Validation

In [None]:
from sklearn.metrics import silhouette_score