### Unsupervised Learning: Clustering

**OBJECTIVES**
- Differentiate between supervised and unsupervised learning tasks
- Understand and implement the KMeans clustering algorithm
- Understand and implement the DBScan clustering algorithm
- Apply clustering to problem of customer segmentation

In [None]:
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Supervised vs. Unsupervised Learning

- **SUPERVISED LEARNING**: Labels to predict are known (regression and classification)
- **UNSUPERVISED LEARNING**: No target label to predict (clustering)

In [None]:
#create a synthetic dataset
X, y = make_blobs(n_samples=40, centers = 2, random_state = 22, cluster_std = 1)

In [None]:
#plot the data
plt.scatter(X[:, 0], X[:, 1])
plt.title("Clustering -- Problem Setting");

### KMeans Algorithm

Before implementing, you need to determine how many clusters you think exist in the data.  Here, we were able to visualize this but typically you will try a few values that are likely constrained by the context you are clustering within.

#### STEP 1: Choose and Implement Centers

Once you choose the number of centers, you will need to offer a first guess.  There are many ways to do so, below two centers are given for you.

In [None]:
#initial centers
center_1 = np.array([-5, 4])
center_2 = np.array([-3, 4])

In [None]:
#plot the examples and centers
plt.scatter(X[:, 0], X[:, 1])
plt.plot(center_1[0], center_1[1], 'x', markersize = 10, color = 'red', label = 'Center A')
plt.plot(center_2[0], center_2[1], 'x', markersize = 10, color = 'black', label = 'Center B')
plt.legend();

#### STEP 2: Measure Distance from points to each center



In [None]:
#distance to the first center
distance_to_c1 = np.linalg.norm(center_1 - X, axis = 1)

In [None]:
#distance to the second center
distance_to_c2 = np.linalg.norm(center_2 - X, axis = 1)

In [None]:
#create a DataFrame of the distances
dists = pd.DataFrame({'c1': distance_to_c1,
              'c2': distance_to_c2})

In [None]:
#the first few distances
dists.head()

In [None]:
#which center is closest
dists.apply(np.argmin, axis = 1)[:8]

In [None]:
#creating dataframe with data
data = pd.DataFrame(X, columns = ['x1', 'x2'])

In [None]:
#adding a label column
data['label'] = dists.apply(np.argmin, axis = 1)

In [None]:
#examine the results
data.head()

#### STEP 3: Update the Centers

Using our labels, we update the center points as the mean of the new labels.

In [None]:
data.groupby('label').mean()

In [None]:
plt.scatter(X[:, 0], X[:, 1])
plt.plot(center_1[0], center_1[1], 'x', color = 'red', label = 'Start')
plt.plot(center_2[0], center_2[1], 'x', color = 'black', label = 'Start')

c2 = data.groupby('label').mean()
plt.scatter(c2['x1'], c2['x2'], c = ['red', 'black'], label = 'Step 2')
plt.legend();

#### STEP 4: Repeat until centers stop moving

Now, you repeat the process of measuring the distances from the center, labeling by the closest, and updating the location of the centers appropriately.

### Implementing with `sklearn`

Last consideration is to scale the data prior to implementing the algorithm so that different scales don't affect the labels.  Below, we create a pipeline to scale and cluster our data. 

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
X, y = make_blobs(random_state = 20)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c = y)

In [None]:
pipe = Pipeline([('scale', StandardScaler()),
                 ('cluster', KMeans(n_clusters = 3))])

In [None]:
pipe.fit(X)

In [None]:
pipe.predict(X)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c = pipe.predict(X))

In [None]:
pipe = Pipeline([('scale', StandardScaler()),
                ('cluster', KMeans(n_clusters = 3, init = 'random' ))])
pipe.fit(X)
plt.scatter(X[:, 0], X[:, 1], c = pipe.predict(X))

### Finding the right number of clusters

In [None]:
scores = []
for i in range(2, 11):
    pipe = Pipeline([('scale', StandardScaler()),
                ('cluster', KMeans(n_clusters = i))])
    pipe.fit(X)
    scores.append(pipe.score(X))

In [None]:
plt.plot(range(2, 11), scores, '--o')
plt.title("")

### Evaluating Cluster Models

- [Inertia](https://scikit-learn.org/stable/modules/clustering.html#k-means): Within cluster sum of squares.
 $$\sum_{i = 0}^n \min_{u_j \in C}(||x_i - u_j||^2)$$
- [Silhouette](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient): Ratio of distances between points in same cluster to those in others

 - a: The mean distance between a sample and all other points in the same class
 - b: The mean distance between a sample and all other points in the next nearest cluster

$$s = \frac{b - a}{max(a, b)}$$

In [None]:
from sklearn.metrics import silhouette_score, silhouette_samples

In [None]:
silhouette_score(X, pipe.predict(X))

- The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.

- The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

In [None]:
pipe = Pipeline([('scale', StandardScaler()),
                ('cluster', KMeans(n_clusters = 3))])
pipe.fit(X)
silhouette_score(X, pipe.predict(X))

In [None]:
scores = []
for i in range(2, 11):
    pipe = Pipeline([('scale', StandardScaler()),
                ('cluster', KMeans(n_clusters = i))])
    pipe.fit(X)
    scores.append(silhouette_score(X, pipe.predict(X)))

In [None]:
# pip install scikit-plot

In [None]:
from scikitplot.metrics import plot_silhouette

In [None]:
#trying different cluster values and silhouette scores
scores = []
for i in range(2, 11):
    pipe = Pipeline([('scale', StandardScaler()),
                ('cluster', KMeans(n_clusters = i))])
    pipe.fit(X)
    plot_silhouette(X, pipe.predict(X), title = f'{i} Clusters')

### DBSCAN Algorithm

A second approach to clustering is the DBSCAN algorithm.  Rather than using a preset number of clusters, DBSCAN learns a correct number of clusters based on an iterative process of nearness based grouping. Let's take a look at the algorithm in action [here](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/). The parameter for this algorithm is $ϵ$ -- the radius of the ball.

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
#pipeline for scale and cluster


In [None]:
#fit data


In [None]:
#score it


In [None]:
#try with different epsilon values


In [None]:
#visualize


### Application: Customer Segmentation

One important application of clustering algorithms is to group customers.  This profiling can help a business to understand patterns in purchasing or customer demographics.  The idea is to cluster and look for patterns with the clustered group. 

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

In [None]:
churn = pd.read_csv('data/cell_phone_churn.csv')

In [None]:
churn.head()

In [None]:
#drop state and churn
X = churn.iloc[:, 1:-1]

In [None]:
encoder = make_column_transformer((OneHotEncoder(), ['vmail_plan', 'intl_plan']),
                                 remainder=StandardScaler())

In [None]:
pipe = Pipeline([('preprocess', encoder), ('cluster', KMeans(n_clusters = 4))])

In [None]:
pipe.fit(X)

In [None]:
pipe.predict(X)

In [None]:
X['label'] = pipe.predict(X)

In [None]:
#look for patterns within groups
X.groupby('label').mean()