# Unsupervised Learning

## k-means clustering
* Finds clusters of samples
* Number of clusters must be specified
* Implemented in`sklearn` 
* `from sklearn.cluster import KMeans`

```
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
```
* This fits the model to the data by locating and remembering the regions where the different clusters occur
* Then we can use the predict method of the model
* This returns a cluster label for each sample indicating to which cluster a sample belongs. 

```
labels = model.predict(samples)
print(labels)
```

* New samples can be assigned to existing clusters.
* k-means does this by remembering the mean of the samples in each cluster: called the __centroids.__
* k-means then finds the nearest centroid to each new sample. 
***
* Say you have a array of new samples 
* To assign the new samples to the existing cluster, pass the array of new samples to the predict method of the k-means model

```
new_labels = model.predict(new_samples)
print(new_labels)
```

### Scatter plots

```
import mattplotlib.pyplot as plt
xs = samples[:,0]
ys = samples[:,2]
plt.scatter(xs, ys, c=labels)
plt.show()
```

1) Visualize data in a scatter plot to visually determine the optimal value of k

```
#Import KMeans
from sklearn.cluster import KMeans
#Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)
#Fit model to points
model.fit(points)
#Determine the cluster labels of new_points: labels
labels = model.predict(new_points)
#Print cluster labels of new_points
print(labels)
```

* __`.cluster_centers_`__ : Computes the coordinates of the centroids
* parameter __`s`__ refers to marker size

```
#Import pyplot
import matplotlib.pyplot as plt
#Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]
#Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha= 0.5)
#Assign the cluster centers: centroids
centroids = model.cluster_centers_
#Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
#Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()
```

## Evaluating a clustering
* How to evaluate quality of the clustering?
    * One approach: check correspondence with e.g. iris species.
    * But what if there are no species to check against?
* Measure quality of a clustering without requiring pre-determined labels.
* A measure of quality then informs choice of how many clusters to look for
* "clusters" vs "species" or "clusters" vs. "labels" is an example of __cross-tabulation__

### Cross-tabulation with pandas

```
import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
```
* Create df from data lists where the first column is the cluster labels and the second column is the iris species so that each row gives a cluster label and species of a single sample

```
ct = pd.crosstab(df['labels'], df['species'])
print(ct)
```

* How to evaluate a clustering if there is no label information?
* We need a way to measure cluster quality using only samples and their _cluster_ labels.
* A good clustering has tight clusters
* Samples in each cluster bunched together
* __Inertia:__ measures clustering quality by measuring how spread out the clusters are (_lower_ is better).
* Distance from each sample to centroid of its cluster
* The inertia of the KMeans model is measured automatically when any of the fit methods are called and is available afterwards in attribute __`.inertia_`__
* KMeans aims to place the clusters in a way that minimizes the inertia.
* More clusters will always mean more inertia. _However_:
    * A good clustering has tight clusters (so low inertia)...
    * ...but it also doesn't have too many clusters!
    * A good rule of thumb is to choose the __'elbow'__ in the inertia plot. 
    * That is, a point where the inertia begins to decrease more slowly

```
ks = range(1, 6)
inertias = []
for k in ks:
    #Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    #Fit model to samples
    model.fit(samples)    
    #Append the inertia to the list of inertias
    inertias.append(model.inertia_)    
#Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
```


```
#Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)
#Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)
#Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
#Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
#Display ct
print(ct)
```

### Transforming features for better clusters
#### Feature variances
* Wine features have very different variances
* Variance of a feature measures spread of its values
* __In kmeans: feature variance = feature influence__
* To give every feature a chance, the data needs to be transformed so that the feature have equal variance
* This can be achieved with `StandardScaler`
* `StandardScaler` transforms each feature to have mean 0 and variance 1 
* Features are said to be "standardized"

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
StandardScaler = (copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)
```
* `StandardScaler` and `KMeans` have similar methods but __important difference:__
    * `StandardScaler` transforms the data and so has a transform method: `fit()` / `transform()`
    * `KMeans` in contrast assigns cluster labels to samples: `fit()` / `predict()`

#### StandardScaler, then KMeans
* This can be conveniently achieved combined the two steps using pipeline
* Data then flows from one step into the next automatically
#### Pipelines combine multiple steps 
* First steps are the same:

```
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)
labels = pipeline.predict(samples)
```
* Feature standardization improves clustering
#### sklearn preprocessing steps
* `StandardScaler` is a __preprocessing step__
* `MaxAbsScaler` and `Normalizer` are other examples of __preprocessing steps__

```
#Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#Create scaler: scaler
scaler = StandardScaler()
#Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)
#Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)
```

```
#Import pandas
import pandas as pd
#Fit the pipeline to samples
pipeline.fit(samples)
#Calculate the cluster labels: labels
labels = pipeline.predict(samples)
#Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels, 'species':species})
#Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])
#Display ct
print(ct)
```

```
#Import Normalizer
from sklearn.preprocessing import Normalizer
#Create a normalizer: normalizer
normalizer = Normalizer()
#Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)
#Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)
#Fit pipeline to the daily price movements
pipeline.fit(movements)
```

```
#Import pandas
import pandas as pd
#Predict the cluster labels: labels
labels = pipeline.predict(movements)
#Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})
#Display df sorted by cluster label
print(df.sort_values('labels'))
```

# Visualizing hierarchies

* Visualizations communicate insights, and are particularly effective with a non-technical audience.
* Two unsupervised learning techniques for visualization: __t-SNE__ and __Hierarchical Clustering__
* __t-SNE:__ Creates a 2D map of a dataset and conveys useful information about the proximity of the samples to one another. 
## Hierarchical Clustering
* Clusters are contained within one another; like classification of animal kingdom or an ancestry tree
* __Dendrogram:__ a tree-like structure 
* Hierarchical clustering can produce great visualizations with dendrograms
* __Agglomerative Hierarchical Clustering:__ At each step, the two closest clusters are merged until eventually there is only one cluster left (unless a smaller threshold is determined)
* There is also __Divisive Clustering__, which works the other way around.
* Hierarchical clustering using __`SciPy`__:

```
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
merging = linkage(samples, method='complete')
dendrodram(mergings, labels= country_names, leaf_rotation=90, leaf_font_size=6)
plt.show
```

```
#Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
#Calculate the linkage: mergings
mergings = linkage(samples, method='complete')
#Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()
```

```
#Import normalize
from sklearn.preprocessing import normalize
#Normalize the movements: normalized_movements
normalized_movements = normalize(movements)
#Calculate the linkage: mergings
mergings = linkage(normalized_movements, method='complete')
#Plot the dendrogram
dendrogram(mergings, labels=companies, leaf_rotation=90, leaf_font_size=6)
plt.show()
```
### Cluster labels in hierarchical clustering
* Not only a visualization tool!
* Cluster labels at any intermediate stage can be recovered.
* For use in cross-tabulations etc.
* Intermediate clusters are defined by height
* __Height on dendrogram:__ distance between merging clusters
    * So the height that specifies an intermediate clustering on a dendrogram corresponds to a distance
    * In this way we can define a threshold.
* Complete linkage vs simple linkage vs averaged linkage

#### Extracting cluster labels 
* Use the `fcluster()` function (with specifying the height)
* Returns a numpy array of cluster labels

```
from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion = 'distance')
print(labels)
```

#### Aligning cluster labels with country names 
* Given a list of strings `country_names`

```
import pandas as pd
pairs = pd.DataFrame({'labels':labels, 'countries':country_names})
print(pairs.sort_values('labels')
```
__NOTE: SciPy cluster labels start at 1__

```
#Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
#Calculate the linkage: mergings
mergings = linkage(samples, method='single')
#Plot the dendrogram
dendrogram(mergings, labels= country_names, leaf_rotation=90, leaf_font_size= 6)
plt.show()
```

```
#Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster
#Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')
#Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
#Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])
#Display ct
print(ct)
```

## t-SNE for 2-dimensional maps
* __t-SNE:__ "t-distributed stochastic neighbor embedding"
    * It maps samples from their higher dimensional space into a 2- or 3- dimensional space so they can be visualized. 
    * While some distortion is inevitable, t-SNE does a great job of approximately representing the distances between the samples.
    * t-SNE is an invaluable visual aid for understanding a dataset.
* __low inertia:__ tight clusters

```
import matplotlib.pyplot as plt
from sklearn.maniford import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()
```
* __t-SNE only has `.fit_transform()` method
* Simultaneously fits the model and transforms the data.
* You __can't__ extend a t-SNE map to include more samples
* You may need to try different learning rates for different datasets
* If you choose the wrong choice for learning rate, it will be very clear, because all the samples will be bunched together in the scatterplot.
* Try values between 50 and 200.
* __The axes of a t-SNE plot do not have any interpretable meaning.__
* __t-SNE features are different every time.__
* Orientation may/will be different every time you run a t-SNE, but the position of the data will be the same relative to one another. 

```
#Import TSNE
from sklearn.manifold import TSNE
#Create a TSNE instance: model
model = TSNE(learning_rate=200)
#Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)
#Select the 0th feature: xs
xs = tsne_features[:,0]
#Select the 1st feature: ys
ys = tsne_features[:,1]
#Scatter plot, coloring by variety_numbers
plt.scatter(xs, ys, c=variety_numbers)
plt.show()
```

```
#Import TSNE
from sklearn.manifold import TSNE
#Create a TSNE instance: model
model = TSNE(learning_rate=50)
#Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)
#Select the 0th feature: xs
xs = tsne_features[:,0]
#Select the 1th feature: ys
ys = tsne_features[:,1]
#Scatter plot
plt.scatter(xs, ys, alpha = 0.5)
#Annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()
```
