## Description

Say you have a collection of customers with a variety of characteristics such as age, location, and financial history, and you wish to discover patterns and sort them into clusters. Or perhaps you have a set of texts, such as Wikipedia pages, and you wish to segment them into categories based on their content. This is the world of unsupervised learning, called as such because you are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this course, you'll learn the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy.

## Imports

In [None]:
import pandas as pd
from pprint import pprint as pp
from itertools import combinations
from zipfile import ZipFile
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import numpy as np
from pathlib import Path
import requests
import sys


# 1 - Clustering for dataset exploration

Learn how to discover the underlying groups (or "clusters") in a dataset.

### 1.1 - Clustering 2D points

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the `.predict()` method.

You are given the array `points` from the previous exercise, and also an array `new_points`.


In [None]:
iris = sns.load_dataset('iris')

iris_samples = iris.sample(n=75, replace=False, random_state=3)
points = iris_samples.iloc[:, :4]
points = points.to_numpy()

iris_new_samples = iris[~iris.index.isin(iris_samples.index)].copy()
new_points = iris_new_samples.iloc[:, :4]
new_points = new_points.to_numpy()


**You've successfully performed k-Means clustering and predicted the labels of new points. But it is not easy to inspect the clustering by just looking at the printed labels. A visualization would be far more useful. In the next exercise, you'll inspect your clustering with a scatter plot!**

### 1.2 - Inspect your clustering

Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so `new_points` is an array of points and `labels` is the array of their cluster labels.

**The clustering looks great! But how can you be sure that 3 clusters is the correct choice? In other words, how can you evaluate the quality of a clustering?**

## 2 - Evaluating a clustering

### 2.1 - How many clusters of grain?

You are given an `array` samples containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

`KMeans` and PyPlot (`plt`) have already been imported for you.

This dataset was sourced from the [UCI Machine Learning Repository.][1]

  [1]: https://archive.ics.uci.edu/ml/datasets/seeds

In [None]:
sed = pd.read_csv("seeds.csv", header=None)

sed['varieties'] = sed[7].map({1: 'Kama wheat', 2: 'Rosa wheat', 3: 'Canadian wheat'})

sed.head(2)

**The inertia decreases very slowly from 3 clusters to 4, so it looks like 3 clusters would be a good choice for this data.**

### 2.2 - Evaluating the grain clustering

In the previous exercise, you observed from the inertia plot that 3 is a good number of clusters for the grain data. In fact, the grain samples come from a mix of 3 different grain varieties: "Kama", "Rosa" and "Canadian". In this exercise, cluster the grain samples into three clusters, and compare the clusters to the grain varieties using a cross-tabulation.

You have the array samples of grain `samples`, and a list `varieties` giving the grain variety for each sample. Pandas (`pd`) and `KMeans` have already been imported for you.

# 3 - Visualization with hierarchical clustering and t-SNE

Hierarchical clustering and t-SNE, are two unsupervised learning techniques for data visualization. Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

**A Note Regarding the Data**

- The Eurovision data, `euv`, is used for the lecture and some of the following exercises.
- The `.shape` of the Eurovision `samples` is `(42, 26)`
- The Eurovision DataFrame must be pivoted to achieve the correct shape
  - `'From country'` is `index`
  - `'To country'` is `columns`
  - `'Jury Points'` is `values`

In [None]:
euv = pd.read_csv('eurovision-2016.csv')
euv.head(2)
euvp = euv.pivot(index='From country', columns='To country', values='Jury Points').fillna(0)
euv_samples = euvp.to_numpy()

In [None]:
euvp.iloc[:5, :5]

In [None]:
plt.figure(figsize=(16, 6))
euv_mergings = linkage(euv_samples, method='complete')
dendrogram(euv_mergings, labels=euvp.index, leaf_rotation=90, leaf_font_size=12)
plt.title('Countries Hierarchically Clustered by Eurovision 2016 Voting')
plt.show()

### 3.1 - Hierarchical clustering of the grain data

Use the `linkage()` function to obtain a hierarchical clustering of the grain samples, and use `dendrogram()` to visualize the result. A sample of the grain measurements is provided in the array `samples`, while the variety of each grain sample is given by the list `varieties`.



In [None]:
seed_sample = sed.groupby('varieties').sample(n=14, random_state=250)
samples = seed_sample.iloc[:, :7]
varieties = seed_sample.varieties.tolist()
print(varieties)

**Dendrograms are a great way to illustrate the arrangement of the clusters produced by hierarchical clustering.**

In [None]:
mergings = linkage(euv_samples, method='complete')
labels = fcluster(mergings, 15, criterion='distance')
print(labels)

In [None]:
pairs = pd.DataFrame({'labels': labels, 'countries': euvp.index}).sort_values('labels')
pairs

### 3.2 - Different linkage, different hierarchical clustering

You saw a hierarchical clustering of the voting countries at the Eurovision song contest using `'complete'` linkage. Now, perform a hierarchical clustering of the voting countries with `'single'` linkage, and compare the resulting dendrogram with the one in the presentation. Different linkage, different hierarchical clustering!

You are given an array `samples`. Each row corresponds to a voting country, and each column corresponds to a performance that was voted for. The list `country_names` gives the name of each voting country. This dataset was obtained from [Eurovision][1].

  [1]: http://www.eurovision.tv/page/results

In [None]:
country_names = euv['From country'].unique()

**As you can see, performing single linkage hierarchical clustering produces a different dendrogram!**

### 3.3 Intermediate clusterings

If the hierarchical clustering were stopped at height 6 on the dendrogram, how many clusters would there be?

![][1]

**Possible Answers**

- 1
- 3
- As many as they were at the beginning

  [1]: https://raw.githubusercontent.com/trenton3983/DataCamp/master/Images/2021-03-29_unsupervised_learning_python/intermediate_clusterings.JPG

### 3.4 Extracting the cluster labels

In the previous exercise, you saw that the intermediate clustering of the grain samples at height 6 has X clusters. Now, use the `fcluster()` function to extract the cluster labels for this intermediate clustering, and compare the labels with the grain varieties using a cross-tabulation.

The hierarchical clustering has already been performed and `mergings` is the result of the `linkage()` function. The list `varieties` gives the variety of each grain sample.


In [None]:
seed_sample = sed.groupby('varieties').sample(n=14, random_state=250)
samples = seed_sample.iloc[:, :7]
varieties = seed_sample.varieties.tolist()

# Calculate the linkage: mergings
mergings = linkage(samples, method='complete')

## 3.5 - t-SNE for 2-dimensional maps


In [None]:
rs = [100, 200, 300]
fig, axes = plt.subplots(ncols=3, figsize=(15, 3))
axes = axes.ravel()

for i, state in enumerate(rs):
    ax = axes[i]
    
    model = TSNE(learning_rate=100, random_state=state)
    transformed = model.fit_transform(iris.iloc[:, :4])

    xs = transformed[:, 0]
    ys = transformed[:, 1]

    sns.scatterplot(x=xs, y=ys, hue=iris.species, ax=ax)
    ax.set_title(f't-SNE applied to Iris with random_state={state}')
    
plt.tight_layout()
plt.show()

### t-SNE visualization of grain dataset

You saw t-SNE applied to the iris dataset. In this exercise, you'll apply t-SNE to the grain samples data and inspect the resulting t-SNE features using a scatter plot. You are given an array `samples` of grain samples and a list `variety_numbers` giving the variety number of each grain sample.

**Instructions**

- Import `TSNE` from `sklearn.manifold`.
- Create a TSNE instance called `model` with `learning_rate=200`.
- Apply the `.fit_transform()` method of `model` to `samples`. Assign the result to `tsne_features`.
- Select the column `0` of `tsne_features`. Assign the result to `xs`.
- Select the column `1` of `tsne_features`. Assign the result to `ys`.
- Make a scatter plot of the t-SNE features `xs` and `ys`. To color the points by the grain variety, specify the additional keyword argument `c=variety_numbers`.

In [None]:
samples = sed.iloc[:, :7]
variety_numbers = sed[7]
variety_names = sed.varieties

In [None]:
# Import TSNE
____

# Create a TSNE instance: model
model = ____

# Apply fit_transform to samples: tsne_features
tsne_features = ____

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
____
plt.show()
