# Exploring Clustering Techniques with Machine Learning

In contrast to _supervised_ machine learning, _unsupervised_ learning is used when there is no "ground truth" from which to train and validate label predictions. The most common form of unsupervised learning is _clustering_, which is similar conceptually to _classification_, except that the the training data does not include known values for the class label to be predicted. **Clustering works by separating the training cases based on similarities** that can be determined from their feature values. Think of it this way; **the numeric features of a given entity can be thought of as vector coordinates that define the entity's position in n-dimensional space**. What a clustering model seeks to do is to identify groups, or _clusters_, of entities that are close to one another while being separated from other clusters.


## Topics

**Explore unsupervised clustering** using a seeds dataset.

**Visualize high-dimensional data** with Principal Component Analysis (PCA).

**Determine optimal cluster count** using the "elbow" method.

**Implement K-Means and Agglomerative Clustering** to group seed samples.

**Compare clusters** against actual species labels to evaluate performance.


## Load Data

In [2]:
import pandas as pd

# Load the training dataset
data = pd.read_csv('./../../data/seeds.csv')

# Display random sample of 10 observations
features = data.sample(10)
features

Unnamed: 0,area,perimeter,compactness,kernel_length,kernel_width,asymmetry_coefficient,groove_length,species
35,16.12,15.0,0.9,5.709,3.485,2.27,5.443,0
170,11.02,13.0,0.8189,5.325,2.701,6.735,5.163,2
124,15.99,14.89,0.9064,5.363,3.582,3.336,5.144,1
185,11.56,13.31,0.8198,5.363,2.683,4.062,5.182,2
182,12.19,13.36,0.8579,5.24,2.909,4.857,5.158,2
189,10.59,12.41,0.8648,4.899,2.787,4.975,4.794,2
152,12.26,13.6,0.8333,5.408,2.833,4.756,5.36,2
16,13.99,13.83,0.9183,5.119,3.383,5.234,4.781,0
38,14.8,14.52,0.8823,5.656,3.288,3.112,5.309,0
33,13.94,14.17,0.8728,5.585,3.15,2.124,5.012,0
