# Cluster Analysis

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the [UCI](https://archive.ics.uci.edu/ml/datasets/iris) Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species


## Load and Inspect Data

1. Load the file for this week's analysis:
```
iris.csv
```
2. Measure the correlation coefficient between the features
 - plot a correlation heatmap and/or a scatter matrix


## K-Means Cluster Analysis

1. Use K-Means clustering with a cluster count of k=3
2. Compare the results of the clustering to the actual labels
3. Evaluate the results using the following metrics:
 - Homogeneity
 - Completeness
 - V Measure
 - Silhouette
3. Use the elbow method to decide the optimal k (time permitting)

## Hierarchical Clustering

Use 3 clusters to perform a Hierarchical clustering analysis.  Run the cluster analysis using the following Linkage techniques:
 - **Ward**: minimize variance within clusters
 - **Complete**: minimize the maximum distances between pairs
 - **Average**: minimize average distances between points
 - **Single**: minimize distance between closest points from a pair of clusters

## Summarize Results

Compare the results of all techniques.

In [1]:
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Custom Evaluation Function

You can use the following function to evaluate the accuracy of the model:

In [2]:
def evaluate(model, s, labels_true):
    labels_pred = model.labels_
    
    homogeneity = metrics.homogeneity_score(labels_true, labels_pred)
    completeness = metrics.completeness_score(labels_true, labels_pred)
    v_measure = metrics.v_measure_score(labels_true, labels_pred)
    silhouette = metrics.silhouette_score(s,labels_pred)

    pt = prettytable.PrettyTable(['metric','value'])
    pt.add_row(['Homogeneity', homogeneity])
    pt.add_row(['Completeness', completeness])
    pt.add_row(['V Measure', v_measure])
    pt.add_row(['Silhouette', silhouette])
    print(pt)
    d = {'homogeneity': homogeneity,
         'completeness':completeness,
         'v_measure':v_measure,
         'silhouette':silhouette
        }
    return d

In [3]:
location = '../../data/'
files = os.listdir(location)
files

['responses.csv',
 'CrossStats20150102.txt',
 'auto_2020.xlsx',
 'multiple_choice.csv',
 'iris_names.txt',
 'state_codes.csv',
 'iris.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'questions.csv',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'Candidate Assessment.xlsx',
 'nst-est2019-popchg2010_2019.csv']

## Load and Inspect Data

1. Load the file for this week's analysis:
```
iris.csv
```
2. Measure the correlation coefficient between the features
 - plot a correlation heatmap and/or a scatter matrix