# Cluster Analysis

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the [UCI](https://archive.ics.uci.edu/ml/datasets/iris) Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species


## Load and Inspect Data

1. Load the file for this week's analysis:
```
iris.csv
```
2. Measure the correlation coefficient between the features
 - plot a correlation heatmap and/or a scatter matrix


## K-Means Cluster Analysis

1. Use K-Means clustering with a cluster count of k=3
2. Compare the results of the clustering to the actual labels
3. Use the elbow method to decide the optimal k

## Hierarchical Clustering
1. Use Hierarchical clustering
2. Compare the results to K-Means


In [1]:
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Custom Evaluation Function

You can use the following function to evaluate the accuracy of the model:

In [2]:
def summarize(model, values):
    result = list(zip(model.labels_,values))
    counts = {} # [name] = {0: n, 1:n}
    labels = {} # [name] = dominant label
    for (label, name) in result:
        if name not in counts:
            counts[name] = {}
            labels[name] = label
        if label not in counts[name]:
            counts[name][label] = 0
        counts[name][label] += 1
        other = 0
        for l in counts[name].keys():
            if l != label:
                other = max(other, counts[name][l])
        
        if counts[name][label] > other:
            labels[name] = label
    correct = 0
    for (label,name) in result:
        correct += int(label == labels[name])
    print('ACCURACY')
    print('Correct:  %d' % correct)
    print('Wrong:    %d' % (len(result) - correct))
    print('Accuracy: %6.4f' % (correct / len(result)))
    return counts

In [3]:
location = '../../data/'
files = os.listdir(location)
files

['responses.csv',
 'CrossStats20150102.txt',
 'auto_2020.xlsx',
 'multiple_choice.csv',
 'iris_names.txt',
 'state_codes.csv',
 'iris.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'questions.csv',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'Candidate Assessment.xlsx',
 'nst-est2019-popchg2010_2019.csv']

## Load and Inspect Data

1. Load the file for this week's analysis:
```
iris.csv
```
2. Measure the correlation coefficient between the features
 - plot a correlation heatmap and/or a scatter matrix