# Introduction

# Method

## Import Libraries
Before running any code some libary imports are required.

- urlib.request is being used to download the data set
- matplotlib is imported to allow for plotting the data onto a graph
- pandas is being imported to allow for datasets to be imported and manipulated in-memory
- sklean is used to perform kmeans clustering

In [None]:
from urllib.request import urlopen

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans

## Get Data
Before doing any calculations we will need to download and prepare the data.

### Download Data
This section of code allows for the dataset to be downloaded from source into the current working directory. It also strips any leading or trailing white space characters.

In [None]:
source_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
download_filename = "iris.data"

with urlopen(source_url) as response:
    content = response.read()
    with open(download_filename, "wb") as fo:
        fo.write(content.strip())

### Process Data
To use the downloaded dataset, it needs to be processed to be ready. This code will add a header and write to the processed file path.

In [None]:
processed_filename = "iris.csv"
csv_header = b"sepal-length,sepal-width,petal-length,petal-width,class"

with open(download_filename, "rb") as src_fo:
    with open(processed_filename, "wb") as fo:
        fo.write(csv_header)
        fo.write(b"\n")
        fo.write(src_fo.read())

### Load Data From CSV Into DataFrame
Once the dataset has been processed into a valid csv file, it can be loaded into a pandas DataFrame.

In [None]:
df = pd.read_csv(processed_filename)

## Show Loaded Data
Once the data is loaded we can preview the dataset which currently shows all the columns from the csv file.

In [None]:
df

## Select Columns To Use
For clustering only two columns are needed, in this code sample they are selected and referenced in the variable `selected_data`.

In [None]:
column_names = ["sepal-length", "petal-length"]

In [None]:
selected_data = df[column_names]

The `selected_data` can now be previewed, which shows the two selected columns.

In [None]:
selected_data

## Preview Data
Now the data is loaded it can be previewed on a scatter chart without any clusters.

In [None]:
def plot_data(data, x, y, xl, yl, title):
    plt.scatter(x=x, y=y, data=data)
    plt.title(title)
    plt.xlabel(xl)
    plt.ylabel(yl)
    plt.show()

In [None]:
x, y = column_names
plot_data(df, x, y, "Sepal Length", "Petal Length", "Selected Data")

## Find Optimum K Value


To find the optimum number for k we use an elbow plot. The "elbow" of the graph will indicate the k value, we can also go further along increasing the number of clusters as long as the intertia value is high enough.

In [None]:
def show_elbow_plot(data, max_k):
    means = []
    inertias = []
    
    for k in range(1,max_k):
        kmeans = KMeans(n_clusters=k)
        kmeans.fit(data)
        means.append(k)
        inertias.append(kmeans.inertia_)
    
    plt.figure(figsize=(12,5))
    plt.plot(means, inertias, 'o-')
    plt.xlabel("Number of Clusters")
    plt.ylabel("Inertia")
    plt.grid(True)
    plt.show()

In [None]:
show_elbow_plot(selected_data, 10)

The elbow plot shows that the optimum number to use for k is two, which will mean there will be two clusters. We could also use three, any numbers after that will produce undesiged results due to the intertia value being low.

## Calculate KMeans

Once the optimum k value has been found, the selected data can be clustered using kmeans. Here I selected the min k value to be two which is the optmum shown by the elbow plot, and the max k value will be three the highest k value before the inertia value is too low.

In [None]:
k_min = 2
k_max = 3
cluster_columns = []

for k in range(k_min, k_max + 1):
    kmeans = KMeans(k).fit(selected_data)
    df[f"cluster_{k}"] = kmeans.labels_
    cluster_columns.append(f"cluster_{k}")

## Get Cluster Data

The dataset with the cluster columns added can now be previewed again.

In [None]:
df[column_names]

## Plot Clusters

In [None]:
def plot_cluster(data, kmeans, x, y, c, xl, yl, title):
    plt.scatter(x=x, y=y, c=c, data=data)
    plt.scatter(
        kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
        s=250, marker="x", c="black", label="centers"
    )
    plt.title(title)
    plt.xlabel(xl)
    plt.ylabel(yl)
    plt.legend(scatterpoints=1)

In [None]:
plt.figure(figsize=(12,8))
for i, k_col in enumerate(cluster_columns, 1):
    plt.subplot(1, 2, i)
    x, y = column_names
    plot_cluster(df, kmeans, x, y, k_col, "Sepal Length", "Petal Length", k_col)
plt.show()

# Results

# Conclusion

# References
- <https://archive.ics.uci.edu/ml/datasets/Iris>
- <https://towardsdatascience.com/how-to-use-unsupervised-learning-to-cluster-well-log-data-using-python-a552713748b5> - MIT
- <https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen>

# Appendices
## Source Code
- <https://github.com/enchanted-code/python-kmeans-clustering>