# Objectives
* Describe when clustering is the appropriate analysis technique
* Use scikit-learn to perform k-means clustering

## Iris Species Segmentation with Cluster Analysis

The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) is one of the most popular datasets for machine learning. 

There are 4 features: sepal length, sepal width, petal length, and petal width. We are going to seperate the data first into 2 clusters. We will then standardize our variables to more accurately train our model. Then we will use the Elbow Method to explore different numbers of clusters (2, 3 and 5). We will then load the labeled dataset to see how well our k-means clustering performed. Lastly, we will identify strengths and weaknesses of k-means clustering.

![iris](iris.png)

***

# 1. Clustering (k = 2)

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans

## Load the Data

**Q1.1.** Load data from the csv file: `iris_dataset.csv`

In [2]:
# load the data

# check the data


## Plot the Data

For this exercise, try to cluster the iris flowers by the shape of their sepal.

**Q1.2.** Create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)

In [8]:
# scatter plot

# Name your axes

# plt.show()

**Q1.3.** Create a scatter plot based on the other two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width)

In [8]:
# scatter plot

# Name your axes

# plt.show()

It appears that, regardless of plotting sepal size or petal size, this data contains 2 groups. Let's start our clustering with k = 2.

## Clustering

**Q1.4.** Separate the original data into 2 clusters.

In [6]:
# create a variable (x) which will contain the features for clustering


# create a k-means object with 2 clusters


# fit the data


**Q1.5.** Create a copy of the data using the `copy()` method so we can see the clusters next to the original data and predict the cluster for each observation.

In [10]:
# copy data

# predict cluster


**Q1.6.** Create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width). Change the color of the points based on the cluster prediction from our k-means model.

In [11]:
# scatter plot

# Name your axes

# plt.show()

**Q1.7.** Create a scatter plot based on the other two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width). Change the color of the points based on the cluster prediction from our k-means model.

In [12]:
# scatter plot

# Name your axes

# plt.show()

Once again regardless of plotting sepal size or petal size, it seems like our model is inaccurately predicting a small number of points (4 for sepal and 2 for petal). The difference in magnitude for width vs. length might be causing this issue, so let's tune our model by standardinzing our features. 

***

# 2. Tune Model

## Standardize the variables

**Q2.1.** Import and use the <i> scale </i> function from sklearn to standardize the data. 

In [14]:
# import the preprocessing module from sklearn


# scale the data for better results


## Clustering (scaled/standardized data)

**Q2.2.** Separate the original data into 2 clusters.

In [16]:
# create a k-means object with 2 clusters


# fit the data



**Q2.3.** Create a copy of the data so we can see the clusters next to the original data and predict the cluster for each observation.

In [18]:
# copy data


# predict cluster


**Q2.4.** Create a scatter plot based on two corresponding features (sepal_length and sepal_width; OR petal_length and petal_width). Change the color of the points based on the cluster prediction from our k-means model.

In [19]:
# scatter plot

# Name your axes

# plt.show()

This looks much better for k = 2! But, do we know that 2 is the best number of clusters? Let's use the Elbow Method to determine the best number of clusters.

***

# 3. Determine k - Elbow Method

## WCSS

**Q3.1.** Create and fit a k-means model for 1 to 10 clusters and calculate the WCSS for each. Save these 10 values to a list called `wcss` and output the list.

In [43]:
wcss = []
# 'cl_num' is a that keeps track the highest number of clusters we want to use the WCSS method for. We have it set at 10 right now, but it is completely arbitrary.
cl_num = 10
for i in range (1,cl_num):
    kmeans= KMeans(i)
    kmeans.fit(x_scaled)
    wcss_iter = kmeans.inertia_
    wcss.append(wcss_iter)
wcss

[600.0000000000003,
 223.73200573676343,
 140.96837895511072,
 114.42714544645858,
 91.29544474066981,
 80.7111146538459,
 71.79782083817554,
 63.20563259361187,
 56.01600757779229]

## The Elbow Method

**Q3.2.** Plot the WCSS curve.

In [20]:
# create x variable

# lineplot with title, x and y axis

# plt.show()

Based on the Elbow Curve, conduct k-means clustering and plot several graphs with the appropriate amounts of clusters you believe would best fit the data.

***

# 4. Analyze Different Numbers of Clusters

Construct and compare the scatter plots to determine which number of clusters is appropriate for further use in our analysis. Based on the Elbow Curve, 2, 3 or 5 seem the most likely.

## 2 Clusters

**Q4.1.** Repeat our code from above with 2 clusters using our standardized data.

In [21]:
# create a k-means object with 2 clusters


# fit the data


**Q4.2.** Construct a scatter plot of the original data using the standartized clusters. *Note: we are plotting the non-standardized values of the sepal length and width.*

In [46]:
# copy data

# predict cluster


In [22]:
# scatter plot

# Name your axes

# plt.show()

## 3 Clusters

**Q4.3.** Redo the same for 3 clusters.

In [24]:
# create a k-means object with 3 clusters


# fit the data

In [25]:
# copy data

# predict cluster


In [26]:
# scatter plot

# Name your axes

# plt.show()

## 5 Clusters

**Q4.4.** Redo the same for 5 clusters.

In [27]:
# create a k-means object with 3 clusters


# fit the data

In [28]:
# copy data

# predict cluster


In [29]:
# scatter plot

# Name your axes

# plt.show()

# 5. Compare Solutions to the Labeled Iris Dataset

The original (full) iris data is located in `iris_with_labels.csv`. Load the csv, plot the data and compare it with your solution. 

## Load the data

**Q5.1.** Load data from the csv file: `iris_with_labels.csv`

In [39]:
# load the data

# check the data


**Q5.2.** Find the unique species in the DataFrame.

In [40]:
# unique


It looks like there are only 3 unique species of Iris. Our 2-cluster solution seemed good, but in real life the iris dataset has 3 species (a 3-cluster solution). Therefore, clustering cannot be trusted at all times. Sometimes it seems like x clusters are a good solution, but in real life, there are more (or less).

**Q5.3.** Use the map function to change 'setosa' values to 0, 'versicolor' to 1, and 'viginica' values to 2. Inspect the first five lines of the DataFrame.

In [41]:
# map species to numbers

# inspect data


## Plot the data

Looking at the sepal graph it seems like the clustering solution is much more intertwined than what we imagined (and what we found before). 

**Q5.4.** Plot the labeled data sepal length and width, using the species column for color.

In [37]:
# scatter plot

# Name your axes

# plt.show()

**Q5.5.** Plot the labeled data petal length and width, using the species column for color.

In [38]:
# scatter plot

# Name your axes

# plt.show()

Examining the other scatter plot (petal length vs petal width), we see that in fact the features which actually make the species different are petals and NOT sepals!

## Cluster (k = 3) Solution

**Q5.6.** Plot the 3-cluster solution data sepal length and width. Change the color of the points based on the cluster prediction from our k-means model.

In [42]:
# scatter plot

# Name your axes

# plt.show()

**Q5.6.** Plot the 3-cluster solution data petal length and width. Change the color of the points based on the cluster prediction from our k-means model.

In [43]:
# scatter plot

# Name your axes

# plt.show()

It appears that our k-means clustering solution is using the sepal length and width more than the petal length and width to seperate the data into 3 clusters.

# 6. Conclusions

This tutorial shows us that:
* the Eblow method is imperfect (we might have opted for 2 or even 4)
* k-means is very useful when we already know the number of clusters - in this case: 3
