# Objectives
- Describe when clustering is the appropriate analysis technique
- Use scikit-learn to perform k-means clustering

# 1. Gather and Prepare Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.cluster import KMeans

Read the file `Mall_Customers.csv`, save it to `customers`, and inspect the first 5 rows.

Determine the shape of `customers`.

Output the statistics of `customers`.

Check the data type for each column.

Inspect `customers` for the count of non-null values.

# 2. Choose Model
Run the cell below in order to plot the relationship between `Age`, `Annual Income`, and `Spending Score`.

In [None]:
plt.figure(1 , figsize = (15 , 7))
n = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    for y in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
        n += 1
        plt.subplot(3 , 3 , n)
        plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
        sns.regplot(x = x , y = y , data = customers)
        plt.ylabel(y.split()[0]+' '+y.split()[1] if len(y.split()) > 1 else y )
plt.show()

Now let's create a DataFrame from `customers` that uses only the columns `Age`, `Annual Income`, and `Spending Score`.  Assign the new DataFrame to `X`.

Run the cell below in order to calculate the Within-Cluster Sum of Squares (WCSS).

In [None]:
# Create an empty list
wcss = []

# Create all possible cluster solutions for k = 1 to 7 and save inertia to a list
for k in range(1,10):
    # Cluster solution with i clusters
    kmeans = KMeans(k)
    
    # Fit the data
    kmeans.fit(X)
    
    # Find WCSS for the current iteration
    wcss_iter = kmeans.inertia_
    
    # Append the value to the WCSS list
    wcss.append(wcss_iter)

Using the WCSS that we just found, create a variable called `number_clusters` that will establish our x-axis ticks.  Then create a line plot using `number_clusters` and `wcss`.  Set appropriate x- and y-axis labels and title.

In [None]:
# Create a variable containing the numbers from 1 to 6, so we can use it as X axis of the future plot

# Plot the number of clusters vs WCSS

# Name your graph

# Name the x-axis

# Name the y-axis


# 3. Train Model (k=5)

Based on what we saw using The Elbow Method, we'll cluster our data using 5 clusters and our `X` DataFrame from above containing data on age, annual income, and spending score, and store the resulting KMeans object in `kmeans5`.

# 4. Evaluate Model

Now that we have our KMeans object, `kmeans5`, let's create an array that will contain our predicted clusters for each observation and assign it to `id_clusters`.

In [None]:
# Create a variable which will contain the predicted clusters for each observation

# Check the result


Create a copy of our original DataFrame, `customers`, and assign it to `customers_with_clusters`.  Then, create a column called `Cluster` in the copy DataFrame that will contain our predicted clusters from `id_clusters`.

In [None]:
# Create a copy of the mapped data


# Create a new Series, containing the identified cluster for each observation


# Check the result


Run the next cell in order to once again create plots that show the relationship between each column (feature) pairing.

In [None]:
plt.figure(1 , figsize = (15 , 7))
n = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    for y in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
        n += 1
        plt.subplot(3 , 3 , n)
        plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
        sns.scatterplot(x = x , y = y , data = data_with_clusters, hue='Cluster')
        plt.ylabel(y.split()[0]+' '+y.split()[1] if len(y.split()) > 1 else y )
plt.show()

The plot that shows the relationship between annual income and spending score seems to have some distinct clusters.  Let's plot this graph.

We've done quite a bit so far, so let's get a quick reminder of what our data, `X`, looks like.

Import sklearn from the preprocessing module.

Finally, let's scale our data in `X`.