## 18.4.1
### Elbow Curve
The trial-and-error method seemed to work to an extent, but the two of you wonder what happens when data gets more complex, such as in the cryptocurrency dataset.

There is a method that will help you make a more educated guess on the number of clusters called the elbow curve.

An easy method for determining the best number for K is the elbow curve. Elbow curves get their names from their shape: they turn on a specific value, which looks a bit like an elbow!

To create an elbow curve, we'll plot the clusters on the x-axis and the values of a selected objective function on the y-axis.

Inertia is one of the most common objective functions to use when creating an elbow curve. While what it's actually doing can get into some pretty complicated math, basically the inertia objective function is measuring the amount of variation in the dataset.

So, for our elbow curve, we'll plot the number of clusters (also known as the values of K) on the x-axis and the inertia values on the y-axis.

Let's see what happens when we plot our K values versus inertia for the preprocessed iris dataset created earlier.

We will first take a look at the elbow curve using this dataset, since we know that there should be three clusters.

Let's first take a look at the elbow curve using the iris dataset, since we know that there should be three clusters:

    '# Initial imports
    import pandas as pd
    from sklearn.cluster import KMeans
    import plotly.express as px
    import hvplot.pandas

Then enter the code to load in the dataset into a DataFrame:

    '# Loading data
    file_path = "Resources/new_iris_data.csv"
    df_iris = pd.read_csv(file_path)

    df_iris.head(10)

#### Store Values of K to Plot

We'll start with creating an empty list to hold inertia values. We'll also store a range of K values we want to test. Enter the code in a new cell:

    inertia = []
    k = list(range(1, 11))

#### Loop Through K  Values and Find Inertia

Next, we'll loop through each K value, find the inertia, and store it into our list. Enter the code in the next cell:

    '# Looking for the best K
    for i in k:
        km = KMeans(n_clusters=i, random_state=0)
        km.fit(df_iris)
        inertia.append(km.inertia_)

#### Create a DataFrame and Plot the Elbow Curve

We'll create a DataFrame that stores our K values and their appropriate inertia values. This will allow for an easy plot of the results withhvplot. In another new cell, enter the code:

    '# Define a DataFrame to plot the Elbow Curve using hvPlot
    elbow_data = {"k": k, "inertia": inertia}
    df_elbow = pd.DataFrame(elbow_data)
    df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

This will create a graph.

#### Use the Elbow Curve to Determine the Best K Value

Let's take a look at the graph.

Note the shape of the curve on the following graph. At point 0 (top left), the line starts as a steep vertical slope that breaks at point 2, shifts to a slightly horizontal slope, breaks again at point 3, then shifts to a strong horizontal line that reaches to point 10. The angle at point 3 looks like an elbow, which gives this type of curve its name:

The graph shows a sloping curve to the right with a break at point 3 on the axis, indicating the elbow of the curve.

In [2]:
# 18.4.1

# Initial imports
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import hvplot.pandas

# Loading data
file_path = "../Exported_Data/new_iris_data.csv"
df_iris = pd.read_csv(file_path)

df_iris.head(10)

inertia = []
k = list(range(1, 11))

# Looking for the best K
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_iris)
    inertia.append(km.inertia_)
# Define a DataFrame to plot the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

  f"KMeans is known to have a memory leak on Windows "


## 18.4.2
### Use the Elbow Curve
Now that you and Martha know how to create an elbow curve, use one on a dataset to help determine the number of clusters you should use.

Let's walk through an example of how to use the elbow curve. This time, we'll answer the question from the previous section on customer data and how many clusters would be ideal.

Open a new notebook and import our dependencies:

    '# Initial imports
    import pandas as pd
    from sklearn.cluster import KMeans
    import plotly.express as px
    import hvplot.pandas

Then enter the code to load in the dataset into a DataFrame:

    '# Load data
    file_path = "Resources/shopping_data_cleaned.csv"
    df_shopping = pd.read_csv(file_path)
    df_shopping.head(10)

To create the elbow curve, remember there are two values we need: a list of K values and a list of inertia values. Recall that inertia is the objective function to plot K values against. We will loop through 10 values for K and determine the inertia:

    inertia = []
    k = list(range(1, 11))

    # Calculate the inertia for the range of K values
    for i in k:
        km = KMeans(n_clusters=i, random_state=0)
        km.fit(df_shopping)
        inertia.append(km.inertia_)

Next, let's create a plot for the elbow curve:

The graph shows a gentle sloping curve to the right with a break at points 5 and 6. Either one of these points could be considered the elbow of the curve.

This elbow curve doesn't have as obvious of an elbow as previously seen. Either K values for points 5 or 6 could be considered the elbow. There is no surefire way to pick between the two, but we did knock down the potential number of K values between points from 10 to 2. You might also wonder why point 3 wasn't considered. Remember, we're looking for the break where the vertical direction shifts to a strong horizontal direction. Compared to points 5 and 6, the shift at point 3 isn't as dramatic.

Before plotting the two K values, let's create a K-means function again to reuse the K-means cluster. As you may recall, functions allow us to save time because we don't need to write the code contained in the function more than once:

    def get_clusters(k, data):
        # Create a copy of the DataFrame
        data = data.copy()

        # Initialize the K-Means model
        model = KMeans(n_clusters=k, random_state=0)

        # Fit the model
        model.fit(data)

        # Predict clusters
        predictions = model.predict(data)

        # Create return DataFrame with predicted clusters
        data["class"] = model.labels_

        return data

**note**
Creating a function is not required for K-means. The get_clusters function helps us save time since we'll run the algorithm twice: once with point 5 and again with point 6. If you're still struggling with functions, feel free to run the code twice, but do revisit using get_clusters after strengthening your Python function skills.

Consider looking into the principle of "Don't repeat yourself" (Links to an external site.) to learn why it's important to use functions.

We can now run the function for K = 5:

The get_clusters function is applied to the shopping DataFrame to create five clusters.

Run the function again for K = 6:

The get_clusters function is applied to the shopping DataFrame to create six clusters.

Plot a 2D graph for K = 5:

A 2D scatter plot is created with five clusters.

Plot a 3D graph for K = 5:

    # Plot the 3D-scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
    fig = px.scatter_3d(
        five_clusters,
        x="Age",
        y="Spending Score (1-100)",
        z="Annual Income",
        color="class",
        symbol="class",
        width=800,
    )
    fig.update_layout(legend=dict(x=0, y=1))
    fig.show()

A 3D scatter plot is created with five clusters.

Plot a 2D graph for K = 6:

A 2D scatter plot is created with six clusters.

Plot a 3D graph for K = 6:

    # Plotting the 3D-Scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
    fig = px.scatter_3d(
        six_clusters,
        x="Age",
        y="Spending Score (1-100)",
        z="Annual Income",
        color="class",
        symbol="class",
        width=800,
    )
    fig.update_layout(legend=dict(x=0, y=1))
    fig.show()

A 3D scatter plot is created with six clusters.

Recall, in the trial-and-error method, both graphs displayed multiple clusters. We're still applying some trial and error here, but the elbow curve helps narrow down the number of clusters.

Now, the important question: So do we use five or six groups? This depends on what insights you can take away from the data. One might conclude that six groups would be most useful because they could be broken down like so:

    Cluster 0: medium income, low annual spend
    Cluster 1: low income, low annual spend
    Cluster 2: high income, low annual spend
    Cluster 3: low income, high annual spend
    Cluster 4: medium income, high annual spend
    Cluster 5: very high income, high annual spend

If we choose five groups, they would need to be different and would not fit into what you're looking for, which is grouping types of customers based on spending habits. Remember, unsupervised learning can help us make decisions about the data, up to a point, then it is up to you, the expert, to make the final call.

So far, you've learned that when dealing with multiple features, the clusters were best viewed in 3D graphs, which can get messy. In the next section, we'll learn how to limit or combine features.

In [8]:
## 18.4.2

# Initial imports
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import hvplot.pandas

# Load data
file_path = "../Exported_Data/shopping_data_cleaned.csv"
df_shopping = pd.read_csv(file_path)
df_shopping.head(10)

inertia = []
k = list(range(1, 11))

# Calculate the inertia for the range of K values
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_shopping)
    inertia.append(km.inertia_)

def get_clusters(k, data):
    # Create a copy of the DataFrame
    data = data.copy()

    # Initialize the K-Means model
    model = KMeans(n_clusters=k, random_state=0)

    # Fit the model
    model.fit(data)

    # Predict clusters
    predictions = model.predict(data)

    # Create return DataFrame with predicted clusters
    data["class"] = model.labels_

    return data

# We can now run the function for K = 5:
five_clusters = get_clusters(5,df_shopping)
five_clusters.head()

# Run the function again for K = 6:
six_clusters = get_clusters(6,df_shopping)
six_clusters.head()

# Plot a 2D graph for K = 5:
# Plotting the 2D-Scatter with x="Annual Income" and y = "Spending Score(1-100)"
five_clusters.hvplot.scatter(x="Annual Income", y = "Spending Score (1-100)", by = "class")

# A 2D scatter plot is created with five clusters.

# Plot a 3D graph for K = 5:

# Plot the 3D-scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
fig = px.scatter_3d(
    five_clusters,
    x="Age",
    y="Spending Score (1-100)",
    z="Annual Income",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

# Plot a 2D graph for K = 6:
# Plotting the 2D Scatter with x="Annual Income" and y="Spending Score(1-100)"
six_clusters.hvplot.scatter(x="Annual Income", y = "Spending Score (1-100)", by="class")

# Plotting the 3D-Scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
fig = px.scatter_3d(
    six_clusters,
    x="Age",
    y="Spending Score (1-100)",
    z="Annual Income",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()


KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.

