In [13]:
#load dependencies
import pandas as pd
import plotly.express as px
import hvplot.pandas
from sklearn.cluster import KMeans

In [15]:
# Loading data
file_path = "C:/Users/otrin/OneDrive/Desktop/Git/cryptocurrencies/shopping_data_cleaned.csv"
df_shopping = pd.read_csv(file_path)
df_shopping.head(10)

Unnamed: 0,Card_Member,Age,Annual_Income,Spending_Score
0,1,19.0,15.0,39.0
1,1,21.0,15.0,81.0
2,0,20.0,16.0,6.0
3,0,23.0,16.0,77.0
4,0,31.0,17.0,40.0
5,0,22.0,17.0,76.0
6,0,35.0,18.0,6.0
7,0,23.0,18.0,94.0
8,1,64.0,19.0,3.0
9,0,30.0,19.0,72.0


See what the points look like at the start by entering the code:

In [17]:
df_shopping.hvplot.scatter(x="Annual_Income", y="Spending_Score")

On first look, it may seem obvious the amount of clusters that would work, but let’s see what happens when we start to cluster.

First, let's create a function so we can quickly run K-means on the DataFrame with a different amount of clusters by entering the following code:

In [20]:
# Function to cluster and plot dataset
def test_cluster_amount(df, clusters):
   model = KMeans(n_clusters=clusters, random_state=5)   
model
# Fitting model
model.fit(df_shopping)

# Add a new class column to df_shopping
df_shopping["class"] = model.labels_

This function will take a DataFrame and the number of clusters to make as arguments. Start by running the function to create two clusters and then plot the results:

In [21]:
test_cluster_amount(df_shopping, 2)
df_shopping.hvplot.scatter(x="Annual_Income", y="Spending_Score", by="class")



At first glance, two clusters look okay with some data points mixed in the middle.

__IMPORTANT__

Recall that sometimes plotting data with more than two data points in a 2D plot might show the true clustering.

Since there are some data points in the middle, let’s plot the DataFrame with a third axis. Enter the code to create a 3D plot:

In [23]:
fig = px.scatter_3d(
	df_shopping,
x="Annual_Income",
	y="Spending_Score",
	z="Age",
color="class",
	symbol="class",
	width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

With the 3D plot, the cluster looks much better. Let’s repeat the process a few more times and see what the different clusters look like.

This also looks great! We’re really starting to see some solid clusters break out. However, before we get trigger-happy and increase the clusters further, we should consider when there might be too many clusters.

If we have too many, will it even tell us something about the data? If we increase to 100 clusters, that would really fine-tune each group, but with so many clusters, can we even do anything with that?

Recall that unsupervised learning doesn’t have a concrete outcome like supervised learning does. We use unsupervised learning to parse data to help us make decisions. So, at what point do we lose the helpfulness of unsupervised learning?

With trial and error, this can become unclear and can only get us so far with more complex datasets. In the next section, we’ll learn a method that will help us determine the best value for K when clustering data.


# 18.4.2 Use the Elbow Curve

Now that you and Martha know how to create an elbow curve, use one on a dataset to help determine the number of clusters you should use.
Let’s walk through an example of how to use the elbow curve. This time, we’ll answer the question from the previous section on customer data and how many clusters would be ideal.

Open a new notebook and import our dependencies:

In [25]:
inertia = []
k = list(range(1, 11))
# Calculate the inertia for the range of K values
for i in k:
   km = KMeans(n_clusters=i, random_state=0)
   km.fit(df_shopping)
   inertia.append(km.inertia_)

In [26]:
# Define a DataFrame to plot the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
df_elbow.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)

This elbow curve doesn’t have as obvious of an elbow as previously seen. Either K values for points 5 or 6 could be considered the elbow. There is no surefire way to pick between the two, but we did knock down the potential number of K values between points from 10 to 2. You might also wonder why point 3 wasn’t considered. Remember, we’re looking for the break where the vertical direction shifts to a strong horizontal direction. Compared to points 5 and 6, the shift at point 3 isn’t as dramatic.

Before plotting the two K values, let’s create a K-means function again to reuse the K-means cluster. As you may recall, functions allow us to save time because we don’t need to write the code contained in the function more than once:

In [27]:
def get_clusters(k, data):   # Create a copy of the DataFrame   
    data = data.copy()       # Initialize the K-Means model   
    model = KMeans(n_clusters=k, random_state=0)   # Fit the model   
    model.fit(data)   # Predict clusters   
    predictions = model.predict(data)   # Create return DataFrame with predicted clusters   
    data["class"] = model.labels_   
    return data

__NOTE__

Creating a function is not required for K-means. The get_clusters function helps us save time since we’ll run the algorithm twice: once with point 5 and again with point 6. If you’re still struggling with functions, feel free to run the code twice, but do revisit using get_clusters after strengthening your Python function skills.

Consider looking into the principle of "Don't repeat yourself" (Links to an external site.) to learn why it’s important to use functions.

We can now run the function for K = 5:

In [28]:
#run the function for K=5
five_clusters = get_clusters(5, df_shopping)
five_clusters.head()

Unnamed: 0,Card_Member,Age,Annual_Income,Spending_Score,class
0,1,19.0,15.0,39.0,0
1,1,21.0,15.0,81.0,4
2,0,20.0,16.0,6.0,0
3,0,23.0,16.0,77.0,4
4,0,31.0,17.0,40.0,0


In [29]:
#run the function for K=6
six_clusters = get_clusters(6, df_shopping)
six_clusters.head()

Unnamed: 0,Card_Member,Age,Annual_Income,Spending_Score,class
0,1,19.0,15.0,39.0,5
1,1,21.0,15.0,81.0,4
2,0,20.0,16.0,6.0,5
3,0,23.0,16.0,77.0,4
4,0,31.0,17.0,40.0,5


In [31]:
# Plot a 2D-scatter with x="Annual_Income" and y="Spending_Score"
five_clusters.hvplot.scatter(x="Annual_Income", y="Spending_Score", by='class')

In [32]:
# Plot the 3D-scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
fig = px.scatter_3d(
    five_clusters,
    x="Age",
    y="Spending_Score",
    z="Annual_Income",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

In [33]:
# Plot a 2D-scatter with x="Annual_Income" and y="Spending_Score"
six_clusters.hvplot.scatter(x="Annual_Income", y="Spending_Score", by='class')

In [34]:
# Plot the 3D-scatter with x="Annual Income", y="Spending Score (1-100)" and z="Age"
fig = px.scatter_3d(
    six_clusters,
    x="Age",
    y="Spending_Score",
    z="Annual_Income",
    color="class",
    symbol="class",
    width=800,
)
fig.update_layout(legend=dict(x=0, y=1))
fig.show()

Recall, in the trial-and-error method, both graphs displayed multiple clusters. We’re still applying some trial and error here, but the elbow curve helps narrow down the number of clusters.

Now, the important question: So do we use five or six groups? This depends on what insights you can take away from the data. One might conclude that six groups would be most useful because they could be broken down like so:

- Cluster 0: medium income, low annual spend
- Cluster 1: low income, low annual spend
- Cluster 2: high income, low annual spend
- Cluster 3: low income, high annual spend
- Cluster 4: medium income, high annual spend
- Cluster 5: very high income, high annual spend

If we choose five groups, they would need to be different and would not fit into what you’re looking for, which is grouping types of customers based on spending habits. Remember, unsupervised learning can help us make decisions about the data, up to a point, then it is up to you, the expert, to make the final call.

So far, you’ve learned that when dealing with multiple features, the clusters were best viewed in 3D graphs, which can get messy. In the next section, we’ll learn how to limit or combine features.

# 18.5.1 Dimensionality Reduction

Martha has noticed that so far we have been working with pretty good datasets in terms of data used. Even after some data cleanup, there haven't been too many features to work with. However, she is beginning to worry that her cryptocurrency data has too many features and is not sure how this will affect our model. The way to handle this is with dimensionality reduction.
Think back to our example with the store owner who is trying to sell school supplies. His customer data could contain endless features, or columns. The data could include name, age, address, items bought, amount spent, time spent shopping, zip code, and so forth. Some features just aren’t necessary and could throw off our algorithm. For instance, would converting names to an integer value be worth the time or even inform our analysis?

Also, throwing all of these features into the model might overfit the data.

Since overfitting is bad, it is best to find a way to limit features. The process of reducing features is called dimensionality reduction. There are two options for coping with too many features: elimination and extraction.

## Feature Elimination
Your first idea is to remove a good amount of features so the model won’t be run using every column. This is called feature elimination.

Feature elimination means what you think: You remove, or eliminate, a feature from the dataset. In our school supply example, you remove features that aren’t relevant to what we’re looking for, such as name, address, and zip code. This simple method increases and maintains interpretability.

The downside is, once you remove that feature, you can no longer glean information from it. If we want to know the likelihood of people buying school supplies, but we removed the zip code feature, then we’d miss a detail that could help us understand when certain residents tend to purchase school supplies.

## Feature Extraction
Feature extraction combines all features into a new set that is ordered by how well they predict our original variable.

In other words, feature extraction reduces the number of dimensions by transforming a large set of variables into a smaller one. This smaller set of variables contains most of the important information from the original large set.

__NOTE__

Sometimes, you need to use both feature elimination and extraction. For instance, the customer name feature doesn’t inform us about whether or not customers will purchase school supplies. So, we would eliminate that feature during the preprocessing stage, then apply extraction on the remaining features.

