# Clustering

In this assignment, you will implement a K-Means Clustering algorithm from scratch and compare the results to existing sklearn algorithm.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Question 1.1: Write a method that determine Labels from Points and ClusterCentroids, and return a list of a label for each point

In [None]:
def FindLabelOfClosest(Points, ClusterCentroids): # determine Labels from Points and ClusterCentroids
    ClusterCentroids = np.asarray(ClusterCentroids)
    NumberOfClusters, NumberOfDimensions = ClusterCentroids.shape # dimensions of the initial Centroids
    Distances = np.empty(NumberOfClusters) # centroid distances
    NumberOfPoints, NumberOfDimensions = Points.shape
    Labels = np.empty(NumberOfPoints, dtype=int)
    for PointNumber in range(NumberOfPoints): # assign labels to all data points
        for ClusterNumber in range(NumberOfClusters): # for each cluster
            # Get distances for each cluster
            Distances[ClusterNumber] = np.linalg.norm(Points[PointNumber] - ClusterCentroids[ClusterNumber])
        Labels[PointNumber] = np.argmin(Distances) # assign to closest cluster
    return Labels # return the a label for each point


Question 1.2: Write a method that determine centroid of Points with the same label

In [None]:
def CalculateClusterCentroid(Points, Labels): # determine centroid of Points with the same label
    ClusterLabels = np.unique(Labels) # names of labels
    NumberOfPoints, NumberOfDimensions = Points.shape
    ClusterCentroids = pd.DataFrame(index=ClusterLabels, columns=range(NumberOfDimensions))
    for ClusterNumber in ClusterLabels: # for each cluster
        # get mean for each label
        ClusterCentroids.loc[ClusterNumber, :] = Points[Labels == ClusterNumber].mean(axis=0)
    return ClusterCentroids # return the a label for each point


Question 1.3: Put it all together as such. K-means algorithm partitions the input data into K clusters by iterating between the following two steps:
- Compute the cluster center by computing the arithmetic mean of all the points belonging to the cluster.
- Assign each point to the closest cluster center.

In [None]:
def KMeans(Points, ClusterCentroidGuesses, max_iters=100):
    ClusterCentroids = ClusterCentroidGuesses.copy()
    Labels_Previous = None
    # Get starting set of labels
    Labels = FindLabelOfClosest(Points, ClusterCentroids)
    while not np.array_equal(Labels, Labels_Previous):
        # Re-calculate cluster centers based on new set of labels
        ClusterCentroids = CalculateClusterCentroid(Points, Labels).values
        Labels_Previous = Labels.copy() # Must make a deep copy
        # Determine new labels based on new cluster centers
        Labels = FindLabelOfClosest(Points, ClusterCentroids)
    return Labels, ClusterCentroids


In [None]:
StoreTxn = pd.read_csv("./Superstore Transaction data.csv")
StoreTxn['Order Date'] = pd.to_datetime(StoreTxn['Order Date'] )
StoreTxn.head()

Extract RFM features from the transaction data:
- Recency: when was the last purchase they made
- Frequency: how often do they make a purchase in the last month (or any given window you choose)
- Monetary: how much money did they spend in the last month

Question 2.1:
- Use groupby to summarize the quantity and dollar columns by user_id and date
- Name the aggregated data txn_agg
- Reset the index for txn_agg to the default and user_id and date to dataframe columns
- Confirm changes

In [None]:
txn_agg = StoreTxn.groupby(['Customer ID','Order Date'])[['Quantity','Sales']].sum()
txn_agg.head(10)


Question 2.2:Using the aggregated data, obtain recency, frequency and monetary features for both dollar and quantity. Use a 7-day moving window for frequency and monetary. Call your new features last_visit_ndays (recency) quantity_roll_sum_7D (frequency) and dollar_roll_sum_7D (monetary)

In [None]:
last = txn_agg.groupby(level=0).apply(lambda x: x.index.get_level_values(1).to_series().diff()).to_frame('Order Date')
last.rename(columns = {'Order Date' : 'last_visit_ndays'}, inplace = True) # Name the lagged date values last_visit_ndays
print(last.head(10), end='

')

roll = txn_agg.groupby(level=0).apply(lambda x: x.rolling('7D').sum())
roll.rename(columns = {'Quantity' : 'Quantity_roll_sum_7D', 'Sales' : 'Sales_roll_sum_7D'}, inplace = True) # Name the resulting data values quantity_roll_sum_7D and dollar_roll_sum_7D
print(roll.head(10), end='

')


Question 2.3: Combine all three features into a single DataFrame and call it txn_roll

In [None]:
txn_roll = roll.join(last, how='inner') # Inner join between roll (frequency and monetary fields) and last (recency fields) to create churn_roll.  Join based on index which works given that both dataframes are sorted by user_id and date.

print(txn_roll.dtypes, end='

')
txn_roll.head(10)


Question 2.4: Use fillna to replace missing values for recency with a large value like 100 days (whatever makes business sense). HINT: You can use pd.Timedelta('100 days') to set the value.

In [None]:
txn_roll['last_visit_ndays'] = txn_roll['last_visit_ndays'].fillna(pd.Timedelta('100 days')) # Replace missing recency values with 100 days
txn_roll.head(10)


Question 2.5: Merge the aggregated data churn_agg with the RFM features in churn_roll. You can use the merge method to do this with the right keys specified.

In [None]:
txn_rfm = txn_agg.merge(txn_roll, left_index=True, right_index=True)
txn_rfm.head(10)


Question 3.1: Train the k-means algorithm you developed earlier on the RFM features using  𝑘=4 . What are the cluster centroids? The cluster centroids should be reported in the original scale, not the standardized scale.

In [None]:
from sklearn.preprocessing import StandardScaler
features = txn_rfm[['Quantity_roll_sum_7D','Sales_roll_sum_7D','last_visit_ndays']].copy()
features['last_visit_ndays'] = features['last_visit_ndays'].dt.days
scaler = StandardScaler()
Points = scaler.fit_transform(features)
ClusterCentroidGuesses = Points[np.random.choice(Points.shape[0], 4, replace=False)]
Labels, ClusterCentroids = KMeans(Points, ClusterCentroidGuesses)
ClusterCentroids = scaler.inverse_transform(ClusterCentroids)
ClusterCentroids


Question 3.2: Pick few pairs and plot scatter plots along with cluster centroids.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(features['Quantity_roll_sum_7D'], features['Sales_roll_sum_7D'], c=Labels, cmap='tab10', alpha=0.6)
plt.scatter(ClusterCentroids[:,0], ClusterCentroids[:,1], marker='x', color='black', s=200)
plt.xlabel('Quantity_roll_sum_7D')
plt.ylabel('Sales_roll_sum_7D')
plt.title('K-Means Clusters')
plt.show()


[Bonus] Question 4: Train k-means model using sklearn library and compare results to the model developed above.

In [None]:
from sklearn.cluster import KMeans as SKKMeans
sk_kmeans = SKKMeans(n_clusters=4, random_state=0)
sk_labels = sk_kmeans.fit_predict(Points)
sk_centroids = scaler.inverse_transform(sk_kmeans.cluster_centers_)
sk_centroids


Question 5: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.