# Assignment

We build on the feature engineering we did in the last assignment and run k-means on the data with RFM features in order to do **customer segmentation**. Since k-means is unsupervised, we will also encounter challenges around interpreting results at the end. 

In [None]:
import pandas as pd
col_names = ['user_id', 'gender', 'address', 'store_id', 'trans_id', 'timestamp', 'item_id', 'quantity', 'dollar']
churn = pd.read_csv("../../data/retail-churn.csv", sep = ",", skiprows = 1, names = col_names)
churn.head()

Run the feature engineering steps on the data to extract RFM features.  

In [None]:
churn['date'] =  pd.to_datetime(pd.to_datetime(churn['timestamp'], format = '%m/%d/%Y %H:%M').dt.date)
churn_agg = churn.groupby(['user_id', 'date']).agg({'dollar': 'sum', 'quantity': 'sum'})
churn_agg = churn_agg.reset_index()
churn_roll = pd.DataFrame()
churn_roll['dollar_roll_sum_7D'] = churn_agg.groupby('user_id').rolling(window = '7D', on = 'date')['dollar'].sum()
churn_roll['quantity_roll_sum_7D'] = churn_agg.groupby('user_id').rolling(window = '7D', on = 'date')['quantity'].sum()
churn_roll = churn_roll.reset_index()
churn_roll['last_visit_ndays'] = churn_agg.groupby('user_id')['date'].diff(periods = 1).dt.days
print(churn_roll.shape)

# Should we impute or drop NaN/NaT in churn_roll['last_visit_ndays']?
imputation_value = churn_roll['last_visit_ndays'].max() # None # 
if imputation_value is None:
    # Drop (Remove all rows with NaN):
    churn_roll.dropna(inplace = True)
    print(churn_roll.shape)
else:
    # Impute (Replace all NaN in last_visit_ndays):
    churn_roll['last_visit_ndays'] = churn_roll['last_visit_ndays'].fillna(imputation_value)

churn_roll.head()

The RFM features are:  'dollar_roll_sum_7D', 'quantity_roll_sum_7D', 'last_visit_ndays'

In [None]:
from sklearn.cluster import KMeans


def _run_kmeans(dataframe,cols,cluster,init='auto',rand_state=41):
    menas = KMeans(n_clusters=cluster,n_init=init,random_state=rand_state)
    menas.fit(dataframe[cols])
    return menas

1. Train a k-means algorithm on the 3 normalized RFM features using $k = 10$.

- What are the cluster centroids? The cluster centroids should be reported in the **original scale**, not the normalized scale. <span style="color:red" float:right>[2 point]</span> 

In [None]:
cols = ['dollar_roll_sum_7D','quantity_roll_sum_7D','last_visit_ndays']
churn_roll_scaled = churn_roll[cols].apply(lambda x: (x - x.mean()) / x.std(), axis = 0)
means = _run_kmeans(churn_roll_scaled,cols,cluster=10)
centroids = pd.DataFrame(means.cluster_centers_, columns=cols)
(centroids * churn_roll[cols].std() + churn_roll[cols].mean()).round(2)

2. Our earlier choice of $k=10$ was arbitrary. To find a better number of $k$ create a **scree plot**, which plots the number of clusters $k$ on the x-axis and the sum of squared distances from each point to its cluster centroid on the y-axis. We can get the latter by calling the `inertia_` attribute as shown in the lab. Plot the scree plot for $k$ values from 1 to 15. <span style="color:red" float:right>[4 point]</span>

In [None]:
from plotly import express as plt


data_cap = []
for n in range(1,16):
    means = _run_kmeans(churn_roll_scaled,cols,cluster=n)
    data_cap.append((n,means.inertia_))

k_testing = pd.DataFrame(data_cap, columns=['k','inertia'])
plt.line(k_testing,x='k',y='inertia',markers=True)

3. Based on the scree plot, what is a good value to pick for $k$? Provide a brief justification for your choice. <span style="color:red" float:right>[2 point]</span>

> Given the RFM dataset we are working with, I would say that 8 is the minimum number of `k` clusters, but 14 might be more optimal
>
> The jump from 13 to 14 is the last one that is an order of magnitude jump and nothing above it provides the same delta. If 14 becomes unmanageable for any reason, then I believe 8 is the minimum because that is approximately the location where things start to level off before rescaling the plot to rule out clusters sizes of 1 to 4.

4. Train a k-means algorithm on the RFM features using your new value of $k$. Report:
- the size (number of items) of each cluster
- the mean of each cluster in the original scale
- the standard deviation of each cluster in the Z-normalized scale
<span style="color:red" float:right>[2 point]</span>

In [None]:
means_14 = _run_kmeans(churn_roll_scaled,cols,cluster=14)
means_8 = _run_kmeans(churn_roll_scaled,cols,cluster=8)
churn_roll['cluster_14'] = means_14.predict(churn_roll_scaled[cols])
churn_roll_scaled['cluster_14'] = means_14.predict(churn_roll_scaled[cols])
churn_roll['cluster_8'] = means_8.predict(churn_roll_scaled[cols])
churn_roll_scaled['cluster_8'] = means_8.predict(churn_roll_scaled[cols])
churn_roll['cluster_14'] = churn_roll['cluster_14'].astype('str')
churn_roll['cluster_8'] = churn_roll['cluster_8'].astype('str')

In [None]:
print("Grouping by cluster id")
display(churn_roll.value_counts('cluster_14').sort_index(),churn_roll.value_counts('cluster_8').sort_index())

In [None]:
display(churn_roll.groupby('cluster_14',observed=True)[cols].mean(), churn_roll.groupby('cluster_8',observed=True)[cols].mean())

In [None]:
display(churn_roll_scaled.groupby('cluster_14')[cols].std(), churn_roll_scaled.groupby('cluster_8')[cols].std())

In [None]:
fig_14 = plt.scatter_matrix(churn_roll,dimensions=cols,color='cluster_14')
fig_14.update_traces(diagonal_visible=False)
fig_14.update_layout(height=1000)
fig_14.show()

In [None]:
fig_8 = plt.scatter_matrix(churn_roll,dimensions=cols,color='cluster_8')
fig_8.update_traces(diagonal_visible=False)
fig_8.update_layout(height=1000)
fig_8.show()

> In comparing 14 vs 8 clusters, it appears the 14 may be a bit of an overfit situation. Thus I'm moving forward with only 8 clusters for the final question.

5. Pick 3 clusters at random and describe what makes them different from one another (in terms of their RFM features). <span style="color:red" float:right>[3 point]</span>

In [None]:
df_comp = churn_roll.loc[(churn_roll.cluster_8 == 1) | (churn_roll.cluster_8 == 2) | (churn_roll.cluster_8 == 6)]
df_comp.value_counts('cluster_8')

In [None]:
fig = plt.scatter_matrix(df_comp,dimensions=cols,color='cluster_8')
fig.update_traces(diagonal_visible=False)
fig.update_layout(height=1200)
fig.show()

> Fortunately, these clusters do seem to be genuinely different.
>
> Cluster 1 appears to be grouped by having the fewest dollars spent and fewest items purchased.
>
> Cluster 2 appears to buy the highest number of items but not the most expensive ones and not return to the store very often.
>
> Cluster 6 appears to purchases higher dollar items and fewer quantities.

# End of assignment