### 1. Introduction


We'll play the role of a data scientist working for a credit card company

The dataset contains information about the company's clients and we're asked to help segment them into different groups in order to apply different business strategies for each type of customer

### 2. Libreries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
df = pd.read_csv('customer_segmentation.csv')

In [None]:
df.describe()

In [None]:
fig, ax = plt.subplots(figsize = (12,10))

# removing the customer's id before plotting the distributions
df.drop('custoemr_id', axis = 1).hist(ax = ax)

plt.tight_layout()
plt.show()

It's always import to look the shape of things. For our culstering purposes today, this isn't going to be something we have to worry about a ton, but if you aren't happy with this results, this could be  a place to coem back and look and see if there's any modifications you can make to the dataset to guide the clusters in a more targeted way.


In [None]:
correlations = df.drop('customer_id', axis = 1).corr(numeric_only = True)

fig, ax = plt.subplots(figsizes = (12,8))
sns.heatmap(correlations[(correlations > 0.3) | (correlations < -0.3)], cmap = 'Blues', annot = True, ax=ax)

plt.tight_layout()
plt.show()


### 3. Feature engenirring

In [None]:
customers_modif = df.copy()

customers_modif['gender'] = df['gender'].aaply(lambda x : 1 if x == 'M' else 0)
customers_modif.head()

In [None]:
education_mapping = {'Uneducated' : 0, 'High Scool' :1, 'College' : 2,
                     'Graduate': 3, 'post-Graduate' : 4, 'Doctorate': 5}
customers_modif['education_level'] = customers_modif['education_level'].map(education_mapping)

customers_modif.head()

In [None]:
dummies = pd.get_dummies(customers_modif[['marital_status']])

customers_modif = pd.concat([customers_modif, dummies], axis = 1)
customers_modif.drop(['marital_status'], axis = 1, inplace = True)

print(customers_modif.info())
customers_modif.head()

In [None]:
X = customers_modif.drop('customer_id', axis = 1)

scaler = StandardScaler()
scaler.fit(X)

X_Scaled = scaler.transform(X)
X_Scaled

In [None]:
## The Elbow rule

X = pd.DataFrame(X_Scaled)
inertias = []

for k in range(1,11):
    model = KMeans(n_clusters= k)
    y = model.fit_predict(X)
    inertias.append(model.inertia_)

plt.figure(figsize = (12,8))
plt.plot(range(1,11), inertias, marker = 'o')
plt.xticks(ticks = range(1,11), inertias, marker = 'o')
plt.xticks(ticks = range(1,11), labels = range(1,11))
plt.title('Inertia vs Number of Clusters')

plt.tight_layout()
plt.show()

The elbow rule is a visual way of looking at the inertia of different numbers of clusters. So what we're going to do is initializing an empty list calle dinertias. Inertias in Kmeans clustering is a sum os the squared distances between every data point and its corresponding centrid. So the centrid is the center of a cluster, so the closer a point is to the centroid, the smaller the inertia. Basically, the closer all the points are to the centroid, the smaller the inertia. ANd teh smaller the inertia is better because it means the cluster is stronger. However, you could technically have a centroid for every point in your dataset. If you have 100 data points, you have 100 centroids, inertia zero. Fantastic, right? No, because then you don't have any insight. If you have 100 clusters and 100 data points, credic card company's going to say, what can we do with that? So where's the sweet spot? You don't wnat one cluster, we don't want 100 cluster. we use the elbow rule. 

What we want to look for is where is the slope tapering off? After 8 it starts do shallow, which tell us this is our elbow. But eeverytime you run it'll get a different result if you don't set a seed. There are a few strategies to sonsider when you do this analysis. You can look for the average elbow.

In [None]:
model = KMeans(n_clusters=5)
y = model.fit_predict(X_Scaled)
y

In [None]:
df['CLUSTER'] = y + 1
df.head()

We are not done, we need to do a analysis of what these clusters are telling us to give information to our end user. Look at our numeric variables, group that cluster and look at some plots

In [None]:
numeric_columns = df.select_dtypes(include = np.number).drop(['customer_id', 'CLUSTER', axis = 1]).columns

fig = plt.figure(figsize = (20,20))
for i, column in enumerate(numeric_columns):
    df_plot = df.groupby('CLUSTER')[column].mean()
    ax = fig.add_subplot(5,2, i+1)
    ax.bar(df_plot.index, df_plot, color = sns.color_palette('Set1'), alpha = 0.6)
    ax.set_title(f'Average {column.title()} per Cluster', alpha = 0.5)
    ax.xaxis.grid(False)

    plt.tight_layout()
    plt.show()

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (16,8)) 
sns.scatterplot(x = 'age', y = 'months_on_book', huw = 'CLUSTER',
                 data = df, palette = 'tab10', alpha = 0.4, ax = ax1)
sns.scatterplot(x = 'estimated_income', y = 'credit_limit', hue = 'CLUSTER',
                data = df, palette = 'tab10', alpha = 0.4, ax = ax2)
sns.scatterplot(x = 'credit_limit', y = 'avg_utilization_ratio', hue = 'CLUSTER', 
                data = df, palette = 'tab10', alpha = 0.4, ax = ax3)
sns.scatterplot(x = 'total_trans_count', y = 'total_trans_amount', hue = 'CLUSTER',
                data = df, palette= 'tab10', alpha = 0.4, ax = ax4)

plt.tight_layout()
plt.show()

In [None]:
cat_columns = df.select_dtypes(include =['object']) 

fig = plt.figure(figsize = (18,6))
for i, col in enumerate(cat_columns):
    plot_df = pd.crosstab(index = df['CLSUTER'], columns = df[col], aggfunc= 'size', normalize= 'inc')
    ax = fig.add_subplot(1,3, i+1)
    plot_df.plot.bar(stacked = True, ax = ax, alpha = 0.6)
    ax.set_title(f'% {col.title()} per Cluster', alpha = 0.5)

    ax.set_ylim(0,1.4)
    ax.legend(frameon = False)
    ax.xaxis.grid(False)

    labels = [0, 0.2, 0.4, 0.6, 0.8, 1]

plt.tight_layout()
plt.show()


**Cluster 1**
- Generally male
- High stimated income (~$100K)
- High credit limit
- Very low utilization ratios

> Customers have moneey to spend so can be incentivized to spend more