## **AllLife Bank Credit Card Customer Segmentation using Clustering**


## Context:
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as up sell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer’s queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.


## Objective: To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # To supress warnings

In [2]:
# Read the dataset
data = pd.read_csv('Credit Card Customer Data.csv')
data.head()

Unnamed: 0,Sl_No,Customer Key,Avg_Credit_Limit,Total_Credit_Cards,Total_visits_bank,Total_visits_online,Total_calls_made
0,1,87073,100000,2,1,1,0
1,2,38414,50000,3,0,10,9
2,3,17341,50000,7,1,3,4
3,4,40496,30000,5,1,1,4
4,5,47437,100000,6,0,12,3


In [6]:
data.iloc[0]

Sl_No                       1
Customer Key            87073
Avg_Credit_Limit       100000
Total_Credit_Cards          2
Total_visits_bank           1
Total_visits_online         1
Total_calls_made            0
Name: 0, dtype: int64

In [10]:
data.shape

(660, 7)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB


In [12]:
# Check for duplicate data
data.duplicated().sum()

0

 Data contains no duplicate values.

In [None]:
#figure out the uniques in each column
data.nunique()

- Customer key, which is an identifier, has repeated values.

In [None]:
# Identify the duplicated customer keys
duplicate_keys=data.duplicated('Customer Key') == True
duplicate_keys

In [None]:
# Drop duplicated keys
data=data[duplicate_keys == False]

In [None]:
# drop the columns that is not needed for the Analysis
data.drop(columns = ['Sl_No', 'Customer Key'], inplace = True)

In [None]:
data.shape

### Observation

- After removing duplicated keys and rows and unnecessary columns, there are 655 unique observations and 5 columns in our data.

## Exploratory Data Analysis

**Statistics**

In [None]:
data.describe()

- Credit limit average is around 35K with 50% of customers having a credit limit less than 18K, which implies a high positive skewness.

### Univariate Analysis

#### **Distribution Plot**

In [None]:
plt.figure(figsize = (14, 7))
sns.distplot(data['Avg_Credit_Limit'])

In [None]:
plt.figure(figsize = (14, 7))
sns.boxplot(data['Avg_Credit_Limit'])

*The Average Credit Limit is right skewed with a lot of outliers.*

In [None]:
data.columns

In [None]:
fig, ax = plt.subplots(nrows=4, ncols=1, figsize=(20, 23))
cols_ = ['Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']
for ind, col in enumerate(cols_):
    sns.countplot(x=col, data=data, ax=ax[ind])

### Observations

1. The above graph shows that the maximum customers have 4 credit cards.
2. Majority of customers visited bank for 2 times.
3. Majority of customers visited bank online for 2 times followed by 0 times.
4. Maximum customers have made 4 calls followed by no calls and 1 call.

### Multi-variate Analysis

In [None]:
plt.figure(figsize=(15,8))

sns.heatmap(data[['Avg_Credit_Limit', 'Total_Credit_Cards', 'Total_visits_bank', 'Total_visits_online', 'Total_calls_made']].corr(), 
            annot=True, cmap="PiYG");

#### Observations

- Avg_Credit_Limit is positively correlated with Total_Credit_Cards and
Total_visits_online which can makes sense.
- Avg_Credit_Limit is negatively correlated with Total_calls_made and Total_visits_bank.
- Total_visits_bank, Total_visits_online, Total_calls_made are negatively correlated which implies that majority of customers use only one of these channels to contact the bank.
- Total credit cards and Total calls made are negatively correlated with each other.

## Data Preprocessing

### **Outlier Detection & Handling**

In [None]:
# Outlier Detection For Average Credit Limit

Q1 = data['Avg_Credit_Limit'].quantile(0.25)
Q3 = data['Avg_Credit_Limit'].quantile(0.75)

IQR = Q3 - Q1
data = data[(data['Avg_Credit_Limit'] >= Q1 - 3*IQR) & (data['Avg_Credit_Limit'] <= Q3 + 3*IQR)]

In [None]:
## Let's visualize the graph after outlier treatment
plt.figure(figsize=(15, 10))
sns.boxplot(data['Avg_Credit_Limit'], orient = 'v')
plt.show()

In [None]:
data.head()

In [None]:
data.shape

In [None]:
# Make a copy of Original Dataframe
df = data.copy()
df.shape

#### Standardization

Before clustering, we should always scale the data, because, different scales of features would result in unintentional importance to the feature of higher scale while calculating the distances.

In [None]:
# Standardize the dataset

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
std_data = scaler.fit_transform(data)

In [None]:
std_data

In [None]:
std_data_x = np.copy(std_data)

## Apply K-Means Clustering Algorithms

In [None]:
from sklearn import metrics
from sklearn.metrics import silhouette_score

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

from sklearn import cluster 
from sklearn.cluster import SpectralClustering

### Determining Number of Clusters with Elbow Method

In [None]:
wcss = []
cluster_list = range(1, 12)
for i in cluster_list :
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, random_state = 40)
    kmeans.fit(std_data)
    wcss.append(kmeans.inertia_)

In [None]:
plt.figure(figsize=(13, 9))
plt.plot(cluster_list, wcss)
plt.title('The Elbow Graph')
plt.xlabel('Clusters')
plt.ylabel('WCSS')
plt.show()

Now we can see that elbow is bend at cluster no. 3, so we can say that for this particular dataset we can choose 3 clusters.

#### Cross Check to Determine the Number of Clusters with Silhouette Scores Method

In [None]:
kmeans_values=[]

for cluster in range(2,12):
    kmeans = KMeans(n_clusters=cluster, random_state=40).fit_predict(std_data)
    sil_score = metrics.silhouette_score(std_data, kmeans, metric='euclidean')
    print("For n_clusters = {}, the silhouette score is {})".format(cluster, sil_score))
    kmeans_values.append((cluster,sil_score))

Again with silhouette analysis we can clearly observe that silhouette score is high when we have 3 clusters.

***Let's build the model with K=3 based on the Elbow Curve and Silhouttee Score.***

In [None]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, random_state = 42)

In [None]:
y = kmeans.fit_predict(std_data)
y

In [None]:
kmeans.cluster_centers_

#### **Cluster Profiling with K-Means Clustering Method**

In [None]:
data['K_means_segments'] = kmeans.labels_

In [None]:
data['K_means_segments'].value_counts()

In [None]:
data.head()

In [None]:
data['K_means_segments'] = kmeans.labels_
cluster_profile = data.groupby('K_means_segments').mean()
cluster_profile['No. of Customers'] = data.groupby('K_means_segments')['Avg_Credit_Limit'].count().values
cluster_profile

### **Insights**

- *If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit and also have more credit cards (cluster - 2), but this group has minimum no. of customers (32).*
- *The customers who prefer in-person interactions tend to have the mid-range of credit cards and credit limit (cluster - 1) and this group has highest number of customers (382).*
- *The customers who contact via phonecall are in another segment, who have lowest credit limit and number of cards (cluster - 0).*

## Hierarchical Clustering

In [None]:
siliuette_list_hierarchical = []
for cluster in range(2,15,1):
    for linkage_method in ['single','average', 'complete', 'ward']:
        agglomerative = AgglomerativeClustering(linkage=linkage_method, metric = 'euclidean', n_clusters=cluster).fit_predict(std_data)
        sil_score = metrics.silhouette_score(std_data, agglomerative, metric='euclidean')
        siliuette_list_hierarchical.append((cluster, sil_score, linkage_method, len(set(agglomerative))))        
df_hierarchical = pd.DataFrame(siliuette_list_hierarchical, columns=['clusters', 'sil_score','linkage_method'])

In [None]:
df_hierarchical.sort_values('sil_score', ascending=False)

***Let's build the model with K=2 based on the Silhouttee Score.***

In [None]:
hierarchical_= AgglomerativeClustering(linkage='complete', affinity='euclidean', n_clusters=2).fit_predict(std_data)

### **Cluster Profiling based on Agglomerative Clustering**

In [None]:
df['Agglomerative_Segments'] = hierarchical_
cluster_profile = df.groupby('Agglomerative_Segments').mean()
cluster_profile['No. of customers'] = df.groupby('Agglomerative_Segments')['Avg_Credit_Limit'].count().values
cluster_profile

### **Insights**

- If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit (~1,20,000K) and also have more credit cards (cluster - 1) and this group is having just 32 customers.
- The customers who prefer in-person interactions tend to have the low-range of credit cards (~25,000K)and credit limit (cluster - 0) and this group has 606 customers.


## Compare cluster K-means clusters and Hierarchical clusters - Perform cluster profiling - Derive Insights

In [None]:
kmeans_      = KMeans(n_clusters=3, random_state=40).fit_predict(std_data)

In [None]:
hierarchical_= AgglomerativeClustering(linkage='complete', affinity='euclidean', n_clusters=2).fit_predict(std_data)

In [None]:
kmeansSilhouette_Score        = metrics.silhouette_score(std_data, kmeans_, metric='euclidean')

Hierarchical_Silhouette_Score = metrics.silhouette_score(std_data, hierarchical_, metric='euclidean')

In [None]:
Clustering_Silhouette_Scores  = [ ['KMeans',kmeansSilhouette_Score ], ['Hierarchical',Hierarchical_Silhouette_Score ]]

Clustering_Silhouette_Scores  = pd.DataFrame(Clustering_Silhouette_Scores, columns=['Clustering Method', 'Silhouette Score']) 
Clustering_Silhouette_Scores.sort_values(by='Silhouette Score', ascending= False)

***The Hierarchical method seems more suitable with a high silhouette score, after checking clusters and the number of variables in each cluster***

Based on the information provided, it appears that your client has identified three distinct categories of customers: in-person users, phone users, and online users. Each group has different preferences when it comes to handling bank transactions and receiving notifications. 

In-person users prefer to handle transactions in person and would likely prefer mail notifications and upselling when at the bank location. On the other hand, phone and in-person customers should be approached to promote online banking. Phone users have the fewest credit cards and lowest credit card limit while online users have the most credit cards and the highest available credit.

The text suggests that your client should use customer preferences to contact them. Online/phone users will probably prefer email/text notifications, while in-person users prefer mail notifications and upselling (when at the bank location). 

Overall, it is recommended that your client improve their online services and reach more customers via email/messages rather than phone calls. This will likely be more effective for online/phone users who prefer digital communications.