Skip to content

Customer Segmentation of Credit Bank data using K Mean Clustering to Study Categorize Credit Loan Applicants

Notifications You must be signed in to change notification settings

basilghauri/Customer-Segmentation-Using-Credit-Bank-Data

Repository files navigation

Customer Segmentation Using Credit Bank Data

This project comprises of using user data from a german bank to conduct customer segmentation of loans given to people. By conducting customer segmentation we are able to figure out average characteristics of customers issued loans in various categories.

Background on

Context

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

Content

It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

  • Age (numeric)
  • Sex (text: male, female)
  • Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
  • Housing (text: own, rent, or free)
  • Saving accounts (text - little, moderate, quite rich, rich)
  • Checking account (numeric, in DM - Deutsch Mark)
  • Credit amount (numeric, in DM)
  • Duration (numeric, in month)
  • Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
# Read data as csv file
data=pd.read_csv("german_credit_data.csv")

Make scatterplots of datapoints

# Function to make scatterplots
def scatters(data, h=None, pal=None):
    fig, (a, b, c) = plt.subplots(3,1, figsize=(8,14))
    sns.scatterplot(x="Credit amount",y="Duration", hue=h, palette=pal, data=data, ax=a)
    sns.scatterplot(x="Age",y="Credit amount", hue=h, palette=pal, data=data, ax=b)
    sns.scatterplot(x="Age",y="Duration", hue=h, palette=pal, data=data, ax=c)
plt.tight_layout()
# Based on Sex
scatters(data, h="Sex")

# Based on Job
scatters(data, h="Job")

Correlation Heatmap

corrMatrix=data.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

Regression Plots to further look into correlations

# Reggression Plot Between Credit Amount and Duration based on Sex
sns.lmplot(x="Credit amount",y="Duration", hue="Sex", data=data, palette="Set2", aspect=2)
plt.show()
# Reggression Plot Based on Jobs
sns.lmplot(x="Credit amount",y="Duration", hue="Job", data=data, palette="Set1", aspect=2)
plt.show()

Barplots to Look into Credits Issued

# Barplots to summarize overall
byjob = data.groupby("Job")["Age"].count().rename("Number").reset_index()
byjob.sort_values(by=["Number"], ascending=False, inplace=True)

plt.figure(figsize=(10,7))
bar1 = sns.barplot(x="Job",y="Number",data=byjob)
bar1.set_xticklabels(bar1.get_xticklabels())
plt.ylabel("Number of Credits Issued")
plt.tight_layout()
bypurpose = data.groupby("Purpose")["Age"].count().rename("Number").reset_index()
bypurpose.sort_values(by=["Number"], ascending=False, inplace=True)
plt.figure(figsize=(10,7))
bar2 = sns.barplot(x="Purpose",y="Number",data=bypurpose)
bar2.set_xticklabels(bar2.get_xticklabels(),rotation=90)
plt.ylabel("Number of Credits Issued")
plt.tight_layout()

Boxplot to look at Outliers and Overall Spread

# Function to make boxplots
def boxplt(x,y,h,r=45):
    fig, ax = plt.subplots(figsize=(10,7))
    box = sns.boxplot(x=x,y=y, hue=h, data=data)
    box.set_xticklabels(box.get_xticklabels(), rotation=90)
    fig.subplots_adjust(bottom=0.4)
    plt.tight_layout()
# Boxplots based on Duration and Purpose
boxplt("Purpose","Duration","Sex")
# Boxplots Based on Credit amount and Purpose
boxplt("Purpose","Credit amount","Sex")

Data Profiling for Overall View

data.profile_report()
By looking at the above data report we can see that three variables which are Age, Credit Amount and Duration have a Right Skew.

Log Transform Variables which are Right Skewed

def distributions(df): # This function is to plot histograms of variables that are right skewed
    fig, (a, b, c) = plt.subplots(3,1, figsize=(10,7))
    sns.distplot(df["Age"], ax=a,color="r")
    sns.distplot(df["Credit amount"], ax=b,color="g")
    sns.distplot(df["Duration"], ax=c,color="b")
    plt.tight_layout()
distributions(np.log(data[["Age","Credit amount","Duration"]]))
#Dataset containing Log of these three variables
data2=data[["Age","Credit amount","Duration"]]

data2_log=np.log(data2)

# Scale data
scale=StandardScaler()
data_sc=scale.fit_transform(data2_log)

How Many Clusters To Seed

clusters_range = range(2,10)
random_range = range(0,20)
result=[]
for c in clusters_range:
    for r in random_range:
        clusterer = KMeans(n_clusters=c, random_state=r)
        cluster_lab = clusterer.fit_predict(data_sc)
        silhouette_avg = silhouette_score(data_sc, cluster_lab)
        result.append([c,r,silhouette_avg])

result = pd.DataFrame(result, columns=["n_clusters","seed","silhouette_score"])
pivot_result = pd.pivot_table(result, index="n_clusters", columns="seed",values="silhouette_score")

plt.figure(figsize=(12,7))
sns.heatmap(pivot_result, annot=True, linewidths=.5, fmt='.3f', cmap=sns.cm.rocket_r)
plt.tight_layout()

By looking at the heatmap above we can determine that two or three clusters will be ideal in this scenario.

KMeans Clustering

kmean=KMeans(n_clusters=3, random_state=1).fit(data_sc)
labels=pd.DataFrame(kmean.labels_)
data_clustered=clustered=data2.assign(Cluster=labels)

Show Clusters as Scatterplots

scatters(data_clustered,"Cluster")

Clusters Divided in Groups

grouped_cluster=data_clustered.groupby(["Cluster"]).mean().round(2)
print(grouped_cluster)

Customer Groups

About

Customer Segmentation of Credit Bank data using K Mean Clustering to Study Categorize Credit Loan Applicants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published