	
# <center><u>KMeans Clustering (Core)</u>
* Authored By: Eric N Valdez
* Date: 03/8/2024

## `KMeans Clustering (Core)`

## <u>Task</u>

* Your task is to perform customer segmentation using KMeans. We are interested in grouping our customers into groups based on similar characteristics. This can help the company effectively allocate marketing resources. 
* We will use customer age, education, years of employment, income, debt, whether they defaulted, and their debt-to-income ratio to group them into segments.

* You can download the [data here](https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/cust_seg.csv). The original data is from this [data source.](https://github.com/Nikhil-Adithyan/Customer-Segmentation-with-K-Means)

## <u>Imports

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Warnings
import warnings

# Set filter warnings to ignore
warnings.filterwarnings('ignore')

## <u>Custom Settings

In [2]:
# Set MatPlotLib default parameters
plt.rcParams.update({'figure.facecolor': 'white',
                          'font.weight': 'bold',
                      'patch.linewidth': 1.25,
                       'axes.facecolor': 'white',
                       'axes.edgecolor': 'black',
                       'axes.linewidth': 2,
                       'axes.titlesize': 14,
                     'axes.titleweight': 'bold',
                       'axes.labelsize': 12,
                     'axes.labelweight': 'bold',
                      'xtick.labelsize': 10,
                      'ytick.labelsize': 10,
                            'axes.grid': True,
                       'axes.grid.axis': 'y',
                           'grid.color': 'black',
                       'grid.linewidth': .5,
                           'grid.alpha': .25,
                   'scatter.edgecolors': 'black'})

## <u>Load Data

In [3]:
df = pd.read_csv('Data/cust_seg.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       850 non-null    int64  
 1   Customer Id      850 non-null    int64  
 2   Age              850 non-null    int64  
 3   Edu              850 non-null    int64  
 4   Years Employed   850 non-null    int64  
 5   Income           850 non-null    int64  
 6   Card Debt        850 non-null    float64
 7   Other Debt       850 non-null    float64
 8   Defaulted        700 non-null    float64
 9   DebtIncomeRatio  850 non-null    float64
dtypes: float64(4), int64(6)
memory usage: 66.5 KB


Unnamed: 0.1,Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,0,1,41,2,6,19,0.124,1.073,0.0,6.3
1,1,2,47,1,26,100,4.582,8.218,0.0,12.8
2,2,3,33,2,10,57,6.111,5.802,1.0,20.9
3,3,4,29,2,4,19,0.681,0.516,0.0,6.3
4,4,5,47,1,31,253,9.308,8.908,0.0,7.2


In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Customer Id,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
count,850.0,850.0,850.0,850.0,850.0,850.0,850.0,850.0,700.0,850.0
mean,424.5,425.5,35.029412,1.710588,8.565882,46.675294,1.57682,3.078773,0.261429,10.171647
std,245.51816,245.51816,8.041432,0.927784,6.777884,38.543054,2.125843,3.398799,0.439727,6.719441
min,0.0,1.0,20.0,1.0,0.0,13.0,0.012,0.046,0.0,0.1
25%,212.25,213.25,29.0,1.0,3.0,24.0,0.3825,1.04575,0.0,5.1
50%,424.5,425.5,34.0,1.0,7.0,35.0,0.885,2.003,0.0,8.7
75%,636.75,637.75,41.0,2.0,13.0,55.75,1.8985,3.90325,1.0,13.8
max,849.0,850.0,56.0,5.0,33.0,446.0,20.561,35.197,1.0,41.3


In [5]:
df.shape

(850, 10)

### <u>EDA and Data Cleaning

In [6]:
# Looking for missing values
df.isna().sum()

Unnamed: 0           0
Customer Id          0
Age                  0
Edu                  0
Years Employed       0
Income               0
Card Debt            0
Other Debt           0
Defaulted          150
DebtIncomeRatio      0
dtype: int64

In [7]:
# Cleaning Data for Mode
mode = df['Defaulted'].mode()[0]
df['Defaulted'].fillna(mode, inplace=True)
df.isna().sum()

Unnamed: 0         0
Customer Id        0
Age                0
Edu                0
Years Employed     0
Income             0
Card Debt          0
Other Debt         0
Defaulted          0
DebtIncomeRatio    0
dtype: int64

In [8]:
# Checking for duplicate rows
print('Number of Duplicated Rows', df.duplicated().sum())
print('\n')

Number of Duplicated Rows 0




In [9]:
# Checking dtypes
df.dtypes

Unnamed: 0           int64
Customer Id          int64
Age                  int64
Edu                  int64
Years Employed       int64
Income               int64
Card Debt          float64
Other Debt         float64
Defaulted          float64
DebtIncomeRatio    float64
dtype: object

In [10]:
# Rechecking info to make sure data is cleaned
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       850 non-null    int64  
 1   Customer Id      850 non-null    int64  
 2   Age              850 non-null    int64  
 3   Edu              850 non-null    int64  
 4   Years Employed   850 non-null    int64  
 5   Income           850 non-null    int64  
 6   Card Debt        850 non-null    float64
 7   Other Debt       850 non-null    float64
 8   Defaulted        850 non-null    float64
 9   DebtIncomeRatio  850 non-null    float64
dtypes: float64(4), int64(6)
memory usage: 66.5 KB


In [11]:
# Dropping unnecessary columns
df.drop(columns=['Unnamed: 0', 'Customer Id'], inplace=True)
df.head()

Unnamed: 0,Age,Edu,Years Employed,Income,Card Debt,Other Debt,Defaulted,DebtIncomeRatio
0,41,2,6,19,0.124,1.073,0.0,6.3
1,47,1,26,100,4.582,8.218,0.0,12.8
2,33,2,10,57,6.111,5.802,1.0,20.9
3,29,2,4,19,0.681,0.516,0.0,6.3
4,47,1,31,253,9.308,8.908,0.0,7.2


In [None]:
# Scale the data

## `1. Use KMeans to create various customer segments.`
### 1. Use an Elbow Plot of inertia.

### 2. And a plot of Silhouette Scores.

### 3. Choose a K based on the results.

## `2. Analyze the clusters you made in Part 1.`  
### 1. Create analytical visualizations that explore statistics for each feature for each cluster.

### 2. `Write a description of each cluster based on the visualizations you created.`
  1. Do more than describe the numbers; try to see beyond the numbers and describe the people represented by each cluster.
  2. Include at least one insight for each cluster.

## 3.`Create one or two recommendations for your stakeholders (the credit card company) regarding how they should market credit cards differently or which cards they should market to each cluster based on your data and insights.`