# Bank Customer Segmentation and Personalization
To effectively segment banking customers for targeted marketing and personalized services, we'll begin with Data Exploration and Preprocessing.

## Data Exploration & Preprocessing
### Examine the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Load the dataset
df = pd.read_csv('bank_transactions.csv')

# Check the shape of the dataset (rows, columns)
print("Dataset shape (rows, columns):\n", df.shape)

# Get dataset information (column names, non-null counts, data types)
print("Dataset Info:\n")
df.info()

# Display first 5 rows
print("\nFirst 5 rows:\n", df.head())

Dataset shape (rows, columns):
 (1048567, 9)
Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 9 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   TransactionID            1048567 non-null  object 
 1   CustomerID               1048567 non-null  object 
 2   CustomerDOB              1045170 non-null  object 
 3   CustGender               1047467 non-null  object 
 4   CustLocation             1048416 non-null  object 
 5   CustAccountBalance       1046198 non-null  float64
 6   TransactionDate          1048567 non-null  object 
 7   TransactionTime          1048567 non-null  int64  
 8   TransactionAmount (INR)  1048567 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 72.0+ MB

First 5 rows:
   TransactionID CustomerID CustomerDOB CustGender CustLocation  \
0            T1   C5841053     10/1/94          F   JAMSHEDPUR   
1

Using Pandas and NumPy, we can compute summary statistics to understand the distributions of these features.

In [2]:
# Summary statistics for numerical columns
print("\nSummary Statistics:\n", df.describe())


Summary Statistics:
        CustAccountBalance  TransactionTime  TransactionAmount (INR)
count        1.046198e+06     1.048567e+06             1.048567e+06
mean         1.154035e+05     1.570875e+05             1.574335e+03
std          8.464854e+05     5.126185e+04             6.574743e+03
min          0.000000e+00     0.000000e+00             0.000000e+00
25%          4.721760e+03     1.240300e+05             1.610000e+02
50%          1.679218e+04     1.642260e+05             4.590300e+02
75%          5.765736e+04     2.000100e+05             1.200000e+03
max          1.150355e+08     2.359590e+05             1.560035e+06


### Data Cleaning
#### Checking missing values
Identify missing values.

In [4]:
# Total missing values per column
print("Missing values per column:\n", df.isnull().sum())

Missing values per column:
 TransactionID                 0
CustomerID                    0
CustomerDOB                3397
CustGender                 1100
CustLocation                151
CustAccountBalance         2369
TransactionDate               0
TransactionTime               0
TransactionAmount (INR)       0
dtype: int64


#### Imputation Strategy
- `CustomerDOB`: Convert to age, then impute missing values with the median age.
- `CustGender`: Fill missing values with the most common gender.
- `CustLocation`: Fill with "Unknown" or mode.
- `CustAccountBalance`: Fill missing values with the median.

In [5]:
from datetime import datetime

# Convert 'CustomerDOB' to datetime
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

# Calculate age
df['Age'] = (pd.to_datetime('today') - df['CustomerDOB']).dt.days // 365

# Drop the original 'CustomerDOB' column
df.drop(columns=['CustomerDOB'], inplace=True)

# Fill missing age values with median age
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing categorical values with mode
df['CustGender'].fillna(df['CustGender'].mode()[0], inplace=True)
df['CustLocation'].fillna('Unknown', inplace=True)

# Fill missing numerical values with median
df['CustAccountBalance'].fillna(df['CustAccountBalance'].median(), inplace=True)

  df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CustGender'].fillna(df['CustGender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method

The `fillna()` function in Pandas is used to fill missing values in a DataFrame.

In [6]:
# Total missing values per column
print("Missing values per column:\n", df.isnull().sum())

Missing values per column:
 TransactionID              0
CustomerID                 0
CustGender                 0
CustLocation               0
CustAccountBalance         0
TransactionDate            0
TransactionTime            0
TransactionAmount (INR)    0
Age                        0
dtype: int64


### Feature Engineering
#### Extracting Features from Date & Time
- Convert `TransactionDate` to recency (days since last transaction).
- Convert `TransactionTime` to hour of transaction for behavior analysis.

In [8]:
# Convert TransactionDate to datetime
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])

# Calculate recency (days since last transaction)
df['DaysSinceLastTransaction'] = (datetime.now() - df['TransactionDate']).dt.days

# Convert TransactionTime (seconds) to hours
df['TransactionHour'] = df['TransactionTime'] // 3600

  df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])


#### Creating RFM Features (Recency, Frequency, Monetary)
- Recency: Days since last transaction.
- Frequency: Number of transactions per customer.
- Monetary: Total transaction amount per customer.

In [9]:
# Group by CustomerID
rfm = df.groupby('CustomerID').agg({
    'DaysSinceLastTransaction': 'min',  # Recency (most recent transaction)
    'TransactionID': 'count',  # Frequency (number of transactions)
    'TransactionAmount (INR)': 'sum'  # Monetary (total spending)
}).reset_index()

# Rename columns
rfm.rename(columns={'TransactionID': 'Frequency', 'TransactionAmount (INR)': 'Monetary'}, inplace=True)

# Merge back with original dataset
df = df.merge(rfm, on='CustomerID', how='left')

##### RFM Usage
- Recency: Customers who have recently transacted are likely to be more loyal.

- Frequency: Customers who have transacted more are more valuable.

- Monetary: Customers who have spent more are more profitable.

Using these features, customers can be divided into different groups and personalized marketing strategies can be designed for each group.

#### Encoding & Normalization
- One-Hot Encode `CustGender` and `CustLocation`.
- Scale Numerical Features (Account Balance, RFM Metrics, Age).

In [6]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['CustGender', 'CustLocation'], drop_first=True)

# Normalize numerical features
scaler = StandardScaler()
num_cols = ['CustAccountBalance', 'Age', 'DaysSinceLastTransaction', 'Frequency', 'Monetary']
df[num_cols] = scaler.fit_transform(df[num_cols])

Convert `CustomerDOB` to `Age`

In [7]:
# Convert 'CustomerDOB' to datetime
df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])

# Calculate age
df['Age'] = (pd.to_datetime('today') - df['CustomerDOB']).dt.days // 365

# Drop the original 'CustomerDOB' column
df.drop(columns=['CustomerDOB'], inplace=True)

  df['CustomerDOB'] = pd.to_datetime(df['CustomerDOB'])


#### Normalizing Numerical Features

Standardize numerical features to ensure uniformity.