# Multi-Feature Banking Adoption and Customer Churn Prediction
## Step 1: Data Loading and Initial Overview

## Project Overview

**Domain:** Banking & FinTech Analytics

**Objective:** This project analyzes customer demographics, multi-product adoption behavior, digital engagement metrics, and satisfaction levels to understand patterns that drive customer retention versus churn in a European banking context.

**Business Problem:** Modern banks offer multiple digital and traditional products (savings accounts, credit cards, loans, investment products, loyalty programs). Understanding which combination of product adoptions and engagement patterns drive customer loyalty versus churn is critical for:
- Designing better product bundles
- Improving customer onboarding experiences
- Implementing data-driven retention strategies
- Reducing customer acquisition costs by improving retention

**Dataset:** Bank Customer Churn Dataset containing 10,000 customer records from a European bank with 18 features including demographics, account details, product usage, and churn status.

## Step 1: Data Loading and Initial Overview

**Goal:** Import the dataset, examine its structure, and understand the basic characteristics of the data.

In [1]:
#Importing pandas libraries

import pandas as pd
import numpy as np

#Display setting for pandas
pd.set_option('display.max_columns', None)              #show all columns
pd.set_option('display.float_format','{:.2f}'.format)    #format all decimals to 2 places

print("Libraries imported successfully!")

Libraries imported successfully!


### 1.1 Data Source and Loading

**Data Source:** Maven Analytics - Bank Customer Churn Dataset  
**Source Link:** https://mavenanalytics.io/data-playground/bank-customer-churn

**Dataset Description:**  
This dataset contains account information for 10,000 customers at a European bank, including:
- Customer demographics (age, gender, geography)
- Account details (credit score, balance, tenure)
- Product adoption metrics (number of products, credit card, active membership)
- Churn status (whether customer exited)

**Loading Objectives:**  
We will load the CSV file and perform initial inspection to understand:
- Total number of records (rows) - represents individual customers
- Total number of features (columns) - represents customer attributes
- Column names and their data types
- Overall data structure and quality

In [3]:
#Load the dataset
df = pd.read_csv('../data/Bank_Churn_Messy.csv', encoding='latin-1')
# Display dataset dimensions
print("Dataset loaded successfully!")
print("="*70)
print(f"Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

Dataset loaded successfully!
Dataset Shape: 10,001 rows × 14 columns


### 1.2 First Look at the Data

We will examine the first few rows of the dataset to understand:
- The structure and format of the data
- Actual values in each column
- Data types (numbers, text, etc.)
- Any obvious data quality issues

The `df.head()` function displays the first 5 rows by default, giving a quick preview of the dataset. We can use the `df.head(n)` method to check the top n rows of the dataframe, where n is an integer.

In [3]:
# Display the first 10 rows of the dataset
print('First 10 rows of the dataset:')
df.head(10)

First 10 rows of the dataset:


Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,Tenure.1,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,FRA,Female,42.0,2,0.0,1,Yes,2,Yes,101348.88,1
1,15647311,Hill,608,Spain,Female,41.0,1,0.0,1,Yes,2,Yes,112542.58,1
2,15619304,Onio,502,French,Female,42.0,8,83807.86,1,Yes,1,Yes,113931.57,0
3,15701354,Boni,699,FRA,Female,39.0,1,159660.8,3,No,8,No,93826.63,1
4,15737888,Mitchell,850,Spain,Female,43.0,2,0.0,2,No,1,No,79084.1,0
5,15574012,Chu,645,Spain,Male,44.0,8,125510.82,1,Yes,2,Yes,149756.71,0
6,15592531,Bartlett,822,France,Male,50.0,7,113755.78,2,No,8,No,10062.8,1
7,15656148,Obinna,376,Germany,Female,29.0,4,0.0,2,Yes,7,Yes,119346.88,0
8,15792365,He,501,French,Male,44.0,4,115046.74,4,No,4,No,74940.5,1
9,15592389,H?,684,France,Male,27.0,2,142051.07,2,Yes,4,Yes,71725.73,0


### 1.3 Last Rows of the Dataset

Checking the last few rows helps us:
- Verify the entire file loaded completely
- Check if data patterns change at the end
- Ensure no corruption at the file's end

The `df.tail(n)` function displays the last n rows.

In [4]:
# Display the last 10 rows of the dataset
print("Last 10 rows of the dataset:")
df.tail(10)

Last 10 rows of the dataset:


Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,Tenure.1,IsActiveMember,EstimatedSalary,Exited
9991,15769959,Ajuluchukwu,597,France,Female,53.0,4,35016.6,1,No,3,No,69384.71,0
9992,15657105,Chukwualuka,726,Spain,Male,36.0,2,88381.21,1,No,4,No,195192.4,1
9993,15569266,Rahman,644,FRA,Male,28.0,7,0.0,1,No,2,No,29179.52,0
9994,15719294,Wood,800,France,Female,29.0,2,155060.41,1,No,7,No,167773.55,0
9995,15606229,Obijiaku,771,France,Male,39.0,5,0.0,2,No,2,No,96270.64,0
9996,15569892,Johnstone,516,French,Male,35.0,10,0.0,2,No,5,No,101699.77,0
9997,15584532,Liu,709,FRA,Female,36.0,7,57369.61,1,Yes,10,Yes,42085.58,0
9998,15682355,Sabbatini,772,Germany,Male,42.0,3,0.0,1,Yes,7,Yes,92888.52,1
9999,15628319,Walker,792,French,Female,28.0,4,75075.31,2,No,3,No,38190.78,1
10000,15628319,Walker,792,French,Female,28.0,4,130142.79,1,No,4,No,38190.78,0


### 1.4 Dataset Information and Data Types

Understanding data types is crucial for:
- Identifying which columns need type conversion
- Planning appropriate analysis methods
- Detecting potential data quality issues
- Determining memory usage and optimization needs

The `info()` method provides:
- Total number of entries (rows)
- Column names and data types
- Number of non-null values (identifies missing data)
- Memory usage

In [5]:
# Display dataset information
print("Dataset Information:")
print("="*70)
df.info()

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001 entries, 0 to 10000
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10001 non-null  int64  
 1   Surname          9998 non-null   object 
 2   CreditScore      10001 non-null  int64  
 3   Geography        10001 non-null  object 
 4   Gender           10001 non-null  object 
 5   Age              9998 non-null   float64
 6   Tenure           10001 non-null  int64  
 7   Balance          10001 non-null  object 
 8   NumOfProducts    10001 non-null  int64  
 9   HasCrCard        10001 non-null  object 
 10  Tenure.1         10001 non-null  int64  
 11  IsActiveMember   10001 non-null  object 
 12  EstimatedSalary  10001 non-null  object 
 13  Exited           10001 non-null  int64  
dtypes: float64(1), int64(6), object(7)
memory usage: 1.1+ MB


### 1.5 Column Names and Descriptions

Below is a brief description of each column in the dataset:

**Customer Information:**
- **CustomerId**: Unique identifier for each customer
- **Surname**: Customer's last name

**Demographics:**
- **CreditScore**: Credit score of the customer
- **Geography**: Country where customer resides
- **Gender**: Customer's gender
- **Age**: Customer's age in years

**Account Details:**
- **Tenure**: Number of years customer has been with the bank
- **Tenure.1**: Second tenure column (appears to be duplicate)
- **Balance**: Current account balance
- **NumOfProducts**: Number of bank products the customer uses
- **HasCrCard**: Whether customer has a credit card
- **IsActiveMember**: Whether customer is an active member
- **EstimatedSalary**: Customer's estimated salary

**Target Variable:**
- **Exited**: Whether the customer has left the bank (1 = Yes, 0 = No)

### 1.6 Statistical Summary of Numerical Features

The `describe()` function provides key statistics for all numerical columns:

**Statistical Measures:**
- **count**: Number of non-null values
- **mean**: Average value
- **std**: Standard deviation (variability in the data)
- **min**: Minimum value
- **max**: Maximum value
- **25%, 50%, 75%**: Quartiles (Q1, median, Q3)

**This helps to identify:**
- Data distribution and spread
- Potential outliers (unusually high or low values)
- Missing data (if count is less than total rows)
- Columns stored incorrectly as text (these won't appear in the summary)

In [6]:
# Statistical summary of numerical columns
print("Statistical Summary of Numerical Features:")
print("="*70)
df.describe()

Statistical Summary of Numerical Features:


Unnamed: 0,CustomerId,CreditScore,Age,Tenure,NumOfProducts,Tenure.1,Exited
count,10001.0,10001.0,9998.0,10001.0,10001.0,10001.0,10001.0
mean,15690934.31,650.54,38.92,5.01,1.53,5.01,0.2
std,71935.31,96.66,10.49,2.89,0.58,2.89,0.4
min,15565701.0,350.0,18.0,0.0,1.0,0.0,0.0
25%,15628523.0,584.0,32.0,3.0,1.0,3.0,0.0
50%,15690733.0,652.0,37.0,5.0,1.0,5.0,0.0
75%,15753229.0,718.0,44.0,7.0,2.0,7.0,0.0
max,15815690.0,850.0,92.0,10.0,4.0,10.0,1.0


### 1.7 Data Quality Checks

**Missing Values:**
- Missing data can affect statistical calculations and model performance
- Need to identify which columns have missing values
- This help to plan appropriate handling strategies

**Duplicate Rows:**
- Duplicate records can twist analysis results
- They may indicate data entry errors
- Duplicates should be removed to ensure data integrity

The `isnull().sum()` is used to count missing values and `duplicated().sum()` is used to count duplicate rows.

In [7]:
# Check for missing values in all columns
print("Missing Values Summary:")
missing_values = df.isnull().sum()
print(missing_values)
print("="*70)

# Columns with missing values
if missing_values.sum() > 0:
    print("\nColumns with missing values:")
    print(missing_values[missing_values > 0])
else:
    print("\nNo missing values found in the dataset.")

Missing Values Summary:
CustomerId         0
Surname            3
CreditScore        0
Geography          0
Gender             0
Age                3
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
Tenure.1           0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

Columns with missing values:
Surname    3
Age        3
dtype: int64


In [8]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print(f"\nWarning: {duplicates} duplicate row(s) detected!")
else:
    print("\nNo duplicate rows found in the dataset.")

Number of duplicate rows: 0

No duplicate rows found in the dataset.


### 1.8 Summary of Findings

**Dataset Overview:**
- Total Rows: 10,001 customers
- Total Columns: 14 features (including 1 duplicate)
- Target Variable: Exited (Churn rate: 20.4%)

**Customer Profile:**
- Average age: 39 years (range: 18-92)
- Average tenure: 5 years
- Average credit score: 651 (range: 350-850)
- Average products: 1.53 (most use only 1 product)

**Data Quality Issues Identified:**

1. **Missing Values (6 total):**
   - Surname: 3 missing (0.03%)
   - Age: 3 missing (0.03%)

2. **Duplicate Column:**
   - Tenure and Tenure.1 contain identical data

3. **Data Type Issues:**
   - Balance: Stored as text instead of numeric
   - EstimatedSalary: Stored as text instead of numeric
   - HasCrCard: Stored as "Yes"/"No" instead of 1/0
   - IsActiveMember: Stored as "Yes"/"No" instead of 1/0

4. **Inconsistent Values:**
   - Geography has inconsistent formats (FRA/France/French)

**Key Business Insights:**
- 20.4% churn rate indicates significant customer erosion
- Low product adoption (median = 1) suggests cross-selling opportunity
- Most customers are middle-aged with decent credit scores

---