# Phase 3 Project: Predicting Bank Account Ownership for Financial Inclusion in Kenya

## Business Understanding

### Real-World Problem
Despite the success of mobile money services like M-Pesa, a large portion of adults in Kenya and East Africa remain unbanked – meaning they lack a formal bank account. This limits their ability to save safely, access credit, build financial history, and fully participate in the economy. Financial exclusion is particularly high among rural residents, women, lower-education groups, and informal workers.

### Stakeholders
- Kenyan commercial banks (e.g., Equity Bank, KCB Group, Co-operative Bank)
- Fintech companies (Safaricom/M-Pesa, mobile banking providers)
- Central Bank of Kenya and government bodies promoting financial inclusion
- NGOs and development organizations focused on poverty reduction

### Project Objective
Build a binary classification model to predict whether an individual has a bank account ("Yes" or "No") based on demographic, location, and access-related features from survey data.

### How the Model Helps Stakeholders
The model can identify individuals most likely to be unbanked. Banks and fintechs can use these predictions to:
- Target outreach campaigns (e.g., mobile banking sign-ups in rural areas)
- Design tailored products for underserved groups
- Prioritize regions or demographics for financial literacy programs

This directly supports national goals for greater financial inclusion, economic growth, and poverty reduction in Kenya.


### Loading the Dataset and Variable Definitions

To begin exploring the data, I first load the main training dataset (`Train.csv`) using pandas. This file contains all the survey responses, including features and the target variable `bank_account`.

I also load `VariableDefinitions.csv` to display the meaning of each column. This helps me (and stakeholders) understand what each feature represents in the real world, which is critical for interpreting results later.

In [5]:
# import libraries
import pandas as pd
import os

# Load dataset
df = pd.read_csv('./data/Train.csv')

# Load variable definitions for reference
variable_definitions = pd.read_csv('./data/VariableDefinitions.csv')

# Display first few rows of the dataset
print(df.head())

  country  year    uniqueid bank_account location_type cellphone_access  \
0   Kenya  2018  uniqueid_1          Yes         Rural              Yes   
1   Kenya  2018  uniqueid_2           No         Rural               No   
2   Kenya  2018  uniqueid_3          Yes         Urban              Yes   
3   Kenya  2018  uniqueid_4           No         Rural              Yes   
4   Kenya  2018  uniqueid_5           No         Urban               No   

   household_size  age_of_respondent gender_of_respondent  \
0               3                 24               Female   
1               5                 70               Female   
2               5                 26                 Male   
3               5                 34               Female   
4               8                 26                 Male   

  relationship_with_head           marital_status  \
0                 Spouse  Married/Living together   
1      Head of Household                  Widowed   
2         Other relativ

### Variable Definitions

Displaying the official variable definitions helps me and any stakeholder understand exactly what each column represents. This is crucial for interpreting relationships and justifying feature inclusion later.

In [6]:
variable_definitions

Unnamed: 0,Variable Definitions,Unnamed: 1
0,country,Country interviewee is in.
1,year,Year survey was done in.
2,uniqueid,Unique identifier for each interviewee
3,location_type,"Type of location: Rural, Urban"
4,cellphone_access,"If interviewee has access to a cellphone: Yes, No"
5,household_size,Number of people living in one house
6,age_of_respondent,The age of the interviewee
7,gender_of_respondent,"Gender of interviewee: Male, Female"
8,relationship_with_head,The interviewee’s relationship with the head o...
9,marital_status,The martial status of the interviewee: Married...


### Dataset Overview and Shape

Checking the shape and basic info gives me the total number of respondents and features. I also look for missing values early – clean data means less preprocessing later.

In [7]:
print("Dataset shape (rows, columns):", df.shape)
print("\nData types and missing values:")
df.info()

Dataset shape (rows, columns): (23524, 13)

Data types and missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory 

In [10]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nTotal missing values:", df.isnull().sum().sum())

# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())

# Basic statistical summary for numeric columns
print("\nNumeric columns summary:")
df.describe()

Missing values per column:
country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

Total missing values: 0

Number of duplicate rows: 0

Numeric columns summary:


Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


## Data Preparation

### Overview of Steps
1. Drop `uniqueid` — it's just an identifier, no predictive value.
2. Convert target `bank_account` to numeric (Yes → 1, No → 0) for modeling.
3. Separate features (X) and target (y).
4. Perform stratified train-test split (80/20) to preserve class distribution in both sets.
5. Use scikit-learn Pipeline with ColumnTransformer:
   - OneHotEncoder for categorical features
   - StandardScaler for numeric features (optional but good practice)
   - This prevents data leakage and makes code clean/reproducible.

These steps ensure the data is ready for baseline modeling while maintaining real-world class imbalance.

### Dropping Non-Predictive Column and Encoding Target

`uniqueid` is a unique identifier of the form "uniqueid_× country" and provides no predictive information, so I drop it.

I also map the target: "Yes" → 1, "No" → 0 for scikit-learn compatibility.

In [11]:
# Drop uniqueid
df = df.drop('uniqueid', axis=1)

# Map target to numeric
df['bank_account'] = df['bank_account'].map({'Yes': 1, 'No': 0})

# Verify
print("After mapping:")
print(df['bank_account'].value_counts())

df.head()

After mapping:
bank_account
0    20212
1     3312
Name: count, dtype: int64


Unnamed: 0,country,year,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,1,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,0,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,1,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,0,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,0,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [12]:
from sklearn.model_selection import train_test_split

# Split data into features and target
x = df.drop('bank_account', axis=1)
y = df['bank_account']

# Stratified train-test split
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)
print("Training set shape:", x_train.shape, y_train.shape)
print("Testing set shape:", x_test.shape, y_test.shape)
print("\nTarget distribution in train:", y_train.value_counts(normalize=True))
print("Target distribution in test:", y_test.value_counts(normalize=True))

Training set shape: (18819, 11) (18819,)
Testing set shape: (4705, 11) (4705,)

Target distribution in train: bank_account
0    0.859185
1    0.140815
Name: proportion, dtype: float64
Target distribution in test: bank_account
0    0.859299
1    0.140701
Name: proportion, dtype: float64
