# Phase 3 Project: Predicting Bank Account Ownership for Financial Inclusion in Kenya

## Business Understanding

### Real-World Problem
Despite the success of mobile money services like M-Pesa, a large portion of adults in Kenya and East Africa remain unbanked – meaning they lack a formal bank account. This limits their ability to save safely, access credit, build financial history, and fully participate in the economy. Financial exclusion is particularly high among rural residents, women, lower-education groups, and informal workers.

### Stakeholders
- Kenyan commercial banks (e.g., Equity Bank, KCB Group, Co-operative Bank)
- Fintech companies (Safaricom/M-Pesa, mobile banking providers)
- Central Bank of Kenya and government bodies promoting financial inclusion
- NGOs and development organizations focused on poverty reduction

### Project Objective
Build a binary classification model to predict whether an individual has a bank account ("Yes" or "No") based on demographic, location, and access-related features from survey data.

### How the Model Helps Stakeholders
The model can identify individuals most likely to be unbanked. Banks and fintechs can use these predictions to:
- Target outreach campaigns (e.g., mobile banking sign-ups in rural areas)
- Design tailored products for underserved groups
- Prioritize regions or demographics for financial literacy programs

This directly supports national goals for greater financial inclusion, economic growth, and poverty reduction in Kenya.


### Loading the Dataset and Variable Definitions

To begin exploring the data, I first load the main training dataset (`Train.csv`) using pandas. This file contains all the survey responses, including features and the target variable `bank_account`.

I also load `VariableDefinitions.csv` to display the meaning of each column. This helps me (and stakeholders) understand what each feature represents in the real world, which is critical for interpreting results later.

In [1]:
# import libraries
import pandas as pd
import os

# Load dataset
df = pd.read_csv('./data/Train.csv')

# Load variable definitions for reference
variable_definitions = pd.read_csv('./data/VariableDefinitions.csv')

# Display first few rows of the dataset
print(df.head())

  country  year    uniqueid bank_account location_type cellphone_access  \
0   Kenya  2018  uniqueid_1          Yes         Rural              Yes   
1   Kenya  2018  uniqueid_2           No         Rural               No   
2   Kenya  2018  uniqueid_3          Yes         Urban              Yes   
3   Kenya  2018  uniqueid_4           No         Rural              Yes   
4   Kenya  2018  uniqueid_5           No         Urban               No   

   household_size  age_of_respondent gender_of_respondent  \
0               3                 24               Female   
1               5                 70               Female   
2               5                 26                 Male   
3               5                 34               Female   
4               8                 26                 Male   

  relationship_with_head           marital_status  \
0                 Spouse  Married/Living together   
1      Head of Household                  Widowed   
2         Other relativ

### Variable Definitions

Displaying the official variable definitions helps me and any stakeholder understand exactly what each column represents. This is crucial for interpreting relationships and justifying feature inclusion later.

In [2]:
variable_definitions

Unnamed: 0,Variable Definitions,Unnamed: 1
0,country,Country interviewee is in.
1,year,Year survey was done in.
2,uniqueid,Unique identifier for each interviewee
3,location_type,"Type of location: Rural, Urban"
4,cellphone_access,"If interviewee has access to a cellphone: Yes, No"
5,household_size,Number of people living in one house
6,age_of_respondent,The age of the interviewee
7,gender_of_respondent,"Gender of interviewee: Male, Female"
8,relationship_with_head,The interviewee’s relationship with the head o...
9,marital_status,The martial status of the interviewee: Married...


### Dataset Overview and Shape

Checking the shape and basic info gives me the total number of respondents and features. I also look for missing values early – clean data means less preprocessing later.

In [3]:
print("Dataset shape (rows, columns):", df.shape)
print("\nData types and missing values:")
df.info()

Dataset shape (rows, columns): (23524, 13)

Data types and missing values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory 

In [4]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

print("\nTotal missing values:", df.isnull().sum().sum())

# Check for duplicate rows
print("\nNumber of duplicate rows:", df.duplicated().sum())

# Basic statistical summary for numeric columns
print("\nNumeric columns summary:")
df.describe()

Missing values per column:
country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

Total missing values: 0

Number of duplicate rows: 0

Numeric columns summary:


Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


## Data Preparation

### Overview of Steps
1. Drop `uniqueid` — it's just an identifier, no predictive value.
2. Convert target `bank_account` to numeric (Yes → 1, No → 0) for modeling.
3. Separate features (X) and target (y).
4. Perform stratified train-test split (80/20) to preserve class distribution in both sets.
5. Use scikit-learn Pipeline with ColumnTransformer:
   - OneHotEncoder for categorical features
   - StandardScaler for numeric features (optional but good practice)
   - This prevents data leakage and makes code clean/reproducible.

These steps ensure the data is ready for baseline modeling while maintaining real-world class imbalance.

### Dropping Non-Predictive Column and Encoding Target

`uniqueid` is a unique identifier of the form "uniqueid_× country" and provides no predictive information, so I drop it.

I also map the target: "Yes" → 1, "No" → 0 for scikit-learn compatibility.

In [5]:
# Drop uniqueid
df = df.drop('uniqueid', axis=1)

# Map target to numeric
df['bank_account'] = df['bank_account'].map({'Yes': 1, 'No': 0})

# Verify
print("After mapping:")
print(df['bank_account'].value_counts())

df.head()

After mapping:
bank_account
0    20212
1     3312
Name: count, dtype: int64


Unnamed: 0,country,year,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,1,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,0,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,1,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,0,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,0,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [6]:
from sklearn.model_selection import train_test_split

# Split data into features and target
x = df.drop('bank_account', axis=1)
y = df['bank_account']

# Stratified train-test split
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)
print("Training set shape:", x_train.shape, y_train.shape)
print("Testing set shape:", x_test.shape, y_test.shape)
print("\nTarget distribution in train:", y_train.value_counts(normalize=True))
print("Target distribution in test:", y_test.value_counts(normalize=True))

Training set shape: (18819, 11) (18819,)
Testing set shape: (4705, 11) (4705,)

Target distribution in train: bank_account
0    0.859185
1    0.140815
Name: proportion, dtype: float64
Target distribution in test: bank_account
0    0.859299
1    0.140701
Name: proportion, dtype: float64


### Preprocessing Pipeline

Most features are categorical and need encoding. I use OneHotEncoder for them.

Numeric features (age_of_respondent, household_size) will be scaled with StandardScaler to improve model performance (especially for logistic regression).

I build a ColumnTransformer inside a Pipeline to handle everything cleanly and prevent data leakage – transformers are fitted only on the training data.

### Identifying Categorical and Numeric Columns

I separate columns into categorical and numeric lists for the ColumnTransformer.

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

# Categorical columns
categorical_cols = x_train.select_dtypes(include='object').columns.tolist()

# Numeric columns
numeric_cols = x_train.select_dtypes(include='number').columns.tolist()

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

Categorical columns: ['country', 'location_type', 'cellphone_access', 'gender_of_respondent', 'relationship_with_head', 'marital_status', 'education_level', 'job_type']
Numeric columns: ['year', 'household_size', 'age_of_respondent']


### Building and Fitting the Preprocessing Pipeline

Now that columns are identified, I create the ColumnTransformer:
- OneHotEncoder for categorical columns
- StandardScaler for numeric columns

I fit it only on X_train to prevent leakage.

In [8]:
# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', StandardScaler(), numeric_cols)
    ]
)

# Fit on training data only
preprocessor.fit(x_train)

# Transform train and test
x_train_processed = preprocessor.transform(x_train)
x_test_processed = preprocessor.transform(x_test)

print("Shape after preprocessing (train):", x_train_processed.shape)
print("Shape after preprocessing (test):", x_test_processed.shape)

Shape after preprocessing (train): (18819, 40)
Shape after preprocessing (test): (4705, 40)


## Modeling

### Approach
I follow an iterative modeling process as required:
1. Start with a simple baseline: Logistic Regression (interpretable, fast, good for tabular data).
2. Evaluate on proper metrics (not just accuracy due to imbalance).
3. Interpret coefficients to understand feature importance.
4. Later iterate to nonparametric models (Decision Trees, Random Forest) for potential improvement.

### Baseline Model: Logistic Regression
Logistic Regression is a strong baseline here because:
- Linear relationships often exist in demographic/survey data.
- Provides interpretable coefficients (odds ratios).
- Handles encoded categorical features well after scaling.

### Training the Baseline Logistic Regression

I combine the fitted preprocessor with LogisticRegression in a full Pipeline.
- class_weight='balanced' to help with imbalance (penalizes mistakes on minority class more).
- max_iter=1000 for convergence.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Full pipeline: preprocessor + logistic regression
baseline_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42))
])

# Fit on training data
baseline_model.fit(x_train, y_train)

# Predictions on test set
y_pred = baseline_model.predict(x_test)

### Baseline Model Evaluation

I evaluate on the holdout test set using:
- Confusion Matrix
- Classification Report (precision, recall, F1 – focus on "Yes" class)
- ROC AUC (good for imbalanced data)

Recall for "Yes" is key: we want to identify as many actual banked/unbanked people as possible for targeted outreach.

In [10]:
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No', 'Yes']))

# ROC AUC
auc = roc_auc_score(y_test, baseline_model.predict_proba(x_test)[:, 1])
print(f"\nROC AUC Score: {auc:.3f}")

Confusion Matrix:
[[3219  824]
 [ 158  504]]

Classification Report:
              precision    recall  f1-score   support

          No       0.95      0.80      0.87      4043
         Yes       0.38      0.76      0.51       662

    accuracy                           0.79      4705
   macro avg       0.67      0.78      0.69      4705
weighted avg       0.87      0.79      0.82      4705


ROC AUC Score: 0.865


### Baseline Model Results & Interpretation

The balanced Logistic Regression performs well:

- **Recall for "Yes" class: 0.76** – captures 76% of individuals who actually have bank accounts. This is strong for stakeholder outreach goals.
- **Precision for "Yes": 0.38** – about 38% of predicted "Yes" are correct (many false positives).
- Trade-off due to class_weight='balanced': prioritizes catching the minority class.
- **ROC AUC: 0.865** – excellent discrimination ability.

**Business Implications:**
- Model identifies a large portion of banked/unbanked individuals for targeted interventions (e.g., rural mobile banking campaigns).
- False positives are manageable if outreach cost is low compared to value of including true positives.
- Next: Try tree-based models (nonparametric) which may handle nonlinear patterns better and improve precision without losing too much recall.

### Iteration 1: Decision Tree Classifier

Decision Trees are nonparametric and can capture nonlinear relationships and interactions (e.g., job_type + location_type effects).

I start with default parameters as a second baseline, then will tune later.
No class_weight for now – compare fairly to logistic.

In [12]:
from sklearn.tree import DecisionTreeClassifier

# New pipeline with Decision Tree
tree_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Fit
tree_model.fit(x_train, y_train)

# Predictions
y_pred_tree = tree_model.predict(x_test)
y_pred_proba_tree = tree_model.predict_proba(x_test)[:, 1]
# Evaluation
print("Decision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_tree))

print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_tree, target_names=['No', 'Yes']))

auc_tree = roc_auc_score(y_test, y_pred_proba_tree)
print(f"\nDecision Tree ROC AUC: {auc_tree:.3f}")

Decision Tree Confusion Matrix:
[[3635  408]
 [ 367  295]]

Decision Tree Classification Report:
              precision    recall  f1-score   support

          No       0.91      0.90      0.90      4043
         Yes       0.42      0.45      0.43       662

    accuracy                           0.84      4705
   macro avg       0.66      0.67      0.67      4705
weighted avg       0.84      0.84      0.84      4705


Decision Tree ROC AUC: 0.681


### Iteration 1 Results: Decision Tree vs Logistic Regression

The default Decision Tree performed worse than the balanced Logistic Regression:

- Recall for "Yes" dropped from 0.76 to 0.45 — missing many potential outreach targets.
- Precision slightly improved (0.38 → 0.42) but not worth the recall loss.
- ROC AUC significantly lower (0.865 → 0.681).

**Conclusion**: The linear logistic model better captures patterns in this demographic data. The tree likely overfits.

**Next Iteration**: Use Random Forest (ensemble of trees) to reduce overfitting, add class_weight='balanced', and tune hyperparameters to try improving recall and AUC.