# Unbanked Customer Data Preparation
This notebook prepares the unbanked customer dataset for modeling and segmentation.

It performs the following steps:
- Load source data from `credit_report (1).xlsx`.
- Filter to UnBanked customers and exclude special programs.
- Remove Rank 3 and 4, enforce valid credit limits, and ensure valid customer buckets.
- Run data quality checks (missing values, duplicates, outliers, business rules).
- Clean data (deduplicate, cap invalid/extreme values, impute selective fields).
- Create the binary target from `customer_bucket` (0=Good, 1=Bad).
- Assemble the final features dataframe and save artifacts.

Inputs:
- `credit_report (1).xlsx`

Outputs:
- `unbanked_customer_segmentation_final.csv`
- `unbanked_dataset_documentation.json`
- `dataset_summary_report.txt`

In [1]:
# import packages
import pandas as pd
import os
import sys
sys.path.insert(0, os.path.abspath("../preprocessing"))
from preprocess import CreditDatasetConfig, CreditDataPreprocessor

In [2]:
#load dataset
df = pd.read_excel("../../credit_report (1).xlsx")

In [3]:
df_backup =  df.copy()

In [4]:
# Set pandas display options to show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

df.head(15)

Unnamed: 0,customer_id,national_id,credit_limit,due_principal,customer_bucket,onboarding_merchant,first_transaction_merchant,rank,limit_source,special_program_flag,has_past_credit_flag,income_delta_percentage,income_delta_tier,income_delta_score,age,age_score,marital_status,marital_status_score,jobtitle_category,jobtitle_score,address_category,address_score,gender,gender_score
0,237293,28302022102615,0.0,0.0,,Fawry Plus,,3,,False,False,,,,42,13.164,Married,14.6048,D,30.4325,C,23.4765,MALE,20.547
1,237294,28506300100399,0.0,0.0,,Fawry Plus,,2,,False,False,,,,39,15.5774,Married,14.6048,D,30.4325,A,17.39,MALE,20.547
2,237295,25104210300141,0.0,0.0,,Fawry Plus,,4,,False,False,,,,74,9.6536,Single,16.952,,,,,FEMALE,15.22
3,237296,29305050104961,0.0,0.0,,Union Stores,,3,,False,False,,,,32,16.3453,Married,14.6048,D,30.4325,B,19.9985,FEMALE,15.22
4,237297,28911010107839,5300.0,0.0,,Fawry Plus,,3,Banked,False,True,67.963354,8.0,129.679634,35,16.0162,Married,14.6048,D,30.4325,C,23.4765,MALE,20.547
5,237298,29812152203199,0.0,0.0,,Connect,,4,,False,False,,,,26,17.2229,Single,16.952,D,30.4325,E,30.4325,MALE,20.547
6,237299,26708180102656,0.0,0.0,,Connect,,2,,False,False,,,,57,11.5185,Married,14.6048,D,30.4325,B,19.9985,MALE,20.547
7,237300,28301300100924,0.0,0.0,,Connect,,2,,False,False,,,,42,13.164,Married,14.6048,D,30.4325,D,26.9545,FEMALE,15.22
8,237301,29601102101092,0.0,0.0,,Union Stores,,1,,False,False,,,,29,16.8938,Married,14.6048,A,17.39,A,17.39,MALE,20.547
9,237302,28210240103117,0.0,0.0,,Connect,,2,,False,False,,,,42,13.164,Single,16.952,D,30.4325,A,17.39,MALE,20.547


In [5]:
limit_source = 'UnBanked'
cdp = CreditDataPreprocessor(data=df, limit_source=limit_source)
final_df, documentation = cdp.preprocess()


Unbanked rank distribution:
Shape before filtering by credit limit: (427116, 24)
Shape after filtering by credit limit: (127741, 24)

Shape before filtering by customer_bucket: (127741, 24)
Shape after filtering by customer_bucket: (78144, 24)
Original dataset shape: (427116, 24)
Filtered dataset shape (UnBanked only): (51672, 24)

Unique values in limit_source (original):
limit_source
UnBanked    90017
Banked      40939
Name: count, dtype: int64
df shape: (51672, 24)

df columns: ['customer_id', 'national_id', 'credit_limit', 'due_principal', 'customer_bucket', 'onboarding_merchant', 'first_transaction_merchant', 'rank', 'limit_source', 'special_program_flag', 'has_past_credit_flag', 'income_delta_percentage', 'income_delta_tier', 'income_delta_score', 'age', 'age_score', 'marital_status', 'marital_status_score', 'jobtitle_category', 'jobtitle_score', 'address_category', 'address_score', 'gender', 'gender_score']

First 5 rows of df:
    customer_id     national_id  credit_limit  due

##### save final dataset

In [6]:
# Save final dataset with comprehensive documentation
print("="*60)
print("SAVING FINAL DATASET")
print("="*60)

# Ensure output directory exists
import os
import json
out_dir = os.path.join('customer_segmantation_v1.0', 'data')
os.makedirs(out_dir, exist_ok=True)
print(f"Output directory: {out_dir}")

# Save the final dataset
csv_path = os.path.join(out_dir, f'{limit_source}_customer_segmentation_final.csv')
final_df.to_csv(csv_path, index=False)
print(f"✓ Final dataset saved as '{csv_path}'")

# Save documentation as JSON
doc_path = os.path.join(out_dir, f'{limit_source}_dataset_documentation.json')
with open(doc_path, 'w') as f:
    json.dump(documentation, f, indent=2)
print(f"✓ Documentation saved as '{doc_path}'")

# Create a summary report
summary_report = f"""
{limit_source} CUSTOMER SEGMENTATION DATASET - FINAL REPORT
=====================================================

DATASET OVERVIEW:
- Original records: {df_backup.shape[0]:,}
- Final records: {final_df.shape[0]:,}
- Records removed: {df_backup.shape[0] - final_df.shape[0]:,} ({((df_backup.shape[0] - final_df.shape[0])/df_backup.shape[0]*100):.1f}%)
- Unique customers: {final_df['customer_id'].nunique():,}
- Features: {len(final_df.columns)-2} (excluding customer_id and target)

TARGET DISTRIBUTION:
- Good customers: {(final_df['target'] == 0).sum():,} ({(final_df['target'] == 0).mean()*100:.1f}%)
- Bad customers: {(final_df['target'] == 1).sum():,} ({(final_df['target'] == 1).mean()*100:.1f}%)

FILTERING APPLIED:
1. {limit_source} customers only
2. No special programs
3. Rank 1-2 only (removed 3-4)
4. Valid credit limits (>0)
5. Non-null customer buckets
6. Data quality improvements

TARGET DEFINITION:
- Good: CURRENT, SETTLED, CANCELLED, CANCELLED-PARTIAL-REFUND, SETTLE-RESCHEDULED, BUCKET-1, BUCKET-2
- Bad: All other buckets (BUCKET-3 through BUCKET-7, charge-offs, writeoffs, etc.)

FILES CREATED:
1. {limit_source}_customer_segmentation_final.csv - Main dataset
2. {limit_source}_dataset_documentation.json - Detailed documentation
3. dataset_summary_report.txt - This summary

Dataset is ready for customer segmentation analysis and modeling.
"""

# Save summary report
summary_path = os.path.join(out_dir, 'dataset_summary_report.txt')
with open(summary_path, 'w') as f:
    f.write(summary_report)
print(f"✓ Summary report saved as '{summary_path}'")

# Print summary report
print(f"\n{summary_report}")

SAVING FINAL DATASET
Output directory: customer_segmantation_v1.0/data
✓ Final dataset saved as 'customer_segmantation_v1.0/data/UnBanked_customer_segmentation_final.csv'
✓ Documentation saved as 'customer_segmantation_v1.0/data/UnBanked_dataset_documentation.json'
✓ Summary report saved as 'customer_segmantation_v1.0/data/dataset_summary_report.txt'


UnBanked CUSTOMER SEGMENTATION DATASET - FINAL REPORT

DATASET OVERVIEW:
- Original records: 427,116
- Final records: 50,091
- Records removed: 377,025 (88.3%)
- Unique customers: 42,100
- Features: 6 (excluding customer_id and target)

TARGET DISTRIBUTION:
- Good customers: 43,185 (86.2%)
- Bad customers: 6,906 (13.8%)

FILTERING APPLIED:
1. UnBanked customers only
2. No special programs
3. Rank 1-2 only (removed 3-4)
4. Valid credit limits (>0)
5. Non-null customer buckets
6. Data quality improvements

TARGET DEFINITION:
- Good: CURRENT, SETTLED, CANCELLED, CANCELLED-PARTIAL-REFUND, SETTLE-RESCHEDULED, BUCKET-1, BUCKET-2
- Bad: All oth

## Summary
- Filtered to UnBanked customers, excluded special programs, and removed Rank 3–4.
- Enforced valid credit limits and non-null customer buckets.
- Completed data quality checks and applied targeted cleaning (duplicates, caps, imputations, and outlier handling).
- Built a binary target from `customer_bucket` and assembled the final features.
- Saved deliverables:
  - `unbanked_customer_segmentation_final.csv`
  - `unbanked_dataset_documentation.json`
  - `dataset_summary_report.txt`

Next steps (optional):
- Explore class balance and feature distributions (EDA).
- Train baseline models (e.g., logistic regression, tree-based).
- Perform segmentation and stability checks across cohorts.