<p align="center">
  <h1 style="text-align: center;">🔍📊 EV Financing Propensity Model - Ghana 🇬🇭🚗⚡️</h1>
</p>


### 📘 Notebook 1: Data Ingestion, Initial Cleaning, and Exploratory Data Analysis (EDA)

This notebook initiates the EV Financing Propensity Model project by focusing on the following key steps:

* 🔧 1. Problem Definition and Business Context
* 📥 2. Loading the Raw AHIES 2022–2023 Dataset
* 🧹 3. Initial Cleaning and Feature Selection
* 💾 4. Save the Initially Cleaned Dataset
* 📊 5. Basic Exploratory Data Analysis (EDA)


### 1. Project Context

### 1.1 Problem Statement
We aim to build a propensity model to identify Ghanaian households or individuals who are good targets for an electric vehicle (EV) financing loan. The core task is a supervised binary classification problem: predicting whether a household/individual is a good prospect for an EV financing loan (label = 1) or not (label = 0). Since the actual EV loan uptake label is not present in the AHIES dataset, we will need to simulate this target variable later in the project based on plausible socioeconomic features. The model will help estimate the likelihood of seeking or qualifying for an EV loan based on features like income, transport spending, urban/rural status, etc.


### 1.2 Business Case
Electric mobility is an emerging sector in Ghana. A data-driven propensity model offers significant benefits to financial institutions:
* **Targeted Marketing:** Efficiently prioritize customers most likely to adopt EV loans, increasing marketing ROI.
* **Product Expansion:** Identify early adopters and tailor EV financing products, especially given the current low EV penetration due to high upfront costs.
* **Risk Management:** Profile likely EV customers to better assess credit risk, potentially identifying stable, higher-income households.
* **Competitive Advantage:** Proactively capture market share in the growing green mobility sector.
This model aims to turn household survey data into actionable insights for financial product targeting, aligning with Ghana’s sustainability goals.


## 2. Setup and Data Loading

### 2.1 Import Libraries

In [6]:
# %%
import pandas as pd
import numpy as np
import sys
import os

# Add the src directory to the Python path to allow importing our custom modules
# This makes sure Python knows where to find 'src'
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

# Import our custom functions and configurations
from src import config
from src import data_processing as dp

print("Libraries and custom modules imported successfully.")
print(f"Project Root (detected): {config.PROJECT_ROOT}")
print(f"Raw Data Path: {config.RAW_DATA_PATH}")

Libraries and custom modules imported successfully.
Project Root (detected): /home/mr-rey/Joseph/Projects/Python/gh-ev-finance-propensity
Raw Data Path: /home/mr-rey/Joseph/Projects/Python/gh-ev-finance-propensity/data/01_raw/AHIES2022Q1_2023Q3_SEC01234_202402.csv


### 2.2 Load Raw AHIES Dataset

The following code attempts to load the AHIES dataset. It tries different encodings as survey data can sometimes have encoding issues.
**Note:** Ensure your raw data file `AHIES2022Q1_2023Q3_SEC01234_202402.csv` is placed in the `../data/01_raw/` directory.


In [7]:
# Load raw data using our function
df_raw = dp.load_raw_data(config.RAW_DATA_PATH)

# Select and rename columns using our function and config
df_selected = dp.select_and_rename_cols(df_raw, config.COLUMNS_TO_EXTRACT)

✅ File loaded with ISO-8859-1 encoding from '/home/mr-rey/Joseph/Projects/Python/gh-ev-finance-propensity/data/01_raw/AHIES2022Q1_2023Q3_SEC01234_202402.csv'.
✅ 17 columns selected and renamed.


### 2.3 Initial Inspection of Raw Data (Post Selection)

Let's take a quick look at the raw loaded data with selected columns.

In [8]:
if df_selected is not None:
    print("--- Selected Data: First 5 Rows ---")
    print(df_selected.head())
    print("\n" + "="*50 + "\n")
    print("--- Selected Data: Shape ---")
    print(df_selected.shape)
    print("\n" + "="*50 + "\n")
    print("--- Selected Data: Info ---")
    df_selected.info()
    print("\n" + "="*50 + "\n")
    print("--- Selected Data: Missing Values ---")
    print((df_selected.isnull().sum() * 100 / len(df_selected)).sort_values(ascending=False))
else:
    print("Data selection failed. Cannot inspect.")

--- Selected Data: First 5 Rows ---
   household_id  person_id   region urban_rural     sex  age  \
0        500109          2  WESTERN      Urbanb  Female   43   
1        300109          3  WESTERN      Urbanb    Male   21   
2        600114          1  WESTERN      Urbanb    Male   39   
3        600102          4  WESTERN      Urbanb    Male    3   
4        300114          3  WESTERN      Urbanb  Female    3   

                    marital_status highest_education_level  grade_completed  \
0  Married (Customary/Traditional)                       0              NaN   
1                    Never married                 SSS/SHS              3.0   
2  Married (Customary/Traditional)                 SSS/SHS              3.0   
3                              NaN                 Nursery              0.0   
4                              NaN                       0              0.0   

  still_in_school  primary_job_income_monthly  secondary_job_income_monthly  \
0             NaN        

### 2.4 Data types and columns

In [10]:
## Data types of selected columns
print("Data types:")
print(df_selected.dtypes)

Data types:
household_id                                int64
person_id                                   int64
region                                     object
urban_rural                                object
sex                                        object
age                                         int64
marital_status                             object
highest_education_level                    object
grade_completed                           float64
still_in_school                            object
primary_job_income_monthly                float64
secondary_job_income_monthly              float64
tuition_fee_paid_last_12m                 float64
transportation_cost_to_school_last_12m    float64
school_food_cost_last_12m                 float64
total_medical_expense_last_12m            float64
worked_last_7_days                         object
dtype: object


### 2.5 Summary Statistics

In [11]:
df_selected.describe()

Unnamed: 0,household_id,person_id,age,grade_completed,primary_job_income_monthly,secondary_job_income_monthly,tuition_fee_paid_last_12m,transportation_cost_to_school_last_12m,school_food_cost_last_12m,total_medical_expense_last_12m
count,356437.0,356437.0,356437.0,286553.0,26640.0,896.0,9063.0,2667.0,20551.0,170106.0
mean,425031.197945,3.857627,24.356832,2.427181,1056.847322,399.746652,407.885761,217.266172,204.35843,19.902662
std,200614.126241,2.785346,19.670018,1.586052,1998.136721,946.34905,2119.156419,224.479796,192.677503,173.223288
min,100101.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,244215.0,2.0,9.0,1.0,150.0,50.0,70.0,80.0,100.0,0.0
50%,426506.0,3.0,19.0,3.0,500.0,100.0,150.0,150.0,150.0,0.0
75%,611205.0,5.0,37.0,3.0,1500.0,400.0,320.0,300.0,280.0,0.0
max,760019.0,31.0,120.0,10.0,149400.0,15000.0,176300.0,3000.0,4500.0,45750.0


### 2.6 Initial Missing Values Summary

In [19]:
## remove this code if needed.
## it reloads the src module again without restarting the kernel
import importlib
from src import data_processing

importlib.reload(data_processing)

# Rebind the shorthand alias
from src import data_processing as dp

✅ Data processing functions defined.


In [14]:
# initial display summary of missing values
dp.display_missing_values_summary(df_selected)


--- Missing Values Summary ---
                           Column Name  Missing Values  Percentage Missing (%)  Total Rows
          secondary_job_income_monthly          355541               99.748623      356437
transportation_cost_to_school_last_12m          353770               99.251761      356437
             tuition_fee_paid_last_12m          347374               97.457335      356437
             school_food_cost_last_12m          335886               94.234325      356437
            primary_job_income_monthly          329797               92.526028      356437
                       still_in_school          201346               56.488524      356437
        total_medical_expense_last_12m          186331               52.275998      356437
                        marital_status          114434               32.104972      356437
                       grade_completed           69884               19.606270      356437
                    worked_last_7_days           44780    

Unnamed: 0,Column Name,Missing Values,Percentage Missing (%),Total Rows
household_id,household_id,0,0.0,356437
person_id,person_id,0,0.0,356437
region,region,0,0.0,356437
urban_rural,urban_rural,0,0.0,356437
sex,sex,0,0.0,356437
age,age,0,0.0,356437
marital_status,marital_status,114434,32.104972,356437
highest_education_level,highest_education_level,25550,7.168167,356437
grade_completed,grade_completed,69884,19.60627,356437
still_in_school,still_in_school,201346,56.488524,356437


## 3 Data Cleaning and filtering

### 3.1 Data Filtering by Age

In [16]:
MINIMUM_AGE_FOR_ANALYSIS = 30 # Define this, can be changed

df_age_filtered = dp.filter_by_age(df_selected, min_age=MINIMUM_AGE_FOR_ANALYSIS, age_column='age')

## Post filter summary of missing values
dp.display_missing_values_summary(df_age_filtered)

✅ Filtered by age: 117924 rows remaining (>= 30 years). 238513 rows removed.

--- Missing Values Summary ---
                           Column Name  Missing Values  Percentage Missing (%)  Total Rows
             school_food_cost_last_12m          117874               99.957600      117924
transportation_cost_to_school_last_12m          117847               99.934704      117924
             tuition_fee_paid_last_12m          117791               99.887215      117924
          secondary_job_income_monthly          117221               99.403853      117924
                       still_in_school          115920               98.300600      117924
            primary_job_income_monthly          100614               85.321054      117924
        total_medical_expense_last_12m           60665               51.444150      117924
                       grade_completed           32090               27.212442      117924
               highest_education_level              76                0.

Unnamed: 0,Column Name,Missing Values,Percentage Missing (%),Total Rows
household_id,household_id,0,0.0,117924
person_id,person_id,0,0.0,117924
region,region,0,0.0,117924
urban_rural,urban_rural,0,0.0,117924
sex,sex,0,0.0,117924
age,age,0,0.0,117924
marital_status,marital_status,0,0.0,117924
highest_education_level,highest_education_level,76,0.064448,117924
grade_completed,grade_completed,32090,27.212442,117924
still_in_school,still_in_school,115920,98.3006,117924


In [18]:
df_age_filtered["income_missing"] = df_age_filtered["primary_job_income_monthly"].isna()
pd.crosstab(df_age_filtered["worked_last_7_days"], df_age_filtered["income_missing"])


income_missing,False,True
worked_last_7_days,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2788,97389
Yes,14522,3224
