<p align="center">
  <h1 style="text-align: center;">🔍📊 EV Financing Propensity Model - Ghana 🇬🇭🚗⚡️</h1>
</p>


### 📘 Notebook 1: Data Ingestion, Initial Cleaning, and Exploratory Data Analysis (EDA)

This notebook initiates the EV Financing Propensity Model project by focusing on the following key steps:

* 🔧 1. Problem Definition and Business Context
* 📥 2. Loading the Raw AHIES 2022–2023 Dataset
* 🧹 3. Initial Cleaning and Feature Selection
* 💾 4. Save the Initially Cleaned Dataset
* 📊 5. Basic Exploratory Data Analysis (EDA)


### 1. Project Context

### 1.1 Problem Statement
We aim to build a propensity model to identify Ghanaian households or individuals who are good targets for an electric vehicle (EV) financing loan. The core task is a supervised binary classification problem: predicting whether a household/individual is a good prospect for an EV financing loan (label = 1) or not (label = 0). Since the actual EV loan uptake label is not present in the AHIES dataset, we will need to simulate this target variable later in the project based on plausible socioeconomic features. The model will help estimate the likelihood of seeking or qualifying for an EV loan based on features like income, transport spending, urban/rural status, etc.


### 1.2 Business Case
Electric mobility is an emerging sector in Ghana. A data-driven propensity model offers significant benefits to financial institutions:
* **Targeted Marketing:** Efficiently prioritize customers most likely to adopt EV loans, increasing marketing ROI.
* **Product Expansion:** Identify early adopters and tailor EV financing products, especially given the current low EV penetration due to high upfront costs.
* **Risk Management:** Profile likely EV customers to better assess credit risk, potentially identifying stable, higher-income households.
* **Competitive Advantage:** Proactively capture market share in the growing green mobility sector.
This model aims to turn household survey data into actionable insights for financial product targeting, aligning with Ghana’s sustainability goals.


## 2. Setup and Data Loading

### 2.1 Import Libraries

In [4]:
import pandas as pd
import numpy as np
import os

# Optional: settings to display more rows/columns if needed later
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', 100)

print("Libraries imported successfully.")

Libraries imported successfully.


### 2.2 Load Raw AHIES Dataset

The following code attempts to load the AHIES dataset. It tries different encodings as survey data can sometimes have encoding issues.
**Note:** Ensure your raw data file `AHIES2022Q1_2023Q3_SEC01234_202402.csv` is placed in the `../data/01_raw/` directory.


In [5]:
# Define file paths using os.path.join for better compatibility
raw_data_directory = os.path.join("..", "data", "01_raw") # More robust path
raw_file_name = "AHIES2022Q1_2023Q3_SEC01234_202402.csv" # Your specified file name
raw_file_path = os.path.join(raw_data_directory, raw_file_name)

print(f"Raw file path: {raw_file_path}")

Raw file path: ../data/01_raw/AHIES2022Q1_2023Q3_SEC01234_202402.csv


In [6]:
df_raw = None # Initialize df_raw

try:
    df_raw = pd.read_csv(raw_file_path, encoding="ISO-8859-1", low_memory=False)
    print(f"✅ File '{raw_file_name}' loaded successfully with ISO-8859-1 encoding from '{raw_file_path}'.")
except UnicodeDecodeError:
    try:
        df_raw = pd.read_csv(raw_file_path, encoding="windows-1252", low_memory=False)
        print(f"✅ File '{raw_file_name}' loaded successfully with windows-1252 encoding from '{raw_file_path}'.")
    except Exception as e:
        print(f"❌ Error loading file '{raw_file_name}' with windows-1252 encoding: {e}")
except FileNotFoundError:
    print(f"❌ Error: The file was not found at '{raw_file_path}'.")
    print("Please ensure the file exists in the specified directory and the name is correct.")
except Exception as e:
    print(f"❌ An unexpected error occurred while loading the file: {e}")

✅ File 'AHIES2022Q1_2023Q3_SEC01234_202402.csv' loaded successfully with ISO-8859-1 encoding from '../data/01_raw/AHIES2022Q1_2023Q3_SEC01234_202402.csv'.


### 2.3 Initial Inspection of Raw Data

Let's take a quick look at the raw loaded data before selecting columns.

In [7]:
if df_raw is not None and not df_raw.empty:
    print("--- Raw Data: First 5 Rows ---")
    print(df_raw.head())
    print("\n" + "="*50 + "\n")

    print("--- Raw Data: Shape (rows, columns) ---")
    print(df_raw.shape)
    print("\n" + "="*50 + "\n")

    print("--- Raw Data: Info (dtypes, non-null counts) ---")
    df_raw.info(verbose=True, show_counts=True)
    print("\n" + "="*50 + "\n")
else:
    print("Raw dataframe `df_raw` is not loaded or is empty. Cannot perform inspection.")

--- Raw Data: First 5 Rows ---
     hhid quarter  cluster  HholdID  personid   s1aq1  \
0  500109  2023Q1        1        9         2  Female   
1  300109  2022Q3        1        9         3    Male   
2  600114  2023Q2        1       14         1    Male   
3  600102  2023Q2        1        2         4    Male   
4  300114  2022Q3        1       14         3  Female   

                                   s1aq2  s1aq4y  s1aq4m  \
0  Spouse (Wife/Husband/Living together)      43     NaN   
1                   Child (Son/Daughter)      21     NaN   
2                                   Head      39     NaN   
3                   Child (Son/Daughter)       3     0.0   
4                   Child (Son/Daughter)       3     1.0   

                             s1aq5  ...   region        BMI pop_weight  \
0  Married (Customary/Traditional)  ...  WESTERN        NaN  923.53058   
1                    Never married  ...  WESTERN  20.672762  863.96033   
2  Married (Customary/Traditional)  ...  WE

## 3. Column Selection and Renaming for EV Propensity Model

Based on the project goals, we will extract a subset of columns relevant to understanding socioeconomic status, demographics, income, and expenditure patterns that might influence EV adoption. We will also rename them to be more intuitive.


In [8]:
# Dictionary mapping original column names to new, more descriptive names
columns_to_extract_and_rename = {
    "hhid": "household_id",
    "personid": "person_id",
    "region": "region",
    "urbrur": "urban_rural",

    # Demographics
    "s1aq1": "sex",
    "s1aq4y": "age",
    "s1aq5": "marital_status",

    # Education
    "s2aq3": "highest_education_level",
    "s2aq4": "grade_completed",
    "s2aq6": "still_in_school",

    # Income - Note: These are individual level. Aggregation to household level might be needed later.
    "s4aq55a": "primary_job_income_monthly", # Assuming monthly based on typical survey questions
    "s4bq9": "secondary_job_income_monthly", # Assuming monthly
    # "s4eq9": "expected_minimum_wage", # This might be less about actual income

    # Expenditure - Note: These are specific educational/medical expenses.
    # We'll need to look for broader transport expenditure later or use these as proxies if needed.
    "s2aq11a2": "tuition_fee_paid_last_12m",
    "s2aq11a15": "transportation_cost_to_school_last_12m",
    "s2aq11a16": "school_food_cost_last_12m",
    "s3aq21": "total_medical_expense_last_12m",

    # Employment
    "s4aq1": "worked_last_7_days", # Likely employment status indicator
    # "s4aq2": "total_work_days", # May need context if this is per week/month
    # The daily work hours might be too granular for initial model, but good to have.
    # "s4aq3a": "work_hours_day_1",
    # "s4aq3b": "work_hours_day_2",
    # "s4aq3c": "work_hours_day_3",
    # "s4aq3d": "work_hours_day_4",
    # "s4aq3e": "work_hours_day_5",
    # "s4aq3f": "work_hours_day_6",
    # "s4aq3g": "work_hours_day_7",

    # Migration (Potentially useful for stability/lifestyle)
    # "s1bq1": "born_in_this_town",
    # "s1bq2a": "born_in_another_region", # This might be 'region_of_birth'
    # "s1bq5a": "previous_residence_type",
    # "s1bq6": "years_in_previous_location",
}

# Filter out columns that might not exist in df_raw to prevent KeyErrors
existing_columns_to_extract = {
    original_col: new_col
    for original_col, new_col in columns_to_extract_and_rename.items()
    if original_col in df_raw.columns
}

missing_from_raw = set(columns_to_extract_and_rename.keys()) - set(existing_columns_to_extract.keys())
if missing_from_raw:
    print(f"⚠️ Warning: The following specified columns were NOT found in the raw dataset and will be skipped:")
    for col in missing_from_raw:
        print(f"  - {col}")

df_selected = None # Initialize

if df_raw is not None and not df_raw.empty:
    if not existing_columns_to_extract:
        print("❌ Error: None of the specified columns for extraction exist in the loaded dataframe.")
        print("Please check your `columns_to_extract_and_rename` dictionary against the actual columns in the raw data.")
    else:
        df_selected = df_raw[list(existing_columns_to_extract.keys())].copy() # Use .copy() to avoid SettingWithCopyWarning
        df_selected.rename(columns=existing_columns_to_extract, inplace=True)
        print(f"✅ Columns extracted and renamed. New dataframe `df_selected` created.")
else:
    print("Raw dataframe `df_raw` is not loaded or is empty. Cannot select columns.")

✅ Columns extracted and renamed. New dataframe `df_selected` created.


## 4. Save Initially Cleaned (Selected) Dataset

We will save this subset of data with renamed columns to the `02_intermediate` directory for further processing in subsequent notebooks.


In [9]:
intermediate_data_directory = os.path.join("..", "data", "02_intermediate") # More robust path
cleaned_file_name = "ahies_selected_for_ev_propensity.csv" # More descriptive name
cleaned_file_path = os.path.join(intermediate_data_directory, cleaned_file_name)

if df_selected is not None and not df_selected.empty:
    try:
        # Create the intermediate directory if it doesn't exist
        # import os # Already imported at the top
        os.makedirs(intermediate_data_directory, exist_ok=True)

        df_selected.to_csv(cleaned_file_path, index=False)
        print(f"✅ Selected and renamed dataset saved to '{cleaned_file_path}'")
    except Exception as e:
        print(f"❌ Error saving the cleaned dataset: {e}")
else:
    print("Selected dataframe `df_selected` is not created or is empty. Cannot save.")

✅ Selected and renamed dataset saved to '../data/02_intermediate/ahies_selected_for_ev_propensity.csv'


## 5. Next Steps

* **Data Cleaning:** Dive deeper into cleaning the `df_selected` DataFrame. This will involve:
    * Handling missing values (imputation or removal).
    * Correcting data types if necessary.
    * Addressing any outliers or erroneous values.
    * Transforming variables (e.g., decoding categorical variables from numbers to meaningful labels based on the data dictionary).
* **Exploratory Data Analysis (EDA):** Perform detailed EDA on the cleaned dataset to understand distributions, relationships, and gather insights that will inform feature engineering and modeling.
* **Feature Engineering:** Create new relevant features.
* **Label Simulation:** Develop and apply the logic to create our target variable.