In [1]:
import numpy as np
import pandas as pd
import os
import sys
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)
from src.data_enrich import DataEnrichment
from src.enrich_impact import DataEnrichment
from src.logging_config import logging
# Define file paths
raw_path = "../data/raw/ethiopia_fi_unified_data.xlsx"
processed_output = "../data/processed/ethiopia_fi_unified_data.xlsx"
# Ensure processed directory exists
os.makedirs("../data/processed", exist_ok=True)

2026-02-01 19:41:26,885 - INFO - Logging Initialized


In [2]:
initial_df = pd.read_excel("../data/raw/ethiopia_fi_unified_data.xlsx")
initial_impact_df = pd.read_excel(
    "../data/raw/ethiopia_fi_unified_data.xlsx", 
    sheet_name="Impact_sheet", 
)
logging.info("load the orginal dataset")

2026-02-01 19:41:27,365 - INFO - load the orginal dataset


In [3]:
# 1. Initialize your class with your current data
# (Assuming your initial CSV is loaded as 'initial_df')
profiler = DataEnrichment(initial_df)


# 2. Define the Enriched Array of Objects
new_data = [
    # --- ACCESS PILLAR: Infrastructure & Surveys ---
    {
        'record_id': 'REC_0011', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Mobile Subscription Penetration', 'indicator_code': 'ACC_MOBILE_PEN',
        'value_numeric': 61.4, 'observation_date': '2025-12-31', 'source_name': 'DataReportal',
        'source_url': 'https://datareportal.com/reports/digital-2026-ethiopia', 'confidence': 'high',
        'notes': '93.7M connections / 152.7M population'
    },
    {
        'record_id': 'REC_0012', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Fayda Digital ID Enrollment', 'indicator_code': 'ACC_FAYDA',
        'value_numeric': 12000000.0, 'observation_date': '2025-02-28', 'source_name': 'World Bank',
        'confidence': 'high', 'notes': 'Over 12 million registered as of Feb 2025'
    },
    {
        'record_id': 'REC_0013', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Account ownership, total (% age 15+)', 'indicator_code': 'ACC_OWNERSHIP',
        'value_numeric': 49.0, 'observation_date': '2024-12-31', 'source_name': 'World Bank Findex 2025',
        'source_url': 'https://www.worldbank.org/en/publication/globalfindex', 'confidence': 'high',
        'notes': 'Stagnation noted: grew only 3% since 2021 despite digital account surge.'
    },
    {
        'record_id': 'REC_0014', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Account ownership, female (% age 15+)', 'indicator_code': 'ACC_OWN_FEMALE',
        'value_numeric': 41.6, 'observation_date': '2024-12-31', 'source_name': 'World Bank Findex 2025',
        'confidence': 'high', 'notes': 'Gender gap persists at approx 14%.'
    },
    {
        'record_id': 'REC_0015', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Registered Mobile Money Agents', 'indicator_code': 'ACC_MM_AGENTS',
        'value_numeric': 500000.0, 'observation_date': '2025-06-30', 'source_name': 'NBE Annual Report 2024',
        'source_url': 'https://nbe.gov.et/', 'confidence': 'high',
        'notes': 'Massive expansion of agent network to over half a million.'
    },
    {
        'record_id': 'REC_0016', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Mobile Phone Ownership (% adults)', 'indicator_code': 'ACC_PHONE_OWN',
        'value_numeric': 58.0, 'observation_date': '2024-12-31', 'source_name': 'ITU DataHub',
        'source_url': 'https://datahub.itu.int/', 'confidence': 'high',
        'notes': 'Critical barrier: Phone ownership is significantly lower than SSA average (81%).'
    },
    {
        'record_id': 'REC_0017', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Smartphone Penetration Rate', 'indicator_code': 'ACC_SMARTPHONE',
        'value_numeric': 21.7, 'observation_date': '2025-10-01', 'source_name': 'DataReportal 2026',
        'source_url': 'https://datareportal.com/reports/digital-2026-ethiopia', 'confidence': 'high',
        'notes': 'Significant barrier: only ~22% have internet-enabled smartphones.'
    },

    # --- USAGE PILLAR: Transactions & Adoption ---
    {
        'record_id': 'REC_0018', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'P2P Transaction Count', 'indicator_code': 'USG_P2P_COUNT',
        'value_numeric': 128300000.0, 'observation_date': '2025-07-07', 'source_name': 'EthSwitch',
        'source_url': 'https://ethswitch.com/', 'confidence': 'high',
        'notes': '128.3 million P2P transactions, +158% YoY growth'
    },
    {
        'record_id': 'REC_0019', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'P2P Transaction Value', 'indicator_code': 'USG_P2P_VALUE',
        'value_numeric': 577700000000.0, 'observation_date': '2025-07-07', 'source_name': 'EthSwitch',
        'source_url': 'https://ethswitch.com/', 'confidence': 'high',
        'notes': 'ETB 577.7 billion total value'
    },
    {
        'record_id': 'REC_0020', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Mobile Money Account Ownership (% age 15+)', 'indicator_code': 'USG_MM_ACC',
        'value_numeric': 19.4, 'observation_date': '2024-12-31', 'source_name': 'IMF FAS 2025',
        'source_url': 'https://data.imf.org/', 'confidence': 'high',
        'notes': 'Quadrupled from 4.7% in 2021. Primary growth engine.'
    },
    {
        'record_id': 'REC_0021', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Total Digital Transaction Value (ETB)', 'indicator_code': 'USG_DIG_VAL',
        'value_numeric': 7700000000000.0, 'observation_date': '2024-07-07', 'source_name': 'NBE Annual Report',
        'confidence': 'high', 'notes': 'Reached 7.7 Trillion ETB in FY2023/24.'
    },
    {
        'record_id': 'REC_0022', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'M-Pesa Active Users (90-day)', 'indicator_code': 'USG_MPESA_ACTIVE',
        'value_numeric': 5000000.0, 'observation_date': '2026-01-21', 'source_name': 'Safaricom Ethiopia',
        'confidence': 'high', 'notes': 'Measuring private sector competition impact.'
    },

    # --- POLICY TARGETS & EVENTS ---
    {
        'record_id': 'REC_0023', 'record_type': 'target', 'pillar': 'ACCESS',
        'indicator': 'Target Account Ownership Rate', 'indicator_code': 'ACC_OWN_TGT',
        'value_numeric': 70.0, 'observation_date': '2027-12-31', 'source_name': 'NFIS-II/III Projections',
        'confidence': 'medium', 'notes': 'Ambitious target set by NBE for 2027.'
    },
    {
        'record_id': 'REC_0024', 'record_type': 'event', 'category': 'policy',
        'event_name': 'National ID (Fayda) Mandate', 'event_date': '2025-01-01',
        'notes': 'Mandating Fayda for financial services to streamline KYC.'
    },
    # Add these to your 'new_data' list in the script
{
    'record_id': 'REC_0025', 'record_type': 'observation', 'pillar': 'USAGE',
    'indicator': 'Active Mobile Money Accounts (90-day)', 'indicator_code': 'USG_MM_ACTIVE',
    'value_numeric': 60.0, 'unit': 'Millions', 'observation_date': '2025-12-31', 
    'source_name': 'NBE / Digital Ethiopia 2030 Strategy', 'confidence': 'high',
    'notes': 'Significant leap from 12M in 2020 to 60M+ by end of 2025.'
},
{
    'record_id': 'REC_0026', 'record_type': 'observation', 'pillar': 'ACCESS',
    'indicator': 'Bank Account to Fayda Linkage Rate', 'indicator_code': 'ACC_FAYDA_BANK_LINK',
    'value_numeric': 100.0, 'unit': 'Percent', 'observation_date': '2026-03-30', 
    'source_name': 'NBE Directive', 'confidence': 'high',
    'notes': 'NBE mandate for all accounts to be linked to Fayda by March 2026 (EC 2018).'
},
# Add these to your 'new_data' list in the script
{
    'record_id': 'REC_0027', 'record_type': 'observation', 'pillar': 'USAGE',
    'indicator': 'Active Mobile Money Accounts (90-day)', 'indicator_code': 'USG_MM_ACTIVE',
    'value_numeric': 60.0, 'unit': 'Millions', 'observation_date': '2025-12-31', 
    'source_name': 'NBE / Digital Ethiopia 2030 Strategy', 'confidence': 'high',
    'notes': 'Significant leap from 12M in 2020 to 60M+ by end of 2025.'
},
{
    'record_id': 'REC_0028', 'record_type': 'observation', 'pillar': 'ACCESS',
    'indicator': 'Bank Account to Fayda Linkage Rate', 'indicator_code': 'ACC_FAYDA_BANK_LINK',
    'value_numeric': 100.0, 'unit': 'Percent', 'observation_date': '2026-03-30', 
    'source_name': 'NBE Directive', 'confidence': 'high',
    'notes': 'NBE mandate for all accounts to be linked to Fayda by March 2026 (EC 2018).'
},

    {
        'record_id': 'REC_0029', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'ATM Density (per 100,000 adults)', 'indicator_code': 'ACC_ATM_DENSITY',
        'value_numeric': 0.45, 'observation_date': '2015-12-31', 'source_name': 'IMF FAS',
        'source_url': 'https://data.imf.org/FAS', 'confidence': 'high',
        'notes': 'Infrastructure baseline during the early growth phase.'
    },
    {
        'record_id': 'REC_0030', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Bank Branch Density (per 100,000 adults)', 'indicator_code': 'ACC_BRANCH_DENSITY',
        'value_numeric': 1.15, 'observation_date': '2016-12-31', 'source_name': 'IMF FAS',
        'confidence': 'high', 'notes': 'Critical for comparing physical vs digital access trajectory.'
    },

    # --- 2018/2019: The Pre-Reform "Bridge" (Source: NBE Annual Reports) ---
    {
        'record_id': 'REC_0031', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Total Debit Cards Issued', 'indicator_code': 'USG_DEBIT_CARDS',
        'value_numeric': 8100000.0, 'observation_date': '2018-06-30', 'source_name': 'NBE Annual Report',
        'source_url': 'https://nbe.gov.et/', 'confidence': 'high',
        'notes': 'Pre-digital surge: Banking was still heavily card-and-ATM focused.'
    },
    {
        'record_id': 'REC_0032', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': 'Mobile Phone Ownership (% adults)', 'indicator_code': 'ACC_PHONE_OWN',
        'value_numeric': 41.2, 'observation_date': '2019-12-31', 'source_name': 'ITU DataHub',
        'source_url': 'https://datahub.itu.int/', 'confidence': 'high',
        'notes': 'Shows the enabler gap before the 2021 liberalization.'
    },

    # --- 2020: The "Regulatory Catalyst" Year (Source: NBE/IMF) ---
    {
        'record_id': 'REC_0033', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Mobile Money Registered Accounts', 'indicator_code': 'USG_MM_REG',
        'value_numeric': 5100000.0, 'observation_date': '2020-06-30', 'source_name': 'NBE',
        'confidence': 'high', 'notes': 'The final baseline before Telebirr changed the market.'
    },
    {
        'record_id': 'REC_0034', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Gender Gap in Mobile Money (IMF Proxy)', 'indicator_code': 'GEN_GAP_MM',
        'value_numeric': 11.4, 'observation_date': '2020-12-31', 'source_name': 'IMF FAS',
        'confidence': 'medium', 'notes': 'Proxy for social exclusion pillar.'
    },

    # --- 2022: Competitive Market Transformation (Source: GSMA/NBE) ---
    {
        'record_id': 'REC_0035', 'record_type': 'observation', 'pillar': 'USAGE',
        'indicator': 'Active Mobile Money Users (90-day)', 'indicator_code': 'USG_MM_ACTIVE',
        'value_numeric': 25000000.0, 'observation_date': '2022-12-31', 'source_name': 'GSMA/NBE',
        'source_url': 'https://www.gsma.com/', 'confidence': 'high',
        'notes': 'Captured the post-Telebirr boom and Safaricom entry period.'
    },
    {
        'record_id': 'REC_0036', 'record_type': 'observation', 'pillar': 'ACCESS',
        'indicator': '4G Population Coverage', 'indicator_code': 'ACC_4G_COV',
        'value_numeric': 35.0, 'observation_date': '2022-06-30', 'source_name': 'Ethio Telecom Annual Report',
        'confidence': 'high', 'notes': 'Pre-Safaricom network expansion baseline.'
    }

]
logging.info(f"Adding {len(new_data)} rows to the dataset (REC_0011 to REC_0036)")



2026-02-01 19:41:27,414 - INFO - DataEnrichment initialized successfully.
2026-02-01 19:41:27,421 - INFO - Adding 26 rows to the dataset (REC_0011 to REC_0036)


In [4]:


# 1. Define the full schema as per the starter dataset
original_columns = [
    "record_id", "record_type", "category", "pillar", "indicator", "indicator_code", 
    "indicator_direction", "value_numeric", "value_text", "value_type", "unit", 
    "observation_date", "period_start", "period_end", "fiscal_year", "gender", 
    "location", "region", "source_name", "source_type", "source_url", "confidence", 
    "related_indicator", "relationship_type", "impact_direction", "impact_magnitude", 
    "impact_estimate", "lag_months", "evidence_basis", "comparable_country", 
    "collected_by", "collection_date", "original_text", "notes"
]

# 2. Enrich the data in memory
profiler.enrich_data(new_data)

# 3. Align the dataframe to the full schema 
# This adds all the missing columns (category, fiscal_year, etc.) and fills them with NaN
df_main_final = profiler.df.reindex(columns=original_columns)



2026-02-01 19:41:27,501 - INFO - Enrichment completed successfully.
2026-02-01 19:41:27,503 - INFO - Total Records: 46



--- Enrichment Success ---
Total Records After Enrichment: 46


In [5]:


# 3. Initialize the manager with that data
impact_manager = DataEnrichment(initial_impact_df)

# 4. Define the NEW entries you want to add (IMP_0015, etc.)
new_impact_entries = [
    {
        'record_id': 'IMP_0015', 
        'parent_id': 'EVT_0001', 
        'record_type': 'impact_link',
        'pillar': 'USAGE',
        'indicator': 'Telebirr effect on Active MM Accounts',
        'related_indicator': 'USG_MM_ACTIVE',
        'impact_direction': 'increase',
        'impact_magnitude': 'high',
        'lag_months': 1,
        'value_numeric': 15.0,
        'evidence_basis': 'empirical',
        'notes': 'Telebirr mass-onboarding leap from 12M to 60M baseline.'
    },
    {
        'record_id': 'IMP_0016', 
        'parent_id': 'REC_0024', 
        'record_type': 'impact_link',
        'pillar': 'ACCESS',
        'indicator': 'Fayda effect on Bank Linkage',
        'related_indicator': 'ACC_FAYDA_BANK_LINK',
        'impact_direction': 'increase',
        'impact_magnitude': 'high',
        'lag_months': 3,
        'value_numeric': 100.0,
        'evidence_basis': 'regulatory',
        'notes': 'NBE mandate for 100% linkage by March 2026.'
    }
]
logging.info(f"Adding impact links: {new_impact_entries[0]['record_id']} to {new_impact_entries[-1]['record_id']} ({len(new_impact_entries)} links total)")
# 5. Enrich (combine old + new)
impact_manager.enrich_data(new_impact_entries)

# 3. Align the dataframe to the full schema 
# This adds all the missing columns (category, fiscal_year, etc.) and fills them with NaN
df_impact_final = impact_manager.df.reindex(columns=original_columns)
logging.info("Impact links successfully integrated into the pillar analysis.")




2026-02-01 19:41:27,537 - INFO - DataEnrichment initialized successfully.
2026-02-01 19:41:27,539 - INFO - Adding impact links: IMP_0015 to IMP_0016 (2 links total)
2026-02-01 19:41:27,559 - INFO - Enrichment completed successfully.
2026-02-01 19:41:27,562 - INFO - Total Records: 16
2026-02-01 19:41:27,566 - INFO - Impact links successfully integrated into the pillar analysis.



--- Enrichment Success ---
Total Records After Enrichment: 16


In [6]:
# 3. Create the Excel writer
with pd.ExcelWriter(processed_output, engine='openpyxl') as writer:
    # Save Main Sheet
    df_main_final.to_excel(writer, sheet_name='ethiopia_fi_unified_data', index=False)
    
    # Save Impact Sheet
    df_impact_final.to_excel(writer, sheet_name='Impact_sheet', index=False)

print(f"✅ Success: Unified file saved at {processed_output}")
logging.info(f"✅ Success: Unified file saved at {processed_output}")

2026-02-01 19:41:27,847 - INFO - ✅ Success: Unified file saved at ../data/processed/ethiopia_fi_unified_data.xlsx


✅ Success: Unified file saved at ../data/processed/ethiopia_fi_unified_data.xlsx


# 2. Data Enrichment Phase

## 2.1 Dataset Expansion (Record Augmentation)

After profiling, enrichment was initiated to strengthen the dataset with additional records.

The system added:

- **26 new rows**

Record IDs expanded from:

- `REC_0011` → `REC_0036`

Log evidence:

- `Adding 26 rows to the dataset (REC_0011 to REC_0036)`

This enrichment step increased dataset breadth by incorporating:

- Updated digital finance indicators
- Infrastructure metrics
- Usage proxies beyond survey cycles

---

## 2.2 Enrichment Success Confirmation

The enrichment pipeline confirmed successful integration:

- Total unified records increased to:

Log marker:

- `--- Enrichment Success ---`

This confirms the dataset moved beyond its initial baseline into a more complete national inclusion panel.

---

## 2.3 Impact Link Integration

Beyond adding indicator rows, the enrichment process also introduced structured policy-impact relationships.

The system added:

- **2 impact links**

From:

- `IMP_0015` → `IMP_0016`

Log evidence:

- `Adding impact links: IMP_0015 to IMP_0016 (2 links total)`

Impact links were successfully integrated into pillar-level analysis:

- `Impact links successfully integrated into the pillar analysis.`

This enhancement enables downstream modeling of:

- Policy events
- Market shifts
- Drivers of inclusion acceleration

---

## 2.4 Unified Processed Dataset Export

Once enrichment was completed, the final dataset was saved as a unified Excel file:

- Output location:


Log confirmation:

- `✅ Success: Unified file saved at ../data/processed/ethiopia_fi_unified_data.xlsx`

This file became the master dataset for all EDA and forecasting workflows.

---

---

# 3. EDA Suite Execution on Enriched Dataset

After enrichment, the system launched the full EDA session:

Log marker:

- `Starting EDA Analysis session using file: ../data/processed/ethiopia_fi_unified_data.xlsx`

The analysis suite successfully produced:

- Dataset overview summaries
- Temporal coverage heatmaps
- Account ownership trends
- Growth rate plots
- Mobile money activity gap analysis
- Event overlay impact studies
- Correlation matrix computation
- Impact link summaries

Final completion confirmation:

- `Full EDA Analysis suite completed successfully.`

---

# ✅ Summary Outcome

The log history confirms a complete pipeline execution:

- Raw dataset successfully loaded
- Profiling validated schema, pillars, temporal structure, and quality
- Enrichment expanded indicators and added impact link relationships
- Unified dataset exported successfully
- Full EDA suite executed without failure

This workflow produced a robust analytical foundation for evaluating Ethiopia’s financial inclusion trajectory and policy-driven inclusion strategies through 2027.

---
