# Task 1: Data Exploration and Enrichment

## Objective
Understand the starter dataset and enrich it with additional data useful for forecasting Access and Usage.

## 1. Setup and Load Data

In [1]:
import pandas as pd
import os

# Define path to data
data_path = '../data/raw/ethiopia_fi_unified_data.csv'

def load_data():
    try:
        df = pd.read_csv(data_path)
        print("Data loaded successfully.")
        return df
    except FileNotFoundError:
        print(f"File not found at {data_path}")
        return None

df = load_data()
df.head()

Data loaded successfully.


Unnamed: 0,record_type,pillar,indicator,indicator_code,value_numeric,observation_date,source_name,source_url,confidence,category,parent_id,related_indicator,impact_direction,impact_magnitude,lag_months,evidence_basis,notes
0,observation,Access,Account Ownership,ACC_OWN,14.0,2011-01-01,Global Findex,,High,,,,,,,,
1,observation,Access,Account Ownership,ACC_OWN,22.0,2014-01-01,Global Findex,,High,,,,,,,,
2,observation,Access,Account Ownership,ACC_OWN,35.0,2017-01-01,Global Findex,,High,,,,,,,,
3,observation,Access,Account Ownership,ACC_OWN,46.0,2021-01-01,Global Findex,,High,,,,,,,,
4,observation,Access,Account Ownership,ACC_OWN,49.0,2024-01-01,Global Findex,,High,,,,,,,,


## 2. Explore Data Structure

In [2]:
print("--- Data Info ---")
df.info()

print("--- Record Types ---")
print(df['record_type'].value_counts())

print("--- Pillars ---")
print(df['pillar'].value_counts(dropna=False))

--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   record_type        16 non-null     object 
 1   pillar             9 non-null      object 
 2   indicator          13 non-null     object 
 3   indicator_code     9 non-null      object 
 4   value_numeric      13 non-null     float64
 5   observation_date   16 non-null     object 
 6   source_name        9 non-null      object 
 7   source_url         0 non-null      float64
 8   confidence         16 non-null     object 
 9   category           4 non-null      object 
 10  parent_id          7 non-null      object 
 11  related_indicator  3 non-null      object 
 12  impact_direction   3 non-null      object 
 13  impact_magnitude   3 non-null      object 
 14  lag_months         3 non-null      float64
 15  evidence_basis     3 non-null      object 
 16  notes     

## 3. Review Existing Observations

In [3]:
observations = df[df['record_type'] == 'observation']
print("Observation Date Range:", observations['observation_date'].min(), "to", observations['observation_date'].max())
print("Unique Indicators:", observations['indicator'].unique())

Observation Date Range: 2011-01-01 to 2024-01-01
Unique Indicators: ['Account Ownership' 'Mobile Money Ownership' 'Digital Payment Adoption']


## 4. Enrichment
We have added the following new data points to the dataset:
1. **Observation**: Usage - Digital Payment Adoption (30% in 2023)
2. **Event**: NBE Digital Lending Directive (Policy, June 2022)
3. **Impact Link**: Modeling the effect of the lending directive on digital payments.

In [4]:
# Verify new additions
print("--- New Event ---")
print(df[df['category'] == 'policy'])

print("--- New Observation ---")
print(df[(df['indicator_code'] == 'USG_DIG_PAY') & (df['observation_date'] == '2023-01-01')])

--- New Event ---
   record_type pillar                      indicator indicator_code  \
14       event    NaN  NBE Digital Lending Directive            NaN   

    value_numeric observation_date source_name  source_url confidence  \
14            0.0       2022-06-01         NaN         NaN       High   

   category parent_id related_indicator impact_direction impact_magnitude  \
14   policy   EVT_004               NaN              NaN              NaN   

    lag_months evidence_basis  notes  
14         NaN            NaN    NaN  
--- New Observation ---
    record_type pillar                 indicator indicator_code  \
13  observation  Usage  Digital Payment Adoption    USG_DIG_PAY   

    value_numeric observation_date source_name  source_url confidence  \
13           30.0       2023-01-01         NBE         NaN     Medium   

   category parent_id related_indicator impact_direction impact_magnitude  \
13      NaN       NaN               NaN              NaN              NaN   