# Phase 1

**This is the beginning of the work. The dataset is the A2F 2023 Survey from Enhancing Financial Inclusion & Advancement (EFInA). The link to the data source is below:**

https://a2f.ng/datasets/#licence-2023-Dataset-revised-popup

**I chose this dataset becaue it is the most recent version of the survey taken. Others were taken in 2020, 2018, etc.**

**Please feel free to pass any comments/ make any edits as you deem fit at any point in the project.**

**The file is quite a large file, with almost 30,000 rows and over 1,600 columns. It is a ".sav" file with a size of about 350MB.**

**The large size of this file can cause some issues with computing values later on. I will try to consolidate the columns so we can have a more compact data structure to work with.** 

In [95]:
# First, install the library if you haven't:
# pip install pyreadstat

In [96]:
import pyreadstat
import pandas as pd
import numpy as np

# At this point, I am loading the .sav file locally. Make sure to replace the path with the actual path to your .sav file. The path could be a raw string (prefix with 'r') to avoid issues with backslashes.

file_path = r"c:\Users\USER\Desktop\A2F-2023-Revised-dataset-with-Revised-weights\A2F 2023 Revised dataset with Revised weights.sav"

# Load the file by unpacking the .sav file into a pandas DataFrame and metadata. The `read_sav` function reads the .sav file and returns a tuple (data, meta) where `data` is a DataFrame and `meta` contains metadata about the variables. The `meta` object contains information about the variable labels, value labels, and other metadata

data, meta = pyreadstat.read_sav(file_path)
data = data.drop_duplicates().dropna(how="all", axis=1)

# Again,
# `data` is a pandas DataFrame containing the data
# `meta` contains metadata like variable labels

display(data.head())  # Just getting an overview

Unnamed: 0,statecode,state_code,e6,state,agegroup,respondent_serial,Weightingvariable,final_hh_wgt,region,state_name,...,finneeds_transferofvalue,target_groups,savings_group,cooperative,village_comm_association,savings_thrift,microfinance,money_lender,finhealth_indicator_final,finlit_cap_final
0,1.0,31.0,2.0,ABIA,2.0,3697287.0,686.992065,749.888855,2.0,ABIA,...,1.0,2.0,0.0,0.0,0.0,0.0,,,1.0,1.0
1,1.0,31.0,2.0,ABIA,2.0,3546312.0,1292.105957,1028.419067,2.0,ABIA,...,1.0,2.0,0.0,0.0,0.0,0.0,,,1.0,0.0
2,1.0,31.0,1.0,ABIA,2.0,3737095.0,274.308411,492.784119,2.0,ABIA,...,1.0,5.0,0.0,0.0,0.0,0.0,,,0.0,0.0
3,1.0,31.0,1.0,ABIA,2.0,3545118.0,1194.71936,1028.419067,2.0,ABIA,...,1.0,5.0,0.0,0.0,0.0,0.0,,,0.0,0.0
4,1.0,31.0,1.0,ABIA,2.0,3705829.0,4107.366211,1906.860229,2.0,ABIA,...,1.0,4.0,0.0,0.0,0.0,0.0,,,0.0,0.0


In [97]:
# Just trying to save a csv copy of the data for later use
# data.to_csv(r"c:\Users\USER\Desktop\A2F-2023-Revised-dataset-with-Revised-weights/data.csv", index=False)

In [98]:

# You can get summary statistics for numerical columns, I have temporarily commented it out
#display(data.describe())  

# Getting a summary of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28392 entries, 0 to 28391
Columns: 1606 entries, statecode to finlit_cap_final
dtypes: float64(1496), object(110)
memory usage: 347.9+ MB


In [99]:
# Let's check the attributes of the "meta" object so we can know where to look
print(dir(meta))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'column_labels', 'column_names', 'column_names_to_labels', 'creation_time', 'file_encoding', 'file_format', 'file_label', 'missing_ranges', 'missing_user_values', 'modification_time', 'mr_sets', 'notes', 'number_columns', 'number_rows', 'original_variable_types', 'readstat_variable_types', 'table_name', 'value_labels', 'variable_alignment', 'variable_display_width', 'variable_measure', 'variable_storage_width', 'variable_to_label', 'variable_value_labels']


In [100]:
# You can access various attributes in "meta", such as:
# - meta.column_labels: Provides the variable (column) names and their corresponding labels.
# - meta.variable_measure: Shows the measurement types (e.g., nominal, scale).
# - meta.file_label: Displays the label of the file if available.
# - meta.notes: Any additional notes saved in the .sav file.

# Access metadata attributes
# print(meta.column_labels)  # Variable labels
# print(meta.variable_measure)  # Measurement levels for variables
# print(meta.file_label)  # Overall file label
# print(meta.notes)  # Notes from the .sav file


In [101]:
# Targetting the attribute "Column_names_to_labels" to get the variable names and their labels. This attribute contains a dictionary mapping variable names to their labels. 

columns_to_labels = meta.column_names_to_labels  # This gives a dictionary

# Loop through the dictionary and print each column with its description
for column_name, label in columns_to_labels.items():
    print(f"{column_name}: {label}")

statecode: 8. state
state_code: state_code
e6: gender
state: None
agegroup: Age (15-17) (18 & above)
respondent_serial: 3. serial number
Weightingvariable: new weighting variable
final_hh_wgt: 5. household weight
region: geopolitical zones
state_name: 8. state
lga_name: 9. lga name
locality_name: 10. locality_name
ea_name: 11. ea name
sector: 12. sector
ea_code: 13. ea code
ee1: 1, 1, 1 : how many adults (aged 15+) live in this household?
hh_total_size_1: total_total household size
hh_total_size_2: male_total household size
hh_total_size_3: female_total household size
hh_age_15_17_1: total_number of persons age 15 to 17 yrs
hh_age_15_17_2: male_number of persons age 15 to 17 yrs
hh_age_15_17_3: female_number of persons age 15 to 17 yrs
hh_age_18_plus_1: total_total of persons age 18 yrs+
hh_age_18_plus_2: male_total of persons age 18 yrs+
hh_age_18_plus_3: female_total of persons age 18 yrs+
q1_1: interview type
c1: c1.  how many people live in this household?
c2a: c2a.  how many peopl

### In this project, the main work will be about consolidating the columns in the dataset into a more compact structure. We will then feed this compact structure into the function that'll give us different visualizations to explore the data and extract the stories hidden in the data. I have defined a list of columns that we need to create based on the data at hand. The list is as follows:


### **1. `bank_account_ownership`**
- **Description**: Indicates whether an individual owns a formal bank account (e.g., savings, checking) and the type of financial institution they use (e.g., commercial bank, microfinance institution).
- **Purpose**: This column helps assess financial inclusion by identifying individuals with access to formal banking services.

---

### **2. `urban_rural`**
- **Description**: Categorizes individuals based on whether they reside in urban or rural areas.
- **Purpose**: This column provides geographic context for understanding disparities in access to financial services, infrastructure, and opportunities between urban and rural populations.

---

### **3. `financial_inclusion_metrics`**
- **Description**: Aggregates multiple indicators of financial inclusion, such as ownership of bank accounts, mobile money usage, and access to credit.
- **Purpose**: This composite metric offers a holistic view of an individual's engagement with formal and informal financial systems.

---

### **4. `demographic_factors`**
- **Description**: Combines demographic attributes such as age, gender, education level, and marital status into a single representation.
- **Purpose**: These factors are critical for analyzing how demographic characteristics influence financial behavior and inclusion.

---

### **5. `savings_behavior`**
- **Description**: Captures how individuals save money, including methods (e.g., at home, in banks, via mobile wallets) and purposes (e.g., emergencies, education, business).
- **Purpose**: This column highlights saving habits and preferences, which are key to understanding financial resilience and planning.

---

### **6. `borrowing_behavior`**
- **Description**: Describes borrowing patterns, including sources of loans (e.g., family, banks, microfinance institutions) and reasons for borrowing (e.g., emergencies, business expansion).
- **Purpose**: This column provides insights into debt management and access to credit, which are essential for assessing financial vulnerability and entrepreneurial activities.

---

### **7. `digital_payment_adoption`**
- **Description**: Indicates the extent to which individuals use digital payment methods, such as mobile money, e-wallets, or online banking.
- **Purpose**: This column measures adoption of modern financial technologies, which is a strong indicator of financial inclusion and economic participation.

---

### **8. `access_to_electricity`**
- **Description**: Identifies whether individuals have reliable access to electricity, which is a prerequisite for using digital devices and financial services.
- **Purpose**: This column highlights infrastructure challenges that may hinder access to financial tools and services, particularly in rural areas.

---

### **9. `internet_access`**
- **Description**: Indicates whether individuals have access to the internet, either through smartphones, computers, or other devices.
- **Purpose**: Internet access is crucial for digital financial inclusion, enabling activities like online banking, mobile payments, and e-commerce.

---

### **10. `mobile_phone_usage`**
- **Description**: Describes how individuals use mobile phones, including ownership, type of phone (e.g., smartphone, feature phone), and comfort with apps.
- **Purpose**: Mobile phones are often a gateway to financial services, especially in regions where traditional banking is limited.

---

### **11. `credit_access`**
- **Description**: Assesses access to credit, including borrowing history, sources of credit, and repayment behavior.
- **Purpose**: This column highlights barriers to accessing credit and identifies opportunities to expand financial inclusion for underserved populations.

---

### **12. `small_business_ownership`**
- **Description**: Indicates whether individuals own or operate small businesses, including ownership of business equipment, control over assets, and proximity to markets.
- **Purpose**: This column evaluates entrepreneurial activity and identifies individuals who may benefit from targeted financial products and services.

---

### **13. `entrepreneurship`**
- **Description**: Provides a comprehensive view of entrepreneurial activities, including income-generating activities, sectors of operation, and employment of others.
- **Purpose**: This column helps identify individuals engaged in entrepreneurial ventures and assesses their needs for financial support and resources.

---

### Summary
These columns collectively provide a detailed understanding of financial inclusion, economic behavior, and access to resources. While `urban_rural` has been temporarily skipped, it can be revisited later to incorporate geographic insights into the analysis. Each column contributes to a broader narrative about how individuals interact with financial systems and what factors influence their economic opportunities.


### > state or territory <

In [132]:
# We need to validate that this column carries the expected state names. The column name is "state", representing the states in Nigeria.
# We check to see if they match the expected state names (the 37 states in Nigeria plus the Federal Capital Territory (FCT)).

# Define the list of expected state names (37 states + FCT)
expected_states = [
    'ABIA', 'ADAMAWA', 'AKWA IBOM', 'ANAMBRA', 'BAUCHI', 'BAYELSA', 'BENUE', 'BORNO',
    'CROSS RIVER', 'DELTA', 'EBONYI', 'EDO', 'EKITI', 'ENUGU', 'FCT', 'GOMBE', 'IMO',
    'JIGAWA', 'KADUNA', 'KANO', 'KATSINA', 'KEBBI', 'KOGI', 'KWARA', 'LAGOS',
    'NASARAWA', 'NIGER', 'OGUN', 'ONDO', 'OSUN', 'OYO', 'PLATEAU', 'RIVERS', 'SOKOTO',
    'TARABA', 'YOBE', 'ZAMFARA'
]

# Step 1: Check for missing values
missing_values = data['state'].isnull().sum()
print(f"Missing Values in 'state' Column: {missing_values}")

# Step 2: Retrieve unique values in the 'state' column
unique_states = data['state'].unique()
print(f"Unique Values in 'state' Column ({len(unique_states)}): {unique_states}")

# Step 3: Compare unique values with the expected state names
missing_from_data = set(expected_states) - set(unique_states)
extra_in_data = set(unique_states) - set(expected_states)

print("\nValidation Results:")
if missing_from_data:
    print(f"States Missing from Data: {missing_from_data}")
else:
    print("All expected states are present in the data.")

if extra_in_data:
    print(f"Unexpected States in Data: {extra_in_data}")
else:
    print("No unexpected states found in the data.")

# Step 4: Check if the number of unique states matches the expected count
if len(unique_states) == len(expected_states):
    print("The number of unique states matches the expected count (37).")
else:
    print(f"Mismatch in Count: Found {len(unique_states)}, Expected {len(expected_states)}.")

Missing Values in 'state' Column: 0
Unique Values in 'state' Column (37): ['ABIA' 'ADAMAWA' 'AKWA IBOM' 'ANAMBRA' 'BAUCHI' 'BAYELSA' 'BENUE' 'BORNO'
 'CROSS RIVER' 'DELTA' 'EBONYI' 'EDO' 'EKITI' 'ENUGU' 'FCT' 'GOMBE' 'IMO'
 'JIGAWA' 'KADUNA' 'KANO' 'KATSINA' 'KEBBI' 'KOGI' 'KWARA' 'LAGOS'
 'NASARAWA' 'NIGER' 'OGUN' 'ONDO' 'OSUN' 'OYO' 'PLATEAU' 'RIVERS' 'SOKOTO'
 'TARABA' 'YOBE' 'ZAMFARA']

Validation Results:
All expected states are present in the data.
No unexpected states found in the data.
The number of unique states matches the expected count (37).


### 1, bank_account_ownership

To ensure we properly consolidate the bank_account_ownership column, let’s first analyze the relevant columns (qf4_* and qf5_*) to understand their structure, values, and suitability. This step is critical to confirm how account ownership is recorded and whether there are any inconsistencies or missing data.

Step 1: Identify Relevant Columns
The relevant columns for bank_account_ownership are:

qf4_* :
These columns indicate whether an individual has an account registered in their name with various financial institutions.
Examples:
qf4_1: "Do you have an account registered in your name with a commercial bank?"
qf4_2: "Do you have an account registered in your name with a microfinance bank?"
qf4_3: "Do you have an account registered in your name with a non-interest banking institution?"
Additional columns like qf4_4 to qf4_15 cover other types of institutions (e.g., mortgage institutions, mobile money operators, etc.).
qf5_* :
These columns indicate whether the individual currently uses specific financial providers.
Examples:
qf5_1: "Do you currently use this provider (commercial bank)?"
qf5_2: "Do you currently use this provider (microfinance bank)?"


Step 2: Analyze Relevant Columns
We will loop through each of these columns to extract their data types, unique values, value counts, missing values, and sample values. This analysis will help us confirm their suitability for consolidation.

In [103]:
# Define relevant column groups
account_registration_cols = [col for col in data.columns if col.startswith('qf4_')]  # Account registration
current_usage_cols = [col for col in data.columns if col.startswith('qf5_')]        # Current usage

# Combine all relevant columns into a single list
relevant_columns = account_registration_cols + current_usage_cols

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: qf4_1 ---
Data Type: float64
Unique Values (3): [nan  1.  2.]
Value Counts:
qf4_1
1.0    11209
2.0      477
Name: count, dtype: int64
Missing Values: 16706
Sample Values:
       qf4_1
24485    1.0
27650    NaN
11138    1.0
25090    1.0
6698     NaN

--------------------------------------------------

--- Analysis for Column: qf4_2 ---
Data Type: float64
Unique Values (3): [nan  2.  1.]
Value Counts:
qf4_2
1.0    403
2.0    194
Name: count, dtype: int64
Missing Values: 27795
Sample Values:
       qf4_2
20785    NaN
28035    NaN
21275    NaN
6965     NaN
22577    NaN

--------------------------------------------------

--- Analysis for Column: qf4_16 ---
Data Type: float64
Unique Values (3): [nan  1.  2.]
Value Counts:
qf4_16
1.0    243
2.0     93
Name: count, dtype: int64
Missing Values: 28056
Sample Values:
       qf4_16
27220     NaN
13783     NaN
13118     NaN
22986     1.0
16515     NaN

--------------------------------------------------

--- Analysis for Co

In [104]:
# Now we can consolidate

# Step 1: Identify relevant columns
account_registration_cols = [col for col in data.columns if col.startswith('qf4_')]  # Account registration
current_usage_cols = [col for col in data.columns if col.startswith('qf5_')]        # Current usage

# Combine all relevant columns into a single list
relevant_columns = account_registration_cols + current_usage_cols

# Step 2: Create binary indicator for bank account ownership
data['bank_account_ownership'] = (
    data[relevant_columns].isin([1.0]).any(axis=1).astype(int)
)

# Step 3: Handle missing values (optional)
# If all relevant columns are NaN, assign NaN to bank_account_ownership
data.loc[data[relevant_columns].isnull().all(axis=1), 'bank_account_ownership'] = np.nan

# Display the consolidated column
print(data[['bank_account_ownership']].head())

   bank_account_ownership
0                     0.0
1                     0.0
2                     0.0
3                     0.0
4                     0.0


### 2, urban_rural

We do not have sufficient data to infer this value, so we will skip it for now. We can always come back to it later. We'll just do a quick EDA

In [105]:
# Define the relevant columns for urbanization_rate
relevant_columns = ['locality_name', 'ea_name', 'lga_name']

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values[:10]}...")  # Show first 10 unique values
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts (Top 10):")
    print(data[column].value_counts().head(10))
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: locality_name ---
Data Type: object
Unique Values (1672): ['UMUORAM OHIYA' 'OWOAHIAFOR' 'AMAKPOKE' 'UKWA NKPORO' 'AKIRIKA UKU'
 'AMAUZU' 'EZIAMA' 'MGBELU UMUNNEKWU' 'OKOPEDI' 'AMUOHA ISIEKETA']...
Value Counts (Top 10):
locality_name
TUDUN WADA    112
SABON GARI    109
LAFIA          96
BWARI          80
KEFFI          80
GAJIGANNA      78
GAJIRAM        77
LUGBE          76
ILORIN         64
NGURU          63
Name: count, dtype: int64
Missing Values: 0
Sample Values:
                       locality_name
19203                        SHOMOLU
3058   CHUKWURAH VILLAGE UMUEZE ANAM
15197                   KASUWAR KUKA
10880                           KUJE
1746                       AKPAUTONG

--------------------------------------------------

--- Analysis for Column: ea_name ---
Data Type: object
Unique Values (1817): ['JESUS GOSPEL MANIFESTATION OF POWER &HEALING MINIS'
 'DEEPER LIFE BIBLE CHURCH OWOAHIAFOR' 'AMAKPOKE HEALTH CENTRE UMUAKU'
 'EME UKA UKA' 'CHINWE EK

### 3, financial_inclusion_metrics

To consolidate the financial_inclusion_metrics column, we need to aggregate multiple indicators that capture various dimensions of financial inclusion. These metrics typically include account ownership, savings behavior, borrowing behavior, credit access, digital payment adoption, and mobile money usage. Based on the dataset's column names and descriptions, here’s how we can consolidate this column:

Step 1: Identify Relevant Columns
The following columns are likely candidates for deriving financial_inclusion_metrics:

1. Account Ownership
Relevant Columns :
qf4_*: Indicates whether an individual has an account registered in their name with various institutions (e.g., commercial banks, microfinance banks, etc.).
qf5_*: Indicates whether the individual currently uses specific financial providers.
2. Savings Behavior
Relevant Columns :
sa2_*: Captures reasons for saving.
sa3a_*: Captures methods of saving.
3. Borrowing Behavior
Relevant Columns :
lc2a_*: Captures sources of borrowing.
lc4_*: Captures reasons for not borrowing.
4. Credit Access
Relevant Columns :
qf6_*: Captures activities done using financial institutions, such as loans or credit services.
5. Digital Payment Adoption
Relevant Columns :
py1b_*: Captures payment methods like USSD, mobile money, internet banking, etc.
6. Mobile Money Usage
Relevant Columns :
fs3_5_*: Indicates whether the individual uses mobile money services.
Step 2: Consolidation Logic
We will create a composite score or categorical representation for financial_inclusion_metrics based on the above dimensions. Here’s how:

Binary Indicators :
For each dimension (account ownership, savings behavior, etc.), assign a binary value (1 = Yes, 0 = No) based on whether the individual participates in that activity.
Composite Score :
Sum up the binary indicators across all dimensions to calculate a composite score. For example:
A score of 0: No financial inclusion.
A score of 1–2: Low financial inclusion.
A score of 3–4: Moderate financial inclusion.
A score of 5+: High financial inclusion.
Categorical Representation :
Alternatively, classify individuals into categories like "Low," "Moderate," or "High" financial inclusion based on thresholds derived from the composite score.

In [106]:
# We’ll loop through each of these groups of columns to extract their data types, unique values, value counts, missing values, and sample values. This will help us confirm their suitability for consolidation.

# Define the relevant column groups
account_ownership_cols = [col for col in data.columns if col.startswith('qf4_') or col.startswith('qf5_')]
savings_behavior_cols = [col for col in data.columns if col.startswith('sa2_') or col.startswith('sa3a_')]
borrowing_behavior_cols = [col for col in data.columns if col.startswith('lc2a_') or col.startswith('lc4_')]
credit_access_cols = [col for col in data.columns if col.startswith('qf6_')]
digital_payment_cols = [col for col in data.columns if col.startswith('py1b_')]
mobile_money_cols = [col for col in data.columns if col.startswith('fs3_5_')]

# Combine all relevant columns into a single list
relevant_columns = (
    account_ownership_cols +
    savings_behavior_cols +
    borrowing_behavior_cols +
    credit_access_cols +
    digital_payment_cols +
    mobile_money_cols
)

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: qf4_1 ---
Data Type: float64
Unique Values (3): [nan  1.  2.]
Value Counts:
qf4_1
1.0    11209
2.0      477
Name: count, dtype: int64
Missing Values: 16706
Sample Values:
       qf4_1
27790    NaN
19724    NaN
15627    NaN
23278    NaN
24407    1.0

--------------------------------------------------

--- Analysis for Column: qf4_2 ---
Data Type: float64
Unique Values (3): [nan  2.  1.]
Value Counts:
qf4_2
1.0    403
2.0    194
Name: count, dtype: int64
Missing Values: 27795
Sample Values:
       qf4_2
24253    NaN
13632    NaN
17371    NaN
16721    NaN
22560    NaN

--------------------------------------------------

--- Analysis for Column: qf4_16 ---
Data Type: float64
Unique Values (3): [nan  1.  2.]
Value Counts:
qf4_16
1.0    243
2.0     93
Name: count, dtype: int64
Missing Values: 28056
Sample Values:
       qf4_16
2774      NaN
26347     NaN
4781      NaN
9949      NaN
8841      NaN

--------------------------------------------------

--- Analysis for Co

In [107]:
# Using the information above, we can now consolidate these columns into our desried column.

# Step 1: Create binary indicators for each dimension
data['account_ownership'] = data[account_ownership_cols].isin([1.0]).any(axis=1).astype(int)
data['savings_behavior'] = data[savings_behavior_cols].isin([1.0]).any(axis=1).astype(int)
data['borrowing_behavior'] = data[borrowing_behavior_cols].isin([1.0]).any(axis=1).astype(int)
data['credit_access'] = data[credit_access_cols].isin([1.0]).any(axis=1).astype(int)
data['digital_payment_adoption'] = data[digital_payment_cols].isin([1.0, 2.0, 3.0, 4.0, 5.0]).any(axis=1).astype(int)
data['mobile_money_usage'] = data[mobile_money_cols].isin([1.0]).any(axis=1).astype(int)

# Step 2: Calculate a composite score
data['financial_inclusion_metrics'] = (
    data['account_ownership'] +
    data['savings_behavior'] +
    data['borrowing_behavior'] +
    data['credit_access'] +
    data['digital_payment_adoption'] +
    data['mobile_money_usage']
)

# Step 3: Categorize financial inclusion levels
def categorize_financial_inclusion(score):
    if score == 0:
        return 'None'
    elif 1 <= score <= 2:
        return 'Low'
    elif 3 <= score <= 4:
        return 'Moderate'
    else:
        return 'High'

data['financial_inclusion_level'] = data['financial_inclusion_metrics'].apply(categorize_financial_inclusion)

# Display the consolidated columns
print(data[['account_ownership', 'savings_behavior', 'borrowing_behavior', 'credit_access',
            'digital_payment_adoption', 'mobile_money_usage', 'financial_inclusion_metrics',
            'financial_inclusion_level']].head())

   account_ownership  savings_behavior  borrowing_behavior  credit_access  \
0                  0                 1                   0              0   
1                  0                 1                   0              0   
2                  0                 1                   0              0   
3                  0                 0                   0              0   
4                  0                 0                   0              0   

   digital_payment_adoption  mobile_money_usage  financial_inclusion_metrics  \
0                         1                   0                            2   
1                         1                   0                            2   
2                         1                   0                            2   
3                         1                   0                            1   
4                         1                   0                            1   

  financial_inclusion_level  
0                       Lo

### 4, demographic_factors


To create the demographic_factors column, we need to consolidate information about key demographic attributes such as age , gender , education level , and potentially other factors like income group or household size . These attributes are typically spread across multiple columns in the dataset.

Step 1: Identify Relevant Columns
Based on the dataset's column names and descriptions, the following columns are likely candidates for deriving demographic_factors:

1. Age
Relevant Columns :
agegroup: Categorical variable indicating age groups (e.g., "15–17," "18+").
hh_age_15_17_*: Household-level breakdown of individuals aged 15–17.
hh_age_18_plus_*: Household-level breakdown of individuals aged 18+.
e7: their age
2. Gender
Relevant Columns :
e6: Binary or categorical variable indicating gender (e.g., "Male," "Female").
3. Education Level
Relevant Columns :
e8: Likely a categorical variable indicating the highest level of education achieved (e.g., "Primary," "Secondary," "Tertiary").
4. Income Group
Relevant Columns :
wealthscore: A numeric variable representing household wealth.
quintile: A categorical variable dividing households into wealth quintiles (e.g., "Lowest," "Second," "Middle," "Fourth," "Highest").
5. Household Size
Relevant Columns :
hh_total_size_1: Total household size.
hh_total_size_2: Male household members.
hh_total_size_3: Female household members.
Step 2: Consolidation Logic
We will create a composite representation of demographic factors by combining the above attributes. Here’s how:

Categorize Age :
Use agegroup to classify individuals into broad age categories (e.g., "Youth," "Adult").
Gender :
Use gender to classify individuals as "Male" or "Female."
Education Level :
Use education_level to classify individuals into categories like "Low Education" (Primary), "Medium Education" (Secondary), or "High Education" (Tertiary).
Income Group :
Use quintile to classify households into income groups (e.g., "Low Income," "Middle Income," "High Income").
Household Size :
Use hh_total_size_1 to classify households as "Small" (≤4 members), "Medium" (5–8 members), or "Large" (>8 members).
Composite Representation :
Combine these factors into a single string or categorical variable (e.g., "Young Male, Secondary Education, Low Income, Small Household").


In [108]:
# Analyzing relevant columns

# Define relevant columns
relevant_columns = ['e7', 'e6', 'e8', 'quintile', 'hh_total_size_1']

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: e7 ---
Data Type: float64
Unique Values (83): [ 43.  51.  69.  65.  82.  40.  30.  67.  52.  35.  20.  27.  16.  50.
  58.  53.  62.  17. 995.  54.  38.  45.  42.  75.  60.  77.  15.  63.
  26.  25.  56.  48.  31.  84.  47.  24.  36.  18.  70.  29.  22.  32.
  68.  80.  28.  41.  44.  73.  66.  55.  89.  79.  74.  46.  23.  57.
  59.  34.  19.  64.  85.  90.  37.  72.  76.  81.  86.  33.  78.  49.
  61.  39.  83.  71.  88.  21.  98.  87.  95.  96.  92.  97.  91.]
Value Counts:
e7
30.0    1892
40.0    1612
35.0    1570
25.0    1445
45.0    1104
        ... 
95.0       6
96.0       2
92.0       1
97.0       1
91.0       1
Name: count, Length: 83, dtype: int64
Missing Values: 0
Sample Values:
         e7
16086  55.0
25868  50.0
11511  16.0
7098   50.0
4600   48.0

--------------------------------------------------

--- Analysis for Column: e6 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
e6
2.0    15198
1.0    13194
Name: count, dtype: int64
Miss

In [109]:
# Now to consolidate

# Step 1: Categorize age
def categorize_age(age):
    if age <= 25:
        return 'Youth'
    else:
        return 'Adult'

data['age_category'] = data['e7'].apply(categorize_age)

# Step 2: Categorize gender
def categorize_gender(gender_code):
    if gender_code == 1.0:
        return 'Male'
    elif gender_code == 2.0:
        return 'Female'
    else:
        return 'Unknown'

data['gender'] = data['e6'].apply(categorize_gender)

# Step 3: Categorize education level
def categorize_education(education_level):
    if education_level in [1.0, 2.0, 3.0]:
        return 'Low Education'
    elif education_level in [4.0, 5.0, 6.0, 7.0]:
        return 'Medium Education'
    elif education_level in [8.0, 9.0, 10.0, 11.0]:
        return 'High Education'
    else:
        return 'Unknown'

data['education_category'] = data['e8'].apply(categorize_education)

# Step 4: Categorize income group
def categorize_income(quintile):
    if quintile in [1.0, 2.0]:
        return 'Low Income'
    elif quintile in [3.0, 4.0]:
        return 'Middle Income'
    elif quintile == 5.0:
        return 'High Income'
    else:
        return 'Unknown'

data['income_group'] = data['quintile'].apply(categorize_income)

# Step 5: Categorize household size
def categorize_household_size(size):
    size = int(size)  # Convert from object to integer
    if size <= 4:
        return 'Small Household'
    elif 5 <= size <= 8:
        return 'Medium Household'
    else:
        return 'Large Household'

data['household_size_category'] = data['hh_total_size_1'].apply(categorize_household_size)

# Step 6: Combine demographic factors
data['demographic_factors'] = (
    data['age_category'] + ', ' +
    data['gender'] + ', ' +
    data['education_category'] + ', ' +
    data['income_group'] + ', ' +
    data['household_size_category']
)

# Display the consolidated column
print(data[['age_category', 'gender', 'education_category', 'income_group', 'household_size_category', 'demographic_factors']].head())

  age_category  gender education_category   income_group  \
0        Adult  Female   Medium Education  Middle Income   
1        Adult  Female   Medium Education  Middle Income   
2        Adult    Male   Medium Education  Middle Income   
3        Adult    Male   Medium Education  Middle Income   
4        Adult    Male      Low Education  Middle Income   

  household_size_category                                demographic_factors  
0         Small Household  Adult, Female, Medium Education, Middle Income...  
1         Small Household  Adult, Female, Medium Education, Middle Income...  
2         Small Household  Adult, Male, Medium Education, Middle Income, ...  
3         Small Household  Adult, Male, Medium Education, Middle Income, ...  
4         Small Household  Adult, Male, Low Education, Middle Income, Sma...  


### 5, savings_behavior

Let’s proceed to create the savings_behavior column. The savings_behavior column should capture information about:

Whether an individual saves money.

How they save (e.g., formal institutions like banks, informal methods like savings groups, or at home).

Frequency or purpose of saving (if available).

From the dataset, the following columns are likely candidates for deriving savings_behavior:

sa2_* : Captures reasons for saving (e.g., emergencies, education, personal needs).
sa3a_* : Captures methods of saving (e.g., with a bank, mobile money, savings groups, etc.).
sa7b1_* : Captures types of savings (e.g., cash, assets, cryptocurrency, etc.).

In [110]:
# Define relevant column groups
savings_reason_cols = [col for col in data.columns if col.startswith('sa2_')]
savings_method_cols = [col for col in data.columns if col.startswith('sa3a_')]
savings_type_cols = [col for col in data.columns if col.startswith('sa7b1_')]

# Combine all relevant columns into a single list
relevant_columns = savings_reason_cols + savings_method_cols + savings_type_cols

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: sa2_1 ---
Data Type: float64
Unique Values (3): [ 0. nan  1.]
Value Counts:
sa2_1
0.0    10443
1.0      161
Name: count, dtype: int64
Missing Values: 17788
Sample Values:
       sa2_1
24271    0.0
15238    NaN
20974    NaN
21836    NaN
10725    NaN

--------------------------------------------------

--- Analysis for Column: sa2_2 ---
Data Type: float64
Unique Values (3): [ 0. nan  1.]
Value Counts:
sa2_2
0.0    7150
1.0    3454
Name: count, dtype: int64
Missing Values: 17788
Sample Values:
       sa2_2
16894    NaN
1859     NaN
27539    NaN
256      NaN
4844     NaN

--------------------------------------------------

--- Analysis for Column: sa2_3 ---
Data Type: float64
Unique Values (3): [ 0. nan  1.]
Value Counts:
sa2_3
0.0    8289
1.0    2315
Name: count, dtype: int64
Missing Values: 17788
Sample Values:
       sa2_3
8167     NaN
8711     NaN
4046     NaN
23227    NaN
10750    0.0

--------------------------------------------------

--- Analysis for Column

In [111]:
# Now we consolidate


# Step 1: Identify relevant columns
formal_savings_cols = [
    col for col in data.columns if col.startswith('sa3a_') and 'bank' in col.lower() or 'microfinance' in col.lower() or 'mobile' in col.lower()
]
informal_savings_cols = [
    col for col in data.columns if col.startswith('sa3a_') and 'group' in col.lower() or 'community' in col.lower() or 'thrift' in col.lower()
]
home_savings_cols = [
    col for col in data.columns if col.startswith('sa3a_') and 'cash' in col.lower() or 'assets' in col.lower()
]

# Step 2: Create binary indicators for each category
data['saves_formally'] = data[formal_savings_cols].isin([1.0]).any(axis=1).astype(int)
data['saves_informally'] = data[informal_savings_cols].isin([1.0]).any(axis=1).astype(int)
data['saves_at_home'] = data[home_savings_cols].isin([1.0]).any(axis=1).astype(int)

# Step 3: Categorize saving behavior
def categorize_savings_behavior(row):
    if row['saves_formally']:
        return 'Formal Savings'
    elif row['saves_informally']:
        return 'Informal Savings'
    elif row['saves_at_home']:
        return 'Saves at Home'
    else:
        return 'Does Not Save'

data['savings_behavior'] = data.apply(categorize_savings_behavior, axis=1)

# Display the consolidated column
print(data[['saves_formally', 'saves_informally', 'saves_at_home', 'savings_behavior']].head())

   saves_formally  saves_informally  saves_at_home savings_behavior
0               0                 0              0    Does Not Save
1               0                 0              0    Does Not Save
2               0                 0              1    Saves at Home
3               0                 0              0    Does Not Save
4               0                 0              0    Does Not Save


### 6, borrowing_behavior

This column will capture information about an individual's borrowing habits, including whether they borrow money, from whom they borrow, and the purpose of borrowing.

Step 1: Define borrowing_behavior

The borrowing_behavior column should capture:

Whether an individual borrows money .
From whom they borrow (e.g., formal institutions like banks, informal sources like family or friends, or other entities).
The purpose of borrowing (if available).
From the dataset, the following columns are likely candidates for deriving borrowing_behavior:

lc1_* : Indicates whether the individual has borrowed money in the past 12 months.

lc2a_* : Captures the source of borrowing (e.g., bank, family, savings groups, etc.).

lc2d_* : Captures the purpose of borrowing (e.g., emergencies, personal needs, business, etc.).

Before consolidating, let’s analyze the relevant columns to confirm their structure and values.

In [113]:
# Define relevant column groups
borrowing_indicators_cols = [col for col in data.columns if col.startswith('lc1_')]  # Borrowing indicators
borrowing_sources_cols = [col for col in data.columns if col.startswith('lc2a_')]    # Borrowing sources
borrowing_purpose_cols = [col for col in data.columns if col.startswith('lc2d_')]    # Purpose of borrowing

# Combine all relevant columns into a single list
relevant_columns = borrowing_indicators_cols + borrowing_sources_cols + borrowing_purpose_cols

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: lc1_1 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
lc1_1
2.0    21303
1.0     7089
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       lc1_1
15594    2.0
20995    1.0
20757    2.0
27369    2.0
20512    2.0

--------------------------------------------------

--- Analysis for Column: lc1_2 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
lc1_2
2.0    25815
1.0     2577
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       lc1_2
14651    2.0
824      2.0
88       2.0
8691     2.0
4096     2.0

--------------------------------------------------

--- Analysis for Column: lc1_3 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
lc1_3
2.0    25642
1.0     2750
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       lc1_3
6013     2.0
9945     2.0
11885    2.0
3525     2.0
4474     2.0

--------------------------------------------------

--- Analysis for Column: lc1_4 ---
Data Type: flo

Based on the analysis, here’s how we’ll consolidate the borrowing_behavior column:

Binary Indicator for Borrowing :

Use lc1_* columns to determine whether an individual has borrowed money in the past 12 months.

Assign 1 if any lc1_* column indicates borrowing; otherwise, assign 0.

Categorization of Borrowing Sources :

Classify individuals based on their primary source of borrowing:

Formal Borrowing : From banks, microfinance institutions, or government programs.

Informal Borrowing : From family, friends, savings groups, or informal lenders.

Other Borrowing : From unspecified or unique sources.

Purpose of Borrowing :

Use lc2d_* columns to classify the purpose of borrowing (e.g., emergencies, personal needs, business).

Composite Representation :

Combine binary indicators, borrowing sources, and purposes into a single string representation (borrowing_behavior).

In [114]:
# Step 1: Identify relevant columns
formal_borrowing_cols = [
    col for col in borrowing_sources_cols if "bank" in col.lower() or "microfinance" in col.lower() or "government" in col.lower()
]
informal_borrowing_cols = [
    col for col in borrowing_sources_cols if "family" in col.lower() or "friend" in col.lower() or "savings group" in col.lower()
]
other_borrowing_cols = [
    col for col in borrowing_sources_cols if "other" in col.lower()
]

# Step 2: Create binary indicators for each category
data['borrows_formally'] = data[formal_borrowing_cols].isin([1.0]).any(axis=1).astype(int)
data['borrows_informally'] = data[informal_borrowing_cols].isin([1.0]).any(axis=1).astype(int)
data['borrows_other'] = data[other_borrowing_cols].isin([1.0]).any(axis=1).astype(int)

# Step 3: Categorize borrowing behavior
def categorize_borrowing_behavior(row):
    if row['borrows_formally']:
        return 'Formal Borrowing'
    elif row['borrows_informally']:
        return 'Informal Borrowing'
    elif row['borrows_other']:
        return 'Other Borrowing'
    else:
        return 'Does Not Borrow'

data['borrowing_behavior'] = data.apply(categorize_borrowing_behavior, axis=1)

# Step 4: Add purpose of borrowing (optional)
def get_borrowing_purpose(row):
    for col in borrowing_purpose_cols:
        if row[col] == 1.0:
            return col.split('_')[-1].replace('_', ' ').title()  # Extract purpose from column name
    return 'Unknown'

data['borrowing_purpose'] = data.apply(get_borrowing_purpose, axis=1)

# Step 5: Combine borrowing behavior and purpose
data['borrowing_behavior'] = (
    data['borrowing_behavior'] + ', Purpose: ' + data['borrowing_purpose']
)

# Display the consolidated column
print(data[['borrows_formally', 'borrows_informally', 'borrows_other', 'borrowing_purpose', 'borrowing_behavior']].head())

   borrows_formally  borrows_informally  borrows_other borrowing_purpose  \
0                 0                   0              0           Unknown   
1                 0                   0              0           Unknown   
2                 0                   0              0           Unknown   
3                 0                   0              0           Unknown   
4                 0                   0              0           Unknown   

                  borrowing_behavior  
0  Does Not Borrow, Purpose: Unknown  
1  Does Not Borrow, Purpose: Unknown  
2  Does Not Borrow, Purpose: Unknown  
3  Does Not Borrow, Purpose: Unknown  
4  Does Not Borrow, Purpose: Unknown  


### 7, digital_payment_adoption

This column will capture information about an individual's adoption and usage of digital payment methods, such as mobile money, bank transfers, or other electronic payment systems.

Step 1: Define digital_payment_adoption
The digital_payment_adoption column should capture:

Whether an individual uses digital payment methods .

Which digital payment methods they use (e.g., mobile money, bank transfers, ATMs, etc.).

Frequency of usage (if available).
From the dataset, the following columns are likely candidates for deriving digital_payment_adoption:

py2b_* : Indicates whether the individual used specific payment methods in the past 12 months (e.g., mobile money, bank transfers, ATMs, cash, etc.).

mm3a : Captures how often the individual uses their mobile money account.

mm3b_* : Captures the specific mobile money services used (e.g., sending/receiving money, paying bills, etc.).

In [115]:
# Define relevant column groups
payment_method_cols = [col for col in data.columns if col.startswith('py2b_')]  # Payment methods
mobile_money_usage_cols = [col for col in data.columns if col.startswith('mm3a') or col.startswith('mm3b_')]  # Mobile money usage

# Combine all relevant columns into a single list
relevant_columns = payment_method_cols + mobile_money_usage_cols

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: py2b_1 ---
Data Type: float64
Unique Values (3): [nan  1.  0.]
Value Counts:
py2b_1
1.0    5277
0.0     648
Name: count, dtype: int64
Missing Values: 22467
Sample Values:
       py2b_1
93        NaN
15876     NaN
17351     1.0
18934     NaN
4153      NaN

--------------------------------------------------

--- Analysis for Column: py2b_2 ---
Data Type: float64
Unique Values (3): [nan  0.  1.]
Value Counts:
py2b_2
0.0    5886
1.0      39
Name: count, dtype: int64
Missing Values: 22467
Sample Values:
       py2b_2
6784      0.0
12080     NaN
10450     NaN
27251     NaN
18927     0.0

--------------------------------------------------

--- Analysis for Column: py2b_3 ---
Data Type: float64
Unique Values (3): [nan  0.  1.]
Value Counts:
py2b_3
0.0    5879
1.0      46
Name: count, dtype: int64
Missing Values: 22467
Sample Values:
       py2b_3
12829     0.0
25821     NaN
9042      NaN
12464     NaN
6871      NaN

--------------------------------------------------

-

In [116]:
# Now to consolidate

# Step 1: Identify relevant columns
mobile_money_cols = [col for col in data.columns if col.startswith('py2b_')]
other_payment_cols = []  # Add other digital payment columns here if applicable

# Step 2: Create binary indicators for each category
data['uses_mobile_money'] = data[mobile_money_cols].isin([1.0]).any(axis=1).astype(int)
data['uses_other_digital_payments'] = data[other_payment_cols].isin([1.0]).any(axis=1).astype(int)

# Step 3: Categorize payment method
def categorize_payment_method(row):
    if row['uses_mobile_money']:
        return 'Mobile Money'
    elif row['uses_other_digital_payments']:
        return 'Other Digital Payments'
    else:
        return 'Does Not Use Digital Payments'

data['payment_method'] = data.apply(categorize_payment_method, axis=1)

# Step 4: Categorize mobile money frequency
def categorize_mobile_money_frequency(mm3a):
    if pd.isna(mm3a):
        return 'Non-User'
    elif mm3a in [1.0, 2.0, 3.0]:
        return 'Frequent User'
    elif mm3a in [4.0, 5.0, 6.0, 7.0]:
        return 'Occasional User'
    else:
        return 'Unknown'

data['mobile_money_frequency'] = data['mm3a'].apply(categorize_mobile_money_frequency)

# Step 5: Combine payment method and frequency
data['digital_payment_adoption'] = (
    data['payment_method'] + ', Frequency: ' + data['mobile_money_frequency']
)

# Display the consolidated column
print(data[['uses_mobile_money', 'uses_other_digital_payments',
            'payment_method', 'mobile_money_frequency', 'digital_payment_adoption']].head())

   uses_mobile_money  uses_other_digital_payments  \
0                  0                            0   
1                  0                            0   
2                  0                            0   
3                  0                            0   
4                  0                            0   

                  payment_method mobile_money_frequency  \
0  Does Not Use Digital Payments               Non-User   
1  Does Not Use Digital Payments               Non-User   
2  Does Not Use Digital Payments               Non-User   
3  Does Not Use Digital Payments               Non-User   
4  Does Not Use Digital Payments               Non-User   

                            digital_payment_adoption  
0  Does Not Use Digital Payments, Frequency: Non-...  
1  Does Not Use Digital Payments, Frequency: Non-...  
2  Does Not Use Digital Payments, Frequency: Non-...  
3  Does Not Use Digital Payments, Frequency: Non-...  
4  Does Not Use Digital Payments, Frequency: Non-..

### 8, access_to_electricity

Let’s proceed with the column access_to_electricity .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving access_to_electricity :

d3_18 : Indicates whether the household has a telephone (landline), which might correlate with electricity access.

d3_19 : Indicates whether the household has a mobile phone, which indirectly suggests electricity access for charging.

d3_25 : Indicates whether the household has a generator set, which is often used as a backup power source.

pc1_4 : Indicates whether an ATM is close to the household, which may imply infrastructure like electricity.

pc1_7 : Indicates whether a non-interest service provider is close to the household, potentially correlating with electricity access.

pc1_10 : Indicates whether a pharmacy is close to the household, which typically requires electricity.

pc1_11 : Indicates whether a restaurant is close to the household, which also typically requires electricity.

pc1_12 : Indicates whether a mobile phone kiosk is close to the household, which requires electricity for operation.

gen3_1 : Indicates whether the household owns agricultural land, which might correlate with electricity access for irrigation or other uses.

gen3_2 : Indicates whether the household owns land, which could be relevant if tied to electricity infrastructure.

gen3_3 : Indicates whether the household owns a house/dwelling, which might have electricity connections.

Step 2: Analyze Relevant Columns
To confirm the relevance of these columns, we’ll analyze their structure, unique values, value counts, missing values, and sample data.

In [117]:
# Define relevant columns
relevant_columns = [
    'd3_18', 'd3_19', 'd3_25', 'pc1_4', 'pc1_7', 'pc1_10', 'pc1_11', 'pc1_12',
    'gen3_1', 'gen3_2', 'gen3_3'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: d3_18 ---
Data Type: float64
Unique Values (2): [0. 1.]
Value Counts:
d3_18
0.0    28244
1.0      148
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_18
26689    0.0
5431     0.0
23508    0.0
8594     0.0
21007    0.0

--------------------------------------------------

--- Analysis for Column: d3_19 ---
Data Type: float64
Unique Values (2): [1. 0.]
Value Counts:
d3_19
1.0    20063
0.0     8329
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_19
8069     0.0
2826     0.0
24557    1.0
24852    1.0
6143     0.0

--------------------------------------------------

--- Analysis for Column: d3_25 ---
Data Type: float64
Unique Values (2): [0. 1.]
Value Counts:
d3_25
0.0    25128
1.0     3264
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_25
24969    0.0
17170    0.0
20854    0.0
18992    1.0
27870    0.0

--------------------------------------------------

--- Analysis for Column: pc1_4 ---
Data Type: flo

To create the access_to_electricity column, we will:

Binary Indicator for Electricity Access :

Use d3_18, d3_19, and d3_25 to infer electricity access:

A household with a landline (d3_18 == 1.0) or mobile phone (d3_19 == 1.0) likely has electricity.

A household with a generator set (d3_25 == 1.0) may have limited or backup electricity.

Assign 1 if any of these indicators suggest electricity access; otherwise, assign 0.

Proximity to Infrastructure :

Use pc1_* columns to assess proximity to services that typically require electricity (e.g., ATMs, pharmacies, restaurants, mobile phone kiosks).

Assign 1 if any pc1_* column indicates proximity to such services; otherwise, assign 0.

Land Ownership :

Use gen3_* columns to infer whether land ownership correlates with infrastructure access.

For example, households owning agricultural land (gen3_1 == 1.0) may have better access to electricity for irrigation or other uses.

Composite Representation :

Combine binary indicators into a single string representation (access_to_electricity).

In [118]:
# Now to consolidate

# Step 1: Create binary indicators for electricity access
data['has_landline'] = data['d3_18'].apply(lambda x: 1 if x == 1.0 else 0)
data['has_mobile_phone'] = data['d3_19'].apply(lambda x: 1 if x == 1.0 else 0)
data['has_generator_set'] = data['d3_25'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 2: Create binary indicators for proximity to services
data['proximity_to_atm'] = data['pc1_4'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_non_interest_service'] = data['pc1_7'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_pharmacy'] = data['pc1_10'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_restaurant'] = data['pc1_11'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_mobile_kiosk'] = data['pc1_12'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 3: Create binary indicators for land ownership
data['owns_agricultural_land'] = data['gen3_1'].apply(lambda x: 1 if x == 1.0 else 0)
data['owns_other_land'] = data['gen3_2'].apply(lambda x: 1 if x == 1.0 else 0)
data['owns_house'] = data['gen3_3'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 4: Categorize electricity access
def categorize_electricity_access(row):
    # Direct indicators of electricity access
    if row['has_landline'] or row['has_mobile_phone']:
        return 'Likely Has Electricity'
    elif row['has_generator_set']:
        return 'Limited/Backup Electricity'
    
    # Proximity to services requiring electricity
    elif (
        row['proximity_to_atm'] or
        row['proximity_to_non_interest_service'] or
        row['proximity_to_pharmacy'] or
        row['proximity_to_restaurant'] or
        row['proximity_to_mobile_kiosk']
    ):
        return 'Proximity Suggests Electricity'
    
    # Land ownership as a proxy
    elif row['owns_agricultural_land'] or row['owns_other_land']:
        return 'Potential Electricity (Land Ownership)'
    
    # Default case
    else:
        return 'No Evidence of Electricity'

data['access_to_electricity'] = data.apply(categorize_electricity_access, axis=1)

# Display the consolidated column
print(data[['has_landline', 'has_mobile_phone', 'has_generator_set',
            'proximity_to_atm', 'proximity_to_pharmacy', 'proximity_to_mobile_kiosk',
            'owns_agricultural_land', 'owns_other_land', 'access_to_electricity']].head())

   has_landline  has_mobile_phone  has_generator_set  proximity_to_atm  \
0             0                 1                  0                 0   
1             0                 1                  0                 0   
2             0                 1                  0                 0   
3             0                 1                  0                 0   
4             0                 0                  0                 0   

   proximity_to_pharmacy  proximity_to_mobile_kiosk  owns_agricultural_land  \
0                      0                          0                       0   
1                      0                          0                       0   
2                      0                          0                       1   
3                      0                          0                       1   
4                      1                          0                       0   

   owns_other_land           access_to_electricity  
0                0         

### 9, internet_access

Let’s proceed with the column internet_access .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving internet_access :

d3_20 : Indicates whether the household has a computer, which often correlates with internet access.

d3_21 : Indicates whether the household has a tablet, which may also suggest internet access.

d3_22 : Indicates whether the household has a smart TV, which typically requires internet connectivity.

d3_23 : Indicates whether the household has a radio, which does not directly imply internet access but could be relevant in some contexts.

d3_24 : Indicates whether the household has a satellite dish, which might correlate with internet access (e.g., satellite internet).

pc1_4 : Indicates whether an ATM is close to the household, which may imply infrastructure like electricity and internet.

pc1_7 : Indicates whether a non-interest service provider is close to the household, potentially correlating with internet access.

pc1_10 : Indicates whether a pharmacy is close to the household, which typically requires electricity and possibly internet.

pc1_12 : Indicates whether a mobile phone kiosk is close to the household, which requires internet for operation.

gen3_1 , gen3_2 , gen3_3 : Indicate ownership of agricultural land or other types of land, which might correlate with infrastructure access.


In [119]:
# Define relevant columns
relevant_columns = [
    'd3_20', 'd3_21', 'd3_22', 'd3_23', 'd3_24', 'pc1_4', 'pc1_7', 'pc1_10', 'pc1_12',
    'gen3_1', 'gen3_2', 'gen3_3'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: d3_20 ---
Data Type: float64
Unique Values (2): [0. 1.]
Value Counts:
d3_20
0.0    28314
1.0       78
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_20
10704    0.0
22195    0.0
24724    0.0
27712    0.0
21416    0.0

--------------------------------------------------

--- Analysis for Column: d3_21 ---
Data Type: float64
Unique Values (2): [0. 1.]
Value Counts:
d3_21
0.0    28206
1.0      186
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_21
15507    1.0
18453    0.0
7540     0.0
22082    0.0
11095    0.0

--------------------------------------------------

--- Analysis for Column: d3_22 ---
Data Type: float64
Unique Values (2): [0. 1.]
Value Counts:
d3_22
0.0    24597
1.0     3795
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       d3_22
5838     0.0
10466    0.0
26357    0.0
13831    0.0
19810    0.0

--------------------------------------------------

--- Analysis for Column: d3_23 ---
Data Type: flo

Based on the results, here’s how we can consolidate the internet_access column.

Key Observations from the Analysis
1. Columns Related to Household Devices (d3_*)
d3_20 : Indicates whether the household has a computer.
0.0: No computer.
1.0: Has a computer (likely correlates with internet access).
d3_21 : Indicates whether the household has a tablet.
0.0: No tablet.
1.0: Has a tablet (likely correlates with internet access).
d3_22 : Indicates whether the household has a smart TV.
0.0: No smart TV.
1.0: Has a smart TV (requires internet connectivity).
d3_23 : Indicates whether the household has a radio.
0.0: No radio.
1.0: Has a radio (does not directly imply internet access).
d3_24 : Indicates whether the household has a satellite dish.
0.0: No satellite dish.
1.0: Has a satellite dish (may correlate with internet access, e.g., satellite internet).
2. Columns Related to Proximity to Services (pc1_*)
pc1_4 : Indicates whether an ATM is close to the household.
0.0: No ATM nearby.
1.0: ATM nearby (implies infrastructure like electricity and possibly internet).
pc1_7 : Indicates whether a non-interest service provider is close to the household.
0.0: No non-interest service provider nearby.
1.0: Non-interest service provider nearby (may require internet).
pc1_10 : Indicates whether a pharmacy is close to the household.
0.0: No pharmacy nearby.
1.0: Pharmacy nearby (typically requires electricity and possibly internet).
pc1_12 : Indicates whether a mobile phone kiosk is close to the household.
0.0: No mobile phone kiosk nearby.
1.0: Mobile phone kiosk nearby (requires internet for operation).
3. Columns Related to Land Ownership (gen3_*)
gen3_1 , gen3_2 , gen3_3 : Indicate ownership of agricultural land or other types of land.
1.0: Owns land.
2.0: Does not own land.
3.0: Unknown or unspecified.

In [120]:
# Time to consolidate

# Step 1: Create binary indicators for internet access
data['has_computer'] = data['d3_20'].apply(lambda x: 1 if x == 1.0 else 0)
data['has_tablet'] = data['d3_21'].apply(lambda x: 1 if x == 1.0 else 0)
data['has_smart_tv'] = data['d3_22'].apply(lambda x: 1 if x == 1.0 else 0)
data['has_satellite_dish'] = data['d3_24'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 2: Create binary indicators for proximity to services
data['proximity_to_atm'] = data['pc1_4'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_non_interest_service'] = data['pc1_7'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_pharmacy'] = data['pc1_10'].apply(lambda x: 1 if x == 1.0 else 0)
data['proximity_to_mobile_kiosk'] = data['pc1_12'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 3: Create binary indicators for land ownership
data['owns_agricultural_land'] = data['gen3_1'].apply(lambda x: 1 if x == 1.0 else 0)
data['owns_other_land'] = data['gen3_2'].apply(lambda x: 1 if x == 1.0 else 0)
data['owns_house'] = data['gen3_3'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 4: Categorize internet access
def categorize_internet_access(row):
    # Direct indicators of internet access
    if row['has_computer'] or row['has_tablet'] or row['has_smart_tv']:
        return 'Likely Has Internet'
    
    # Satellite dish as a proxy
    elif row['has_satellite_dish']:
        return 'Potential Internet (Satellite)'
    
    # Proximity to services requiring internet
    elif (
        row['proximity_to_atm'] or
        row['proximity_to_non_interest_service'] or
        row['proximity_to_pharmacy'] or
        row['proximity_to_mobile_kiosk']
    ):
        return 'Proximity Suggests Internet'
    
    # Land ownership as a proxy
    elif row['owns_agricultural_land'] or row['owns_other_land']:
        return 'Potential Internet (Land Ownership)'
    
    # Default case
    else:
        return 'No Evidence of Internet'

data['internet_access'] = data.apply(categorize_internet_access, axis=1)

# Display the consolidated column
print(data[['has_computer', 'has_tablet', 'has_smart_tv', 'has_satellite_dish',
            'proximity_to_atm', 'proximity_to_pharmacy', 'proximity_to_mobile_kiosk',
            'owns_agricultural_land', 'owns_other_land', 'internet_access']].head())

   has_computer  has_tablet  has_smart_tv  has_satellite_dish  \
0             0           0             0                   0   
1             0           0             0                   0   
2             0           0             0                   0   
3             0           0             0                   0   
4             0           0             0                   0   

   proximity_to_atm  proximity_to_pharmacy  proximity_to_mobile_kiosk  \
0                 0                      0                          0   
1                 0                      0                          0   
2                 0                      0                          0   
3                 0                      0                          0   
4                 0                      1                          0   

   owns_agricultural_land  owns_other_land  \
0                       0                0   
1                       0                0   
2                       1       

### 10, mobile_phone_usage

Let’s proceed with the column mobile_phone_usage .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving mobile_phone_usage :

te1_1_1 : Indicates whether the individual uses a mobile phone.

te1_1_2 : Indicates whether the individual uses a tablet.

te1_1_3 : Indicates whether the individual uses a computer or laptop.

te1_1_4 : Indicates whether the individual uses a landline telephone.

te1_1_5 : Indicates whether the individual uses a 3G/4G/LTE modem/router.

te3_1 : Indicates whether the individual uses a smartphone capable of accessing the internet.

te3_2 : Indicates whether the individual uses a feature phone capable of accessing the internet.

te3_3 : Indicates whether the individual uses a basic phone (only capable of voice calls and SMS).

te4 : Indicates how comfortable the individual feels about using smartphone apps.

py1a_* : Indicates payment methods used in the past 12 months, including mobile money (py1a_11) and e-naira (py1a_12).

mm1a : Indicates whether the individual has heard of mobile money prior to today.

mm1b : Indicates the individual's experience level with mobile money (e.g., never used, beginner, advanced user).

Step 2: Analyze Relevant Columns
To confirm the relevance of these columns, we’ll analyze their structure, unique values, value counts, missing values, and sample data.

In [121]:
# Define relevant columns
relevant_columns = [
    'te1_1_1', 'te1_1_2', 'te1_1_3', 'te1_1_4', 'te1_1_5',
    'te3_1', 'te3_2', 'te3_3', 'te4',
    'py1a_11', 'py1a_12',
    'mm1a', 'mm1b'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: te1_1_1 ---
Data Type: float64
Unique Values (2): [1. 2.]
Value Counts:
te1_1_1
1.0    24164
2.0     4228
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       te1_1_1
3426       1.0
12991      1.0
6754       1.0
15799      1.0
1675       1.0

--------------------------------------------------

--- Analysis for Column: te1_1_2 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
te1_1_2
2.0    27573
1.0      819
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       te1_1_2
22580      1.0
25736      2.0
15866      2.0
15367      2.0
4441       2.0

--------------------------------------------------

--- Analysis for Column: te1_1_3 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
te1_1_3
2.0    26891
1.0     1501
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       te1_1_3
12194      2.0
17130      2.0
14049      2.0
15074      2.0
16034      1.0

--------------------------------------------------

-

Based on the results, here’s how we can consolidate the mobile_phone_usage column.

Key Observations from the Analysis
1. Columns Related to Device Ownership (te1_1_*)
te1_1_1 : Indicates whether the individual uses a mobile phone.
1.0: Uses a mobile phone.
2.0: Does not use a mobile phone.
te1_1_2 : Indicates whether the individual uses a tablet.
1.0: Uses a tablet.
2.0: Does not use a tablet.
te1_1_3 : Indicates whether the individual uses a computer or laptop.
1.0: Uses a computer/laptop.
2.0: Does not use a computer/laptop.
te1_1_4 : Indicates whether the individual uses a landline telephone.
1.0: Uses a landline telephone.
2.0: Does not use a landline telephone.
te1_1_5 : Indicates whether the individual uses a 3G/4G/LTE modem/router.
1.0: Uses a 3G/4G/LTE modem/router.
2.0: Does not use a 3G/4G/LTE modem/router.
2. Columns Related to Phone Type (te3_*)
te3_1 : Indicates whether the individual uses a smartphone capable of accessing the internet.
1.0: Uses a smartphone.
0.0: Does not use a smartphone.
NaN: Missing data.
te3_2 : Indicates whether the individual uses a feature phone capable of accessing the internet.
1.0: Uses a feature phone.
0.0: Does not use a feature phone.
NaN: Missing data.
te3_3 : Indicates whether the individual uses a basic phone (only capable of voice calls and SMS).
1.0: Uses a basic phone.
0.0: Does not use a basic phone.
NaN: Missing data.
3. Column Related to Smartphone App Comfort (te4)
te4 : Indicates how comfortable the individual feels about using smartphone apps.
1.0: Very uncomfortable.
2.0: Uncomfortable.
3.0: Neutral.
4.0: Comfortable.
5.0: Very comfortable.
4. Columns Related to Payment Methods (py1a_*)
py1a_11 : Indicates whether the individual has used mobile money in the past 12 months.
1.0: Has used mobile money.
0.0: Has not used mobile money.
py1a_12 : Indicates whether the individual has used e-naira in the past 12 months.
1.0: Has used e-naira.
0.0: Has not used e-naira.
5. Columns Related to Mobile Money Awareness (mm1a, mm1b)
mm1a : Indicates whether the individual has heard of mobile money prior to today.
1.0: Has heard of mobile money.
2.0: Has not heard of mobile money.
mm1b : Indicates the individual's experience level with mobile money.
1.0: Beginner.
2.0: Intermediate.
3.0: Advanced.
4.0: Expert.
NaN: Missing data.


In [122]:
# Time to consolidate

# Step 1: Create binary indicators for mobile phone usage
data['uses_mobile_phone'] = data['te1_1_1'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 2: Categorize phone type
def categorize_phone_type(row):
    if row['te3_1'] == 1.0:
        return 'Smartphone'
    elif row['te3_2'] == 1.0:
        return 'Feature Phone'
    elif row['te3_3'] == 1.0:
        return 'Basic Phone'
    else:
        return 'No Phone'

data['phone_type'] = data.apply(categorize_phone_type, axis=1)

# Step 3: Categorize comfort level with smartphone apps
def categorize_comfort_level(te4):
    if te4 == 1.0:
        return 'Very Uncomfortable'
    elif te4 == 2.0:
        return 'Uncomfortable'
    elif te4 == 3.0:
        return 'Neutral'
    elif te4 == 4.0:
        return 'Comfortable'
    elif te4 == 5.0:
        return 'Very Comfortable'
    else:
        return 'Unknown'

data['comfort_level'] = data['te4'].apply(categorize_comfort_level)

# Step 4: Categorize mobile money usage
def categorize_mobile_money_usage(row):
    if row['py1a_11'] == 1.0:
        return 'Uses Mobile Money'
    elif row['mm1a'] == 1.0:
        return 'Aware of Mobile Money'
    else:
        return 'Not Aware of Mobile Money'

data['mobile_money_usage'] = data.apply(categorize_mobile_money_usage, axis=1)

# Step 5: Combine all indicators into a single string representation
data['mobile_phone_usage'] = (
    'Phone Usage: ' + data['uses_mobile_phone'].astype(str) +
    ', Phone Type: ' + data['phone_type'] +
    ', Comfort Level: ' + data['comfort_level'] +
    ', Mobile Money: ' + data['mobile_money_usage']
)

# Display the consolidated column
print(data[['uses_mobile_phone', 'phone_type', 'comfort_level', 'mobile_money_usage', 'mobile_phone_usage']].head())

   uses_mobile_phone     phone_type       comfort_level  \
0                  1  Feature Phone         Comfortable   
1                  1    Basic Phone             Neutral   
2                  1    Basic Phone       Uncomfortable   
3                  0       No Phone  Very Uncomfortable   
4                  0       No Phone         Comfortable   

          mobile_money_usage  \
0  Not Aware of Mobile Money   
1  Not Aware of Mobile Money   
2  Not Aware of Mobile Money   
3  Not Aware of Mobile Money   
4  Not Aware of Mobile Money   

                                  mobile_phone_usage  
0  Phone Usage: 1, Phone Type: Feature Phone, Com...  
1  Phone Usage: 1, Phone Type: Basic Phone, Comfo...  
2  Phone Usage: 1, Phone Type: Basic Phone, Comfo...  
3  Phone Usage: 0, Phone Type: No Phone, Comfort ...  
4  Phone Usage: 0, Phone Type: No Phone, Comfort ...  


### 11, credit_access

 Let’s proceed with the column credit_access .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving credit_access :

lc1_1 : Indicates whether the individual borrowed money in the past 12 months.
lc1_2 : Indicates whether the individual has been paying back borrowed money.
lc2a_* : Indicates who the individual borrowed money from (e.g., bank, microfinance institution, family/friends, etc.).
lc4_other : Free-text responses explaining why the individual did not borrow money.
qf6_1_10 , qf6_2_10 , qf6_4_10 , etc.: Indicate whether the individual borrowed money from specific institutions (e.g., commercial banks, microfinance banks, mortgage institutions, etc.).
f4a_17 , f4a_18 , f4a_19 : Indicate whether the individual bought goods on credit, hired purchase, or used a credit card in the past year.
finhealth_resilience : Indicates the individual's financial resilience, which may correlate with access to credit.
finneeds_resilience : Indicates the individual's need for financial resilience, which may also correlate with credit access.

In [123]:
# Define relevant columns
relevant_columns = [
    'lc1_1', 'lc1_2', 'lc2a_1', 'lc2a_2', 'lc2a_3', 'lc2a_4', 'lc2a_5', 'lc2a_6',
    'lc2a_7', 'lc2a_8', 'lc2a_9', 'lc2a_10', 'lc4_other',
    'qf6_1_10', 'qf6_2_10', 'qf6_4_10', 'qf6_5_10', 'qf6_8_10',
    'f4a_17', 'f4a_18', 'f4a_19',
    'finhealth_resilience', 'finneeds_resilience'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: lc1_1 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
lc1_1
2.0    21303
1.0     7089
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       lc1_1
22597    2.0
24868    2.0
10854    2.0
13308    2.0
1767     2.0

--------------------------------------------------

--- Analysis for Column: lc1_2 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
lc1_2
2.0    25815
1.0     2577
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       lc1_2
11471    2.0
8980     2.0
25787    2.0
27399    2.0
9183     2.0

--------------------------------------------------

--- Analysis for Column: lc2a_1 ---
Data Type: float64
Unique Values (3): [nan  0.  1.]
Value Counts:
lc2a_1
0.0    8302
1.0     322
Name: count, dtype: int64
Missing Values: 19768
Sample Values:
       lc2a_1
9120      NaN
18751     1.0
24972     NaN
3092      NaN
3104      NaN

--------------------------------------------------

--- Analysis for Column: lc2a_2 -

Based on the provided outputs, here's how we can consolidate the credit_access column.

Key Observations from the Analysis
1. Columns Related to Borrowing Behavior (lc1_*)
lc1_1 : Indicates whether the individual borrowed money in the past 12 months.
1.0: Borrowed money.
0.0: Did not borrow money.
lc1_2 : Indicates whether the individual has been paying back borrowed money.
1.0: Paying back borrowed money.
0.0: Not paying back borrowed money.
2. Columns Related to Borrowing Sources (lc2a_*)
lc2a_1 to lc2a_10 : Indicate who the individual borrowed money from (e.g., family/friends, banks, microfinance institutions, etc.).
1.0: Borrowed from this source.
0.0: Did not borrow from this source.
lc2a_other : Free-text responses explaining other borrowing sources.
3. Columns Related to Financial Service Providers (qf6_*)
qf6_1_10, qf6_2_10, etc. : Indicate whether specific financial service providers (e.g., commercial banks, microfinance banks, etc.) were used for borrowing.
1.0: Used this provider.
0.0: Did not use this provider.
NaN: Missing data.
4. Columns Related to Financial Resilience (finhealth_resilience, finneeds_resilience)
finhealth_resilience : Indicates the individual's financial resilience.
1.0: Low resilience.
2.0: High resilience.
finneeds_resilience : Indicates the individual's need for financial resilience.
1.0: Low need.
2.0: High need.
5. Columns Related to Credit Usage (f4a_*)
f4a_17, f4a_18, f4a_19 : Indicate whether the individual bought goods on credit, hired purchase, or used a credit card in the past year.
1.0: Used credit.
0.0: Did not use credit.

In [None]:
# Time to consolidate

# Step 1: Create binary indicators for borrowing behavior
data['borrowed_money'] = data['lc1_1'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 2: Categorize borrowing sources
def categorize_borrowing_source(row):
    if row['lc2a_1'] == 1.0:
        return 'Family/Friends'
    elif row['lc2a_2'] == 1.0:
        return 'Bank'
    elif row['lc2a_3'] == 1.0:
        return 'Microfinance Institution'
    elif pd.notna(row['lc2a_other']) and row['lc2a_other'].strip() != '':
        return 'Other'
    else:
        return 'No Borrowing Source'

data['borrowing_source'] = data.apply(categorize_borrowing_source, axis=1)

# Step 3: Identify financial service providers used
def identify_financial_providers(row):
    providers = []
    if row['qf6_1_10'] == 1.0:
        providers.append('Commercial Bank')
    if row['qf6_2_10'] == 1.0:
        providers.append('Microfinance Bank')
    # Add more providers as needed
    return ', '.join(providers) if providers else 'None'

data['financial_providers'] = data.apply(identify_financial_providers, axis=1)

# Step 4: Identify credit usage
def identify_credit_usage(row):
    if row['f4a_17'] == 1.0 or row['f4a_18'] == 1.0 or row['f4a_19'] == 1.0:
        return 'Used Credit'
    else:
        return 'No Credit Usage'

data['credit_usage'] = data.apply(identify_credit_usage, axis=1)

# Step 5: Combine all indicators into a single string representation
data['credit_access'] = (
    'Borrowed Money: ' + data['borrowed_money'].astype(str) +
    ', Borrowing Source: ' + data['borrowing_source'] +
    ', Financial Providers: ' + data['financial_providers'] +
    ', Credit Usage: ' + data['credit_usage']
)

# Display the consolidated column
print(data[['borrowed_money', 'borrowing_source', 'financial_providers', 'credit_usage', 'credit_access']].head())

   borrowed_money     borrowing_source financial_providers     credit_usage  \
0               0  No Borrowing Source                None  No Credit Usage   
1               0  No Borrowing Source                None  No Credit Usage   
2               0  No Borrowing Source                None  No Credit Usage   
3               0  No Borrowing Source                None  No Credit Usage   
4               0  No Borrowing Source                None  No Credit Usage   

                                       credit_access  
0  Borrowed Money: 0, Borrowing Source: No Borrow...  
1  Borrowed Money: 0, Borrowing Source: No Borrow...  
2  Borrowed Money: 0, Borrowing Source: No Borrow...  
3  Borrowed Money: 0, Borrowing Source: No Borrow...  
4  Borrowed Money: 0, Borrowing Source: No Borrow...  


### 12, small_business_ownership

Let’s proceed with the column small_business_ownership .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving small_business_ownership :

gen3_8 : Indicates whether the individual owns non-farm business equipment (e.g., sewing machine, brick-making machine).
gen3_9 : Indicates whether the individual owns large consumer durables (e.g., refrigerator, TV, sofa), which could be used for business purposes.
gen3_10 : Indicates whether the individual owns small consumer durables (e.g., radio, cookware), which might also be relevant for small-scale businesses.
gen3_11 : Indicates whether the individual owns a mobile phone, which is often essential for running a business.
gen3_12 : Indicates whether the individual owns a means of transportation (e.g., bicycle, motorcycle, car), which could support business operations.
gen5_8 : Indicates whether the individual can sell or lease non-farm business equipment without anyone's permission.
cc1_* : Indicates the individual's involvement in various income-generating activities (e.g., agriculture, trade, services).
cc3 : Indicates the primary source of income, which may include small business ownership.
cc4_* : Indicates secondary sources of income, which may also include small business activities.
pc1_* : Indicates proximity to markets, shops, and other services, which could correlate with business ownership.

In [125]:
# Define relevant columns
relevant_columns = [
    'gen3_8', 'gen3_9', 'gen3_10', 'gen3_11', 'gen3_12',
    'gen5_8', 'cc1_1', 'cc1_2', 'cc1_3', 'cc1_4', 'cc1_5', 'cc1_6', 'cc1_7',
    'cc3', 'cc4_1', 'cc4_2', 'cc4_3', 'cc4_4', 'cc4_5', 'cc4_6', 'cc4_7',
    'pc1_1', 'pc1_2', 'pc1_3', 'pc1_4', 'pc1_5', 'pc1_6', 'pc1_7', 'pc1_8', 'pc1_9', 'pc1_10', 'pc1_11', 'pc1_12'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: gen3_8 ---
Data Type: float64
Unique Values (3): [3. 2. 1.]
Value Counts:
gen3_8
3.0    26924
1.0     1164
2.0      304
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       gen3_8
17643     2.0
13498     3.0
17264     3.0
19556     3.0
4011      3.0

--------------------------------------------------

--- Analysis for Column: gen3_9 ---
Data Type: float64
Unique Values (3): [3. 2. 1.]
Value Counts:
gen3_9
3.0    21924
1.0     5053
2.0     1415
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       gen3_9
23740     3.0
17767     3.0
24042     3.0
24065     3.0
3149      3.0

--------------------------------------------------

--- Analysis for Column: gen3_10 ---
Data Type: float64
Unique Values (3): [1. 3. 2.]
Value Counts:
gen3_10
3.0    21018
1.0     6164
2.0     1210
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       gen3_10
18276      1.0
26832      1.0
59         3.0
10536      3.0
23995      3.0

---------------------

To create the small_business_ownership column, we will:

Binary Indicator for Business Equipment Ownership :

Use gen3_8 to determine whether the individual owns non-farm business equipment.

Assign 1 if gen3_8 == 1.0; otherwise, assign 0.

Categorize Business Tools :

Use gen3_9, gen3_10, gen3_11, and gen3_12 to classify ownership of essential business tools:

"Owns Large Consumer Durables": gen3_9 == 1.0.

"Owns Small Consumer Durables": gen3_10 == 1.0.

"Owns Mobile Phone": gen3_11 == 1.0.

"Owns Transportation": gen3_12 == 1.0.

Control Over Business Assets :

Use gen5_8 to infer control over business assets:

"Full Control": gen5_8 == 1.0.

"Limited Control": gen5_8 == 2.0.

Income-Generating Activities :

Use cc1_* columns to identify involvement in income-generating activities:

"Involved in Agriculture": cc1_1 == 1.0.

"Involved in Trade": cc1_2 == 1.0.

"Involved in Services": cc1_3 == 1.0.

Proximity to Markets and Services :

Use pc1_* columns to assess proximity to markets and services that could support business operations.

Composite Representation :
Combine all indicators into a single string representation (small_business_ownership).

In [126]:
# Now to consolidate

# Step 1: Create binary indicators for business equipment ownership
data['owns_business_equipment'] = data['gen3_8'].apply(lambda x: 1 if x == 1.0 else 0)

# Step 2: Categorize ownership of essential business tools
def categorize_business_tools(row):
    tools = []
    if row['gen3_9'] == 1.0:
        tools.append('Large Consumer Durables')
    if row['gen3_10'] == 1.0:
        tools.append('Small Consumer Durables')
    if row['gen3_11'] == 1.0:
        tools.append('Mobile Phone')
    if row['gen3_12'] == 1.0:
        tools.append('Transportation')
    return ', '.join(tools) if tools else 'No Business Tools'

data['business_tools'] = data.apply(categorize_business_tools, axis=1)

# Step 3: Categorize control over business assets
def categorize_control_over_assets(gen5_8):
    if gen5_8 == 1.0:
        return 'Full Control'
    elif gen5_8 == 2.0:
        return 'Limited Control'
    else:
        return 'Unknown'

data['control_over_assets'] = data['gen5_8'].apply(categorize_control_over_assets)

# Step 4: Identify income-generating activities
def identify_income_activities(row):
    activities = []
    if row['cc1_1'] == 1.0:
        activities.append('Agriculture')
    if row['cc1_2'] == 1.0:
        activities.append('Trade')
    if row['cc1_3'] == 1.0:
        activities.append('Services')
    return ', '.join(activities) if activities else 'No Income Activities'

data['income_activities'] = data.apply(identify_income_activities, axis=1)

# Step 5: Identify proximity to markets and services
def identify_proximity_to_services(row):
    services = []
    if row['pc1_1'] == 1.0:
        services.append('Market')
    if row['pc1_2'] == 1.0:
        services.append('Shop')
    # Add more services as needed
    return ', '.join(services) if services else 'No Proximity to Services'

data['proximity_to_services'] = data.apply(identify_proximity_to_services, axis=1)

# Step 6: Combine all indicators into a single string representation
data['small_business_ownership'] = (
    'Business Equipment: ' + data['owns_business_equipment'].astype(str) +
    ', Business Tools: ' + data['business_tools'] +
    ', Control Over Assets: ' + data['control_over_assets'] +
    ', Income Activities: ' + data['income_activities'] +
    ', Proximity to Services: ' + data['proximity_to_services']
)

# Display the consolidated column
print(data[['owns_business_equipment', 'business_tools', 'control_over_assets',
            'income_activities', 'proximity_to_services', 'small_business_ownership']].head())

   owns_business_equipment                                     business_tools  \
0                        0  Small Consumer Durables, Mobile Phone, Transpo...   
1                        0              Small Consumer Durables, Mobile Phone   
2                        0              Small Consumer Durables, Mobile Phone   
3                        0                                  No Business Tools   
4                        0                                  No Business Tools   

  control_over_assets     income_activities proximity_to_services  \
0             Unknown  No Income Activities                Market   
1             Unknown  No Income Activities                Market   
2             Unknown  No Income Activities                Market   
3             Unknown  No Income Activities                Market   
4             Unknown  No Income Activities                Market   

                            small_business_ownership  
0  Business Equipment: 0, Business Tools: S

### 13, entrepreneurship


Let’s proceed with the column entrepreneurship .

Step 1: Identify Relevant Columns
Based on the dataset's structure and naming conventions, the following columns are likely candidates for deriving entrepreneurship :

cc1_* : Indicates involvement in income-generating activities (e.g., agriculture, trade, services).
cc3 : Indicates the primary source of income, which may include entrepreneurship.
cc4_* : Indicates secondary sources of income, which may also include entrepreneurial activities.
gen3_8 : Indicates ownership of non-farm business equipment (e.g., sewing machine, brick-making machine), which is often associated with entrepreneurship.
gen5_8 : Indicates whether the individual can sell or lease non-farm business equipment without anyone's permission, suggesting control over entrepreneurial assets.
pc1_* : Indicates proximity to markets, shops, and other services, which could correlate with entrepreneurial activities.
e13a_* : Indicates the sector(s) in which the individual's source(s) of income falls, which may include entrepreneurial sectors like trade, manufacturing, or services.
e13b : Indicates the number of people employed in the individual's business, if applicable.
lc2a_* : Indicates borrowing sources, which may include loans for entrepreneurial purposes (e.g., microfinance institutions, savings groups).
qf6_* : Indicates activities done using financial institutions, such as borrowing money or making investments, which may be linked to entrepreneurship.


In [130]:
# Define relevant columns
relevant_columns = [
    'cc1_1', 'cc1_2', 'cc1_3', 'cc1_4', 'cc1_5', 'cc1_6', 'cc1_7',
    'cc3', 'cc4_1', 'cc4_2', 'cc4_3', 'cc4_4', 'cc4_5', 'cc4_6', 'cc4_7',
    'gen3_8', 'gen5_8',
    'pc1_1', 'pc1_2', 'pc1_3', 'pc1_4', 'pc1_5', 'pc1_6', 'pc1_7', 'pc1_8', 'pc1_9', 'pc1_10', 'pc1_11', 'pc1_12',
    'e13a_1', 'e13a_2', 'e13a_3', 'e13a_4', 'e13a_5', 'e13a_6', 'e13a_7', 'e13a_8', 'e13a_9', 'e13a_10', 'e13a_other',
    'e13b',
    'lc2a_1', 'lc2a_2', 'lc2a_3', 'lc2a_4', 'lc2a_5', 'lc2a_6', 'lc2a_7', 'lc2a_8', 'lc2a_9', 'lc2a_10', 'lc2a_11',
    'lc2a_12', 'lc2a_13', 'lc2a_14', 'lc2a_15', 'lc2a_16', 'lc2a_other',
    'qf6_1_10', 'qf6_2_10', 'qf6_3_10', 'qf6_4_10', 'qf6_5_10', 'qf6_6_10', 'qf6_7_10', 'qf6_8_10'
]

# Analyze each column
for column in relevant_columns:
    print(f"--- Analysis for Column: {column} ---")
    
    # Step 1: Check the column type
    print(f"Data Type: {data[column].dtype}")
    
    # Step 2: List unique values
    unique_values = data[column].unique()
    print(f"Unique Values ({len(unique_values)}): {unique_values}")
    
    # Step 3: Show value counts (frequency of each unique value)
    print("Value Counts:")
    print(data[column].value_counts())
    
    # Step 4: Count missing values
    missing_values = data[column].isnull().sum()
    print(f"Missing Values: {missing_values}")
    
    # Step 5: Show sample values
    print("Sample Values:")
    print(data[[column]].sample(5))
    
    print("\n" + "-"*50 + "\n")

--- Analysis for Column: cc1_1 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
cc1_1
2.0    23876
1.0     4516
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       cc1_1
3818     2.0
16789    2.0
191      2.0
14682    2.0
14039    2.0

--------------------------------------------------

--- Analysis for Column: cc1_2 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
cc1_2
2.0    26327
1.0     2065
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       cc1_2
11740    2.0
11524    2.0
23646    2.0
2712     2.0
10207    2.0

--------------------------------------------------

--- Analysis for Column: cc1_3 ---
Data Type: float64
Unique Values (2): [2. 1.]
Value Counts:
cc1_3
2.0    27259
1.0     1133
Name: count, dtype: int64
Missing Values: 0
Sample Values:
       cc1_3
10543    2.0
13871    2.0
19809    2.0
9010     2.0
12765    1.0

--------------------------------------------------

--- Analysis for Column: cc1_4 ---
Data Type: flo

To create the entrepreneurship column, we will:

Binary Indicator for Involvement in Income-Generating Activities :

Use cc1_* columns to determine involvement in income-generating activities.

Assign 1 if any cc1_* == 1.0; otherwise, assign 0.

Categorize Primary and Secondary Sources of Income :

Use cc3 and cc4_* columns to classify the primary and secondary sources of income.

Ownership of Business Equipment :

Use gen3_8 to infer ownership of non-farm business equipment.

Control Over Business Assets :

Use gen5_8 to assess control over business assets.

Proximity to Markets and Services :

Use pc1_* columns to evaluate proximity to markets and services.

Borrowing Sources :

Use lc2a_* columns to identify borrowing sources relevant to entrepreneurship.

Sectors of Income :

Use e13a_* columns to classify the sectors of income.

Employment :

Use e13b to determine the number of employees in the individual's business.

Composite Representation :

Combine all indicators into a single string representation (entrepreneurship).

In [131]:
# Time to consolidate

# Step 1: Create binary indicators for income-generating activities
def categorize_income_activities(row):
    activities = []
    if row['cc1_1'] == 1.0:
        activities.append('Agriculture')
    if row['cc1_2'] == 1.0:
        activities.append('Trade')
    if row['cc1_3'] == 1.0:
        activities.append('Services')
    # Add more activities as needed
    return ', '.join(activities) if activities else 'No Income Activities'

data['income_activities'] = data.apply(categorize_income_activities, axis=1)

# Step 2: Categorize primary and secondary sources of income
def categorize_sources_of_income(row):
    primary_source = {
        1.0: 'Agriculture',
        2.0: 'Trade',
        3.0: 'Services',
        4.0: 'Other'
    }.get(row['cc3'], 'Unknown')
    
    secondary_sources = []
    if row['cc4_1'] == 1.0:
        secondary_sources.append('Agriculture')
    if row['cc4_2'] == 1.0:
        secondary_sources.append('Trade')
    if row['cc4_3'] == 1.0:
        secondary_sources.append('Services')
    # Add more sources as needed
    secondary_sources = ', '.join(secondary_sources) if secondary_sources else 'None'
    
    return f'Primary Source: {primary_source}, Secondary Sources: {secondary_sources}'

data['sources_of_income'] = data.apply(categorize_sources_of_income, axis=1)

# Step 3: Categorize ownership of business equipment
def categorize_business_equipment(gen3_8):
    if gen3_8 == 1.0:
        return 'Owns Business Equipment'
    elif gen3_8 == 2.0:
        return 'No Business Equipment'
    else:
        return 'Unknown'

data['business_equipment'] = data['gen3_8'].apply(categorize_business_equipment)

# Step 4: Categorize control over business assets
def categorize_control_over_assets(gen5_8):
    if gen5_8 == 1.0:
        return 'Full Control'
    elif gen5_8 == 2.0:
        return 'Limited Control'
    else:
        return 'Unknown'

data['control_over_assets'] = data['gen5_8'].apply(categorize_control_over_assets)

# Step 5: Identify proximity to markets and services
def identify_proximity_to_services(row):
    services = []
    if row['pc1_1'] == 1.0:
        services.append('Market')
    if row['pc1_2'] == 1.0:
        services.append('Shop')
    # Add more services as needed
    return ', '.join(services) if services else 'No Proximity to Services'

data['proximity_to_services'] = data.apply(identify_proximity_to_services, axis=1)

# Step 6: Identify borrowing sources
def identify_borrowing_sources(row):
    sources = []
    if row['lc2a_1'] == 1.0:
        sources.append('Family/Friends')
    if row['lc2a_2'] == 1.0:
        sources.append('Bank')
    if row['lc2a_3'] == 1.0:
        sources.append('Microfinance Institution')
    # Add more sources as needed
    return ', '.join(sources) if sources else 'No Borrowing Sources'

data['borrowing_sources'] = data.apply(identify_borrowing_sources, axis=1)

# Step 7: Identify sectors of income
def identify_sectors_of_income(row):
    sectors = []
    if row['e13a_1'] == 1.0:
        sectors.append('Agriculture')
    if row['e13a_2'] == 1.0:
        sectors.append('Trade')
    if row['e13a_3'] == 1.0:
        sectors.append('Services')
    # Add more sectors as needed
    return ', '.join(sectors) if sectors else 'No Sectors'

data['sectors_of_income'] = data.apply(identify_sectors_of_income, axis=1)

# Step 8: Identify employment
data['employment'] = data['e13b'].apply(lambda x: int(x) if pd.notna(x) else 0)

# Step 9: Combine all indicators into a single string representation
data['entrepreneurship'] = (
    'Income Activities: ' + data['income_activities'] +
    ', Sources of Income: ' + data['sources_of_income'] +
    ', Business Equipment: ' + data['business_equipment'] +
    ', Control Over Assets: ' + data['control_over_assets'] +
    ', Proximity to Services: ' + data['proximity_to_services'] +
    ', Borrowing Sources: ' + data['borrowing_sources'] +
    ', Sectors of Income: ' + data['sectors_of_income'] +
    ', Employees: ' + data['employment'].astype(str)
)

# Display the consolidated column
print(data[['income_activities', 'sources_of_income', 'business_equipment',
            'control_over_assets', 'proximity_to_services', 'borrowing_sources',
            'sectors_of_income', 'employment', 'entrepreneurship']].head())

      income_activities                                 sources_of_income  \
0  No Income Activities  Primary Source: Unknown, Secondary Sources: None   
1  No Income Activities  Primary Source: Unknown, Secondary Sources: None   
2  No Income Activities  Primary Source: Unknown, Secondary Sources: None   
3  No Income Activities  Primary Source: Unknown, Secondary Sources: None   
4  No Income Activities  Primary Source: Unknown, Secondary Sources: None   

  business_equipment control_over_assets proximity_to_services  \
0            Unknown             Unknown                Market   
1            Unknown             Unknown                Market   
2            Unknown             Unknown                Market   
3            Unknown             Unknown                Market   
4            Unknown             Unknown                Market   

      borrowing_sources sectors_of_income  employment  \
0  No Borrowing Sources        No Sectors           1   
1  No Borrowing Sources   

Now that we have consolidated all these columns, we will create a dataset containing only these columns so we can work with the most important information in a compact form. The dataset will be our tool for Phase 2. This is the end of Phase 1.

In [134]:
# Step 1: Select the relevant columns for the new DataFrame
columns_to_keep = [
    'state', 
    'bank_account_ownership', 
    'financial_inclusion_metrics', 
    'demographic_factors', 
    'savings_behavior', 
    'borrowing_behavior', 
    'digital_payment_adoption', 
    'access_to_electricity', 
    'internet_access', 
    'mobile_phone_usage', 
    'credit_access', 
    'small_business_ownership', 
    'entrepreneurship'
]

# Step 2: Create the new DataFrame (data2) with only the selected columns
data2 = data[columns_to_keep]

# Step 3: Display the first few rows of the new DataFrame to confirm
print(data2.head())

# Step 4: Define the file path where the CSV will be saved
file_path = r"C:\Users\USER\Desktop\A2F-2023-Revised-dataset-with-Revised-weights\data2.csv"

# Step 5: Save the new DataFrame (data2) as a CSV file
data2.to_csv(file_path, index=False)

# Step 6: Print confirmation message
print(f"CSV file saved successfully to: {file_path}")

  state  bank_account_ownership  financial_inclusion_metrics  \
0  ABIA                     0.0                            2   
1  ABIA                     0.0                            2   
2  ABIA                     0.0                            2   
3  ABIA                     0.0                            1   
4  ABIA                     0.0                            1   

                                 demographic_factors savings_behavior  \
0  Adult, Female, Medium Education, Middle Income...    Does Not Save   
1  Adult, Female, Medium Education, Middle Income...    Does Not Save   
2  Adult, Male, Medium Education, Middle Income, ...    Saves at Home   
3  Adult, Male, Medium Education, Middle Income, ...    Does Not Save   
4  Adult, Male, Low Education, Middle Income, Sma...    Does Not Save   

                  borrowing_behavior  \
0  Does Not Borrow, Purpose: Unknown   
1  Does Not Borrow, Purpose: Unknown   
2  Does Not Borrow, Purpose: Unknown   
3  Does Not Borr