### Author: *May*

# BUSINESS UNDERSTANDING

## Context and Business Problem

Retirement income adequacy remains a significant and growing concern in Kenya as life expectancy increases and traditional family-based support systems continue to weaken. Pension schemes are intended to provide financial security after retirement, yet evidence from industry reports and empirical studies suggests that a large proportion of pension scheme members retire with insufficient income to maintain a reasonable standard of living.

Industry benchmarks recommend a **replacement ratio of 60–80%** of a member’s final salary to achieve adequate retirement income. However, studies show that **only about 13% of defined contribution (DC) schemes** and **approximately 6% of defined benefit (DB) schemes** in Kenya deliver replacement ratios considered adequate for members joining at age 25. This indicates that the majority of scheme members face a high risk of income inadequacy in retirement unless they make additional voluntary savings.

At the system level, Kenya’s pension sector has experienced substantial asset growth. Pension assets exceeded **KSh 2.25 trillion by December 2024** and rose further to **over KSh 2.53 trillion by mid-2025**, largely driven by reforms such as the implementation of the **NSSF Act, 2013**. Despite this growth, the pension sector’s **asset-to-GDP ratio remains relatively low at approximately 14–15%**, compared to more mature pension systems. This suggests that asset growth has not translated into uniformly adequate retirement outcomes at the individual member level.

A critical challenge lies in the unequal retirement outcomes observed across salary scales. Evidence from the literature and data indicates that lower-income earners often achieve **higher replacement ratios** due to compulsory contribution mechanisms, while higher-income earners—despite contributing larger absolute amounts—experience **lower proportional income replacement** as contribution rates decline relative to income. Historical contribution rates of **15% (2004–2014)** and the current **18% rate** have been shown to be **inadequate**, particularly under early retirement scenarios.

These disparities are typically not visible during active employment but become evident at retirement, when corrective actions are no longer possible. For pension trustees, employers, and regulators, this creates a pressing need for data-driven tools to identify at-risk members early and to support timely policy and scheme design interventions.

## Business Objectives

- **Assess retirement income adequacy:** Measure and compare replacement ratios across salary bands.
- **Analyze contribution behavior:** Examine how employee and employer contribution rates and contribution amounts vary with income.
- **Identify at-risk groups:** Detect salary levels and member profiles associated with low projected replacement ratios.
- **Understand key drivers:** Determine which factors most strongly influence retirement income outcomes.
- **Support evidence-based decision-making:** Inform contribution rate reviews, scheme design improvements, and member education initiatives.

## Stakeholders

- **Pension Scheme Trustees:** Evaluate whether schemes deliver adequate and equitable retirement outcomes.
- **Fund Managers and Administrators:** Identify contribution gaps and members at risk of inadequate retirement income.
- **Employers:** Assess the effectiveness of existing contribution arrangements and consider enhancements.
- **Regulators (Retirement Benefits Authority):** Monitor pension adequacy and assess the impact of regulatory reforms.
- **Policymakers:** Inform national pension policy and long-term retirement income sustainability strategies.
- **Scheme Members:** Benefit indirectly from improved scheme design and clearer communication on retirement readiness.

## Success Metrics

- **Replacement ratio outcomes:** Proportion of members achieving recommended adequacy benchmarks (60–80%).
- **Contribution adequacy:** Consistency and sufficiency of contribution rates across salary scales.
- **Model performance metrics:** Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared for replacement ratio predictions.
- **Feature importance:** Stability and interpretability of key drivers such as contribution rates and years of service.
- **Business relevance:** Ability to clearly identify salary groups most exposed to inadequate retirement income.

# DATA UNDERSTANDING

This study uses member-level pension data to analyze contribution behavior and retirement income adequacy across salary scales among pension scheme members in Kenya. The dataset represents anonymized administrative records drawn from multiple pension schemes and captures key demographic, employment, and contribution-related attributes relevant to retirement outcomes.

## Data Source

The dataset consists of **2,561 observations**, where each row represents an individual pension scheme member. The data includes information on members’ earnings, contribution rates, accumulated pension savings, and employment tenure. All personal identifiers have been removed to ensure confidentiality and ethical use of the data.

The data is suitable for analyzing pension adequacy because it captures the primary determinants of retirement outcomes under defined contribution (DC) pension arrangements.

## Key Variables

The dataset contains the following core variables:

- **Salary:** The member’s current or final basic salary, used as a proxy for pre-retirement earnings.
- **Age:** The current age of the member.
- **Retirement Age:** The assumed or expected retirement age for the member.
- **Years of Service:** The total number of years the member has contributed to the pension scheme.
- **Employee Contribution Rate (EE %):** Percentage of salary contributed by the employee.
- **Employer Contribution Rate (ER %):** Percentage of salary contributed by the employer.
- **DOB:** Date Of Birth. 
- **Total Contributions:** Cumulative contributions made over the member’s service period.
- **Fund Value:** The accumulated pension savings available for retirement.

From these variables, additional analytical features are derived, including **projected retirement income** and **replacement ratios**, which form the core outcome measures of this study.

## Target Variable

The primary outcome of interest is the **replacement ratio**, defined as the proportion of a member’s pre-retirement salary that is expected to be replaced by pension income during retirement.

Since actual pension payouts are not observed in the data, replacement ratios are **projected** by converting accumulated fund values into estimated annual retirement income using standard annuitization assumptions. These assumptions are applied consistently across all members to enable fair comparison across salary groups.

However, the data does not capture investment performance history, annuity pricing variation, or post-retirement behavior. As a result, findings are interpreted as **projected retirement outcomes under standardized assumptions**, rather than realized pension income.

These limitations are explicitly acknowledged and addressed in the interpretation of results.

In [5]:
# ------- [Import all relevant libraries] -------

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Usual Suspects
import numpy as np           # Mathematical operations
import pandas as pd          # Data manipulation

# Visualization
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
import seaborn as sns

# String manipulation
import re

# Mathematical Operations
import math

# Display settings
pd.set_option('display.max_colwidth', None)
from IPython.display import display

#### Now to load the data and print it out.

In [6]:
# Load data
data = pd.read_csv('../Data/Research data- Raw.csv')
data

Unnamed: 0,No.,DOB,Age,Fund Value,Salary,Contributions,EE,ER,Years,Retirement age
0,1,3/25/1975,51.0,12621655.25,421820.00,84364.00,10%,10%,9.0,60.0
1,2,2/23/1981,45.0,8149961.01,465010.00,93002.00,10%,10%,15.0,60.0
2,3,7/24/1991,35.0,7085348.52,504660.00,100932.00,10%,10%,25.0,60.0
3,4,4/13/1986,40.0,6028192.05,504660.00,100932.00,10%,10%,20.0,60.0
4,5,11/14/1980,46.0,9458131.00,504660.00,100932.00,10%,10%,14.0,60.0
...,...,...,...,...,...,...,...,...,...,...
2556,2557,1/1/1980,46.0,140829.12,61215.00,6121.50,5%,5%,-46.0,
2557,2558,1/1/1980,46.0,169616.84,69120.40,6912.04,5%,5%,-46.0,
2558,2559,1/1/1980,46.0,114402.78,52618.00,5261.80,5%,5%,-46.0,
2559,2560,1/1/1980,46.0,475380.00,378000.00,37800.00,5%,5%,-46.0,


##### *Observation:*

The data is consistent from top to bottom.

phone number is a Personal Identification Information (PII). To maintain discretion, it will be dropped.

Next, I will carry out a quick Initial Data Exploration (IDE) to get a bird’s-eye view of the dataset - just as you would form first impressions when greeting someone new.

In [8]:
# ---- [Initial Data Exploration (IDE)] ----

# # Check dataset shape
print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns.")

# Check columns
print('\n'+'--'*40)
print("Columns:")
display(data.columns)

# Check metadata
print('\n'+'--'*40)
print("Metadata Check:")
display(data.info())

# Descriptive statistics
print('\n'+'--'*40)
print("Descriptive Statistics For Numeric Variables:")
display(data.describe().T)

# Categorical Variables
print('\n'+'--'*40)
print("Descriptive Statistics For Categorical Variables:")
display(data.describe(include='object').T)

# Check and remove duplicates
print('\n'+'--'*40)
print("Duplicates:", data.duplicated().sum())

# Check data completeness
print('\n'+'--'*40)
print("Missingness check:")
display(data.isna().sum())

The dataset has 2561 rows and 10 columns.

--------------------------------------------------------------------------------
Columns:


Index(['No.', 'DOB', 'Age', ' Fund Value ', ' Salary ', ' Contributions ',
       'EE', ' ER ', ' Years ', 'Retirement age'],
      dtype='object')


--------------------------------------------------------------------------------
Metadata Check:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2561 entries, 0 to 2560
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   No.              2561 non-null   int64  
 1   DOB              2561 non-null   object 
 2   Age              2561 non-null   float64
 3    Fund Value      2561 non-null   object 
 4    Salary          2560 non-null   object 
 5    Contributions   2560 non-null   object 
 6   EE               2561 non-null   object 
 7    ER              2561 non-null   object 
 8    Years           2561 non-null   float64
 9   Retirement age   1498 non-null   float64
dtypes: float64(3), int64(1), object(6)
memory usage: 200.2+ KB


None


--------------------------------------------------------------------------------
Descriptive Statistics For Numeric Variables:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
No.,2561.0,1281.0,739.441343,1.0,641.0,1281.0,1921.0,2561.0
Age,2561.0,42.317454,6.540048,24.0,37.0,45.0,46.0,59.0
Years,2561.0,-7.221788,29.215819,-50.0,-45.0,10.0,18.0,34.0
Retirement age,1498.0,60.0,0.0,60.0,60.0,60.0,60.0,60.0



--------------------------------------------------------------------------------
Descriptive Statistics For Categorical Variables:


Unnamed: 0,count,unique,top,freq
DOB,2561,1542,1/1/1980,601
Fund Value,2561,2038,777079.24,86
Salary,2560,456,32054.60,172
Contributions,2560,456,3205.46,172
EE,2561,4,5%,1443
ER,2561,4,5%,1253



--------------------------------------------------------------------------------
Duplicates: 0

--------------------------------------------------------------------------------
Missingness check:


No.                   0
DOB                   0
Age                   0
 Fund Value           0
 Salary               1
 Contributions        1
EE                    0
 ER                   0
 Years                0
Retirement age     1063
dtype: int64

#### *Observation:*

- Dataset Size and Structure  
  - 2,561 rows and 10 columns.  
  - Columns include demographic, employment, and pension-related attributes.  

- Column Types and Data Quality  
  - Numeric columns: No. (int64), Age, Years, Retirement age (float64).  
  - Object columns: DOB, Fund Value, Salary, Contributions, EE, ER.  
  - Missing values:  
    - Salary and Contributions: 1 missing value each.  
    - Retirement age: 1,063 missing values.  
  - Issues:  
    - Columns like Fund Value, Salary, Contributions, EE, ER are stored as objects. They require type conversion.  
    - Years contains negative values (down to -50), which may indicate data entry errors or special encoding.  

- Descriptive Statistics:
  - Numeric Columns: 
    - Age: 24–59 years, mean 42.3.  
    - Years: mean -7.22, std 29.21, minimum -50, maximum 34 (data anomalies present).  
    - Retirement age: consistently 60 for all non-missing entries.  

  - Categorical Columns:  
    - DOB: 1,542 unique values (multiple members share birthdates).  
    - Fund Value: 2,038 unique values, most frequent 777,079.24.  
    - Salary & Contributions: 456 unique values each; most frequent 32,054.60 and 3,205.46 respectively.  
    - EE & ER: 4 unique values each; most frequent 5%.  

- Takeaways 
  - Dataset contains a mix of numeric and object-formatted financial data requiring **cleaning and type conversion**.  
  - Missing and negative values need attention before modeling or analysis.  
  - Uniformity in Retirement age indicates either a policy standard or limited variability.