<a href="https://colab.research.google.com/github/call493/MLFC/blob/main/Poverty_and_Literacy_levels_correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring the link between literacy rates and poverty in Kenya - A Socio-Economic Analysis**



---



Braxton Mandara, DSAIL

15-09-2025

**Abstract**

This notebook explores the relationship between literacy rates and poverty levels across Kenyan counties through a comprehensive, spatially-informed data analysis. It integrates multiple datasets—including county boundaries, school locations, population census, literacy rates, and poverty indicators—visualizing the geographic distribution of educational infrastructure and population density.

By merging and normalizing these datasets, the notebook enables statistical analyses such as correlation and regression, revealing a strong, statistically significant negative association between poverty rates and literacy rates. Interactive and static maps highlight regional disparities in school access and population, while scatter plots and heatmaps illustrate key socioeconomic patterns.

The notebook also identifies outlier counties and evaluates the predictive power of poverty and demographic indicators for literacy outcomes. Overall, this work demonstrates how geospatial data blending and statistical modeling can uncover critical insights into educational equity and poverty, informing policy and resource allocation in Kenya.

# **Step 1: Initial Setup**

Installing Python packages which we will use to download, model and analyze map features and other spatial data from OpenStreetMap.

In [None]:
%%capture
%pip install osmnx

In [None]:
import osmnx as ox
import matplotlib.pyplot as plt
import warnings
import math
import pandas as pd

warnings.filterwarnings("ignore", category=FutureWarning, module='osmnx')

# **Step 2 : Plot map of Kenya**

Plotting map of Kenya

In [None]:
!git clone https://github.com/call493/fynesse.git
import sys
sys.path.append("/content/fynesse")

In [None]:
%cd /content/fynesse
!git pull origin main

In [None]:
import fynesse

In [None]:
fynesse.access.plot_map("Kenya")

Let us proceed to plot the county boundaries.

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

# URL to the GeoJSON file for Kenyan counties (ADM1) from GeoBoundaries
geojson_url = "https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/KEN/ADM1/geoBoundaries-KEN-ADM1.geojson"

kenya_gdf = gpd.read_file(geojson_url)

fig, ax = plt.subplots(figsize=(10, 10))
kenya_gdf.boundary.plot(ax=ax, edgecolor='blue', linewidth=1)

ax.set_title("Kenya County Boundaries")

plt.show()

# **Step 3 : Schools distribution in different counties**

I proceed to look at how primary schools are distributed in the country using the school geolocation (coordinates) data from schools.json file which has details for Kenyan schools in the year 2020. This dataset was obtained from Energydata ---> https://energydata.info/dataset/kenya-schools#

In [None]:
fynesse.assess.plot_primary_schools(
    schools_file="https://raw.githubusercontent.com/call493/MLFC/main/schools.json",
    geojson_url="https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen/KEN/ADM1/geoBoundaries-KEN-ADM1.geojson"
)

## Let us look at data for primary schools and secondary schools

How are primary schools and secondary schools distributed across the country in different counties.

In [None]:
#reload fyness.assess modules

import importlib
import fynesse.assess
importlib.reload(fynesse.assess)


In [None]:
fynesse.assess.plot_schools_on_county_boundaries()

From the look of things, there are more primary schools than secondary schools, we will be proving that in the following cells.

## Obtain and Prepare Population Data

The Kenya National Bureau of Statistics (KNBS) provides population data. I will search for the 2019 Census data which includes population counts by county. I will use the data from the following source:

https://www.knbs.or.ke/wp-content/uploads/2023/09/2019-Kenya-population-and-Housing-Census-Volume-1-Population-By-County-And-Sub-County.pdf

Specifically, the 2019 Kenya Population and Housing Census results. I will look for data at the county level. The dataset used here was extracted from the PDF containing the population census report (page 7).

In [None]:
import pandas as pd
import os

population_file_url = "https://raw.githubusercontent.com/call493/MLFC/main/kenya_population_by_county_2019.csv"

try:
    population_df = pd.read_csv(population_file_url)
    print("Population data loaded successfully!")
    display(population_df.head())
except Exception as e:
    print(f"An error occurred while reading the CSV file: {e}")

### Merge Population Data with County Boundaries

Now, I will merge the population data with the county boundaries GeoDataFrame. I will use the 'County' column in the population DataFrame and the 'ADM1NAME' column in the county GeoDataFrame as the common key for merging. I will perform a left merge to keep all the county geometries and add the population data where a match is found.

In [None]:
# It's important to check if the county names match exactly in both dataframes.
# Let's print the unique county names from both dataframes to compare.
print("Unique counties in population_df:")
print(population_df['County'].unique())

print("\nColumns in kenya_gdf:")
print(kenya_gdf['shapeName'].unique())

The names do not match, let us see the output we will get when we plot a population heatmap

### Population Heatmap by County

This heatmap visualizes the population distribution across Kenyan counties based on the 2019 census data. Counties with higher populations are shown in darker shades, while those with lower populations are in lighter shades.

In [None]:
fynesse.assess.plot_population_heatmap(kenya_gdf, population_df)

Same map but for interactivity using follium

In [None]:
fynesse.assess.plot_interactive_population_map(kenya_gdf, population_df)

From the look of things we do not have population data for the following counties: Elgeyo Marakwet, Tharaka Nithi, Taita Taveta and Nairobi.

Why is that?

> The population datadata and the GeoDataFrame have different column names for these counties.

In that case we have to normalize the data, to ensure that the merge works and we have a heatmap that covers data for all counties.



In [None]:
print(population_df['County'].unique())

Let us use assess county name function.

In [None]:
from fynesse.assess import normalize_county_names

# Apply to population DataFrame
population_df['County'] = population_df['County'].apply(normalize_county_names)

# Apply to geodataframe
kenya_gdf['shapeName'] = kenya_gdf['shapeName'].apply(normalize_county_names)

merged = kenya_gdf.merge(population_df, left_on='shapeName', right_on='County', how='left')


In [None]:
fynesse.assess.check_county_mismatches(kenya_gdf, population_df, merged)

In [None]:
# Check for any remaining mismatches
pop_counties = set(population_df['County'].unique())
geo_counties = set(kenya_gdf['shapeName'].unique())

print(f"\nPopulation counties: {len(pop_counties)}")
print(f"Geographic counties: {len(geo_counties)}")
print(f"\nCounties in population data but not in geographic data: {pop_counties - geo_counties}")
print(f"Counties in geographic data but not in population data: {geo_counties - pop_counties}")

In [None]:
# Check for missing data
missing = merged[merged["Total"].isna()][["shapeName"]]
print(f"\nCounties with missing population data: {len(missing)}")
if len(missing) > 0:
    print(missing)

In [None]:
print("Unique counties in population_df:")
print(population_df['County'].unique())

Now let us check to confirm that the data has been normalized.

In [None]:
fynesse.assess.plot_county_population_heatmap(merged)

Let us proceed to make a comparison between the number of primary schools and secondary schools in each county.

In [None]:
from fynesse.assess import normalize_county_names
m = fynesse.assess.plot_schools_choropleth_map(normalize_func=normalize_county_names)
m  # displays the map in

[https://www.knbs.or.ke/wp-content/uploads/2023/09/2019-Kenya-population-and-Housing-Census-Analytical-Report-on-Education-and-Training.pdf](https://www.knbs.or.ke/wp-content/uploads/2023/09/2019-Kenya-population-and-Housing-Census-Analytical-Report-on-Education-and-Training.pdf)


# Literacy In Kenya

According to UNESCO https://www.theglobaleconomy.com/Kenya/Literacy_rate/ Kenya's literacy rate is at 82.88%.

To bring things into proper perspective, basic literacy can be defined as the ability to both read and write with understanding a short, simple statement about everyday life in English or Kiswahili.

The dataset I used here I obtained it from

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/call493/MLFC/main/kenya_literacy_2019.csv')
display(df.head())


Data used here was obtained from Kenya Poverty Report 2019 https://www.knbs.or.ke/wp-content/uploads/2023/09/The-Kenya-Poverty-Report-2019.pdf

In [None]:
# Load the poverty rate dataset
import pandas as pd

# Load poverty data from the specified URL
poverty_df = pd.read_csv('https://raw.githubusercontent.com/call493/MLFC/main/kenya_poverty_2019.csv')
print("Poverty data loaded successfully!")
print(f"Poverty dataset shape: {poverty_df.shape}")
print(f"Poverty dataset columns: {poverty_df.columns.tolist()}")
display(poverty_df.head())

In [None]:
# Merge poverty_df with the previously loaded literacy dataset
# First, let's examine both dataframes and ensure proper column names
print("Columns in poverty_df:", poverty_df.columns.tolist())
print("Columns in df (literacy):", df.columns.tolist())

In [None]:
# Check county names to identify any inconsistencies
print("\nCounties in poverty_df:")
print(sorted(poverty_df['County'].unique()))
print("\nCounties in literacy df:")
print(sorted(df['County'].unique()))

In [None]:
merged_df = fynesse.assess.merge_literacy_and_poverty(df, poverty_df)

In [None]:
df_analysis, corr_coef, p_value = fynesse.assess.clean_and_summarize_merged_df(merged_df)

In [None]:
print(df.columns.tolist())
print(df_analysis.columns.tolist())

In [None]:
results = fynesse.assess.correlate_and_plot_poverty_literacy(df_analysis)

These results show a strong negative correlation between poverty rate and literacy rate across Kenya’s counties for 2019:

*   Pearson correlation coefficient: -0.841

> This value is close to -1, meaning as poverty rates increase, literacy rates tend to decrease in a nearly linear pattern.

> The p-value (1.72e-12) is extremely small, indicating this result is highly statistically significant—the likelihood that this correlation is due to random chance is virtually zero.

*  Spearman correlation coefficient: -0.785

> This also shows a strong negative relationship, based on ranking the counties (not just the exact values).

> The small p-value (4.54e-10) again means the evidence is very strong statistically.


Counties with higher poverty rates have much lower literacy rates, and this inverse association is very strong.

### Let us identify outlier counties
Outliers are counties whose literacy rates are unusually high or low compared to what you’d expect for their poverty level.

In [None]:
outliers_df, model = fynesse.assess.identify_outlier_counties_linear(df_analysis)

### Multivariate Regression


In [None]:
ols_model = fynesse.assess.fit_multivariate_ols(df_analysis)

* Model fit (R-squared = 0.710): 71% of the variation in county literacy rates is explained by four predictors (poverty rate, poverty gap, poverty severity, and population). This is a strong fit.

* Predictor coefficients & significance: The most important part of interpretation is the coefficients and their p-values:

  const (Intercept): 95.65 (the predicted literacy rate if all other values are zero; not usually interpreted on its own)

  Headcount_Rate_Percent: -0.2375
  Negative, as expected, but p-value = 0.687, not statistically significant (doesn't rule out chance).

Population_Thousands: 0.0005, p = 0.761 (not significant)

Poverty_Gap_Percent: -1.68, p = 0.596 (not significant)

Severity_Percent: 1.89, p = 0.591 (not significant)

A statistically significant coefficient usually has p < 0.05.

Overall model significance (F-statistic p = 9.06e-10):
My combined model is statistically significant (very likely to mirror a true association in the population).

Multicollinearity warning ("condition number is large, ... strong multicollinearity"):

Several poverty measures in the model are highly correlated with each other. Multicollinearity makes it hard to separate their individual effects. This leads to unstable and/or non-significant individual coefficients—even if the group is important together.

*The model as a whole predicts literacy levels across counties well, and you see the expected negative direction for Headcount_Rate_Percent, but overlapping predictor information clouds the significance of individual variables. The strongest message remains that poverty is negatively associated with literacy at the county level.*

> The model fits well overall (high R²). The poverty and education variables are so highly correlated (collinear) that their individual effects aren’t statistically distinguishable with the current predictors.

Classification

In [None]:
output = fynesse.assess.classify_high_literacy(df_analysis)

Low Literacy:

* Precision (0.56): Of counties predicted as “Low Literacy,” 56% are actually low literacy.

* Recall (0.83): Of all actual “Low Literacy” counties, 83% were correctly predicted.

* F1-score (0.67): Harmonic mean of precision and recall.

High Literacy:

* Precision (0.75): Of counties predicted as “High Literacy,” 75% are correct.

* Recall (0.43): Of all actual high literacy counties, only 43% were correctly identified.

* F1-score (0.55): Lower due to missing more “High Literacy” cases.

Overall Accuracy: 0.62 (62%) — The model correctly classifies 62% of counties.

### Reduce Collinear Predictors
Let us keep one poverty-related variable, such as Headcount_Rate_Percent.
Why? Using multiple similar poverty indices creates redundancy and statistical noise.

In [None]:
reduced_model = fynesse.assess.fit_reduced_ols(df_analysis)

#### Model Fit

R-squared: 0.707
About 71% of the variation in literacy rates is explained by the model.

#### Coefficients

Intercept (const): 97.28
Theoretically, if poverty and population were zero, literacy rate would be about 97% (not literally meaningful, but shows baseline in the model).

Headcount_Rate_Percent: -0.53
For each 1 percentage point increase in poverty rate, the literacy rate decreases by about 0.53 percentage points.

p-value: 0.000 — Statistically highly significant. This is a real, robust effect.

Population_Thousands: Not significant (p = 0.821)

This predictor does not have a meaningful independent effect in the data.