# MOSAIKS Demonstration: Identifying Internet Access in Togo

This notebook demonstrates the application of Multi-task Observation using Satellite Imagery & Kitchen Sinks (MOSAIKS), a machine learning approach utilizing satellite imagery, to identify areas with and without internet access in Togo. This information can be crucial for understanding the digital divide and informing policies to improve connectivity.

## Methodology

The process involves the following steps:

1. **Data Acquisition:** Label data on internet access is provided by the Agence Togo Digital (ATD). Pre-processed satellite imagery features are derived using the MOSAIKS framework.
1. **Data Preprocessing:** The label and feature data are preprocessed and cleaned. This leaves a single dataset with labels and features that is ready to use.
1. **Model Training and Evaluation:** A machine learning model, specifically Ridge Regression with isotonic calibration, is trained on the joined dataset to predict the likelihood of internet access. The model is evaluated using metrics like the Receiver Operating Characteristic Area Under the Curve (ROC AUC).
1. **Visualization and Interpretation:** The model results are visualized using maps and plots. Important features identified by the model are analyzed to understand the patterns associated with internet access.


## Setup

This notebook utilizes several key Python libraries:

* **Data Handling and Analysis:**  `pandas` and `numpy` provide fundamental data structures and functions for manipulating and analyzing data. `geopandas` extends these capabilities to work with geospatial data.
* **Visualization:** `matplotlib.pyplot` and `seaborn` enable the creation of static and interactive visualizations, aiding in data exploration and result presentation.
* **Machine Learning:** `sklearn` offers a comprehensive suite of tools for building and evaluating machine learning models, including algorithms like Ridge Regression and methods for data preprocessing.

In [2]:
import os
import shutil
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

# from google.colab import drive
from sklearn.linear_model import RidgeClassifierCV
from sklearn.calibration import IsotonicRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split, GridSearchCV

In [3]:
# Define the directory name
data_dir = "LSMS-ISA-data"

# Check if the final directory exists
if not os.path.exists(data_dir):
    # Download only if the zip file doesn't exist
    if not os.path.exists("Data.zip"):
        !wget https://zenodo.org/records/14040658/files/Data.zip

    # Unzip the data
    !unzip Data.zip

    # Rename the folder
    !mv Data {data_dir}

    # Remove the zip file
    !rm Data.zip

# List the files (this will run regardless)
!ls -lh {data_dir}

total 197M
-rw-rw-rw- 1 cullen_molitor emlab  44M Oct 29 06:23 Household_dataset.dta
-rw-rw-rw- 1 cullen_molitor emlab 404M Oct 29 06:23 Individual_dataset.dta
-rw-rw-rw- 1 cullen_molitor emlab 241M Nov  5 05:40 Plotcrop_dataset.dta
-rw-rw-rw- 1 cullen_molitor emlab 244M Nov  5 05:40 Plot_dataset.dta


In [4]:
# Define the directory and base URL
geo_dir = "geoBoundaries"
base_url = "https://github.com/wmgeolab/geoBoundaries/blob/9469f09/releaseData/gbOpen"

# List of country codes
countries = ["ETH", "MWI", "MLI", "NER", "NGA", "TZA", "UGA"]

# Create directory if it doesn't exist
if not os.path.exists(geo_dir):
    !mkdir {geo_dir}

# Download files for each country if they don't exist
for country in countries:
    filename = f"geoBoundaries-{country}-ADM2.geojson"
    filepath = os.path.join(geo_dir, filename)

    if not os.path.exists(filepath):
        # Note: UGA needs 'raw' instead of 'blob' in the URL
        if country == "UGA":
            !wget {base_url}/{country}/ADM2/{filename} -P {geo_dir}
        else:
            !wget {base_url}/{country}/ADM2/{filename} -P {geo_dir}

# List the files (this will run regardless)
!ls -lh {geo_dir}

total 23M
-rw-r--r-- 1 cullen_molitor emlab 1.3M Dec  9 18:50 geoBoundaries-ETH-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab 849K Dec  9 18:51 geoBoundaries-MLI-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab 5.4M Dec  9 18:51 geoBoundaries-MWI-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab 205K Dec  9 18:51 geoBoundaries-NER-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab 9.0M Dec  9 18:51 geoBoundaries-NGA-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab  18M Dec  9 18:51 geoBoundaries-TZA-ADM2.geojson
-rw-r--r-- 1 cullen_molitor emlab 1.8M Dec  9 18:52 geoBoundaries-UGA-ADM2.geojson


In [None]:
# !wget https://www.geoboundaries.org/data/geoBoundariesCGAZ-3_0_0/ADM2/simplifyRatio_100/geoBoundariesCGAZ_ADM2.topojson

boundaries = gpd.read_file("geoBoundariesCGAZ_ADM2.topojson")
mask = boundaries["shapeID"].str[:3].isin(countries)
boundaries = boundaries[mask]
boundaries = boundaries.rename(
    columns={
        "shapeName": "ADM2",
        "shapeGroup": "ADM0",
    }
)
boundaries = boundaries.drop(
    columns=[
        "id",
        "shapeISO",
        "shapeType",
        "ADM1_shapeID",
        "ADM0_shapeID",
        "ADMHIERARCHY",
    ]
)
boundaries = boundaries[["shapeID", "ADM0", "ADM2", "geometry"]]
boundaries.crs = "EPSG:4326"
boundaries.head()

Unnamed: 0,shapeID,ADM0,ADM2,geometry
60194,MWI-ADM2-3_0_0-B1,MWI,Likoma,"MULTIPOLYGON (((34.67926 -12.06793, 34.67942 -..."
60195,MWI-ADM2-3_0_0-B2,MWI,Rumphi,"POLYGON ((33.70894 -10.56715, 33.7197 -10.5607..."
60196,MWI-ADM2-3_0_0-B3,MWI,Chitipa,"POLYGON ((33.95912 -10.4494, 33.7197 -10.56072..."
60197,MWI-ADM2-3_0_0-B4,MWI,Karonga,"POLYGON ((34.17752 -10.58361, 34.17547 -10.583..."
60198,MWI-ADM2-3_0_0-B5,MWI,Nkhata Bay,"POLYGON ((34.05633 -11.08358, 34.05732 -11.082..."


In [6]:
# feats = pd.read_csv(
#     os.path.join(
#         "/",
#         "home",
#         "emlab",
#         "data",
#         "mosaiks-togo",
#         "API_features",
#         "ADM_2_regions_RCF_global_dense.csv",
#     )
# )
# mask = feats['shapeID'].str[:3].isin(countries)
# feats = feats[mask]
# feats = feats.drop(columns=["shapeID.1"], errors="ignore")
# features = boundaries.set_index("shapeID").join(feats.set_index("shapeID"), how="left")
# features = features.reset_index()
# new_cols = [col for col in features.columns if col != 'geometry'] + ['geometry']
# features = features[new_cols]
# features.crs = "EPSG:4326"
# features.to_feather("adm2_api_features.feather")
# features.head()

In [7]:
feats2 = gpd.read_feather("adm2_api_features.feather")
feats2.head()

Unnamed: 0,shapeID,ADM0,ADM2,X_0,X_1,X_2,X_3,X_4,X_5,X_6,...,X_3991,X_3992,X_3993,X_3994,X_3995,X_3996,X_3997,X_3998,X_3999,geometry
0,MWI-ADM2-3_0_0-B1,MWI,Likoma,0.01964,0.070377,0.006018,0.066489,0.037376,0.053688,0.019789,...,0.092924,0.287207,0.070323,0.058085,0.176112,0.318884,0.125248,0.131343,0.17276,"MULTIPOLYGON (((34.67926 -12.06793, 34.67942 -..."
1,MWI-ADM2-3_0_0-B2,MWI,Rumphi,0.067003,0.279466,0.012389,0.271192,0.136311,0.188153,0.080772,...,0.016506,0.111627,0.081184,0.046551,0.497085,0.663726,0.364558,0.311497,0.022621,"POLYGON ((33.70894 -10.56715, 33.7197 -10.5607..."
2,MWI-ADM2-3_0_0-B3,MWI,Chitipa,0.101584,0.375491,0.017354,0.331362,0.199296,0.26687,0.105208,...,0.022143,0.145599,0.103824,0.071805,0.626835,0.86063,0.477321,0.386188,0.028637,"POLYGON ((33.95912 -10.4494, 33.7197 -10.56072..."
3,MWI-ADM2-3_0_0-B4,MWI,Karonga,0.065781,0.27324,0.012053,0.237539,0.133702,0.183947,0.076608,...,0.020197,0.117028,0.08032,0.058535,0.519485,0.713003,0.372761,0.337297,0.019917,"POLYGON ((34.17752 -10.58361, 34.17547 -10.583..."
4,MWI-ADM2-3_0_0-B5,MWI,Nkhata Bay,0.062253,0.212086,0.022592,0.103554,0.112989,0.22449,0.125575,...,0.028885,0.177763,0.106717,0.051149,0.356471,0.796815,0.35836,0.17184,0.041598,"POLYGON ((34.05633 -11.08358, 34.05732 -11.082..."


In [167]:
df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))
df

One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))


Unnamed: 0,country,wave,crop_name,season,pw,ea_id_merge,ea_id_obs,strataid,urban,admin_1,...,seed_kg,seed_value_LCU,seed_value_USD,improved,used_pesticides,crop_shock,pests_shock,rain_shock,drought_shock,flood_shock
0,Ethiopia,1.0,SORGHUM,1.0,2236.134521,010101088801601,1000002.0,7.0,No,Tigray,...,11.36,90.880000,6.486660,No,No,No,0.0,0.0,0.0,No
1,Ethiopia,1.0,MILLET,1.0,2236.134521,010101088801601,1000002.0,7.0,No,Tigray,...,8.04,49.325153,3.520637,No,No,No,0.0,0.0,0.0,No
2,Ethiopia,1.0,MILLET,1.0,2236.134521,010101088801601,1000002.0,7.0,No,Tigray,...,3.10,19.018405,1.357460,No,No,No,0.0,0.0,0.0,No
3,Ethiopia,1.0,SORGHUM,1.0,2236.134521,010101088801601,1000002.0,7.0,No,Tigray,...,11.36,90.880000,6.486660,No,No,No,0.0,0.0,0.0,No
4,Ethiopia,1.0,MILLET,1.0,2236.134521,010101088801601,1000002.0,7.0,No,Tigray,...,8.08,49.570552,3.538153,No,No,No,0.0,0.0,0.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
504106,Uganda,8.0,MAIZE,2.0,986.894118,,,,No,Afar,...,4.00,6000.000000,1.625449,No,No,No,,,,
504107,Uganda,8.0,(ROBUSTA),1.0,986.894118,,,,No,Afar,...,0.00,,,,No,No,,,,
504108,Uganda,8.0,(ROBUSTA),2.0,986.894118,,,,No,Afar,...,0.00,,,,No,No,,,,
504109,Uganda,8.0,MAIZE,1.0,986.894118,,,,No,Afar,...,3.00,6000.000000,1.625449,No,No,No,,,,


In [168]:
df.columns

Index(['country', 'wave', 'crop_name', 'season', 'pw', 'ea_id_merge',
       'ea_id_obs', 'strataid', 'urban', 'admin_1', 'admin_2', 'admin_3',
       'admin_1_name', 'admin_2_name', 'admin_3_name', 'lat_modified',
       'lon_modified', 'geocoords_id', 'parcel_id_obs', 'parcel_id_merge',
       'plot_id_obs', 'plot_id_merge', 'hh_id_obs', 'hh_id_merge',
       'harvest_end_month', 'planting_month', 'harvest_kg',
       'harvest_value_LCU', 'harvest_value_USD', 'seed_kg', 'seed_value_LCU',
       'seed_value_USD', 'improved', 'used_pesticides', 'crop_shock',
       'pests_shock', 'rain_shock', 'drought_shock', 'flood_shock'],
      dtype='object')

In [None]:
# df = pd.read_stata(os.path.join(data_dir, "Household_dataset.dta"))
df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))
# Create a conditions dictionary mapping countries to their desired waves
wave_conditions = {
    "Ethiopia": 4,
    "Malawi": 4,
    "Mali": 2,
    "Niger": 2,
    "Nigeria": 3,
    "Tanzania": 5,
    "Uganda": 7,
}

# Create the filter using boolean indexing
df = df[
    df.apply(lambda x: x["wave"] == wave_conditions.get(x["country"], False), axis=1)
]

gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.lon_modified, df.lat_modified), crs="EPSG:4326"
)

gdf = gdf.drop(
    columns=[
        "admin_2",
        "admin_3",
        "hh_id_merge",
        "hh_id_obs",
        "season",
        "ea_id_merge",
        "ea_id_obs",
        "strataid",
        "geocoords_id",
    ]
)

gdf["admin_2_name"] = gdf["admin_2_name"].replace("", np.nan)

mask_coords = gdf["geometry"].is_empty == False
gdf_with_coords = gdf[mask_coords].copy()

joined_data = gpd.sjoin(
    gdf_with_coords, boundaries[["ADM2", "geometry"]], how="left", predicate="within"
)

gdf["admin_2_joined"] = None

gdf.loc[joined_data.index, "admin_2_joined"] = joined_data["ADM2"]

gdf["ADM2"] = gdf["admin_2_name"].fillna(gdf["admin_2_joined"])

gdf = gdf.drop(
    columns=[
        "wave",
        "admin_2_joined",
        "geometry",
        "lat_modified",
        "lon_modified",
        "admin_1",
        "admin_1_name",
        "admin_2_name",
    ]
)

# Create dictionary of country names to ISO3 codes
country_to_iso3 = {
    "Ethiopia": "ETH",
    "Malawi": "MWI",
    "Mali": "MLI",
    "Niger": "NER",
    "Nigeria": "NGA",
    "Tanzania": "TZA",
    "Uganda": "UGA",
}

# Create new ADM0 column using map
gdf["ADM0"] = gdf["country"].map(country_to_iso3)

new_cols = ["country", "ADM2"] + [
    col for col in gdf.columns if col not in ["country",  "ADM2"]
]
gdf = gdf[new_cols]

gdf

One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))


Unnamed: 0,country,ADM2,crop_name,pw,urban,admin_3_name,parcel_id_obs,parcel_id_merge,plot_id_obs,plot_id_merge,...,seed_kg,seed_value_LCU,seed_value_USD,improved,used_pesticides,crop_shock,pests_shock,rain_shock,drought_shock,flood_shock
67387,Ethiopia,North Western,SORGHUM,1658.475109,No,,1000001.0,1000002-1,1052558.0,01010108880091000701-1-2,...,3.00,21.416667,0.845471,No,No,Yes,,1.0,0.0,No
67388,Ethiopia,North Western,MAIZE,1658.475109,No,,1000001.0,1000002-1,1052559.0,01010108880091000701-1-3,...,0.75,5.812500,0.229461,No,No,No,0.0,0.0,0.0,No
67389,Ethiopia,North Western,SORGHUM,1658.475109,No,,1000002.0,1000003-1,1052560.0,01010108880091001701-1-2,...,1.00,7.138889,0.281824,No,No,Yes,0.0,0.0,0.0,No
67390,Ethiopia,North Western,RED PEPPER,1658.475109,No,,1000002.0,1000003-1,1052561.0,01010108880091001701-1-3,...,,,,No,No,No,0.0,0.0,0.0,No
67391,Ethiopia,North Western,SORGHUM,1658.475109,No,,1000003.0,1000003-2,1052562.0,01010108880091001701-2-1,...,3.00,39.000000,1.539612,Yes,No,Yes,0.0,0.0,0.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
477209,Uganda,,BEANS,,,,7028029.0,d10a687889de469687377204195f3db0-1,7077535.0,d10a687889de469687377204195f3db0-1-1,...,5.00,,,No,No,,,,,
477210,Uganda,,MAIZE,,,,7028029.0,d10a687889de469687377204195f3db0-1,7077535.0,d10a687889de469687377204195f3db0-1-1,...,0.50,,,No,No,,,,,
477211,Uganda,,POTATOES,,,,7028029.0,d10a687889de469687377204195f3db0-1,7077536.0,d10a687889de469687377204195f3db0-1-2,...,15.00,,,No,No,,,,,
477212,Uganda,,POTATOES,,,,7028029.0,d10a687889de469687377204195f3db0-1,7077537.0,d10a687889de469687377204195f3db0-1-3,...,60.00,6000.000000,1.690036,No,No,,,,,


In [170]:
gdf.columns

Index(['country', 'ADM2', 'crop_name', 'pw', 'urban', 'admin_3_name',
       'parcel_id_obs', 'parcel_id_merge', 'plot_id_obs', 'plot_id_merge',
       'harvest_end_month', 'planting_month', 'harvest_kg',
       'harvest_value_LCU', 'harvest_value_USD', 'seed_kg', 'seed_value_LCU',
       'seed_value_USD', 'improved', 'used_pesticides', 'crop_shock',
       'pests_shock', 'rain_shock', 'drought_shock', 'flood_shock'],
      dtype='object')

In [144]:
# Convert Yes/No to 1/0
binary_cols = ["urban", "hh_electricity_access"]
for col in binary_cols:
    gdf[col] = (gdf[col] == "Yes").astype(int)

# Identify numeric columns to aggregate
numeric_cols = gdf.select_dtypes(include=["float64", "int64"]).columns
numeric_cols = [
    col for col in numeric_cols if col != "pw" and col != "ADM0"
]  # exclude weight column

# Create weighted means by ADM2
summary = []
for adm2 in gdf["ADM2"].dropna().unique():
    subset = gdf[gdf["ADM2"] == adm2]

    # Skip if subset is empty
    if len(subset) == 0:
        continue

    country = subset["country"].iloc[0]
    # adm0 = subset["ADM0"].iloc[0]

    # Calculate weighted means for each numeric column
    weighted_means = {
        "country": country,
        # "ADM0": adm0,
        "ADM2": adm2,
        "n_households": len(subset),
    }

    for col in numeric_cols:
        # Remove NaN values before calculating weighted average
        mask = ~subset[col].isna()
        if mask.any():  # if there are any non-NaN values
            weighted_means[col] = np.average(
                subset[col][mask], weights=subset["pw"][mask]
            )
        else:
            weighted_means[col] = np.nan

    summary.append(weighted_means)

# Convert to DataFrame
summary_df = pd.DataFrame(summary)

# Reorder columns to match original
new_cols = ["country", "ADM2", "n_households"] + [
    col for col in numeric_cols if col in summary_df.columns
]
summary_df = summary_df[new_cols]
summary_df

Unnamed: 0,country,ADM2,n_households,urban,hh_size,hh_shock,hh_primary_education,hh_electricity_access,hh_dependency_ratio,hh_formal_education,nonfarm_enterprise,nb_fallow_plots,nb_plots,share_kg_sold,totcons_LCU,totcons_USD,cons_quint,hh_asset_index,HDDS
0,Ethiopia,North Western,122,0.276652,3.803650,0.368085,0.514375,0.857257,0.819882,0.916196,0.324468,0.041033,2.635956,0.138981,1.656113e+04,653.787492,2.415107,-0.008282,6.396108
1,Ethiopia,Central,171,0.255789,4.031828,0.283807,0.501333,0.804692,0.839186,0.809109,0.234798,0.126336,4.327916,0.175077,1.602444e+04,632.600549,2.532844,0.150875,6.832208
2,Ethiopia,Eastern\n,136,0.335196,3.958594,0.372890,0.556226,0.838756,0.835395,0.873949,0.382285,0.102907,3.782932,0.049926,2.058052e+04,812.462147,2.868741,0.085622,6.941185
3,Ethiopia,Southern,220,0.481401,4.241622,0.492621,0.515626,0.779828,0.917596,0.878757,0.231299,0.026740,2.531305,0.143772,2.409702e+04,951.283828,3.030861,0.023783,6.906992
4,Ethiopia,Western,48,0.303788,4.450050,0.123167,0.446696,0.686888,0.716534,0.891968,0.445862,0.142143,3.411326,0.366706,1.712208e+04,675.932550,2.563057,0.619005,6.902124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
616,Uganda,KWEEN,20,0.000000,7.600000,0.100000,1.000000,0.000000,1.160000,1.000000,0.200000,,,0.508504,2.121615e+06,597.601048,3.800000,-0.357655,9.000000
617,Uganda,RUKIGA,18,0.000000,5.095238,0.047619,0.809524,0.000000,1.539683,1.000000,0.000000,,,0.187077,1.056080e+06,297.468872,2.666667,-0.445230,6.523810
618,Uganda,NWOYA,2,0.000000,7.000000,0.000000,1.000000,0.000000,1.333333,1.000000,0.000000,,,,1.090914e+06,307.280640,3.000000,0.292174,10.000000
619,Uganda,PALISA,2,0.000000,3.000000,0.000000,1.000000,0.000000,0.500000,1.000000,0.000000,,,0.000000,1.958689e+06,551.709412,4.000000,-0.438135,10.000000


In [158]:
summary_df['ADM2'] = summary_df['ADM2'].str.title()
feats2['ADM2'] = feats2['ADM2'].str.title()

summary_df['ADM2'] = summary_df['ADM2'].str.replace(' Rural', '').str.strip()
feats2['ADM2'] = feats2['ADM2'].str.replace(' Rural', '').str.strip()

# Create a dictionary of corrections
corrections = {
    "Birni N'Konni": "Bkonni",
    "Butiama": "Butiam",
    "Guidan-Roumdji": "Guidan Roumji",
    "Illéla": "Illela",
    "Matamèye": "Matameye",
    "Tchin-Tabaraden": "Tchintabaraden",
    "Tillaberi": "Tillabéri",
    "Maïné-Soroa": "Maïné Soroa",
    "Abia": "Abi",
    "Anambra East": "Anambra",
    "Babati Urban": "Babati",
    "Kigoma Ujiji Urban": "Kigoma Urban",
    "Niamey1": "Niamey",
    "Babati Town": "Babati Urban",
    "Kahama Town": "Kahama Township Authority",
    "Korogwe Urban": "Korogwe Township Authority",
    "Mafinga Town": "Mafinga Township Authority",
    "Masasi Urban": "Masasi Township Authority",
    "Mtwara Mikindani": "Mtwara Urban",
    "Nzega Town": "Nzega Township Authority",
    "Kasulu Town": "Kasulu Township Authority",
    "Chakechake": "Chake Chake",
    "Fct Abuja": "Abuja Municipal",
    "Handeni Mji": "Handeni Urban",
    "Makambako Town": "Makambako Township Authority",
    "Chakechake": "Chake Chake",  # In case of variations without capitalization
    "Masasi  Township Authority": "Masasi Township Authority" 
}

# Apply corrections to both dataframes
for old_name, new_name in corrections.items():
    summary_df['ADM2'] = summary_df['ADM2'].replace(old_name, new_name)
    feats2['ADM2'] = feats2['ADM2'].replace(old_name, new_name)

In [159]:
# Get sets of unique values from each dataset
set1 = set(summary_df['ADM2'].unique())
set2 = set(feats2['ADM2'].unique())

# Find values that appear in one dataset but not the other
only_in_summary = set1 - set2
only_in_feats = set2 - set1

print("Values only in summary_df:", sorted(only_in_summary))
print("Values only in feats2:", sorted(only_in_feats))

Values only in summary_df: ['Adamawa', 'Akwa Ibom', 'Bariadi Town', 'Bayelsa', 'Benue', 'Blantyre City', 'Borno', 'Buchosa', 'Bugweri', 'Bumbuli', 'Bunyangabu', 'Busokelo', 'Butebo', 'Chalinze', 'Cross River', 'Delta', 'Edo', 'Enugu', 'Gaya {Added May/04, 07}', 'Geita Town', 'Ifakara Urban', 'Imo', 'Itigi', 'Jigawa', 'Kaduna', 'Kano', 'Kapelebyong', 'Kaskazini Ï¿½Aï¿½', 'Kaskazini Ï¿½Bï¿½', 'Kassanda', 'Kebbi', 'Kibiti', 'Kigamboni', 'Kigoma Urban', 'Kikuube', 'Kondoa Urban', 'Kwania', 'Kwara', 'Kyotera', 'Lagos', 'Lilongwe City', 'Madaba', 'Malinyi', 'Mbalali', 'Mbulu Town', 'Mpimbwe', 'Msalala', 'Mwanza Urban', 'Mzuzu City', 'Nabilatuk', 'Namisindwa', 'Nanyamba Town', 'Neno', 'Niger', 'Njombe Town', 'Nsimbo', 'Nzega Township Authority', 'Ogun', 'Ondo', 'Osun', 'Oyo', 'Pakwach', 'Palisa', 'Plateau', 'Rivers', 'Rukiga', 'Sokoto', 'Songwe', 'Ssembabule', 'Taraba', 'Tarime Town', 'Ubungo', 'Ushetu', 'Yobe', 'Zamfara', 'Zomba City']
Values only in feats2: ['Aba North', 'Aba South', 'Abada

In [160]:
sorted(only_in_summary)

['Adamawa',
 'Akwa Ibom',
 'Bariadi Town',
 'Bayelsa',
 'Benue',
 'Blantyre City',
 'Borno',
 'Buchosa',
 'Bugweri',
 'Bumbuli',
 'Bunyangabu',
 'Busokelo',
 'Butebo',
 'Chalinze',
 'Cross River',
 'Delta',
 'Edo',
 'Enugu',
 'Gaya {Added May/04, 07}',
 'Geita Town',
 'Ifakara Urban',
 'Imo',
 'Itigi',
 'Jigawa',
 'Kaduna',
 'Kano',
 'Kapelebyong',
 'Kaskazini Ï¿½Aï¿½',
 'Kaskazini Ï¿½Bï¿½',
 'Kassanda',
 'Kebbi',
 'Kibiti',
 'Kigamboni',
 'Kigoma Urban',
 'Kikuube',
 'Kondoa Urban',
 'Kwania',
 'Kwara',
 'Kyotera',
 'Lagos',
 'Lilongwe City',
 'Madaba',
 'Malinyi',
 'Mbalali',
 'Mbulu Town',
 'Mpimbwe',
 'Msalala',
 'Mwanza Urban',
 'Mzuzu City',
 'Nabilatuk',
 'Namisindwa',
 'Nanyamba Town',
 'Neno',
 'Niger',
 'Njombe Town',
 'Nsimbo',
 'Nzega Township Authority',
 'Ogun',
 'Ondo',
 'Osun',
 'Oyo',
 'Pakwach',
 'Palisa',
 'Plateau',
 'Rivers',
 'Rukiga',
 'Sokoto',
 'Songwe',
 'Ssembabule',
 'Taraba',
 'Tarime Town',
 'Ubungo',
 'Ushetu',
 'Yobe',
 'Zamfara',
 'Zomba City']

In [161]:
data = summary_df.merge(feats2, on=["ADM2"], how="left")
data = data.dropna()
data

Unnamed: 0,country,ADM2,n_households,urban,hh_size,hh_shock,hh_primary_education,hh_electricity_access,hh_dependency_ratio,hh_formal_education,...,X_3991,X_3992,X_3993,X_3994,X_3995,X_3996,X_3997,X_3998,X_3999,geometry
0,Ethiopia,North Western,122,0.276652,3.803650,0.368085,0.514375,0.857257,0.819882,0.916196,...,0.069334,0.353199,0.179662,0.083176,0.450013,1.269864,0.370094,0.237892,0.060540,"MULTIPOLYGON (((38.67728 13.48406, 38.69115 13..."
1,Ethiopia,Central,171,0.255789,4.031828,0.283807,0.501333,0.804692,0.839186,0.809109,...,0.099348,0.448303,0.240759,0.134295,0.683113,1.646239,0.502437,0.383884,0.074073,"POLYGON ((39.02647 14.6334, 39.09872 14.63401,..."
2,Ethiopia,Eastern,136,0.335196,3.958594,0.372890,0.556226,0.838756,0.835395,0.873949,...,0.085033,0.422713,0.229910,0.156833,0.804397,1.676661,0.546421,0.475549,0.059438,"POLYGON ((39.63232 14.60316, 39.63363 14.60134..."
3,Ethiopia,Southern,220,0.481401,4.241622,0.492621,0.515626,0.779828,0.917596,0.878757,...,0.061458,0.317047,0.162198,0.087990,0.474245,1.211935,0.357866,0.265503,0.048680,"POLYGON ((39.03722 13.17052, 39.05332 13.20198..."
4,Ethiopia,Western,48,0.303788,4.450050,0.123167,0.446696,0.686888,0.716534,0.891968,...,0.049111,0.294488,0.121071,0.030551,0.158767,0.927309,0.153904,0.058958,0.046716,"POLYGON ((36.49591 13.84078, 36.5001 13.8474, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,Tanzania,Ileje,7,0.000000,4.232980,0.807950,0.680776,0.556544,1.131875,0.680776,...,0.030823,0.188179,0.137952,0.086227,0.661212,0.956804,0.554999,0.384431,0.047879,"POLYGON ((33.64228 -9.60334, 33.64187 -9.60376..."
501,Tanzania,Gairo,22,0.193691,5.169722,0.837561,0.826886,0.634454,1.036614,1.000000,...,0.049609,0.236664,0.162240,0.132706,0.940240,1.176850,0.621916,0.693314,0.045240,"POLYGON ((36.74822 -6.55437, 36.74748 -6.55018..."
502,Tanzania,Muheza,11,0.000000,4.665326,0.274113,0.914893,0.196983,1.055869,1.000000,...,0.059158,0.311477,0.163569,0.064590,0.274113,1.036725,0.255092,0.144512,0.057595,"POLYGON ((38.63417 -5.37011, 38.63425 -5.37011..."
503,Tanzania,Makambako Township Authority,6,1.000000,2.824868,1.000000,1.000000,0.651383,0.321994,1.000000,...,0.048684,0.247128,0.177187,0.142715,0.908591,1.192264,0.619265,0.623635,0.043042,"POLYGON ((34.85398 -8.76906, 34.85767 -8.77361..."


In [162]:
from sklearn.linear_model import RidgeCV
from sklearn.calibration import IsotonicRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, GridSearchCV

In [163]:
feature_cols = [f"X_{i}" for i in range(4000)]
# feature_cols

In [164]:
data.columns[3:25]

Index(['urban', 'hh_size', 'hh_shock', 'hh_primary_education',
       'hh_electricity_access', 'hh_dependency_ratio', 'hh_formal_education',
       'nonfarm_enterprise', 'nb_fallow_plots', 'nb_plots', 'share_kg_sold',
       'totcons_LCU', 'totcons_USD', 'cons_quint', 'hh_asset_index', 'HDDS',
       'shapeID', 'ADM0', 'X_0', 'X_1', 'X_2', 'X_3'],
      dtype='object')

In [165]:
X = data[feature_cols].values
y = data["nb_plots"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [166]:
alphas = np.logspace(-8, 8, base=10, num=17)
ridge = RidgeCV(alphas=alphas, scoring="r2", cv=5)

ridge.fit(X_train, y_train)

y_pred = np.maximum(ridge.predict(X_test), 0)

r2 = r2_score(y_test, y_pred)

print(f"Best alpha: {ridge.alpha_}")
print(f"Validation R2 performance {ridge.best_score_:0.2f}")
print(f"Test R2 performance {r2:.4f}")

Best alpha: 1.0
Validation R2 performance 0.29
Test R2 performance 0.4801


In [138]:
# alphas = [0.01, 0.1, 1.0, 10.0]
# ridge = RidgeClassifierCV(alphas=alphas, cv=5, scoring="roc_auc")

# ridge.fit(X_train, y_train)

**3. Model Evaluation:**

* The trained model is evaluated using the testing set, which was not used during training.
* The primary evaluation metric is the Receiver Operating Characteristic Area Under the Curve (ROC AUC). This metric measures the model's ability to distinguish between positive (internet presence) and negative (internet absence) cases.
* The ROC curve is visualized to provide a graphical representation of the model's performance across different thresholds.


In [None]:
# Get uncalibrated predictions (decision function)
y_pred_uncal = ridge.decision_function(X_test)

# Fit isotonic calibration
iso_reg = IsotonicRegression(out_of_bounds="clip")
iso_reg.fit(y_pred_uncal, y_test)

# Get calibrated predictions
y_pred_cal = iso_reg.predict(y_pred_uncal)

# Calculate AUC scores
auc_uncal = roc_auc_score(y_test, y_pred_uncal)
auc_cal = roc_auc_score(y_test, y_pred_cal)

print(f"Best alpha: {ridge.alpha_}")
print(f"AUC before calibration: {auc_uncal:.3f}")
print(f"AUC after calibration: {auc_cal:.3f}")

In [None]:
# Calculate ROC curves
fpr_uncal, tpr_uncal, _ = roc_curve(y_test, y_pred_uncal)
fpr_cal, tpr_cal, _ = roc_curve(y_test, y_pred_cal)

# Create the plot
plt.figure(figsize=(5, 5))

# Plot both curves
plt.plot(fpr_uncal, tpr_uncal, "b-", label=f"Uncalibrated (AUC = {auc_uncal:.3f})")
plt.plot(fpr_cal, tpr_cal, "r-", label=f"Calibrated (AUC = {auc_cal:.3f})")

# Plot the diagonal reference line
plt.plot([0, 1], [0, 1], "k--", label="Random")

# Customize the plot
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves - Before and After Calibration")
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)

# Set the plot limits
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])

plt.tight_layout()
plt.show()

In [None]:
import geopandas as gpd

In [None]:
features = pd.read_feather(os.path.join(local_dir, "TGO_API_features.feather"))

features = gpd.GeoDataFrame(
    features, geometry=gpd.points_from_xy(features.Lon, features.Lat), crs="EPSG:4326"
)
features.geometry = features.geometry.buffer(0.005, cap_style=3)

features

In [None]:
y_pred_full = ridge.decision_function(features[feature_cols].values)

# y_pred_full_cal = iso_reg.predict(y_pred_full)

features["predicted_probability"] = y_pred_full
features

In [None]:
features[features.predicted_probability > 0].plot(
    column="predicted_probability",
    cmap="viridis",
    legend=True,
    legend_kwds={"label": "Predicted Probability"},
    figsize=(10, 8),
)
plt.title("Prediction")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.show()

## Conclusions
This notebook demonstrated the use of MOSAIKS to predict internet access in Togo using satellite imagery features.

**Key Findings:**  

- The model achieved a good level of predictive performance, indicated by the ROC AUC score.
- Satellite imagery features proved to be valuable predictors of internet access, highlighting their potential for understanding the digital divide.

**Implications and Future Directions:**

- The insights gained from this analysis can inform policy decisions and interventions aimed at expanding internet access in Togo.
- Further research could explore the use of other machine learning algorithms and incorporate additional data sources to enhance predictive accuracy.
- This approach can be extended to other regions and countries to address the global digital divide.

**Limitations:**

- The model's performance might vary in different geographical contexts.
- The accuracy of the predictions depends on the quality and availability of satellite imagery and label data.
- Further validation and ground-truthing are needed to assess the real-world applicability of the model.


Overall, this demonstration highlights the potential of MOSAIKS and machine learning in bridging the digital divide by providing insights into internet access patterns using satellite imagery.