# MOSAIKS Demonstration: Identifying Internet Access in Togo

> **Note**: If you wish to save this notebook with any of your changes, please make sure to click on `File > Save a copy in Drive`. All changes will be lost if you close this tab without saving a copy in your Google Drive. All changes made before saving a copy in Drive will be saved in the notebook after you save a copy in Drive. Data is downloaded from the internet and stored in the temporary disk of the environment. This data will be deleted once the session is over.

This notebook demonstrates the application of Multi-task Observation using Satellite Imagery & Kitchen Sinks (MOSAIKS), a machine learning approach utilizing satellite imagery, to identify areas with and without internet access in Togo. This information can be crucial for understanding the digital divide and informing policies to improve connectivity.


## Methodology

The process involves the following steps:

1. **Data Acquisition:** Label data on internet access is provided by the Agence Togo Digital (ATD). Pre-processed satellite imagery features are derived using the MOSAIKS framework.
1. **Data Preprocessing:** The label and feature data are preprocessed and cleaned. This leaves a single dataset with labels and features that is ready to use.
1. **Model Training and Evaluation:** A machine learning model, specifically Ridge Regression with isotonic calibration, is trained on the joined dataset to predict the likelihood of internet access. The model is evaluated using metrics like the Receiver Operating Characteristic Area Under the Curve (ROC AUC).
1. **Visualization and Interpretation:** The model results are visualized using maps and plots. Important features identified by the model are analyzed to understand the patterns associated with internet access.


## Setup

This notebook utilizes several key Python libraries:

- **Data Handling and Analysis:** `pandas` and `numpy` provide fundamental data structures and functions for manipulating and analyzing data. `geopandas` extends these capabilities to work with geospatial data.
- **Visualization:** `matplotlib.pyplot` and `seaborn` enable the creation of static and interactive visualizations, aiding in data exploration and result presentation.
- **Machine Learning:** `sklearn` offers a comprehensive suite of tools for building and evaluating machine learning models, including algorithms like Ridge Regression and methods for data preprocessing.


In [1]:
import os
import shutil
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

# from google.colab import drive
from sklearn.linear_model import RidgeCV
from sklearn.calibration import IsotonicRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, GridSearchCV

In [2]:
import warnings
from scipy.linalg import LinAlgWarning

warnings.filterwarnings(
    "ignore", category=LinAlgWarning, module="sklearn.linear_model._ridge"
)

In [None]:
# Define the directory name
data_dir = "LSMS-ISA-data"

# Check if the final directory exists
if not os.path.exists(data_dir):
    # Download only if the zip file doesn't exist
    if not os.path.exists("Data.zip"):
        !wget https://zenodo.org/records/14040658/files/Data.zip

    # Unzip the data
    !unzip Data.zip

    # Rename the folder
    !mv Data {data_dir}

    # Remove the zip file
    !rm Data.zip

# List the files (this will run regardless)
!ls -lh {data_dir}

In [None]:
# Define the directory and base URL
geo_dir = "geoBoundaries"
base_url = "https://github.com/wmgeolab/geoBoundaries/raw/9469f09/releaseData/gbOpen"

# List of country codes
countries = ["ETH", "MWI", "MLI", "NER", "NGA", "TZA", "UGA"]

# Create directory if it doesn't exist
if not os.path.exists(geo_dir):
    !mkdir {geo_dir}

# Download files for each country if they don't exist
for country in countries:
    filename = f"geoBoundaries-{country}-ADM2.geojson"
    filepath = os.path.join(geo_dir, filename)

    if not os.path.exists(filepath):
        !wget {base_url}/{country}/ADM2/{filename} -P {geo_dir}

# List the files (this will run regardless)
!ls -lh {geo_dir}

In [None]:
# !wget https://www.geoboundaries.org/data/geoBoundariesCGAZ-3_0_0/ADM2/simplifyRatio_100/geoBoundariesCGAZ_ADM2.topojson

boundaries = gpd.read_file("geoBoundariesCGAZ_ADM2.topojson")
mask = boundaries["shapeID"].str[:3].isin(countries)
boundaries = boundaries[mask]
boundaries = boundaries.rename(
    columns={
        "shapeName": "ADM2",
        "shapeGroup": "ADM0",
    }
)
boundaries = boundaries.drop(
    columns=[
        "id",
        "shapeISO",
        "shapeType",
        "ADM1_shapeID",
        "ADM0_shapeID",
        "ADMHIERARCHY",
    ]
)
boundaries = boundaries[["shapeID", "ADM0", "ADM2", "geometry"]]
boundaries.crs = "EPSG:4326"
boundaries.head()

In [6]:
# feats = pd.read_csv(
#     os.path.join(
#         "/",
#         "home",
#         "emlab",
#         "data",
#         "mosaiks-togo",
#         "API_features",
#         "ADM_2_regions_RCF_global_dense.csv",
#     )
# )
# mask = feats['shapeID'].str[:3].isin(countries)
# feats = feats[mask]
# feats = feats.drop(columns=["shapeID.1"], errors="ignore")
# features = boundaries.set_index("shapeID").join(feats.set_index("shapeID"), how="left")
# features = features.reset_index()
# new_cols = [col for col in features.columns if col != 'geometry'] + ['geometry']
# features = features[new_cols]
# features.crs = "EPSG:4326"
# features.to_feather("adm2_api_features.feather")
# features.head()

In [None]:
feats2 = gpd.read_feather("adm2_api_features.feather")
feats2.head()

In [None]:
df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))
df

In [None]:
df.columns

In [None]:
# df = pd.read_stata(os.path.join(data_dir, "Household_dataset.dta"))
df = pd.read_stata(os.path.join(data_dir, "Plotcrop_dataset.dta"))
# Create a conditions dictionary mapping countries to their desired waves
wave_conditions = {
    "Ethiopia": 4,
    "Malawi": 4,
    "Mali": 2,
    "Niger": 2,
    "Nigeria": 3,
    "Tanzania": 5,
    "Uganda": 7,
}

# Create the filter using boolean indexing
df = df[
    df.apply(lambda x: x["wave"] == wave_conditions.get(x["country"], False), axis=1)
]

gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.lon_modified, df.lat_modified), crs="EPSG:4326"
)

gdf = gdf.drop(
    columns=[
        "admin_2",
        "admin_3",
        "hh_id_merge",
        "hh_id_obs",
        "season",
        "ea_id_merge",
        "ea_id_obs",
        "strataid",
        "geocoords_id",
    ]
)

gdf["admin_2_name"] = gdf["admin_2_name"].replace("", np.nan)

mask_coords = gdf["geometry"].is_empty == False
gdf_with_coords = gdf[mask_coords].copy()

joined_data = gpd.sjoin(
    gdf_with_coords, boundaries[["ADM2", "geometry"]], how="left", predicate="within"
)

gdf["admin_2_joined"] = None

gdf.loc[joined_data.index, "admin_2_joined"] = joined_data["ADM2"]

gdf["ADM2"] = gdf["admin_2_name"].fillna(gdf["admin_2_joined"])

gdf = gdf.drop(
    columns=[
        "wave",
        "admin_2_joined",
        "geometry",
        "lat_modified",
        "lon_modified",
        "admin_1",
        "admin_1_name",
        "admin_2_name",
    ]
)

# Create dictionary of country names to ISO3 codes
country_to_iso3 = {
    "Ethiopia": "ETH",
    "Malawi": "MWI",
    "Mali": "MLI",
    "Niger": "NER",
    "Nigeria": "NGA",
    "Tanzania": "TZA",
    "Uganda": "UGA",
}

# Create new ADM0 column using map
gdf["ADM0"] = gdf["country"].map(country_to_iso3)

new_cols = ["country", "ADM2"] + [
    col for col in gdf.columns if col not in ["country", "ADM2"]
]
gdf = gdf[new_cols]

gdf

In [None]:
gdf.columns

In [None]:
# Convert Yes/No to 1/0
binary_cols = ["urban", "hh_electricity_access"]
for col in binary_cols:
    gdf[col] = (gdf[col] == "Yes").astype(int)

# Identify numeric columns to aggregate
numeric_cols = gdf.select_dtypes(include=["float64", "int64"]).columns
numeric_cols = [
    col for col in numeric_cols if col != "pw" and col != "ADM0"
]  # exclude weight column

# Create weighted means by ADM2
summary = []
for adm2 in gdf["ADM2"].dropna().unique():
    subset = gdf[gdf["ADM2"] == adm2]

    # Skip if subset is empty
    if len(subset) == 0:
        continue

    country = subset["country"].iloc[0]
    # adm0 = subset["ADM0"].iloc[0]

    # Calculate weighted means for each numeric column
    weighted_means = {
        "country": country,
        # "ADM0": adm0,
        "ADM2": adm2,
        "n_households": len(subset),
    }

    for col in numeric_cols:
        # Remove NaN values before calculating weighted average
        mask = ~subset[col].isna()
        if mask.any():  # if there are any non-NaN values
            weighted_means[col] = np.average(
                subset[col][mask], weights=subset["pw"][mask]
            )
        else:
            weighted_means[col] = np.nan

    summary.append(weighted_means)

# Convert to DataFrame
summary_df = pd.DataFrame(summary)

# Reorder columns to match original
new_cols = ["country", "ADM2", "n_households"] + [
    col for col in numeric_cols if col in summary_df.columns
]
summary_df = summary_df[new_cols]
summary_df

In [158]:
summary_df["ADM2"] = summary_df["ADM2"].str.title()
feats2["ADM2"] = feats2["ADM2"].str.title()

summary_df["ADM2"] = summary_df["ADM2"].str.replace(" Rural", "").str.strip()
feats2["ADM2"] = feats2["ADM2"].str.replace(" Rural", "").str.strip()

# Create a dictionary of corrections
corrections = {
    "Birni N'Konni": "Bkonni",
    "Butiama": "Butiam",
    "Guidan-Roumdji": "Guidan Roumji",
    "Illéla": "Illela",
    "Matamèye": "Matameye",
    "Tchin-Tabaraden": "Tchintabaraden",
    "Tillaberi": "Tillabéri",
    "Maïné-Soroa": "Maïné Soroa",
    "Abia": "Abi",
    "Anambra East": "Anambra",
    "Babati Urban": "Babati",
    "Kigoma Ujiji Urban": "Kigoma Urban",
    "Niamey1": "Niamey",
    "Babati Town": "Babati Urban",
    "Kahama Town": "Kahama Township Authority",
    "Korogwe Urban": "Korogwe Township Authority",
    "Mafinga Town": "Mafinga Township Authority",
    "Masasi Urban": "Masasi Township Authority",
    "Mtwara Mikindani": "Mtwara Urban",
    "Nzega Town": "Nzega Township Authority",
    "Kasulu Town": "Kasulu Township Authority",
    "Chakechake": "Chake Chake",
    "Fct Abuja": "Abuja Municipal",
    "Handeni Mji": "Handeni Urban",
    "Makambako Town": "Makambako Township Authority",
    "Chakechake": "Chake Chake",  # In case of variations without capitalization
    "Masasi  Township Authority": "Masasi Township Authority",
}

# Apply corrections to both dataframes
for old_name, new_name in corrections.items():
    summary_df["ADM2"] = summary_df["ADM2"].replace(old_name, new_name)
    feats2["ADM2"] = feats2["ADM2"].replace(old_name, new_name)

In [None]:
# Get sets of unique values from each dataset
set1 = set(summary_df["ADM2"].unique())
set2 = set(feats2["ADM2"].unique())

# Find values that appear in one dataset but not the other
only_in_summary = set1 - set2
only_in_feats = set2 - set1

print("Values only in summary_df:", sorted(only_in_summary))
print("Values only in feats2:", sorted(only_in_feats))

In [None]:
sorted(only_in_summary)

In [None]:
data = summary_df.merge(feats2, on=["ADM2"], how="left")
data = data.dropna()
data

In [162]:
from sklearn.linear_model import RidgeCV
from sklearn.calibration import IsotonicRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, GridSearchCV

In [163]:
feature_cols = [f"X_{i}" for i in range(4000)]
# feature_cols

In [None]:
data.columns[3:25]

In [165]:
X = data[feature_cols].values
y = data["nb_plots"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
alphas = np.logspace(-8, 8, base=10, num=17)
ridge = RidgeCV(alphas=alphas, scoring="r2", cv=5)

ridge.fit(X_train, y_train)

y_pred = np.maximum(ridge.predict(X_test), 0)

r2 = r2_score(y_test, y_pred)

print(f"Best alpha: {ridge.alpha_}")
print(f"Validation R2 performance {ridge.best_score_:0.2f}")
print(f"Test R2 performance {r2:.4f}")