# Malaria Incidence Disaggregation Notebook
# 
# This notebook demonstrates the process of disaggregating national malaria incidence data to the state level by combining multiple datasets. We use population data, Land Use Land Cover (LULC) data, and environmental data (temperature and rainfall) to compute weights that are then used to allocate national malaria cases proportionally to each state.
# 
# **Overview of Steps:**
# 
# 1. **Data Import:** Load population, administrative boundaries, LULC, and environmental data.
# 2. **Data Preparation:** Clean and prepare datasets.
# 3. **Population Disaggregation:** Merge population data with state boundaries.
# 4. **LULC Weighting:** Assign weights based on LULC categories (urban, agricultural, forested, water bodies).
# 5. **Environmental Risk Calculation:** Compute risk indices from temperature and rainfall data.
# 6. **Malaria Incidence Allocation:** Distribute national malaria incidence based on the computed weights.
# 7. **Results Export:** Save or visualize the final disaggregated data.

In [None]:
# Import necessary libraries
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import show
from shapely.geometry import Point

## Step 1: Data Import
# 
# In this section, we load the various datasets:
# 
# - **Population Data:** A CSV file containing state-level population counts.
# - **Administrative Boundaries:** A shapefile with the boundaries of Nigeria's 36 states and the FCT.
# - **LULC Data:** Raster or vector data classifying regions (urban, agricultural, forested, water bodies).
# - **Environmental Data:** Temperature and rainfall data (could be in CSV, raster, or other formats).
# 
# *Note: Replace the file paths with the correct paths on your PC

In [None]:
# Load population data
pop_df = pd.read_csv("data/nigeria_population.csv")  
# Expected columns: ['State', 'Population']

# Load administrative boundaries (shapefile)
states_gdf = gpd.read_file("data/nigeria_states.shp")
# Ensure the state names in the shapefile match those in the population data

# Load LULC data (this can be a shapefile or raster; here we assume a shapefile)
lulc_gdf = gpd.read_file("data/nigeria_lulc.shp")
# Expected columns: ['State', 'LULC_Type'] where LULC_Type might be categories like 'Urban', 'Agricultural', etc.

# Load environmental data (e.g., average temperature and rainfall per state)
env_df = pd.read_csv("data/nigeria_environment.csv")
# Expected columns: ['State', 'Avg_Temperature', 'Avg_Rainfall']

## Step 2: Data Preparation
# 
# We now ensure that all datasets have a common key (State) and merge the population and environmental data into the administrative boundaries GeoDataFrame.

In [None]:
# Merge population data into the states GeoDataFrame
states_gdf = states_gdf.merge(pop_df, on="State", how="left")
# Merge environmental data
states_gdf = states_gdf.merge(env_df, on="State", how="left")

# Check the head of the merged GeoDataFrame
states_gdf.head()

## Step 3: LULC Weighting
# 
# We assign weights based on LULC categories. For simplicity, we assume that:
# 
# - **Urban areas:** Weight = 1.5
# - **Agricultural areas:** Weight = 1.3
# - **Forested areas:** Weight = 1.0
# - **Water bodies:** Weight = 0.5
# 
# If a state has multiple LULC categories, you may compute a weighted average based on the proportion of each category.

In [None]:
# Define a mapping of LULC types to weights
lulc_weight_mapping = {
    "Urban": 1.5,
    "Agricultural": 1.3,
    "Forested": 1.0,
    "Water": 0.5
}

# Assume lulc_gdf has columns: ['State', 'LULC_Type']
lulc_gdf["LULC_Weight"] = lulc_gdf["LULC_Type"].map(lulc_weight_mapping)

# Merge LULC weight into states_gdf
states_gdf = states_gdf.merge(lulc_gdf[["State", "LULC_Weight"]], on="State", how="left")

# If any state is missing a weight, fill with a default (e.g., 1.0)
states_gdf["LULC_Weight"].fillna(1.0, inplace=True)

# ## Step 4: Environmental Risk Calculation
# 
# We now calculate an environmental risk score based on temperature and rainfall. For example:
# 
# - **Temperature Factor:** Optimal range is 20–32°C. We can compute a score based on how close the average temperature is to the midpoint (26°C).
# - **Rainfall Factor:** Higher rainfall may increase risk due to stagnant water. A simple normalization of rainfall can be used.
# 
# Here, we compute a risk score as:
# 
# \[
# \text{Risk Score} = \text{LULC Weight} \times \left(1 - \frac{|T - 26|}{\Delta T}\right) \times \left(\frac{R}{R_{max}}\right)
# \]
# 
# Where \(T\) is average temperature, \(\Delta T\) is the temperature range (assumed 6 for simplicity), \(R\) is average rainfall, and \(R_{max}\) is the maximum observed rainfall in the dataset.

In [None]:
# Define temperature parameters
optimal_temp = 26.0
temp_range = 6.0  # a constant to normalize deviation

# Normalize temperature score (closer to optimal yields higher score)
states_gdf["Temp_Score"] = 1 - (abs(states_gdf["Avg_Temperature"] - optimal_temp) / temp_range)
states_gdf["Temp_Score"] = states_gdf["Temp_Score"].clip(lower=0)  # ensure non-negative

# Normalize rainfall score
max_rainfall = states_gdf["Avg_Rainfall"].max()
states_gdf["Rain_Score"] = states_gdf["Avg_Rainfall"] / max_rainfall

# Compute combined environmental risk score
states_gdf["Env_Risk"] = states_gdf["LULC_Weight"] * states_gdf["Temp_Score"] * states_gdf["Rain_Score"]

## Step 5: Malaria Incidence Allocation
# 
# Assume we have a national malaria incidence number. We allocate this number to each state in proportion to the computed environmental risk and population density.
# 
# First, compute a weight for each state based on:
# 
# \[
# \text{State Weight} = \text{Population} \times \text{Env_Risk}
# \]
# 
# Then, allocate national cases accordingly.

In [None]:
# Example: National malaria incidence (from an external source)
national_malaria_incidence = 500000  # replace with actual number

# Compute state weight
states_gdf["State_Weight"] = states_gdf["Population"] * states_gdf["Env_Risk"]

# Compute total weight across all states
total_weight = states_gdf["State_Weight"].sum()

# Allocate malaria cases to each state proportionally
states_gdf["Allocated_Cases"] = (states_gdf["State_Weight"] / total_weight) * national_malaria_incidence

# ## Step 6: Results Visualization and Export
# 
# We can now visualize the allocated malaria incidence on the map and export the results.

In [None]:
# Visualize the allocated malaria cases using a choropleth map
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
states_gdf.plot(column="Allocated_Cases", cmap="OrRd", linewidth=0.8, ax=ax, edgecolor="0.8", legend=True)
ax.set_title("Allocated Malaria Cases by State in Nigeria")
plt.axis("off")
plt.show()

In [None]:
# Optionally, export the resulting GeoDataFrame to a new shapefile or CSV
states_gdf.to_file("output/nigeria_malaria_allocated.shp")
states_gdf.drop(columns="geometry").to_csv("output/nigeria_malaria_allocated.csv", index=False)

## Conclusion
# 
# In this notebook, we demonstrated a complete workflow for disaggregating national malaria incidence to the state level using a combination of population data, LULC weights, and environmental risk factors (temperature and rainfall). The resulting state-level allocation was used for further spatial analysis.
# 
# Feel free to adapt and extend this code to include more detailed data, additional environmental variables, or more sophisticated weighting schemes.
