# Replication Project: Final Dataframe Extended

**Replication Game – Berlin, 30 October 2025 Berlin**  

---

**Institut for Replication & Freie Universität Berlin**  

**Author:** [Dominik Bursy](mailto:dominik.bursy@icloud.com)  

**Last Updated:** October 2025

---

**Reference:**  
Berazneva, Julia, and Tanya S. Byker. 2017. *Does Forest Loss Increase Human Disease? Evidence from Nigeria.* American Economic Review, 107(5), 516–521. https://doi.org/10.1257/aer.p20171132

---

**Resources:**  
- [Guidelines on the Use of DHS GPS Data (English)](https://dhsprogram.com/publications/publication-SAR8-Spatial-Analysis-Reports.cfm)
- [Nigeria - Subnational Administrative Boundaries](https://data.humdata.org/dataset/cod-ab-nga)


**Notes:**  
- EPSG:3857 is a spherical Mercator projected coordinate system in meters, ideal for web mapping applications like Google Maps, while EPSG:4326 is a geographic coordinate system in degrees using the WGS84 ellipsoid, which represents Earth as a 3D sphere and is used by GPS systems.
- To protect the confidentiality of respondents the geo-located data is displaced (Burgert et al., 2013). The displacement process moves the latitude and longitude to a new location under set parameters. Urban locations are displaced 0-2 kilometers while rural locations are displaced 0-5 kilometers with 1% (or every 100th point) displaced 0-10 kilometers.
- Administrative level 2 contains 774 feature(s). The normal administrative level 2 feature type is *Local Governement Area*.
- Berazneva and Byker (2017) did not provide detailed information on how the data were cleaned or subset, such as the exclusion of observations with missing age values.

## Import Packages <a class="anchor" id="packages"></a>

In [20]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
from datetime import timedelta

import matplotlib.pyplot as plt
import seaborn as sns

In [21]:
## Set Root Directory
ROOT_FOLDER = str(Path().absolute().parent)
print(ROOT_FOLDER)

## Save Figures
savefigures = True

/Users/dominikbursy/Documents/8_PhD_New/replication_game


---

## Import Data 

In [22]:
## Import DHS Dataframe being the base structure
gdf_dhs = gpd.read_file(f"{ROOT_FOLDER}/output/gdf_dhs.geojson")

In [23]:
## Import forest loss, luminosity, and soil properties
gdf_dhs = gpd.read_file(f"{ROOT_FOLDER}/output/gdf_dhs.geojson")
df_forest_change = pd.read_csv(f"{ROOT_FOLDER}/output/dataframe_forest_change_2019.csv", index_col=0)
df_luminosity = pd.read_csv(f"{ROOT_FOLDER}/output/dataframe_luminosity.csv", index_col=0)
df_soil = pd.read_csv(f"{ROOT_FOLDER}/output/dataframe_soil_depth_0_15.csv", index_col=0)

In [24]:
## Combine DHS Dataframe with geospatial dataframes
gdf_dhs = gdf_dhs.join(df_forest_change.drop(columns="caseid"))
gdf_dhs = gdf_dhs.join(df_luminosity.drop(columns="caseid"))
gdf_dhs = gdf_dhs.join(df_soil.drop(columns="caseid"))

In [25]:
## Nigeria - Subnational Administrative Boundaries
gdf_nigeria = gpd.read_file(f"{ROOT_FOLDER}/datasets/map_africa/nga_admbnda_adm2_osgof_20190417.shp")

gdf_nigeria = gdf_nigeria.reset_index(names="LGA")
gdf_nigeria_dhs = gdf_nigeria.sjoin(gdf_dhs, how="left")

# Name of LGAs is stored in ADM2_EN

In [26]:
pd.read_csv(f"{ROOT_FOLDER}/output/dataframe_forest_change_2019.csv", index_col=0)["treecover_mean"].describe()

count    94053.000000
mean         7.547492
std          9.827020
min          0.000000
25%          0.250131
50%          3.463532
75%         10.856233
max         59.502627
Name: treecover_mean, dtype: float64

In [27]:
pd.read_csv(f"{ROOT_FOLDER}/output/dataframe_forest_change.csv", index_col=0)["treecover_mean"].describe()

count    94053.000000
mean         7.547492
std          9.827020
min          0.000000
25%          0.250131
50%          3.463532
75%         10.856233
max         59.502627
Name: treecover_mean, dtype: float64

## Calculate mean forest loss, luminosity, and soil properties per LGA

In [28]:
# Take all yearly loss columns (exclude caseid and forest_loss_size)
loss_years = df_forest_change.drop(columns=["caseid", "forest_loss_size", "treecover_mean"]).columns

# Normalize each by forest_loss_size
gdf_nigeria_dhs[loss_years] = gdf_nigeria_dhs[loss_years].div(
    gdf_nigeria_dhs["forest_loss_size"], axis=0
)

In [29]:
## Forest Change per LGA

condition = gdf_nigeria_dhs["DHSYEAR"] == 2008

gdf_nigeria_dhs.loc[condition, [
    "loss_2001_pt_2008MEAN", "loss_2002_pt_2008MEAN",
    "loss_2003_pt_2008MEAN", "loss_2004_pt_2008MEAN",
    "loss_2005_pt_2008MEAN", "loss_2006_pt_2008MEAN",
    "loss_2007_pt_2008MEAN", "loss_2008_pt_2008MEAN",
    "loss_2009_pt_2008MEAN", "loss_2010_pt_2008MEAN",
    "loss_2011_pt_2008MEAN", "loss_2012_pt_2008MEAN",
    "loss_2013_pt_2008MEAN", "loss_2014_pt_2008MEAN",
    "loss_2015_pt_2008MEAN", "loss_2016_pt_2008MEAN",
    "loss_2017_pt_2008MEAN", "loss_2018_pt_2008MEAN",
    "loss_2019_pt_2008MEAN",
    "treecover_2008MEAN"]] = (gdf_nigeria_dhs.loc[condition].groupby("LGA")[list(df_forest_change.drop(columns=["caseid", "forest_loss_size"]).columns)]
    .transform("mean").values
)

## Forest Change per LGA

condition = gdf_nigeria_dhs["DHSYEAR"] == 2013

gdf_nigeria_dhs.loc[condition, [
    "loss_2001_pt_2013MEAN", "loss_2002_pt_2013MEAN",
    "loss_2003_pt_2013MEAN", "loss_2004_pt_2013MEAN",
    "loss_2005_pt_2013MEAN", "loss_2006_pt_2013MEAN",
    "loss_2007_pt_2013MEAN", "loss_2008_pt_2013MEAN",
    "loss_2009_pt_2013MEAN", "loss_2010_pt_2013MEAN",
    "loss_2011_pt_2013MEAN", "loss_2012_pt_2013MEAN",
    "loss_2013_pt_2013MEAN", "loss_2014_pt_2013MEAN",
    "loss_2015_pt_2013MEAN", "loss_2016_pt_2013MEAN",
    "loss_2017_pt_2013MEAN", "loss_2018_pt_2013MEAN",
    "loss_2019_pt_2013MEAN",
    "treecover_2013MEAN"]] = (gdf_nigeria_dhs.loc[condition].groupby("LGA")[list(df_forest_change.drop(columns=["caseid", "forest_loss_size"]).columns)]
    .transform("mean").values
)

## Forest Change per LGA

condition = gdf_nigeria_dhs["DHSYEAR"] == 2018

gdf_nigeria_dhs.loc[condition, [
    "loss_2001_pt_2018MEAN", "loss_2002_pt_2018MEAN",
    "loss_2003_pt_2018MEAN", "loss_2004_pt_2018MEAN",
    "loss_2005_pt_2018MEAN", "loss_2006_pt_2018MEAN",
    "loss_2007_pt_2018MEAN", "loss_2008_pt_2018MEAN",
    "loss_2009_pt_2018MEAN", "loss_2010_pt_2018MEAN",
    "loss_2011_pt_2018MEAN", "loss_2012_pt_2018MEAN",
    "loss_2013_pt_2018MEAN", "loss_2014_pt_2018MEAN",
    "loss_2015_pt_2018MEAN", "loss_2016_pt_2018MEAN",
    "loss_2017_pt_2018MEAN", "loss_2018_pt_2018MEAN",
    "loss_2019_pt_2018MEAN",
    "treecover_2018MEAN"]] = (gdf_nigeria_dhs.loc[condition].groupby("LGA")[list(df_forest_change.drop(columns=["caseid", "forest_loss_size"]).columns)]
    .transform("mean").values
)

In [30]:
## Luminosity per LGA

condition = gdf_nigeria_dhs["DHSYEAR"] == 2008

gdf_nigeria_dhs.loc[condition, ["f162005_cluster_2008MEAN", "f162006_cluster_2008MEAN", "f162007_cluster_2008MEAN", "f162008_cluster_2008MEAN"]] = (
    gdf_nigeria_dhs.loc[condition].groupby("LGA")[["f162005", "f162006", "f162007", "f162008"]]
    .transform("mean").values
)

condition = gdf_nigeria_dhs["DHSYEAR"] == 2013

gdf_nigeria_dhs.loc[condition, ["f182010_cluster_2013MEAN", "f182011_cluster_2013MEAN", "f182012_cluster_2013MEAN", "f182013_cluster_2013MEAN"]] = (
    gdf_nigeria_dhs.loc[condition].groupby("LGA")[["f182010", "f182011", "f182012", "f182013"]]
    .transform("mean").values
)

condition = gdf_nigeria_dhs["DHSYEAR"] == 2018

gdf_nigeria_dhs.loc[condition, ["f162005_cluster_2018MEAN", "f162006_cluster_2018MEAN", "f162007_cluster_2018MEAN", "f162008_cluster_2018MEAN"]] = (
    gdf_nigeria_dhs.loc[condition].groupby("LGA")[["f182010", "f182011", "f182012", "f182013"]]
    .transform("mean").values
)

In [32]:
## Soil characteristics per LGA

gdf_nigeria_dhs[["occ_ave_pt", "ph_ave_pt", "cec_ave_pt"]] = (
    gdf_nigeria_dhs.groupby("LGA")[["soil_ORCDRC", "soil_PHIHOX", "soil_CEC"]]
    .transform("mean").values
)

## Subset by Variables included in Berazneva and Byker (2017)

In [33]:
df_final = pd.read_stata(f"{ROOT_FOLDER}/resources/113539-V1/Data_programs_readme/dta.dta")
final_columns = list(df_final.columns)
final_columns.append("geometry")

In [34]:
dict_dhs_features = {
    "v002": "v002",    # df_dhs_2008_children
    "v003": "v003",    # df_dhs_2008_children
    "v005": "v005",    # df_dhs_2008_children
    "v021": "v021",    # df_dhs_2008_children
    "v022": "v022",    # df_dhs_2008_children
    "v136": "no_HH_members",    # df_dhs_2008_children
    "v137": "no_kids_under_5",    # df_dhs_2008_children
    "v115": "time_to_water",    # df_dhs_2008_children
    "v025": "rural",    # df_dhs_2008_children
    "v152": "head_HH_age",    # df_dhs_2008_children -> Not perfect, but likely picked by the authors 
    "v715": "HH_head_edu_years",    # df_dhs_2008_children
    "v459": "own_bednet",   # df_dhs_2008_children
    "v151": "femhh",    # df_dhs_2008_children -> Likely female household / Sex of household head
    "v116": "toilet",   # df_dhs_2008_children: 31 corresponds to no toilet
    "v460": "kidnet",   # df_dhs_2008_children
    "v190": "poorest",    # df_dhs_2008_children: Wealth index quintile / Alternatively, dhs_2008_household_path: hv270
    "v161": "firewood",    # df_dhs_2008_children: Type of cooking fuel
    "v127": "floor",    # df_dhs_2008_children: "Main floor material of higher quality df_dhs_2008_children.v127.isin([20, 21, 22, 30, 31, 32, 33, 34, 35])
    "v012": "age_resp",     # df_dhs_2008_children
    "v133": "edu_years",     # df_dhs_2008_children
    "v130": "christian",    # df_dhs_2008_children: Religion
    "v130": "muslim",    # df_dhs_2008_children: Religion
    "v131": "yoruba",    # df_dhs_2008_children: Ethnicity
    "v131": "igbo",    # df_dhs_2008_children: Ethnicity
    "v131": "hausa",    # df_dhs_2008_children: Ethnicity
    "v201": "no_child_total",  # df_dhs_2008_children
    "v218": "no_child_living",  # df_dhs_2008_children
    "v461": "resp_slept_net",   # df_dhs_2008_children ???
    "v714": "resp_works",     # df_dhs_2008_children
    "v213": "pregnant",     # df_dhs_2008_children
    "v501": "married",     # df_dhs_2008_children: Marital status
    "v501": "livewith",     # df_dhs_2008_children: Marital status
    "v006": "month",    # df_dhs_2008_children
    "v024": "region",    # df_dhs_2008_children
    "b8": "age",   # df_dhs_2008_children
    "h22": "fever",   # df_dhs_2008_children
    "h11": "diarrhea",   # df_dhs_2008_children
    "h31": "cough",   # df_dhs_2008_children

    # DHS Geographic Data 
    "ALT_DEM": "altitude",   
    "DHSCLUST": "DHSCLUST", ## Duplicate v001
    "DHSYEAR": "DHSYEAR", 
}

## Export Dataframe

In [35]:
gdf_nigeria_dhs.drop(columns=["ADM2ALT1EN", "ADM2ALT2EN", "validTo"]).to_csv(f"{ROOT_FOLDER}/output/replication_full_extended.csv")

## Export Dataframe to Stata

In [37]:
## Calculate the area of LGAs 
gdf_nigeria_dhs["LGA_Area"] = gdf_nigeria_dhs.to_crs(epsg=3857).area * 1e-6

In [38]:
## Export only Variables included in Berazneva and Byker (2017)
gdf_nigeria_dhs.loc[condition, final_columns].drop(columns="geometry").to_stata(f"{ROOT_FOLDER}/output/dta_replication_extended.dta")

In [39]:
## Export additional Variables´
# gdf_nigeria_dhs.loc[condition, final_columns + ["treecover_mean"] + ["forest_loss_size"] + ["LGA_Area"]].drop(columns="geometry").to_stata(f"{ROOT_FOLDER}/output/dta_replication_additional_extended.dta")

In [40]:
## Export all Variables including the DHS Wave 2018
gdf_nigeria_dhs.drop(columns=["ADM2ALT1EN", "ADM2ALT2EN", "validTo", "geometry"]).to_stata(f"{ROOT_FOLDER}/output/dta_replication_full_extended.dta")

---