---
title: Full workflow for importing Dengue data into DHIS2
short_title: Import Dengue Data
---

This workflow demonstrates the end-to-end preparation of importing dengue case data into DHIS2. We demonstrate the workflow using [**OpenDengue**](https://opendengue.org/data.html) data, otherwise it is expected for countries to use official Ministry of Health data.

The notebook focuses on **data harmonization and preparation** using a worked example for **Nepal (districts / admin2)** and **monthly** data. The final DHIS2 import step follows the same approach as the WorldPop and CHIRPS workflows and is therefore not repeated in full here.

## Inputs

This workflow expects two local input files under `../../guides/data/`:

- `nepal-opendengue.csv` — [**OpenDengue**](https://opendengue.org/data.html) export containing Nepal dengue case counts
- `nepal-locations.geojson` — Nepal district geometries (admin2)

## Output

The workflow produces:

- `nepal-dengue-harmonized.csv` — harmonized monthly dengue cases per district (`time_period`, `location`, `disease_cases`)


In [1]:
from pathlib import Path

import pandas as pd
import geopandas as gpd

pd.set_option("display.max_columns", 200)


## Paths

In [2]:
DATA_FOLDER = Path("../../guides/data")

LOCATIONS_GEOJSON = DATA_FOLDER / "nepal-locations.geojson"
OPENDENGUE_SOURCE_PATH = DATA_FOLDER / "nepal-opendengue.csv"

# Output
OUT_CSV = DATA_FOLDER / "nepal-dengue-harmonized.csv"

for p in [LOCATIONS_GEOJSON, OPENDENGUE_SOURCE_PATH]:
    if not p.exists():
        raise FileNotFoundError(f"Missing required input: {p}")

print("Using inputs:")
print(" -", LOCATIONS_GEOJSON)
print(" -", OPENDENGUE_SOURCE_PATH)


Using inputs:
 - ../../guides/data/nepal-locations.geojson
 - ../../guides/data/nepal-opendengue.csv


## Load district locations

In [None]:
locations = gpd.read_file(LOCATIONS_GEOJSON)

# DHIS2 UID 
uid_col = "id" if "id" in locations.columns else None
if uid_col is None:
    raise KeyError(f"Expected DHIS2 UID in GeoJSON 'id'. Found: {list(locations.columns)}")

locations["location"] = locations[uid_col].astype(str).str.strip() 

# Join helper (district name)
if "name" not in locations.columns:
    raise KeyError(f"Expected district name in GeoJSON 'name'. Found: {list(locations.columns)}")

locations["district_name"] = (
    locations["name"].astype(str)
    .str.replace(r"^\s*\d+\s+", "", regex=True)  # drop "101 that came with location names"
    .str.upper()
    .str.strip()
)

# Keep only what we need
locations = locations[["location", "district_name", "geometry"]].dropna(subset=["location"]).copy()


## Load OpenDengue

In [4]:
df_raw = pd.read_csv(OPENDENGUE_SOURCE_PATH)
print("Loaded:", OPENDENGUE_SOURCE_PATH)
print("Columns:", df_raw.columns.tolist())
df_raw.head()


Loaded: ../../guides/data/nepal-opendengue.csv
Columns: ['adm_0_name', 'adm_1_name', 'adm_2_name', 'full_name', 'ISO_A0', 'FAO_GAUL_code', 'RNE_iso_code', 'IBGE_code', 'calendar_start_date', 'calendar_end_date', 'Year', 'dengue_total', 'case_definition_standardised', 'S_res', 'T_res', 'UUID', 'region']


Unnamed: 0,adm_0_name,adm_1_name,adm_2_name,full_name,ISO_A0,FAO_GAUL_code,RNE_iso_code,IBGE_code,calendar_start_date,calendar_end_date,Year,dengue_total,case_definition_standardised,S_res,T_res,UUID,region
0,NEPAL,,,NEPAL,NPL,175,NPL,,1987-01-01,1987-12-31,1987,0,Total,Admin0,Year,WHOSEARO-ALL-19852009-Y01-00,SEARO
1,NEPAL,,,NEPAL,NPL,175,NPL,,1985-01-01,1985-12-31,1985,0,Total,Admin0,Year,WHOSEARO-ALL-19852009-Y01-00,SEARO
2,NEPAL,,,NEPAL,NPL,175,NPL,,1986-01-01,1986-12-31,1986,0,Total,Admin0,Year,WHOSEARO-ALL-19852009-Y01-00,SEARO
3,NEPAL,,,NEPAL,NPL,175,NPL,,1991-01-01,1991-12-31,1991,0,Total,Admin0,Year,WHOSEARO-ALL-19852009-Y01-00,SEARO
4,NEPAL,,,NEPAL,NPL,175,NPL,,1988-01-01,1988-12-31,1988,0,Total,Admin0,Year,WHOSEARO-ALL-19852009-Y01-00,SEARO


## Column mapping

In [5]:
# OpenDengue export columns
DATE_COL = "calendar_start_date"
CASES_COL = "dengue_total"
ADMIN2_COL = "adm_2_name"

missing = [c for c in [DATE_COL, CASES_COL, ADMIN2_COL] if c not in df_raw.columns]
if missing:
    raise KeyError(
        f"Input CSV is missing required columns: {missing}. "
        f"Available columns: {df_raw.columns.tolist()}"
    )

print("Using columns:", {"date": DATE_COL, "cases": CASES_COL, "admin2": ADMIN2_COL})


Using columns: {'date': 'calendar_start_date', 'cases': 'dengue_total', 'admin2': 'adm_2_name'}


## Normalize OpenDengue (Nepal districts / admin2)

In [7]:
df_norm = pd.DataFrame({
    "date": pd.to_datetime(df_raw[DATE_COL], errors="coerce"),
    "cases": pd.to_numeric(df_raw[CASES_COL], errors="coerce"),
    "district_name": df_raw[ADMIN2_COL],   # <-- not location yet
})

# Normalize district name for the crosswalk join
df_norm["district_name"] = (
    df_norm["district_name"]
    .astype(str)
    .str.upper()
    .str.strip()
    .str.replace(r"\s+", " ", regex=True)
)

# Keep only valid rows
# Map district_name -> DHIS2 orgUnit UID
df_norm = df_norm.merge(
    locations[["district_name", "location"]],
    on="district_name",
    how="left",
)

# Fail fast (or drop) if mapping is incomplete
unmapped = df_norm["location"].isna().mean()
print(f"Unmapped dengue rows: {unmapped:.2%}")
if unmapped > 0:
    print("Examples:", df_norm.loc[df_norm["location"].isna(), "district_name"].drop_duplicates().head(15).tolist())

df_norm = df_norm.dropna(subset=["location"]).copy()


df_norm = df_norm.dropna(subset=["date", "cases", "district_name"])
df_norm = df_norm[df_norm["district_name"].ne("")]

df_norm.head()


Unmapped dengue rows: 100.00%
Examples: ['NAN', 'BARDIYA', 'BAITADI', 'KAILALI', 'SURKHET', 'MAKAWANPUR', 'CHITWAN', 'BANKE', 'SARLAHI', 'PYUTHAN', 'MORANG', 'SYANGJA', 'DHADING', 'RUKUM', 'KANCHANPUR']


Unnamed: 0,date,cases,district_name,location


## Monthly aggregation

In [8]:
# Convert to month period label (YYYY-MM)
df_norm["time_period"] = df_norm["date"].dt.to_period("M").astype(str)

# Aggregate within month + location
disease = (
    df_norm.groupby(["time_period", "location"], as_index=False)["cases"]
    .sum()
    .rename(columns={"cases": "disease_cases"})
)

print("Aggregated rows:", len(disease))
disease.head()


Aggregated rows: 0


Unnamed: 0,time_period,location,disease_cases


## Filter to spatial backbone and align time axis

In [9]:
# Keep only locations present in the GeoJSON backbone
before = len(disease)
disease = disease.merge(locations[["location"]], on="location", how="inner")
after = len(disease)
print(f"Backbone filter kept {after}/{before} rows")

# Build full (time_period x location) grid and fill missing with 0
all_months = pd.period_range(disease["time_period"].min(), disease["time_period"].max(), freq="M").astype(str)
all_locations = locations["location"].sort_values().unique()

grid = pd.MultiIndex.from_product([all_months, all_locations], names=["time_period", "location"]).to_frame(index=False)

disease_full = grid.merge(disease, on=["time_period", "location"], how="left")
disease_full["disease_cases"] = disease_full["disease_cases"].fillna(0)

# Keep integer-looking values as ints where possible
disease_full["disease_cases"] = pd.to_numeric(disease_full["disease_cases"], errors="coerce").fillna(0).astype(int)

print("Final rows (complete grid):", len(disease_full))
disease_full.head()


Backbone filter kept 0/0 rows


ValueError: start and end must not be NaT

## Write output CSV

In [None]:
disease_full.to_csv(OUT_CSV, index=False)
print("Wrote:", OUT_CSV)
OUT_CSV


## Import into DHIS2

This workflow stops after producing a harmonized, DHIS2-ready dataset.

To import the resulting data into DHIS2:

- create a data element for dengue case counts
- map locations to DHIS2 organisation units
- submit the data using the DHIS2 Web API

The import mechanics are identical to those used in the WorldPop and CHIRPS workflows and are not repeated here.
