<a href="https://colab.research.google.com/github/acoiman/pdt/blob/main/asthma_mortality/notebooks/colab/04_Asthma_Mortality_AP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Air Pollution Data

To clip the satelite-derived PM2.5 obtained from the [Washington University in St. Louis USA](https://sites.wustl.edu/acag/datasets/surface-pm2-5/#V6.GL.02.03), we used a polygon representing the national boundaries of Argentina. This boundary was generated by dissolving the provincial boundaries from a dataset acquired through[ Poblaciones.org](https://poblaciones.org/) using QGIS.

##📦 Import Required Libraries

In [None]:
# geospatial data handling
import rasterio
from rasterio.mask import mask
import geopandas as gpd
import geemap
import ee
import rioxarray
import netCDF4
import h5netcdf

# array data handling
import xarray as xr


# other libraries
import os
import zipfile
import subprocess
from itables import init_notebook_mode

## 🌍 Connect to Google Earth Engine (GEE)

In [None]:
# trigger the authentication flow
ee.Authenticate()

In [None]:
# initialize the library.
ee.Initialize(project='ee-pdt')
print(ee.String('Hello from the Earth Engine servers!').getInfo())

## ✂️ Clipping PM2.5 data to Argentina's boundaries

The satellite-derived PM2.5 dataset is not stored on Google Drive due to its large file size. If you wish to download the PM2.5 V6.GL.02.03 data, it is available at: https://sites.wustl.edu/acag/datasets/surface-pm2-5/#V6.GL.02.03. However, downloading this dataset is not required to reproduce the results in this notebook. You may skip ahead to the section titled "Calculate mean PM2.5 per department per year (1998–2022)."

In [None]:
# change to my computer home directory
%cd work/

In [None]:
# Set paths
# .nc files are not in Google Drive, download data into this folder
nc_folder = "pdt/asthma_mortality/data/raster/pm2.5_V6.GL.02.03/Global" # input folder with .nc files)

output_folder = "pdt/asthma_mortality/data/raster/pm2.5_V6.GL.02.03"   # output folder
boundary_file = "pdt/asthma_mortality/data/shp/ar_poly.shp"    # mask shapefile path

In [None]:
# Load Argentina boundary and fix geometry if needed
argentina = gpd.read_file(boundary_file).to_crs("EPSG:4326")
argentina.geometry = argentina.geometry.buffer(0)

In [None]:
def process_pm25_netcdf(nc_folder, output_folder, boundary_gdf, var_name='PM25'):
    for file in os.listdir(nc_folder):
        if file.endswith(".nc"):
            nc_path = os.path.join(nc_folder, file)
            year = file.split(".")[4].split("-")[0][0:4]

            print(f"Processing year {year}...")

            # Step 1: Open NetCDF file
            ds = xr.open_dataset(nc_path)

            # Step 2: Select the variable (e.g., 'PM25')
            if var_name not in ds.data_vars:
                print(f"Variable '{var_name}' not found in {file}")
                continue

            da = ds[var_name]

            # Step 3: Set spatial dimensions (depends on your .nc)
            # Try to auto-detect
            dims = da.dims
            if 'lat' in dims and 'lon' in dims:
                da = da.rio.set_spatial_dims(x_dim='lon', y_dim='lat')
            elif 'latitude' in dims and 'longitude' in dims:
                da = da.rio.set_spatial_dims(x_dim='longitude', y_dim='latitude')
            else:
                raise ValueError(f"Unknown spatial dimensions: {dims}")

            # Step 4: Write CRS manually (assume WGS84 unless you know better)
            da = da.rio.write_crs("EPSG:4326")

            # Step 5: Clip using boundary_gdf
            clipped = da.rio.clip(boundary_gdf.geometry, boundary_gdf.crs, drop=True)

            # Step 6: Export to GeoTIFF
            output_path = os.path.join(output_folder, f"pm2.5_ar_V6.GL.02.03-{year}.tif")
            clipped.rio.to_raster(output_path)

    print("✅ All NetCDF rasters processed, clipped, and saved.")

In [None]:
process_pm25_netcdf(nc_folder, output_folder, argentina)

## 🧮 Calculate mean PM2.5 per department per year (1998–2022)

###  ☁️ Upload  data to GEE

Clipped PM2.5 data and the Shapefile containing the  adjusted asthma mortality rate per 100,000  will be uploades to GEE through Code Editor interface

### Calculating mean PM2.5

Through the following function, we will batch-extract the annual mean PM2.5 data for each department. The resulting shapefile will then be merged with the dataset containing the normalized asthma mortality rate.

In [None]:
def calculate_and_merge_pm25_by_year(start_year=2001, end_year=2022,
                                     asset_image_prefix="projects/ee-pdt/assets/pm2-5-ar-V6-GL-02-03/pm2-5-ar-V6-GL-02-03-",
                                     fc_asset="projects/ee-pdt/assets/tma/tma_2001_2022",
                                     output_folder="pdt/asthma_mortality/data/shp/",
                                     merged_filename="pm25_2001_2022.shp"):

    # Load FeatureCollection normalized mortality rate
    fc = ee.FeatureCollection(fc_asset)

    merged_gdf = None  # # To accumulate results

    for year in range(start_year, end_year + 1):
        print(f"Processing year {year}...")

        # Load PM2.5 image
        image_path = f"{asset_image_prefix}{year}"
        pm25_image = ee.Image(image_path)

        # Reduce regions to get the mean
        reduced_fc = pm25_image.reduceRegions(
            collection=fc,
            reducer=ee.Reducer.mean(),
            scale=1000
        )

        # Convert to GeoDataFrame
        gdf = geemap.ee_to_gdf(reduced_fc)

        # Rename and round
        gdf = gdf.rename(columns={'mean': f'PM25_{year}'})
        gdf[f'PM25_{year}'] = gdf[f'PM25_{year}'].round(2)

        # Keep IDDPTO, geometry, and PM2.5 related columns
        if merged_gdf is None:
            merged_gdf = gdf[['IDDPTO','geometry', f'PM25_{year}']].copy()
        else:
            merged_gdf[f'PM25_{year}'] = gdf[f'PM25_{year}']

    # Save merged shapefile
    output_path = os.path.join(output_folder, merged_filename)
    merged_gdf.to_file(output_path, encoding='utf-8')
    print(f"\nFile saved at: {output_path}")

In [None]:
calculate_and_merge_pm25_by_year()

In [None]:
# Load local pm2.5 shapefile
gdf_pm25 = gpd.read_file("pdt/asthma_mortality/data/shp/pm25_2001_2022.shp")

In [None]:
# drop geometry column
gdf_pm25 = gdf_pm25.drop(columns=['geometry'])

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf_pm25.head()

In [None]:
# Load  `gdf`  from a shapefile containing asthma mortality data from 2001 to 2022.
gdf = gpd.read_file("pdt/asthma_mortality/data/shp/tma_2001_2022_2.shp", encoding='utf-8')

In [None]:
# Display the first few rows of the DataFrame
init_notebook_mode(all_interactive=True)
gdf.head()

In [None]:
# Merge gdf and gdf_pm25 based on 'IDDPTO', preserving all data from gdf
gdf_pm25_tma  = gdf.merge(gdf_pm25, on='IDDPTO', how='left')


In [None]:
# Display the first few rows of the merged GeoDataFrame
init_notebook_mode(all_interactive=True)
gdf_pm25_tma.head()

In [None]:
# save gdf_pm25_tma as geopackage
gdf_pm25_tma.to_file("pdt/asthma_mortality/data/gpkg/tma_pm25_2001_2022.gpkg", driver="GPKG")


In [None]:
# save gdf_pm25_tma as geopackage as a shapefile
gdf_pm25_tma.to_file("pdt/asthma_mortality/data/shp/tma_pm25_2001_2022.shp", encoding='utf-8')