# 04: Pre-process shapefiles
*Create the shapefiles used to aggregate climate data and communicate results.*

In [None]:
import fsspec
import geopandas as gpd
import pandas as pd

Inspired by the UHE-Daily dataset, the primary data product will focus on a set of ~13,000 global human settlements around the world as delineated by the [Global Human Settlement Urban Center Database](https://ghsl.jrc.ec.europa.eu/ghs_stat_ucdb2015mt_r2019a.php). 

In [None]:
uhe_daily_cities = gpd.read_file(
    "s3://carbonplan-climate-impacts/extreme-heat/v1.0/inputs/GHSL_UCDB/GHS_STAT_UCDB2015MT_GLOBE_R2019A_V1_2.gpkg"
)

Expand the list of cities with an additional set of ~2,000 additional locations in the US.

In [None]:
additional_cities = gpd.read_file(
    "s3://carbonplan-climate-impacts/extreme-heat/v1.0/inputs/additional_us_cities.gpkg"
)

For some additional analyses of non-urban areas, further expand the list with a set of ~24,000 climatically-similar regions from the Climate Impact Lab (as used in [Rode et al (2021)](https://doi.org/10.1038/s41586-021-03883-8)).

In [None]:
regions_path = "s3://carbonplan-climate-impacts/extreme-heat/v1.0/inputs/high-res-regions-simplified.topo.json"
with fsspec.open(regions_path) as file:
    regions = gpd.read_file(file)

In [None]:
regions = regions.set_crs("EPSG:4326")

Select out unique identifiers for each of the two city shapefiles.

In [None]:
all_cities = pd.concat(
    [
        uhe_daily_cities[["ID_HDC_G0", "UC_NM_MN", "geometry"]],
        additional_cities[["UACE20", "NAMELSAD20", "geometry"]],
    ],
    ignore_index=True,
)

Overlay the cities with the regions and take the difference to create regions that do not include the cities. These will often look like donuts, with regions having empty hole(s) where they overlap with cities. These regions will support population-level analyses that aim to separate effects between urban and non-urban areas. There are ~300 regions with no non-city areas and these are dropped from the regions. 

In [None]:
regions_excluding_cities = regions.overlay(all_cities, how="difference")

Combine the cities with the regions-with-cities-excluded into a singled dataset.

In [None]:
all_regions = pd.concat(
    [all_cities, regions_excluding_cities[["gadmid", "hierid", "ISO", "geometry"]]],
    ignore_index=True,
)

Make a new unique identifier which will be used in subsequent steps.

In [None]:
all_regions["processing_id"] = all_regions.index

In [None]:
all_regions.to_file(
    "s3://carbonplan-climate-impacts/extreme-heat/v1.0/inputs/all_regions_and_cities.json",
    driver="GeoJSON",
)