## Index Tutorial for NYC Neighborhood Tabulation Areas


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/danmillr/places-platforms/blob/main/02_Index/02_Index.ipynb)


---

## Overview & Learning Objectives
This tutorial guides you through constructing a spatial index using real-world data from New York City. We will build a **Heat Vulnerability Index (HVI)** that combines environmental and demographic indicators.

You will:
- Learn how spatial indices are used for policy and planning
- Join multiple spatial and tabular datasets
- Normalize variables to prepare them for aggregation
- Build a weighted composite index with dynamic inputs
- Create static and interactive maps
- Critically evaluate the assumptions behind your index

By the end, you should be able to construct your own index and reflect on how methodological decisions shape outcomes.

---

## 1. Setup
Install and import the required libraries. These include pandas and geopandas for data handling, matplotlib and folium for visualization, and ipywidgets for interactivity.

In [2]:
# Install required libraries (for Google Colab)
!pip install pandas geopandas matplotlib folium ipywidgets

Collecting matplotlib
  Downloading matplotlib-3.10.3-cp311-cp311-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting folium
  Using cached folium-0.20.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.58.4-cp311-cp311-macosx_10_9_universal2.whl.metadata (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.6/106.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.2 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.3.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.0 kB)
Collecting branca>=0.6.0 (from folium)
  Using

In [6]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import folium
from folium import Choropleth
import ipywidgets as widgets
from IPython.display import display, Markdown

In [8]:
import pyproj

---

## 2. Designing the Index

Before building a composite index, it's essential to determine what you're trying to measure and how.

### Step 1: Who is this index for?
This heat vulnerability index (HVI) is meant to guide decision-makers and community members in identifying neighborhoods in New York City most at risk during extreme heat events.

You should consider:
- Who will use this index? (e.g., planners, emergency services, community groups)
- What decisions will it inform?
- What kinds of data are meaningful and actionable?

### Step 2: Define your goal
*"Which neighborhoods in NYC are most vulnerable to extreme heat due to environmental exposure and social risk factors?"*

### Step 3: Choose your dimensions and variables
We’ll begin with three variables across different dimensions:

| Dimension              | Variable                          | Source                           |
|------------------------|-----------------------------------|----------------------------------|
| Exposure               | Surface Temperature (`SURFACE_TEMP`)  | hvi-nta-2020.csv                 |
| Sensitivity            | % Crowded Households (`Percent`)      | nta_crowding.csv                |
| Adaptive Capacity      | % Tree Canopy (`canopy_pct`)         | canopystreettree_supp_nta.csv   |

Later, you may add or swap variables that reflect:
- Age
- Income
- Language isolation
- Access to AC or cooling centers

### Step 4: Define your spatial unit
We’ll use **Neighborhood Tabulation Areas (NTAs)** as the unit of analysis. These are aggregations of census tracts and represent stable planning units in NYC.

Now, let’s load the data and prepare it for analysis.


In [9]:
# Load datasets
nta_gdf = gpd.read_file("data/2020_ntas.geojson")
hvi_df = pd.read_csv("data/hvi-nta-2020.csv")
crowding_df = pd.read_csv("data/nta_crowding.csv")
canopy_df = pd.read_csv("data/canopystreettree_supp_nta.csv")

# Inspect each file
print("NTA GeoJSON:")
display(nta_gdf.head())

print("\nHeat Vulnerability Index CSV:")
display(hvi_df.head())

print("\nHousehold Crowding CSV:")
display(crowding_df.head())

print("\nCanopy Cover CSV:")
display(canopy_df.head())


DataDirError: Valid PROJ data directory not found. Either set the path using the environmental variable PROJ_DATA (PROJ 9.1+) | PROJ_LIB (PROJ<9.1) or with `pyproj.datadir.set_data_dir`.

---

## 4. Data Linking & Validation
This step merges the datasets into one table. The keys used are `NTACode` and `GEOCODE`. Always inspect your joins to ensure you aren't introducing nulls or losing rows.

In [None]:
crowding_df = crowding_df.rename(columns={"GeoID": "GEOCODE"})
canopy_df["NTACode"] = canopy_df["ntacode"].str.upper()

merged_df = hvi_df.merge(crowding_df, on="GEOCODE", how="left")
merged_df = merged_df.merge(canopy_df, on="NTACode", how="left")
merged_gdf = nta_gdf.merge(merged_df, on="NTACode", how="left")

---

## 5. Exploratory Data Analysis
We explore variable distributions to identify skew or outliers. This step also helps students reason about transformations needed before constructing the index.

In [None]:
merged_gdf['Percent'].dropna().astype(float).hist(bins=20)
plt.title("Crowding Rate Distribution")
plt.xlabel("Percent Crowded")
plt.ylabel("Number of NTAs")
plt.show()

Plotting a map of the raw variable:

In [None]:
merged_gdf.plot(column='Percent', cmap='OrRd', legend=True, figsize=(10,6))
plt.title("Crowding Rate by NTA")
plt.axis('off')
plt.show()

---

## 6. Normalization
To combine variables, they must be on the same scale. We use **min–max normalization**:

\[ x_{norm} = 
rac{x - x_{min}}{x_{max} - x_{min}} \]

This is useful for visualization and additive indices but sensitive to outliers.

In [None]:
variables = ['Percent', 'SURFACE_TEMP', 'canopy_pct']
for var in variables:
    merged_gdf[f"{var}_norm"] = (
        merged_gdf[var] - merged_gdf[var].min()
    ) / (merged_gdf[var].max() - merged_gdf[var].min())

---

## 7. Build Your Index (Interactive)
You can set your own weights to explore different assumptions about which factors contribute most to heat vulnerability. Note that canopy is inverted to represent less vulnerability with more coverage.

In [None]:
w_crowd = widgets.FloatSlider(0.33, 0, 1, 0.01, description='Crowding')
w_temp = widgets.FloatSlider(0.33, 0, 1, 0.01, description='Temperature')
w_canopy = widgets.FloatSlider(0.33, 0, 1, 0.01, description='Canopy')

def update_index(crowd, temp, canopy):
    total = crowd + temp + canopy
    merged_gdf['custom_index'] = (
        crowd * merged_gdf['Percent_norm'] +
        temp * merged_gdf['SURFACE_TEMP_norm'] +
        canopy * (1 - merged_gdf['canopy_pct_norm'])
    ) / total

    ax = merged_gdf.plot(
        column='custom_index', cmap='plasma', legend=True, figsize=(10,6)
    )
    plt.title("Custom Heat Vulnerability Index")
    plt.axis('off')
    plt.show()

widgets.interact(update_index, crowd=w_crowd, temp=w_temp, canopy=w_canopy);

---

## 8. Advanced Options
Explore advanced methods to improve, validate, or challenge your index construction. These techniques go beyond static weightings and help illuminate structure in your data.

**Alternative normalization:** Use z-score normalization to standardize variables based on their distance from the mean.

In [None]:
from scipy.stats import zscore
merged_gdf['Percent_z'] = zscore(merged_gdf['Percent'].dropna())

This is particularly useful if your variables are skewed or have outliers, as it centers data at zero with a standard deviation of one.

**Add more variables:** You can enrich the index by including:
- % of residents over 65 (age vulnerability)
- % without air conditioning (exposure risk)
- % non-English speakers (language isolation)
- Access to public cooling centers or shaded green space

Make sure to normalize any new variables before combining them.

**PCA (Principal Component Analysis):** PCA reduces multiple related variables into components that capture the most variance. It can be used to simplify and weight input dimensions empirically.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = ['Percent_norm', 'SURFACE_TEMP_norm', 'canopy_pct_norm']
X = StandardScaler().fit_transform(merged_gdf[features].dropna())
pca = PCA(n_components=1)
merged_gdf['pca_index'] = pca.fit_transform(X)

**Clustering:** Use algorithms like KMeans to segment NTAs into distinct groups of vulnerability, rather than using a single continuous index.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
merged_gdf['cluster'] = kmeans.labels_

These tools allow you to compare your subjective weightings with data-driven structures. They can help expose hidden patterns or support alternative interpretations of vulnerability.

---

## 9. Export Your Results
Export your final output as a GeoJSON to use in GIS or for sharing.

In [None]:
merged_gdf[['NTACode', 'custom_index', 'geometry']].to_file("custom_hvi.geojson", driver="GeoJSON")

---

## 10. Critical Reflection
Questions for consideration:
- What kinds of vulnerability are not captured here?
- How do your weights reflect your assumptions?
- Are there ethical concerns with publicly mapping vulnerability?
- What might a participatory or community-informed index look like?

Indexes are powerful—but they are never neutral. Be reflective, transparent, and critical.