<div style="padding: 25px; background-color: #fdf2e9; border-radius: 12px; border: 2px solid #e67e22;">
    <h1 style="color: #d35400; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">Step 02: The Grid</h1>
    <h2 style="color: #e67e22;">Spatial Indexing and Multi-Modal Enrichment</h2>
    <p style="font-size: 1.1em; color: #2c3e50;">
        With clean vector (POIs), raster (LandScan), and tabular (Census) data from Step 01,
        we now create a <strong>unified spatial index</strong> &mdash; an H3 hexagonal grid &mdash;
        and enrich each cell with population counts and demographic profiles.
    </p>
    <hr style="border-color: #e67e22;">
    <table style="width:100%; border-collapse: collapse; margin-top: 10px; font-size: 0.95em;">
        <tr style="background: #d35400; color: white;">
            <th style="padding: 8px; text-align: left;">Pipeline</th>
            <th style="padding: 8px; text-align: left;">Source</th>
            <th style="padding: 8px; text-align: left;">Action</th>
        </tr>
        <tr><td style="padding: 8px;"><strong>Input</strong></td><td>Notebook 01</td><td><code>camden_census_oa.geojson</code>, LandScan raster</td></tr>
        <tr style="background: #fef9f4;"><td style="padding: 8px;"><strong>Process</strong></td><td>This notebook</td><td>H3 grid &rarr; zonal stats &rarr; census spatial join</td></tr>
        <tr><td style="padding: 8px;"><strong>Output</strong></td><td>Notebook 03</td><td><code>camden_h3_grid.parquet</code> (enriched master grid)</td></tr>
    </table>
    <p style="margin-top: 15px; color: #2c3e50;"><strong>H3 Resolution:</strong> 9 (~174 m edge length) &mdash; one hexagon &asymp; one 2-minute walking block.</p>
</div>

<div style="margin-top: 30px;">
    <h2 style="color: #2e86c1; border-bottom: 2px solid #2e86c1; padding-bottom: 10px;">Concept: Why Hexagons (H3)?</h2>
    <p>Uber's <strong>H3</strong> library partitions the globe into hierarchical hexagonal cells.
    Compared to square grids:</p>
    <ul>
        <li><strong>Squares:</strong> 4 edge-neighbours + 4 corner-neighbours at different distances &mdash; ambiguous adjacency.</li>
        <li><strong>Hexagons:</strong> All 6 neighbours share an edge at the <em>same</em> distance from the centre &mdash; ideal for walking-distance analysis.</li>
    </ul>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0; font-size: 0.95em;">
        <tr style="background: #2e86c1; color: white;">
            <th style="padding: 8px;">Resolution</th>
            <th style="padding: 8px;">Edge Length</th>
            <th style="padding: 8px;">Hex Area</th>
            <th style="padding: 8px;">Use Case</th>
        </tr>
        <tr><td style="padding: 8px;">7</td><td>~1.2 km</td><td>~5.16 km&sup2;</td><td>District-level analysis</td></tr>
        <tr style="background: #eaf2f8;"><td style="padding: 8px;">8</td><td>~460 m</td><td>~0.74 km&sup2;</td><td>Neighbourhood catchment</td></tr>
        <tr style="font-weight: bold; background: #d4efdf;"><td style="padding: 8px;">9</td><td>~174 m</td><td>~0.105 km&sup2;</td><td>Walking-scale / 15-min city (our choice)</td></tr>
        <tr style="background: #eaf2f8;"><td style="padding: 8px;">10</td><td>~66 m</td><td>~0.015 km&sup2;</td><td>Street-level micro-analysis</td></tr>
    </table>
    <p><strong>Hierarchical property:</strong> Every Res-9 hex has exactly one Res-5 parent.
    We exploit this in the ML notebook for Spatial Block Cross-Validation
    (grouping child hexes by parent to prevent spatial leakage).</p>
    <p>Hex area formula: $A_{hex} \approx 2.6 \times s^2$, where $s$ is the edge length.</p>
</div>

In [None]:
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import h3
import geopandas as gpd
import pandas as pd
import numpy as np
import rasterstats
import rasterio
import osmnx as ox
from shapely.geometry import Polygon

# H3 version guard — v4 API is required
assert int(h3.__version__.split('.')[0]) >= 4, f"H3 v4+ required, got {h3.__version__}"

os.makedirs("data/outputs", exist_ok=True)

print(f"H3 version: {h3.__version__}")
print("Libraries loaded. Ready to build the grid.")

<div style="margin-top: 30px;">
    <h2 style="color: #117a65; border-bottom: 2px solid #117a65; padding-bottom: 10px;">Task 1: Generating the H3 Grid</h2>
    <p>We use <strong>Resolution 9</strong> (~174 m edge). At this scale, each hexagon covers roughly the area
    a person walks in 2 minutes &mdash; the micro-unit of the 15-minute city (Moreno et al., 2021).</p>
    <p><strong>API note:</strong> H3 v4 uses <code>polygon_to_cells()</code> and <code>cell_to_boundary()</code>.
    The input polygon must be in <strong>WGS84</strong> (lat/lng), which is H3's native coordinate system.
    We convert to <code>LatLngPoly</code> format, where coordinates are <code>(lat, lng)</code> &mdash;
    note the reversed order compared to GeoJSON's <code>(lng, lat)</code>.</p>
</div>

In [None]:
RESOLUTION = 9

# Fetch Camden boundary in WGS84 (H3 expects lat/lng)
boundary = ox.geocode_to_gdf("London Borough of Camden")
boundary_wgs84 = boundary.to_crs(epsg=4326)
poly = boundary_wgs84.geometry.iloc[0]

# H3 v4: polygon_to_cells expects a LatLngPoly
# Shapely gives (lng, lat) — H3 expects (lat, lng)
outer_coords = list(poly.exterior.coords)
h3_poly = h3.LatLngPoly([(lat, lng) for lng, lat in outer_coords])

hex_ids = h3.polygon_to_cells(h3_poly, RESOLUTION)

# Convert H3 cell IDs to shapely Polygons
hex_polygons = []
for h_id in hex_ids:
    boundary_coords = h3.cell_to_boundary(h_id)
    # cell_to_boundary returns [(lat, lng), ...] — flip to (lng, lat) for shapely
    hex_polygons.append(Polygon([(lng, lat) for lat, lng in boundary_coords]))

h3_grid = gpd.GeoDataFrame(
    {'h3_index': list(hex_ids)},
    geometry=hex_polygons,
    crs="EPSG:4326"
)

# Validation
assert 400 < len(h3_grid) < 800, f"Unexpected hex count: {len(h3_grid)}"
print(f"Generated {len(h3_grid)} hexagons at Resolution {RESOLUTION}.")
print(f"CRS: {h3_grid.crs}")
h3_grid.head(3)

<div style="margin-top: 30px;">
    <h2 style="color: #884ea0; border-bottom: 2px solid #884ea0; padding-bottom: 10px;">Task 2: Zonal Statistics (Raster Enrichment)</h2>
    <p><strong>Concept:</strong> We overlay our transparent H3 grid onto the LandScan population raster
    and <em>sum</em> all population pixels that fall under each hexagon. This converts a continuous
    raster surface into discrete per-hex population counts.</p>
    <p><strong>CRS match:</strong> Both the H3 grid and LandScan are in EPSG:4326 (WGS84),
    so <code>rasterstats.zonal_stats</code> can operate directly without reprojection.</p>
    <p><strong>Aggregation:</strong> <code>sum</code> &mdash; total estimated population within each hex.</p>
</div>

In [None]:
raster_path = "landscan-mosaic-unitedkingdom-v1.tif"

# Verify CRS compatibility
with rasterio.open(raster_path) as src:
    print(f"Raster CRS: {src.crs}")
    print(f"Grid CRS:   {h3_grid.crs}")

# Zonal stats: sum population pixels under each hexagon
stats = rasterstats.zonal_stats(h3_grid, raster_path, stats="sum")
h3_grid['population'] = [s['sum'] if s['sum'] is not None else 0 for s in stats]

# Validation
assert (h3_grid['population'] >= 0).all(), "Negative population values detected"
print(f"\nPopulation enrichment complete.")
print(f"  Total population: {h3_grid['population'].sum():,.0f}")
print(f"  Mean per hex:     {h3_grid['population'].mean():.1f}")
print(f"  Max:              {h3_grid['population'].max():.0f}")
print(f"  Zero-pop hexes:   {(h3_grid['population'] == 0).sum()}")

<div style="margin-top: 30px;">
    <h2 style="color: #a04000; border-bottom: 2px solid #a04000; padding-bottom: 10px;">Task 3: Census Enrichment (Spatial Join)</h2>
    <p>We now link ONS Census 2021 demographics to the hex grid. The census is at <strong>Output Area (OA)</strong>
    level (846 OAs in Camden), while the grid has ~600 hexagons. The join strategy:</p>
    <ol>
        <li>Load the census GeoDataFrame from Notebook 01 (<code>camden_census_oa.geojson</code>, EPSG:27700).</li>
        <li>Project the hex grid to EPSG:27700 so distances are in metres.</li>
        <li>Use <code>sjoin_nearest</code> to assign each OA centroid to its closest hexagon.</li>
        <li>Aggregate: take the <strong>mean</strong> of each demographic percentage per hex (since multiple OAs can map to one hex).</li>
        <li>Fill hexes with no OA match using the <strong>borough median</strong> (conservative imputation).</li>
    </ol>
    <p><strong>Result:</strong> Every hexagon gains 7 demographic columns: <code>employed_total_perc</code>,
    <code>retired_perc</code>, <code>unemployed_perc</code>, <code>age_16_to_34_perc</code>,
    <code>age_65_plus_perc</code>, <code>level4_perc</code>, <code>no_qualifications_perc</code>.</p>
</div>

In [None]:
# Load cleaned census GeoDataFrame from Notebook 01
census_gdf = gpd.read_file("data/processed/camden_census_oa.geojson")
print(f"Census loaded: {len(census_gdf)} OAs, CRS: {census_gdf.crs}")

# Project hex grid to BNG for spatial join (distance in metres)
h3_grid_bng = h3_grid.to_crs(epsg=27700)

# Spatial join: assign each OA centroid to its nearest hexagon
census_to_hex = gpd.sjoin_nearest(
    census_gdf,
    h3_grid_bng[['h3_index', 'geometry']],
    how='left',
    distance_col='join_dist'
)

print(f"Joined {len(census_to_hex)} OA-hex pairs.")
print(f"Max join distance: {census_to_hex['join_dist'].max():.0f}m")

# Aggregate: mean demographics per hexagon
demo_cols = ['employed_total_perc', 'retired_perc', 'unemployed_perc',
             'age_16_to_34_perc', 'age_65_plus_perc',
             'level4_perc', 'no_qualifications_perc']

hex_demographics = census_to_hex.groupby('h3_index')[demo_cols].mean().reset_index()

# Merge demographics into master grid
h3_grid = h3_grid.merge(hex_demographics, on='h3_index', how='left')

# Fill hexes with no OA match using borough median
for col in demo_cols:
    median_val = h3_grid[col].median()
    h3_grid[col] = h3_grid[col].fillna(median_val)

print(f"\nDemographic columns added: {demo_cols}")
print(f"Hexes with demographic data: {hex_demographics['h3_index'].nunique()}")
print(f"Null check: {h3_grid[demo_cols].isnull().sum().sum()} nulls remaining")
h3_grid[['h3_index', 'population'] + demo_cols].head(3)

<div style="margin-top: 30px;">
    <h2 style="color: #7d6608; border-bottom: 2px solid #7d6608; padding-bottom: 10px;">Output: The Master Grid</h2>
    <p>We save the enriched grid as <strong>Parquet</strong> &mdash; a columnar format that preserves
    geometry, CRS metadata, and data types efficiently.</p>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0; font-size: 0.95em;">
        <tr style="background: #7d6608; color: white;">
            <th style="padding: 8px;">File</th>
            <th style="padding: 8px;">Format</th>
            <th style="padding: 8px;">Columns</th>
            <th style="padding: 8px;">Consumer</th>
        </tr>
        <tr>
            <td style="padding: 8px;"><code>camden_h3_grid.parquet</code></td>
            <td>GeoParquet</td>
            <td><code>h3_index</code>, <code>population</code>, 7 demographic <code>_perc</code> columns, geometry</td>
            <td>Notebook 03 (graph analytics)</td>
        </tr>
    </table>
    <p><strong>CRS:</strong> Saved in EPSG:27700 (BNG) so that Notebook 03 can compute Euclidean distances in metres.</p>
</div>

In [None]:
# Project to BNG for downstream analysis (Notebook 03 uses BNG distances)
h3_grid_bng = h3_grid.to_crs(epsg=27700)
h3_grid_bng.to_parquet("data/outputs/camden_h3_grid.parquet")

print(f"Master Grid saved: data/outputs/camden_h3_grid.parquet")
print(f"  {len(h3_grid_bng)} hexagons")
print(f"  {len(h3_grid_bng.columns)} columns: {list(h3_grid_bng.columns)}")
print(f"  CRS: {h3_grid_bng.crs}")
print("\nGrid complete. Ready for Step 03: The Intelligence.")