<div style="padding: 25px; background-color: #e8f6f3; border-radius: 12px; border: 2px solid #1abc9c;">
    <h1 style="color: #16a085; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">Step 01: The Foundation</h1>
    <h2 style="color: #1abc9c;">Multi-Modal ETL: Vector, Raster, and Census Data</h2>
    <p style="font-size: 1.1em; color: #2c3e50;">
        Before we can model site suitability, we must acquire source datasets from three
        distinct geospatial modalities and harmonise them into a common coordinate reference system.
    </p>
    <hr style="border-color: #1abc9c;">
    <table style="width:100%; border-collapse: collapse; margin-top: 10px; font-size: 0.95em;">
        <tr style="background: #16a085; color: white;">
            <th style="padding: 8px; text-align: left;">Data Source</th>
            <th style="padding: 8px; text-align: left;">Type</th>
            <th style="padding: 8px; text-align: left;">Format</th>
        </tr>
        <tr><td style="padding: 8px;">OpenStreetMap (via OSMnx)</td><td>Vector (Points/Polygons)</td><td>GeoJSON &rarr; EPSG:27700</td></tr>
        <tr style="background: #f0faf7;"><td style="padding: 8px;">LandScan Global Population</td><td>Raster (Pixels)</td><td>GeoTIFF, ~1km resolution</td></tr>
        <tr><td style="padding: 8px;">ONS Census 2021 (Digimap)</td><td>Tabular (CSV)</td><td>Output Area level, BNG centroids</td></tr>
    </table>
    <p style="margin-top: 15px; color: #2c3e50;"><strong>Output CRS:</strong> EPSG:27700 (British National Grid) &mdash; all distances in metres.</p>
</div>

<div style="margin-top: 30px;">
    <h2 style="color: #2980b9; border-bottom: 2px solid #2980b9; padding-bottom: 10px;">Concept: Three Data Modalities</h2>
    <p>Geospatial data comes in fundamentally different representations. Our pipeline must reconcile all three before any analysis can begin.</p>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0;">
        <tr style="background: #2980b9; color: white;">
            <th style="padding: 8px;">Modality</th><th style="padding: 8px;">Representation</th><th style="padding: 8px;">Example in This Pipeline</th><th style="padding: 8px;">Spatial Precision</th>
        </tr>
        <tr>
            <td style="padding: 8px;"><strong>Vector</strong></td>
            <td>Discrete geometric primitives (points, lines, polygons) with exact coordinate pairs</td>
            <td>OSM cafe locations (Points), building footprints (Polygons)</td>
            <td>Sub-metre (coordinate precision)</td>
        </tr>
        <tr style="background: #f0f6fb;">
            <td style="padding: 8px;"><strong>Raster</strong></td>
            <td>Regular grid of cells (pixels), each storing a numeric value &mdash; analogous to a satellite image band</td>
            <td>LandScan population grid &mdash; each pixel holds an estimated population count</td>
            <td>~1 km per pixel</td>
        </tr>
        <tr>
            <td style="padding: 8px;"><strong>Tabular</strong></td>
            <td>Attribute records keyed by a spatial identifier (Output Area code), linkable to geometry via a join</td>
            <td>ONS Census CSVs &mdash; demographic percentages per Output Area, with BNG centroids</td>
            <td>OA centroid (~125 m median OA diameter in Camden)</td>
        </tr>
    </table>
    <p><strong>Key Principle:</strong> All three must be projected to the same Coordinate Reference System (CRS)
    before they can interact. We use:</p>
    <p style="text-align: center; font-size: 1.1em;">
        $$\mathbf{x}_{BNG} = T_{4326 \to 27700}(\mathbf{x}_{WGS84})$$
    </p>
    <p>where $T$ is the Ordnance Survey's Transverse Mercator projection for Great Britain.</p>
</div>

In [None]:
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

import osmnx as ox
import pandas as pd
import geopandas as gpd
import rasterio
from shapely.geometry import Point

# Create full directory scaffold
for d in ['data/raw', 'data/processed', 'data/outputs']:
    os.makedirs(d, exist_ok=True)

print("Libraries loaded. Directory scaffold created.")

<div style="margin-top: 30px;">
    <h2 style="color: #d35400; border-bottom: 2px solid #d35400; padding-bottom: 10px;">Task 1: Fetching &amp; Categorising Vector Data (Amenities)</h2>
    <p>We use <code>OSMnx</code> to query OpenStreetMap for Points of Interest (POIs) within Camden.
    Each POI is classified into a <strong>business role</strong> that determines how it interacts with a hypothetical new coffee shop:</p>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0; font-size: 0.95em;">
        <tr style="background: #d35400; color: white;">
            <th style="padding: 8px;">Role</th>
            <th style="padding: 8px;">OSM Tags</th>
            <th style="padding: 8px;">Logic</th>
        </tr>
        <tr>
            <td style="padding: 8px;"><strong>Competitor</strong></td>
            <td><code>amenity=cafe</code>, <code>amenity=coffee_shop</code></td>
            <td>Direct competition &mdash; reduces site score</td>
        </tr>
        <tr style="background: #fdf2e9;">
            <td style="padding: 8px;"><strong>Synergy</strong></td>
            <td><code>amenity=gym|university|office|library|leisure_centre</code>, <code>leisure=fitness_centre|sports_centre</code></td>
            <td>Generates complementary foot traffic</td>
        </tr>
        <tr>
            <td style="padding: 8px;"><strong>Anchor</strong></td>
            <td><code>public_transport=station</code></td>
            <td>High-footfall transit node</td>
        </tr>
        <tr style="background: #fdf2e9;">
            <td style="padding: 8px;"><strong>Other</strong></td>
            <td><code>shop=bakery|supermarket</code></td>
            <td>Contextual &mdash; retained for enrichment</td>
        </tr>
    </table>
    <p><strong>Critical ordering:</strong> We reproject to EPSG:27700 <em>before</em> computing centroids.
    In EPSG:4326, 1&deg; longitude &ne; 1&deg; latitude at 51&deg;N, so centroids of large polygons would be geometrically distorted.</p>
</div>

In [None]:
place_name = "London Borough of Camden"

# Define what we are looking for — expanded tag set
tags = {
    'amenity': ['cafe', 'coffee_shop', 'gym', 'university', 'office',
                'library', 'leisure_centre'],
    'leisure': ['fitness_centre', 'sports_centre'],
    'shop': ['bakery', 'supermarket'],
    'public_transport': ['station']
}

print(f"Fetching POIs for {place_name}...")
pois_raw = ox.features_from_place(place_name, tags)

# CRITICAL FIX: Reproject to BNG BEFORE computing centroids
# In EPSG:4326, 1° longitude ≠ 1° latitude at 51°N, so centroids in
# degrees are geometrically incorrect for large building polygons.
pois_bng = pois_raw.to_crs(epsg=27700)
pois_bng['geometry'] = pois_bng.centroid
pois_bng = pois_bng[pois_bng.geometry.type == 'Point'].copy()

assert pois_bng.crs.to_epsg() == 27700, f"CRS mismatch: {pois_bng.crs}"

# Categorise POIs by business role
def categorize(row):
    """Map OSM tags to business roles for the site selection model."""
    amenity = row.get('amenity', '') or ''
    leisure = row.get('leisure', '') or ''
    transport = row.get('public_transport', '') or ''

    if amenity in ('cafe', 'coffee_shop'):
        return 'Competitor'
    if amenity in ('gym', 'university', 'office', 'library', 'leisure_centre') \
       or leisure in ('fitness_centre', 'sports_centre'):
        return 'Synergy'
    if transport == 'station':
        return 'Anchor'
    return 'Other'

pois_bng['role'] = pois_bng.apply(categorize, axis=1)

print(f"Fetched {len(pois_bng)} POIs. Role breakdown:")
print(pois_bng['role'].value_counts().to_string())

<div style="margin-top: 30px;">
    <h2 style="color: #884ea0; border-bottom: 2px solid #884ea0; padding-bottom: 10px;">Concept: Coordinate-Dependent Metric Distortion</h2>
    <p>Geographic coordinates (EPSG:4326) use angular units (degrees). At London's latitude (51&deg;N),
    the metric scale is <strong>anisotropic</strong>: 1&deg; longitude spans ~69.4 km while 1&deg; latitude spans ~111.3 km
    &mdash; a 60% distortion factor. Any distance, area, or centroid computation in EPSG:4326 produces
    geometrically incorrect results because the implicit assumption of isotropic unit spacing is violated.</p>
    <ul>
        <li><strong>EPSG:4326 (WGS84):</strong> Angular CRS in degrees. Suitable for storage and interchange,
        but <strong>not for metric operations</strong> (centroid, buffer, distance, area).</li>
        <li><strong>EPSG:27700 (British National Grid):</strong> Transverse Mercator projection optimised for Great Britain.
        1 unit = 1 metre isotropically within the projection zone. Required for all geometric computations in this pipeline.</li>
    </ul>
    <p><strong>Fix applied above:</strong> The original notebook computed <code>pois.centroid</code> while the data was still in EPSG:4326.
    We now reproject to EPSG:27700 <em>first</em>, then compute centroids in metres, eliminating the anisotropic distortion.</p>
    <p>The cell below verifies that the reprojection succeeded and the data falls within Camden's BNG bounding box.</p>
</div>

In [None]:
# === CRS Verification Dashboard ===
print(f"CRS: {pois_bng.crs}")
print(f"POI count: {len(pois_bng)}")
print(f"\nBounding box (BNG metres):")
bounds = pois_bng.total_bounds
print(f"  X: {bounds[0]:.0f} – {bounds[2]:.0f}")
print(f"  Y: {bounds[1]:.0f} – {bounds[3]:.0f}")

# Camden in BNG: ~526000-530200 E, ~182600-187100 N
assert 520000 < bounds[0] < 535000, f"X min outside Camden: {bounds[0]}"
assert 520000 < bounds[2] < 535000, f"X max outside Camden: {bounds[2]}"
assert 178000 < bounds[1] < 192000, f"Y min outside Camden: {bounds[1]}"
assert 178000 < bounds[3] < 192000, f"Y max outside Camden: {bounds[3]}"

print("\nAll spatial assertions passed.")
print(f"\nRole distribution:")
print(pois_bng['role'].value_counts().to_frame('count'))

<div style="margin-top: 30px;">
    <h2 style="color: #27ae60; border-bottom: 2px solid #27ae60; padding-bottom: 10px;">Task 2: Loading Raster &amp; Census Data</h2>
    <p>We load two data sources in this cell:</p>
    <ol>
        <li><strong>LandScan Raster</strong> &mdash; metadata inspection only (extraction happens in Notebook 02 via <code>rasterstats.zonal_stats</code>).</li>
        <li><strong>ONS Census 2021</strong> &mdash; three CSV files from EDINA Digimap, merged on <code>geog_code</code> (Output Area identifier).</li>
    </ol>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0; font-size: 0.95em;">
        <tr style="background: #27ae60; color: white;">
            <th style="padding: 8px;">CSV</th>
            <th style="padding: 8px;">Key Columns</th>
            <th style="padding: 8px;">Business Relevance</th>
        </tr>
        <tr>
            <td style="padding: 8px;">Economic Activity</td>
            <td><code>employed_total_perc</code>, <code>retired_perc</code>, <code>unemployed_perc</code></td>
            <td>Employed population = daytime foot traffic</td>
        </tr>
        <tr style="background: #eafaf1;">
            <td style="padding: 8px;">Age Structure</td>
            <td><code>age_16_to_34_perc</code>, <code>age_65_plus_perc</code></td>
            <td>Young adults over-index on specialty coffee consumption</td>
        </tr>
        <tr>
            <td style="padding: 8px;">Qualifications</td>
            <td><code>level4_perc</code>, <code>no_qualifications_perc</code></td>
            <td>Degree-level education correlates with specialty coffee demand</td>
        </tr>
    </table>
    <p><strong>Join key:</strong> <code>geog_code</code> (ONS Output Area code, e.g. <code>E00000001</code>). All 3 CSVs share this column with 846 Camden OAs.</p>
    <p><strong>Centroid source:</strong> Each CSV contains <code>centroid_x</code> and <code>centroid_y</code> already in EPSG:27700 (BNG metres).</p>
</div>

In [None]:
# === Raster Metadata Inspection ===
raster_path = "landscan-mosaic-unitedkingdom-v1.tif"
with rasterio.open(raster_path) as src:
    print(f"Raster: {src.name}")
    print(f"  CRS : {src.crs}")
    print(f"  Res : {src.res}")
    print(f"  Bounds: {src.bounds}")

# === Load ALL 3 Census CSVs ===
econ_df = pd.read_csv("ons-economic-ew-2021_6304504/ons-economic-ew-2021.csv")
age_df  = pd.read_csv("ons-age-ew-2021_6304503/ons-age-ew-2021.csv")
qual_df = pd.read_csv("ons-qualifications-ew-2021_6304505/ons-qualifications-ew-2021.csv")

print(f"\nEconomic : {len(econ_df)} OAs")
print(f"Age      : {len(age_df)} OAs")
print(f"Quals    : {len(qual_df)} OAs")

# === Merge on geog_code (Output Area identifier) ===
census_merged = econ_df[['geog_code', 'geog_name', 'centroid_x', 'centroid_y',
                          'denom_total', 'employed_total_perc', 'retired_perc',
                          'unemployed_perc']].copy()

census_merged = census_merged.merge(
    age_df[['geog_code', 'age_16_to_34_perc', 'age_65_plus_perc']],
    on='geog_code', how='left'
)

census_merged = census_merged.merge(
    qual_df[['geog_code', 'level4_perc', 'no_qualifications_perc']],
    on='geog_code', how='left'
)

# === Convert to GeoDataFrame (centroids already in BNG) ===
geometry = [Point(xy) for xy in zip(census_merged['centroid_x'],
                                     census_merged['centroid_y'])]
census_gdf = gpd.GeoDataFrame(census_merged, geometry=geometry, crs="EPSG:27700")

# === Null handling — fill missing percentages with column median ===
# Median imputation is more defensible than zero (which implies "0% employed")
pct_cols = [c for c in census_gdf.columns if c.endswith('_perc')]
for col in pct_cols:
    median_val = census_gdf[col].median()
    census_gdf[col] = census_gdf[col].fillna(median_val)

print(f"\nMerged census: {len(census_gdf)} OAs, {len(census_gdf.columns)} columns")
print(f"CRS: {census_gdf.crs}")
print(f"Demographic columns: {pct_cols}")
print(f"Null check: {census_gdf[pct_cols].isnull().sum().sum()} nulls remaining")
census_gdf.head(3)

<div style="margin-top: 30px;">
    <h2 style="color: #e67e22; border-bottom: 2px solid #e67e22; padding-bottom: 10px;">Checkpoint: Saving for Step 02</h2>
    <p>We persist all cleaned data so that downstream notebooks can load pre-processed inputs without re-running the ETL.</p>
    <table style="width:100%; border-collapse: collapse; margin: 15px 0; font-size: 0.95em;">
        <tr style="background: #e67e22; color: white;">
            <th style="padding: 8px;">File</th>
            <th style="padding: 8px;">Format</th>
            <th style="padding: 8px;">Consumer</th>
            <th style="padding: 8px;">Contents</th>
        </tr>
        <tr>
            <td style="padding: 8px;"><code>camden_pois.geojson</code></td>
            <td>GeoJSON</td>
            <td>Notebook 02 &amp; 03</td>
            <td>POIs with <code>role</code> column, EPSG:27700</td>
        </tr>
        <tr style="background: #fdf2e9;">
            <td style="padding: 8px;"><code>camden_census_oa.geojson</code></td>
            <td>GeoJSON</td>
            <td>Notebook 02 (spatial join)</td>
            <td>846 OAs with all demographic columns, EPSG:27700</td>
        </tr>
        <tr>
            <td style="padding: 8px;"><code>cleaned_census.csv</code></td>
            <td>CSV</td>
            <td>Notebook 02 (flat fallback)</td>
            <td>Same data, flat CSV for backward compatibility</td>
        </tr>
    </table>
</div>

In [None]:
# === Save POIs (trimmed to essential columns) ===
keep_cols = ['geometry', 'role', 'amenity', 'leisure', 'public_transport', 'name', 'shop']
pois_save = pois_bng[[c for c in keep_cols if c in pois_bng.columns]].copy()
pois_save.to_file("data/processed/camden_pois.geojson", driver='GeoJSON')

# === Save merged census as GeoJSON (for spatial joins in 02) ===
census_gdf.to_file("data/processed/camden_census_oa.geojson", driver='GeoJSON')

# === Save as CSV for backward compatibility (02 loads this) ===
census_merged.to_csv("data/processed/cleaned_census.csv", index=False)

print(f"Saved: data/processed/camden_pois.geojson ({len(pois_save)} POIs)")
print(f"Saved: data/processed/camden_census_oa.geojson ({len(census_gdf)} OAs)")
print(f"Saved: data/processed/cleaned_census.csv (flat backup)")
print("\nFoundation complete. Ready for Step 02: The Grid.")