# Global ML Building Footprints Data Loader

This notebook loads Microsoft's Global ML Building Footprints dataset into Delta tables with the following features:
* **Geospatial filtering** - Loads building footprints for a specific US state
* **Batch processing** - Efficiently processes multiple quadkeys in batches
* **WKB geometry format** - Stores geometries in Well-Known Binary for optimal performance
* **Parameterized** - Ready for job/pipeline execution

## Parameters

| Parameter | Description | Example |
|-----------|-------------|----------|
| `catalog` | Unity Catalog name | `odi_datalake` |
| `schema_name` | Schema name | `odi_bronze` |
| `state_name` | US state name (full name) | `California` |
| `batch_size` | Number of quadkeys to process before writing | `10` |

## Dataset Information

* **Source**: [Microsoft Global ML Building Footprints](https://github.com/microsoft/GlobalMLBuildingFootprints)
* **Coverage**: United States building footprints
* **Format**: GeoJSON (compressed)
* **Spatial Index**: Quadkeys at zoom level 9
* **Attributes**: 
  * `confidence` - ML model prediction confidence score
  * `height` - Estimated building height in meters
  * `geometry_wkb` - Building footprint geometry (Well-Known Binary)
  * `quadkey` - Spatial partition identifier

## How It Works

1. Loads the state boundary geometry from Unity Catalog volumes
2. Identifies quadkeys (spatial tiles) that intersect the state
3. Downloads building footprints for each quadkey from Microsoft's dataset
4. Converts geometries to WKB format for efficient storage
5. Writes data to Delta table in batches to optimize performance

## Notes

* Requires state boundary geometry file in `/Volumes/{catalog}/{schema}/supporting_geometry_files/state_geometries/{state_name}.geojson`
* Batch size controls memory usage vs. write performance tradeoff
* Uses serverless compute - sequential processing optimized for memory limits
* First write overwrites existing table, subsequent writes append

In [0]:
# Create widgets for parameterization
dbutils.widgets.text("catalog", "odi_datalake", "Catalog Name")
dbutils.widgets.text("schema_name", "odi_bronze", "Schema Name")
dbutils.widgets.text("state_name", "california", "State Name (lowercase)")
dbutils.widgets.text("batch_size", "10", "Batch Size")

In [0]:
# Install required packages for loading the global ML building footprints dataset
# Note specific package versions can be pinned for reporoducability
# These packages already exist in the serverless runtime, but are installed here for illustration
%pip install geopandas mercantile shapely fsspec aiohttp
dbutils.library.restartPython() 

In [0]:
import pandas as pd
import geopandas as gpd
import mercantile
import shapely.geometry
import fsspec
from pyspark.sql import functions as F
from pyspark.sql.types import *

In [0]:
# Get parameters from widgets
CATALOG = dbutils.widgets.get("catalog")
SCHEMA = dbutils.widgets.get("schema_name")
STATE_NAME = dbutils.widgets.get("state_name")
BATCH_SIZE = int(dbutils.widgets.get("batch_size"))

# Construct table name from state
TABLE_NAME = f"global_ml_building_footprints_{STATE_NAME}"
FULL_TABLE_NAME = f"{CATALOG}.{SCHEMA}.{TABLE_NAME}"

print(f"Configuration:")
print(f"  Catalog: {CATALOG}")
print(f"  Schema: {SCHEMA}")
print(f"  State: {STATE_NAME}")
print(f"  Batch Size: {BATCH_SIZE}")
print(f"  Target Table: {FULL_TABLE_NAME}")

In [0]:
print(f"Identifying {STATE_NAME.title()} quadkeys")

# Read the dataset links
df = pd.read_csv(
    "https://minedbuildings.z5.web.core.windows.net/global-buildings/dataset-links.csv",
    dtype={"QuadKey": "str"},  # Don't use an int, since there are leading zeros!
)

# Get the shape of the state to identify quadkeys that intersect
state_geometry_path = f"/Volumes/{CATALOG}/{SCHEMA}/supporting_geometry_files/state_geometries/{STATE_NAME}.geojson"
state_gdf = gpd.read_file(state_geometry_path)
state_geometry = state_gdf.iloc[0].geometry

print(f"{STATE_NAME.title()} bounds: {state_geometry.bounds}")

In [0]:
# Find all tiles which intersect the bounding box at zoom level 9
features = []
for tile in mercantile.tiles(*state_geometry.bounds, zooms=9):
    features.append(
        {
            "quadkey": mercantile.quadkey(tile),
            "geometry": shapely.geometry.shape(
                mercantile.feature(tile)["geometry"]
            ),
        }
    )

print(f"Tiles intersecting bounding box: {len(features)}")

# Prune out tiles which don't intersect the state
quadkeys = gpd.GeoDataFrame.from_records(features).set_geometry("geometry")
state_quadkeys = quadkeys[quadkeys.intersects(state_geometry)]

print(f"Tiles intersecting {STATE_NAME.title()}: {len(state_quadkeys)}")

# Get URLs for quadkeys intersecting the state
state_data = df[
    df.QuadKey.isin(state_quadkeys.quadkey) & (df.Location == "UnitedStates")
]

print(f"Quadkeys with data available: {len(state_data)}")

In [0]:
# Process quadkeys and write to Delta in batches
# Batching reduces the number of expensive Delta write operations
accumulated_dataframes = []
overwrite_mode = True

for idx, row in state_data.iterrows():
    print(f"Processing quadkey {row.QuadKey} ({idx + 1}/{len(state_data)})")
    
    try:
        # Read GeoJSON data
        with fsspec.open(row.Url, compression="infer") as f:
            gdf = gpd.read_file(f, driver="GeoJSONSeq")
        
        # Add quadkey column for partitioning
        gdf = gdf.assign(quadkey=row.QuadKey)
        
        print(f"  Loaded {len(gdf)} buildings")
        
        # Convert geometry to WKB for Spark compatibility
        gdf['geometry_wkb'] = gdf.geometry.to_wkb()
        
        # Drop the original geometry column and convert to pandas DataFrame
        pdf = pd.DataFrame(gdf.drop(columns=['geometry']))
        
        # Convert to Spark DataFrame and accumulate
        sdf = spark.createDataFrame(pdf)
        accumulated_dataframes.append(sdf)
        
        # Write when batch is full or on last quadkey
        if len(accumulated_dataframes) >= BATCH_SIZE or idx == len(state_data) - 1:
            print(f"  Writing batch of {len(accumulated_dataframes)} quadkeys to Delta...")
            
            # Union all accumulated DataFrames
            batch_sdf = accumulated_dataframes[0]
            for df in accumulated_dataframes[1:]:
                batch_sdf = batch_sdf.union(df)
            
            # Write to Delta table
            write_mode = "overwrite" if overwrite_mode else "append"
            batch_sdf.write.format("delta").mode(write_mode).saveAsTable(FULL_TABLE_NAME)
            
            print(f"  Written to {FULL_TABLE_NAME} (mode: {write_mode})")
            
            # Reset for next batch
            accumulated_dataframes = []
            overwrite_mode = False
        
    except Exception as e:
        print(f"  Error processing quadkey {row.QuadKey}: {str(e)}")
        continue

print(f"\nCompleted! Data loaded to {FULL_TABLE_NAME}")

In [0]:
# Check the loaded data
df_result = spark.table(FULL_TABLE_NAME)

print(f"Total records: {df_result.count():,}")
print(f"\nSchema:")
df_result.printSchema()

print(f"\nSample data:")
display(df_result.limit(10))