# Import Microsoft Building Footprints (Hawaiʻi)

_Author: Robert Litt_


## Purpose

This notebook ingests the Microsoft US Building Footprints dataset for Hawaiʻi and prepares an analysis-ready building footprint layer for the Matrix workflow.

The output is a GIS-native footprint dataset with repaired geometry and a spatial reference aligned to the project CRS (anchored to the validated parcels layer), written to `data/02_interim` for downstream buildable-area calculations and related overlays.


## Inputs

All inputs are stored as static datasets under `data/01_raw/`.

Primary input:
- Microsoft building footprints for Hawaiʻi (GeoJSON extracted from the official Microsoft ZIP)

Reference input (CRS anchor):
- Validated parcels polygon layer in `data/02_interim/` (used to define the project CRS)



## Workflow Overview

1. **Locate and validate inputs**  
   Confirm the Microsoft GeoJSON exists and is readable.

2. **Convert GeoJSON to features**  
   Create a GIS-native footprint feature class in a file geodatabase (if it does not already exist).

3. **Repair geometry**  
   Create a “valid” copy and run geometry repair tools to stabilize downstream overlays.

4. **Align spatial reference**  
   Project footprints to the project CRS using the validated parcels layer as the CRS anchor.

5. **Write outputs to interim**  
   Store analysis-ready outputs under `data/02_interim/` in a predictable location for reuse.


## Imports


In [7]:
from pathlib import Path
import arcpy


## CRS Guardrails and Sanity Checks

This cell defines reusable functions that enforce basic coordinate reference system (CRS) sanity checks for imported spatial datasets.

**Purpose**
- Detect common CRS misinterpretation issues early (e.g., GeoJSON lon/lat treated incorrectly, Web Mercator masquerading as geographic).
- Prevent silent generation of spatially valid-looking but incorrectly located outputs.
- Fail fast before projection if raw coordinate ranges are inconsistent with Hawaiʻi datasets.
- Verify that projected outputs overlap the parcel extent as a final alignment check.

**How this is used**
- `crs_guardrail_preproject()` is called immediately after importing a dataset and before any projection.
- `crs_guardrail_postproject_overlap()` is called after projection to confirm alignment with the parcel dataset.

These checks are intentionally heuristic and conservative. If a guardrail fails, processing should stop and the input CRS should be explicitly diagnosed and defined before continuing.

This pattern is intended to be reused across notebooks for any third-party or externally generated spatial data (e.g., building footprints, streams, wells, infrastructure layers).


In [None]:


def _extent_str(ext):
    return f"XMin={ext.XMin:.3f}, XMax={ext.XMax:.3f}, YMin={ext.YMin:.3f}, YMax={ext.YMax:.3f}"

def crs_guardrail_preproject(
    in_fc: str,
    expected_case: str = "hawaii_lonlat_or_projected",
    label: str = "INPUT"
):
    """
    Guardrail before projecting.
    - Prints spatial reference + extent.
    - Raises a RuntimeError if coordinates look wrong.

    expected_case:
      - "hawaii_lonlat": expects lon/lat degrees around Hawaiʻi
      - "hawaii_lonlat_or_projected": allows lon/lat OR already-projected meters (UTM/StatePlane), but rejects Web Mercator-like or nonsense
    """
    desc = arcpy.Describe(in_fc)
    sr = desc.spatialReference
    ext = desc.extent

    print(f"[{label}] SR: {sr.name} | factoryCode: {sr.factoryCode}")
    print(f"[{label}] Extent: {_extent_str(ext)}")

    # Heuristics
    looks_lonlat_hi = (-170 < ext.XMin < -140) and (-170 < ext.XMax < -140) and (10 < ext.YMin < 30) and (10 < ext.YMax < 30)
    looks_webmerc = (abs(ext.XMin) > 1_000_000 and abs(ext.XMax) > 1_000_000 and abs(ext.YMin) > 1_000_000 and abs(ext.YMax) > 1_000_000)

    # "Already projected" heuristic: values in thousands to hundreds of thousands (meters), not millions
    looks_projected_m = (abs(ext.XMin) > 1_000 and abs(ext.XMax) > 1_000 and abs(ext.YMin) > 1_000 and abs(ext.YMax) > 1_000) and not looks_webmerc

    if expected_case == "hawaii_lonlat":
        if not looks_lonlat_hi:
            raise RuntimeError(f"[{label}] CRS guardrail FAILED: expected Hawaiʻi lon/lat degrees but extent does not match.")
        print(f"[{label}] CRS guardrail OK: looks like Hawaiʻi lon/lat degrees.")
        return

    if expected_case == "hawaii_lonlat_or_projected":
        if looks_lonlat_hi:
            print(f"[{label}] CRS guardrail OK: looks like Hawaiʻi lon/lat degrees.")
            return
        if looks_projected_m:
            print(f"[{label}] CRS guardrail OK: looks like projected meters (already projected).")
            return
        if looks_webmerc:
            raise RuntimeError(
                f"[{label}] CRS guardrail FAILED: extent looks like Web Mercator meters. "
                "DefineProjection is likely needed before Project."
            )
        raise RuntimeError(f"[{label}] CRS guardrail FAILED: extent pattern is unexpected. Stop and


## Configurations


In [8]:
# Parent directory (shared Drive project root)
PROJECT_ROOT = Path(r"G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis")

# Standard project folders
RAW = PROJECT_ROOT / "data" / "01_raw"
INTERIM = PROJECT_ROOT / "data" / "02_interim"

# Inputs
FOOTPRINT_DIR = RAW / "building_footprints_hi_microsoft"
SOURCE_GEOJSON = FOOTPRINT_DIR / "Hawaii.geojson"

# CRS anchor (validated parcels layer)
PARCELS_VALID = INTERIM / "parcels_valid_hi_higp" / "tmk_state_valid.shp"

# Outputs (interim)
OUT_DIR = INTERIM / "building_footprints_valid_hi_microsoft"
OUT_DIR.mkdir(parents=True, exist_ok=True)

OUT_GDB = OUT_DIR / "building_footprints_valid_hi_microsoft.gdb"

RAW_FC = OUT_GDB / "buildings_hi_microsoft_raw"
VALID_FC = OUT_GDB / "buildings_hi_microsoft_valid"
PROJECTED_FC = OUT_GDB / "buildings_hi_microsoft_valid_projected"

# Environment behavior
arcpy.env.overwriteOutput = False  # keep False by default for safety; set True only when you intend to regenerate outputs


## Building Footprints - Note that I did this first manually and am now creating the notebook so it detecting and creating if needed here


In [9]:
print("GeoJSON exists:", SOURCE_GEOJSON.exists())
print("Parcels CRS anchor exists:", PARCELS_VALID.exists())

if not OUT_GDB.exists():
    arcpy.management.CreateFileGDB(str(OUT_DIR), OUT_GDB.name)
    print("Created GDB:", OUT_GDB)
else:
    print("Using existing GDB:", OUT_GDB)


GeoJSON exists: True
Parcels CRS anchor exists: True
Created GDB: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb


In [10]:
print("Raw footprints feature class exists:", arcpy.Exists(str(RAW_FC)))

if not arcpy.Exists(str(RAW_FC)):
    arcpy.conversion.JSONToFeatures(
        in_json_file=str(SOURCE_GEOJSON),
        out_features=str(RAW_FC)
    )
    print("Created raw footprints:", RAW_FC)
else:
    print("Using existing raw footprints:", RAW_FC)


Raw footprints feature class exists: False
Created raw footprints: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_raw


In [11]:
desc = arcpy.Describe(str(RAW_FC))
count = int(arcpy.management.GetCount(str(RAW_FC))[0])

print("Shape type:", desc.shapeType)
print("CRS:", desc.spatialReference.name)
print("Feature count:", f"{count:,}")


Shape type: Polygon
CRS: GCS_WGS_1984
Feature count: 252,908


In [12]:
print("Valid footprints feature class exists:", arcpy.Exists(str(VALID_FC)))

# Copy raw -> valid once (keeps raw immutable)
if not arcpy.Exists(str(VALID_FC)):
    arcpy.management.CopyFeatures(str(RAW_FC), str(VALID_FC))
    print("Created valid copy:", VALID_FC)
else:
    print("Using existing valid copy:", VALID_FC)

# Repair geometry in place on the valid copy
arcpy.management.RepairGeometry(str(VALID_FC))
print("Repaired geometry (valid):", VALID_FC)


Valid footprints feature class exists: False
Created valid copy: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_valid
Repaired geometry (valid): G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_valid


In [13]:
parcels_sr = arcpy.Describe(str(PARCELS_VALID)).spatialReference
footprints_sr = arcpy.Describe(str(VALID_FC)).spatialReference

print("Parcels CRS:", parcels_sr.name, "| factoryCode:", parcels_sr.factoryCode)
print("Footprints CRS:", footprints_sr.name, "| factoryCode:", footprints_sr.factoryCode)

if parcels_sr.factoryCode != footprints_sr.factoryCode:
    arcpy.management.Project(
        in_dataset=str(VALID_FC),
        out_dataset=str(PROJECTED_FC),
        out_coor_system=parcels_sr
    )
    print("Projected footprints to parcels CRS:", PROJECTED_FC)
else:
    # If already matches, copy so downstream code can always refer to PROJECTED_FC
    if not arcpy.Exists(str(PROJECTED_FC)):
        arcpy.management.CopyFeatures(str(VALID_FC), str(PROJECTED_FC))
        print("Copied valid footprints to projected output (CRS already matched).")
    else:
        print("Projected output already exists (CRS already matched):", PROJECTED_FC)


Parcels CRS: WGS_1984_UTM_Zone_4N | factoryCode: 32604
Footprints CRS: GCS_WGS_1984 | factoryCode: 4326
Projected footprints to parcels CRS: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_valid_projected


In [14]:
final_desc = arcpy.Describe(str(PROJECTED_FC))
final_count = int(arcpy.management.GetCount(str(PROJECTED_FC))[0])

print("Final shape type:", final_desc.shapeType)
print("Final CRS:", final_desc.spatialReference.name)
print("Final feature count:", f"{final_count:,}")
print("Final output:", PROJECTED_FC)


Final shape type: Polygon
Final CRS: WGS_1984_UTM_Zone_4N
Final feature count: 252,908
Final output: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_valid_projected


In [15]:
aprx = arcpy.mp.ArcGISProject("CURRENT")
m = aprx.activeMap

m.addDataFromPath(str(PROJECTED_FC))
print("Added to map:", PROJECTED_FC)


Added to map: G:\Shared drives\WW_Overlay_2024\matrix\HiOSDS-TechSuitabilityAnalysis\data\02_interim\building_footprints_valid_hi_microsoft\building_footprints_valid_hi_microsoft.gdb\buildings_hi_microsoft_valid_projected


In [16]:
print("Disk file exists:", (FOOTPRINT_DIR / "buildings_hi_microsoft_raw.shp").exists())
print("ArcPy Exists:", arcpy.Exists(str(FOOTPRINT_DIR / "buildings_hi_microsoft_raw.shp")))


Disk file exists: True
ArcPy Exists: True


## Canonical Output – Building Footprints

**Authoritative dataset**

`data/02_interim/building_footprints_valid_hi_microsoft/building_footprints_valid_hi_microsoft.gdb/buildings_hi_microsoft_valid_projected`

**Description**

This feature class represents statewide Microsoft building footprints that have been:

- imported from GeoJSON
- preserved as an immutable raw copy
- geometry-validated
- projected to match the parcel dataset CRS (WGS 1984 UTM Zone 4N, meters)

**Usage note**

All downstream analyses (e.g., parcel buildable area calculations, footprint subtraction, and area summaries) should reference this dataset only.

Raw and intermediate versions are retained solely for provenance, reproducibility, and debugging, and should not be used directly in analytical workflows.
