# ODI Building Footprints Pipeline - Setup

This notebook creates the required Unity Catalog infrastructure for the ODI building footprints processing pipeline.

## Purpose
This is a **prerequisite setup notebook** that must be run before executing the ODI building footprints job. It creates:
* **Catalog**: `odi_datalake` - The Unity Catalog namespace for all ODI data
* **Schemas**: `odi_bronze`, `odi_silver`, `odi_gold` - The medallion architecture layers for data processing
* **Volume**: `supporting_geometry_files` in `odi_bronze` - Storage for geometry reference files

## Prerequisites
* Unity Catalog enabled in your Databricks workspace
* Appropriate permissions to create catalogs and schemas
* (Optional) External storage location configured for the catalog

## Usage
1. Review and update the configuration variables in the configuration cell
2. Run all cells in sequence
3. Verify the catalog and schemas are created successfully

## Configuration Options

### Catalog Storage Location
You can specify an external storage location for the catalog in two ways:

* **Using an existing Unity Catalog external location** (recommended):
  * Reference by name: `"odi_datalake"`
  * The external location must already be created in Unity Catalog with proper credentials
  * This is the cleanest approach as credentials are managed centrally
  * [Learn how to create external locations in Azure Databricks](https://learn.microsoft.com/en-us/azure/databricks/connect/unity-catalog/cloud-storage/external-locations)

* **Without external location** (default metastore):
  * Suitable for POC and development
  * Data stored in workspace's default metastore location
  * Leave `catalog_storage_location` as empty string

### Medallion Architecture
The pipeline uses a standard medallion architecture:
* **odi_bronze**: Raw ingested data
* **odi_silver**: Cleaned and validated building footprints
* **odi_gold**: Aggregated and business-ready datasets

In [0]:
# Configuration - Update these values for your environment
catalog = "odi_datalake"
catalog_storage_location = "odi_datalake"  # External location name (must already exist in Unity Catalog). Leave empty string "" to use default metastore.

# Create odi_datalake catalog
catalogs = [row.catalog for row in spark.sql("SHOW CATALOGS").collect()]
if catalog not in catalogs:
    # Use external location if specified (preferred)
    if catalog_storage_location:
        spark.sql(f"CREATE CATALOG {catalog} MANAGED LOCATION '{catalog_storage_location}'")
        print(f"Created catalog {catalog} with external location: {catalog_storage_location}")
    # Else default metastore (okay for POC but not preferred)
    else:
        spark.sql(f"CREATE CATALOG {catalog}")
        print(f"Created catalog {catalog} (using default metastore storage)")
else:
    print(f"Catalog {catalog} already exists")

In [0]:
# Create odi_bronze, odi_silver, and odi_gold schemas in the odi_datalake catalog
required_schemas = ["odi_bronze", "odi_silver", "odi_gold"]
for schema in required_schemas:
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")
    print(f"Created schema {schema} in catalog {catalog}")

In [0]:
# Create volume for supporting geometry files in odi_bronze
volume_name = "supporting_geometry_files"
volume_path = f"{catalog}.odi_bronze.{volume_name}"

# Check if volume already exists
existing_volumes = spark.sql(f"SHOW VOLUMES IN {catalog}.odi_bronze").collect()
if any(row.volume_name == volume_name for row in existing_volumes):
    print(f"Volume {volume_path} already exists")
else:
    spark.sql(f"CREATE VOLUME IF NOT EXISTS {volume_path}")
    print(f"Created volume {volume_path}")
    print(f"Access path: /Volumes/{catalog}/odi_bronze/{volume_name}")
    print(f"State geometries path: /Volumes/{catalog}/odi_bronze/{volume_name}/state_geometries")

In [0]:
# Create volume for output files (GeoParquet and Shapefiles) in odi_gold
output_volume_name = "global_ml_building_footprints"
output_volume_path = f"{catalog}.odi_gold.{output_volume_name}"

# Check if volume already exists
existing_volumes = spark.sql(f"SHOW VOLUMES IN {catalog}.odi_gold").collect()
if any(row.volume_name == output_volume_name for row in existing_volumes):
    print(f"Volume {output_volume_path} already exists")
else:
    spark.sql(f"CREATE VOLUME IF NOT EXISTS {output_volume_path}")
    print(f"Created volume {output_volume_path}")
    print(f"Access path: /Volumes/{catalog}/odi_gold/{output_volume_name}")

## Verify Setup

Run the cell below to verify that the catalog and schemas were created successfully.

In [0]:
# Verify catalog exists
print("=" * 60)
print("CATALOG VERIFICATION")
print("=" * 60)
catalogs = spark.sql("SHOW CATALOGS").collect()
if any(row.catalog == catalog for row in catalogs):
    print(f"✓ Catalog '{catalog}' exists")
    
    # Get catalog details
    catalog_info = spark.sql(f"DESCRIBE CATALOG EXTENDED {catalog}").collect()
    for row in catalog_info:
        if row.info_name == "Location":
            print(f"  Location: {row.info_value}")
else:
    print(f"✗ Catalog '{catalog}' not found")

# Verify schemas exist
print("\n" + "=" * 60)
print("SCHEMA VERIFICATION")
print("=" * 60)
for schema in required_schemas:
    schemas = spark.sql(f"SHOW SCHEMAS IN {catalog}").collect()
    if any(row.databaseName == schema for row in schemas):
        print(f"✓ Schema '{catalog}.{schema}' exists")
    else:
        print(f"✗ Schema '{catalog}.{schema}' not found")

# Verify volumes exist
print("\n" + "=" * 60)
print("VOLUME VERIFICATION")
print("=" * 60)

# Check bronze volume
volumes_bronze = spark.sql(f"SHOW VOLUMES IN {catalog}.odi_bronze").collect()
if any(row.volume_name == volume_name for row in volumes_bronze):
    print(f"✓ Volume '{catalog}.odi_bronze.{volume_name}' exists")
    print(f"  Access path: /Volumes/{catalog}/odi_bronze/{volume_name}")
    print(f"  State geometries: /Volumes/{catalog}/odi_bronze/{volume_name}/state_geometries")
else:
    print(f"✗ Volume '{catalog}.odi_bronze.{volume_name}' not found")

# Check gold volume
volumes_gold = spark.sql(f"SHOW VOLUMES IN {catalog}.odi_gold").collect()
if any(row.volume_name == output_volume_name for row in volumes_gold):
    print(f"✓ Volume '{catalog}.odi_gold.{output_volume_name}' exists")
    print(f"  Access path: /Volumes/{catalog}/odi_gold/{output_volume_name}")
else:
    print(f"✗ Volume '{catalog}.odi_gold.{output_volume_name}' not found")

print("\n" + "=" * 60)
print("Setup complete! Ready to run ODI building footprints job.")
print("=" * 60)