# Geospatial Data Processing Pipeline

## Key Features
- **Overture Maps download** via DuckDB with bounding box filtering
- **Multi-format conversion** (Shapefile, GeoPackage, etc.) to GeoJSON
- **Automated PMTiles generation** with tippecanoe settings per geometry type and/or theme

## Processing Steps
1. **Download** - Fetch Overture Maps data for specified extent
2. **Convert** - Transform custom spatial data to GeoJSON format
3. **Tile** - Generate PMTiles using tippecanoe with custom settings

## Prerequisites
- Python with required packages (duckdb, tqdm, pathlib)
- Tippecanoe installed and available in PATH
- GDAL/OGR for geospatial format conversion

In [None]:
# Import the three modular processing scripts
import sys
import os
from pathlib import Path
import json
import time

# Add the processing directory to Python path
processing_dir = Path("./processing")
if str(processing_dir) not in sys.path:
    sys.path.append(str(processing_dir))

# Import modular processing scripts
try:
    from downloadOverture import download_overture_data
    from convertCustomData import convert_file
    from runCreateTiles import process_to_tiles, create_tilejson
    print("Successfully imported all processing modules")
except ImportError as e:
    print(f"Error importing modules: {e}")
    print("Make sure the processing scripts are in the ./processing directory")

# Import additional libraries for visualization and analysis
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

## 1. Project Configuration and Paths

Configure the project directories and processing parameters for the pipeline.

In [None]:
# Configuration - All paths and parameters centralized
from pathlib import Path

# Define all project paths
PROJECT_ROOT = Path(__file__).resolve().parent.parent if '__file__' in globals() else Path.cwd().parent
PROCESSING_DIR = PROJECT_ROOT / "processing"
DATA_DIR = PROCESSING_DIR / "data"
OVERTURE_DATA_DIR = DATA_DIR / "raw" / "overture"
CUSTOM_DATA_DIR = DATA_DIR / "raw" / "grid3"
OUTPUT_DIR = DATA_DIR / "processed"
TILE_DIR = DATA_DIR / "tiles"
PUBLIC_TILES_DIR = PROJECT_ROOT / "public" / "tiles"

CONFIG = {
    "paths": {
        "project_root": PROJECT_ROOT,
        "processing_dir": PROCESSING_DIR,
        "data_dir": DATA_DIR,
        "overture_data_dir": OVERTURE_DATA_DIR,
        "custom_data_dir": CUSTOM_DATA_DIR,
        "tile_dir": TILE_DIR,
        "output_dir" : OUTPUT_DIR,
        "public_tiles_dir": PUBLIC_TILES_DIR,
        "template_path": PROCESSING_DIR / "tileQueries.template"
    },
    "extent": {
        # "coordinates": (22.0, -6.0, 24.0, -4.0),  # kasai-oriental
        # Bounding box around Prospect Park points with a 0.01° buffer (~1.1 km)
        # Points: (40.682140, -73.985371) and (40.6473912366065, -73.95500872565238)
        "coordinates": (
            -73.98257744202017,  # lon_min (clamped to Brooklyn bounds)
            40.64773925613089,   # lat_min (clamped to Brooklyn bounds)
            -73.9562859766083,   # lon_max (clamped to Brooklyn bounds)
            40.67679734614368    # lat_max (clamped to Brooklyn bounds)
        ),
        "buffer_degrees": 0.0
    },
    "download": {
        "verbose": True,
        "output_formats": ["*.geojson", "*.geojsonseq"]
    },
    "conversion": {
        "input_patterns": ["*.shp", "*.gpkg", "*.gdb", "*.sqlite", "*.db", "*.geojson", "*.json"],
        "output_suffix": ".geojsonseq",
        "reproject_crs": "EPSG:4326",
        "overwrite": True,
        "verbose": True
    },
    "tiling": {
        # "input_dirs": [CUSTOM_DATA_DIR, OVERTURE_DATA_DIR],  # Search in both data directories
        "input_dirs": [OVERTURE_DATA_DIR],  # just overture
        "output_dir": TILE_DIR,  # Explicit output directory for PMTiles
        "parallel": True,
        "overwrite": True,
        "verbose": True,
        "create_tilejson": True,
        "filter_pattern": None  # Optional: filter files by pattern
    }
}

# Create necessary directories
for path_key, path_value in CONFIG["paths"].items():
    if path_key.endswith("_dir") and path_value:
        path_value.mkdir(parents=True, exist_ok=True)

# Display configuration summary
print("PROJECT CONFIGURATION INITIALIZED")
print("=" * 50)
print(f"Project root: {CONFIG['paths']['project_root']}")
print(f"Processing directory: {CONFIG['paths']['processing_dir']}")
print(f"Data directory: {CONFIG['paths']['data_dir']}")
print(f"Output directory: {CONFIG['paths']['output_dir']}")
print(f"Overture data directory: {CONFIG['paths']['overture_data_dir']}")
print(f"Custom data directory: {CONFIG['paths']['custom_data_dir']}")
print(f"Tile output directory: {CONFIG['paths']['tile_dir']}")
print(f"Public tiles directory: {CONFIG['paths']['public_tiles_dir']}")
print()
print(f"Processing extent: {CONFIG['extent']['coordinates']}")
print(f"Buffer degrees: {CONFIG['extent']['buffer_degrees']}")
print(f"Area: {(CONFIG['extent']['coordinates'][2] - CONFIG['extent']['coordinates'][0]) * (CONFIG['extent']['coordinates'][3] - CONFIG['extent']['coordinates'][1]):.2f} degree²")
print()
print("All directories created and configuration loaded")
print("All modular functions will use CONFIG parameters instead of hardcoded defaults")

## 2. Download Overture Data with DuckDB

Use the `downloadOverture.py` module to fetch geospatial data from Overture Maps. This module uses DuckDB to efficiently query and download data for specific geographic extents.

In [None]:
# Download Overture Maps data
print("=== STEP 1: DOWNLOADING OVERTURE DATA ===")
download_results = download_overture_data(
    extent=CONFIG["extent"]["coordinates"],
    buffer_degrees=CONFIG["extent"]["buffer_degrees"],
    template_path=str(CONFIG["paths"]["template_path"]),
    verbose=CONFIG["download"]["verbose"],
    project_root=str(CONFIG["paths"]["project_root"]),
    overture_data_dir=str(CONFIG["paths"]["overture_data_dir"])
)

print(f"Download completed: {download_results['success']}")
print(f"Sections processed: {download_results['processed_sections']}")
if download_results["errors"]:
    print(f"Errors encountered: {len(download_results['errors'])}")
    for error in download_results["errors"]:
        print(f"  - {error}")
print()

In [None]:
# Check what files were created during download
print("=== CHECKING DOWNLOADED FILES ===")

overture_files = []
search_dirs = [CONFIG["paths"]["data_dir"], CONFIG["paths"]["overture_data_dir"]]

for data_dir in search_dirs:
    if data_dir.exists():
        for pattern in CONFIG["download"]["output_formats"]:
            files = list(data_dir.glob(pattern))
            overture_files.extend(files)

print(f"Found {len(overture_files)} downloaded files:")
for file in sorted(overture_files):
    file_size = file.stat().st_size / 1024 / 1024  # Size in MB
    print(f"  {file.name} ({file_size:.1f} MB)")

# Display file statistics
if overture_files:
    total_size_mb = sum(f.stat().st_size for f in overture_files) / 1024 / 1024
    print(f"\nTotal size: {total_size_mb:.1f} MB")
    print(f"Search directories: {[str(d) for d in search_dirs]}")
else:
    print("No files found. Check download results above.")
    print(f"Searched in: {[str(d) for d in search_dirs]}")

## 3. Convert Custom Spatial Data for Tippecanoe

Use the `convertCustomData.py` module to convert various geospatial formats to newline-delimited GeoJSON files suitable for Tippecanoe 

### Supported Input Formats
- Shapefile (.shp)
- GeoPackage (.gpkg)
- FileGDB (.gdb)
- SQLite/SpatiaLite (.sqlite, .db)
- PostGIS (connection string)
- CSV with geometry columns

In [None]:
# Look for custom data files to convert
print("=== STEP 3: CONVERTING CUSTOM SPATIAL DATA ===")

custom_input_dir = CONFIG["paths"]["custom_data_dir"]
custom_files = []

# Search for various spatial data formats using CONFIG patterns
for pattern in CONFIG["conversion"]["input_patterns"]:
    custom_files.extend(custom_input_dir.glob(pattern))

print(f"Found {len(custom_files)} custom data files to convert:")
print(f"Search directory: {custom_input_dir}")
for file in custom_files:
    print(f"  {file.name}")

# Convert custom data files (if any exist)
converted_files = []

for input_file in custom_files:
    output_file = CONFIG["paths"]["output_dir"] / f"{input_file.stem}{CONFIG['conversion']['output_suffix']}"
    
    print(f"Converting {input_file.name}...")
    
    try:
        # Convert using the modular function with CONFIG settings
        processed, skipped, output_path = convert_file(
            input_path=str(input_file),
            output_path=str(output_file),
            reproject=CONFIG["conversion"]["reproject_crs"],
            verbose=CONFIG["conversion"]["verbose"]
        )
        
        converted_files.append(output_file)
        print(f"✓ Converted: {processed} features, {skipped} skipped")
        print(f"  Output: {output_file.name}")
        
    except Exception as e:
        print(f"✗ Error converting {input_file.name}: {e}")

if converted_files:
    print(f"\n✓ Successfully converted {len(converted_files)} files")
    print(f"  Output directory: {CONFIG['paths']['output_dir']}")
else:
    print(f"\nNo custom files to convert. Add data files to: {custom_input_dir}")
    print(f"Supported formats: {', '.join(CONFIG['conversion']['input_patterns'])}")

## 3 1/2. Define tippecanoe parameters per layer

## 4. Process GeoJSON/GeoJSONSeq to PMTiles

Use the `runCreateTiles.py` module to convert GeoJSON and GeoJSONSeq files to PMTiles using optimized Tippecanoe settings.

### Automatic Optimization Features
- **Geometry Detection**: Automatically detects Point, LineString, or Polygon geometries
- **Layer-Specific Settings**: Optimized settings for water, roads, places, land use, etc.
- **Parallel Processing**: Multi-threaded processing for large datasets
- **Quality Optimization**: Smart simplification and feature dropping

In [None]:
# Step 4: Process all GeoJSON/GeoJSONSeq files to PMTiles
print("=== STEP 4: PROCESSING TO PMTILES ===")

# Process all downloaded and converted files to PMTiles using CONFIG settings
tiling_results = process_to_tiles(
    extent=CONFIG["extent"]["coordinates"],
    input_dirs=[str(d) for d in CONFIG["tiling"]["input_dirs"]],  # Convert Path objects to strings
    filter_pattern=CONFIG["tiling"]["filter_pattern"],  # Pass filter pattern from CONFIG
    output_dir=str(CONFIG["tiling"]["output_dir"]),  # Use explicit output directory from CONFIG
    parallel=CONFIG["tiling"]["parallel"],
    verbose=CONFIG["tiling"]["verbose"]
)

print(f"Tiling completed: {tiling_results['success']}")
print(f"Files processed: {len(tiling_results['processed_files'])}/{tiling_results['total_files']}")

if tiling_results["errors"]:
    print(f"Errors encountered: {len(tiling_results['errors'])}")
    for error in tiling_results["errors"]:
        print(f"  - {error}")

# Display generated PMTiles files
if tiling_results["processed_files"]:
    print(f"\n✓ Successfully generated {len(tiling_results['processed_files'])} PMTiles:")
    
    pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))
    
    total_size_mb = 0
    for pmtile in sorted(pmtiles_files):
        size_mb = pmtile.stat().st_size / 1024 / 1024
        total_size_mb += size_mb
        print(f"  {pmtile.name} ({size_mb:.1f} MB)")
    
    print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
    print(f"Files location: {CONFIG['paths']['tile_dir']}")
    
else:
    print("\nNo PMTiles files were generated. Check the errors above.")
    print(f"Make sure you have GeoJSON/GeoJSONSeq files in: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

## 5. Create TileJSON Metadata

Generate TileJSON metadata files for seamless integration with web mapping libraries like MapLibre GL JS.

### TileJSON Features
- **Bounds and zoom levels** automatically detected from PMTiles
- **Vector layer definitions** for each data layer
- **MapLibre GL JS compatibility** for easy web integration
- **PMTiles URL references** for efficient tile serving

In [None]:
# Step 5: Create TileJSON metadata for MapLibre integration
print("=== STEP 5: CREATING TILEJSON METADATA ===")

# Check if PMTiles files exist in the configured tile directory
pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))

if pmtiles_files:
    print(f"Found {len(pmtiles_files)} PMTiles files, creating TileJSON...")
    
    try:
        tilejson = create_tilejson(
            tile_dir=str(CONFIG["paths"]["tile_dir"]),  # Explicitly pass tile directory
            extent=CONFIG["extent"]["coordinates"],  # Pass extent from CONFIG
            output_file=str(CONFIG["paths"]["tile_dir"] / "tilejson.json")  # Explicitly pass output file path
        )
        
        print("✓ TileJSON created successfully")
        print(f"  Bounds: {tilejson['bounds']}")
        print(f"  Zoom range: {tilejson['minzoom']} - {tilejson['maxzoom']}")
        print(f"  Vector layers: {len(tilejson['vector_layers'])}")
        print(f"  Output file: {CONFIG['paths']['tile_dir'] / 'tilejson.json'}")
        
        # Show a summary of all output files
        print(f"\nComplete output summary:")
        total_size_mb = 0
        for pmtile in sorted(pmtiles_files):
            size_mb = pmtile.stat().st_size / 1024 / 1024
            total_size_mb += size_mb
            print(f"  {pmtile.name} ({size_mb:.1f} MB)")
        
        print(f"  tilejson.json")
        print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
        print(f"All files location: {CONFIG['paths']['tile_dir']}")
        
    except Exception as e:
        print(f"✗ TileJSON creation failed: {e}")
        
else:
    print("No PMTiles files found in output directory.")
    print(f"Expected location: {CONFIG['paths']['tile_dir']}")
    print("Run Step 4 first to generate PMTiles files.")

## 6. Validate and Test Individual Steps

Test each processing step individually and validate the generated outputs.

In [None]:
# Individual Step Testing and Validation

print("INDIVIDUAL STEP TESTING")
print("=" * 50)

print("\n1. Test downloadOverture.py standalone:")
print("python processing/downloadOverture.py --extent='23.4,-6.2,23.8,-5.8' --buffer=0.1")

print("\n2. Test convertCustomData.py standalone:")
print("python processing/convertCustomData.py input.shp output.geojsonseq --reproject=EPSG:4326")

print("\n3. Test runCreateTiles.py standalone:")
print("python processing/runCreateTiles.py --extent='23.4,-6.2,23.8,-5.8' --create-tilejson")

print("\n4. Test individual steps in this notebook:")
print("   - Step 1: Download section (cell 6)")
print("   - Step 2: Check downloaded files (cell 7)")
print("   - Step 3: Convert custom data (cell 9)")
print("   - Step 4: Process to PMTiles (cell 11)")
print("   - Step 5: Create TileJSON (cell 13)")

print("\n5. Validate outputs using CONFIG paths:")
print(f"   - Check {CONFIG['paths']['data_dir']} for GeoJSON files")
print(f"   - Check {CONFIG['paths']['tile_dir']} for PMTiles files")
print(f"   - Verify TileJSON metadata file")

# Configuration validation using centralized CONFIG
print("\nCURRENT CONFIGURATION VALIDATION")
print("=" * 50)
print(f"Extent: {CONFIG['extent']['coordinates']}")
print(f"Buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"Tile output directory: {CONFIG['paths']['tile_dir']}")
print(f"Custom data directory: {CONFIG['paths']['custom_data_dir']}")
print(f"Input directories for tiling: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

# Area calculation using CONFIG
extent = CONFIG['extent']['coordinates']
area = (extent[2] - extent[0]) * (extent[3] - extent[1])
print(f"Processing area: {area:.2f} degree² ({area * 111**2:.0f} km²)")

# Check directory status
print(f"\nDIRECTORY STATUS")
print("=" * 30)
for path_name, path_obj in CONFIG['paths'].items():
    if path_name.endswith('_dir'):
        status = "exists" if path_obj.exists() else "missing"
        file_count = len(list(path_obj.glob("*"))) if path_obj.exists() else 0
        print(f"{path_name}: {status} ({file_count} files)")

print("\nPERFORMANCE OPTIMIZATION TIPS")
print("=" * 50)

print(f"\n1. For large areas (current: {area:.2f} degree²):")
print(f"   - Current buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"   - Parallel processing: {CONFIG['tiling']['parallel']}")
print("   - Consider smaller chunks if memory issues occur")

print("\n2. File management:")
print(f"   - Monitor {CONFIG['paths']['data_dir']} size during processing")
print("   - Clean intermediate files between steps if needed")
print("   - Use filter patterns to process specific layers only")

print("\n3. Output optimization:")
print(f"   - PMTiles output: {CONFIG['paths']['tile_dir']}")
print(f"   - Public tiles: {CONFIG['paths']['public_tiles_dir']}")
print("   - Copy final tiles to public directory for web serving")

# Modular Processing Summary

This notebook provides a complete, step-by-step approach for geospatial data processing with the following capabilities:

## Core Steps
1. **Download Overture Maps data** with spatial filtering using DuckDB
2. **Check and validate** downloaded files 
3. **Convert custom spatial data** to GeoJSON format
4. **Generate PMTiles** using optimized tippecanoe settings
5. **Create TileJSON metadata** for web mapping integration
6. **Validate and test** individual processing steps

## Key Features
- **Modular design** - Each step can be run independently
- **Flexible configuration** - Easy to customize for different areas and data types
- **Interactive development** - Run steps individually for debugging
- **Performance optimized** - Appropriate settings for different geometry types
- **Production ready** - Robust error handling and validation

## Output Files
Each step generates specific outputs that can be directly used:
- **GeoJSON/GeoJSONSeq files** for further processing or analysis
- **PMTiles files** for efficient web mapping
- **TileJSON metadata** for MapLibre GL JS integration

## Usage Patterns
- **Development**: Run steps individually for testing and debugging
- **Production**: Execute all steps in sequence for automated processing
- **Customization**: Modify CONFIG settings and re-run specific steps
- **Integration**: Use generated files with web mapping applications