# Geospatial Data Processing Pipeline

## Key Features
- **Overture Maps download** via DuckDB with bounding box filtering (outputs GeoParquet)
- **FlatGeobuf conversion** for optimal tile generation (streaming read, spatial indexing)
- **Multi-format conversion** (Shapefile, GeoPackage, etc.) to GeoJSON
- **Automated PMTiles generation** with tippecanoe
- **Performance optimized** for continent/world-scale processing

## Processing Steps
1. **Download** - Fetch Overture Maps data for specified extent (as GeoParquet)
2. **Convert to FlatGeobuf** - Transform GeoParquet to FlatGeobuf for efficient tiling
3. **Convert Custom Data** - Transform custom spatial data to GeoJSON/FlatGeobuf
4. **Tile** - Generate PMTiles using tippecanoe with optimized settings

## Format Optimization Strategy
- **GeoParquet (.parquet)** - Download format (compact, fast DuckDB queries)
- **FlatGeobuf (.fgb)** - Tiling format (streaming, spatial index, native tippecanoe support)
- **GeoJSON (.geojson)** - Legacy support for small datasets

### Why FlatGeobuf for Large-Scale Processing?
- ✓ **Streaming read**: Process datasets larger than memory
- ✓ **Spatial indexing**: Built-in R-tree for fast spatial queries
- ✓ **Compact**: 30-50% smaller than GeoJSON
- ✓ **Fast**: Optimized for millions of features
- ✓ **Native tippecanoe support**: v2.17+

## Prerequisites
- Python with required packages (duckdb, geopandas, tqdm, pathlib)
- Tippecanoe 2.17.0+ installed and available in PATH
- GDAL/OGR for geospatial format conversion
- PyArrow for GeoParquet processing

In [2]:
# Setup Python path and imports
import sys
import os
from pathlib import Path

# Add parent directory to path to import config and scripts
notebook_dir = Path.cwd()
processing_dir = notebook_dir.parent
if str(processing_dir) not in sys.path:
    sys.path.insert(0, str(processing_dir))

# Fix PROJ database path for GDAL/GeoPandas
# This prevents "PROJ: proj_create_from_database: Open of ... failed" errors
if 'PROJ_LIB' not in os.environ:
    # Auto-detect PROJ path in conda/micromamba environment
    conda_prefix = os.environ.get('CONDA_PREFIX', '')
    if conda_prefix:
        proj_lib = Path(conda_prefix) / 'share' / 'proj'
        if proj_lib.exists():
            os.environ['PROJ_LIB'] = str(proj_lib)
            print(f"✓ Set PROJ_LIB to: {proj_lib}")

# Import configuration
from config import (
    get_config,
    ensure_directories,
    print_config_summary,
    SCRIPTS_DIR,
    OUTPUT_DIR,
    OVERTURE_DATA_DIR,
    GRID3_DATA_DIR,
)

# Import processing functions
from scripts import (
    download_overture_data,
    convert_file,
    convert_parquet_to_fgb,
    batch_convert_directory,
    process_to_tiles,
    create_tilejson,
)

# Additional libraries for analysis and visualization
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("✓ Successfully imported configuration and processing modules")
print(f"  Config module: {processing_dir / 'config.py'}")
print(f"  Scripts package: {SCRIPTS_DIR}")

✓ Set PROJ_LIB to: /home/mjh2241/micromamba/envs/gis/share/proj
✓ Successfully imported configuration and processing modules
  Config module: /srv/mapTiles/1-processing/config.py
  Scripts package: /srv/mapTiles/1-processing/scripts


## 1. Project Configuration and Paths

Configure the project directories and processing parameters for the pipeline.

In [3]:
# Initialize configuration
CONFIG = get_config()

# Customize extent for your project (if needed)
# Example: Democratic Republic of Congo - Kasai-Oriental
# CONFIG["extent"]["coordinates"] = (22.0, -6.0, 24.0, -4.0)

# Brooklyn, Prospect Park area (default)
CONFIG["extent"]["coordinates"] = (
    -73.98257744202017,  # lon_min
    40.64773925613089,   # lat_min
    -73.9562859766083,   # lon_max
    40.67679734614368    # lat_max
)
CONFIG["extent"]["buffer_degrees"] = 0.25

# Customize processing options (if needed)
# CONFIG["tiling"]["input_dirs"] = [OVERTURE_DATA_DIR, GRID3_DATA_DIR]  # Include custom data
CONFIG["tiling"]["input_dirs"] = [OVERTURE_DATA_DIR]  # Just Overture data
CONFIG["download"]["verbose"] = True
CONFIG["conversion"]["verbose"] = True
CONFIG["tiling"]["verbose"] = True
CONFIG["tiling"]["parallel"] = True

# Create all necessary directories
ensure_directories()

# Display configuration summary
print_config_summary(CONFIG)
print("\n✓ Configuration loaded and directories initialized")


PROJECT CONFIGURATION
Project root:        /srv/mapTiles/1-processing
Scripts directory:   /srv/mapTiles/1-processing/scripts
Notebooks directory: /srv/mapTiles/1-processing/notebooks
Data directory:      /mnt/pool/gis/mapTiles/data
Scratch directory:   /mnt/pool/gis/mapTiles/data/2-scratch
Output directory:    /mnt/pool/gis/mapTiles/data/3-pmtiles
Overture data:       /mnt/pool/gis/mapTiles/data/1-input/overture
GRID3 data:         /mnt/pool/gis/mapTiles/data/1-input/grid3

Processing extent:   (-73.98257744202017, 40.64773925613089, -73.9562859766083, 40.67679734614368)
Buffer degrees:      0.25
Area:                0.0008 degree² (~9 km²)

✓ Configuration loaded and directories initialized


## 2. Download Overture Data with DuckDB

Use the `downloadOverture.py` module to fetch geospatial data from Overture Maps. This module uses DuckDB to efficiently query and download data for specific geographic extents.

In [None]:
# Download Overture Maps data
print("=== STEP 1: DOWNLOADING OVERTURE DATA ===")
download_results = download_overture_data(
    extent=CONFIG["extent"]["coordinates"],
    buffer_degrees=CONFIG["extent"]["buffer_degrees"],
    template_path=str(CONFIG["paths"]["template_path"]),
    verbose=CONFIG["download"]["verbose"],
    project_root=str(CONFIG["paths"]["project_root"]),
    overture_data_dir=str(CONFIG["paths"]["overture_data_dir"])
)

print(f"Download completed: {download_results['success']}")
print(f"Sections processed: {download_results['processed_sections']}")
if download_results["errors"]:
    print(f"Errors encountered: {len(download_results['errors'])}")
    for error in download_results["errors"]:
        print(f"  - {error}")
print()

In [None]:
# Check what files were created during download
print("=== CHECKING DOWNLOADED FILES ===")

overture_files = []
search_dirs = [CONFIG["paths"]["data_dir"], CONFIG["paths"]["overture_data_dir"]]

for data_dir in search_dirs:
    if data_dir.exists():
        for pattern in CONFIG["download"]["output_formats"]:
            files = list(data_dir.glob(pattern))
            overture_files.extend(files)

print(f"Found {len(overture_files)} downloaded files:")
for file in sorted(overture_files):
    file_size = file.stat().st_size / 1024 / 1024  # Size in MB
    print(f"  {file.name} ({file_size:.1f} MB)")

# Display file statistics
if overture_files:
    total_size_mb = sum(f.stat().st_size for f in overture_files) / 1024 / 1024
    print(f"\nTotal size: {total_size_mb:.1f} MB")
    print(f"Search directories: {[str(d) for d in search_dirs]}")
else:
    print("No files found. Check download results above.")
    print(f"Searched in: {[str(d) for d in search_dirs]}")

## 2.5. Convert GeoParquet to FlatGeobuf for Optimal Tiling

Convert downloaded GeoParquet files to FlatGeobuf format for efficient tile generation.

### Why This Step?
- **Memory efficiency**: FlatGeobuf supports streaming reads (essential for large datasets)
- **Speed**: Built-in spatial indexing accelerates tippecanoe processing
- **Native support**: Tippecanoe 2.17+ reads FlatGeobuf natively (no intermediate conversion)
- **Compact**: 30-50% smaller than GeoJSON while maintaining full attribute data

### Performance for Large Datasets
- **Continent-scale**: Process billions of features without memory issues
- **World-scale**: Optimal format for global basemap generation
- **Parallel-friendly**: Each file can be processed independently

In [4]:
# Convert GeoParquet files to FlatGeobuf for optimal tiling performance
print("=== STEP 2.5: CONVERTING GEOPARQUET TO FLATGEOBUF ===")

# Use CONFIG settings for FlatGeobuf conversion
fgb_results = batch_convert_directory(
    input_dir=str(CONFIG["paths"]["overture_data_dir"]),
    output_dir=str(CONFIG["paths"]["overture_data_dir"]),  # Save FGB files alongside parquet
    pattern=CONFIG["fgb_conversion"]["input_pattern"],
    overwrite=CONFIG["fgb_conversion"]["overwrite"],
    verbose=CONFIG["fgb_conversion"]["verbose"]
)

print(f"\nConversion Summary:")
print(f"  ✓ Converted: {fgb_results['converted']} files")
print(f"  ⊘ Skipped:   {fgb_results['skipped']} files (already exist)")
print(f"  ✗ Errors:    {len(fgb_results['errors'])} files")

if fgb_results['errors']:
    print("\nErrors encountered:")
    for error in fgb_results['errors']:
        print(f"  - {error['file']}: {error['error']}")

if fgb_results['output_files']:
    print(f"\n✓ FlatGeobuf files ready for tippecanoe")
    print(f"  Location: {CONFIG['paths']['overture_data_dir']}")
else:
    print(f"\nNo new FlatGeobuf files created.")
    if fgb_results['skipped'] > 0:
        print(f"All {fgb_results['skipped']} files already converted. Use overwrite=True to reconvert.")

=== STEP 2.5: CONVERTING GEOPARQUET TO FLATGEOBUF ===
Found 8 GeoParquet files to convert
Input:  /mnt/pool/gis/mapTiles/data/1-input/overture
Output: /mnt/pool/gis/mapTiles/data/1-input/overture



Converting to FlatGeobuf: 100%|██████████| 8/8 [00:15<00:00,  2.00s/it]


Conversion Summary:
  Total files:     8
  Converted:       7
  Skipped:         1
  Errors:          0
  Total FGB size:  275.8 MB

✓ FlatGeobuf files ready for tippecanoe

Conversion Summary:
  ✓ Converted: 7 files
  ⊘ Skipped:   1 files (already exist)
  ✗ Errors:    0 files

✓ FlatGeobuf files ready for tippecanoe
  Location: /mnt/pool/gis/mapTiles/data/1-input/overture





## 3. Convert Custom Spatial Data for Tippecanoe

Use the `convertCustomData.py` module to convert various geospatial formats to newline-delimited GeoJSON files suitable for Tippecanoe 

### Supported Input Formats
- Shapefile (.shp)
- GeoPackage (.gpkg)
- FileGDB (.gdb)
- SQLite/SpatiaLite (.sqlite, .db)
- PostGIS (connection string)
- CSV with geometry columns

In [None]:
# Look for custom data files to convert
print("=== STEP 3: CONVERTING CUSTOM SPATIAL DATA ===")

custom_input_dir = CONFIG["paths"]["grid3_data_dir"]
custom_files = []

# Search for various spatial data formats using CONFIG patterns
for pattern in CONFIG["conversion"]["input_patterns"]:
    custom_files.extend(custom_input_dir.glob(pattern))

print(f"Found {len(custom_files)} custom data files to convert:")
print(f"Search directory: {custom_input_dir}")
for file in custom_files:
    print(f"  {file.name}")

# Convert custom data files (if any exist)
converted_files = []

for input_file in custom_files:
    output_file = CONFIG["paths"]["output_dir"] / f"{input_file.stem}{CONFIG['conversion']['output_suffix']}"
    
    print(f"Converting {input_file.name}...")
    
    try:
        # Convert using the modular function with CONFIG settings
        processed, skipped, output_path = convert_file(
            input_path=str(input_file),
            output_path=str(output_file),
            reproject=CONFIG["conversion"]["reproject_crs"],
            verbose=CONFIG["conversion"]["verbose"]
        )
        
        converted_files.append(output_file)
        print(f"✓ Converted: {processed} features, {skipped} skipped")
        print(f"  Output: {output_file.name}")
        
    except Exception as e:
        print(f"✗ Error converting {input_file.name}: {e}")

if converted_files:
    print(f"\n✓ Successfully converted {len(converted_files)} files")
    print(f"  Output directory: {CONFIG['paths']['output_dir']}")
else:
    print(f"\nNo custom files to convert. Add data files to: {custom_input_dir}")
    print(f"Supported formats: {', '.join(CONFIG['conversion']['input_patterns'])}")

## 3 1/2. Define tippecanoe parameters per layer

## 4. Process FlatGeobuf to PMTiles

Use the `runCreateTiles.py` module to convert FlatGeobuf files to PMTiles using optimized Tippecanoe settings.

### Supported Input Formats (Priority Order)
1. **FlatGeobuf (.fgb)** - **RECOMMENDED** for large-scale processing
   - Streaming read capability (low memory)
   - Built-in spatial indexing (fast)
   - Native tippecanoe support
   - Optimal for continent/world-scale data

2. **GeoJSONSeq (.geojsonseq)** - Good for medium datasets
   - Line-delimited format
   - Sequential processing

3. **GeoJSON (.geojson)** - Small datasets only
   - Full file must load into memory
   - Not recommended for large areas

### Automatic Optimization Features
- **Geometry Detection**: Automatically detects Point, LineString, or Polygon geometries
- **Layer-Specific Settings**: Optimized settings for water, roads, places, land use, etc.
- **Parallel Processing**: Multi-threaded processing for large datasets
- **Quality Optimization**: Smart simplification and feature dropping
- **Format Detection**: Automatically selects best input format available

In [5]:
# Step 4: Process all geospatial files to PMTiles
print("=== STEP 4: PROCESSING TO PMTILES ===")

# Process all downloaded and converted files to PMTiles using CONFIG settings
# Now supports: GeoJSON, GeoJSONSeq, and GeoParquet formats
tiling_results = process_to_tiles(
    extent=CONFIG["extent"]["coordinates"],
    input_dirs=[str(d) for d in CONFIG["tiling"]["input_dirs"]],  # Convert Path objects to strings
    filter_pattern=CONFIG["tiling"]["filter_pattern"],  # Pass filter pattern from CONFIG
    output_dir=str(CONFIG["tiling"]["output_dir"]),  # Use explicit output directory from CONFIG
    parallel=CONFIG["tiling"]["parallel"],
    verbose=CONFIG["tiling"]["verbose"]
)

# print(f"Tiling completed: {tiling_results['success']}")
# print(f"Files processed: {len(tiling_results['processed_files'])}/{tiling_results['total_files']}")

if tiling_results["errors"]:
    print(f"Errors encountered: {len(tiling_results['errors'])}")
    for error in tiling_results["errors"]:
        print(f"  - {error}")

# Display generated PMTiles files
if tiling_results["processed_files"]:
    print(f"\n✓ Successfully generated {len(tiling_results['processed_files'])} PMTiles:")
    
    pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))
    
    total_size_mb = 0
    for pmtile in sorted(pmtiles_files):
        size_mb = pmtile.stat().st_size / 1024 / 1024
        total_size_mb += size_mb
        print(f"  {pmtile.name} ({size_mb:.1f} MB)")
    
    print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
    print(f"Files location: {CONFIG['paths']['tile_dir']}")
    
else:
    print("\nNo PMTiles files were generated. Check the errors above.")
    print(f"Make sure you have geospatial files (GeoJSON/GeoJSONSeq/GeoParquet) in: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

=== STEP 4: PROCESSING TO PMTILES ===
=== PROCESSING TO TILES ===
Found 8 files to process:
  land_use.fgb (FlatGeobuf)
  land_residential.fgb (FlatGeobuf)
  roads.fgb (FlatGeobuf)
  land.fgb (FlatGeobuf)
  water.fgb (FlatGeobuf)
  buildings.fgb (FlatGeobuf)
  infrastructure.fgb (FlatGeobuf)
  land_cover.fgb (FlatGeobuf)


Processing files:   0%|          | 0/8 [00:00<?, ?file/s]



Processing files:  12%|█▎        | 1/8 [00:01<00:10,  1.55s/file]

✓ land_residential.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/land_residential.pmtiles


Processing files:  38%|███▊      | 3/8 [00:02<00:04,  1.14file/s]

✓ land.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/land.pmtiles
✓ water.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/water.pmtiles
✓ land_use.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/land_use.pmtiles


Processing files:  62%|██████▎   | 5/8 [00:02<00:01,  2.59file/s]

✓ land_cover.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/land_cover.pmtiles


Processing files:  75%|███████▌  | 6/8 [00:03<00:01,  1.66file/s]

✓ roads.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/roads.pmtiles


Processing files:  88%|████████▊ | 7/8 [00:04<00:00,  1.96file/s]

✓ infrastructure.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/infrastructure.pmtiles


Processing files: 100%|██████████| 8/8 [00:06<00:00,  1.23file/s]

✓ buildings.fgb -> /mnt/pool/gis/mapTiles/data/3-pmtiles/buildings.pmtiles

=== TILE PROCESSING COMPLETE ===
Processed: 8/8 files

✓ Successfully generated 8 PMTiles:
  GRID3_COD_Settlement_Extents_v3_1.pmtiles (3.4 MB)
  GRID3_COD_health_areas_v5_0.pmtiles (8.7 MB)
  GRID3_COD_health_facilities_v5_0.pmtiles (1.9 MB)
  GRID3_COD_health_zones_v5_0.pmtiles (2.6 MB)
  GRID3_COD_settlement_names_v5_0.pmtiles (2.1 MB)
  buildings.pmtiles (1.1 MB)
  infrastructure.pmtiles (0.3 MB)
  land.pmtiles (0.1 MB)
  land_cover.pmtiles (0.0 MB)
  land_residential.pmtiles (0.0 MB)
  land_use.pmtiles (0.1 MB)
  placenames.pmtiles (0.3 MB)
  places.pmtiles (0.2 MB)
  roads.pmtiles (0.9 MB)
  water.pmtiles (0.0 MB)

Total PMTiles size: 21.7 MB
Files location: /mnt/pool/gis/mapTiles/data/3-pmtiles





## 5. Create TileJSON Metadata

Generate TileJSON metadata files for seamless integration with web mapping libraries like MapLibre GL JS.

### TileJSON Features
- **Bounds and zoom levels** automatically detected from PMTiles
- **Vector layer definitions** for each data layer
- **MapLibre GL JS compatibility** for easy web integration
- **PMTiles URL references** for efficient tile serving

In [8]:
# Step 5: Create TileJSON metadata for MapLibre integration
print("=== STEP 5: CREATING TILEJSON METADATA ===")

# Check if PMTiles files exist in the configured tile directory
pmtiles_files = list(CONFIG["paths"]["tile_dir"].glob("*.pmtiles"))

if pmtiles_files:
    print(f"Found {len(pmtiles_files)} PMTiles files, creating TileJSON...")
    
    try:
        tilejson = create_tilejson(
            tile_dir=str(CONFIG["paths"]["tile_dir"]),  # Explicitly pass tile directory
            extent=CONFIG["extent"]["coordinates"],  # Pass extent from CONFIG
            output_file=str(CONFIG["paths"]["tile_dir"] / "tilejson.json")  # Explicitly pass output file path
        )
        
        print("✓ TileJSON created successfully")
        print(f"  Bounds: {tilejson['bounds']}")
        print(f"  Zoom range: {tilejson['minzoom']} - {tilejson['maxzoom']}")
        print(f"  Vector layers: {len(tilejson['vector_layers'])}")
        print(f"  Output file: {CONFIG['paths']['tile_dir'] / 'tilejson.json'}")
        
        # Show a summary of all output files
        print(f"\nComplete output summary:")
        total_size_mb = 0
        for pmtile in sorted(pmtiles_files):
            size_mb = pmtile.stat().st_size / 1024 / 1024
            total_size_mb += size_mb
            print(f"  {pmtile.name} ({size_mb:.1f} MB)")
        
        print(f"  tilejson.json")
        print(f"\nTotal PMTiles size: {total_size_mb:.1f} MB")
        print(f"All files location: {CONFIG['paths']['tile_dir']}")
        
    except Exception as e:
        print(f"✗ TileJSON creation failed: {e}")
        
else:
    print("No PMTiles files found in output directory.")
    print(f"Expected location: {CONFIG['paths']['tile_dir']}")
    print("Run Step 4 first to generate PMTiles files.")

=== STEP 5: CREATING TILEJSON METADATA ===
Found 15 PMTiles files, creating TileJSON...
TileJSON created: /mnt/pool/gis/mapTiles/data/3-pmtiles/tilejson.json
Found 15 PMTiles files
✓ TileJSON created successfully
  Bounds: [-73.98257744202017, 40.64773925613089, -73.9562859766083, 40.67679734614368]
  Zoom range: 0 - 16
  Vector layers: 15
  Output file: /mnt/pool/gis/mapTiles/data/3-pmtiles/tilejson.json

Complete output summary:
  GRID3_COD_Settlement_Extents_v3_1.pmtiles (3.4 MB)
  GRID3_COD_health_areas_v5_0.pmtiles (8.7 MB)
  GRID3_COD_health_facilities_v5_0.pmtiles (1.9 MB)
  GRID3_COD_health_zones_v5_0.pmtiles (2.6 MB)
  GRID3_COD_settlement_names_v5_0.pmtiles (2.1 MB)
  buildings.pmtiles (1.1 MB)
  infrastructure.pmtiles (0.3 MB)
  land.pmtiles (0.1 MB)
  land_cover.pmtiles (0.0 MB)
  land_residential.pmtiles (0.0 MB)
  land_use.pmtiles (0.1 MB)
  placenames.pmtiles (0.3 MB)
  places.pmtiles (0.2 MB)
  roads.pmtiles (0.9 MB)
  water.pmtiles (0.0 MB)
  tilejson.json

Total PMTil

## 6. Validate and Test Individual Steps

Test each processing step individually and validate the generated outputs.

In [9]:
# Individual Step Testing and Validation

print("INDIVIDUAL STEP TESTING")
print("=" * 50)

print("\n1. Test downloadOverture.py standalone:")
print("python processing/downloadOverture.py --extent='23.4,-6.2,23.8,-5.8' --buffer=0.1")

print("\n2. Test convertCustomData.py standalone:")
print("python processing/convertCustomData.py input.shp output.geojsonseq --reproject=EPSG:4326")

print("\n3. Test runCreateTiles.py standalone:")
print("python processing/runCreateTiles.py --extent='23.4,-6.2,23.8,-5.8' --create-tilejson")

print("\n4. Test individual steps in this notebook:")
print("   - Step 1: Download section (cell 6)")
print("   - Step 2: Check downloaded files (cell 7)")
print("   - Step 3: Convert custom data (cell 9)")
print("   - Step 4: Process to PMTiles (cell 11)")
print("   - Step 5: Create TileJSON (cell 13)")

print("\n5. Validate outputs using CONFIG paths:")
print(f"   - Check {CONFIG['paths']['data_dir']} for GeoJSON files")
print(f"   - Check {CONFIG['paths']['tile_dir']} for PMTiles files")
print(f"   - Verify TileJSON metadata file")

# Configuration validation using centralized CONFIG
print("\nCURRENT CONFIGURATION VALIDATION")
print("=" * 50)
print(f"Extent: {CONFIG['extent']['coordinates']}")
print(f"Buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"Tile output directory: {CONFIG['paths']['tile_dir']}")
print(f"Custom data directory: {CONFIG['paths']['grid3_data_dir']}")
print(f"Input directories for tiling: {[str(d) for d in CONFIG['tiling']['input_dirs']]}")

# Area calculation using CONFIG
extent = CONFIG['extent']['coordinates']
area = (extent[2] - extent[0]) * (extent[3] - extent[1])
print(f"Processing area: {area:.2f} degree² ({area * 111**2:.0f} km²)")

# Check directory status
print(f"\nDIRECTORY STATUS")
print("=" * 30)
for path_name, path_obj in CONFIG['paths'].items():
    if path_name.endswith('_dir'):
        status = "exists" if path_obj.exists() else "missing"
        file_count = len(list(path_obj.glob("*"))) if path_obj.exists() else 0
        print(f"{path_name}: {status} ({file_count} files)")

print("\nPERFORMANCE OPTIMIZATION TIPS")
print("=" * 50)

print(f"\n1. For large areas (current: {area:.2f} degree²):")
print(f"   - Current buffer: {CONFIG['extent']['buffer_degrees']} degrees")
print(f"   - Parallel processing: {CONFIG['tiling']['parallel']}")
print("   - Consider smaller chunks if memory issues occur")

print("\n2. File management:")
print(f"   - Monitor {CONFIG['paths']['data_dir']} size during processing")
print("   - Clean intermediate files between steps if needed")
print("   - Use filter patterns to process specific layers only")

print("\n3. Output optimization:")
print(f"   - PMTiles output: {CONFIG['paths']['tile_dir']}")
# print(f"   - Public tiles: {CONFIG['paths']['public_tiles_dir']}")
print("   - Copy final tiles to public directory for web serving")

INDIVIDUAL STEP TESTING

1. Test downloadOverture.py standalone:
python processing/downloadOverture.py --extent='23.4,-6.2,23.8,-5.8' --buffer=0.1

2. Test convertCustomData.py standalone:
python processing/convertCustomData.py input.shp output.geojsonseq --reproject=EPSG:4326

3. Test runCreateTiles.py standalone:
python processing/runCreateTiles.py --extent='23.4,-6.2,23.8,-5.8' --create-tilejson

4. Test individual steps in this notebook:
   - Step 1: Download section (cell 6)
   - Step 2: Check downloaded files (cell 7)
   - Step 3: Convert custom data (cell 9)
   - Step 4: Process to PMTiles (cell 11)
   - Step 5: Create TileJSON (cell 13)

5. Validate outputs using CONFIG paths:
   - Check /mnt/pool/gis/mapTiles/data for GeoJSON files
   - Check /mnt/pool/gis/mapTiles/data/3-pmtiles for PMTiles files
   - Verify TileJSON metadata file

CURRENT CONFIGURATION VALIDATION
Extent: (-73.98257744202017, 40.64773925613089, -73.9562859766083, 40.67679734614368)
Buffer: 0.25 degrees
Tile o

# Modular Processing Summary

This notebook provides a complete, step-by-step approach for **large-scale geospatial data processing** optimized for continent and world-scale datasets.

## Core Steps
1. **Download Overture Maps data** with spatial filtering using DuckDB (outputs GeoParquet)
2. **Check and validate** downloaded files
3. **Convert to FlatGeobuf** - Optimize GeoParquet for efficient tile generation
4. **Convert custom spatial data** to GeoJSON/FlatGeobuf format
5. **Generate PMTiles** using optimized tippecanoe settings
6. **Create TileJSON metadata** for web mapping integration
7. **Validate and test** individual processing steps

## Format Workflow (Optimized for Scale)

```
Download (DuckDB)     Convert           Tile (Tippecanoe)
─────────────────     ───────           ─────────────────
GeoParquet (.parquet) → FlatGeobuf (.fgb) → PMTiles (.pmtiles)
    ↓                     ↓                      ↓
  Compact            Streaming Read         Web Optimized
  Fast Query         Spatial Index          Vector Tiles
  50-80% smaller     Low Memory            HTTP Range Requests
```

## Why This Workflow?

### 1. GeoParquet for Download
- **Compact storage**: 50-80% smaller than GeoJSON
- **Fast DuckDB queries**: Efficient spatial filtering
- **Columnar format**: Excellent compression

### 2. FlatGeobuf for Tiling
- **Streaming capability**: Process datasets larger than RAM
- **Spatial indexing**: R-tree for fast spatial queries
- **Native tippecanoe support**: No conversion overhead
- **Optimal for large scale**: Tested on continent/world datasets

### 3. PMTiles for Serving
- **Cloud-native**: Works with any static file host
- **Efficient delivery**: HTTP range requests
- **No tile server needed**: Direct browser access

## Performance Benefits
- **Memory efficiency**: Process billions of features without OOM errors
- **Disk space**: GeoParquet + FlatGeobuf = 2-3x less than GeoJSON workflow
- **Processing speed**: 20-40% faster tile generation vs GeoJSON
- **Parallel processing**: Multi-threaded for optimal CPU utilization

## Scale Capabilities
- ✓ **City-scale**: Brooklyn, Paris, Tokyo
- ✓ **Country-scale**: DRC, USA, India  
- ✓ **Continent-scale**: Africa, Europe, Americas
- ✓ **World-scale**: Global basemaps with billions of features

## Key Features
- **Modular design** - Each step can be run independently
- **Flexible configuration** - Easy to customize for different areas and data types
- **Interactive development** - Run steps individually for debugging
- **Performance optimized** - Format selection based on dataset size
- **Production ready** - Robust error handling and validation
- **Memory conscious** - Streaming workflows prevent OOM errors

## Output Files
Each step generates specific outputs:
- **GeoParquet files (.parquet)** - Compact download format
- **FlatGeobuf files (.fgb)** - Optimized tiling input (streaming, indexed)
- **PMTiles files (.pmtiles)** - Efficient web mapping output
- **TileJSON metadata** - MapLibre GL JS integration

## Usage Patterns
- **Development**: Run steps individually for testing and debugging
- **Production**: Execute all steps in sequence for automated processing
- **Customization**: Modify CONFIG settings and re-run specific steps
- **Integration**: Use generated PMTiles with web mapping applications

## Best Practices for Large Datasets
1. **Always convert to FlatGeobuf** before tiling (don't tile GeoParquet directly)
2. **Use parallel processing** for multi-file datasets
3. **Monitor disk space**: Keep both parquet and fgb during processing
4. **Clean up intermediate files** after successful tiling (keep parquet as source)
5. **Process by region** for extremely large datasets (e.g., split continents into countries)