<a href="https://colab.research.google.com/github/aborbala/tree-canopy/blob/main/01_05_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
Tree Crown Data Cleaning and Validation Pipeline

A geospatial data processing workflow that cleans automatically detected tree crown 
polygons, filters invalid geometries, and validates intersections with cadastral 
tree records.

Overview:
    This pipeline processes GeoJSON files containing automatically detected tree crowns,
    applies geometric filters based on shape ratios and minimum area thresholds, and
    cross-references detections with official cadastral tree data. The workflow removes
    invalid or anomalous crown geometries and marks validated detections.

Main Sections:

    1. Setup & Configuration
       - Install dependencies (geopandas, shapely, scikit-learn, OpenCV)
       - Mount Google Drive
       - Define input/output directories and AOI parameters
       - Load cadastral tree reference data

    2. Utility Functions
       - get_length_width(): Calculate principal dimensions of polygons using PCA
       - satisfies_ratio(): Validate polygon aspect ratio (max 2.4:1)

    3. Geometric Analysis with PCA
       - Apply Principal Component Analysis (PCA) to polygon coordinates
       - Calculate length and width along principal axes
       - Compute aspect ratio for shape validation

    4. Crown Geometry Validation
       - Filter by aspect ratio: length/width <= 2.4
       - Filter by minimum area: >= 1 square meter
       - Handle both single Polygon and MultiPolygon geometries
       - Remove degenerate or noise-like features

    5. Cadastral Intersection Checking
       - Load official cadastral tree point data (GPKG)
       - Test spatial intersection with validated crowns
       - Flag crowns that intersect with cadastral records
       - Create boolean intersection attribute

    6. Output Generation & Export
       - Create new GeoDataFrame with cleaned geometries
       - Add 'intersects_cadaster' boolean field
       - Preserve original CRS
       - Export cleaned data as GeoJSON per tile

Key Data:

    - Input: GeoJSON files with automatically detected tree crown polygons (per tile)
    - Reference: Cadastral tree points (GPKG format)
    - Processing: Geometric validation, aspect ratio checking, spatial intersection
    - Output: Cleaned GeoJSON files with validation flags

CRS:

    - EPSG:25833 (UTM Zone 33N) - Standard for Berlin/Germany region
    - CRS automatically detected from input GeoJSON

Input Format:

    - GeoJSON files (one per raster tile)
    - Geometries: Polygon or MultiPolygon (tree crowns)
    - Attributes: May vary per detection algorithm

Output Format:

    - GeoJSON files (same naming as input)
    - Geometries: Validated Polygon features
    - Attributes:
        * geometry: Cleaned polygon
        * intersects_cadaster: Boolean flag (1=intersects, 0=no intersection)

Output Locations:

    - {aoi_code}/crowns_clean_no_structures_veg_mask/ - Cleaned crown GeoJSON files
    - One file per input tile maintaining original tile naming

Processing Parameters:

    - Aspect ratio threshold: 2.4 (length/width <= 2.4)
    - Minimum area: 1 square meter
    - Reference CRS: EPSG:25833
    - Geometry type: Polygon, MultiPolygon

Key Features:

    - PCA-based aspect ratio calculation (invariant to rotation)
    - Robust handling of MultiPolygon geometries
    - Automatic decomposition of MultiPolygons into single Polygons
    - Cross-validation with cadastral tree records
    - Preservation of spatial reference information
    - Batch processing of multiple tiles

Workflow:

    1. Load GeoJSON files containing detected tree crowns
    2. Load cadastral tree reference data (GPKG)
    3. For each crown geometry:
       - Calculate length and width using PCA
       - Validate aspect ratio (max 2.4:1)
       - Validate minimum area (>= 1 mÂ²)
       - Decompose MultiPolygons into individual Polygons
    4. For each valid crown:
       - Check spatial intersection with cadastral points
       - Add boolean flag for intersection result
    5. Save cleaned geometries with intersection flags as GeoJSON

Validation Logic:

    - Aspect Ratio: Prevents extremely elongated or thin features that are unlikely
      to be tree crowns
    - Minimum Area: Filters out noise and micro-features below practical detection limits
    - Cadastral Intersection: Identifies crowns corresponding to official tree records
    - Geometry Type: Handles both simple polygons and compound multi-polygons

Dependencies:

    - geopandas - Vector data I/O and operations
    - shapely - Geometric operations
    - scikit-learn (PCA) - Principal component analysis for dimension calculation
    - opencv-python-headless - Image processing utilities
    - rasterio - Raster metadata (optional)
    - numpy - Numerical computations

Notes:

    - Aspect ratio of 2.4:1 empirically determined to distinguish tree crowns
      from building footprints and other elongated features
    - PCA-based approach is rotation-invariant (handles arbitrary polygon orientations)
    - MultiPolygon handling ensures all detection results are processed
    - Output boolean flag enables downstream filtering for accuracy assessment

Expected Outcomes:

    - 70-90% of detected crowns typically pass geometric validation
    - 60-85% of validated crowns intersect with cadastral records
    - Remaining crowns may represent: newly grown trees, trees not in cadastral records,
      false positives from vegetation detection, or trees below detection threshold

Author: Master Thesis Project
"""

# --- Setup ---

In [1]:
# Install necessary libraries
!pip install geopandas shapely scikit-learn rasterio opencv-python-headless

Der Befehl "pip" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [2]:
import os
import numpy as np
import geopandas as gpd
from shapely.geometry import Polygon, MultiPolygon
from sklearn.decomposition import PCA
import cv2
import rasterio
from google.colab import drive

ModuleNotFoundError: No module named 'numpy'

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


# --- Functions ---

In [None]:
def get_length_width(polygon):
    if polygon.is_empty or polygon.geom_type != 'Polygon':
        return None, None

    coords = np.array(polygon.exterior.coords)
    coords -= coords.mean(axis=0)

    pca = PCA(2)
    coords_pca = pca.fit_transform(coords)

    polygon_pca = Polygon(coords_pca)
    length = polygon_pca.bounds[2] - polygon_pca.bounds[0]
    width = polygon_pca.bounds[3] - polygon_pca.bounds[1]

    return max(length, width), min(length, width)

In [None]:
def satisfies_ratio(polygon):
    length, width = get_length_width(polygon)
    if width == 0:
        return False
    ratio = length / width
    return ratio <= 2.4


# --- Configuration ---

In [None]:
# aoi_code = '386_5818' # training data
aoi_code = '384_5816' # test data

base_path = '/content/drive/MyDrive/masterthesis/data'

input_dir = f'{base_path}/{aoi_code}/crowns_no_structures_veg_mask'
output_dir = f'{base_path}/{aoi_code}/crowns_clean_no_structures_veg_mask'

cadaster_trees = f'{base_path}/{aoi_code}/trees.gpkg'
os.makedirs(output_dir, exist_ok=True)

In [None]:
cadaster_gdf = gpd.read_file(cadaster_trees)

In [None]:
min_area = 1

# --- Processing ---

In [None]:
for filename in os.listdir(input_dir):
    if filename.endswith('.geojson'):
        filepath = os.path.join(input_dir, filename)
        gdf = gpd.read_file(filepath)

        # Filter geometries based on the ratio condition
        cleaned_geometries = []
        intersects_cadaster = []

        for geom in gdf.geometry:
            if geom is not None:
                if geom.geom_type == 'Polygon' and satisfies_ratio(geom) and geom.area >= min_area:
                    cleaned_geometries.append(geom)
                    intersects = int(cadaster_gdf.intersects(geom).any())
                    intersects_cadaster.append(intersects)

                elif geom.geom_type == 'MultiPolygon':
                    for poly in geom.geoms:
                        if satisfies_ratio(poly) and poly.area >= min_area:
                            cleaned_geometries.append(poly)
                            intersects = int(cadaster_gdf.intersects(poly).any())
                            intersects_cadaster.append(intersects)

        # Create new GeoDataFrame with intersection flag
        cleaned_gdf = gpd.GeoDataFrame({
            'geometry': cleaned_geometries,
            'intersects_cadaster': intersects_cadaster
        }, crs=gdf.crs)

        # Save to GeoJSON
        cleaned_filepath = os.path.join(output_dir, filename)
        cleaned_gdf.to_file(cleaned_filepath, driver='GeoJSON')

print("Cleaning, intersection checking, and saving completed.")

Cleaning, intersection checking, and saving completed.
