# ✅ Validating Spatial Data - Ensuring Data Quality

**GIST 604B - Python GeoPandas Introduction**  
**Notebook 4: Spatial Data Quality Control**

---

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
- Identify and diagnose invalid geometries in spatial datasets
- Check for missing or null spatial data
- Validate coordinate ranges and detect outliers
- Assess coordinate reference system appropriateness
- Generate comprehensive data quality reports
- Implement the `validate_spatial_data()` function

## 🚨 Why Data Validation Matters

Real-world spatial data often contains errors that can break your analysis:
- **Invalid geometries** - Self-intersecting polygons, unclosed rings
- **Missing coordinates** - Null or empty geometry values
- **Coordinate errors** - Values outside valid ranges
- **CRS mismatches** - Wrong or missing projection information
- **Topology issues** - Gaps, overlaps, or inconsistent boundaries

**Better to catch these early than debug mysterious analysis failures later!**

In [None]:
# Import necessary libraries
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, LineString
from shapely.validation import explain_validity
import warnings
warnings.filterwarnings('ignore')

print("📦 Libraries loaded successfully!")
print(f"🐼 GeoPandas version: {gpd.__version__}")

## 🔍 Detecting Invalid Geometries

Invalid geometries are one of the most common problems in spatial data...

In [None]:
# TODO: Create examples of invalid geometries and show how to detect them
# - Self-intersecting polygons
# - Unclosed polygon rings  
# - Degenerate geometries (zero area/length)
# - Using is_valid property
# - Using explain_validity() for detailed error messages
pass

## 🕳️ Finding Missing and Null Geometries

Sometimes spatial data has missing coordinate information...

In [None]:
# TODO: Detect missing and null geometries
# - Using isna() and isnull() to find missing geometries
# - Checking for empty geometries with is_empty
# - Counting and locating problematic records
pass

## 🌍 Validating Coordinate Ranges

Coordinates should be within reasonable ranges based on the CRS...

In [None]:
# TODO: Check coordinate ranges
# - For geographic CRS (EPSG:4326): latitude [-90, 90], longitude [-180, 180]
# - For projected CRS: reasonable ranges based on projection
# - Detecting obvious outliers (coordinates in wrong hemisphere, etc.)
# - Flagging suspicious coordinate patterns
pass

## 🗺️ Checking CRS Issues

Coordinate reference system problems can cause major analysis errors...

In [None]:
# TODO: Validate CRS information
# - Check if CRS is defined (not None)
# - Verify CRS is appropriate for data extent
# - Detect common CRS mismatches
# - Flag datasets that might need reprojection
pass

## 📐 Geometry-Specific Validation

Different geometry types have different validation requirements...

### Point Validation

In [None]:
# TODO: Point-specific validation
# - Check for duplicate points
# - Validate coordinate precision
# - Detect points at (0, 0) which might indicate missing data
pass

### Line Validation

In [None]:
# TODO: LineString-specific validation
# - Check for minimum number of points (at least 2)
# - Detect zero-length lines
# - Find lines with duplicate consecutive points
pass

### Polygon Validation

In [None]:
# TODO: Polygon-specific validation
# - Check for closed rings (first point = last point)
# - Detect self-intersecting polygons
# - Find zero-area polygons
# - Validate interior rings (holes) don't overlap exterior
pass

## 🛠️ Generating Quality Reports

Creating comprehensive validation reports for datasets...

In [None]:
# TODO: Create validation reports
# - Summary of issues found
# - Counts and percentages of different problem types
# - Specific recommendations for fixing issues
# - Overall data quality score/rating
pass

## 🔧 Common Fixes for Spatial Data Issues

Not just detecting problems - showing how to fix them!

In [None]:
# TODO: Demonstrate common fixes
# - Using buffer(0) to fix invalid polygons
# - Removing duplicate points from lines
# - Dropping features with missing geometries
# - Setting appropriate CRS when missing
pass

## 🛠️ Building Your validate_spatial_data() Function

Now let's implement a comprehensive validation function...

In [None]:
# TODO: Step-by-step implementation guide
# This will walk through building a robust validation function
# that checks for all major types of spatial data quality issues
pass

## 🧪 Testing Your Implementation

Let's test our validation function with intentionally problematic data...

In [None]:
# TODO: Test cases with known problems
# - Create datasets with various types of issues
# - Verify that validation function catches all problems
# - Test edge cases and boundary conditions
pass

## 📊 Real-World Validation Examples

Examples of validation reports from actual spatial datasets...

In [None]:
# TODO: Show validation results from real datasets
# - Government boundary data
# - OpenStreetMap extracts
# - Crowdsourced GPS data
# - Commercial spatial datasets
pass

## 🎯 Key Takeaways

After completing this notebook, you should understand:

✅ **Geometry validation** - How to detect and diagnose invalid spatial features  
✅ **Missing data detection** - Finding null and empty geometries  
✅ **Coordinate validation** - Checking ranges and detecting outliers  
✅ **CRS validation** - Ensuring appropriate coordinate systems  
✅ **Quality reporting** - Generating comprehensive validation reports  
✅ **Data repair** - Common techniques for fixing spatial data issues  

## 📚 Next Steps

1. **Implement** your `validate_spatial_data()` function in `src/spatial_basics.py`
2. **Test** your implementation with `uv run pytest tests/ -k "validate_spatial_data" -v`
3. **Move on** to `05_coordinate_systems.ipynb` to learn about CRS transformations

---

*Data validation is not glamorous, but it's essential! Clean, validated data is the foundation of reliable spatial analysis. Take time to validate your data before analysis - your future self will thank you!* 🌟