# ‚úÖ Validating Spatial Data - Ensuring Data Quality

**GIST 604B - Python GeoPandas Introduction**  
**Notebook 3: Spatial Data Quality Control**

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
- Identify and diagnose invalid geometries in spatial datasets
- Check for missing or null spatial data
- Validate coordinate ranges and detect outliers
- Assess coordinate reference system appropriateness
- Generate comprehensive data quality reports
- Implement the `validate_spatial_data()` function

## üö® Why Data Validation Matters

Real-world spatial data often contains errors that can break your analysis:
- **Invalid geometries** - Self-intersecting polygons, unclosed rings
- **Missing coordinates** - Null or empty geometry values
- **Coordinate errors** - Values outside valid ranges
- **CRS mismatches** - Wrong or missing projection information
- **Topology issues** - Gaps, overlaps, or inconsistent boundaries

**Better to catch these early than debug mysterious analysis failures later!**

In [None]:
# Import necessary libraries
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, LineString
from shapely.validation import explain_validity
from shapely import wkt
import warnings
warnings.filterwarnings('ignore')

print("üì¶ Libraries loaded successfully!")
print(f"üêº GeoPandas version: {gpd.__version__}")

## üîç Detecting Invalid Geometries

Invalid geometries are one of the most common problems in spatial data. Let's create some examples and learn to detect them:

In [None]:
# Create examples of different geometry validity issues

# 1. Valid geometries
valid_point = Point(-120, 45)
valid_polygon = Polygon([(-119, 44), (-118, 44), (-118, 45), (-119, 45), (-119, 44)])

# 2. Invalid geometries
# Self-intersecting polygon (bow-tie shape)
invalid_polygon = wkt.loads('POLYGON((0 0, 1 1, 1 0, 0 1, 0 0))')

# 3. Empty geometry
empty_point = wkt.loads('POINT EMPTY')

# Create a test dataset
test_geometries = [
    valid_point,
    valid_polygon, 
    invalid_polygon,
    empty_point,
    None  # Missing geometry
]

test_gdf = gpd.GeoDataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Valid Point', 'Valid Polygon', 'Invalid Polygon', 'Empty Point', 'Missing'],
    'geometry': test_geometries
}, crs='EPSG:4326')

print("‚úÖ Test dataset created with various geometry types")
print(f"Dataset shape: {test_gdf.shape}")
print("\nDataset preview:")
print(test_gdf[['id', 'name']])

In [None]:
# Check geometry validity using the is_valid property
print("üîç Checking geometry validity...\n")

# Check validity for each geometry
for idx, row in test_gdf.iterrows():
    geom = row.geometry
    name = row['name']
    
    if geom is None:
        print(f"‚ùå {name}: Missing geometry (None)")
    elif geom.is_empty:
        print(f"‚ö†Ô∏è  {name}: Empty geometry")
    elif geom.is_valid:
        print(f"‚úÖ {name}: Valid geometry")
    else:
        print(f"‚ùå {name}: Invalid geometry")
        # Get detailed explanation of what's wrong
        explanation = explain_validity(geom)
        print(f"   Reason: {explanation}")
    print()

In [None]:
# Batch validity checking for entire GeoDataFrame
print("üìä Batch validity analysis...\n")

# Count valid, invalid, missing, and empty geometries
valid_mask = test_gdf.geometry.notna() & ~test_gdf.geometry.is_empty & test_gdf.geometry.is_valid
invalid_mask = test_gdf.geometry.notna() & ~test_gdf.geometry.is_empty & ~test_gdf.geometry.is_valid
missing_mask = test_gdf.geometry.isna()
empty_mask = test_gdf.geometry.notna() & test_gdf.geometry.is_empty

valid_count = valid_mask.sum()
invalid_count = invalid_mask.sum()
missing_count = missing_mask.sum()
empty_count = empty_mask.sum()
total_count = len(test_gdf)

print(f"‚úÖ Valid geometries: {valid_count} ({valid_count/total_count:.1%})")
print(f"‚ùå Invalid geometries: {invalid_count} ({invalid_count/total_count:.1%})")
print(f"‚ö†Ô∏è  Empty geometries: {empty_count} ({empty_count/total_count:.1%})")
print(f"üï≥Ô∏è  Missing geometries: {missing_count} ({missing_count/total_count:.1%})")
print(f"üìä Total features: {total_count}")

# Show indices of problematic geometries
if invalid_count > 0:
    invalid_indices = test_gdf[invalid_mask].index.tolist()
    print(f"\n‚ùå Invalid geometry indices: {invalid_indices}")
    
if missing_count > 0:
    missing_indices = test_gdf[missing_mask].index.tolist()
    print(f"üï≥Ô∏è  Missing geometry indices: {missing_indices}")

## üåç Validating Coordinate Ranges

Coordinates should be within reasonable ranges based on the CRS. For geographic coordinates (EPSG:4326):
- Latitude: -90 to 90 degrees
- Longitude: -180 to 180 degrees

In [None]:
# Create test data with coordinate range issues
coord_test_data = [
    Point(-120, 45),    # Valid coordinates
    Point(-118, 46),    # Valid coordinates  
    Point(200, 50),     # Invalid longitude (> 180)
    Point(-90, 100),    # Invalid latitude (> 90)
    Point(-190, -95),   # Both coordinates out of range
]

coord_test_gdf = gpd.GeoDataFrame({
    'id': [1, 2, 3, 4, 5],
    'description': ['Valid', 'Valid', 'Bad Longitude', 'Bad Latitude', 'Both Bad'],
    'geometry': coord_test_data
}, crs='EPSG:4326')

print("üåç Testing coordinate ranges for geographic data (EPSG:4326)...\n")

def validate_geographic_coordinates(gdf):
    """Validate coordinate ranges for geographic CRS."""
    issues = []
    
    for idx, row in gdf.iterrows():
        geom = row.geometry
        desc = row.description
        
        if geom is not None and not geom.is_empty:
            # Extract coordinates
            x, y = geom.x, geom.y
            
            # Check longitude range (-180 to 180)
            if x < -180 or x > 180:
                issue = f"‚ùå Row {idx} ({desc}): Longitude {x:.2f} out of range [-180, 180]"
                issues.append(issue)
                print(issue)
            
            # Check latitude range (-90 to 90)
            if y < -90 or y > 90:
                issue = f"‚ùå Row {idx} ({desc}): Latitude {y:.2f} out of range [-90, 90]"
                issues.append(issue)
                print(issue)
            
            # Valid coordinates
            if -180 <= x <= 180 and -90 <= y <= 90:
                print(f"‚úÖ Row {idx} ({desc}): Valid coordinates ({x:.2f}, {y:.2f})")
    
    return issues

coordinate_issues = validate_geographic_coordinates(coord_test_gdf)
print(f"\nüìä Found {len(coordinate_issues)} coordinate range issues")

## üó∫Ô∏è Checking CRS Issues

Coordinate reference system problems can cause major analysis errors:

In [None]:
# Create datasets with various CRS issues

# 1. Dataset with proper CRS
good_crs_gdf = gpd.GeoDataFrame({
    'id': [1, 2],
    'geometry': [Point(-120, 45), Point(-118, 46)]
}, crs='EPSG:4326')

# 2. Dataset with no CRS
no_crs_gdf = gpd.GeoDataFrame({
    'id': [1, 2], 
    'geometry': [Point(-120, 45), Point(-118, 46)]
})  # No CRS specified

# 3. Dataset with potentially wrong CRS for the data
wrong_crs_gdf = gpd.GeoDataFrame({
    'id': [1, 2],
    'geometry': [Point(500000, 4000000), Point(501000, 4001000)]  # UTM coordinates
}, crs='EPSG:4326')  # But marked as geographic!

def validate_crs(gdf, dataset_name):
    """Check CRS-related issues in a GeoDataFrame."""
    print(f"\nüó∫Ô∏è  Checking CRS for {dataset_name}:")
    issues = []
    
    # Check if CRS is defined
    if gdf.crs is None:
        issue = "‚ùå No CRS defined - coordinates are ambiguous!"
        issues.append(issue)
        print(issue)
        print("   üí° Recommendation: Set appropriate CRS with gdf.set_crs()")
    else:
        print(f"‚úÖ CRS defined: {gdf.crs}")
        
        # Check if coordinates seem appropriate for the CRS
        sample_coords = [(geom.x, geom.y) for geom in gdf.geometry.dropna()[:5]]
        
        if gdf.crs.is_geographic:
            # Geographic CRS - coordinates should be reasonable lat/lon
            for x, y in sample_coords:
                if abs(x) > 1000 or abs(y) > 1000:
                    issue = f"‚ö†Ô∏è  Large coordinates ({x:.0f}, {y:.0f}) for geographic CRS - possible CRS mismatch"
                    issues.append(issue)
                    print(issue)
                    print("   üí° Recommendation: Check if data is actually in projected coordinates")
                    break
        else:
            # Projected CRS - coordinates should be reasonably large
            for x, y in sample_coords:
                if abs(x) < 1000 and abs(y) < 1000:
                    issue = f"‚ö†Ô∏è  Small coordinates ({x:.2f}, {y:.2f}) for projected CRS - possible CRS mismatch"
                    issues.append(issue) 
                    print(issue)
                    print("   üí° Recommendation: Check if data is actually in geographic coordinates")
                    break
    
    return issues

# Test all datasets
crs_issues = []
crs_issues.extend(validate_crs(good_crs_gdf, "Good CRS Dataset"))
crs_issues.extend(validate_crs(no_crs_gdf, "No CRS Dataset"))
crs_issues.extend(validate_crs(wrong_crs_gdf, "Suspicious CRS Dataset"))

print(f"\nüìä Total CRS issues found: {len(crs_issues)}")

## üõ†Ô∏è Building Your validate_spatial_data() Function

Now let's create the function that combines all our validation checks. This is the implementation you should copy to your `src/spatial_basics.py` file:

In [None]:
def validate_spatial_data_demo(gdf):
    """
    Comprehensive spatial data validation function.
    
    This is the implementation you should copy to your src/spatial_basics.py file.
    It demonstrates all the validation checks your function should perform.
    
    Args:
        gdf (gpd.GeoDataFrame): Input spatial dataset to validate
    
    Returns:
        Dict[str, Any]: Validation report with required keys
    """
    # Initialize results dictionary with all required keys
    validation_results = {
        'is_valid': True,           # Overall validation status
        'issues_found': [],         # List of issues discovered  
        'invalid_geometries': 0,    # Count of invalid geometries
        'missing_geometries': 0,    # Count of null/missing geometries
        'crs_issues': [],          # CRS-related problems
        'recommendations': []       # Suggested fixes
    }
    
    # 1. Check for missing/null geometries
    missing_mask = gdf.geometry.isna()
    missing_count = missing_mask.sum()
    validation_results['missing_geometries'] = missing_count
    
    if missing_count > 0:
        validation_results['is_valid'] = False
        validation_results['issues_found'].append(f"{missing_count} missing geometries")
        validation_results['recommendations'].append("Remove or fix rows with missing geometries")
    
    # 2. Check geometry validity
    valid_geoms = gdf.geometry.dropna()
    if len(valid_geoms) > 0:
        # Check for invalid geometries
        invalid_mask = ~valid_geoms.is_valid
        invalid_count = invalid_mask.sum()
        validation_results['invalid_geometries'] = invalid_count
        
        if invalid_count > 0:
            validation_results['is_valid'] = False
            validation_results['issues_found'].append(f"{invalid_count} invalid geometries")
            validation_results['recommendations'].append("Fix invalid geometries using buffer(0) or repair methods")
        
        # Check for empty geometries (often indicates missing data)
        empty_mask = valid_geoms.is_empty
        empty_count = empty_mask.sum()
        
        if empty_count > 0:
            validation_results['is_valid'] = False
            validation_results['issues_found'].append(f"{empty_count} empty geometries")
            validation_results['recommendations'].append("Remove or fix empty geometries")
    
    # 3. Check CRS issues
    if gdf.crs is None:
        validation_results['is_valid'] = False
        crs_issue = "No CRS defined"
        validation_results['crs_issues'].append(crs_issue)
        validation_results['issues_found'].append(crs_issue)
        validation_results['recommendations'].append("Define appropriate CRS using set_crs() method")
    
    # 4. Check coordinate ranges (for geographic CRS)
    if gdf.crs is not None and gdf.crs.is_geographic:
        coord_issues = []
        valid_geoms = gdf.geometry.dropna()
        
        for idx, geom in valid_geoms.items():
            if hasattr(geom, 'x') and hasattr(geom, 'y'):  # Point geometry
                x, y = geom.x, geom.y
                if not (-180 <= x <= 180) or not (-90 <= y <= 90):
                    coord_issues.append(f"Invalid coordinates at index {idx}: ({x:.2f}, {y:.2f})")
        
        if coord_issues:
            validation_results['is_valid'] = False
            validation_results['issues_found'].append(f"{len(coord_issues)} features with invalid coordinate ranges")
            validation_results['recommendations'].append("Check coordinate values and CRS specification")
    
    return validation_results

# Test the function with our problematic dataset
print("üß™ Testing validation function with problematic data:")
print("=" * 55)
results = validate_spatial_data_demo(test_gdf)

# Display results
print(f"\nüìä Validation Results:")
print(f"   Overall Valid: {results['is_valid']}")
print(f"   Issues Found: {len(results['issues_found'])}")
print(f"   Invalid Geometries: {results['invalid_geometries']}")
print(f"   Missing Geometries: {results['missing_geometries']}")
print(f"   CRS Issues: {len(results['crs_issues'])}")

if results['issues_found']:
    print("\n‚ùå Issues Found:")
    for issue in results['issues_found']:
        print(f"   - {issue}")

if results['recommendations']:
    print("\nüí° Recommendations:")
    for rec in results['recommendations']:
        print(f"   - {rec}")

## üß™ Testing Different Scenarios

Let's test our validation function with different types of data:

In [None]:
# Test 1: Clean, valid data
clean_gdf = gpd.GeoDataFrame({
    'id': [1, 2, 3],
    'name': ['A', 'B', 'C'],
    'geometry': [
        Point(-120, 45),
        Point(-118, 46),
        Point(-119, 44)
    ]
}, crs='EPSG:4326')

print("üß™ Test 1: Clean Data")
print("-" * 25)
clean_results = validate_spatial_data_demo(clean_gdf)
print(f"Valid: {clean_results['is_valid']} ‚úÖ" if clean_results['is_valid'] else f"Valid: {clean_results['is_valid']} ‚ùå")
print(f"Issues: {len(clean_results['issues_found'])}")

# Test 2: Data with no CRS
no_crs_test_gdf = gpd.GeoDataFrame({
    'id': [1, 2],
    'geometry': [Point(-120, 45), Point(-118, 46)]
})

print("\nüß™ Test 2: No CRS Data")
print("-" * 25)
no_crs_results = validate_spatial_data_demo(no_crs_test_gdf)
print(f"Valid: {no_crs_results['is_valid']} ‚úÖ" if no_crs_results['is_valid'] else f"Valid: {no_crs_results['is_valid']} ‚ùå")
print(f"Issues: {len(no_crs_results['issues_found'])}")
if no_crs_results['crs_issues']:
    print(f"CRS Issues: {no_crs_results['crs_issues']}")

# Test 3: Data with coordinate range issues
print("\nüß™ Test 3: Coordinate Range Issues")
print("-" * 35)
coord_results = validate_spatial_data_demo(coord_test_gdf)
print(f"Valid: {coord_results['is_valid']} ‚úÖ" if coord_results['is_valid'] else f"Valid: {coord_results['is_valid']} ‚ùå")
print(f"Issues: {len(coord_results['issues_found'])}")

print("\nüéâ All tests completed! Your function should handle these scenarios.")

## üîß Common Data Fixes

Not just detecting problems - here are common ways to fix spatial data issues:

In [None]:
print("üîß Common Spatial Data Fixes:\n")

# Fix 1: Repair invalid geometries using buffer(0)
print("1. üî® Fixing Invalid Geometries:")
print("   Before repair:")
invalid_geom = wkt.loads('POLYGON((0 0, 1 1, 1 0, 0 1, 0 0))')  # Self-intersecting
print(f"   - Valid: {invalid_geom.is_valid}")
print(f"   - Issue: {explain_validity(invalid_geom)}")

# Apply buffer(0) to fix
fixed_geom = invalid_geom.buffer(0)
print("   After buffer(0) repair:")
print(f"   - Valid: {fixed_geom.is_valid} ‚úÖ")
print(f"   - Geometry type: {fixed_geom.geom_type}")

print("\n2. üóëÔ∏è Removing Missing Geometries:")
# Create example with missing data
example_gdf = gpd.GeoDataFrame({
    'id': [1, 2, 3],
    'geometry': [Point(-120, 45), None, Point(-118, 46)]
}, crs='EPSG:4326')

print(f"   Before: {len(example_gdf)} features, {example_gdf.geometry.isna().sum()} missing")
clean_gdf = example_gdf.dropna(subset=['geometry'])
print(f"   After: {len(clean_gdf)} features, {clean_gdf.geometry.isna().sum()} missing ‚úÖ")

print("\n3. üó∫Ô∏è Setting Missing CRS:")
no_crs_gdf = gpd.GeoDataFrame({
    'id': [1, 2], 
    'geometry': [Point(-120, 45), Point(-118, 46)]
})
print(f"   Before: CRS = {no_crs_gdf.crs}")
no_crs_gdf = no_crs_gdf.set_crs('EPSG:4326')
print(f"   After: CRS = {no_crs_gdf.crs} ‚úÖ")

print("\nüí° These techniques can automatically fix most common spatial data issues!")

## üéØ Key Takeaways

After completing this notebook, you should understand:

‚úÖ **Geometry validation** - How to detect and diagnose invalid spatial features  
‚úÖ **Missing data detection** - Finding null and empty geometries  
‚úÖ **Coordinate validation** - Checking ranges and detecting outliers  
‚úÖ **CRS validation** - Ensuring appropriate coordinate systems  
‚úÖ **Quality reporting** - Generating comprehensive validation reports  
‚úÖ **Data repair** - Common techniques for fixing spatial data issues  

## üìö Next Steps

1. **Implement** your `validate_spatial_data()` function in `src/spatial_basics.py`
2. **Test** your implementation with `uv run pytest tests/ -k "validate_spatial_data" -v`
3. **Move on** to `04_function_standardize_crs.ipynb` to learn about CRS transformations

---

*Data validation is not glamorous, but it is essential! Clean, validated data is the foundation of reliable spatial analysis. Take time to validate your data before analysis - your future self will thank you!* üåü