# 📁 Loading Spatial Data - Mastering Data Input

**GIST 604B - Python GeoPandas Introduction**  
**Notebook 2: Load Spatial Data from Various Sources**

---

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:
- Load spatial data from different file formats (GeoJSON, Shapefile, etc.)
- Handle file paths using both strings and Path objects
- Troubleshoot common loading errors and encoding issues
- Implement robust error handling for spatial data loading
- Understand the differences between spatial file formats
- **Prepare to implement the `load_spatial_dataset()` function**

## 🗂️ What You'll Practice

This notebook directly prepares you to implement the **`load_spatial_dataset()`** function in your assignment. You'll learn:

1. **File Format Detection**: How to determine what type of spatial file you're working with
2. **Error Handling**: What can go wrong when loading spatial data and how to handle it
3. **Path Management**: Working with file paths in a robust way
4. **Data Validation**: Ensuring loaded data is actually valid spatial data

---

## 🚀 Getting Started

Let's start by importing the libraries we'll need and setting up our workspace:

In [None]:
# 📚 Import required libraries
import geopandas as gpd
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import warnings
import os

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("🔧 Libraries imported successfully!")
print(f"📦 GeoPandas version: {gpd.__version__}")
print(f"🐼 Pandas version: {pd.__version__}")

# Check our data directory
data_path = Path('../data')
print(f"\n📁 Data directory exists: {data_path.exists()}")
if data_path.exists():
    subdirs = [d.name for d in data_path.iterdir() if d.is_dir()]
    print(f"📂 Available datasets: {subdirs}")

## 📖 Basic Spatial Data Loading

The most fundamental operation in spatial analysis is loading data. GeoPandas makes this remarkably simple with the `gpd.read_file()` function.

### 🎯 The Universal Loader

Unlike regular pandas which has separate functions for different formats (`read_csv()`, `read_json()`, etc.), GeoPandas uses **one function for all spatial formats**:

In [None]:
# 🌍 Load spatial data - it's this simple!

print("📁 Loading different spatial formats with the same function:\n")

# Method 1: Load GeoJSON
cities_geojson = gpd.read_file('../data/cities/sample_cities.geojson')
print(f"📄 GeoJSON: Loaded {len(cities_geojson)} cities")
print(f"   Columns: {list(cities_geojson.columns)}")

# Method 2: Load Shapefile
cities_shapefile = gpd.read_file('../data/cities/world_cities.shp')
print(f"\n📄 Shapefile: Loaded {len(cities_shapefile)} cities")
print(f"   Columns: {list(cities_shapefile.columns)}")

print("\n✨ Same function, different formats - GeoPandas handles the details!")

## 🛣️ Working with File Paths

Professional code needs to handle file paths robustly. Let's explore different ways to specify file paths and why the `pathlib.Path` approach is preferred:

In [None]:
# 🛣️ Different ways to handle file paths

print("🔧 Path Handling Methods:\n")

# Method 1: String paths (traditional)
file_path_string = '../data/cities/sample_cities.geojson'
cities_string = gpd.read_file(file_path_string)
print(f"📝 String path: '{file_path_string}'")
print(f"   Result: {len(cities_string)} cities loaded")

# Method 2: Path objects (modern, recommended)
file_path_object = Path('../data/cities/sample_cities.geojson')
cities_path = gpd.read_file(file_path_object)
print(f"\n🗂️ Path object: {file_path_object}")
print(f"   Result: {len(cities_path)} cities loaded")

# Path object advantages
print(f"\n💡 Path Object Advantages:")
print(f"   📁 Absolute path: {file_path_object.absolute()}")
print(f"   📄 File name: {file_path_object.name}")
print(f"   📂 Parent directory: {file_path_object.parent}")
print(f"   🔍 File exists: {file_path_object.exists()}")
print(f"   📊 File size: {file_path_object.stat().st_size} bytes")

# Both approaches work identically
print(f"\n✅ Both methods load identical data: {cities_string.equals(cities_path)}")

## 🔍 Understanding Spatial File Formats

Different spatial formats have different characteristics. Let's explore the most common ones and understand when to use each:

In [None]:
# 🗂️ Comparing different spatial file formats

import os

print("📊 Spatial File Format Comparison:\n")

# Load the same data in different formats
formats = {
    'GeoJSON': '../data/cities/world_cities.geojson',
    'Shapefile': '../data/cities/world_cities.shp'
}

for format_name, file_path in formats.items():
    if Path(file_path).exists():
        # Load the data
        gdf = gpd.read_file(file_path)
        
        # File size analysis
        if format_name == 'Shapefile':
            # Shapefile is multiple files
            shp_dir = Path(file_path).parent
            shp_files = list(shp_dir.glob('world_cities.*'))
            total_size = sum(f.stat().st_size for f in shp_files)
            file_count = len(shp_files)
        else:
            # Single file formats
            total_size = Path(file_path).stat().st_size
            file_count = 1
        
        print(f"📄 {format_name}:")
        print(f"   💾 Size: {total_size:,} bytes ({total_size/1024:.1f} KB)")
        print(f"   📁 Files: {file_count}")
        print(f"   🌍 Features: {len(gdf)}")
        print(f"   📊 Columns: {len(gdf.columns)}")
        print(f"   🗺️ CRS: {gdf.crs}")
        print()

# Format characteristics
print("🎯 When to Use Each Format:")
print("📄 GeoJSON:")
print("   ✅ Web applications, APIs, JavaScript")
print("   ✅ Human-readable text format")
print("   ✅ Single file, easy to share")
print("   ⚠️  Larger file sizes")
print("")
print("📄 Shapefile:")
print("   ✅ Traditional GIS software")
print("   ✅ Widely supported")
print("   ✅ Compact file size")
print("   ⚠️  Multiple files to manage")
print("   ⚠️  Column name limitations (10 chars)")

## ⚠️ Error Handling and Troubleshooting

Real-world spatial data loading often involves problems. Let's explore common errors and how to handle them professionally:

### 🚨 Common Loading Errors

In [None]:
# 🚨 Demonstrating common errors and solutions

print("🕵️ Common Spatial Data Loading Errors:\n")

# Error 1: File doesn't exist
print("❌ Error 1: File Not Found")
try:
    missing_file = gpd.read_file('nonexistent_file.geojson')
except FileNotFoundError as e:
    print(f"   🚨 Error: {e}")
    print("   💡 Solution: Check file path and spelling")

print()

# Error 2: Invalid format
print("❌ Error 2: Invalid File Format")
# Create a temporary invalid file
invalid_file = Path('../data/invalid_spatial_file.txt')
invalid_file.write_text("This is not spatial data")

try:
    invalid_data = gpd.read_file(invalid_file)
except Exception as e:
    print(f"   🚨 Error type: {type(e).__name__}")
    print(f"   🚨 Error message: {e}")
    print("   💡 Solution: Ensure file contains valid spatial data")

# Clean up
invalid_file.unlink()  # Delete the temporary file

print()

# Error 3: Corrupted data (using our problematic dataset)
print("❌ Error 3: Data Quality Issues")
try:
    problematic = gpd.read_file('../data/cities/cities_with_issues.geojson')
    print(f"   ⚠️  File loads but has {problematic.geometry.isna().sum()} missing geometries")
    print(f"   ⚠️  File loads but has {(~problematic.geometry.is_valid).sum()} invalid geometries")
    print("   💡 Solution: Load successfully, then validate and clean data")
except Exception as e:
    print(f"   🚨 Error: {e}")
    print("   💡 Solution: Check data integrity and format")

print("\n✅ Key Takeaway: Always handle errors gracefully in production code!")

## 🛡️ Building a Robust Loading Function

Now let's build a professional-grade loading function step by step. This will guide you toward implementing the `load_spatial_dataset()` function in your assignment:

### 🔧 Step-by-Step Implementation

In [None]:
# 🛡️ Building a robust spatial data loader
from pathlib import Path
from typing import Union

def demo_load_spatial_dataset(file_path: Union[str, Path], **kwargs) -> gpd.GeoDataFrame:
    """
    Demonstration of robust spatial data loading.
    This shows you the approach for your assignment implementation.
    """
    
    print(f"🔄 Loading spatial data from: {file_path}")
    
    # Step 1: Convert to Path object
    path_obj = Path(file_path)
    print(f"   📁 Using path object: {path_obj}")
    
    # Step 2: Check if file exists
    if not path_obj.exists():
        raise FileNotFoundError(f"File not found: {path_obj}")
    
    print(f"   ✅ File exists: {path_obj.stat().st_size} bytes")
    
    # Step 3: Determine file format
    file_extension = path_obj.suffix.lower()
    print(f"   🗂️ File format: {file_extension}")
    
    # Step 4: Validate supported format
    supported_formats = ['.geojson', '.json', '.shp', '.gpkg']
    if file_extension not in supported_formats:
        raise ValueError(f"Unsupported format: {file_extension}. Supported: {supported_formats}")
    
    # Step 5: Load the data
    try:
        gdf = gpd.read_file(path_obj, **kwargs)
        print(f"   📊 Loaded {len(gdf)} features")
    except Exception as e:
        raise ValueError(f"Error loading spatial data: {e}")
    
    # Step 6: Basic validation
    if not isinstance(gdf, gpd.GeoDataFrame):
        raise ValueError("Loaded data is not a valid GeoDataFrame")
    
    if len(gdf) == 0:
        print("   ⚠️ Warning: Dataset is empty")
    
    if 'geometry' not in gdf.columns:
        raise ValueError("No geometry column found in the data")
    
    print(f"   ✅ Validation passed: {len(gdf)} features with geometry")
    
    return gdf

# Test the function
print("🧪 Testing robust loading function:\n")

# Test 1: Valid GeoJSON
try:
    cities = demo_load_spatial_dataset('../data/cities/sample_cities.geojson')
    print(f"   Success! Loaded {len(cities)} cities\n")
except Exception as e:
    print(f"   Error: {e}\n")

# Test 2: Valid Shapefile
try:
    cities_shp = demo_load_spatial_dataset('../data/cities/world_cities.shp')
    print(f"   Success! Loaded {len(cities_shp)} cities\n")
except Exception as e:
    print(f"   Error: {e}\n")

# Test 3: File doesn't exist
try:
    missing = demo_load_spatial_dataset('missing_file.geojson')
except FileNotFoundError as e:
    print(f"   Expected error caught: {e}\n")

print("🎯 This demonstrates the approach for your assignment implementation!")

## 🌐 Advanced Loading Scenarios

Let's explore more advanced loading scenarios you might encounter in real-world projects:

### 📊 Loading with Additional Parameters

In [None]:
# 🔧 Advanced loading with parameters

print("🌐 Advanced Loading Scenarios:\n")

# Scenario 1: Load with encoding specification
print("📝 Scenario 1: Explicit encoding")
try:
    cities_utf8 = gpd.read_file('../data/cities/sample_cities.geojson', encoding='utf-8')
    print(f"   ✅ Loaded with UTF-8 encoding: {len(cities_utf8)} cities")
except Exception as e:
    print(f"   ⚠️ Error: {e}")

# Scenario 2: Load specific columns (for Shapefiles)
print("\n📊 Scenario 2: Load specific columns")
try:
    # For demonstration, load only specific columns
    cities_subset = gpd.read_file('../data/cities/world_cities.shp', 
                                 columns=['name', 'country', 'geometry'])
    print(f"   ✅ Loaded subset: {len(cities_subset)} cities with {len(cities_subset.columns)} columns")
    print(f"   📋 Columns: {list(cities_subset.columns)}")
except Exception as e:
    print(f"   ⚠️ Error: {e}")

# Scenario 3: Load with bounding box filter (if supported)
print("\n🗺️ Scenario 3: Spatial filtering (bbox)")
try:
    # Define a bounding box (minx, miny, maxx, maxy) for North America
    north_america_bbox = [-130, 25, -60, 50]
    cities_na = gpd.read_file('../data/cities/world_cities.geojson', 
                             bbox=north_america_bbox)
    print(f"   ✅ Loaded North America cities: {len(cities_na)} cities")
    if len(cities_na) > 0:
        print(f"   🌍 Sample cities: {', '.join(cities_na['name'].head(3))}")
except Exception as e:
    print(f"   ⚠️ Bbox filtering not supported for this format: {e}")

print("\n💡 Key Insight: The **kwargs parameter allows flexible loading options!")

## ✅ Data Validation After Loading

Loading data is just the first step. Professional spatial analysis always includes data validation:

### 🔍 Essential Validation Checks

In [None]:
# ✅ Post-loading data validation

def validate_loaded_data(gdf: gpd.GeoDataFrame, data_name: str) -> dict:
    """
    Demonstrate validation checks after loading spatial data.
    Returns a validation report.
    """
    print(f"🔍 Validating loaded data: {data_name}")
    
    validation_report = {
        'is_valid': True,
        'issues': [],
        'warnings': []
    }
    
    # Check 1: Is it actually a GeoDataFrame?
    if not isinstance(gdf, gpd.GeoDataFrame):
        validation_report['is_valid'] = False
        validation_report['issues'].append("Not a GeoDataFrame")
        return validation_report
    
    print(f"   ✅ Type: Valid GeoDataFrame")
    
    # Check 2: Has data?
    if len(gdf) == 0:
        validation_report['warnings'].append("Dataset is empty")
        print(f"   ⚠️ Data: Empty dataset")
    else:
        print(f"   ✅ Data: {len(gdf)} features")
    
    # Check 3: Has geometry column?
    if 'geometry' not in gdf.columns:
        validation_report['is_valid'] = False
        validation_report['issues'].append("No geometry column")
        print(f"   ❌ Geometry: Missing geometry column")
    else:
        print(f"   ✅ Geometry: Column present")
        
        # Check geometry validity
        if len(gdf) > 0:
            valid_geoms = gdf.geometry.is_valid.sum()
            total_geoms = len(gdf)
            print(f"   📊 Valid geometries: {valid_geoms}/{total_geoms}")
            
            if valid_geoms < total_geoms:
                validation_report['warnings'].append(f"{total_geoms - valid_geoms} invalid geometries")
    
    # Check 4: Has CRS?
    if gdf.crs is None:
        validation_report['warnings'].append("No coordinate reference system defined")
        print(f"   ⚠️ CRS: Not defined")
    else:
        print(f"   ✅ CRS: {gdf.crs}")
    
    # Check 5: Reasonable data ranges (for geographic data)
    if gdf.crs and gdf.crs.to_epsg() == 4326 and len(gdf) > 0:  # WGS84
        bounds = gdf.total_bounds
        minx, miny, maxx, maxy = bounds
        
        if not (-180 <= minx <= 180 and -180 <= maxx <= 180):
            validation_report['warnings'].append("Longitude values outside valid range")
            print(f"   ⚠️ Longitude range: {minx:.2f} to {maxx:.2f} (outside ±180)")
        else:
            print(f"   ✅ Longitude range: {minx:.2f} to {maxx:.2f}")
            
        if not (-90 <= miny <= 90 and -90 <= maxy <= 90):
            validation_report['warnings'].append("Latitude values outside valid range")
            print(f"   ⚠️ Latitude range: {miny:.2f} to {maxy:.2f} (outside ±90)")
        else:
            print(f"   ✅ Latitude range: {miny:.2f} to {maxy:.2f}")
    
    # Summary
    if validation_report['is_valid']:
        if len(validation_report['warnings']) == 0:
            print(f"   🎉 Overall: EXCELLENT - No issues found!")
        else:
            print(f"   ✅ Overall: GOOD - {len(validation_report['warnings'])} warnings")
    else:
        print(f"   ❌ Overall: PROBLEMS - {len(validation_report['issues'])} critical issues")
    
    return validation_report

# Test validation on different datasets
print("🧪 Testing data validation:\n")

# Test 1: Clean data
cities_clean = gpd.read_file('../data/cities/sample_cities.geojson')
report1 = validate_loaded_data(cities_clean, "Clean Cities")

print()

# Test 2: Problematic data
try:
    cities_problems = gpd.read_file('../data/cities/cities_with_issues.geojson')
    report2 = validate_loaded_data(cities_problems, "Problematic Cities")
except Exception as e:
    print(f"   ❌ Failed to load problematic data: {e}")

print("\n💡 Validation helps identify data quality issues early in your workflow!")

## 🎯 Assignment Implementation Guide

Now you're ready to implement the `load_spatial_dataset()` function! Here's your roadmap:

### 📋 Implementation Checklist

Your function needs to:

1. ✅ **Accept flexible input**: `Union[str, Path]` for file paths
2. ✅ **Handle Path conversion**: Convert strings to Path objects
3. ✅ **Check file existence**: Raise `FileNotFoundError` if missing
4. ✅ **Validate file format**: Support common spatial formats
5. ✅ **Load the data**: Use `gpd.read_file()` with error handling
6. ✅ **Validate results**: Ensure it's a valid GeoDataFrame
7. ✅ **Handle kwargs**: Pass additional parameters to GeoPandas
8. ✅ **Return GeoDataFrame**: Return the loaded spatial data

### 💻 Code Structure Template

```python
def load_spatial_dataset(file_path: Union[str, Path], **kwargs) -> gpd.GeoDataFrame:
    # Step 1: Convert to Path object
    path_obj = Path(file_path)
    
    # Step 2: Check existence
    if not path_obj.exists():
        raise FileNotFoundError(f"File not found: {path_obj}")
    
    # Step 3: Validate format (optional but recommended)
    # Check file extension
    
    # Step 4: Load data with error handling
    try:
        gdf = gpd.read_file(path_obj, **kwargs)
    except Exception as e:
        raise ValueError(f"Error loading spatial data: {e}")
    
    # Step 5: Validate result
    # Check it's a GeoDataFrame, has geometry, etc.
    
    # Step 6: Return the data
    return gdf
```

## 🧪 Testing Your Implementation

After implementing your function, test it with these scenarios:

In [None]:
# 🧪 Test scenarios for your load_spatial_dataset() function
# Uncomment these after implementing your function!

# from src.spatial_basics import load_spatial_dataset

print("🧪 Test Scenarios for load_spatial_dataset()\n")

# Test cases you should handle:
test_cases = [
    "✅ Valid GeoJSON file with string path",
    "✅ Valid Shapefile with Path object", 
    "✅ Loading with additional kwargs (encoding)",
    "❌ Non-existent file (should raise FileNotFoundError)",
    "❌ Invalid file format (should raise ValueError)",
    "⚠️ File with data quality issues (should load but warn)"
]

for i, test_case in enumerate(test_cases, 1):
    print(f"{i}. {test_case}")

print("\n💡 Run: uv run pytest tests/ -k 'load_spatial_dataset' -v")
print("   This will test your implementation automatically!")

## 🎓 Key Concepts Summary

### ✅ What You've Learned:

**📁 File Path Handling**
- Using `pathlib.Path` for robust path management
- Converting between strings and Path objects
- Checking file existence before loading

**🗂️ Spatial Data Formats**
- GeoJSON: Web-friendly, single file, human-readable
- Shapefile: Traditional GIS, multiple files, compact
- Understanding when to use each format

**⚠️ Error Handling**
- `FileNotFoundError` for missing files
- `ValueError` for invalid data or formats
- Using try/except blocks for graceful failure

**✅ Data Validation**
- Checking data type (GeoDataFrame)
- Verifying geometry column exists
- Validating coordinate ranges
- CRS presence and validity

**🔧 Professional Practices**
- Using type hints (`Union[str, Path]`)
- Supporting flexible parameters (`**kwargs`)
- Comprehensive error messages
- Step-by-step validation workflow

### 🚀 Skills for Your GIS Career:
- **Data Pipeline Development**: Building robust data loading functions
- **Error Handling**: Making code that works with messy real-world data  
- **Format Expertise**: Understanding spatial data format trade-offs
- **Quality Assurance**: Validating data immediately after loading

---

## 📚 Next Steps

### 🎯 Immediate Actions:
1. **Implement** `load_spatial_dataset()` in `src/spatial_basics.py`
2. **Test** with: `uv run pytest tests/ -k "load_spatial_dataset" -v`
3. **Debug** any failing tests using the error messages
4. **Validate** with different file formats and scenarios

### 📖 Continue Learning:
- **Next Notebook**: `03_explore_properties.ipynb` - Analyze spatial characteristics
- **Build on**: This loading foundation for all future spatial analysis
- **Practice**: Try loading your own spatial datasets!

### 💡 Pro Tips:
- **Always validate** loaded data before analysis
- **Handle errors gracefully** - real data is messy
- **Use Path objects** - they prevent many path-related bugs
- **Test edge cases** - empty files, missing files, wrong formats

---

**🎉 Congratulations!** You now understand professional spatial data loading. This is the foundation skill that supports all GIS programming work!

## 🔖 Quick Reference - Loading Commands

```python
# Basic loading
gdf = gpd.read_file('data/file.geojson')
gdf = gpd.read_file('data/file.shp')

# With Path objects (recommended)
from pathlib import Path
file_path = Path('data/file.geojson')
gdf = gpd.read_file(file_path)

# With additional parameters
gdf = gpd.read_file('file.geojson', encoding='utf-8')
gdf = gpd.read_file('file.shp', columns=['name', 'geometry'])

# Error handling
try:
    gdf = gpd.read_file('file.geojson')
except FileNotFoundError:
    print("File not found")
except Exception as e:
    print(f"Loading error: {e}")

# Basic validation
assert isinstance(gdf, gpd.GeoDataFrame)
assert 'geometry' in gdf.columns
assert len(gdf) > 0
```