# 🗺️ Spatial Data Overview - Welcome to GeoPandas!

**GIST 604B - Python GeoPandas Introduction**  
**Notebook 1: Understanding Spatial Data Fundamentals**

---

## 🎯 Learning Objectives

By the end of this notebook, you will understand:
- What makes spatial data different from regular data
- Common spatial data formats used in GIS
- How GeoPandas extends Pandas for spatial analysis
- The basic structure of spatial datasets
- Why coordinate reference systems matter

## 🤔 What is Spatial Data?

Spatial data represents information about the **location and shape** of geographic features. Unlike regular data that just has numbers and text, spatial data includes **geometry** - the actual geographic coordinates that define where things are located on Earth.

### Regular Data vs. Spatial Data

**Regular Data (like a CSV file):**
```
City Name    | Population | Country
-------------|------------|--------
New York     | 8,336,817  | USA
London       | 8,982,000  | UK
Tokyo        | 13,960,000 | Japan
```

**Spatial Data (like a Shapefile):**
```
City Name    | Population | Country | Geometry (Coordinates)
-------------|------------|---------|----------------------
New York     | 8,336,817  | USA     | POINT(-74.006, 40.714)
London       | 8,982,000  | UK      | POINT(-0.118, 51.509)
Tokyo        | 13,960,000 | Japan   | POINT(139.692, 35.689)
```

The **geometry column** is what makes data "spatial" - it contains the actual geographic coordinates!

## 🗂️ Common Spatial Data Formats

Just like regular data comes in different formats (CSV, Excel, JSON), spatial data has its own formats:

### 1. **Shapefile (.shp)** - The Classic
- Most common vector format in GIS
- Actually multiple files (.shp, .shx, .dbf, .prj, etc.)
- Widely supported by all GIS software
- Good for: Traditional GIS workflows, desktop analysis

### 2. **GeoJSON (.geojson, .json)** - The Web Favorite  
- Text-based format, human-readable
- Perfect for web mapping applications
- Single file contains everything
- Good for: Web maps, APIs, data sharing

### 3. **GeoPackage (.gpkg)** - The Modern Choice
- SQLite-based, single file
- Can store multiple layers
- Open standard, no licensing issues
- Good for: Modern GIS projects, mobile mapping

### 4. **KML/KMZ** - The Google Earth Format
- XML-based format
- Great for visualization
- Good for: Google Earth, simple visualizations

## 🐼 What is GeoPandas?

**GeoPandas = Pandas + Geography**

If you know Pandas (Python's data analysis library), GeoPandas will feel familiar. It extends Pandas to handle spatial data by adding:

- **GeoDataFrame**: Like a Pandas DataFrame, but with a special geometry column
- **Geometry operations**: Calculate areas, distances, intersections
- **Coordinate system support**: Handle different map projections
- **Spatial joins**: Combine datasets based on location
- **Easy mapping**: Create maps with just a few lines of code

### The Relationship
```
Pandas DataFrame     →    GeoPandas GeoDataFrame
   |                         |
   ├── Column A              ├── Column A  
   ├── Column B              ├── Column B
   └── Column C              ├── Column C
                             └── geometry  ← This is the magic!
```

## 🧪 Let's See GeoPandas in Action!

Time to get hands-on! Let's load our first spatial dataset and explore it.

In [None]:
# Import the libraries we need
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')  # Hide warnings for cleaner output

print("📦 Libraries imported successfully!")
print(f"🐼 GeoPandas version: {gpd.__version__}")
print(f"🐍 Pandas version: {pd.__version__}")

In [None]:
# Let's load some sample world cities data
# (This assumes you have the data in the correct location)

try:
    # Load cities data - replace with actual path when data is available
    cities = gpd.read_file('../data/cities/cities.shp')
    print(f"✅ Successfully loaded {len(cities)} cities!")
    print(f"📊 Data type: {type(cities)}")
    print(f"📍 Columns: {list(cities.columns)}")
    
except FileNotFoundError:
    print("⚠️  Sample data not found. Let's create some demo data instead!")
    
    # Create sample spatial data for demonstration
    from shapely.geometry import Point
    
    # Sample cities with coordinates
    sample_cities = {
        'name': ['New York', 'London', 'Tokyo', 'Sydney', 'Cairo'],
        'population': [8336817, 8982000, 13960000, 5312163, 9606916],
        'country': ['USA', 'UK', 'Japan', 'Australia', 'Egypt'],
        'longitude': [-74.006, -0.118, 139.692, 151.209, 31.235],
        'latitude': [40.714, 51.509, 35.689, -33.867, 30.044]
    }
    
    # Create a regular DataFrame first
    df = pd.DataFrame(sample_cities)
    
    # Convert to GeoDataFrame by creating geometry from coordinates
    geometry = [Point(lon, lat) for lon, lat in zip(df['longitude'], df['latitude'])]
    cities = gpd.GeoDataFrame(df, geometry=geometry, crs='EPSG:4326')
    
    print(f"✅ Created sample dataset with {len(cities)} cities!")
    print(f"📊 Data type: {type(cities)}")
    print(f"📍 Columns: {list(cities.columns)}")

## 🔍 Exploring Our Spatial Data

Now let's explore what makes this data "spatial"...

In [None]:
# Look at the first few rows
print("🗺️  First 3 cities in our dataset:")
print(cities.head(3))

In [None]:
# Focus on the geometry column - this is what makes it spatial!
print("📍 The geometry column contains the actual coordinates:")
print(cities.geometry.head(3))

print("\n🔍 Each geometry is a Shapely Point object:")
print(f"First city geometry type: {type(cities.geometry.iloc[0])}")
print(f"First city coordinates: {cities.geometry.iloc[0].coords[0]}")

In [None]:
# Check the coordinate reference system (CRS)
print("🌍 Coordinate Reference System (CRS):")
print(f"CRS: {cities.crs}")
print("\nThis tells us that coordinates are in longitude/latitude degrees (WGS84)")

In [None]:
# Check the spatial extent (boundaries) of our data
print("📐 Spatial extent of our cities:")
bounds = cities.total_bounds
print(f"West (min longitude): {bounds[0]:.3f}°")
print(f"South (min latitude): {bounds[1]:.3f}°")
print(f"East (max longitude): {bounds[2]:.3f}°")
print(f"North (max latitude): {bounds[3]:.3f}°")

print("\nThis tells us our cities span from Egypt to Japan!")

## 🗺️ Making Our First Map!

One of the coolest things about GeoPandas is how easy it is to make maps:

In [None]:
# Create a simple map - it's just one line of code!
plt.figure(figsize=(12, 8))
cities.plot(marker='o', color='red', markersize=100, alpha=0.7)
plt.title('🌍 World Cities - Our First Spatial Data Map!', fontsize=16, fontweight='bold')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, alpha=0.3)

# Add city labels
for idx, row in cities.iterrows():
    plt.annotate(row['name'], 
                (row.geometry.x, row.geometry.y), 
                xytext=(5, 5), 
                textcoords='offset points',
                fontsize=10,
                fontweight='bold')

plt.tight_layout()
plt.show()

print("🎉 You just created your first spatial data visualization!")

## 🧮 Spatial vs. Regular Data Operations

Let's compare what we can do with spatial data versus regular data:

In [None]:
print("📊 REGULAR DATA OPERATIONS (same as Pandas):")
print("="*50)

# Regular data operations work exactly like Pandas
print(f"Total population: {cities['population'].sum():,}")
print(f"Average population: {cities['population'].mean():,.0f}")
print(f"Most populous city: {cities.loc[cities['population'].idxmax(), 'name']}")

print("\n🗺️  SPATIAL DATA OPERATIONS (unique to GeoPandas):")
print("="*50)

# Spatial operations that regular DataFrames can't do
print("Geographic center (centroid) of all cities:")
center = cities.geometry.unary_union.centroid
print(f"Center coordinates: ({center.x:.2f}°, {center.y:.2f}°)")

print("\nWesternmost and easternmost cities:")
westernmost = cities.loc[cities.geometry.x.idxmin(), 'name']
easternmost = cities.loc[cities.geometry.x.idxmax(), 'name'] 
print(f"Westernmost: {westernmost}")
print(f"Easternmost: {easternmost}")

## 🌐 Why Coordinate Reference Systems Matter

One of the most important concepts in spatial data is the **Coordinate Reference System (CRS)**. This defines how the 3D Earth is projected onto a 2D map.

### Common CRS You'll Encounter:
- **EPSG:4326 (WGS84)**: Geographic coordinates (longitude/latitude in degrees)
- **EPSG:3857 (Web Mercator)**: Used by Google Maps, OpenStreetMap, etc.
- **UTM zones**: Good for accurate distance/area calculations in specific regions

### Why It Matters:
- **Distance calculations**: You can't calculate accurate distances using degrees!
- **Area calculations**: Geographic coordinates give wrong area measurements
- **Mapping**: Different projections distort the Earth in different ways

In [None]:
# Let's demonstrate why CRS matters for distance calculations
print("🌍 Distance calculation demonstration:")
print("="*40)

# Get New York and London
ny = cities[cities['name'] == 'New York'].geometry.iloc[0]
london = cities[cities['name'] == 'London'].geometry.iloc[0]

print(f"New York coordinates: ({ny.x:.3f}°, {ny.y:.3f}°)")
print(f"London coordinates: ({london.x:.3f}°, {london.y:.3f}°)")

# This is WRONG - calculating distance in degrees!
distance_degrees = ny.distance(london)
print(f"\n❌ Distance in degrees: {distance_degrees:.3f} (meaningless!)")

# This is RIGHT - reproject to a projected CRS first
cities_projected = cities.to_crs('EPSG:3857')  # Web Mercator
ny_proj = cities_projected[cities_projected['name'] == 'New York'].geometry.iloc[0]
london_proj = cities_projected[cities_projected['name'] == 'London'].geometry.iloc[0]

distance_meters = ny_proj.distance(london_proj)
distance_km = distance_meters / 1000

print(f"✅ Distance in meters: {distance_meters:,.0f} m")
print(f"✅ Distance in kilometers: {distance_km:,.0f} km")
print("\nThis matches the real-world distance between New York and London!")

## 🗂️ Three Types of Spatial Geometry

Spatial data can represent different types of geographic features:

### 1. **Points** 🎯
- Represent specific locations
- Examples: Cities, GPS coordinates, weather stations
- Defined by: Single (x, y) coordinate pair

### 2. **Lines** 🛣️
- Represent linear features
- Examples: Roads, rivers, hiking trails, flight paths
- Defined by: Series of connected coordinate pairs

### 3. **Polygons** 🏰
- Represent areas/regions
- Examples: Countries, lakes, buildings, parks
- Defined by: Closed series of coordinates forming a boundary

Our cities dataset uses **Points** because cities are specific locations!

In [None]:
# Check what geometry types we have in our dataset
print("📍 Geometry types in our cities dataset:")
geometry_types = cities.geometry.geom_type.value_counts()
print(geometry_types)

print("\n🔍 All our cities are Points, which makes sense - cities are specific locations!")

## 🎯 What You'll Learn in This Assignment

Now that you understand the basics of spatial data, here's what you'll implement in this assignment:

### Function 1: `load_spatial_dataset()` 📂
- Load spatial data from files (Shapefiles, GeoJSON, etc.)
- Handle common loading issues (encoding, missing files)
- Work with different file formats

### Function 2: `explore_spatial_properties()` 🔍  
- Analyze coordinate reference systems
- Calculate spatial bounds and extents
- Identify geometry types and data characteristics

### Function 3: `validate_spatial_data()` ✅
- Check for invalid geometries
- Find missing or problematic spatial data
- Generate data quality reports

### Function 4: `standardize_crs()` 🌍
- Transform between coordinate systems
- Choose appropriate projections for analysis
- Handle coordinate system conversions

## 📚 Next Steps - Your Learning Journey

### 🎓 Recommended Learning Order:

1. **📖 Start here** - `01_spatial_data_overview.ipynb` (this notebook!)
2. **📂 Loading data** - `02_load_spatial_data.ipynb`
3. **🔍 Exploring properties** - `03_explore_properties.ipynb` 
4. **✅ Validating data** - `04_validate_data.ipynb`
5. **🌍 Coordinate systems** - `05_coordinate_systems.ipynb`

### 💻 Implementation Workflow:
1. **Learn** - Work through the relevant notebook
2. **Implement** - Code the function in `src/spatial_basics.py`
3. **Test** - Run `uv run pytest tests/ -v` to check your work
4. **Debug** - Use test failures to guide improvements
5. **Repeat** - Move to the next function

### 🔧 Development Tips:
- **Use the notebooks for experimentation** - try different approaches
- **Test with small datasets first** - easier to debug
- **Read error messages carefully** - GeoPandas gives helpful error info
- **Check your CRS** - many spatial problems come from coordinate system issues

## 🎉 Congratulations!

You've completed your introduction to spatial data and GeoPandas! You now understand:

✅ **What makes data "spatial"** - the geometry column with coordinates  
✅ **Common spatial data formats** - Shapefiles, GeoJSON, GeoPackage  
✅ **How GeoPandas extends Pandas** - adds spatial operations to familiar data analysis  
✅ **Why coordinate systems matter** - essential for accurate measurements  
✅ **Three geometry types** - Points, Lines, and Polygons  

### 🚀 Ready for the Next Step?

Open `02_load_spatial_data.ipynb` to learn how to load spatial data from files!

---

*Remember: Spatial data analysis is a skill that improves with practice. Take your time with each concept, experiment with the code, and don't hesitate to ask questions. Every GIS professional started exactly where you are now!* 🌟