# 📊 Function 2: Calculate Raster Statistics

## Building the `calculate_raster_statistics` Function

**Learning Objectives:**
- Calculate basic statistical measures from raster datasets
- Handle nodata values and masked arrays properly
- Implement memory-efficient statistics for large datasets
- Generate histograms and distribution analysis
- Work with multi-band raster statistics
- Create summary reports for raster data analysis

**Professional Context:**
Raster statistics are fundamental to geospatial analysis and environmental monitoring. Professionals use these techniques for:
- Quality assessment of satellite imagery and DEMs
- Environmental monitoring and change detection
- Data validation and outlier detection
- Preprocessing steps for machine learning workflows
- Generating metadata and data summaries for reports

---

## 🎯 Function Overview

**Function Signature:**
```python
def calculate_raster_statistics(raster_path, band=None, compute_histogram=False, mask_nodata=True):
    """
    Calculate comprehensive statistics for raster datasets.
    
    Parameters:
    -----------
    raster_path : str
        Path to the raster file
    band : int, optional
        Specific band number (1-based). If None, calculates for all bands
    compute_histogram : bool, default False
        Whether to compute histogram data
    mask_nodata : bool, default True
        Whether to exclude nodata values from calculations
    
    Returns:
    --------
    dict
        Dictionary containing statistical measures and metadata
    
    Raises:
    -------
    FileNotFoundError
        If the raster file doesn't exist
    ValueError
        If the band number is invalid or raster is corrupted
    """
```

**Your Task:**
1. Load raster data using rasterio
2. Calculate basic statistics (mean, min, max, std, count)
3. Handle nodata values appropriately
4. Generate histograms when requested
5. Support both single-band and multi-band analysis
6. Return comprehensive statistical summary

## 📚 Raster Statistics Fundamentals

### Basic Statistical Measures

| Statistic | Description | Use Case |
|-----------|-------------|----------|
| **Mean** | Average value | Overall data tendency, comparison baseline |
| **Median** | Middle value when sorted | Robust central tendency, outlier-resistant |
| **Min/Max** | Extreme values | Data range, outlier detection |
| **Std Dev** | Variability measure | Data spread, homogeneity assessment |
| **Count** | Valid pixel count | Data completeness, coverage assessment |
| **Percentiles** | Value distribution | Threshold definition, data exploration |

### Nodata Value Handling
```python
import numpy as np
import rasterio

# Different ways to handle nodata
with rasterio.open('raster.tif') as src:
    data = src.read(1)  # Read first band
    nodata = src.nodata
    
    # Method 1: Mask nodata values
    if nodata is not None:
        masked_data = np.ma.masked_equal(data, nodata)
        mean_val = masked_data.mean()
    
    # Method 2: Use rasterio's built-in masking
    data_masked = src.read(1, masked=True)
    mean_val = data_masked.mean()
```

## 🔧 Implementation Strategy

### Step 1: File Validation and Loading
```python
import rasterio
import numpy as np
import os
from pathlib import Path

# Validate file existence
if not os.path.exists(raster_path):
    raise FileNotFoundError(f"Raster file not found: {raster_path}")

# Open and validate raster
try:
    with rasterio.open(raster_path) as src:
        # Validate band number if specified
        if band is not None and (band < 1 or band > src.count):
            raise ValueError(f"Band {band} not valid. Raster has {src.count} bands")
except Exception as e:
    raise ValueError(f"Error opening raster file: {str(e)}")
```

### Step 2: Statistical Calculations
```python
def calculate_band_statistics(data, nodata_value=None, compute_histogram=False):
    """Calculate statistics for a single band."""
    
    # Handle nodata masking
    if nodata_value is not None:
        data = np.ma.masked_equal(data, nodata_value)
    elif hasattr(data, 'mask'):
        pass  # Already masked
    else:
        data = np.ma.array(data)  # Convert to masked array
    
    # Calculate basic statistics
    stats = {
        'count': int(data.count()),
        'mean': float(data.mean()) if data.count() > 0 else None,
        'std': float(data.std()) if data.count() > 0 else None,
        'min': float(data.min()) if data.count() > 0 else None,
        'max': float(data.max()) if data.count() > 0 else None,
        'median': float(np.ma.median(data)) if data.count() > 0 else None
    }
    
    # Add percentiles
    if data.count() > 0:
        percentiles = [25, 75, 90, 95, 99]
        for p in percentiles:
            stats[f'percentile_{p}'] = float(np.ma.percentile(data, p))
    
    return stats
```

### Step 3: Histogram Generation
```python
def generate_histogram(data, bins=256):
    """Generate histogram data for raster values."""
    
    if data.count() == 0:
        return None
    
    # Calculate histogram
    hist, bin_edges = np.histogram(data.compressed(), bins=bins)
    
    return {
        'counts': hist.tolist(),
        'bin_edges': bin_edges.tolist(),
        'bins': len(hist)
    }
```

## 💻 Hands-On Examples

Let's explore raster statistics with practical examples:

In [None]:
import rasterio
import numpy as np
import matplotlib.pyplot as plt
from rasterio.plot import show

# Example 1: Basic raster statistics
def example_basic_statistics():
    """
    Demonstrate basic raster statistics calculation
    """
    
    # Create sample elevation data
    np.random.seed(42)
    
    # Simulate elevation data (100x100 grid)
    elevation = np.random.normal(1000, 200, (100, 100))  # Mean 1000m, std 200m
    elevation = np.maximum(elevation, 0)  # No negative elevations
    
    # Add some nodata values
    nodata_mask = np.random.random((100, 100)) < 0.05  # 5% nodata
    elevation[nodata_mask] = -9999  # Set nodata value
    
    print("Sample Elevation Data Statistics:")
    print(f"Shape: {elevation.shape}")
    print(f"Total pixels: {elevation.size}")
    
    # Mask nodata values for statistics
    elevation_masked = np.ma.masked_equal(elevation, -9999)
    
    print(f"\nValid pixels: {elevation_masked.count()}")
    print(f"Nodata pixels: {elevation_masked.mask.sum()}")
    print(f"Mean elevation: {elevation_masked.mean():.2f} m")
    print(f"Min elevation: {elevation_masked.min():.2f} m")
    print(f"Max elevation: {elevation_masked.max():.2f} m")
    print(f"Std deviation: {elevation_masked.std():.2f} m")
    print(f"Median elevation: {np.ma.median(elevation_masked):.2f} m")
    
    # Calculate percentiles
    percentiles = [25, 50, 75, 90, 95]
    print(f"\nPercentiles:")
    for p in percentiles:
        val = np.ma.percentile(elevation_masked, p)
        print(f"  {p}th percentile: {val:.2f} m")
    
    return elevation_masked

# Run example
sample_elevation = example_basic_statistics()

In [None]:
# Example 2: Multi-band raster statistics
def example_multiband_statistics():
    """
    Demonstrate statistics for multi-band raster (RGB image)
    """
    
    np.random.seed(42)
    
    # Simulate RGB satellite imagery (3 bands, 50x50 pixels)
    # Band 1: Red (600-700nm)
    red = np.random.normal(120, 30, (50, 50))
    red = np.clip(red, 0, 255).astype(np.uint8)
    
    # Band 2: Green (500-600nm)
    green = np.random.normal(100, 25, (50, 50))
    green = np.clip(green, 0, 255).astype(np.uint8)
    
    # Band 3: Blue (400-500nm)
    blue = np.random.normal(80, 20, (50, 50))
    blue = np.clip(blue, 0, 255).astype(np.uint8)
    
    bands = [red, green, blue]
    band_names = ['Red', 'Green', 'Blue']
    
    print("Multi-band Raster Statistics:")
    print(f"Dimensions: 3 bands × {red.shape[0]} × {red.shape[1]} pixels")
    
    for i, (band_data, band_name) in enumerate(zip(bands, band_names)):
        print(f"\nBand {i+1} ({band_name}):")
        print(f"  Mean: {band_data.mean():.2f}")
        print(f"  Std: {band_data.std():.2f}")
        print(f"  Min: {band_data.min()}")
        print(f"  Max: {band_data.max()}")
        print(f"  Data type: {band_data.dtype}")
    
    return bands

# Run example
rgb_bands = example_multiband_statistics()

In [None]:
# Example 3: Histogram analysis
def example_histogram_analysis(data):
    """
    Demonstrate histogram generation and analysis
    """
    
    print("\nHistogram Analysis:")
    
    # Calculate histogram
    hist, bin_edges = np.histogram(data.compressed(), bins=50)
    
    print(f"Histogram bins: {len(hist)}")
    print(f"Value range: {bin_edges[0]:.2f} to {bin_edges[-1]:.2f}")
    print(f"Most frequent bin: {bin_edges[hist.argmax()]:.2f} to {bin_edges[hist.argmax()+1]:.2f}")
    print(f"Peak frequency: {hist.max()} pixels")
    
    # Create histogram plot
    plt.figure(figsize=(10, 6))
    plt.hist(data.compressed(), bins=50, alpha=0.7, color='skyblue', edgecolor='black')
    plt.title('Elevation Data Distribution')
    plt.xlabel('Elevation (m)')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    
    # Add statistical markers
    plt.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}m')
    plt.axvline(np.ma.median(data), color='green', linestyle='--', label=f'Median: {np.ma.median(data):.1f}m')
    plt.legend()
    plt.show()
    
    return {
        'histogram': hist,
        'bin_edges': bin_edges,
        'peak_bin': hist.argmax(),
        'peak_value': bin_edges[hist.argmax()]
    }

# Run histogram analysis
if 'sample_elevation' in globals():
    histogram_data = example_histogram_analysis(sample_elevation)

## 🔍 Advanced Statistical Techniques

### Memory-Efficient Statistics for Large Rasters
```python
def calculate_statistics_chunked(raster_path, chunk_size=1024):
    """
    Calculate statistics for large rasters using chunked processing
    """
    
    with rasterio.open(raster_path) as src:
        # Initialize running statistics
        count = 0
        sum_val = 0.0
        sum_sq = 0.0
        min_val = float('inf')
        max_val = float('-inf')
        
        # Process in chunks
        for window in src.block_windows(1):
            chunk = src.read(1, window=window, masked=True)
            
            if chunk.count() > 0:
                count += chunk.count()
                sum_val += chunk.sum()
                sum_sq += (chunk ** 2).sum()
                min_val = min(min_val, chunk.min())
                max_val = max(max_val, chunk.max())
        
        # Calculate final statistics
        if count > 0:
            mean = sum_val / count
            variance = (sum_sq / count) - (mean ** 2)
            std = np.sqrt(variance)
            
            return {
                'count': count,
                'mean': mean,
                'std': std,
                'min': min_val,
                'max': max_val
            }
        else:
            return None
```

### Zonal Statistics
```python
def calculate_zonal_statistics(raster_path, zones_array):
    """
    Calculate statistics for different zones/regions
    """
    
    with rasterio.open(raster_path) as src:
        data = src.read(1, masked=True)
        
        # Get unique zones
        unique_zones = np.unique(zones_array)
        
        zonal_stats = {}
        
        for zone in unique_zones:
            if zone == 0:  # Skip background/nodata zones
                continue
                
            zone_mask = zones_array == zone
            zone_data = data[zone_mask]
            
            if zone_data.count() > 0:
                zonal_stats[int(zone)] = {
                    'count': int(zone_data.count()),
                    'mean': float(zone_data.mean()),
                    'std': float(zone_data.std()),
                    'min': float(zone_data.min()),
                    'max': float(zone_data.max())
                }
        
        return zonal_stats
```

## 🎯 Your Implementation Task

Now it's time to implement the `calculate_raster_statistics` function in the `src/rasterio_basics.py` file.

### Requirements Checklist:
- [ ] File existence validation
- [ ] Raster loading and band validation
- [ ] Basic statistics calculation (mean, min, max, std, count)
- [ ] Proper nodata value handling
- [ ] Support for single-band and multi-band analysis
- [ ] Optional histogram generation
- [ ] Comprehensive error handling
- [ ] Return structured results dictionary

In [None]:
# Test your implementation
import sys
import os
sys.path.append('../src')

try:
    from rasterio_basics import calculate_raster_statistics
    
    # Create a test raster file
    import tempfile
    import rasterio
    from rasterio.transform import from_bounds
    
    # Create temporary test data
    temp_dir = tempfile.mkdtemp()
    test_raster = os.path.join(temp_dir, 'test_raster.tif')
    
    # Create sample data
    np.random.seed(42)
    test_data = np.random.normal(100, 25, (50, 50)).astype(np.float32)
    
    # Save as raster
    transform = from_bounds(-180, -90, 180, 90, 50, 50)
    with rasterio.open(
        test_raster, 'w',
        driver='GTiff',
        height=50, width=50,
        count=1, dtype=test_data.dtype,
        crs='EPSG:4326',
        transform=transform,
        nodata=-9999
    ) as dst:
        dst.write(test_data, 1)
    
    print("Testing calculate_raster_statistics function...")
    
    # Test the function
    result = calculate_raster_statistics(test_raster, compute_histogram=True)
    
    if isinstance(result, dict) and 'mean' in result:
        print("✓ Test passed! Function works correctly.")
        print(f"  Statistics calculated:")
        for key, value in result.items():
            if key != 'histogram':  # Skip histogram details
                print(f"    {key}: {value}")
        
        if 'histogram' in result:
            print(f"    histogram: Generated with {len(result['histogram']['counts'])} bins")
    else:
        print("✗ Test failed! Function did not return expected result.")
    
    # Clean up
    import shutil
    shutil.rmtree(temp_dir)
        
except ImportError:
    print("Function not implemented yet. Complete the implementation in src/rasterio_basics.py")
except Exception as e:
    print(f"✗ Test failed with error: {e}")

## 🧪 Testing Your Function

Once you've implemented your function, test it thoroughly:

```bash
# Run the specific test for this function
cd /workspaces/your-repo
python -m pytest tests/test_rasterio_basics.py::test_calculate_raster_statistics -v
```

### Test Cases Your Function Should Handle:
1. ✅ **Basic statistics** - Calculate mean, min, max, std for valid data
2. ✅ **Nodata handling** - Properly exclude nodata values from calculations
3. ✅ **Single-band analysis** - Process individual bands correctly
4. ✅ **Multi-band support** - Handle multiple bands when band=None
5. ✅ **Histogram generation** - Create histogram data when requested
6. ✅ **Error handling** - Handle missing files and invalid bands gracefully
7. ✅ **Empty datasets** - Handle rasters with no valid data
8. ✅ **Data type handling** - Work with different numeric data types

## 🚀 Next Steps

After successfully implementing and testing this function:

1. **Move to Function 3:** `03_function_extract_raster_subset.ipynb`
2. **Build on statistics:** Use statistical insights to guide data extraction
3. **Professional development:** Consider performance optimizations for production

### Professional Extensions to Consider:
- **Parallel processing:** Use multiprocessing for large multi-band rasters
- **Streaming statistics:** Handle datasets too large for memory
- **Advanced metrics:** Implement skewness, kurtosis, entropy measures
- **Spatial statistics:** Add spatial autocorrelation and clustering metrics
- **Benchmarking:** Compare performance with different chunk sizes

### Real-World Applications:
- **Quality Control:** Validate satellite imagery for anomalies and errors
- **Environmental Monitoring:** Track changes in vegetation, temperature, precipitation
- **Data Processing:** Normalize and standardize datasets for analysis
- **Machine Learning:** Generate features and validate input data quality

---

**🎯 Goal:** Master raster statistics - essential for all quantitative geospatial analysis!

**Next:** Once your tests pass, continue to `03_function_extract_raster_subset.ipynb` to learn about spatial data extraction.