# 📈 Function 3: Calculate Station Statistics

## Building the `calculate_station_statistics` Function

**Learning Objectives:**
- Understand data aggregation and grouping operations in pandas
- Learn to use `.groupby()` for statistical analysis
- Master aggregate functions (mean, count, min, max)
- Create summary DataFrames from grouped data
- Handle missing data in statistical calculations
- Generate meaningful reports from environmental monitoring data

**Professional Context:**
Data aggregation is crucial for:
- **Summarizing large datasets** - Convert thousands of readings into actionable insights
- **Identifying patterns** - Find which stations have unusual temperature or humidity patterns
- **Quality control** - Identify stations with too few readings or extreme values
- **Reporting** - Create executive summaries for stakeholders
- **Decision making** - Determine where to place new monitoring stations

## Part 1: Understanding Data Grouping

### 1.1 What is Data Grouping?

**Data grouping** is like organizing data into categories and then calculating statistics for each category.

**Real-world example:**
Imagine you have temperature readings from 5 weather stations, with multiple readings per day:

```
Raw Data (1000s of readings):
station_id  | temperature_c | humidity_percent | datetime
STN_001     | 22.5         | 65               | 2024-01-01 08:00
STN_001     | 23.1         | 63               | 2024-01-01 12:00
STN_002     | 18.9         | 72               | 2024-01-01 08:00
STN_002     | 19.7         | 70               | 2024-01-01 12:00
...

Grouped Summary (5 stations):
station_id  | avg_temperature | avg_humidity | reading_count
STN_001     | 22.8           | 64.2         | 245
STN_002     | 19.3           | 71.1         | 198
STN_003     | 25.1           | 58.7         | 267
```

This transformation makes data **actionable** - you can now answer questions like:
- Which station is the hottest on average?
- Which station has the most/least data?
- Are there patterns across different locations?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Create realistic environmental monitoring data for demonstration
def create_sample_environmental_data():
    """Create sample environmental monitoring data that mimics real weather station data."""
    
    # Define weather stations with different characteristics
    stations = {
        'STN_001': {'base_temp': 22, 'base_humidity': 65, 'location': 'Downtown'},
        'STN_002': {'base_temp': 19, 'base_humidity': 72, 'location': 'Coastal'},
        'STN_003': {'base_temp': 25, 'base_humidity': 58, 'location': 'Desert'},
        'STN_004': {'base_temp': 16, 'base_humidity': 78, 'location': 'Mountain'},
        'STN_005': {'base_temp': 21, 'base_humidity': 68, 'location': 'Suburban'}
    }
    
    # Generate readings for each station
    all_data = []
    
    for station_id, props in stations.items():
        # Different stations have different numbers of readings (realistic scenario)
        n_readings = np.random.randint(150, 300)
        
        # Generate temperatures with daily and random variation
        base_temp = props['base_temp']
        temperatures = []
        
        for i in range(n_readings):
            # Daily variation (warmer in afternoon)
            hour_of_day = (i * 6) % 24  # Simulate readings every 6 hours
            daily_variation = 3 * np.sin(2 * np.pi * hour_of_day / 24)
            
            # Random variation
            random_variation = np.random.normal(0, 2)
            
            temp = base_temp + daily_variation + random_variation
            temperatures.append(round(temp, 1))
        
        # Generate humidity with inverse correlation to temperature
        base_humidity = props['base_humidity']
        humidities = []
        
        for temp in temperatures:
            # Humidity tends to be lower when temperature is higher
            temp_effect = -0.8 * (temp - base_temp)
            random_variation = np.random.normal(0, 3)
            
            humidity = base_humidity + temp_effect + random_variation
            humidity = max(30, min(95, humidity))  # Keep within realistic range
            humidities.append(round(humidity, 1))
        
        # Add station data
        for i in range(n_readings):
            all_data.append({
                'station_id': station_id,
                'temperature_c': temperatures[i],
                'humidity_percent': humidities[i],
                'location': props['location']
            })
    
    # Convert to DataFrame and shuffle
    df = pd.DataFrame(all_data)
    df = df.sample(frac=1).reset_index(drop=True)  # Shuffle the data
    
    return df

# Create our sample dataset
environmental_data = create_sample_environmental_data()

print("Sample Environmental Monitoring Data Created:")
print(f"Total readings: {len(environmental_data):,}")
print(f"Stations: {environmental_data['station_id'].nunique()}")
print("\nFirst 5 rows:")
print(environmental_data.head())

print("\nReadings per station:")
print(environmental_data['station_id'].value_counts().sort_index())

### 1.2 Understanding the GroupBy Operation

The `.groupby()` operation in pandas is like sorting data into buckets and then performing calculations on each bucket.

In [None]:
# Demonstrate basic groupby concepts
print("=== UNDERSTANDING GROUPBY ===")

# Step 1: See the unique stations
unique_stations = environmental_data['station_id'].unique()
print(f"Unique stations: {list(unique_stations)}")
print(f"Number of stations: {len(unique_stations)}")

# Step 2: Create groupby object
grouped = environmental_data.groupby('station_id')
print(f"\nGroupBy object created: {type(grouped)}")
print(f"Number of groups: {grouped.ngroups}")

# Step 3: Show what's in each group
print("\nGroup information:")
for name, group in grouped:
    print(f"  Group '{name}': {len(group)} rows")
    # Show sample from first group
    if name == unique_stations[0]:
        print(f"    Sample from {name}:")
        print(group[['temperature_c', 'humidity_percent', 'location']].head(3).to_string(index=False))
        print("    ...")
        break

## Part 2: Calculating Aggregate Statistics

### 2.1 Basic Aggregation Functions

Once data is grouped, you can apply various aggregate functions to summarize each group:

In [None]:
# Demonstrate different aggregation functions
print("=== AGGREGATION FUNCTIONS ===")

grouped = environmental_data.groupby('station_id')

# 1. Mean (average)
avg_temperature = grouped['temperature_c'].mean()
print("\nAverage Temperature by Station:")
for station, avg_temp in avg_temperature.items():
    print(f"  {station}: {avg_temp:.1f}°C")

# 2. Count
reading_counts = grouped.size()
print("\nReading Counts by Station:")
for station, count in reading_counts.items():
    print(f"  {station}: {count} readings")

# 3. Multiple columns at once
multi_stats = grouped[['temperature_c', 'humidity_percent']].mean()
print("\nAverage Temperature and Humidity:")
print(multi_stats.round(1))

# 4. Multiple functions at once
detailed_stats = grouped['temperature_c'].agg(['mean', 'min', 'max', 'std', 'count'])
print("\nDetailed Temperature Statistics:")
print(detailed_stats.round(2))

### 2.2 Creating Summary DataFrames

The key skill is combining multiple aggregations into a clean summary DataFrame:

In [None]:
# Method 1: Step by step aggregation
print("=== CREATING SUMMARY DATAFRAME ===")

grouped = environmental_data.groupby('station_id')

# Calculate each statistic separately
avg_temp = grouped['temperature_c'].mean().round(1)
avg_humidity = grouped['humidity_percent'].mean().round(1)
reading_count = grouped.size()

# Combine into a DataFrame
summary_df = pd.DataFrame({
    'station_id': avg_temp.index,
    'avg_temperature': avg_temp.values,
    'avg_humidity': avg_humidity.values,
    'reading_count': reading_count.values
})

print("Summary DataFrame:")
print(summary_df)

# Find temperature extremes
hottest_station = summary_df.loc[summary_df['avg_temperature'].idxmax()]
coolest_station = summary_df.loc[summary_df['avg_temperature'].idxmin()]

print(f"\nHottest station: {hottest_station['station_id']} (avg: {hottest_station['avg_temperature']:.1f}°C)")
print(f"Coolest station: {coolest_station['station_id']} (avg: {coolest_station['avg_temperature']:.1f}°C)")

### 2.3 Visualizing the Results

Let's create visualizations to better understand our station statistics:

In [None]:
# Create visualizations of station statistics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Average Temperature by Station
axes[0, 0].bar(summary_df['station_id'], summary_df['avg_temperature'], color='red', alpha=0.7)
axes[0, 0].set_title('Average Temperature by Station')
axes[0, 0].set_ylabel('Temperature (°C)')
axes[0, 0].tick_params(axis='x', rotation=45)

# Add value labels on bars
for i, v in enumerate(summary_df['avg_temperature']):
    axes[0, 0].text(i, v + 0.5, f'{v:.1f}°C', ha='center')

# 2. Average Humidity by Station
axes[0, 1].bar(summary_df['station_id'], summary_df['avg_humidity'], color='blue', alpha=0.7)
axes[0, 1].set_title('Average Humidity by Station')
axes[0, 1].set_ylabel('Humidity (%)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Add value labels on bars
for i, v in enumerate(summary_df['avg_humidity']):
    axes[0, 1].text(i, v + 1, f'{v:.1f}%', ha='center')

# 3. Reading Count by Station
axes[1, 0].bar(summary_df['station_id'], summary_df['reading_count'], color='green', alpha=0.7)
axes[1, 0].set_title('Number of Readings by Station')
axes[1, 0].set_ylabel('Reading Count')
axes[1, 0].tick_params(axis='x', rotation=45)

# Add value labels on bars
for i, v in enumerate(summary_df['reading_count']):
    axes[1, 0].text(i, v + 5, f'{v}', ha='center')

# 4. Temperature vs Humidity Relationship
scatter = axes[1, 1].scatter(summary_df['avg_temperature'], summary_df['avg_humidity'], 
                           s=summary_df['reading_count']/3, alpha=0.7, c=range(len(summary_df)), cmap='viridis')
axes[1, 1].set_title('Temperature vs Humidity by Station')
axes[1, 1].set_xlabel('Average Temperature (°C)')
axes[1, 1].set_ylabel('Average Humidity (%)')

# Add station labels
for i, row in summary_df.iterrows():
    axes[1, 1].annotate(row['station_id'], (row['avg_temperature'], row['avg_humidity']), 
                       xytext=(5, 5), textcoords='offset points', fontsize=8)

plt.tight_layout()
plt.show()

# Summary insights
print("\n=== DATA INSIGHTS ===")
print(f"Temperature-humidity correlation: {summary_df['avg_temperature'].corr(summary_df['avg_humidity']):.3f}")
print(f"Most data: {summary_df.loc[summary_df['reading_count'].idxmax(), 'station_id']} ({summary_df['reading_count'].max()} readings)")
print(f"Least data: {summary_df.loc[summary_df['reading_count'].idxmin(), 'station_id']} ({summary_df['reading_count'].min()} readings)")

## Part 3: Building the Complete Function

### 3.1 Function Implementation Example

Now let's build the complete function that matches the requirements:

In [None]:
def calculate_station_statistics_example(df):
    """Example implementation of the calculate_station_statistics function."""
    
    # Print header
    print("=" * 50)
    print("CALCULATING STATION STATISTICS")
    print("=" * 50)
    
    # Input validation
    if df is None or len(df) == 0:
        print("Error: DataFrame is empty or None")
        return pd.DataFrame()
    
    # Check for required columns
    required_columns = ['station_id', 'temperature_c', 'humidity_percent']
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        print(f"Error: Missing required columns: {missing_columns}")
        print(f"Available columns: {list(df.columns)}")
        return pd.DataFrame()
    
    # Print input data summary
    print(f"Processing {len(df):,} temperature readings...")
    
    # Get unique stations
    unique_stations = df['station_id'].unique()
    print(f"Found {len(unique_stations)} weather stations: {list(unique_stations)}")
    
    # Group data by station
    grouped = df.groupby('station_id')
    
    # Calculate statistics
    avg_temperature = grouped['temperature_c'].mean().round(1)
    avg_humidity = grouped['humidity_percent'].mean().round(1)
    reading_count = grouped.size()
    
    # Create summary DataFrame
    summary = pd.DataFrame({
        'station_id': avg_temperature.index,
        'avg_temperature': avg_temperature.values,
        'avg_humidity': avg_humidity.values,
        'reading_count': reading_count.values
    })
    
    # Reset index to make station_id a regular column
    summary = summary.reset_index(drop=True)
    
    # Print summary of results
    print(f"\nTemperature range across all stations: {summary['avg_temperature'].min():.1f}°C to {summary['avg_temperature'].max():.1f}°C")
    print(f"Humidity range across all stations: {summary['avg_humidity'].min():.1f}% to {summary['avg_humidity'].max():.1f}%")
    print(f"Total readings processed: {summary['reading_count'].sum():,}")
    print(f"Average readings per station: {summary['reading_count'].mean():.0f}")
    
    # Find temperature extremes
    hottest_station = summary.loc[summary['avg_temperature'].idxmax()]
    coolest_station = summary.loc[summary['avg_temperature'].idxmin()]
    
    print(f"\nHottest station: {hottest_station['station_id']} (avg: {hottest_station['avg_temperature']:.1f}°C)")
    print(f"Coolest station: {coolest_station['station_id']} (avg: {coolest_station['avg_temperature']:.1f}°C)")
    
    print("\nStation statistics calculated successfully!")
    
    return summary

# Test the function
station_stats = calculate_station_statistics_example(environmental_data)
print("\n=== FINAL RESULTS ===")
print(station_stats)

## Part 4: Your Implementation Task

### 4.1 Implementation Guidelines

Now implement this function in `src/pandas_basics.py`. Here are the key steps:

```python
def calculate_station_statistics(df):
    # TODO: Print header with function name
    # TODO: Validate input DataFrame (check if None or empty)
    # TODO: Check for required columns: ['station_id', 'temperature_c', 'humidity_percent']
    # TODO: Print summary of input data
    # TODO: Get unique stations and report count
    # TODO: Group data by station_id using df.groupby('station_id')
    # TODO: Calculate avg_temperature using .mean().round(1)
    # TODO: Calculate avg_humidity using .mean().round(1) 
    # TODO: Count readings per station using .size()
    # TODO: Create summary DataFrame with all statistics
    # TODO: Reset index to make station_id a regular column
    # TODO: Print summary statistics (ranges, totals)
    # TODO: Find and report hottest/coolest stations
    # TODO: Return the summary DataFrame
```

### 4.2 Testing Your Implementation

Test your function with:

```bash
uv run pytest tests/test_pandas_basics.py::test_calculate_station_statistics -v
```

### 4.3 Expected Output Format

Your function should return a DataFrame with exactly these columns:
- `station_id`: Station identifier (string)
- `avg_temperature`: Average temperature rounded to 1 decimal (float)
- `avg_humidity`: Average humidity rounded to 1 decimal (float) 
- `reading_count`: Number of readings for this station (int)

### 4.4 Common Issues and Solutions

**Issue 1: Index problems**
```python
# Wrong: station_id becomes the index
summary = grouped_data.mean()

# Right: station_id remains a regular column
summary = grouped_data.mean().reset_index()
```

**Issue 2: Column naming**
```python
# Make sure your final DataFrame has exactly these column names:
['station_id', 'avg_temperature', 'avg_humidity', 'reading_count']
```

**Issue 3: Rounding**
```python
# Remember to round temperature and humidity to 1 decimal place
avg_temperature = grouped['temperature_c'].mean().round(1)
```

## 🎯 Summary and Next Steps

### What You've Learned
- How to group data using `.groupby()`
- Computing aggregate statistics (mean, count, min, max)
- Creating summary DataFrames from grouped data
- Finding extreme values and patterns in data
- Professional data reporting and validation

### Your Implementation Checklist
- [ ] Print informative headers and progress messages
- [ ] Validate input data and handle errors gracefully
- [ ] Check for required columns before processing
- [ ] Use `.groupby()` to group data by station
- [ ] Calculate mean temperature and humidity (rounded to 1 decimal)
- [ ] Count readings per station using `.size()`
- [ ] Create DataFrame with exact column names expected by tests
- [ ] Reset index to make station_id a regular column
- [ ] Report summary statistics and extremes
- [ ] Return properly formatted DataFrame

### Next Function
Once you've implemented and tested this function, move on to:
**[`04_function_join_station_data.ipynb`](04_function_join_station_data.ipynb)**

Where you'll learn to combine datasets using pandas merge operations!

---

**Remember**: Data aggregation is one of the most powerful features of pandas - it transforms raw data into actionable insights! 📊