# Function 2: Filter Environmental Data 🔍

**Welcome to data filtering with pandas!**

In this notebook, you'll learn how to build the `filter_environmental_data()` function step by step. This is like applying filters in Excel to show only the data that meets certain conditions.

## 🎯 What This Function Does
- Filters temperature readings to show only data within a specified range
- Filters by data quality to exclude poor quality measurements
- Reports how many rows were kept vs. removed
- Returns the filtered DataFrame for further analysis

## 🔧 Function Signature
```python
def filter_environmental_data(df, min_temp=15, max_temp=30, quality="good"):
    """
    Args:
        df (pandas.DataFrame): The environmental data to filter
        min_temp (float): Minimum temperature threshold (default: 15°C)
        max_temp (float): Maximum temperature threshold (default: 30°C)
        quality (str): Required data quality level (default: "good")
    
    Returns:
        pandas.DataFrame: Filtered data meeting all conditions
    """
```

## 🚀 Step 1: Import Libraries and Load Data

First, let's set up our environment and load the temperature readings data:

In [None]:
import pandas as pd
import numpy as np
import os

print(f"✅ Pandas version: {pd.__version__}")
print("🔍 Ready to filter data!")

In [None]:
# Load the temperature readings data
file_path = '../data/temperature_readings.csv'
df = pd.read_csv(file_path)

print(f"📊 Loaded {len(df)} temperature readings")
print(f"📋 Columns: {list(df.columns)}")
print("\n🔍 First 5 rows:")
display(df.head())

## 📊 Step 2: Understanding Your Data Before Filtering

Before filtering, let's understand what we're working with:

In [None]:
# Explore the temperature range
print("🌡️  TEMPERATURE ANALYSIS:")
print(f"   Minimum temperature: {df['temperature_c'].min()}°C")
print(f"   Maximum temperature: {df['temperature_c'].max()}°C")
print(f"   Average temperature: {df['temperature_c'].mean():.1f}°C")
print(f"   Temperature range: {df['temperature_c'].max() - df['temperature_c'].min()}°C")

print("\n📈 Temperature distribution:")
print(df['temperature_c'].describe())

In [None]:
# Explore data quality values
print("🏷️  DATA QUALITY ANALYSIS:")
quality_counts = df['data_quality'].value_counts()
print(quality_counts)

print("\n📊 Quality percentages:")
quality_pct = df['data_quality'].value_counts(normalize=True) * 100
for quality, pct in quality_pct.items():
    print(f"   {quality}: {pct:.1f}%")

## 🔍 Step 3: Basic Filtering - Single Condition

Let's start with simple filtering using a single condition:

In [None]:
# Filter 1: Show only temperatures above 20°C
hot_temps = df[df['temperature_c'] > 20]

print(f"🌡️  Original data: {len(df)} rows")
print(f"🔥 Temperatures > 20°C: {len(hot_temps)} rows")
print(f"📉 Removed: {len(df) - len(hot_temps)} rows ({(len(df) - len(hot_temps))/len(df)*100:.1f}%)")

print("\n🔍 Sample of filtered data:")
display(hot_temps.head(3))

In [None]:
# Filter 2: Show only "good" quality data
good_quality = df[df['data_quality'] == 'good']

print(f"📊 Original data: {len(df)} rows")
print(f"✅ Good quality data: {len(good_quality)} rows")
print(f"📉 Removed: {len(df) - len(good_quality)} rows ({(len(df) - len(good_quality))/len(df)*100:.1f}%)")

print("\n🔍 Quality distribution in filtered data:")
print(good_quality['data_quality'].value_counts())

## 🔗 Step 4: Advanced Filtering - Multiple Conditions

Now let's combine multiple conditions using `&` (AND) and `|` (OR) operators:

In [None]:
# Multiple conditions with AND (&)
# Show data where temperature is between 15-30°C AND quality is "good"

# Note: When combining conditions, each condition needs parentheses!
filtered_data = df[
    (df['temperature_c'] >= 15) & 
    (df['temperature_c'] <= 30) & 
    (df['data_quality'] == 'good')
]

print(f"📊 Original data: {len(df)} rows")
print(f"✅ Filtered data: {len(filtered_data)} rows")
print(f"📉 Removed: {len(df) - len(filtered_data)} rows ({(len(df) - len(filtered_data))/len(df)*100:.1f}%)")

print("\n🌡️  Temperature range in filtered data:")
print(f"   Min: {filtered_data['temperature_c'].min()}°C")
print(f"   Max: {filtered_data['temperature_c'].max()}°C")
print(f"   Avg: {filtered_data['temperature_c'].mean():.1f}°C")

## 💡 Step 5: Understanding Boolean Indexing

Let's understand how pandas filtering actually works behind the scenes:

In [None]:
# Create a small sample to understand boolean indexing
sample_df = df.head(10).copy()

print("🔍 Sample data:")
display(sample_df[['station_id', 'temperature_c', 'data_quality']])

print("\n🔢 Boolean mask for temperature > 20:")
temp_mask = sample_df['temperature_c'] > 20
print(temp_mask)

print("\n✅ Rows where temperature > 20:")
display(sample_df[temp_mask][['station_id', 'temperature_c', 'data_quality']])

## 📝 Step 6: Building Helper Functions

Let's create some helper functions to make our filtering more readable:

In [None]:
def show_filtering_stats(original_df, filtered_df, description=""):
    """
    Display statistics about the filtering operation.
    """
    original_count = len(original_df)
    filtered_count = len(filtered_df)
    removed_count = original_count - filtered_count
    removed_pct = (removed_count / original_count) * 100 if original_count > 0 else 0
    
    print(f"📊 {description}")
    print(f"   Original dataset: {original_count} rows")
    print(f"   After filtering: {filtered_count} rows kept")
    print(f"   Removed: {removed_count} rows ({removed_pct:.1f}%)")

# Test our helper function
test_filtered = df[df['temperature_c'] > 25]
show_filtering_stats(df, test_filtered, "Testing helper function - temp > 25°C")

## 🎛️ Step 7: Flexible Filtering with Parameters

Now let's create a flexible filtering function that accepts parameters:

In [None]:
def filter_by_temperature_range(df, min_temp, max_temp):
    """
    Filter DataFrame to show only temperatures within the specified range.
    """
    filtered = df[(df['temperature_c'] >= min_temp) & (df['temperature_c'] <= max_temp)]
    
    print(f"🌡️  Temperature filter: {min_temp}°C to {max_temp}°C")
    show_filtering_stats(df, filtered, "Temperature range filtering")
    
    return filtered

# Test with different temperature ranges
print("🧪 TESTING DIFFERENT TEMPERATURE RANGES:\n")

# Test 1: Conservative range
result1 = filter_by_temperature_range(df, 18, 25)
print()

# Test 2: Wider range
result2 = filter_by_temperature_range(df, 10, 35)
print()

# Test 3: Narrow range
result3 = filter_by_temperature_range(df, 20, 22)

## 🏷️ Step 8: Quality-Based Filtering

Let's create a function to filter by data quality:

In [None]:
def filter_by_quality(df, quality_level):
    """
    Filter DataFrame to show only data with the specified quality level.
    """
    # Check if the quality level exists in the data
    available_qualities = df['data_quality'].unique()
    
    if quality_level not in available_qualities:
        print(f"⚠️  Warning: Quality level '{quality_level}' not found in data")
        print(f"📋 Available quality levels: {list(available_qualities)}")
        return df.copy()  # Return original data if quality not found
    
    filtered = df[df['data_quality'] == quality_level]
    
    print(f"🏷️  Quality filter: '{quality_level}' only")
    show_filtering_stats(df, filtered, f"Quality filtering for '{quality_level}'")
    
    return filtered

# Test quality filtering
print("🧪 TESTING QUALITY FILTERING:\n")

# Test with good quality
good_data = filter_by_quality(df, "good")
print()

# Test with fair quality
fair_data = filter_by_quality(df, "fair")
print()

# Test with invalid quality
invalid_data = filter_by_quality(df, "excellent")

## 🏗️ Step 9: Building the Complete Function

Now let's combine everything into the complete `filter_environmental_data()` function:

In [None]:
def filter_environmental_data(df, min_temp=15, max_temp=30, quality="good"):
    """
    Filter environmental data based on temperature range and data quality.
    
    This function demonstrates how to apply multiple filtering conditions
    to clean and prepare environmental data for analysis.
    
    Args:
        df (pandas.DataFrame): Environmental data with temperature and quality columns
        min_temp (float): Minimum acceptable temperature in Celsius (default: 15)
        max_temp (float): Maximum acceptable temperature in Celsius (default: 30)
        quality (str): Required data quality level (default: "good")
        
    Returns:
        pandas.DataFrame: Filtered data meeting all specified conditions
    """
    
    print("=" * 50)
    print("FILTERING ENVIRONMENTAL DATA")
    print("=" * 50)
    
    # Input validation
    if df is None or df.empty:
        print("❌ ERROR: Empty or None DataFrame provided")
        return pd.DataFrame()
    
    # Check for required columns
    required_columns = ['temperature_c', 'data_quality']
    missing_columns = [col for col in required_columns if col not in df.columns]
    
    if missing_columns:
        print(f"❌ ERROR: Missing required columns: {missing_columns}")
        print(f"📋 Available columns: {list(df.columns)}")
        return pd.DataFrame()
    
    original_count = len(df)
    print(f"📊 Starting with {original_count} rows of environmental data")
    
    # Show filtering criteria
    print(f"\n🎯 FILTERING CRITERIA:")
    print(f"   Temperature range: {min_temp}°C to {max_temp}°C")
    print(f"   Data quality: '{quality}'")
    
    # Check if quality level exists
    available_qualities = df['data_quality'].unique()
    if quality not in available_qualities:
        print(f"\n⚠️  WARNING: Quality level '{quality}' not found in data")
        print(f"📋 Available quality levels: {list(available_qualities)}")
        print("🔄 Returning original data without quality filtering...")
        quality_filter = pd.Series([True] * len(df))  # No filtering
    else:
        quality_filter = df['data_quality'] == quality
    
    # Apply all filters
    print(f"\n🔍 APPLYING FILTERS...")
    
    # Temperature range filter
    temp_filter = (df['temperature_c'] >= min_temp) & (df['temperature_c'] <= max_temp)
    temp_filtered_count = temp_filter.sum()
    temp_removed = original_count - temp_filtered_count
    print(f"   🌡️  Temperature filter: kept {temp_filtered_count}, removed {temp_removed} rows")
    
    # Quality filter
    quality_filtered_count = quality_filter.sum()
    quality_removed = original_count - quality_filtered_count
    print(f"   🏷️  Quality filter: kept {quality_filtered_count}, removed {quality_removed} rows")
    
    # Combined filter
    combined_filter = temp_filter & quality_filter
    filtered_df = df[combined_filter].copy()
    
    final_count = len(filtered_df)
    total_removed = original_count - final_count
    removal_pct = (total_removed / original_count) * 100 if original_count > 0 else 0
    
    print(f"\n📈 FILTERING RESULTS:")
    print(f"   Original dataset: {original_count} rows")
    print(f"   After filtering: {final_count} rows kept")
    print(f"   Total removed: {total_removed} rows ({removal_pct:.1f}%)")
    
    # Show statistics of filtered data
    if not filtered_df.empty:
        print(f"\n📊 FILTERED DATA SUMMARY:")
        print(f"   Temperature range: {filtered_df['temperature_c'].min():.1f}°C to {filtered_df['temperature_c'].max():.1f}°C")
        print(f"   Average temperature: {filtered_df['temperature_c'].mean():.1f}°C")
        print(f"   Quality distribution: {dict(filtered_df['data_quality'].value_counts())}")
    else:
        print(f"\n⚠️  WARNING: No data remains after filtering!")
        print(f"   Consider relaxing your filtering criteria.")
    
    print(f"\n✅ Filtering complete! Ready for analysis.")
    
    return filtered_df

## ✨ Step 10: Test Your Complete Function

Let's test our complete function with different scenarios:

In [None]:
# Test 1: Default parameters
print("🧪 TEST 1: DEFAULT PARAMETERS\n")
result1 = filter_environmental_data(df)
print(f"\n📋 Sample of filtered data:")
display(result1.head(3))

In [None]:
# Test 2: Custom temperature range
print("\n" + "="*80 + "\n")
print("🧪 TEST 2: CUSTOM TEMPERATURE RANGE (10-35°C)\n")
result2 = filter_environmental_data(df, min_temp=10, max_temp=35, quality="good")

In [None]:
# Test 3: Different quality level
print("\n" + "="*80 + "\n")
print("🧪 TEST 3: FAIR QUALITY DATA\n")
result3 = filter_environmental_data(df, min_temp=15, max_temp=30, quality="fair")

In [None]:
# Test 4: Very strict filtering
print("\n" + "="*80 + "\n")
print("🧪 TEST 4: VERY STRICT FILTERING (20-22°C, good quality)\n")
result4 = filter_environmental_data(df, min_temp=20, max_temp=22, quality="good")

In [None]:
# Test 5: Error handling - invalid quality
print("\n" + "="*80 + "\n")
print("🧪 TEST 5: ERROR HANDLING - Invalid Quality\n")
result5 = filter_environmental_data(df, min_temp=15, max_temp=30, quality="excellent")

In [None]:
# Test 6: Error handling - empty DataFrame
print("\n" + "="*80 + "\n")
print("🧪 TEST 6: ERROR HANDLING - Empty DataFrame\n")
empty_df = pd.DataFrame()
result6 = filter_environmental_data(empty_df)

## 🎯 Your Assignment Task

Now that you understand how this function works:

1. **Go to `src/pandas_basics.py`**
2. **Find the `filter_environmental_data()` function**
3. **Replace the TODO comments with your implementation**
4. **Test your function with pytest**:

```bash
# Test just this function
uv run pytest tests/test_pandas_basics.py::test_filter_environmental_data -v

# Test all functions so far
uv run pytest tests/ -v
```

## 🔑 Key Learning Points

- **Boolean indexing**: `df[df['column'] > value]` creates filtered DataFrames
- **Multiple conditions**: Use `&` (AND) and `|` (OR) with parentheses: `(condition1) & (condition2)`
- **Parameter defaults**: Make functions flexible with default parameter values
- **Input validation**: Always check if required columns exist and data is valid
- **Informative output**: Tell users what filters were applied and how much data was removed
- **Error handling**: Handle edge cases like missing quality levels or empty DataFrames

## 📚 Common Filtering Patterns

```python
# Single condition
df[df['temperature'] > 20]

# Multiple conditions (AND)
df[(df['temp'] > 15) & (df['temp'] < 30)]

# Multiple conditions (OR)
df[(df['quality'] == 'good') | (df['quality'] == 'fair')]

# String operations
df[df['station_name'].str.contains('Central')]

# Negation (NOT)
df[~(df['quality'] == 'poor')]
```

## 🚀 Next Steps

Once this function works and passes the tests, move on to:
- **Function 3**: `calculate_station_statistics()` - Learn to group data and calculate statistics
- **Function 4**: `join_station_data()` - Learn to combine multiple datasets
- **Function 5**: `save_processed_data()` - Learn to save your filtered results

**Great job learning data filtering! 🎉**