# 💾 Function 5: Save Processed Data

## Building the `save_processed_data` Function

**Learning Objectives:**
- Understand data export and file I/O operations in pandas
- Learn to save DataFrames to CSV files
- Master file path handling and directory management
- Implement data validation before saving
- Handle file permissions and error scenarios

**Professional Context:**
Data saving is crucial for:
- **Workflow persistence** - Save intermediate results for later analysis
- **Data sharing** - Export data for colleagues and stakeholders
- **Integration** - Prepare data for other software (QGIS, Excel, databases)
- **Backup and archival** - Preserve processed datasets for future reference

## Part 1: Understanding Data Export

### 1.1 Why Save Processed Data?

**Data processing workflows** often involve multiple steps:
1. Load raw data
2. Clean and filter data  
3. Calculate statistics and derive new variables
4. Join with additional datasets
5. **Save processed results** ← This function!

**Benefits of saving processed data:**
- **Time saving**: Don't re-process large datasets every time
- **Sharing**: Send clean data to colleagues or clients
- **Integration**: Import into GIS software, spreadsheets, or databases
- **Documentation**: Keep records of analysis results

In [None]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
from datetime import datetime

# Create sample processed data
processed_data = pd.DataFrame({
    'station_id': ['STN_001', 'STN_002', 'STN_003', 'STN_004', 'STN_005'],
    'station_name': ['Downtown', 'Coastal', 'Mountain', 'Airport', 'University'],
    'avg_temperature': [22.3, 19.1, 16.8, 24.7, 21.2],
    'avg_humidity': [64.2, 76.8, 81.3, 57.9, 68.5],
    'reading_count': [245, 198, 267, 189, 223],
    'latitude': [40.123, 40.789, 41.234, 40.678, 40.876],
    'longitude': [-74.456, -73.987, -74.567, -74.123, -73.876]
})

print("Sample Processed Data:")
print(processed_data)
print(f"\nData shape: {processed_data.shape}")
print(f"Columns: {list(processed_data.columns)}")

## Part 2: Basic Data Export Operations

### 2.1 Saving to CSV Files

CSV is the most common format for sharing tabular data:

In [None]:
# Create output directory
output_dir = Path('output')
output_dir.mkdir(exist_ok=True)
print(f"Created output directory: {output_dir}")

# Basic CSV export
csv_path = output_dir / 'station_summary.csv'
processed_data.to_csv(csv_path, index=False)

print(f"\nSaved data to: {csv_path}")
print(f"File exists: {csv_path.exists()}")
print(f"File size: {csv_path.stat().st_size} bytes")

# Verify by reading back
loaded_data = pd.read_csv(csv_path)
print(f"\nVerification - loaded {len(loaded_data)} rows")
print("First few rows:")
print(loaded_data.head(3))

### 2.2 Professional File Organization

Professional workflows require organized file structures:

In [None]:
# Create organized directory structure
base_dir = Path('environmental_analysis')
summaries_dir = base_dir / 'summaries'
raw_data_dir = base_dir / 'processed_readings'

# Create directories
for directory in [base_dir, summaries_dir, raw_data_dir]:
    directory.mkdir(parents=True, exist_ok=True)
    print(f"Created: {directory}")

# Professional file naming with timestamps
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
date_only = datetime.now().strftime('%Y-%m-%d')

# Save with descriptive filename
filename = f"station_summary_{date_only}.csv"
file_path = summaries_dir / filename

processed_data.to_csv(file_path, index=False, float_format='%.2f')
print(f"\nSaved to: {file_path}")
print(f"Filename pattern: [description]_[date].csv")

# List directory contents
print(f"\nContents of {summaries_dir}:")
for item in summaries_dir.iterdir():
    print(f"  {item.name} ({item.stat().st_size} bytes)")

### 2.3 Data Validation Before Saving

Always validate data before saving:

In [None]:
def validate_data_before_save(df, data_name="DataFrame"):
    """Validate data before saving."""
    
    print(f"=== VALIDATING {data_name.upper()} ===")
    
    issues = []
    
    # Check basic structure
    if df is None:
        issues.append("DataFrame is None")
        return False, issues
    
    if len(df) == 0:
        issues.append("DataFrame is empty")
        return False, issues
    
    if len(df.columns) == 0:
        issues.append("DataFrame has no columns")
        return False, issues
    
    print(f"✓ Shape: {df.shape[0]} rows × {df.shape[1]} columns")
    
    # Check for missing data
    missing_count = df.isnull().sum().sum()
    if missing_count > 0:
        missing_percent = (missing_count / (df.shape[0] * df.shape[1])) * 100
        print(f"⚠️  Missing data: {missing_count} values ({missing_percent:.1f}%)")
        if missing_percent > 50:
            issues.append(f"High missing data: {missing_percent:.1f}%")
    else:
        print("✓ No missing values")
    
    # Check for duplicates
    duplicate_count = df.duplicated().sum()
    if duplicate_count > 0:
        print(f"⚠️  Duplicate rows: {duplicate_count}")
    else:
        print("✓ No duplicate rows")
    
    is_valid = len(issues) == 0
    status = "VALID" if is_valid else "INVALID"
    print(f"\nValidation result: {status}")
    
    return is_valid, issues

# Test validation
is_valid, issues = validate_data_before_save(processed_data, "Station Summary")

if not is_valid:
    print("\nISSUES FOUND:")
    for issue in issues:
        print(f"  - {issue}")
else:
    print("\n✅ Data is ready for saving!")

## Part 3: Building the Complete Function

### 3.1 Complete Function Implementation

Now let's build the complete function:

In [None]:
def save_processed_data_example(df, file_path):
    """Example implementation of save_processed_data function."""
    
    print("=" * 50)
    print("SAVING PROCESSED DATA")
    print("=" * 50)
    
    # Input validation
    if df is None or len(df) == 0:
        print("Error: DataFrame is empty or None")
        return False
    
    if not file_path or str(file_path).strip() == "":
        print("Error: Invalid file path")
        return False
    
    # Convert to Path object for better handling
    file_path = Path(file_path)
    
    print(f"Data to save: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"Target file: {file_path}")
    
    # Create directory if it doesn't exist
    file_path.parent.mkdir(parents=True, exist_ok=True)
    print(f"Directory: {file_path.parent}")
    
    # Validate data quality
    missing_count = df.isnull().sum().sum()
    duplicate_count = df.duplicated().sum()
    
    print(f"Data quality check:")
    print(f"  Missing values: {missing_count}")
    print(f"  Duplicate rows: {duplicate_count}")
    
    try:
        # Save the file
        df.to_csv(file_path, index=False)
        
        # Verify the save
        if file_path.exists():
            file_size = file_path.stat().st_size
            
            # Quick verification by reading back
            verification_df = pd.read_csv(file_path)
            
            print(f"\n✅ File saved successfully!")
            print(f"File size: {file_size:,} bytes")
            print(f"Verification: {len(verification_df)} rows loaded")
            
            return True
        else:
            print("\n❌ Error: File was not created")
            return False
            
    except Exception as e:
        print(f"\n❌ Error saving file: {str(e)}")
        return False

# Test the function
test_path = 'output/test_station_data.csv'
success = save_processed_data_example(processed_data, test_path)
print(f"\nSave operation success: {success}")

### 3.2 Error Handling Examples

Test the function with various error conditions:

In [None]:
print("=== ERROR HANDLING TESTS ===")

# Test 1: Empty DataFrame
print("\n1. Testing with empty DataFrame:")
empty_df = pd.DataFrame()
result1 = save_processed_data_example(empty_df, 'output/empty_test.csv')
print(f"Result: {result1}")

# Test 2: Invalid file path
print("\n2. Testing with invalid file path:")
result2 = save_processed_data_example(processed_data, '')
print(f"Result: {result2}")

# Test 3: Valid data and path
print("\n3. Testing with valid data and path:")
result3 = save_processed_data_example(processed_data, 'output/valid_test.csv')
print(f"Result: {result3}")

print("\n=== ERROR HANDLING COMPLETE ===")

## Part 4: Your Implementation Task

### 4.1 Implementation Guidelines

Now implement this function in `src/pandas_basics.py`:

```python
def save_processed_data(df, file_path):
    # TODO: Print header
    # TODO: Validate input DataFrame (not None or empty)
    # TODO: Validate file path (not empty or None)
    # TODO: Convert file_path to Path object
    # TODO: Print data summary (shape, target file)
    # TODO: Create parent directory if needed
    # TODO: Validate data quality (check missing values, duplicates)
    # TODO: Save DataFrame to CSV using to_csv()
    # TODO: Verify file was created and get file size
    # TODO: Print success message with file details
    # TODO: Handle exceptions and return True/False
```

### 4.2 Testing Your Implementation

```bash
uv run pytest tests/test_pandas_basics.py::test_save_processed_data -v
```

### 4.3 Key Requirements

- Save DataFrame to CSV format without index
- Create output directory if it doesn't exist
- Validate input data and file path
- Return True if successful, False if failed
- Provide informative progress messages

## 🎯 Summary and Next Steps

### What You've Learned
- How to save DataFrames to CSV files using `.to_csv()`
- Professional file organization and naming conventions
- Data validation before saving to prevent errors
- Error handling for file I/O operations
- Directory management with Path objects

### Your Implementation Checklist
- [ ] Validate DataFrame input (not None or empty)
- [ ] Validate file path parameter
- [ ] Create parent directories as needed
- [ ] Check data quality before saving
- [ ] Save to CSV without index
- [ ] Verify file was created successfully
- [ ] Handle exceptions gracefully
- [ ] Return success/failure status

### Assignment Complete!
Once you've implemented and tested this function, you'll have completed all 5 pandas functions:

1. ✅ Load and explore data
2. ✅ Filter environmental data  
3. ✅ Calculate station statistics
4. ✅ Join station data
5. ✅ Save processed data

**Congratulations!** You've mastered the essential pandas skills for GIS data analysis!

---

**Remember**: Saving your work is just as important as doing the analysis - make sure your hard work is preserved! 💾