# Function 1: Load and Explore GIS Data 📊

**Welcome to your first pandas function!**

In this notebook, you'll learn how to build the `load_and_explore_gis_data()` function step by step. This is like opening a spreadsheet file and getting familiar with what's inside.

## 🎯 What This Function Does
- Loads a CSV file into a pandas DataFrame
- Shows you key information about the dataset
- Displays the first few rows so you can see what the data looks like
- Provides summary statistics
- Handles errors gracefully

## 🔧 Function Signature
```python
def load_and_explore_gis_data(file_path):
    """
    Args:
        file_path (str): Path to CSV file (e.g., 'data/weather_stations.csv')
    
    Returns:
        pandas.DataFrame: The loaded dataset
    """
```

## 🚀 Step 1: Import Required Libraries

First, let's import the libraries we need:

In [None]:
import pandas as pd
import os

print(f"✅ Pandas version: {pd.__version__}")
print("📚 Ready to work with data!")

## 📁 Step 2: Understanding File Paths

Before we load data, let's understand where our files are located:

In [None]:
# Let's see what files we have in the data directory
data_dir = '../data'  # Go up one level from notebooks, then into data

if os.path.exists(data_dir):
    print("📂 Files in data directory:")
    for file in os.listdir(data_dir):
        print(f"   {file}")
else:
    print(f"❌ Directory not found: {data_dir}")
    print("Current working directory:", os.getcwd())

## 📊 Step 3: Loading Your First CSV File

Now let's load the weather stations data and see what happens:

In [None]:
# Define the file path
file_path = '../data/weather_stations.csv'

# Load the CSV file into a DataFrame
try:
    df = pd.read_csv(file_path)
    print("✅ File loaded successfully!")
    print(f"📁 Loaded: {file_path}")
except FileNotFoundError:
    print(f"❌ File not found: {file_path}")
    print("Make sure you're running this from the notebooks directory!")
except Exception as e:
    print(f"❌ Error loading file: {e}")

## 🔍 Step 4: Exploring the Dataset Shape and Structure

When you first load data, you want to understand:
- How many rows and columns does it have?
- What are the column names?
- What data types are in each column?

In [None]:
# Show the shape (rows, columns)
print(f"📏 Shape: {df.shape} - {df.shape[0]} rows and {df.shape[1]} columns")
print()

# Show column names
print("📋 Columns:")
for i, col in enumerate(df.columns):
    print(f"   {i+1}. {col}")
print()

# Show data types and memory usage
print("🔧 Data Types and Info:")
df.info()

## 👀 Step 5: Looking at the Actual Data

Numbers and column names are helpful, but you need to see the actual data to understand what you're working with:

In [None]:
# Show the first 5 rows
print("🔍 First 5 rows:")
display(df.head())

print("\n" + "="*50)

# You can also show the last few rows
print("🔍 Last 3 rows:")
display(df.tail(3))

## 📈 Step 6: Summary Statistics

Summary statistics help you understand the range and distribution of numerical data:

In [None]:
# Show summary statistics for numerical columns
print("📊 Summary Statistics:")
display(df.describe())

# For text columns, you can see unique values
print("\n📝 Text Column Analysis:")
for col in df.select_dtypes(include=['object']).columns:
    unique_count = df[col].nunique()
    print(f"   {col}: {unique_count} unique values")
    if unique_count <= 10:  # Show values if not too many
        print(f"      Values: {list(df[col].unique())[:10]}")

## 🔧 Step 7: Checking for Data Quality Issues

Real-world data often has problems. Let's check for common issues:

In [None]:
# Check for missing values
print("🔍 Missing Values Check:")
missing_data = df.isnull().sum()
if missing_data.sum() > 0:
    print(missing_data[missing_data > 0])
else:
    print("✅ No missing values found!")

print("\n🔍 Duplicate Rows Check:")
duplicate_count = df.duplicated().sum()
if duplicate_count > 0:
    print(f"⚠️  Found {duplicate_count} duplicate rows")
else:
    print("✅ No duplicate rows found!")

## 🧪 Step 8: Test with the Temperature Readings Data

Let's try our exploration technique with the other dataset:

In [None]:
# Load the temperature readings dataset
temp_file = '../data/temperature_readings.csv'
temp_df = pd.read_csv(temp_file)

print(f"🌡️  Temperature Readings Dataset")
print(f"Shape: {temp_df.shape}")
print(f"Columns: {list(temp_df.columns)}")
print("\nFirst 3 rows:")
display(temp_df.head(3))

print("\nSummary statistics:")
display(temp_df.describe())

## 🏗️ Step 9: Building the Complete Function

Now let's put everything together into a reusable function. This is what you'll implement in `src/pandas_basics.py`:

In [None]:
def load_and_explore_gis_data(file_path):
    """
    Load a CSV file and display comprehensive information about the dataset.
    
    This function demonstrates the first step in any data analysis project:
    understanding your data through exploration.
    
    Args:
        file_path (str): Path to the CSV file to load
        
    Returns:
        pandas.DataFrame: The loaded dataset, or None if loading failed
    """
    
    print("=" * 50)
    print("LOADING AND EXPLORING GIS DATA")
    print("=" * 50)
    
    # Step 1: Check if file exists
    if not os.path.exists(file_path):
        print(f"❌ ERROR: File not found: {file_path}")
        print("Please check:")
        print("- Is the file path correct?")
        print("- Are you in the right directory?")
        print("- Does the file exist?")
        return None
    
    print(f"📁 Loading data from: {file_path}")
    
    # Step 2: Load the CSV file
    try:
        df = pd.read_csv(file_path)
        print("✅ File loaded successfully!")
    except Exception as e:
        print(f"❌ ERROR loading file: {e}")
        return None
    
    # Step 3: Show basic dataset information
    print(f"\n📊 DATASET OVERVIEW")
    print(f"Shape: {df.shape} - {df.shape[0]} rows and {df.shape[1]} columns")
    print(f"Columns: {list(df.columns)}")
    
    # Step 4: Show data types
    print(f"\n🔧 DATA TYPES:")
    for col in df.columns:
        print(f"   {col}: {df[col].dtype}")
    
    # Step 5: Show first few rows
    print(f"\n👀 FIRST 5 ROWS:")
    print(df.head())
    
    # Step 6: Show summary statistics
    print(f"\n📈 SUMMARY STATISTICS:")
    print(df.describe())
    
    # Step 7: Check for data quality issues
    print(f"\n🔍 DATA QUALITY CHECK:")
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print("Missing values found:")
        print(missing[missing > 0])
    else:
        print("✅ No missing values")
        
    duplicates = df.duplicated().sum()
    if duplicates > 0:
        print(f"⚠️  Found {duplicates} duplicate rows")
    else:
        print("✅ No duplicate rows")
    
    print(f"\n🎉 Data exploration complete! Dataset is ready for analysis.")
    
    return df

## ✨ Step 10: Test Your Function

Let's test our complete function with both datasets:

In [None]:
# Test with weather stations
print("🧪 TESTING WITH WEATHER STATIONS DATA\n")
stations_df = load_and_explore_gis_data('../data/weather_stations.csv')

In [None]:
# Test with temperature readings
print("\n" + "="*80 + "\n")
print("🧪 TESTING WITH TEMPERATURE READINGS DATA\n")
readings_df = load_and_explore_gis_data('../data/temperature_readings.csv')

In [None]:
# Test error handling with non-existent file
print("\n" + "="*80 + "\n")
print("🧪 TESTING ERROR HANDLING\n")
result = load_and_explore_gis_data('../data/nonexistent_file.csv')
print(f"Result: {result}")

## 🎯 Your Assignment Task

Now that you understand how this function works:

1. **Go to `src/pandas_basics.py`**
2. **Find the `load_and_explore_gis_data()` function**
3. **Replace the TODO comments with your implementation**
4. **Test your function with pytest**:

```bash
# Test just this function
uv run pytest tests/test_pandas_basics.py::test_load_and_explore_gis_data -v

# Test all functions
uv run pytest tests/ -v
```

## 🔑 Key Learning Points

- **`pd.read_csv()`** loads CSV files into DataFrames
- **`.shape`** tells you rows and columns: `(rows, columns)`
- **`.head()`** shows the first few rows
- **`.info()`** shows data types and memory usage
- **`.describe()`** shows summary statistics for numerical columns
- **Always check for missing values and data quality issues**
- **Error handling makes your code robust and user-friendly**

## 🚀 Next Steps

Once this function works and passes the tests, move on to:
- **Function 2**: `filter_environmental_data()` - Learn to filter data based on conditions
- **Function 3**: `calculate_station_statistics()` - Learn to group and calculate statistics
- **Function 4**: `join_station_data()` - Learn to combine datasets
- **Function 5**: `save_processed_data()` - Learn to save your results

**Good luck! 🍀**