# Tutorial 3: Advanced Data Cleaning Techniques

Real-world climate data often contains missing values and outliers. This notebook demonstrates the cleaning techniques mentioned in the presentation.

**Objective:** Create a synthetic dataset with issues and apply cleaning methods.

**Techniques covered:**
- Missing Value Imputation (Interpolation, Seasonal)
- Outlier Detection (Z-score)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set seed for reproducibility
np.random.seed(42)

# 1. Generate Synthetic Data
# Simulate daily temperature data with a seasonal trend
dates = pd.date_range(start="2020-01-01", end="2022-12-31", freq="D")
n = len(dates)

# Seasonal signal + trend + noise
t = np.arange(n)
seasonal = 10 * np.sin(2 * np.pi * t / 365)
trend = 0.01 * t
noise = np.random.normal(0, 2, n)
temperature = 15 + seasonal + trend + noise

df = pd.DataFrame({'date': dates, 'temp': temperature})
df.set_index('date', inplace=True)

# 2. Introduce Defects
# Add missing values (gaps)
df.loc['2021-06-01':'2021-06-15', 'temp'] = np.nan # 2-week gap
df.loc['2020-03-10', 'temp'] = np.nan # Single missing

# Add Outliers (extreme events or sensor errors)
df.loc['2020-08-15', 'temp'] = 50 # Unrealistic heat spike
df.loc['2021-01-10', 'temp'] = -40 # Unrealistic cold spike

plt.figure(figsize=(12, 6))
plt.plot(df['temp'], label='Raw Data')
plt.title("Synthetic Climate Data with Defects")
plt.legend()
plt.show()

## 1. Handling Missing Values

### Linear Interpolation
Good for short gaps where the trend is likely continuous.

In [None]:
df_clean = df.copy()

# Linear interpolation
df_clean['temp_interp'] = df_clean['temp'].interpolate(method='linear')

# Visualize the gap filling
subset = df_clean.loc['2021-05-15':'2021-07-01']
plt.figure(figsize=(10, 5))
plt.plot(subset.index, subset['temp'], 'o', label='Original Data')
plt.plot(subset.index, subset['temp_interp'], '--', alpha=0.5, label='Interpolated')
plt.title("Linear Interpolation for Missing Gap")
plt.legend()
plt.show()

## 2. Outlier Detection (Z-Score)

We identify values that are far from the mean (e.g., > 3 standard deviations).

In [None]:
# Calculate rolling statistics to account for seasonality
# A simple global mean isn't good for seasonal data, so we use residuals or rolling Z-score

# For simplicity here, we'll just check against a global limit (or you could detrend first)
mean = df_clean['temp_interp'].mean()
std = df_clean['temp_interp'].std()

# Calculate Z-scores
df_clean['z_score'] = (df_clean['temp_interp'] - mean) / std

# Flag outliers (> 3 std dev)
outliers = df_clean[np.abs(df_clean['z_score']) > 3]

print("Detected Outliers:")
print(outliers[['temp_interp', 'z_score']])

# Plot
plt.figure(figsize=(12, 6))
plt.plot(df_clean.index, df_clean['temp_interp'], label='Data')
plt.scatter(outliers.index, outliers['temp_interp'], color='red', label='Outliers')
plt.title("Outlier Detection via Z-Score")
plt.legend()
plt.show()