# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 5: Python Wrangling I - Tidy Data, Types, and Validation

**Date:** Wednesday, January 21, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

1. What is Tidy Data? Why Does It Matter?
2. Column Typing and Type Coercion
3. Reshaping Data: Wide to Long
4. Missing Data Strategies
5. Validation Checks: Ranges, Nulls, and Keys

---

## Setting Up

In [1]:
# Standard imports
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 100)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Lecture date: {datetime.now().strftime('%Y-%m-%d')}")

Pandas version: 2.3.3
NumPy version: 2.2.6
Lecture date: 2026-01-06


---
## 1. What is Tidy Data?

### The Concept

**Tidy data** is a standard way of organizing data that makes analysis easier. The concept comes from Hadley Wickham's influential paper.

### The Three Rules of Tidy Data

```
┌────────────────────────────────────────────────────────────────┐
│                     TIDY DATA PRINCIPLES                       │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  1. Each VARIABLE has its own COLUMN                          │
│                                                                │
│  2. Each OBSERVATION has its own ROW                          │
│                                                                │
│  3. Each VALUE has its own CELL                               │
│                                                                │
└────────────────────────────────────────────────────────────────┘
```

### Why Tidy Data Matters

- **Consistency**: One format works with all tools
- **Simplicity**: Standard verbs (filter, group, summarize) work predictably
- **Vectorization**: Operations apply to entire columns efficiently

### Example: Messy vs Tidy

Let's look at the same data in messy and tidy formats.

In [2]:
# MESSY DATA: Years as columns (wide format)
# This is how data often comes from spreadsheets

messy_data = pd.DataFrame({
    'country': ['Canada', 'USA', 'Mexico'],
    'indicator': ['wheat_production', 'wheat_production', 'wheat_production'],
    '2020': [35183, 49691, 3115],
    '2021': [22296, 44790, 3024],
    '2022': [34335, 44900, 3195]
})

print("MESSY DATA (years as columns):")
print(messy_data)
print(f"\nShape: {messy_data.shape}")

MESSY DATA (years as columns):
  country         indicator   2020   2021   2022
0  Canada  wheat_production  35183  22296  34335
1     USA  wheat_production  49691  44790  44900
2  Mexico  wheat_production   3115   3024   3195

Shape: (3, 5)


In [3]:
# TIDY DATA: Each row is one observation (country-year)

tidy_data = pd.DataFrame({
    'country': ['Canada', 'Canada', 'Canada', 'USA', 'USA', 'USA', 'Mexico', 'Mexico', 'Mexico'],
    'year': [2020, 2021, 2022, 2020, 2021, 2022, 2020, 2021, 2022],
    'wheat_production': [35183, 22296, 34335, 49691, 44790, 44900, 3115, 3024, 3195]
})

print("TIDY DATA (each row is one country-year):")
print(tidy_data)
print(f"\nShape: {tidy_data.shape}")

TIDY DATA (each row is one country-year):
  country  year  wheat_production
0  Canada  2020             35183
1  Canada  2021             22296
2  Canada  2022             34335
3     USA  2020             49691
4     USA  2021             44790
5     USA  2022             44900
6  Mexico  2020              3115
7  Mexico  2021              3024
8  Mexico  2022              3195

Shape: (9, 3)


In [4]:
# Why tidy is better: Easy to filter, group, and analyze

# Question: What was total wheat production in 2021?
print("With TIDY data - Total production in 2021:")
result = tidy_data[tidy_data['year'] == 2021]['wheat_production'].sum()
print(f"  {result:,} thousand tonnes")

# Question: Average production by country?
print("\nWith TIDY data - Average by country:")
print(tidy_data.groupby('country')['wheat_production'].mean())

With TIDY data - Total production in 2021:
  70,110 thousand tonnes

With TIDY data - Average by country:
country
Canada    30604.666667
Mexico     3111.333333
USA       46460.333333
Name: wheat_production, dtype: float64


In [5]:
# Converting messy to tidy using pd.melt()

tidy_from_messy = pd.melt(
    messy_data,
    id_vars=['country', 'indicator'],    # Columns to keep
    value_vars=['2020', '2021', '2022'], # Columns to unpivot
    var_name='year',                      # Name for the new column
    value_name='production'               # Name for the values
)

# Convert year to integer
tidy_from_messy['year'] = tidy_from_messy['year'].astype(int)

print("Converted to tidy format:")
print(tidy_from_messy)

Converted to tidy format:
  country         indicator  year  production
0  Canada  wheat_production  2020       35183
1     USA  wheat_production  2020       49691
2  Mexico  wheat_production  2020        3115
3  Canada  wheat_production  2021       22296
4     USA  wheat_production  2021       44790
5  Mexico  wheat_production  2021        3024
6  Canada  wheat_production  2022       34335
7     USA  wheat_production  2022       44900
8  Mexico  wheat_production  2022        3195


### Common Messy Data Patterns

| Pattern | Problem | Solution |
|---------|---------|----------|
| Column headers are values | Years as columns | `pd.melt()` |
| Multiple variables in one column | "temp_min_max" | `str.split()` + `pd.concat()` |
| Variables in rows and columns | Crosstab format | `pd.melt()` + `pd.pivot()` |
| Multiple types in one column | "100 kg" mixed with "50 lbs" | Extract + convert |
| One observation across multiple rows | Header row + data row | Combine rows |

---
## 2. Column Typing and Type Coercion

### Why Types Matter

Data types determine:
- What operations are valid (can't average strings)
- Memory usage (int8 vs int64)
- Behavior ("1" + "2" = "12" vs 1 + 2 = 3)

### Pandas Data Types

| Type | Description | Example |
|------|-------------|----------|
| `int64` | Integer | 42, -7, 1000 |
| `float64` | Decimal | 3.14, -0.5 |
| `object` | String/mixed | "Canada", "N/A" |
| `bool` | Boolean | True, False |
| `datetime64` | Date/time | 2024-01-15 |
| `category` | Categorical | Low/Medium/High |

In [6]:
# Let's create a realistic messy dataset: FAO Food Price Index
# This simulates common issues you'll encounter

food_prices_messy = pd.DataFrame({
    'date': ['2023-01', '2023-02', '2023-03', '2023-04', '2023-05', '2023-06',
             '2023-07', '2023-08', '2023-09', '2023-10', '2023-11', '2023-12'],
    'cereals_index': ['147.2', '149.5', '148.3', '146.1', '139.2', '129.8',
                      '127.5', '130.2', '132.1', '131.8', '127.4', '125.5'],
    'meat_index': ['113.8', '114,2', '114.1', '114.0', '113.9', '113.4',  # Note: European decimal
                   '..', '115.2', '116.8', 'N/A', '118.3', '118.9'],       # Note: Missing values
    'dairy_index': ['126.5', '130.2', '128.7', '125.3', '120.1', '115.8',
                    '111.2', '108.5', '110.3', '115.7', '118.2', '116.8'],
    'oils_index': ['156.3*', '154.2', '149.8**', '142.5', '127.8', '118.5',  # Note: Footnotes
                   '119.8', '129.3', '130.7', '126.5', '128.7', '126.3'],
    'sugar_index': [115, 116.2, 118.5, 119.8, 122.3, 125.6,  # Note: Mixed int/float
                    128.9, 131.2, 129.5, 127.8, 130.1, 131.5]
})

print("Messy FAO Food Price Index Data:")
print(food_prices_messy)
print(f"\nData types:")
print(food_prices_messy.dtypes)

Messy FAO Food Price Index Data:
       date cereals_index meat_index dairy_index oils_index  sugar_index
0   2023-01         147.2      113.8       126.5     156.3*        115.0
1   2023-02         149.5      114,2       130.2      154.2        116.2
2   2023-03         148.3      114.1       128.7    149.8**        118.5
3   2023-04         146.1      114.0       125.3      142.5        119.8
4   2023-05         139.2      113.9       120.1      127.8        122.3
5   2023-06         129.8      113.4       115.8      118.5        125.6
6   2023-07         127.5         ..       111.2      119.8        128.9
7   2023-08         130.2      115.2       108.5      129.3        131.2
8   2023-09         132.1      116.8       110.3      130.7        129.5
9   2023-10         131.8        N/A       115.7      126.5        127.8
10  2023-11         127.4      118.3       118.2      128.7        130.1
11  2023-12         125.5      118.9       116.8      126.3        131.5

Data types:
date 

### Problem Analysis

Looking at the data above:
1. **date** - String, needs to be datetime
2. **cereals_index** - String numbers, needs float
3. **meat_index** - Has European decimal (114,2), missing codes (.., N/A)
4. **dairy_index** - String numbers
5. **oils_index** - Has footnote markers (*, **)
6. **sugar_index** - Already numeric (mixed int/float)

In [7]:
# Step-by-step type cleaning

df = food_prices_messy.copy()

# 1. Convert date to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m')
print("Step 1 - Date converted:")
print(f"  Type: {df['date'].dtype}")
print(f"  Sample: {df['date'].iloc[0]}")

Step 1 - Date converted:
  Type: datetime64[ns]
  Sample: 2023-01-01 00:00:00


In [8]:
# 2. Clean numeric columns - Create a reusable function

def clean_numeric_column(series):
    """
    Clean a column that should be numeric but has:
    - European decimals (comma instead of period)
    - Missing value codes (.., N/A, NA, etc.)
    - Footnote markers (*, **, E, F)
    - Whitespace
    
    Returns a float series with proper NaN for missing.
    """
    # Convert to string first
    s = series.astype(str)
    
    # Strip whitespace
    s = s.str.strip()
    
    # Replace common missing value codes with empty string
    missing_codes = ['..', 'N/A', 'NA', 'n/a', 'na', '-', '', 'None', 'null', 'NULL']
    for code in missing_codes:
        s = s.replace(code, '')
    
    # Remove footnote markers (*, **, E, F at end)
    s = s.str.replace(r'[*EFef]+$', '', regex=True)
    
    # Handle European decimals: if comma exists and no period, replace comma with period
    def fix_european_decimal(val):
        if ',' in val and '.' not in val:
            return val.replace(',', '.')
        return val
    
    s = s.apply(fix_european_decimal)
    
    # Convert to numeric (empty strings become NaN)
    return pd.to_numeric(s, errors='coerce')


# Apply to all index columns
index_columns = ['cereals_index', 'meat_index', 'dairy_index', 'oils_index']

for col in index_columns:
    df[col] = clean_numeric_column(df[col])

print("Step 2 - Numeric columns cleaned:")
print(df.dtypes)

Step 2 - Numeric columns cleaned:
date             datetime64[ns]
cereals_index           float64
meat_index              float64
dairy_index             float64
oils_index              float64
sugar_index             float64
dtype: object


In [9]:
# View the cleaned data
print("Cleaned FAO Food Price Index:")
print(df)
print(f"\nMissing values per column:")
print(df.isnull().sum())

Cleaned FAO Food Price Index:
         date  cereals_index  meat_index  dairy_index  oils_index  sugar_index
0  2023-01-01          147.2       113.8        126.5       156.3        115.0
1  2023-02-01          149.5       114.2        130.2       154.2        116.2
2  2023-03-01          148.3       114.1        128.7       149.8        118.5
3  2023-04-01          146.1       114.0        125.3       142.5        119.8
4  2023-05-01          139.2       113.9        120.1       127.8        122.3
5  2023-06-01          129.8       113.4        115.8       118.5        125.6
6  2023-07-01          127.5         NaN        111.2       119.8        128.9
7  2023-08-01          130.2       115.2        108.5       129.3        131.2
8  2023-09-01          132.1       116.8        110.3       130.7        129.5
9  2023-10-01          131.8         NaN        115.7       126.5        127.8
10 2023-11-01          127.4       118.3        118.2       128.7        130.1
11 2023-12-01         

### Type Coercion Methods Summary

| Method | Use Case | Example |
|--------|----------|----------|
| `astype(int)` | Clean integers | `df['year'].astype(int)` |
| `astype(float)` | Clean floats | `df['price'].astype(float)` |
| `pd.to_numeric()` | With errors handling | `pd.to_numeric(s, errors='coerce')` |
| `pd.to_datetime()` | Date parsing | `pd.to_datetime(s, format='%Y-%m-%d')` |
| `astype('category')` | Categorical data | `df['region'].astype('category')` |

In [10]:
# Special case: Categorical data
# Good for columns with limited distinct values

# Add a region column
df['assessment'] = pd.cut(
    df['cereals_index'],
    bins=[0, 130, 145, 200],
    labels=['Low', 'Medium', 'High']
)

print("Categorical column:")
print(df[['date', 'cereals_index', 'assessment']].head())
print(f"\nType: {df['assessment'].dtype}")
print(f"Categories: {df['assessment'].cat.categories.tolist()}")

Categorical column:
        date  cereals_index assessment
0 2023-01-01          147.2       High
1 2023-02-01          149.5       High
2 2023-03-01          148.3       High
3 2023-04-01          146.1       High
4 2023-05-01          139.2     Medium

Type: category
Categories: ['Low', 'Medium', 'High']


---
## 3. Reshaping Data: Wide to Long

### When to Reshape

- **Wide → Long (melt)**: When column headers contain data values (years, categories)
- **Long → Wide (pivot)**: When you need columns for comparison/visualization

### Real Example: Global Temperature Anomalies

Temperature data often comes with months as columns - classic wide format.

In [11]:
# Create temperature anomaly data (similar to NOAA format)

temp_wide = pd.DataFrame({
    'country': ['Canada', 'Canada', 'Canada', 'USA', 'USA', 'USA', 
                'Brazil', 'Brazil', 'Brazil'],
    'year': [2021, 2022, 2023, 2021, 2022, 2023, 2021, 2022, 2023],
    'Jan': [0.85, 1.12, 0.95, 1.02, 1.35, 1.18, 0.62, 0.78, 0.95],
    'Feb': [0.92, 0.88, 1.05, 1.15, 1.22, 1.08, 0.58, 0.82, 0.88],
    'Mar': [1.05, 1.18, 1.32, 1.28, 1.45, 1.35, 0.72, 0.95, 1.02],
    'Apr': [0.78, 0.95, 1.15, 0.92, 1.08, 1.22, 0.55, 0.68, 0.85],
    'May': [0.65, 0.82, 0.98, 0.78, 0.95, 1.05, 0.42, 0.55, 0.72],
    'Jun': [0.72, 0.88, 1.02, 0.85, 1.02, 1.15, 0.38, 0.48, 0.62]
})

print("WIDE FORMAT (months as columns):")
print(temp_wide)
print(f"\nShape: {temp_wide.shape}")

WIDE FORMAT (months as columns):
  country  year   Jan   Feb   Mar   Apr   May   Jun
0  Canada  2021  0.85  0.92  1.05  0.78  0.65  0.72
1  Canada  2022  1.12  0.88  1.18  0.95  0.82  0.88
2  Canada  2023  0.95  1.05  1.32  1.15  0.98  1.02
3     USA  2021  1.02  1.15  1.28  0.92  0.78  0.85
4     USA  2022  1.35  1.22  1.45  1.08  0.95  1.02
5     USA  2023  1.18  1.08  1.35  1.22  1.05  1.15
6  Brazil  2021  0.62  0.58  0.72  0.55  0.42  0.38
7  Brazil  2022  0.78  0.82  0.95  0.68  0.55  0.48
8  Brazil  2023  0.95  0.88  1.02  0.85  0.72  0.62

Shape: (9, 8)


In [12]:
# Convert to long format using pd.melt()

temp_long = pd.melt(
    temp_wide,
    id_vars=['country', 'year'],           # Keep these as identifiers
    value_vars=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],  # Columns to unpivot
    var_name='month',                       # New column name for former headers
    value_name='temp_anomaly'               # New column name for values
)

print("LONG FORMAT (tidy):")
print(temp_long.head(12))
print(f"\nShape: {temp_long.shape}")

LONG FORMAT (tidy):
   country  year month  temp_anomaly
0   Canada  2021   Jan          0.85
1   Canada  2022   Jan          1.12
2   Canada  2023   Jan          0.95
3      USA  2021   Jan          1.02
4      USA  2022   Jan          1.35
5      USA  2023   Jan          1.18
6   Brazil  2021   Jan          0.62
7   Brazil  2022   Jan          0.78
8   Brazil  2023   Jan          0.95
9   Canada  2021   Feb          0.92
10  Canada  2022   Feb          0.88
11  Canada  2023   Feb          1.05

Shape: (54, 4)


In [13]:
# Now analysis is easy!

# Average anomaly by country
print("Average temperature anomaly by country:")
print(temp_long.groupby('country')['temp_anomaly'].mean().round(2))

print("\nAverage anomaly by year:")
print(temp_long.groupby('year')['temp_anomaly'].mean().round(2))

print("\nHottest months (highest anomaly):")
print(temp_long.nlargest(5, 'temp_anomaly')[['country', 'year', 'month', 'temp_anomaly']])

Average temperature anomaly by country:
country
Brazil    0.70
Canada    0.96
USA       1.12
Name: temp_anomaly, dtype: float64

Average anomaly by year:
year
2021    0.79
2022    0.95
2023    1.03
Name: temp_anomaly, dtype: float64

Hottest months (highest anomaly):
   country  year month  temp_anomaly
22     USA  2022   Mar          1.45
4      USA  2022   Jan          1.35
23     USA  2023   Mar          1.35
20  Canada  2023   Mar          1.32
21     USA  2021   Mar          1.28


In [14]:
# Going back: Long to Wide using pivot()

temp_back_wide = temp_long.pivot(
    index=['country', 'year'],  # Rows
    columns='month',             # Columns
    values='temp_anomaly'        # Values
).reset_index()

print("Back to WIDE FORMAT:")
print(temp_back_wide)

Back to WIDE FORMAT:
month country  year   Apr   Feb   Jan   Jun   Mar   May
0      Brazil  2021  0.55  0.58  0.62  0.38  0.72  0.42
1      Brazil  2022  0.68  0.82  0.78  0.48  0.95  0.55
2      Brazil  2023  0.85  0.88  0.95  0.62  1.02  0.72
3      Canada  2021  0.78  0.92  0.85  0.72  1.05  0.65
4      Canada  2022  0.95  0.88  1.12  0.88  1.18  0.82
5      Canada  2023  1.15  1.05  0.95  1.02  1.32  0.98
6         USA  2021  0.92  1.15  1.02  0.85  1.28  0.78
7         USA  2022  1.08  1.22  1.35  1.02  1.45  0.95
8         USA  2023  1.22  1.08  1.18  1.15  1.35  1.05


### Reshape Cheat Sheet

```python
# Wide to Long (unpivot)
pd.melt(df, id_vars=['key_cols'], value_vars=['cols_to_unpivot'],
        var_name='new_col_name', value_name='value_col_name')

# Long to Wide (pivot)
df.pivot(index='row_key', columns='col_to_spread', values='values')

# Long to Wide with aggregation
df.pivot_table(index='row_key', columns='col_to_spread', 
               values='values', aggfunc='mean')
```

---
## 4. Missing Data Strategies

### Types of Missing Data

Understanding **why** data is missing helps decide how to handle it:

| Type | Description | Example | Strategy |
|------|-------------|---------|----------|
| **MCAR** | Missing Completely At Random | Sensor malfunction | Drop or impute |
| **MAR** | Missing At Random (depends on other variables) | Rich countries report more | Impute with care |
| **MNAR** | Missing Not At Random (depends on missing value itself) | High pollution not reported | Cannot ignore, needs modeling |

### Real Example: Water Quality Data

In [15]:
# Create water quality dataset with realistic missing patterns

np.random.seed(42)

n_samples = 50
water_quality = pd.DataFrame({
    'station_id': [f'WQ_{i:03d}' for i in range(1, n_samples + 1)],
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_samples),
    'sample_date': pd.date_range('2023-01-01', periods=n_samples, freq='W'),
    'ph': np.random.normal(7.2, 0.5, n_samples).round(2),
    'dissolved_oxygen': np.random.normal(8.5, 1.5, n_samples).round(2),
    'turbidity': np.random.exponential(5, n_samples).round(2),
    'temperature_c': np.random.normal(15, 5, n_samples).round(1),
    'nitrate_mg_l': np.random.exponential(3, n_samples).round(2),
    'phosphate_mg_l': np.random.exponential(0.5, n_samples).round(3)
})

# Introduce realistic missing patterns
# MCAR: Random sensor failures (pH)
mcar_mask = np.random.random(n_samples) < 0.1
water_quality.loc[mcar_mask, 'ph'] = np.nan

# MAR: Northern stations have more missing dissolved oxygen (harsh conditions)
mar_mask = (water_quality['region'] == 'North') & (np.random.random(n_samples) < 0.3)
water_quality.loc[mar_mask, 'dissolved_oxygen'] = np.nan

# MNAR: High turbidity readings often fail (too murky for sensor)
mnar_mask = water_quality['turbidity'] > 10
water_quality.loc[mnar_mask, 'turbidity'] = np.nan

# Additional random missing
for col in ['nitrate_mg_l', 'phosphate_mg_l']:
    mask = np.random.random(n_samples) < 0.15
    water_quality.loc[mask, col] = np.nan

print("Water Quality Dataset:")
print(water_quality.head(10))

Water Quality Dataset:
  station_id region sample_date    ph  dissolved_oxygen  turbidity  \
0     WQ_001   East  2023-01-01   NaN             13.56       1.39   
1     WQ_002   West  2023-01-08   NaN              7.12       5.58   
2     WQ_003  North  2023-01-15  8.48              7.90       7.17   
3     WQ_004   East  2023-01-22   NaN              8.41       1.36   
4     WQ_005   East  2023-01-29  7.26              6.37       6.51   
5     WQ_006   West  2023-02-05  6.94             10.06       2.29   
6     WQ_007  North  2023-02-12   NaN              9.86       5.00   
7     WQ_008  North  2023-02-19  7.67              8.53       5.02   
8     WQ_009   East  2023-02-26  7.35              7.70       3.84   
9     WQ_010  South  2023-03-05  6.88              6.26       0.47   

   temperature_c  nitrate_mg_l  phosphate_mg_l  
0           15.9          4.98           0.627  
1            8.9           NaN           0.540  
2           20.2          7.33           0.164  
3         

In [16]:
# Analyze missing data patterns

print("Missing Value Summary:")
print("=" * 50)

missing_summary = pd.DataFrame({
    'missing_count': water_quality.isnull().sum(),
    'missing_pct': (water_quality.isnull().sum() / len(water_quality) * 100).round(1),
    'dtype': water_quality.dtypes
})

print(missing_summary[missing_summary['missing_count'] > 0])

Missing Value Summary:
                  missing_count  missing_pct    dtype
ph                            6         12.0  float64
dissolved_oxygen              4          8.0  float64
turbidity                     8         16.0  float64
nitrate_mg_l                  6         12.0  float64
phosphate_mg_l                7         14.0  float64


In [17]:
# Check for patterns in missing data

print("\nMissing dissolved_oxygen by region:")
print(water_quality.groupby('region')['dissolved_oxygen'].apply(
    lambda x: f"{x.isnull().sum()}/{len(x)} ({x.isnull().mean()*100:.0f}%)"
))

# This confirms MAR pattern: North has more missing


Missing dissolved_oxygen by region:
region
East      0/13 (0%)
North    4/10 (40%)
South     0/11 (0%)
West      0/16 (0%)
Name: dissolved_oxygen, dtype: object


### Handling Missing Data: Options

| Strategy | Method | When to Use |
|----------|--------|-------------|
| **Drop rows** | `dropna()` | Few missing, MCAR |
| **Drop columns** | `drop()` | >50% missing |
| **Fill with constant** | `fillna(value)` | Meaningful default |
| **Fill with mean/median** | `fillna(df.mean())` | MCAR, continuous |
| **Fill with mode** | `fillna(df.mode()[0])` | Categorical |
| **Forward/backward fill** | `ffill()`, `bfill()` | Time series |
| **Interpolation** | `interpolate()` | Ordered data |
| **Group-based imputation** | `groupby().transform()` | MAR |

In [18]:
# Strategy 1: Drop rows with any missing (aggressive)

df_dropped = water_quality.dropna()
print(f"Original rows: {len(water_quality)}")
print(f"After dropna(): {len(df_dropped)}")
print(f"Lost: {len(water_quality) - len(df_dropped)} rows ({(1 - len(df_dropped)/len(water_quality))*100:.0f}%)")

Original rows: 50
After dropna(): 28
Lost: 22 rows (44%)


In [19]:
# Strategy 2: Fill with column mean (simple but loses variance)

df_mean_filled = water_quality.copy()
df_mean_filled['ph'] = df_mean_filled['ph'].fillna(df_mean_filled['ph'].mean())

print("pH filled with mean:")
print(f"  Original missing: {water_quality['ph'].isnull().sum()}")
print(f"  After filling: {df_mean_filled['ph'].isnull().sum()}")
print(f"  Fill value used: {water_quality['ph'].mean():.2f}")

pH filled with mean:
  Original missing: 6
  After filling: 0
  Fill value used: 7.28


In [20]:
# Strategy 3: Group-based imputation (better for MAR)
# Fill dissolved_oxygen with regional mean

df_group_filled = water_quality.copy()

# Calculate regional means
regional_means = water_quality.groupby('region')['dissolved_oxygen'].transform('mean')

# Fill missing with regional mean
df_group_filled['dissolved_oxygen'] = df_group_filled['dissolved_oxygen'].fillna(regional_means)

print("Dissolved oxygen filled with regional mean:")
print(f"  Original missing: {water_quality['dissolved_oxygen'].isnull().sum()}")
print(f"  After filling: {df_group_filled['dissolved_oxygen'].isnull().sum()}")

print("\nRegional means used:")
print(water_quality.groupby('region')['dissolved_oxygen'].mean().round(2))

Dissolved oxygen filled with regional mean:
  Original missing: 4
  After filling: 0

Regional means used:
region
East     8.40
North    8.87
South    7.42
West     8.75
Name: dissolved_oxygen, dtype: float64


In [21]:
# Strategy 4: Flag missing values (preserve information)
# Sometimes it's important to know data WAS missing

df_flagged = water_quality.copy()

# Create flag columns
for col in ['turbidity', 'nitrate_mg_l', 'phosphate_mg_l']:
    df_flagged[f'{col}_missing'] = df_flagged[col].isnull().astype(int)

# Fill with median
for col in ['turbidity', 'nitrate_mg_l', 'phosphate_mg_l']:
    df_flagged[col] = df_flagged[col].fillna(df_flagged[col].median())

print("Data with missing flags:")
print(df_flagged[['station_id', 'turbidity', 'turbidity_missing']].head(10))

Data with missing flags:
  station_id  turbidity  turbidity_missing
0     WQ_001       1.39                  0
1     WQ_002       5.58                  0
2     WQ_003       7.17                  0
3     WQ_004       1.36                  0
4     WQ_005       6.51                  0
5     WQ_006       2.29                  0
6     WQ_007       5.00                  0
7     WQ_008       5.02                  0
8     WQ_009       3.84                  0
9     WQ_010       0.47                  0


### Best Practices for Missing Data

1. **Always document** what was missing and how you handled it
2. **Understand the mechanism** - MCAR vs MAR vs MNAR
3. **Don't impute too aggressively** - >30% missing might mean drop the column
4. **Consider flagging** - Missing-ness itself can be informative
5. **Check impact** - Compare results with different strategies

---
## 5. Validation Checks: Ranges, Nulls, and Keys

### Why Validate?

Validation catches problems early:
- Data entry errors (pH of 72 instead of 7.2)
- Unit mismatches (Celsius vs Fahrenheit)
- Duplicate records
- Broken relationships between tables

### Types of Validation

| Type | Check | Example |
|------|-------|----------|
| **Range** | Values within bounds | pH between 0-14 |
| **Null** | Required fields present | Station ID not null |
| **Type** | Correct data type | Year is integer |
| **Uniqueness** | No duplicates | Unique station-date combo |
| **Referential** | Foreign keys valid | Country code exists |

In [22]:
# Create a validation framework

def validate_dataframe(df, rules):
    """
    Validate a DataFrame against a set of rules.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Data to validate
    rules : list of dict
        Each rule has: column, check_type, params, description
    
    Returns:
    --------
    dict : Validation results
    """
    results = []
    
    for rule in rules:
        col = rule['column']
        check = rule['check_type']
        params = rule.get('params', {})
        desc = rule['description']
        
        if col not in df.columns:
            results.append({
                'check': desc,
                'passed': False,
                'message': f"Column '{col}' not found"
            })
            continue
        
        # Range check
        if check == 'range':
            min_val = params.get('min')
            max_val = params.get('max')
            violations = df[
                (df[col] < min_val) | (df[col] > max_val)
            ][col].dropna()
            passed = len(violations) == 0
            msg = f"{len(violations)} values outside [{min_val}, {max_val}]" if not passed else "OK"
        
        # Not null check
        elif check == 'not_null':
            null_count = df[col].isnull().sum()
            passed = null_count == 0
            msg = f"{null_count} null values" if not passed else "OK"
        
        # Unique check
        elif check == 'unique':
            dup_count = df[col].duplicated().sum()
            passed = dup_count == 0
            msg = f"{dup_count} duplicate values" if not passed else "OK"
        
        # Type check
        elif check == 'dtype':
            expected = params.get('expected')
            actual = str(df[col].dtype)
            passed = expected in actual
            msg = f"Expected {expected}, got {actual}" if not passed else "OK"
        
        # Values in set
        elif check == 'in_set':
            valid_values = params.get('values', [])
            invalid = df[~df[col].isin(valid_values)][col].dropna().unique()
            passed = len(invalid) == 0
            msg = f"Invalid values: {list(invalid)[:5]}" if not passed else "OK"
        
        else:
            passed = False
            msg = f"Unknown check type: {check}"
        
        results.append({
            'check': desc,
            'passed': passed,
            'message': msg
        })
    
    return results


print("Validation function defined.")

Validation function defined.


In [23]:
# Define validation rules for water quality data

water_quality_rules = [
    # Not null checks
    {'column': 'station_id', 'check_type': 'not_null', 
     'description': 'Station ID must not be null'},
    {'column': 'sample_date', 'check_type': 'not_null', 
     'description': 'Sample date must not be null'},
    
    # Uniqueness
    {'column': 'station_id', 'check_type': 'unique', 
     'description': 'Station ID must be unique'},
    
    # Range checks
    {'column': 'ph', 'check_type': 'range', 
     'params': {'min': 0, 'max': 14},
     'description': 'pH must be between 0 and 14'},
    {'column': 'dissolved_oxygen', 'check_type': 'range', 
     'params': {'min': 0, 'max': 20},
     'description': 'Dissolved oxygen must be 0-20 mg/L'},
    {'column': 'temperature_c', 'check_type': 'range', 
     'params': {'min': -5, 'max': 40},
     'description': 'Temperature must be -5 to 40 C'},
    {'column': 'turbidity', 'check_type': 'range', 
     'params': {'min': 0, 'max': 1000},
     'description': 'Turbidity must be 0-1000 NTU'},
    
    # Category check
    {'column': 'region', 'check_type': 'in_set', 
     'params': {'values': ['North', 'South', 'East', 'West']},
     'description': 'Region must be valid'},
    
    # Type check
    {'column': 'sample_date', 'check_type': 'dtype', 
     'params': {'expected': 'datetime'},
     'description': 'Sample date must be datetime'},
]

# Run validation
validation_results = validate_dataframe(water_quality, water_quality_rules)

# Display results
print("VALIDATION RESULTS")
print("=" * 60)
for result in validation_results:
    status = "PASS" if result['passed'] else "FAIL"
    symbol = "[OK]" if result['passed'] else "[X] "
    print(f"{symbol} {result['check']}")
    if not result['passed']:
        print(f"      {result['message']}")

# Summary
passed = sum(1 for r in validation_results if r['passed'])
total = len(validation_results)
print("\n" + "=" * 60)
print(f"Result: {passed}/{total} checks passed")

VALIDATION RESULTS
[OK] Station ID must not be null
[OK] Sample date must not be null
[OK] Station ID must be unique
[OK] pH must be between 0 and 14
[OK] Dissolved oxygen must be 0-20 mg/L
[OK] Temperature must be -5 to 40 C
[OK] Turbidity must be 0-1000 NTU
[OK] Region must be valid
[OK] Sample date must be datetime

Result: 9/9 checks passed


In [24]:
# Composite key uniqueness check
# Often need to check uniqueness of multiple columns together

def check_composite_key(df, key_columns):
    """
    Check if combination of columns forms a unique key.
    """
    duplicates = df.duplicated(subset=key_columns, keep=False)
    dup_count = duplicates.sum()
    
    if dup_count > 0:
        print(f"FAIL: {dup_count} rows have duplicate keys")
        print("\nDuplicate examples:")
        print(df[duplicates][key_columns].head(10))
        return False
    else:
        print(f"PASS: {key_columns} forms a unique key")
        return True


# Test: station_id should be unique (single column key)
check_composite_key(water_quality, ['station_id'])

PASS: ['station_id'] forms a unique key


True

In [25]:
# Creating a validation report

def create_validation_report(df, name='Dataset'):
    """
    Create a comprehensive validation report for a DataFrame.
    """
    print(f"\n{'='*60}")
    print(f"VALIDATION REPORT: {name}")
    print(f"{'='*60}")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Basic info
    print(f"\n--- BASIC INFO ---")
    print(f"Rows: {len(df):,}")
    print(f"Columns: {len(df.columns)}")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Missing values
    print(f"\n--- MISSING VALUES ---")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(1)
    missing_df = pd.DataFrame({'count': missing, 'pct': missing_pct})
    missing_df = missing_df[missing_df['count'] > 0]
    if len(missing_df) > 0:
        print(missing_df)
    else:
        print("No missing values!")
    
    # Duplicates
    print(f"\n--- DUPLICATES ---")
    dup_rows = df.duplicated().sum()
    print(f"Duplicate rows: {dup_rows}")
    
    # Numeric column stats
    print(f"\n--- NUMERIC RANGES ---")
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        print(f"{col}: [{df[col].min():.2f}, {df[col].max():.2f}] (mean: {df[col].mean():.2f})")
    
    # Data types
    print(f"\n--- DATA TYPES ---")
    print(df.dtypes)
    
    print(f"\n{'='*60}")
    print("END OF REPORT")
    print(f"{'='*60}")


# Generate report
create_validation_report(water_quality, 'Water Quality Monitoring')


VALIDATION REPORT: Water Quality Monitoring
Generated: 2026-01-06 23:02:53

--- BASIC INFO ---
Rows: 50
Columns: 9
Memory: 8.9 KB

--- MISSING VALUES ---
                  count   pct
ph                    6  12.0
dissolved_oxygen      4   8.0
turbidity             8  16.0
nitrate_mg_l          6  12.0
phosphate_mg_l        7  14.0

--- DUPLICATES ---
Duplicate rows: 0

--- NUMERIC RANGES ---
ph: [6.41, 8.59] (mean: 7.28)
dissolved_oxygen: [4.52, 13.56] (mean: 8.35)
turbidity: [0.08, 9.02] (mean: 3.41)
temperature_c: [8.00, 28.70] (mean: 16.01)
nitrate_mg_l: [0.08, 11.17] (mean: 2.57)
phosphate_mg_l: [0.01, 1.62] (mean: 0.46)

--- DATA TYPES ---
station_id                  object
region                      object
sample_date         datetime64[ns]
ph                         float64
dissolved_oxygen           float64
turbidity                  float64
temperature_c              float64
nitrate_mg_l               float64
phosphate_mg_l             float64
dtype: object

END OF REPORT


---
## Summary: Key Takeaways

### 1. Tidy Data
- Each variable = column, each observation = row, each value = cell
- Use `pd.melt()` to go from wide to long
- Use `pd.pivot()` to go from long to wide

### 2. Type Coercion
- Always check `df.dtypes` after loading
- Use `pd.to_numeric(errors='coerce')` for messy numbers
- Use `pd.to_datetime()` with format strings
- Create reusable cleaning functions

### 3. Missing Data
- Understand the mechanism (MCAR/MAR/MNAR)
- Choose strategy based on mechanism and data
- Document what you did
- Consider flagging missing values

### 4. Validation
- Check ranges, nulls, types, uniqueness
- Build validation functions for reuse
- Create reports before analysis

---

## Next Class Preview

**Lecture 6: Python Wrangling II**
- Merging multiple data sources
- Advanced pivot and unpivot
- Building analysis-ready tables
- Data contracts and documentation

---

## Practice Exercises

1. **Exercise 1**: Create a messy dataset with years as columns and convert it to tidy format

2. **Exercise 2**: Write a cleaning function for a column that has mixed formats (e.g., "100 kg", "50.5kg", "75 KG")

3. **Exercise 3**: Analyze the missing data patterns in the water quality dataset and implement a group-based imputation for pH by region

4. **Exercise 4**: Add three more validation rules to the water quality validation (hint: check for negative values, check date ranges)