# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 3: ETL Pipeline I - Extracting from Files and Tables

**Date:** Monday, January 12, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

1. What is ETL? Why Do We Need It?
2. The Extract Phase: Getting Data from Files
3. Working with CSV Files (The Right Way)
4. Working with JSON Files
5. Raw Layer vs Cleaned Layer
6. Data Lineage: Where Did This Data Come From?
7. Hands-on Practice

---

## 1. What is ETL? Why Do We Need It?

### The Problem We Face

In Assignment 1, you loaded CSV files into MySQL. That worked fine for a one-time load. But in the real world:

- Weather data updates **every day**
- Commodity prices change **every hour**
- ESG ratings are revised **quarterly**
- Sensor data arrives **every minute**

Manual loading does not work when data keeps coming. We need **automated pipelines**.

### ETL: The Three Steps

**ETL** stands for **Extract, Transform, Load**:

```
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│    EXTRACT     │ --> │   TRANSFORM    │ --> │      LOAD      │
│                │     │                │     │                │
│ Pull data from │     │ Clean, reshape │     │ Store in final │
│ source systems │     │ validate data  │     │ destination    │
└────────────────┘     └────────────────┘     └────────────────┘
```

| Step | What Happens | Examples |
|------|--------------|----------|
| **Extract** | Pull data from source | Read CSV, call API, query database |
| **Transform** | Clean and reshape | Fix data types, handle nulls, join tables |
| **Load** | Store in destination | Insert into database, write to file |

### Why ETL Matters for Your Career

Data analysts spend about **80% of their time** on data preparation. If you can build reliable ETL pipelines, you will:

1. **Save time** - Automate repetitive tasks
2. **Reduce errors** - Consistent processing every time
3. **Enable analysis** - Clean data is ready for insights
4. **Stand out** - Many analysts cannot do this

In Assignment 2, you will build a complete ETL pipeline that pulls weather data from an API and loads it into your database from Assignment 1.

---
## 2. Setting Up Our Environment

Let's install and import what we need.

In [1]:
# Run this cell once to install required packages
# Remove the # to uncomment if you need to install

!pip install pandas numpy requests

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Standard imports for ETL work
import pandas as pd
import numpy as np
import json
import os
from datetime import datetime

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print(f"Pandas version: {pd.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Setup complete!")

Pandas version: 2.3.1
Current time: 2026-01-14 09:28:23
Setup complete!


---
## 3. Working with CSV Files (The Right Way)

CSV looks simple but has many traps. Let's learn to handle them.

### Common CSV Problems

| Problem | What It Looks Like | Solution |
|---------|-------------------|----------|
| European decimals | `1.234,56` instead of `1234.56` | `decimal=','` |
| Semicolon delimiter | Columns separated by `;` | `sep=';'` |
| Encoding issues | `Ã©` instead of `é` | `encoding='utf-8'` |
| Missing values | `NA`, `..`, `-`, blank | `na_values` parameter |
| Mixed types | Numbers stored as text | Read as string first |

In [4]:
# Let's look at our crop data file first
# Always inspect before loading!

# Read just the first few lines as text
with open('crop_production_1990_2023.csv', 'r') as f:
    for i, line in enumerate(f):
        if i < 5:  # First 5 lines
            print(f"Line {i}: {line.strip()[:100]}...")  # First 100 chars
        else:
            break

Line 0: Country,ISO3_Code,Region,Income_Group,Year,Crop,Area_Harvested_Ha,Production_Tonnes,Yield_Kg_Ha,Fert...
Line 1: China,CHN,East Asia,Upper middle income,2001,Soybeans,3751494,12036421.75,3208.43,100.9,,...
Line 2: Nepal,NPL,South Asia,Low income,1993,Maize,2112762,11377270.55,"5385,02","19,14",9.8,...
Line 3: South Korea,KOR,East Asia,High income,1995.0,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,...
Line 4: United States,USA,North America,High income,2018,Wheat,4782989,32397951.41,6773.58,205.12,62.5,...


In [5]:
# Basic CSV read - pandas will guess everything
df_basic = pd.read_csv('crop_production_1990_2023.csv')

print(f"Shape: {df_basic.shape[0]} rows, {df_basic.shape[1]} columns")
print(f"\nColumn names:")
print(df_basic.columns.tolist())
print(f"\nData types:")
print(df_basic.dtypes)

Shape: 4187 rows, 12 columns

Column names:
['Country', 'ISO3_Code', 'Region', 'Income_Group', 'Year', 'Crop', 'Area_Harvested_Ha', 'Production_Tonnes', 'Yield_Kg_Ha', 'Fertilizer_Use_Kg_Ha', 'Irrigation_Pct', 'Notes']

Data types:
Country                  object
ISO3_Code                object
Region                   object
Income_Group             object
Year                    float64
Crop                     object
Area_Harvested_Ha         int64
Production_Tonnes        object
Yield_Kg_Ha              object
Fertilizer_Use_Kg_Ha     object
Irrigation_Pct           object
Notes                    object
dtype: object


### The Problem with Automatic Type Detection

When pandas guesses types, it can make mistakes:
- Numbers with commas become strings
- Missing value codes like `NA` or `..` cause issues
- Years with `.0` suffix get read as floats

**Better approach:** Read everything as strings first, then convert explicitly.

In [6]:
# Better approach: Read everything as strings
df_raw = pd.read_csv(
    'crop_production_1990_2023.csv',
    dtype=str,  # Force all columns to string
    keep_default_na=False  # Don't convert anything to NaN automatically
)

print("All columns are now strings:")
print(df_raw.dtypes.unique())

print("\nSample of raw data (notice the data quality issues):")
df_raw.head()

All columns are now strings:
[dtype('O')]

Sample of raw data (notice the data quality issues):


Unnamed: 0,Country,ISO3_Code,Region,Income_Group,Year,Crop,Area_Harvested_Ha,Production_Tonnes,Yield_Kg_Ha,Fertilizer_Use_Kg_Ha,Irrigation_Pct,Notes
0,China,CHN,East Asia,Upper middle income,2001.0,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993.0,Maize,2112762,11377270.55,538502.0,1914.0,9.8,
2,South Korea,KOR,East Asia,High income,1995.0,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,
3,United States,USA,North America,High income,2018.0,Wheat,4782989,32397951.41,6773.58,205.12,62.5,
4,Japan,JPN,East Asia,High income,2013.0,Rice,5434696,58322509.35,1073151.0,21164.0,61.4,


In [7]:
# Let's find some of the data quality issues

# Issue 1: European decimals (comma instead of period)
european_decimals = df_raw[df_raw['Yield_Kg_Ha'].str.contains(',', na=False)]
print(f"Rows with European decimals: {len(european_decimals)}")
print(european_decimals['Yield_Kg_Ha'].head().tolist())

# Issue 2: Various missing value codes
print(f"\nMissing value codes found in Production_Tonnes:")
weird_values = df_raw[~df_raw['Production_Tonnes'].str.match(r'^[\d.]+$', na=False)]['Production_Tonnes']
print(weird_values.value_counts().head(10))

Rows with European decimals: 408
['5385,02', '10731,51', '4398,0', '2970,49', '5140,5']

Missing value codes found in Production_Tonnes:
Production_Tonnes
NA                27
                  26
N/A               25
-                 21
**                 1
20693827.7*        1
11163272.65(p)     1
6154960.74*        1
10262733.99F       1
3797624.37**       1
Name: count, dtype: int64


### Complete CSV Reading Template

Here is a template you can use for any CSV file:

In [8]:
def read_csv_safely(filepath):
    """
    Read a CSV file with full control over parsing.
    Returns data with all columns as strings for safe processing.
    """
    df = pd.read_csv(
        filepath,
        
        # Type control
        dtype=str,              # Read everything as string
        
        # Missing value handling
        na_values=[''],         # Only empty string is null
        keep_default_na=False,  # Don't auto-convert NA, N/A, etc.
        
        # Encoding
        encoding='utf-8',       # Most common encoding
        
        # Other safety options
        low_memory=False,       # Prevent mixed type warnings
    )
    
    # Add metadata
    print(f"Loaded {len(df)} rows, {len(df.columns)} columns from {filepath}")
    
    return df

# Use the function
df = read_csv_safely('crop_production_1990_2023.csv')
df.head(3)

Loaded 4187 rows, 12 columns from crop_production_1990_2023.csv


Unnamed: 0,Country,ISO3_Code,Region,Income_Group,Year,Crop,Area_Harvested_Ha,Production_Tonnes,Yield_Kg_Ha,Fertilizer_Use_Kg_Ha,Irrigation_Pct,Notes
0,China,CHN,East Asia,Upper middle income,2001.0,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993.0,Maize,2112762,11377270.55,538502.0,1914.0,9.8,
2,South Korea,KOR,East Asia,High income,1995.0,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,


---
## 4. Working with JSON Files

JSON (JavaScript Object Notation) is the standard format for API responses. You will work with JSON heavily in Assignment 2 when calling the weather API.

### JSON Structure

JSON has two main building blocks:

1. **Objects** (like Python dictionaries): `{"key": "value"}`
2. **Arrays** (like Python lists): `[1, 2, 3]`

They can be nested inside each other.

In [9]:
# Let's create sample JSON that looks like API weather data
# This is similar to what you'll get from Open-Meteo API in Assignment 2

sample_api_response = '''
{
    "latitude": 52.52,
    "longitude": 13.41,
    "timezone": "Europe/Berlin",
    "daily": {
        "time": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "temperature_2m_mean": [5.2, 3.1, 4.8],
        "precipitation_sum": [0.0, 2.3, 0.5]
    }
}
'''

# Save it to a file
with open('sample_weather.json', 'w') as f:
    f.write(sample_api_response)

print("Sample JSON file created!")
print(sample_api_response)

Sample JSON file created!

{
    "latitude": 52.52,
    "longitude": 13.41,
    "timezone": "Europe/Berlin",
    "daily": {
        "time": ["2023-01-01", "2023-01-02", "2023-01-03"],
        "temperature_2m_mean": [5.2, 3.1, 4.8],
        "precipitation_sum": [0.0, 2.3, 0.5]
    }
}



In [10]:
# Reading JSON with Python's json module

with open('sample_weather.json', 'r') as f:
    data = json.load(f)

# Now 'data' is a Python dictionary
print(f"Type: {type(data)}")
print(f"Top-level keys: {list(data.keys())}")

Type: <class 'dict'>
Top-level keys: ['latitude', 'longitude', 'timezone', 'daily']


In [11]:
# Accessing nested data

# Get latitude
lat = data['latitude']
print(f"Latitude: {lat}")

# Get the daily data (this is nested)
daily = data['daily']
print(f"\nDaily data keys: {list(daily.keys())}")

# Get temperatures
temps = daily['temperature_2m_mean']
print(f"\nTemperatures: {temps}")

Latitude: 52.52

Daily data keys: ['time', 'temperature_2m_mean', 'precipitation_sum']

Temperatures: [5.2, 3.1, 4.8]


In [12]:
# Converting nested JSON to a DataFrame
# This is what you'll do with API responses

# Method: Manual construction (gives you full control)
df_weather = pd.DataFrame({
    'date': data['daily']['time'],
    'temp_mean': data['daily']['temperature_2m_mean'],
    'precip_sum': data['daily']['precipitation_sum']
})

# Add metadata from the parent object
df_weather['latitude'] = data['latitude']
df_weather['longitude'] = data['longitude']

print("Weather data as DataFrame:")
df_weather

Weather data as DataFrame:


Unnamed: 0,date,temp_mean,precip_sum,latitude,longitude
0,2023-01-01,5.2,0.0,52.52,13.41
1,2023-01-02,3.1,2.3,52.52,13.41
2,2023-01-03,4.8,0.5,52.52,13.41


In [13]:
# Using json_normalize for complex nested structures
from pandas import json_normalize

# Create more complex sample data
complex_data = [
    {
        "country": "Canada",
        "coordinates": {"lat": 56.13, "lon": -106.35},
        "weather": [
            {"date": "2023-01-01", "temp": -15.2},
            {"date": "2023-01-02", "temp": -18.5}
        ]
    },
    {
        "country": "Brazil",
        "coordinates": {"lat": -14.24, "lon": -51.93},
        "weather": [
            {"date": "2023-01-01", "temp": 28.3},
            {"date": "2023-01-02", "temp": 29.1}
        ]
    }
]

# Flatten the nested coordinates
df_flat = json_normalize(complex_data)
print("Flattened structure:")
df_flat

Flattened structure:


Unnamed: 0,country,weather,coordinates.lat,coordinates.lon
0,Canada,"[{'date': '2023-01-01', 'temp': -15.2}, {'date...",56.13,-106.35
1,Brazil,"[{'date': '2023-01-01', 'temp': 28.3}, {'date'...",-14.24,-51.93


In [None]:
# Expanding nested arrays
df_expanded = json_normalize(
    complex_data,
    record_path='weather',      # The array to expand
    meta=['country',            # Keep these fields
          ['coordinates', 'lat'],
          ['coordinates', 'lon']]
)

print("Expanded weather data (one row per weather observation):")
df_expanded

---
## 5. Raw Layer vs Cleaned Layer

This is one of the most important concepts in data engineering.

### The Problem with Direct Transformation

If you transform data directly:
- You lose the original data
- If something goes wrong, you cannot debug
- If cleaning logic changes, you must re-extract from source

### The Solution: Layered Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     SOURCE FILES / APIs                     │
└─────────────────────────┬───────────────────────────────────┘
                          │ Extract (no changes)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                      RAW LAYER                              │
│   - Exact copy of source                                    │
│   - All data as strings                                     │
│   - Add: source file, extraction timestamp, row number      │
│   - Never modify this layer                                 │
└─────────────────────────┬───────────────────────────────────┘
                          │ Transform
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    CLEANED LAYER                            │
│   - Correct data types                                      │
│   - Standardized formats                                    │
│   - Nulls handled consistently                              │
│   - Validated                                               │
└─────────────────────────────────────────────────────────────┘
```

In [14]:
# Create directory structure for our layers
import os

os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/cleaned', exist_ok=True)

print("Directory structure created:")
print("  data/")
print("    raw/      <- Untouched source data")
print("    cleaned/  <- Transformed data")

Directory structure created:
  data/
    raw/      <- Untouched source data
    cleaned/  <- Transformed data


In [15]:
# Step 1: Extract to Raw Layer

def extract_to_raw(source_file, source_name):
    """
    Extract data from source file to raw layer.
    Adds metadata columns but makes NO transformations.
    
    Parameters:
    -----------
    source_file : str
        Path to the source file
    source_name : str
        Name to identify this data source
    
    Returns:
    --------
    DataFrame with raw data plus metadata columns
    """
    # Read as strings - no type conversion
    df = pd.read_csv(source_file, dtype=str, keep_default_na=False)
    
    # Add metadata columns (prefix with _ to mark as system columns)
    df['_source_file'] = os.path.basename(source_file)
    df['_source_name'] = source_name
    df['_extracted_at'] = datetime.now().isoformat()
    df['_row_num'] = range(1, len(df) + 1)
    
    print(f"Extracted {len(df)} rows from {source_file}")
    return df

# Extract crop production data
df_raw = extract_to_raw('crop_production_1990_2023.csv', 'fao_crop_production')

# Show metadata columns
print("\nMetadata columns:")
df_raw[['Country', 'Year', '_source_file', '_extracted_at', '_row_num']].head(3)

Extracted 4187 rows from crop_production_1990_2023.csv

Metadata columns:


Unnamed: 0,Country,Year,_source_file,_extracted_at,_row_num
0,China,2001.0,crop_production_1990_2023.csv,2026-01-14T09:30:13.131857,1
1,Nepal,1993.0,crop_production_1990_2023.csv,2026-01-14T09:30:13.131857,2
2,South Korea,1995.0,crop_production_1990_2023.csv,2026-01-14T09:30:13.131857,3


In [16]:
# Save raw layer with timestamp in filename
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
raw_file = f'data/raw/crop_production_raw_{timestamp}.csv'
df_raw.to_csv(raw_file, index=False)
print(f"Saved to: {raw_file}")

Saved to: data/raw/crop_production_raw_20260114_093017.csv


In [17]:
# Step 2: Transform to Cleaned Layer

import re

def clean_numeric(value):
    """Convert string to float, handling various formats."""
    if pd.isna(value) or value == '':
        return None
    
    value = str(value).strip()
    
    # Check for missing value codes
    if value in ['NA', 'N/A', '..', '-', 'NULL']:
        return None
    
    # Remove footnote markers like *, **, (e), (p)
    value = re.sub(r'[*]+$', '', value)
    value = re.sub(r'\([a-zA-Z]\)$', '', value)
    
    # Handle European decimals (comma as decimal separator)
    if ',' in value and '.' not in value:
        value = value.replace(',', '.')
    
    try:
        return float(value)
    except ValueError:
        return None


def clean_year(value):
    """Convert year string to integer."""
    if pd.isna(value) or value == '':
        return None
    try:
        # Handle years like "2020.0"
        return int(float(str(value).strip()))
    except ValueError:
        return None


# Test the functions
print("Testing clean_numeric:")
test_values = ['1234.56', '1234,56', 'NA', '1234*', '1234(e)', '']
for v in test_values:
    print(f"  '{v}' -> {clean_numeric(v)}")

Testing clean_numeric:
  '1234.56' -> 1234.56
  '1234,56' -> 1234.56
  'NA' -> None
  '1234*' -> 1234.0
  '1234(e)' -> 1234.0
  '' -> None


In [18]:
def transform_to_cleaned(df_raw):
    """
    Transform raw data to cleaned layer.
    Applies type conversions and standardization.
    """
    df = df_raw.copy()
    
    # Clean text columns (strip whitespace)
    text_columns = ['Country', 'ISO3_Code', 'Region', 'Income_Group', 'Crop', 'Notes']
    for col in text_columns:
        if col in df.columns:
            df[col] = df[col].str.strip()
    
    # Clean Year column
    df['Year'] = df['Year'].apply(clean_year)
    
    # Clean numeric columns
    numeric_columns = ['Area_Harvested_Ha', 'Production_Tonnes', 'Yield_Kg_Ha',
                       'Fertilizer_Use_Kg_Ha', 'Irrigation_Pct']
    for col in numeric_columns:
        if col in df.columns:
            df[col] = df[col].apply(clean_numeric)
    
    # Add cleaning metadata
    df['_cleaned_at'] = datetime.now().isoformat()
    
    return df

# Transform the data
df_cleaned = transform_to_cleaned(df_raw)

print("Data types after cleaning:")
print(df_cleaned[['Year', 'Yield_Kg_Ha', 'Production_Tonnes']].dtypes)

print("\nSample cleaned data:")
df_cleaned[['Country', 'Year', 'Crop', 'Yield_Kg_Ha']].head()

Data types after cleaning:
Year                   int64
Yield_Kg_Ha          float64
Production_Tonnes    float64
dtype: object

Sample cleaned data:


Unnamed: 0,Country,Year,Crop,Yield_Kg_Ha
0,China,2001,Soybeans,3208.43
1,Nepal,1993,Maize,5385.02
2,South Korea,1995,Soybeans,4527.63
3,United States,2018,Wheat,6773.58
4,Japan,2013,Rice,10731.51


In [19]:
# Save cleaned layer
cleaned_file = f'data/cleaned/crop_production_cleaned_{timestamp}.csv'
df_cleaned.to_csv(cleaned_file, index=False)
print(f"Saved to: {cleaned_file}")

Saved to: data/cleaned/crop_production_cleaned_20260114_093017.csv


---
## 6. Data Lineage: Where Did This Data Come From?

**Data lineage** tracks the origin and transformations of data. This is essential for:

- **Debugging** - Finding where problems came from
- **Auditing** - Proving data sources for compliance
- **Reproducibility** - Recreating analysis

For this course, we use a simple approach:

1. Metadata columns in the data (`_source_file`, `_extracted_at`)
2. A README file documenting the pipeline
3. Log files recording what happened

In [20]:
# Create a lineage README

def create_lineage_readme(source_file, raw_file, cleaned_file, transformations):
    """
    Create a README documenting the data pipeline.
    """
    readme = f"""
# Data Lineage Documentation

Generated: {datetime.now().isoformat()}

## Source
- File: {source_file}

## Raw Layer
- File: {raw_file}
- Contains exact copy of source with metadata columns added

## Cleaned Layer  
- File: {cleaned_file}

## Transformations Applied
"""
    for t in transformations:
        readme += f"- {t}\n"
    
    return readme

# Document what we did
transformations = [
    "Stripped whitespace from text columns",
    "Converted Year from string to integer (handled X.0 format)",
    "Converted numeric columns to float (handled European decimals)",
    "Replaced missing codes (NA, .., -, N/A) with NULL",
    "Removed footnote markers (*, **, (e), (p)) from numbers"
]

readme = create_lineage_readme(
    'crop_production_1990_2023.csv',
    raw_file,
    cleaned_file,
    transformations
)

# Save README
with open('data/cleaned/README.md', 'w') as f:
    f.write(readme)

print(readme)


# Data Lineage Documentation

Generated: 2026-01-14T09:30:29.381548

## Source
- File: crop_production_1990_2023.csv

## Raw Layer
- File: data/raw/crop_production_raw_20260114_093017.csv
- Contains exact copy of source with metadata columns added

## Cleaned Layer  
- File: data/cleaned/crop_production_cleaned_20260114_093017.csv

## Transformations Applied
- Stripped whitespace from text columns
- Converted Year from string to integer (handled X.0 format)
- Converted numeric columns to float (handled European decimals)
- Replaced missing codes (NA, .., -, N/A) with NULL
- Removed footnote markers (*, **, (e), (p)) from numbers



---
## 7. Complete ETL Pipeline Example

Let's put everything together into a reusable pipeline.

In [21]:
def run_etl_pipeline(source_file, source_name):
    """
    Complete ETL pipeline for CSV files.
    
    Parameters:
    -----------
    source_file : str
        Path to source CSV file
    source_name : str
        Identifier for this data source
    
    Returns:
    --------
    dict with pipeline results
    """
    print(f"{'='*60}")
    print(f"ETL Pipeline: {source_name}")
    print(f"{'='*60}")
    
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    results = {'source': source_file, 'timestamp': timestamp}
    
    # EXTRACT
    print("\n[1/3] EXTRACT...")
    df_raw = extract_to_raw(source_file, source_name)
    results['rows_extracted'] = len(df_raw)
    
    # Save raw
    raw_path = f'data/raw/{source_name}_raw_{timestamp}.csv'
    df_raw.to_csv(raw_path, index=False)
    results['raw_file'] = raw_path
    
    # TRANSFORM
    print("\n[2/3] TRANSFORM...")
    df_cleaned = transform_to_cleaned(df_raw)
    results['rows_cleaned'] = len(df_cleaned)
    
    # LOAD (save cleaned)
    print("\n[3/3] LOAD...")
    cleaned_path = f'data/cleaned/{source_name}_cleaned_{timestamp}.csv'
    df_cleaned.to_csv(cleaned_path, index=False)
    results['cleaned_file'] = cleaned_path
    
    print(f"\n{'='*60}")
    print("Pipeline complete!")
    print(f"  Rows extracted: {results['rows_extracted']}")
    print(f"  Rows cleaned: {results['rows_cleaned']}")
    print(f"  Output: {cleaned_path}")
    print(f"{'='*60}")
    
    return results, df_cleaned

# Run the pipeline
results, df_final = run_etl_pipeline(
    'crop_production_1990_2023.csv',
    'crop_production'
)

ETL Pipeline: crop_production

[1/3] EXTRACT...
Extracted 4187 rows from crop_production_1990_2023.csv

[2/3] TRANSFORM...

[3/3] LOAD...

Pipeline complete!
  Rows extracted: 4187
  Rows cleaned: 4187
  Output: data/cleaned/crop_production_cleaned_20260114_093034.csv


---
## Summary

### Key Concepts

1. **ETL = Extract, Transform, Load**
   - Automated pipelines for recurring data

2. **CSV Reading**
   - Read as strings first for safety
   - Handle encodings, delimiters, missing values explicitly

3. **JSON Parsing**
   - Use `json.load()` for files
   - Use `json_normalize()` for nested structures

4. **Layered Architecture**
   - Raw layer: exact copy with metadata
   - Cleaned layer: transformed and validated

5. **Data Lineage**
   - Track source, timestamp, transformations
   - Document in README files

---

### Next Class: ETL Pipeline II - APIs and Parameters

On Wednesday, we will learn:
- How to call REST APIs from Python
- The Open-Meteo weather API (used in Assignment 2)
- Handling authentication and rate limits
- Pagination for large datasets

---