# FRE 521D: Data Analytics in Climate, Food and Environment
## Lab 2: Building an ETL Pipeline

**Program:** UBC Master of Food and Resource Economics  
**Instructor:** Asif Ahmed Neloy

---

<div style="background-color: #FFF3CD; border-left: 4px solid #E6A23C; padding: 15px; margin: 15px 0;">
    <h3 style="margin-top: 0; color: #856404;">Submission Deadline</h3>
    <p style="margin-bottom: 0; font-size: 1.2em;"><strong>End of Day: Wednesday, January 14, 2026</strong></p>
</div>

---

## Lab Objectives

In this lab, you will build an ETL (Extract, Transform, Load) pipeline for climate change data. You will:

1. **Extract**: Load the raw CSV data safely and add metadata columns
2. **Transform**: Clean column names and select relevant columns
3. **Load**: Save both raw and cleaned layers with proper documentation

---

## Dataset Description

The `climate_change_indicators.csv` file contains **surface temperature change** data from the FAO (Food and Agriculture Organization). It shows how much warmer or cooler each country was compared to the 1951-1980 baseline period.

---

## Setup: Import Libraries and Create Directories

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
from datetime import datetime

# Display settings
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', None)

# Create directory structure for raw and cleaned layers
os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/cleaned', exist_ok=True)

print("Setup complete!")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Setup complete!
Current time: 2026-01-14 10:25:09


## Explore the Raw Data First

As covered in Lecture 3, **always inspect your data before loading it**.

In [2]:
# Look at the first few lines as raw text
with open('climate_change_indicators.csv', 'r', encoding='utf-8-sig') as f:
    for i, line in enumerate(f):
        if i < 3:
            print(f"Line {i}: {line[:100]}...")
        else:
            break

Line 0: ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS_Code,CTS_Name,CTS_Full_Descriptor,F1961,F1962,F...
Line 1: 1,"Afghanistan, Islamic Rep. of",AF,AFG,"Temperature change with respect to a baseline climatology, ...
Line 2: 2,Albania,AL,ALB,"Temperature change with respect to a baseline climatology, corresponding to the pe...


In [3]:
# Load and explore the data
df = pd.read_csv('climate_change_indicators.csv', encoding='utf-8-sig')

print(f"Dataset shape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumn names:")
for i, col in enumerate(df.columns):
    print(f"  {i}: {col}")

Dataset shape: 225 rows, 72 columns

Column names:
  0: ObjectId
  1: Country
  2: ISO2
  3: ISO3
  4: Indicator
  5: Unit
  6: Source
  7: CTS_Code
  8: CTS_Name
  9: CTS_Full_Descriptor
  10: F1961
  11: F1962
  12: F1963
  13: F1964
  14: F1965
  15: F1966
  16: F1967
  17: F1968
  18: F1969
  19: F1970
  20: F1971
  21: F1972
  22: F1973
  23: F1974
  24: F1975
  25: F1976
  26: F1977
  27: F1978
  28: F1979
  29: F1980
  30: F1981
  31: F1982
  32: F1983
  33: F1984
  34: F1985
  35: F1986
  36: F1987
  37: F1988
  38: F1989
  39: F1990
  40: F1991
  41: F1992
  42: F1993
  43: F1994
  44: F1995
  45: F1996
  46: F1997
  47: F1998
  48: F1999
  49: F2000
  50: F2001
  51: F2002
  52: F2003
  53: F2004
  54: F2005
  55: F2006
  56: F2007
  57: F2008
  58: F2009
  59: F2010
  60: F2011
  61: F2012
  62: F2013
  63: F2014
  64: F2015
  65: F2016
  66: F2017
  67: F2018
  68: F2019
  69: F2020
  70: F2021
  71: F2022


In [4]:
# View sample of the data
df.head(3)

Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,...,F2016,F2017,F2018,F2019,F2020,F2021,F2022
0,1,"Afghanistan, Islamic Rep. of",AF,AFG,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,...,1.555,1.54,1.544,0.91,0.498,1.327,2.012
1,2,Albania,AL,ALB,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,...,1.464,1.121,2.028,1.675,1.498,1.536,1.518
2,3,Algeria,DZ,DZA,Temperature change with respect to a baseline ...,Degree Celsius,Food and Agriculture Organization of the Unite...,...,1.757,1.512,1.21,1.115,1.926,2.33,1.688


---
---

# Question 1: Extract Phase - Create Raw Layer 
## Task: Extract Data and Add Metadata Columns

As discussed in Lecture 3, the **raw layer** should:
1. Contain the exact source data (no transformations)
2. Add metadata columns to track data lineage

**Complete the function below** to add these metadata columns:
- `_source_file`: The name of the source file (use `os.path.basename()`)
- `_extracted_at`: The timestamp when extracted (use `datetime.now().isoformat()`)
- `_row_num`: Row numbers starting from 1 (use `range(1, len(df) + 1)`)

---

In [None]:
def extract_to_raw(filepath):
    """
    Extract data from source file and add metadata columns.
    This creates the RAW LAYER of our ETL pipeline.
    
    Parameters:
    -----------
    filepath : str
        Path to the source CSV file
    
    Returns:
    --------
    pd.DataFrame with source data plus metadata columns
    """
    # Read the CSV file (keeping data as-is)
    df = pd.read_csv(filepath, encoding='utf-8-sig')
    
    # ============================================
    # YOUR CODE HERE: Add the 3 metadata columns
    # ============================================
    
    # 1. Add _source_file column
    
    
    # 2. Add _extracted_at column
    
    
    # 3. Add _row_num column
    
    
    # ============================================
    
    print(f"Extracted {len(df)} rows from {filepath}")
    return df

In [None]:
# Test your function
df_raw = extract_to_raw('climate_change_indicators.csv')

# Verify metadata columns were added
print("\nChecking metadata columns:")
print(f"  _source_file exists: {'_source_file' in df_raw.columns}")
print(f"  _extracted_at exists: {'_extracted_at' in df_raw.columns}")
print(f"  _row_num exists: {'_row_num' in df_raw.columns}")

# Display sample
print("\nSample with metadata columns:")
df_raw[['Country', 'ISO3', '_source_file', '_extracted_at', '_row_num']].head()

In [None]:
# Save the raw layer
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
raw_filepath = f'data/raw/climate_indicators_raw_{timestamp}.csv'
df_raw.to_csv(raw_filepath, index=False)
print(f"Raw layer saved to: {raw_filepath}")

---
---

# Question 2: Transform Phase - Create Cleaned Layer

## Task: Clean and Simplify the Data

The raw data has many columns we don't need. For the **cleaned layer**, we will:
1. Select only the useful columns
2. Rename columns to be more database-friendly (lowercase, underscores)
3. Add a timestamp for when the data was cleaned

**Complete the function below** to:
1. Select these columns: `Country`, `ISO3`, `F2000`, `F2010`, `F2020`, `F2022`
2. Rename them to: `country`, `iso3`, `temp_2000`, `temp_2010`, `temp_2020`, `temp_2022`
3. Add a `_cleaned_at` column with the current timestamp

---

In [None]:
def transform_to_cleaned(df_raw):
    """
    Transform raw data to cleaned layer.
    Selects relevant columns and standardizes names.
    
    Parameters:
    -----------
    df_raw : pd.DataFrame
        Raw data from extract phase
    
    Returns:
    --------
    pd.DataFrame with cleaned data
    """
    # ============================================
    # YOUR CODE HERE
    # ============================================
    
    # Step 1: Select only the columns we need
    # Columns to keep: 'Country', 'ISO3', 'F2000', 'F2010', 'F2020', 'F2022'
    columns_to_keep = ['Country', 'ISO3', 'F2000', 'F2010', 'F2020', 'F2022']
    df = df_raw[columns_to_keep].copy()
    
    # Step 2: Rename columns to be database-friendly
    # Use df.rename(columns={old_name: new_name, ...})
    # New names: 'country', 'iso3', 'temp_2000', 'temp_2010', 'temp_2020', 'temp_2022'
    
    df = df.rename(columns={
        # YOUR CODE: fill in the mapping
        
    })
    
    # Step 3: Add _cleaned_at timestamp column
    
    
    # ============================================
    
    print(f"Cleaned data: {len(df)} rows, {len(df.columns)} columns")
    return df

In [None]:
# Test your function
df_clean = transform_to_cleaned(df_raw)

# Verify the transformation
print("\nCleaned column names:")
print(df_clean.columns.tolist())

print("\nSample of cleaned data:")
df_clean.head(10)

In [None]:
# Save the cleaned layer
cleaned_filepath = f'data/cleaned/climate_indicators_cleaned_{timestamp}.csv'
df_clean.to_csv(cleaned_filepath, index=False)
print(f"Cleaned layer saved to: {cleaned_filepath}")

---
---

# Question 3: Data Lineage Documentation 

## Task: Document Your Pipeline

As discussed in Lecture 3, **data lineage** tracks:
- Where the data came from
- What transformations were applied
- When the pipeline was run

**Fill in the transformations list** with at least 4 things your pipeline did.

---

In [None]:
# ============================================
# YOUR CODE HERE: List at least 4 transformations
# ============================================

transformations = [
    # Example: "Added metadata columns (_source_file, _extracted_at, _row_num)"
    # List what your pipeline did...
    
]

# ============================================

In [None]:
# Generate lineage documentation
lineage_doc = f"""
# Data Lineage Documentation

Generated: {datetime.now().isoformat()}

## Source
- File: climate_change_indicators.csv
- Description: FAO Climate Change Indicators - Surface Temperature Change
- Rows: {len(df_raw)}

## Raw Layer
- File: {raw_filepath}
- Contains exact copy of source with metadata columns added

## Cleaned Layer
- File: {cleaned_filepath}
- Columns: {df_clean.columns.tolist()}

## Transformations Applied
"""

for i, t in enumerate(transformations, 1):
    lineage_doc += f"{i}. {t}\n"

# Save and display
with open('data/cleaned/README.md', 'w') as f:
    f.write(lineage_doc)

print(lineage_doc)

---

## Verify Your Work

Run this cell to check that everything was created correctly.

In [None]:
# Quick verification
print("=" * 50)
print("ETL Pipeline Summary")
print("=" * 50)

print(f"\n[EXTRACT] Raw layer:")
print(f"  - Rows: {len(df_raw)}")
print(f"  - Has _source_file: {'_source_file' in df_raw.columns}")
print(f"  - Has _extracted_at: {'_extracted_at' in df_raw.columns}")
print(f"  - Has _row_num: {'_row_num' in df_raw.columns}")

print(f"\n[TRANSFORM] Cleaned layer:")
print(f"  - Rows: {len(df_clean)}")
print(f"  - Columns: {df_clean.columns.tolist()}")

print(f"\n[LOAD] Files saved:")
print(f"  - Raw: {raw_filepath}")
print(f"  - Cleaned: {cleaned_filepath}")
print(f"  - README: data/cleaned/README.md")

print("\n" + "=" * 50)

---

## Submission Checklist

Before submitting, make sure:

- [ ] **Question 1**: `extract_to_raw()` adds all 3 metadata columns
- [ ] **Question 2**: `transform_to_cleaned()` selects and renames columns correctly
- [ ] **Question 3**: You listed at least 4 transformations
- [ ] All files were saved to `data/raw/` and `data/cleaned/`

### How to Submit

1. Save this notebook
2. Submit via Canvas (Lab 2 Submission) by **end of day Wednesday, January 14, 2026**

---