# `FRE 521D_Assignment 2_Group 3`
### Members: Janine, Juliette, Margaret & Clare

## Task 1: Pipeline Architecture Design

### (a) Data Flow Diagram

The ETL pipeline follows a layered Extract–Transform–Load–Aggregate architecture as illustrated in Figure 1.
Weather data is extracted from the Open-Meteo Historical Weather API using country centroid coordinates from a CSV file. The pipeline enforces rate limiting and retry logic to ensure reliable API access.

During the transformation stage, JSON responses are flattened, cleaned, validated, and standardized. The transformed data is then loaded into a daily weather table using upsert operations to prevent duplication.

Finally, daily weather data is aggregated into monthly and annual summary tables. These aggregated datasets are joined with crop production data to produce an integrated analytical view used for business analysis.

---
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                        ETL PIPELINE ARCHITECTURE                                    │
├─────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                     │
│  ┌─────────────┐     ┌─────────────┐      ┌─────────────┐      ┌─────────────┐      │
│  │   EXTRACT   │────>│  TRANSFORM  │─────>│    LOAD     │─────>│  AGGREGATE  │      │
│  └─────────────┘     └─────────────┘      └─────────────┘      └─────────────┘      │
│        │                   │                   │                   │                │
│        ▼                   ▼                   ▼                   ▼                │
│  ┌──────────────┐    ┌───────────┐       ┌──────────────┐    ┌───────────────────┐  │
│  │  Open-Meteo  │    │ Flatten   │       │daily_ Weather│    │ monthly_ weather  │  │
│  │   API        │    │ JSON      │       │ Table        │    │                   │  │
│  │              │    │           │       │              │    │- Monthly & Annual │  │
│  │ - Rate limit │    │ - Parse   │       │ - Upsert     │    │- Calculate Metrics│  │
│  │ - Retry      │    │ - Validate│       │ - Dedupe     │    │- Join with        │  │
│  │ - Country CSV│    │ - Clean   │       │ - Index      │    │  production data  │  │
│  └──────────────┘    └───────────┘       └──────────────┘    └───────────────────┘  │
│        │                   │                   │                       │            │
│        ▼                   ▼                   ▼───────────────────────▼───────┐    │
│     Raw Layer          Cleaned Layer        Aggregate Layers           │       │    │
│        │                   │                   │                       │       │    │
│        ▼                   ▼                   ▼                       ▼       ▼    │
│  ┌───────────┐       ┌───────────┐       ┌───────────┐  ┌────────────────────────┐  │
│  │  Logging  │       │  Logging  │       │  Logging  │  │        Logging         │  │
│  │ - Success │       │ - Records │       │ - Inserts │  │- weather & crop Data   │  │
│  │ - Errors  │       │ - Nulls   │       │ - Commits │  │- Country + Year leve   │  │
│  │ - Timing  │       │ - Types   │       │ - Errors  │  │    ┌───────────────┐   │  │
│  └───────────┘       └───────────┘       └───────────┘  └────│ANALYSIS READY │───┘  │
│                                                              └───────────────┘      │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
**Figure 1:** ETL pipeline data flow illustrating extraction from the Open-Meteo API, transformation through raw and cleaned layers, aggregation, and integration with crop production data.

```
┌─────────────────────────────────────────────────────────────────────┐
│                    TABLE RELATIONSHIPS                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   A-1 TABLES                         A-2 TABLES                     │
│   ──────────                         ──────────                     │
│                                                                     │
│   ┌──────────────┐                   ┌──────────────┐               │
│   │crop_production│                   │daily_weather │              │
│   │              │                   │              │               │
│   │ iso3_code ───┼───────────────────┼─ iso3_code   │               │
│   │ year      ───┼───┐               │ date         │               │
│   │ crop         │   │               └──────────────┘               │
│   │ production   │   │                      │                       │
│   │ yield        │   │                      │ Aggregate             │
│   └──────────────┘   │                      ▼                       │
│          │           │               ┌──────────────┐               │
│          │           │               │annual_weather│               │
│          │           │               │              │               │
│          │           └───────────────┼─ iso3_code   │               │
│          │                           │ year ────────┼───┐           │
│          │                           │ weather vars │   │           │
│          │                           └──────────────┘   │           │
│          │                                              │           │
│          │              JOIN ON                         │           │
│          │         iso3_code + year                     │           │
│          │                                              │           │
│          ▼                                              ▼           │
│   ┌─────────────────────────────────────────────────────────┐       │
│   │              climate_agriculture_analysis               │       │
│   │                   (Integrated View)                     │       │
│   │                                                         │       │
│   │  - Country attributes (name, region, income group)      │       │
│   │  - Crop metrics (production, yield, area, fertilizer)   │       │
│   │  - Climate metrics (temp, precip, GDD, extremes)        │       │
│   │  - Derived: water balance, temp bucket                  │       │
│   └─────────────────────────────────────────────────────────┘       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```
**Figure 2:** Table relationships showing integration between Assignment 1 crop production tables and new weather tables using iso3_code and year.

### (b) Schema for New Weather Tables

The ETL pipeline introduces several new tables to store weather data:

**Daily Weather Table (`daily_weather`)**

Stores cleaned daily weather observations including:
- iso3_code
- date
- temperature metrics (mean, max, min)
- precipitation and rain totals
- evapotranspiration (ET0)

This table enforces uniqueness on (`iso3_code, date`) and serves as the cleaned data layer.

**Monthly Weather Table (`monthly_weather`)**

Stores monthly aggregated climate summaries by country and month.

**Annual Weather Table (`annual_weather`)**

Stores annual aggregated climate indicators by country and year, including derived metrics such as:
- Growing Degree Days (GDD)
- Precipitation variability
- Extreme temperature counts

### (c) Relationship to Assignment 1 Tables

Weather data is integrated with existing Assignment 1 crop production tables using shared temporal and geographic keys.

The `daily_weather` table is aggregated into the `annual_weather` table by iso3_code and year.

The `annual_weather` table is then joined with the `crop_production` table on:

**iso3_code + year**


This produces the integrated view `climate_agriculture_analysis`, which combines:
- Country attributes (name, region, income group)
- Crop metrics (production, yield, area, fertilizer use)
- Climate metrics (temperature, precipitation, GDD, extremes)

### (d) Error Handling and API Rate Limit Strategy

To ensure reliable data extraction, the pipeline implements:
- A minimum 5-second delay between API requests to comply with rate limits
- Retry logic with up to three attempts for failed requests
- Exponential backoff between retries

Errors, failures, and processing times are logged at each ETL stage to allow monitoring and troubleshooting.

### (e) Data Lineage and Logging

The pipeline tracks data lineage through both metadata storage and operational logging.

Each extracted weather record includes:
- Source identifier (Open-Meteo API)
- Extraction timestamp

Logging is implemented across the ETL process to capture:
- Extraction success or failure
- Record counts
- Data validation results (nulls and type issues)
- Load confirmations and database errors

This ensures transparency, reproducibility, and quality control throughout the pipeline.


## Task 2: ETL Pipeline Implementation

In [5]:
# ============================================
# ETL PIPELINE SETUP
# ============================================

# Standard imports for ETL work
import pandas as pd
import numpy as np
import requests
import json
import time
import os
from datetime import datetime

# For database connection 
from sqlalchemy import create_engine

# Display settings 
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print(f"Pandas version: {pd.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Setup complete!")

Pandas version: 2.3.3
Current time: 2026-01-29 02:54:59
Setup complete!
