# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 4 & 5: ETL Pipeline II - APIs, Automation, and Cloud Tools

**Date:** Wednesday, January 14, 2026 & Monday, January 19, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

**Part 1: Working with APIs**
1. What is an API? How Does It Work?
2. Making HTTP Requests in Python
3. Understanding API Parameters
4. Authentication: API Keys and Tokens
5. Handling Pagination
6. Rate Limiting and Polite Requests

**Part 2: Building Robust Pipelines**
7. Error Handling and Retry Logic
8. Logging Your Pipeline
9. Making Pipelines Idempotent

**Part 3: Tools and Automation**
10. Introduction to BigQuery
11. Building Automation Scripts

---

## Part 1: Working with APIs

---

## 1. What is an API? How Does It Work?

### The Restaurant Analogy

Think of an API like ordering food at a restaurant:

```
┌─────────────┐         ┌─────────────┐         ┌─────────────┐
│   YOU       │         │   WAITER    │         │   KITCHEN   │
│  (Client)   │ ──────> │   (API)     │ ──────> │  (Server)   │
│             │ <────── │             │ <────── │             │
│  Order food │         │ Takes order │         │ Prepares    │
│  Get food   │         │ Brings food │         │ the food    │
└─────────────┘         └─────────────┘         └─────────────┘
```

- **You** don't go into the kitchen
- **You** use a menu (API documentation) to know what's available
- **You** place an order (make a request) using specific format
- **Waiter** delivers your food (response) in a predictable format

### API in Technical Terms

**API** = Application Programming Interface

A **REST API** uses standard HTTP methods:

| Method | Purpose | Example |
|--------|---------|----------|
| **GET** | Retrieve data | Get list of countries |
| **POST** | Send new data | Submit a new record |
| **PUT** | Update existing data | Update a record |
| **DELETE** | Remove data | Delete a record |

For data extraction, we use **GET** almost exclusively.

### Anatomy of an API Request

```
https://api.worldbank.org/v2/country/CAN/indicator/NY.GDP.PCAP.CD?format=json&date=2020
└──────────┬───────────┘└───────────────────┬─────────────────┘└──────────┬────────────┘
       Base URL                         Endpoint                   Query Parameters
```

- **Base URL**: The server address
- **Endpoint**: The specific resource you want
- **Query Parameters**: Filters and options (after the `?`)

---
## 2. Setting Up and Making Your First API Request

We need the `requests` library to call APIs from Python.

In [None]:
# Install required packages (run once)
# Uncomment if needed

# !pip install requests pandas numpy

In [None]:
# Import libraries
import requests
import pandas as pd
import numpy as np
import json
import time
import logging
from datetime import datetime

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print(f"Setup complete - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

### Your First API Call: REST Countries

Let's start with a simple, free API that requires no authentication: [REST Countries](https://restcountries.com/)

This API provides information about countries worldwide - useful for joining with economic or environmental data.

In [None]:
# Making a simple GET request
# Let's get information about Canada

url = "https://restcountries.com/v3.1/name/canada"

# Make the request
response = requests.get(url)

# Check what we got back
print(f"Status Code: {response.status_code}")
print(f"Response Type: {type(response)}")
print(f"Content Type: {response.headers.get('Content-Type')}")

### Understanding HTTP Status Codes

The status code tells you if your request succeeded:

| Code | Meaning | What To Do |
|------|---------|------------|
| **200** | Success | Process the data |
| **400** | Bad Request | Check your parameters |
| **401** | Unauthorized | Check your API key |
| **403** | Forbidden | You don't have permission |
| **404** | Not Found | Check the URL/endpoint |
| **429** | Too Many Requests | You hit the rate limit - slow down |
| **500** | Server Error | Try again later |

In [None]:
# Parse the JSON response
if response.status_code == 200:
    data = response.json()  # Convert JSON string to Python object
    
    print(f"Response is a: {type(data)}")
    print(f"Number of items: {len(data)}")
    
    # Look at the structure
    print("\nKeys in first item:")
    print(list(data[0].keys()))
else:
    print(f"Request failed with status {response.status_code}")

In [None]:
# Extract specific information
canada = data[0]

print("Country Information:")
print(f"  Name: {canada['name']['common']}")
print(f"  Official Name: {canada['name']['official']}")
print(f"  Capital: {canada['capital'][0]}")
print(f"  Region: {canada['region']}")
print(f"  Subregion: {canada['subregion']}")
print(f"  Population: {canada['population']:,}")
print(f"  Area (km²): {canada['area']:,}")
print(f"  Currencies: {list(canada['currencies'].keys())}")
print(f"  Languages: {list(canada['languages'].values())}")

---
## 3. Understanding API Parameters

Most APIs let you filter and customize responses using **query parameters**.

### Two Ways to Add Parameters

In [None]:
# Method 1: Parameters in the URL string (not recommended)
url_with_params = "https://restcountries.com/v3.1/region/europe?fields=name,capital,population"
response1 = requests.get(url_with_params)
print(f"Method 1 - Status: {response1.status_code}")

# Method 2: Parameters as a dictionary (recommended)
base_url = "https://restcountries.com/v3.1/region/europe"
params = {
    'fields': 'name,capital,population'
}
response2 = requests.get(base_url, params=params)
print(f"Method 2 - Status: {response2.status_code}")

# Method 2 is better because:
# - Automatically handles URL encoding
# - Easier to read and modify
# - Less error-prone

In [None]:
# Get European countries with selected fields
european_countries = response2.json()

print(f"Found {len(european_countries)} European countries")
print("\nFirst 5 countries:")
for country in european_countries[:5]:
    name = country['name']['common']
    capital = country.get('capital', ['N/A'])[0] if country.get('capital') else 'N/A'
    pop = country.get('population', 0)
    print(f"  {name}: {capital}, Population: {pop:,}")

### Example: World Bank API with Parameters

The World Bank API provides economic and development indicators - perfect for ESG and food security analysis.

Let's get GDP per capita for G7 countries.

In [None]:
# World Bank API example
# Get GDP per capita for Canada

base_url = "https://api.worldbank.org/v2/country/CAN/indicator/NY.GDP.PCAP.CD"

params = {
    'format': 'json',      # Response format
    'date': '2018:2022',   # Date range
    'per_page': 100        # Results per page
}

response = requests.get(base_url, params=params)
print(f"Status: {response.status_code}")
print(f"URL called: {response.url}")

In [None]:
# World Bank returns a list with metadata first, then data
result = response.json()

print(f"Response has {len(result)} parts")
print(f"\nPart 1 (metadata): {result[0]}")
print(f"\nPart 2 (data): {len(result[1])} records")

In [None]:
# Extract the GDP data
gdp_data = result[1]

print("Canada GDP per Capita (USD):")
print("-" * 30)
for record in gdp_data:
    year = record['date']
    value = record['value']
    if value:
        print(f"  {year}: ${value:,.2f}")
    else:
        print(f"  {year}: No data")

---
## 4. Authentication: API Keys and Tokens

Many APIs require authentication to:
- Track usage and enforce limits
- Provide personalized data
- Prevent abuse

### Common Authentication Methods

| Method | How It Works | Example |
|--------|--------------|----------|
| **No Auth** | Just call the API | REST Countries |
| **API Key in URL** | Add key as query parameter | `?api_key=abc123` |
| **API Key in Header** | Add key to request header | `X-API-Key: abc123` |
| **Bearer Token** | OAuth token in header | `Authorization: Bearer abc123` |

### Example: Exchange Rate API with API Key

Let's use the ExchangeRate API which requires a free API key.

In [None]:
# IMPORTANT: Never hardcode API keys in your code!
# Use environment variables or config files

import os

# Method 1: Environment variable (best practice)
# Set this in your terminal: export EXCHANGE_API_KEY="your_key_here"
# api_key = os.environ.get('EXCHANGE_API_KEY')

# Method 2: Config file (good practice)
# Create a file called 'config.py' with: API_KEY = "your_key_here"
# from config import API_KEY

# Method 3: For demonstration only (never do this in real code)
# We'll use a free API that works without a key for the demo

print("API Key Best Practices:")
print("1. Use environment variables")
print("2. Never commit keys to Git")
print("3. Use .gitignore for config files")
print("4. Rotate keys if exposed")

In [None]:
# Example: API key in query parameter
# Using a free currency API (frankfurter.app - no key needed)

def get_exchange_rates(base_currency='USD', target_currencies=None):
    """
    Get current exchange rates from Frankfurter API.
    This API is free and requires no authentication.
    """
    url = "https://api.frankfurter.app/latest"
    
    params = {'from': base_currency}
    if target_currencies:
        params['to'] = ','.join(target_currencies)
    
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        return None

# Get rates for major currencies
rates = get_exchange_rates('USD', ['CAD', 'EUR', 'GBP', 'JPY', 'CNY'])

if rates:
    print(f"Exchange Rates (Base: {rates['base']})")
    print(f"Date: {rates['date']}")
    print("-" * 30)
    for currency, rate in rates['rates'].items():
        print(f"  1 USD = {rate:.4f} {currency}")

In [None]:
# Example: API key in header
# This is how you would call an API that needs authentication

def call_api_with_header_auth(url, api_key):
    """
    Example of calling an API with key in header.
    """
    headers = {
        'X-API-Key': api_key,
        'Content-Type': 'application/json'
    }
    
    response = requests.get(url, headers=headers)
    return response

# Example of Bearer token authentication
def call_api_with_bearer_token(url, token):
    """
    Example of calling an API with Bearer token.
    Common with OAuth 2.0 APIs.
    """
    headers = {
        'Authorization': f'Bearer {token}',
        'Content-Type': 'application/json'
    }
    
    response = requests.get(url, headers=headers)
    return response

print("Authentication functions defined.")
print("Use these patterns when working with authenticated APIs.")

---
## 5. Handling Pagination

When an API has a lot of data, it returns results in **pages**. You need to make multiple requests to get all the data.

### Common Pagination Patterns

| Pattern | How It Works | Example |
|---------|--------------|----------|
| **Page Number** | Request page 1, 2, 3... | `?page=1&per_page=100` |
| **Offset/Limit** | Skip N records, get M | `?offset=100&limit=50` |
| **Cursor/Token** | Use token for next page | `?cursor=abc123` |

### Example: Paginated World Bank Data

In [None]:
import requests
import pandas as pd
import time

def get_worldbank_indicator(indicator_code, start_year, end_year):
    """Get World Bank indicator data with pagination handling."""
    base_url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator_code}"
    
    all_data = []
    page = 1
    per_page = 100
    
    params = {
        'format': 'json',
        'date': f'{start_year}:{end_year}',
        'page': page,
        'per_page': per_page
    }
    
    response = requests.get(base_url, params=params)
    
    if response.status_code != 200:
        print(f"HTTP Error {response.status_code}")
        return all_data
    
    result = response.json()
    
    # Check if error response
    if len(result) == 1 and 'message' in result[0]:
        error_msg = result[0]['message'][0].get('value', 'Unknown error')
        print(f"API Error: {error_msg}")
        return all_data
    
    if len(result) < 2:
        print(f"Unexpected response format")
        return all_data
    
    metadata = result[0]
    data = result[1]
    
    if not data:
        print(f"No data for {indicator_code}")
        return all_data
    
    all_data.extend(data)
    total_pages = metadata.get('pages', 1)
    print(f"Page {page}/{total_pages}: Got {len(data)} records")
    
    while page < total_pages:
        page += 1
        params['page'] = page
        
        response = requests.get(base_url, params=params)
        if response.status_code != 200:
            print(f"Error on page {page}: {response.status_code}")
            break
        
        result = response.json()
        metadata = result[0]
        data = result[1] if len(result) > 1 else []
        
        if not data:
            break
        
        all_data.extend(data)
        print(f"Page {page}/{total_pages}: Got {len(data)} records")
        
        time.sleep(0.5)
    
    print(f"Total records collected: {len(all_data)}\n")
    return all_data


def transform_to_dataframe(raw_data, indicator_name):
    """Convert World Bank API response to clean DataFrame."""
    if not raw_data:
        return pd.DataFrame()
    
    df = pd.DataFrame(raw_data)
    
    # Extract nested fields
    df['country_name'] = df['country'].apply(
        lambda x: x.get('value') if isinstance(x, dict) else None
    )
    df['country_code'] = df['country'].apply(
        lambda x: x.get('id') if isinstance(x, dict) else None
    )
    
    # Clean numeric columns
    df['year'] = pd.to_numeric(df['date'], errors='coerce').astype('Int64')
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    
    # Select columns
    df_clean = df[['country_name', 'country_code', 'year', 'value']].copy()
    df_clean.columns = ['country_name', 'country_code', 'year', indicator_name]
    
    # Remove null values
    df_clean = df_clean.dropna(subset=[indicator_name])
    
    return df_clean


def save_and_store_data(dataframe, filename):
    """Save DataFrame to CSV and return it."""
    dataframe.to_csv(filename, index=False)
    print(f"Saved {len(dataframe)} records to {filename}")
    return dataframe


# === MAIN EXECUTION ===

print("FETCHING WORLD BANK DATA")
print("=" * 60)

indicators_to_fetch = {
    'NY.GDP.PCAP.CD': 'gdp_per_capita',
    'SP.POP.TOTL': 'total_population',
    'AG.LND.ARBL.HA.PC': 'arable_land_per_capita',
}

# Dictionary to store all dataframes
data_storage = {}

for code, indicator_name in indicators_to_fetch.items():
    print(f"\nFetching: {indicator_name} ({code})")
    print("-" * 60)
    
    # Extract raw data
    raw_data = get_worldbank_indicator(code, 2018, 2022)
    
    if raw_data:
        # Transform to DataFrame
        df = transform_to_dataframe(raw_data, indicator_name)
        
        # Save to CSV and store in memory
        csv_filename = f"{indicator_name}.csv"
        df = save_and_store_data(df, csv_filename)
        
        # Store in dictionary
        data_storage[indicator_name] = df
        
        print(f"Stored {len(df)} records in memory\n")

# === DISPLAY RESULTS ===

print("\n" + "=" * 60)
print("DATA SUMMARY")
print("=" * 60)

for indicator_name, df in data_storage.items():
    print(f"\n{indicator_name.upper()}")
    print("-" * 60)
    
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    
    # Show sample data for 2022
    sample = df[df['year'] == 2022].nlargest(5, df.columns[-1])
    if not sample.empty:
        print(f"\nTop 5 ({indicator_name}) in 2022:")
        print(sample.to_string(index=False))
    else:
        print("No 2022 data available")

# === ACCESS DATA FROM VARIABLES ===

print("\n" + "=" * 60)
print("ACCESSING DATA FROM STORED VARIABLES")
print("=" * 60)

# Access GDP per capita
df_gdp = data_storage['gdp_per_capita']
print(f"\nGDP per capita data shape: {df_gdp.shape}")
print(f"Countries with GDP data: {df_gdp['country_name'].nunique()}")
print(f"Year range: {df_gdp['year'].min()} to {df_gdp['year'].max()}")

# Show richest countries in 2022
print("\nRichest countries (2022):")
richest = df_gdp[df_gdp['year'] == 2022].nlargest(5, 'gdp_per_capita')
print(richest[['country_name', 'gdp_per_capita']].to_string(index=False))

# Access population data
df_population = data_storage['total_population']
print(f"\n\nPopulation data shape: {df_population.shape}")
print(f"Countries with population data: {df_population['country_name'].nunique()}")

# Show most populous countries in 2022
print("\nMost populous countries (2022):")
populous = df_population[df_population['year'] == 2022].nlargest(5, 'total_population')
print(populous[['country_name', 'total_population']].to_string(index=False))

# Access arable land data
df_arable = data_storage['arable_land_per_capita']
print(f"\n\nArable land per capita data shape: {df_arable.shape}")

# Show countries with most arable land per capita in 2022
print("\nCountries with most arable land per capita (2022):")
arable_top = df_arable[df_arable['year'] == 2022].nlargest(5, 'arable_land_per_capita')
print(arable_top[['country_name', 'arable_land_per_capita']].to_string(index=False))

# === COMBINE DATAFRAMES ===

print("\n" + "=" * 60)
print("COMBINING DATA FROM MULTIPLE SOURCES")
print("=" * 60)

# Merge all data for a specific year
year_to_analyze = 2022

df_combined = df_gdp[df_gdp['year'] == year_to_analyze][['country_name', 'country_code', 'gdp_per_capita']].copy()

df_combined = df_combined.merge(
    df_population[df_population['year'] == year_to_analyze][['country_code', 'total_population']],
    on='country_code',
    how='left'
)

df_combined = df_combined.merge(
    df_arable[df_arable['year'] == year_to_analyze][['country_code', 'arable_land_per_capita']],
    on='country_code',
    how='left'
)

print(f"\nCombined data shape: {df_combined.shape}")
print(f"Columns: {df_combined.columns.tolist()}\n")

# Save combined data
df_combined.to_csv('combined_indicators_2022.csv', index=False)
print("Saved combined data to combined_indicators_2022.csv")

# Show sample
print("\nSample of combined data:")
print(df_combined.dropna().head(10).to_string(index=False))

FETCHING WORLD BANK DATA

Fetching: gdp_per_capita (NY.GDP.PCAP.CD)
------------------------------------------------------------
Page 1/14: Got 100 records
Page 2/14: Got 100 records
Page 3/14: Got 100 records
Page 4/14: Got 100 records
Page 5/14: Got 100 records
Page 6/14: Got 100 records
Page 7/14: Got 100 records
Page 8/14: Got 100 records
Page 9/14: Got 100 records
Page 10/14: Got 100 records
Page 11/14: Got 100 records
Page 12/14: Got 100 records
Page 13/14: Got 100 records
Page 14/14: Got 30 records
Total records collected: 1330

Saved 1291 records to gdp_per_capita.csv
Stored 1291 records in memory


Fetching: total_population (SP.POP.TOTL)
------------------------------------------------------------
Page 1/14: Got 100 records
Page 2/14: Got 100 records
Page 3/14: Got 100 records
Page 4/14: Got 100 records
Page 5/14: Got 100 records
Page 6/14: Got 100 records
Page 7/14: Got 100 records
Page 8/14: Got 100 records
Page 9/14: Got 100 records
Page 10/14: Got 100 records
Page 11/14: 

In [None]:
# Convert stored data to single combined DataFrame

print("CONVERTING STORED DATA TO SINGLE DATAFRAME")
print("=" * 60)

# Merge all three datasets into one
df_co2 = data_storage['gdp_per_capita'].copy()

df_co2 = df_co2.merge(
    data_storage['total_population'][['country_code', 'year', 'total_population']],
    on=['country_code', 'year'],
    how='left'
)

df_co2 = df_co2.merge(
    data_storage['arable_land_per_capita'][['country_code', 'year', 'arable_land_per_capita']],
    on=['country_code', 'year'],
    how='left'
)

print(f"DataFrame shape: {df_co2.shape}")
print(f"Columns: {df_co2.columns.tolist()}")

# Display data info
print("\nData types:")
print(df_co2.dtypes)

print("\nFirst 10 rows:")
print(df_co2.head(10))

print("\nData Summary:")
print(df_co2.describe())

print("\nMissing values:")
print(df_co2.isnull().sum())

print("\nSample data for year 2022:")
print(df_co2[df_co2['year'] == 2022].head(10))

CONVERTING STORED DATA TO SINGLE DATAFRAME
DataFrame shape: (1291, 6)
Columns: ['country_name', 'country_code', 'year', 'gdp_per_capita', 'total_population', 'arable_land_per_capita']

Data types:
country_name               object
country_code               object
year                        Int64
gdp_per_capita            float64
total_population          float64
arable_land_per_capita    float64
dtype: object

First 10 rows:
                  country_name country_code  year  gdp_per_capita  \
0  Africa Eastern and Southern           ZH  2022     1679.327622   
1  Africa Eastern and Southern           ZH  2021     1562.416175   
2  Africa Eastern and Southern           ZH  2020     1351.591669   
3  Africa Eastern and Southern           ZH  2019     1507.085600   
4  Africa Eastern and Southern           ZH  2018     1552.073722   
5   Africa Western and Central           ZI  2022     2138.473153   
6   Africa Western and Central           ZI  2021     2112.794076   
7   Africa Wester

In [33]:
# Clean and reshape the data
df_clean = df_co2[['country_name', 'country_code', 'year', 'gdp_per_capita']].copy()
df_clean.columns = ['country_name', 'country_code', 'year', 'co2_per_capita']

# Select final columns
df_final = df_clean[['country_name', 'country_code', 'year', 'co2_per_capita']].copy()
df_final = df_final.dropna(subset=['co2_per_capita'])

print(f"Cleaned data: {len(df_final)} records")
print("\nTop 10 countries by GDP per capita (2022):")
print(df_final[df_final['year'] == 2022].nlargest(10, 'co2_per_capita')[['country_name', 'co2_per_capita']])

Cleaned data: 1291 records

Top 10 countries by GDP per capita (2022):
        country_name  co2_per_capita
868           Monaco   226052.001905
788    Liechtenstein   188055.003235
798       Luxembourg   123719.658916
345          Bermuda   121613.939984
953           Norway   109269.520580
688          Ireland   105190.685953
1150     Switzerland    94394.510680
415   Cayman Islands    93030.705145
1073       Singapore    90299.069464
1013           Qatar    88701.468976


---
## 6. Rate Limiting and Polite Requests

APIs have limits on how many requests you can make. Exceeding these limits can:
- Get your requests blocked (HTTP 429)
- Get your API key revoked
- Get your IP address banned

### Rate Limiting Strategies

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Fixed Delay** | Wait N seconds between requests | Simple, predictable |
| **Adaptive Delay** | Slow down when getting 429s | Unknown limits |
| **Token Bucket** | Allow bursts but limit average | Complex scenarios |

In [34]:
# Simple rate limiter with fixed delay

class RateLimiter:
    """
    Simple rate limiter that ensures minimum delay between requests.
    """
    
    def __init__(self, min_delay_seconds=1.0):
        """
        Initialize rate limiter.
        
        Parameters:
        -----------
        min_delay_seconds : float
            Minimum seconds to wait between requests
        """
        self.min_delay = min_delay_seconds
        self.last_request_time = 0
    
    def wait(self):
        """
        Wait if necessary to respect rate limit.
        """
        elapsed = time.time() - self.last_request_time
        if elapsed < self.min_delay:
            sleep_time = self.min_delay - elapsed
            time.sleep(sleep_time)
        self.last_request_time = time.time()


# Example usage
limiter = RateLimiter(min_delay_seconds=2.0)  # 2 seconds between requests

print("Making 3 requests with rate limiting...")
for i in range(3):
    limiter.wait()
    print(f"  Request {i+1} at {datetime.now().strftime('%H:%M:%S')}")

print("\nNotice the 2-second gap between requests.")

Making 3 requests with rate limiting...
  Request 1 at 19:28:30
  Request 2 at 19:28:32
  Request 3 at 19:28:34

Notice the 2-second gap between requests.


In [35]:
# Rate-limited API caller

def fetch_with_rate_limit(urls, delay_seconds=1.0):
    """
    Fetch multiple URLs with rate limiting.
    
    Parameters:
    -----------
    urls : list
        List of URLs to fetch
    delay_seconds : float
        Seconds to wait between requests
    
    Returns:
    --------
    list : Responses for each URL
    """
    results = []
    limiter = RateLimiter(delay_seconds)
    
    for i, url in enumerate(urls):
        limiter.wait()
        
        print(f"Fetching {i+1}/{len(urls)}: {url[:50]}...")
        response = requests.get(url)
        
        results.append({
            'url': url,
            'status': response.status_code,
            'data': response.json() if response.status_code == 200 else None
        })
    
    return results

# Example: Fetch data for multiple countries
countries = ['canada', 'mexico', 'brazil']
urls = [f"https://restcountries.com/v3.1/name/{c}?fields=name,population" for c in countries]

results = fetch_with_rate_limit(urls, delay_seconds=1.0)

print("\nResults:")
for r in results:
    if r['data']:
        name = r['data'][0]['name']['common']
        pop = r['data'][0]['population']
        print(f"  {name}: {pop:,}")

Fetching 1/3: https://restcountries.com/v3.1/name/canada?fields=...
Fetching 2/3: https://restcountries.com/v3.1/name/mexico?fields=...
Fetching 3/3: https://restcountries.com/v3.1/name/brazil?fields=...

Results:
  Canada: 41,651,653
  Mexico: 130,575,786
  Brazil: 213,421,037


---
## Part 2: Building Robust Pipelines

---

## 7. Error Handling and Retry Logic

APIs fail. Networks have problems. Your pipeline must handle errors gracefully.

### Common API Errors

| Error Type | Cause | Solution |
|------------|-------|----------|
| Connection timeout | Network issues | Retry with backoff |
| 429 Too Many Requests | Rate limit hit | Wait and retry |
| 500 Server Error | API is down | Wait and retry |
| 400 Bad Request | Invalid parameters | Fix and don't retry |
| 401 Unauthorized | Bad API key | Fix and don't retry |

### Exponential Backoff

When retrying, wait progressively longer:
- 1st retry: wait 1 second
- 2nd retry: wait 2 seconds
- 3rd retry: wait 4 seconds
- 4th retry: wait 8 seconds

This prevents overwhelming a struggling server.

In [36]:
def fetch_with_retry(url, max_retries=3, base_delay=1.0):
    """
    Fetch URL with automatic retry and exponential backoff.
    
    Parameters:
    -----------
    url : str
        URL to fetch
    max_retries : int
        Maximum number of retry attempts
    base_delay : float
        Base delay in seconds (doubles each retry)
    
    Returns:
    --------
    dict : Response data or error information
    """
    
    # Status codes that are worth retrying
    retryable_codes = {429, 500, 502, 503, 504}
    
    for attempt in range(max_retries + 1):
        try:
            response = requests.get(url, timeout=30)
            
            # Success!
            if response.status_code == 200:
                return {
                    'success': True,
                    'data': response.json(),
                    'attempts': attempt + 1
                }
            
            # Retryable error
            if response.status_code in retryable_codes:
                if attempt < max_retries:
                    delay = base_delay * (2 ** attempt)  # Exponential backoff
                    print(f"  Attempt {attempt + 1} failed (HTTP {response.status_code}). "
                          f"Retrying in {delay}s...")
                    time.sleep(delay)
                    continue
            
            # Non-retryable error
            return {
                'success': False,
                'error': f"HTTP {response.status_code}",
                'attempts': attempt + 1
            }
            
        except requests.exceptions.Timeout:
            if attempt < max_retries:
                delay = base_delay * (2 ** attempt)
                print(f"  Attempt {attempt + 1} timed out. Retrying in {delay}s...")
                time.sleep(delay)
                continue
            return {
                'success': False,
                'error': 'Timeout after all retries',
                'attempts': attempt + 1
            }
            
        except requests.exceptions.RequestException as e:
            return {
                'success': False,
                'error': str(e),
                'attempts': attempt + 1
            }
    
    return {
        'success': False,
        'error': 'Max retries exceeded',
        'attempts': max_retries + 1
    }


# Test with a valid URL
print("Testing retry logic with valid URL:")
result = fetch_with_retry("https://restcountries.com/v3.1/name/france?fields=name,capital")
print(f"Success: {result['success']}, Attempts: {result['attempts']}")
if result['success']:
    print(f"Data: {result['data'][0]['name']['common']}")

Testing retry logic with valid URL:
Success: True, Attempts: 1
Data: France


In [37]:
# Test with an invalid URL to see error handling
print("\nTesting retry logic with invalid URL:")
result = fetch_with_retry("https://restcountries.com/v3.1/name/notarealcountry123")
print(f"Success: {result['success']}")
print(f"Error: {result.get('error', 'None')}")
print(f"Attempts: {result['attempts']}")


Testing retry logic with invalid URL:
Success: False
Error: HTTP 404
Attempts: 1


---
## 8. Logging Your Pipeline

Logging is essential for:
- Debugging problems
- Monitoring pipeline health
- Auditing data lineage
- Measuring performance

### Python's Logging Module

Python has a built-in `logging` module that is much better than `print()` statements.

In [38]:
import logging

# Configure logging
# This setup writes to both console and file

def setup_logging(log_file='pipeline.log'):
    """
    Configure logging for the pipeline.
    Logs to both console and file.
    """
    # Create logger
    logger = logging.getLogger('ETLPipeline')
    logger.setLevel(logging.DEBUG)
    
    # Clear any existing handlers
    logger.handlers = []
    
    # Create formatters
    detailed_formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    simple_formatter = logging.Formatter(
        '%(levelname)s - %(message)s'
    )
    
    # File handler (detailed)
    file_handler = logging.FileHandler(log_file)
    file_handler.setLevel(logging.DEBUG)
    file_handler.setFormatter(detailed_formatter)
    
    # Console handler (simple)
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)
    console_handler.setFormatter(simple_formatter)
    
    # Add handlers to logger
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)
    
    return logger


# Create logger
logger = setup_logging('etl_pipeline.log')

# Test logging levels
logger.debug("This is a debug message (only in file)")
logger.info("This is an info message")
logger.warning("This is a warning message")
logger.error("This is an error message")

INFO - This is an info message
ERROR - This is an error message


In [39]:
# Check the log file
print("Contents of etl_pipeline.log:")
print("-" * 50)
with open('etl_pipeline.log', 'r') as f:
    print(f.read())

Contents of etl_pipeline.log:
--------------------------------------------------
2026-01-04 19:29:02,487 - ETLPipeline - DEBUG - This is a debug message (only in file)
2026-01-04 19:29:02,489 - ETLPipeline - INFO - This is an info message
2026-01-04 19:29:02,491 - ETLPipeline - ERROR - This is an error message



In [40]:
# Using logging in a real function

def fetch_country_data_logged(country_name):
    """
    Fetch country data with proper logging.
    """
    logger.info(f"Starting fetch for: {country_name}")
    
    url = f"https://restcountries.com/v3.1/name/{country_name}"
    logger.debug(f"URL: {url}")
    
    try:
        start_time = time.time()
        response = requests.get(url, timeout=30)
        elapsed = time.time() - start_time
        
        logger.debug(f"Response time: {elapsed:.2f}s")
        logger.debug(f"Status code: {response.status_code}")
        
        if response.status_code == 200:
            data = response.json()
            logger.info(f"Successfully fetched {country_name} - {len(data)} record(s)")
            return data
        else:
            logger.warning(f"Unexpected status {response.status_code} for {country_name}")
            return None
            
    except requests.exceptions.Timeout:
        logger.error(f"Timeout fetching {country_name}")
        return None
    except Exception as e:
        logger.error(f"Error fetching {country_name}: {str(e)}")
        return None


# Test it
data = fetch_country_data_logged("germany")
if data:
    print(f"\nGot data for: {data[0]['name']['common']}")

INFO - Starting fetch for: germany
INFO - Successfully fetched germany - 1 record(s)



Got data for: Germany


---
## 9. Making Pipelines Idempotent

**Idempotent** means: running the pipeline twice produces the same result as running it once.

This is critical for:
- Recovering from failures (just re-run)
- Scheduled jobs (safe to overlap)
- Data consistency (no duplicates)

### Strategies for Idempotency

| Strategy | How It Works |
|----------|-------------|
| **Delete and Replace** | Delete existing data, insert new |
| **Upsert** | Insert if new, update if exists |
| **Check Before Insert** | Only insert if not already there |
| **Deduplication** | Remove duplicates after loading |

In [41]:
# Example: Check-before-insert pattern

class DataStore:
    """
    Simple in-memory data store demonstrating idempotent inserts.
    In real code, this would be a database.
    """
    
    def __init__(self):
        self.data = {}  # key -> record
        self.insert_count = 0
        self.skip_count = 0
    
    def upsert(self, key, record):
        """
        Insert or update a record.
        """
        if key in self.data:
            # Record exists - update it
            self.data[key] = record
            self.skip_count += 1
            return 'updated'
        else:
            # New record - insert it
            self.data[key] = record
            self.insert_count += 1
            return 'inserted'
    
    def get_stats(self):
        return {
            'total_records': len(self.data),
            'inserted': self.insert_count,
            'updated': self.skip_count
        }


# Simulate idempotent pipeline
store = DataStore()