# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 6: Python Wrangling II - Merges, Reshaping, and Analysis-Ready Tables

**Date:** Monday, January 26, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

1. Quick Review: Tidy Data Principles
2. Merging Multiple Data Sources
3. Advanced Pivot and Unpivot Patterns
4. Building Analysis-Ready Tables
5. Data Contracts and Documentation

---

## Setting Up

In [24]:
# Standard imports
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 60)

print(f"Pandas version: {pd.__version__}")
print(f"Ready for Python Wrangling II!")

Pandas version: 2.2.3
Ready for Python Wrangling II!


---
## 1. Quick Review: Tidy Data

From last class:

```
┌────────────────────────────────────────────────┐
│           TIDY DATA PRINCIPLES                 │
├────────────────────────────────────────────────┤
│  1. Each VARIABLE → its own COLUMN            │
│  2. Each OBSERVATION → its own ROW            │
│  3. Each VALUE → its own CELL                 │
└────────────────────────────────────────────────┘
```

Today we focus on **combining multiple tidy datasets** into analysis-ready tables.

---
## 2. Merging Multiple Data Sources

### The Real-World Scenario

In climate and food analytics, data comes from many sources:
- Country metadata (World Bank)
- Economic indicators (GDP, trade)
- Climate data (temperature, precipitation)
- Agricultural production (FAO)
- Environmental metrics (emissions, land use)

You need to combine them for integrated analysis.

### Types of Joins

```
LEFT JOIN                INNER JOIN               OUTER JOIN
┌─────┬─────┐           ┌─────┬─────┐           ┌─────┬─────┐
│█████│     │           │     │     │           │█████│█████│
│█████│█████│           │     │█████│           │█████│█████│
│█████│     │           │     │     │           │█████│█████│
└─────┴─────┘           └─────┴─────┘           └─────┴─────┘
 All left +              Only matching           All from both
 matching right
```

In [25]:
# Create realistic datasets for merging

# Dataset 1: Country metadata
countries = pd.DataFrame({
    'country_code': ['CAN', 'USA', 'MEX', 'BRA', 'ARG', 'DEU', 'FRA', 'GBR', 'CHN', 'IND', 'JPN', 'AUS'],
    'country_name': ['Canada', 'United States', 'Mexico', 'Brazil', 'Argentina', 'Germany', 
                     'France', 'United Kingdom', 'China', 'India', 'Japan', 'Australia'],
    'region': ['North America', 'North America', 'North America', 'South America', 'South America',
               'Europe', 'Europe', 'Europe', 'Asia', 'Asia', 'Asia', 'Oceania'],
    'income_group': ['High income', 'High income', 'Upper middle income', 'Upper middle income',
                     'Upper middle income', 'High income', 'High income', 'High income',
                     'Upper middle income', 'Lower middle income', 'High income', 'High income']
})

print("Countries (12 rows):")
print(countries)

Countries (12 rows):
   country_code    country_name         region         income_group
0           CAN          Canada  North America          High income
1           USA   United States  North America          High income
2           MEX          Mexico  North America  Upper middle income
3           BRA          Brazil  South America  Upper middle income
4           ARG       Argentina  South America  Upper middle income
5           DEU         Germany         Europe          High income
6           FRA          France         Europe          High income
7           GBR  United Kingdom         Europe          High income
8           CHN           China           Asia  Upper middle income
9           IND           India           Asia  Lower middle income
10          JPN           Japan           Asia          High income
11          AUS       Australia        Oceania          High income


In [26]:
# Dataset 2: GDP data (some countries missing)
gdp_data = pd.DataFrame({
    'iso3': ['CAN', 'USA', 'MEX', 'BRA', 'DEU', 'FRA', 'GBR', 'CHN', 'IND', 'JPN'],  # Missing ARG, AUS
    'year': [2022] * 10,
    'gdp_billion_usd': [2140, 25460, 1414, 1920, 4072, 2780, 3070, 17963, 3385, 4231]
})

print("\nGDP Data (10 rows - missing ARG, AUS):")
print(gdp_data)


GDP Data (10 rows - missing ARG, AUS):
  iso3  year  gdp_billion_usd
0  CAN  2022             2140
1  USA  2022            25460
2  MEX  2022             1414
3  BRA  2022             1920
4  DEU  2022             4072
5  FRA  2022             2780
6  GBR  2022             3070
7  CHN  2022            17963
8  IND  2022             3385
9  JPN  2022             4231


In [27]:
# Dataset 3: CO2 emissions (some extra countries)
emissions = pd.DataFrame({
    'country_iso': ['CAN', 'USA', 'MEX', 'BRA', 'ARG', 'DEU', 'FRA', 'GBR', 'CHN', 'IND', 'JPN', 'AUS',
                    'RUS', 'KOR'],  # Extra: RUS, KOR
    'year': [2022] * 14,
    'co2_mt': [565, 5007, 477, 478, 185, 675, 306, 341, 11472, 2693, 1081, 394, 1756, 616]
})

print("\nCO2 Emissions (14 rows - includes extra RUS, KOR):")
print(emissions)


CO2 Emissions (14 rows - includes extra RUS, KOR):
   country_iso  year  co2_mt
0          CAN  2022     565
1          USA  2022    5007
2          MEX  2022     477
3          BRA  2022     478
4          ARG  2022     185
5          DEU  2022     675
6          FRA  2022     306
7          GBR  2022     341
8          CHN  2022   11472
9          IND  2022    2693
10         JPN  2022    1081
11         AUS  2022     394
12         RUS  2022    1756
13         KOR  2022     616


In [28]:
# Dataset 4: Agricultural production
ag_production = pd.DataFrame({
    'code': ['CAN', 'USA', 'MEX', 'BRA', 'ARG', 'DEU', 'FRA', 'CHN', 'IND', 'AUS'],
    'year': [2022] * 10,
    'wheat_prod_mt': [34.0, 44.9, 3.2, 10.5, 23.0, 22.1, 33.7, 138.0, 107.7, 36.0],
    'maize_prod_mt': [14.5, 348.8, 27.5, 113.0, 52.0, 3.8, 11.5, 277.0, 33.6, 0.5]
})

print("\nAgricultural Production (10 rows):")
print(ag_production)


Agricultural Production (10 rows):
  code  year  wheat_prod_mt  maize_prod_mt
0  CAN  2022           34.0           14.5
1  USA  2022           44.9          348.8
2  MEX  2022            3.2           27.5
3  BRA  2022           10.5          113.0
4  ARG  2022           23.0           52.0
5  DEU  2022           22.1            3.8
6  FRA  2022           33.7           11.5
7  CHN  2022          138.0          277.0
8  IND  2022          107.7           33.6
9  AUS  2022           36.0            0.5


### Key Challenge: Different Column Names for Keys

Notice that the country code column has different names:
- `country_code` in countries
- `iso3` in GDP
- `country_iso` in emissions
- `code` in agriculture

This is extremely common in real data!

In [29]:
# Method 1: Using left_on and right_on parameters

merged_1 = pd.merge(
    countries,
    gdp_data,
    left_on='country_code',   # Key column in left DataFrame
    right_on='iso3',          # Key column in right DataFrame
    how='left'                # Keep all countries
)

print("Method 1: LEFT JOIN with different key names")
print(f"Result shape: {merged_1.shape}")
print(merged_1[['country_code', 'country_name', 'iso3', 'gdp_billion_usd']].head())

Method 1: LEFT JOIN with different key names
Result shape: (12, 7)
  country_code   country_name iso3  gdp_billion_usd
0          CAN         Canada  CAN           2140.0
1          USA  United States  USA          25460.0
2          MEX         Mexico  MEX           1414.0
3          BRA         Brazil  BRA           1920.0
4          ARG      Argentina  NaN              NaN


In [30]:
# Notice we have both country_code AND iso3 columns
# This is redundant - let's clean it up

merged_1 = merged_1.drop(columns=['iso3'])
print("After dropping redundant column:")
print(merged_1.columns.tolist())

After dropping redundant column:
['country_code', 'country_name', 'region', 'income_group', 'year', 'gdp_billion_usd']


In [31]:
# Method 2: Rename columns first for cleaner joins

# Standardize key column names before merging
gdp_renamed = gdp_data.rename(columns={'iso3': 'country_code'})
emissions_renamed = emissions.rename(columns={'country_iso': 'country_code'})
ag_renamed = ag_production.rename(columns={'code': 'country_code'})

print("Renamed DataFrames now all have 'country_code':")
print(f"  GDP columns: {gdp_renamed.columns.tolist()}")
print(f"  Emissions columns: {emissions_renamed.columns.tolist()}")
print(f"  Agriculture columns: {ag_renamed.columns.tolist()}")

Renamed DataFrames now all have 'country_code':
  GDP columns: ['country_code', 'year', 'gdp_billion_usd']
  Emissions columns: ['country_code', 'year', 'co2_mt']
  Agriculture columns: ['country_code', 'year', 'wheat_prod_mt', 'maize_prod_mt']


In [32]:
# Now merge becomes cleaner

merged_2 = pd.merge(
    countries,
    gdp_renamed,
    on='country_code',   # Same column name in both
    how='left'
)

print("Method 2: Merge after renaming")
print(merged_2[['country_code', 'country_name', 'gdp_billion_usd']].head())

Method 2: Merge after renaming
  country_code   country_name  gdp_billion_usd
0          CAN         Canada           2140.0
1          USA  United States          25460.0
2          MEX         Mexico           1414.0
3          BRA         Brazil           1920.0
4          ARG      Argentina              NaN


### Chaining Multiple Merges

In [33]:
# Chain multiple merges to combine all datasets

combined = (
    countries
    .merge(gdp_renamed, on='country_code', how='left')
    .merge(emissions_renamed[['country_code', 'co2_mt']], on='country_code', how='left')
    .merge(ag_renamed[['country_code', 'wheat_prod_mt', 'maize_prod_mt']], on='country_code', how='left')
)

print("Combined dataset (all sources):")
print(combined)

Combined dataset (all sources):
   country_code    country_name         region         income_group    year  \
0           CAN          Canada  North America          High income  2022.0   
1           USA   United States  North America          High income  2022.0   
2           MEX          Mexico  North America  Upper middle income  2022.0   
3           BRA          Brazil  South America  Upper middle income  2022.0   
4           ARG       Argentina  South America  Upper middle income     NaN   
5           DEU         Germany         Europe          High income  2022.0   
6           FRA          France         Europe          High income  2022.0   
7           GBR  United Kingdom         Europe          High income  2022.0   
8           CHN           China           Asia  Upper middle income  2022.0   
9           IND           India           Asia  Lower middle income  2022.0   
10          JPN           Japan           Asia          High income  2022.0   
11          AUS     

In [34]:
# Check what's missing
print("\nMissing values after merge:")
print(combined.isnull().sum())


Missing values after merge:
country_code       0
country_name       0
region             0
income_group       0
year               2
gdp_billion_usd    2
co2_mt             0
wheat_prod_mt      2
maize_prod_mt      2
dtype: int64


### Join Types in Detail

In [35]:
# Compare different join types

# LEFT JOIN: All from left, matching from right
left_result = countries.merge(gdp_renamed, on='country_code', how='left')
print(f"LEFT JOIN:  {len(left_result)} rows (keeps all {len(countries)} countries)")

# INNER JOIN: Only matching rows
inner_result = countries.merge(gdp_renamed, on='country_code', how='inner')
print(f"INNER JOIN: {len(inner_result)} rows (only countries with GDP data)")

# OUTER JOIN: All rows from both
outer_result = countries.merge(emissions_renamed, on='country_code', how='outer')
print(f"OUTER JOIN: {len(outer_result)} rows (all countries + extra from emissions)")

LEFT JOIN:  12 rows (keeps all 12 countries)
INNER JOIN: 10 rows (only countries with GDP data)
OUTER JOIN: 14 rows (all countries + extra from emissions)


In [36]:
# See what the outer join added
print("\nOuter join includes extra countries from emissions:")
extra_rows = outer_result[outer_result['country_name'].isnull()]
print(extra_rows[['country_code', 'country_name', 'co2_mt']])


Outer join includes extra countries from emissions:
   country_code country_name  co2_mt
10          KOR          NaN     616
12          RUS          NaN    1756


### Handling Duplicate Column Names

In [37]:
# When both DataFrames have the same non-key column

# Create two datasets with 'year' column
df1 = pd.DataFrame({
    'country_code': ['CAN', 'USA', 'MEX'],
    'year': [2022, 2022, 2022],
    'gdp': [2140, 25460, 1414]
})

df2 = pd.DataFrame({
    'country_code': ['CAN', 'USA', 'MEX'],
    'year': [2022, 2022, 2022],
    'population': [38, 331, 128]
})

# Merge creates suffixes
merged_suffix = df1.merge(df2, on='country_code')
print("Merge with duplicate 'year' column:")
print(merged_suffix)

Merge with duplicate 'year' column:
  country_code  year_x    gdp  year_y  population
0          CAN    2022   2140    2022          38
1          USA    2022  25460    2022         331
2          MEX    2022   1414    2022         128


In [38]:
# Better: Include 'year' in the key or use custom suffixes

# Option 1: Merge on both columns
merged_both_keys = df1.merge(df2, on=['country_code', 'year'])
print("Option 1 - Merge on both keys:")
print(merged_both_keys)

# Option 2: Custom suffixes
merged_custom = df1.merge(df2, on='country_code', suffixes=('_gdp', '_pop'))
print("\nOption 2 - Custom suffixes:")
print(merged_custom)

Option 1 - Merge on both keys:
  country_code  year    gdp  population
0          CAN  2022   2140          38
1          USA  2022  25460         331
2          MEX  2022   1414         128

Option 2 - Custom suffixes:
  country_code  year_gdp    gdp  year_pop  population
0          CAN      2022   2140      2022          38
1          USA      2022  25460      2022         331
2          MEX      2022   1414      2022         128


### Validating Merges

In [39]:
# The 'validate' parameter catches merge problems

# Create a dataset with duplicates
df_with_dups = pd.DataFrame({
    'country_code': ['CAN', 'CAN', 'USA'],  # CAN appears twice!
    'metric': ['a', 'b', 'c']
})

try:
    # This will fail because df_with_dups has duplicate keys
    result = countries.merge(
        df_with_dups,
        on='country_code',
        how='left',
        validate='one_to_one'  # Expect each key appears once in both
    )
except Exception as e:
    print(f"Validation caught the problem: {type(e).__name__}")
    print(f"Message: {e}")

Validation caught the problem: MergeError
Message: Merge keys are not unique in right dataset; not a one-to-one merge


In [40]:
# Validate options:
# - "one_to_one" or "1:1": Both sides have unique keys
# - "one_to_many" or "1:m": Left side has unique keys
# - "many_to_one" or "m:1": Right side has unique keys
# - "many_to_many" or "m:m": No uniqueness requirements (careful!)

# Correct usage for our data
result = countries.merge(
    gdp_renamed,
    on='country_code',
    how='left',
    validate='one_to_one'  # Both have unique country codes
)
print("Validated merge successful!")

Validated merge successful!


---
## 3. Advanced Pivot and Unpivot Patterns

### When to Use Each

| Pattern | Function | Use Case |
|---------|----------|----------|
| Unpivot (wide→long) | `pd.melt()` | Column headers are data values |
| Pivot (long→wide) | `df.pivot()` | Need to spread values across columns |
| Pivot with aggregation | `df.pivot_table()` | Need to aggregate when pivoting |

In [41]:
# Create commodity price data (realistic format from data providers)

commodity_prices = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=12, freq='M'),
    'wheat_usd_ton': [315, 320, 310, 305, 298, 285, 280, 290, 295, 288, 282, 278],
    'maize_usd_ton': [285, 290, 288, 280, 275, 268, 265, 272, 278, 275, 270, 268],
    'soybeans_usd_ton': [520, 535, 528, 515, 502, 495, 488, 498, 510, 505, 495, 490],
    'rice_usd_ton': [425, 430, 428, 432, 438, 445, 452, 458, 462, 455, 448, 445]
})

print("Commodity Prices (wide format):")
print(commodity_prices)

Commodity Prices (wide format):
         date  wheat_usd_ton  maize_usd_ton  soybeans_usd_ton  rice_usd_ton
0  2023-01-31            315            285               520           425
1  2023-02-28            320            290               535           430
2  2023-03-31            310            288               528           428
3  2023-04-30            305            280               515           432
4  2023-05-31            298            275               502           438
5  2023-06-30            285            268               495           445
6  2023-07-31            280            265               488           452
7  2023-08-31            290            272               498           458
8  2023-09-30            295            278               510           462
9  2023-10-31            288            275               505           455
10 2023-11-30            282            270               495           448
11 2023-12-31            278            268             

In [42]:
# Unpivot: Make this tidy for analysis

prices_long = pd.melt(
    commodity_prices,
    id_vars=['date'],
    value_vars=['wheat_usd_ton', 'maize_usd_ton', 'soybeans_usd_ton', 'rice_usd_ton'],
    var_name='commodity',
    value_name='price_usd_ton'
)

# Clean up commodity names
prices_long['commodity'] = prices_long['commodity'].str.replace('_usd_ton', '')

print("Tidy format (long):")
print(prices_long.head(12))
print(f"\nShape: {prices_long.shape}")

Tidy format (long):
         date commodity  price_usd_ton
0  2023-01-31     wheat            315
1  2023-02-28     wheat            320
2  2023-03-31     wheat            310
3  2023-04-30     wheat            305
4  2023-05-31     wheat            298
5  2023-06-30     wheat            285
6  2023-07-31     wheat            280
7  2023-08-31     wheat            290
8  2023-09-30     wheat            295
9  2023-10-31     wheat            288
10 2023-11-30     wheat            282
11 2023-12-31     wheat            278

Shape: (48, 3)


In [43]:
# Now analysis is easy!

# Average price by commodity
print("Average price by commodity:")
print(prices_long.groupby('commodity')['price_usd_ton'].agg(['mean', 'min', 'max']).round(0))

# Price trend over time
print("\nPrice change (first vs last month):")
price_change = prices_long.groupby('commodity').apply(
    lambda x: x.iloc[-1]['price_usd_ton'] - x.iloc[0]['price_usd_ton']
)
print(price_change)

Average price by commodity:
            mean  min  max
commodity                 
maize      276.0  265  290
rice       443.0  425  462
soybeans   507.0  488  535
wheat      296.0  278  320

Price change (first vs last month):
commodity
maize      -17
rice        20
soybeans   -30
wheat      -37
dtype: int64


### Pivot with Aggregation: pivot_table()

In [44]:
# Create multi-country, multi-year production data

np.random.seed(42)

production_data = pd.DataFrame({
    'country': np.repeat(['Canada', 'USA', 'Brazil', 'Argentina'], 12),
    'year': np.tile(np.repeat([2021, 2022, 2023], 4), 4),
    'quarter': np.tile(['Q1', 'Q2', 'Q3', 'Q4'], 12),
    'crop': np.tile(np.repeat(['wheat', 'maize'], 6), 4),
    'production_mt': np.random.uniform(5, 50, 48).round(1)
})

print("Production data (long format):")
print(production_data.head(16))
print(f"\nShape: {production_data.shape}")

Production data (long format):
   country  year quarter   crop  production_mt
0   Canada  2021      Q1  wheat           21.9
1   Canada  2021      Q2  wheat           47.8
2   Canada  2021      Q3  wheat           37.9
3   Canada  2021      Q4  wheat           31.9
4   Canada  2022      Q1  wheat           12.0
5   Canada  2022      Q2  wheat           12.0
6   Canada  2022      Q3  maize            7.6
7   Canada  2022      Q4  maize           44.0
8   Canada  2023      Q1  maize           32.1
9   Canada  2023      Q2  maize           36.9
10  Canada  2023      Q3  maize            5.9
11  Canada  2023      Q4  maize           48.6
12     USA  2021      Q1  wheat           42.5
13     USA  2021      Q2  wheat           14.6
14     USA  2021      Q3  wheat           13.2
15     USA  2021      Q4  wheat           13.3

Shape: (48, 5)


In [45]:
# pivot_table: Create summary by country and crop

summary_table = production_data.pivot_table(
    values='production_mt',
    index='country',          # Rows
    columns='crop',           # Columns
    aggfunc='sum'             # How to aggregate
)

print("Pivot table: Total production by country and crop")
print(summary_table.round(1))

Pivot table: Total production by country and crop
crop       maize  wheat
country                
Argentina  151.2  126.5
Brazil     190.5  146.7
Canada     175.1  163.5
USA        125.9  130.9


In [46]:
# More complex pivot: Multiple aggregations

complex_pivot = production_data.pivot_table(
    values='production_mt',
    index='country',
    columns='crop',
    aggfunc=['sum', 'mean', 'count']  # Multiple aggregations
)

print("Complex pivot with multiple aggregations:")
print(complex_pivot.round(1))

Complex pivot with multiple aggregations:
             sum         mean       count      
crop       maize  wheat maize wheat maize wheat
country                                        
Argentina  151.2  126.5  25.2  21.1     6     6
Brazil     190.5  146.7  31.8  24.4     6     6
Canada     175.1  163.5  29.2  27.2     6     6
USA        125.9  130.9  21.0  21.8     6     6


In [61]:
# Pivot with multiple row indices

yearly_summary = production_data.pivot_table(
    values='production_mt',
    index=['country', 'year'],  # Multiple row indices
    columns='crop',
    aggfunc='sum'
)

print("Pivot with country and year as rows:")
print(yearly_summary.round(1))

Pivot with country and year as rows:
crop            maize  wheat
country   year              
Argentina 2021    NaN   88.7
          2022   52.4   37.8
          2023   98.8    NaN
Brazil    2021    NaN  107.9
          2022   45.0   38.8
          2023  145.5    NaN
Canada    2021    NaN  139.5
          2022   51.6   24.0
          2023  123.5    NaN
USA       2021    NaN   83.6
          2022   42.5   47.3
          2023   83.4    NaN


In [62]:
# Pivot with multiple row indices

yearly_summary = production_data.pivot_table(
    values='production_mt',
    index=['country', 'year'],  # Multiple row indices
    columns='crop',
    aggfunc='sum'
)

print("Pivot with country and year as rows:")
print(yearly_summary)

Pivot with country and year as rows:
crop            maize  wheat
country   year              
Argentina 2021    NaN   88.7
          2022   52.4   37.8
          2023   98.8    NaN
Brazil    2021    NaN  107.9
          2022   45.0   38.8
          2023  145.5    NaN
Canada    2021    NaN  139.5
          2022   51.6   24.0
          2023  123.5    NaN
USA       2021    NaN   83.6
          2022   42.5   47.3
          2023   83.4    NaN


In [48]:
# Add margins (totals)

with_totals = production_data.pivot_table(
    values='production_mt',
    index='country',
    columns='crop',
    aggfunc='sum',
    margins=True,              # Add totals
    margins_name='Total'       # Name for total row/column
)

print("Pivot with totals:")
print(with_totals.round(1))

Pivot with totals:
crop       maize  wheat   Total
country                        
Argentina  151.2  126.5   277.7
Brazil     190.5  146.7   337.2
Canada     175.1  163.5   338.6
USA        125.9  130.9   256.8
Total      642.7  567.6  1210.3


### Stack and Unstack

In [49]:
# Stack: Move column index to row index
# Unstack: Move row index to column index

# Start with the pivot table
print("Original pivot table:")
print(yearly_summary.head(8))

# Stack: Make it longer
stacked = yearly_summary.stack()
print("\nAfter stack():")
print(stacked.head(12))

Original pivot table:
crop            maize  wheat
country   year              
Argentina 2021    NaN   88.7
          2022   52.4   37.8
          2023   98.8    NaN
Brazil    2021    NaN  107.9
          2022   45.0   38.8
          2023  145.5    NaN
Canada    2021    NaN  139.5
          2022   51.6   24.0

After stack():
country    year  crop 
Argentina  2021  wheat     88.7
           2022  maize     52.4
                 wheat     37.8
           2023  maize     98.8
Brazil     2021  wheat    107.9
           2022  maize     45.0
                 wheat     38.8
           2023  maize    145.5
Canada     2021  wheat    139.5
           2022  maize     51.6
                 wheat     24.0
           2023  maize    123.5
dtype: float64


In [50]:
# Unstack: Spread a level to columns

unstacked = stacked.unstack(level='year')  # Move 'year' to columns
print("Unstacked by year:")
print(unstacked.round(1))

Unstacked by year:
year              2021  2022   2023
country   crop                     
Argentina maize    NaN  52.4   98.8
          wheat   88.7  37.8    NaN
Brazil    maize    NaN  45.0  145.5
          wheat  107.9  38.8    NaN
Canada    maize    NaN  51.6  123.5
          wheat  139.5  24.0    NaN
USA       maize    NaN  42.5   83.4
          wheat   83.6  47.3    NaN


---
## 4. Building Analysis-Ready Tables

### What Makes a Table "Analysis-Ready"?

1. **Tidy structure**: Each row is one observation
2. **Correct types**: Numeric columns are numeric, dates are datetime
3. **Consistent keys**: Can be joined with other tables
4. **Documented nulls**: Missing values are understood
5. **Derived metrics**: Common calculations pre-computed
6. **Clear naming**: Column names are descriptive

### Complete Example: Climate-Agriculture Analysis Table

In [51]:
# Create source datasets

# 1. Country metadata
countries_meta = pd.DataFrame({
    'country_code': ['CAN', 'USA', 'MEX', 'BRA', 'ARG', 'DEU', 'FRA', 'AUS', 'CHN', 'IND'],
    'country_name': ['Canada', 'United States', 'Mexico', 'Brazil', 'Argentina',
                     'Germany', 'France', 'Australia', 'China', 'India'],
    'region': ['North America', 'North America', 'North America', 'South America',
               'South America', 'Europe', 'Europe', 'Oceania', 'Asia', 'Asia'],
    'hemisphere': ['Northern', 'Northern', 'Northern', 'Southern', 'Southern',
                   'Northern', 'Northern', 'Southern', 'Northern', 'Northern'],
    'arable_land_mha': [38.6, 157.7, 23.3, 55.8, 39.2, 11.8, 18.5, 31.2, 119.5, 156.1]
})

print("Countries metadata:")
print(countries_meta)

Countries metadata:
  country_code   country_name         region hemisphere  arable_land_mha
0          CAN         Canada  North America   Northern             38.6
1          USA  United States  North America   Northern            157.7
2          MEX         Mexico  North America   Northern             23.3
3          BRA         Brazil  South America   Southern             55.8
4          ARG      Argentina  South America   Southern             39.2
5          DEU        Germany         Europe   Northern             11.8
6          FRA         France         Europe   Northern             18.5
7          AUS      Australia        Oceania   Southern             31.2
8          CHN          China           Asia   Northern            119.5
9          IND          India           Asia   Northern            156.1


In [52]:
# 2. Crop production (multiple years)
years = [2020, 2021, 2022, 2023]
country_codes = ['CAN', 'USA', 'MEX', 'BRA', 'ARG', 'DEU', 'FRA', 'AUS', 'CHN', 'IND']

np.random.seed(42)

crop_prod = pd.DataFrame({
    'country_code': np.repeat(country_codes, len(years)),
    'year': np.tile(years, len(country_codes)),
    'wheat_prod_mt': np.random.uniform(5, 140, 40).round(1),
    'maize_prod_mt': np.random.uniform(2, 350, 40).round(1),
    'yield_wheat_ton_ha': np.random.uniform(2.5, 4.5, 40).round(2),
    'yield_maize_ton_ha': np.random.uniform(3.0, 12.0, 40).round(2)
})

print("\nCrop production:")
print(crop_prod.head(12))


Crop production:
   country_code  year  wheat_prod_mt  maize_prod_mt  yield_wheat_ton_ha  \
0           CAN  2020           55.6           44.5                4.23   
1           CAN  2021          133.3          174.3                3.75   
2           CAN  2022          103.8           14.0                3.16   
3           CAN  2023           85.8          318.4                2.63   
4           USA  2020           26.1           92.1                3.12   
5           USA  2021           26.1          232.6                3.15   
6           USA  2022           12.8          110.5                3.96   
7           USA  2023          121.9          183.0                3.78   
8           MEX  2020           86.2          192.3                4.27   
9           MEX  2021          100.6           66.3                3.44   
10          MEX  2022            7.8          339.4                2.74   
11          MEX  2023          135.9          271.7                3.93   

    yi

In [53]:
# 3. Temperature anomalies
temp_anomalies = pd.DataFrame({
    'iso3': np.repeat(country_codes, len(years)),
    'year': np.tile(years, len(country_codes)),
    'annual_temp_anomaly_c': np.random.uniform(0.5, 2.0, 40).round(2),
    'growing_season_precip_mm': np.random.uniform(200, 800, 40).round(0)
})

print("\nTemperature anomalies:")
print(temp_anomalies.head(12))


Temperature anomalies:
   iso3  year  annual_temp_anomaly_c  growing_season_precip_mm
0   CAN  2020                   1.05                     585.0
1   CAN  2021                   1.45                     250.0
2   CAN  2022                   1.45                     297.0
3   CAN  2023                   1.30                     739.0
4   USA  2020                   0.64                     564.0
5   USA  2021                   1.75                     206.0
6   USA  2022                   0.98                     261.0
7   USA  2023                   0.78                     598.0
8   MEX  2020                   0.56                     203.0
9   MEX  2021                   1.39                     296.0
10  MEX  2022                   1.52                     529.0
11  MEX  2023                   0.52                     615.0


In [54]:
# Build the analysis-ready table step by step

def build_analysis_table(countries, production, climate):
    """
    Build an analysis-ready table combining country, production, and climate data.
    
    Parameters:
    -----------
    countries : pd.DataFrame
        Country metadata with country_code as key
    production : pd.DataFrame
        Crop production data with country_code and year
    climate : pd.DataFrame
        Climate data with iso3 and year
    
    Returns:
    --------
    pd.DataFrame : Analysis-ready table
    """
    print("Building analysis-ready table...")
    
    # Step 1: Standardize key columns
    print("  Step 1: Standardizing keys")
    climate_std = climate.rename(columns={'iso3': 'country_code'})
    
    # Step 2: Merge production with country metadata
    print("  Step 2: Merging production with countries")
    df = production.merge(
        countries,
        on='country_code',
        how='left',
        validate='many_to_one'
    )
    
    # Step 3: Merge with climate data
    print("  Step 3: Merging with climate data")
    df = df.merge(
        climate_std,
        on=['country_code', 'year'],
        how='left',
        validate='one_to_one'
    )
    
    # Step 4: Add derived metrics
    print("  Step 4: Calculating derived metrics")
    
    # Total production
    df['total_cereal_prod_mt'] = df['wheat_prod_mt'] + df['maize_prod_mt']
    
    # Production per arable land
    df['cereal_intensity_mt_per_mha'] = (
        df['total_cereal_prod_mt'] / df['arable_land_mha']
    ).round(2)
    
    # Temperature category
    df['temp_category'] = pd.cut(
        df['annual_temp_anomaly_c'],
        bins=[-np.inf, 0.5, 1.0, 1.5, np.inf],
        labels=['Normal', 'Warm', 'Hot', 'Very Hot']
    )
    
    # Precipitation category
    df['precip_category'] = pd.cut(
        df['growing_season_precip_mm'],
        bins=[0, 300, 500, 700, np.inf],
        labels=['Dry', 'Moderate', 'Wet', 'Very Wet']
    )
    
    # Step 5: Reorder columns logically
    print("  Step 5: Organizing columns")
    column_order = [
        # Keys
        'country_code', 'country_name', 'year',
        # Geography
        'region', 'hemisphere', 'arable_land_mha',
        # Production
        'wheat_prod_mt', 'maize_prod_mt', 'total_cereal_prod_mt',
        'yield_wheat_ton_ha', 'yield_maize_ton_ha', 'cereal_intensity_mt_per_mha',
        # Climate
        'annual_temp_anomaly_c', 'temp_category',
        'growing_season_precip_mm', 'precip_category'
    ]
    df = df[column_order]
    
    # Step 6: Sort
    df = df.sort_values(['country_name', 'year']).reset_index(drop=True)
    
    print(f"  Complete! Shape: {df.shape}")
    return df


# Build the table
analysis_table = build_analysis_table(countries_meta, crop_prod, temp_anomalies)

Building analysis-ready table...
  Step 1: Standardizing keys
  Step 2: Merging production with countries
  Step 3: Merging with climate data
  Step 4: Calculating derived metrics
  Step 5: Organizing columns
  Complete! Shape: (40, 16)


In [55]:
# View the result
print("Analysis-Ready Table:")
print(analysis_table.head(16))

Analysis-Ready Table:
   country_code country_name  year         region hemisphere  arable_land_mha  \
0           ARG    Argentina  2020  South America   Southern             39.2   
1           ARG    Argentina  2021  South America   Southern             39.2   
2           ARG    Argentina  2022  South America   Southern             39.2   
3           ARG    Argentina  2023  South America   Southern             39.2   
4           AUS    Australia  2020        Oceania   Southern             31.2   
5           AUS    Australia  2021        Oceania   Southern             31.2   
6           AUS    Australia  2022        Oceania   Southern             31.2   
7           AUS    Australia  2023        Oceania   Southern             31.2   
8           BRA       Brazil  2020  South America   Southern             55.8   
9           BRA       Brazil  2021  South America   Southern             55.8   
10          BRA       Brazil  2022  South America   Southern             55.8   
11    

In [56]:
# Now analysis is straightforward

# Question 1: Average yield by temperature category
print("Average wheat yield by temperature category:")
print(analysis_table.groupby('temp_category')['yield_wheat_ton_ha'].mean().round(2))

# Question 2: Production by region and year
print("\nTotal cereal production by region (2023):")
print(
    analysis_table[analysis_table['year'] == 2023]
    .groupby('region')['total_cereal_prod_mt']
    .sum()
    .round(1)
)

# Question 3: Countries with hot years and low precipitation
print("\nHot + Dry country-years:")
hot_dry = analysis_table[
    (analysis_table['temp_category'].isin(['Hot', 'Very Hot'])) &
    (analysis_table['precip_category'] == 'Dry')
]
print(hot_dry[['country_name', 'year', 'temp_category', 'precip_category', 'yield_wheat_ton_ha']])

Average wheat yield by temperature category:
temp_category
Normal       NaN
Warm        3.56
Hot         3.55
Very Hot    3.38
Name: yield_wheat_ton_ha, dtype: float64

Total cereal production by region (2023):
region
Asia              476.5
Europe            536.3
North America    1116.7
Oceania            99.2
South America     512.1
Name: total_cereal_prod_mt, dtype: float64

Hot + Dry country-years:
     country_name  year temp_category precip_category  yield_wheat_ton_ha
13         Canada  2021           Hot             Dry                3.75
14         Canada  2022           Hot             Dry                3.16
26        Germany  2022      Very Hot             Dry                3.13
29          India  2021      Very Hot             Dry                2.87
33         Mexico  2021           Hot             Dry                3.44
37  United States  2021      Very Hot             Dry                3.15


---
## 5. Data Contracts and Documentation

### What is a Data Contract?

A **data contract** is a document that specifies:
- What columns exist and their types
- What values are allowed
- How often data is updated
- Who is responsible for the data

### Why Contracts Matter

Without contracts:
- Upstream changes break downstream analysis
- Nobody knows what columns mean
- Bugs hide in ambiguous data

### Simple Contract Example

In [57]:
import numpy as np
import pandas as pd
from datetime import datetime
from pandas.api.types import (
    is_numeric_dtype,
    is_datetime64_any_dtype,
    is_object_dtype,
    is_categorical_dtype,
)

def generate_data_contract(df, table_name, description):
    """
    Generate a data contract document for a DataFrame.
    """
    n = len(df)

    contract = f"""
================================================================================
DATA CONTRACT: {table_name}
================================================================================

Description: {description}

Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

--------------------------------------------------------------------------------
SCHEMA
--------------------------------------------------------------------------------
"""

    for col in df.columns:
        s = df[col]
        dtype_obj = s.dtype
        dtype_str = str(dtype_obj)

        non_null = int(s.notna().sum())
        null_count = int(s.isna().sum())
        null_pct = (null_count / n * 100) if n else 0.0
        unique = int(s.nunique(dropna=True))

        contract += f"\n{col}:\n"
        contract += f"  Type: {dtype_str}\n"
        contract += f"  Non-null: {non_null}/{n} ({(100-null_pct):.1f}%)\n"
        contract += f"  Unique values: {unique}\n"

        # Sample values for categorical/text columns
        if is_object_dtype(dtype_obj) or is_categorical_dtype(dtype_obj):
            samples = s.dropna().astype(str).unique()[:5]
            contract += f"  Sample values: {list(samples)}\n"

        # Range for numeric columns
        if is_numeric_dtype(dtype_obj):
            s_num = pd.to_numeric(s, errors="coerce")
            if s_num.notna().any():
                contract += f"  Range: [{s_num.min():.2f}, {s_num.max():.2f}]\n"

        # Range for datetime columns
        elif is_datetime64_any_dtype(dtype_obj):
            if s.notna().any():
                contract += f"  Range: [{s.min()}, {s.max()}]\n"

    # Try to infer keys from available columns (no assumptions beyond names present)
    key_cols = [c for c in ["country_id", "year", "crop"] if c in df.columns]
    pk_text = " + ".join(key_cols) if key_cols else "Not specified"

    fk_text = []
    if "country_id" in df.columns:
        fk_text.append("country_id references countries.country_id")
    fk_text = ", ".join(fk_text) if fk_text else "Not specified"

    contract += f"""
--------------------------------------------------------------------------------
CONSTRAINTS
--------------------------------------------------------------------------------

Primary Key: {pk_text}
Foreign Keys: {fk_text}

--------------------------------------------------------------------------------
BUSINESS RULES
--------------------------------------------------------------------------------

- Numeric measures should be non-negative where applicable
- Year should fall within the dataset coverage
- Missing values should be represented as NULL in the database

--------------------------------------------------------------------------------
REFRESH SCHEDULE
--------------------------------------------------------------------------------

- Update frequency: Depends on source
- Data source: Document per dataset / API
- Last updated: Document per refresh run

================================================================================
"""
    return contract

# Generate contract for your analysis table (analysis_table must be a DataFrame)
contract = generate_data_contract(
    analysis_table,
    "climate_agriculture_analysis",
    "Integrated climate and agricultural production data by country and year"
)

print(contract)



DATA CONTRACT: climate_agriculture_analysis

Description: Integrated climate and agricultural production data by country and year

Generated: 2026-01-26 13:25:37

--------------------------------------------------------------------------------
SCHEMA
--------------------------------------------------------------------------------

country_code:
  Type: object
  Non-null: 40/40 (100.0%)
  Unique values: 10
  Sample values: ['ARG', 'AUS', 'BRA', 'CAN', 'CHN']

country_name:
  Type: object
  Non-null: 40/40 (100.0%)
  Unique values: 10
  Sample values: ['Argentina', 'Australia', 'Brazil', 'Canada', 'China']

year:
  Type: int64
  Non-null: 40/40 (100.0%)
  Unique values: 4
  Range: [2020.00, 2023.00]

region:
  Type: object
  Non-null: 40/40 (100.0%)
  Unique values: 5
  Sample values: ['South America', 'Oceania', 'North America', 'Asia', 'Europe']

hemisphere:
  Type: object
  Non-null: 40/40 (100.0%)
  Unique values: 2
  Sample values: ['Southern', 'Northern']

arable_land_mha:
  Type:

In [58]:
# Save contract to file
with open('data_contract_climate_agriculture.md', 'w') as f:
    f.write(contract)

print("Contract saved to: data_contract_climate_agriculture.md")

Contract saved to: data_contract_climate_agriculture.md


### Lineage Notes

Lineage documents where data came from and how it was transformed.

In [59]:
# Simple lineage tracker

class LineageTracker:
    """
    Track data transformations for lineage documentation.
    """
    
    def __init__(self, name):
        self.name = name
        self.steps = []
        self.created_at = datetime.now()
    
    def log(self, operation, details=None, input_shape=None, output_shape=None):
        """Log a transformation step."""
        step = {
            'timestamp': datetime.now().isoformat(),
            'operation': operation,
            'details': details,
            'input_shape': input_shape,
            'output_shape': output_shape
        }
        self.steps.append(step)
        print(f"  Logged: {operation}")
    
    def report(self):
        """Generate lineage report."""
        report = f"\n{'='*60}\n"
        report += f"LINEAGE REPORT: {self.name}\n"
        report += f"{'='*60}\n"
        report += f"Created: {self.created_at}\n\n"
        
        for i, step in enumerate(self.steps, 1):
            report += f"Step {i}: {step['operation']}\n"
            if step['details']:
                report += f"  Details: {step['details']}\n"
            if step['input_shape']:
                report += f"  Input: {step['input_shape']} -> Output: {step['output_shape']}\n"
            report += "\n"
        
        return report


# Example usage
lineage = LineageTracker('climate_agriculture_analysis')

lineage.log('Load countries', 'Source: countries_meta.csv', None, '(10, 5)')
lineage.log('Load production', 'Source: crop_production.csv', None, '(40, 6)')
lineage.log('Load climate', 'Source: temp_anomalies.csv', None, '(40, 4)')
lineage.log('Merge: production + countries', 'Left join on country_code', '(40, 6)', '(40, 10)')
lineage.log('Merge: + climate', 'Left join on country_code, year', '(40, 10)', '(40, 12)')
lineage.log('Add derived: total_cereal_prod_mt', 'wheat + maize', '(40, 12)', '(40, 13)')
lineage.log('Add derived: cereal_intensity', 'total / arable_land', '(40, 13)', '(40, 14)')
lineage.log('Add categorical: temp_category', 'Bins: Normal/Warm/Hot/Very Hot', '(40, 14)', '(40, 15)')

print(lineage.report())

  Logged: Load countries
  Logged: Load production
  Logged: Load climate
  Logged: Merge: production + countries
  Logged: Merge: + climate
  Logged: Add derived: total_cereal_prod_mt
  Logged: Add derived: cereal_intensity
  Logged: Add categorical: temp_category

LINEAGE REPORT: climate_agriculture_analysis
Created: 2026-01-26 13:25:37.915030

Step 1: Load countries
  Details: Source: countries_meta.csv

Step 2: Load production
  Details: Source: crop_production.csv

Step 3: Load climate
  Details: Source: temp_anomalies.csv

Step 4: Merge: production + countries
  Details: Left join on country_code
  Input: (40, 6) -> Output: (40, 10)

Step 5: Merge: + climate
  Details: Left join on country_code, year
  Input: (40, 10) -> Output: (40, 12)

Step 6: Add derived: total_cereal_prod_mt
  Details: wheat + maize
  Input: (40, 12) -> Output: (40, 13)

Step 7: Add derived: cereal_intensity
  Details: total / arable_land
  Input: (40, 13) -> Output: (40, 14)

Step 8: Add categorical: temp_c

---
## Summary: Key Takeaways

### 1. Merging Data
- Use `left_on`/`right_on` for different key names
- Rename columns first for cleaner code
- Chain merges with `.merge().merge()`
- Use `validate` to catch problems early

### 2. Pivot and Unpivot
- `melt()`: Wide to long (unpivot)
- `pivot()`: Long to wide (no aggregation)
- `pivot_table()`: Long to wide with aggregation
- `stack()`/`unstack()`: Move between row and column indices

### 3. Analysis-Ready Tables
- Tidy structure
- Correct types
- Derived metrics pre-computed
- Logical column order
- Clear naming

### 4. Documentation
- Data contracts specify schema and rules
- Lineage tracks transformations
- Both are essential for production pipelines

---

## Assignment 2 Connection

In Assignment 2, you will:
1. Extract weather data from API (JSON)
2. Transform JSON to tidy DataFrame
3. Merge with existing crop data from A-1
4. Create aggregated views (monthly, annual)
5. Build analysis-ready tables for business questions

The skills from this lecture are directly applicable!

---

## Practice Exercises

1. **Exercise 1**: Create three DataFrames with different key column names and merge them into one analysis table.

2. **Exercise 2**: Take a wide dataset with multiple metrics as columns and convert it to long format, then create a pivot table summarizing by two dimensions.

3. **Exercise 3**: Build a complete analysis-ready table that includes at least 3 derived metrics and 2 categorical classifications.

4. **Exercise 4**: Write a data contract for a table of your choice, including all constraints and business rules.