# `FRE 521D_Assignment 1_Group 3`
### Members: Janine, Juliette, Margaret & Clare

## Task 1

The database consists of five tables:

1. countries
2. crop
2. country_name_mapping
3. crop_production
4. temperature_anomalies

Each table is designed to represent a distinct entity and maintain referential integrity across datasets.


a)
1. The *countries* table stores standardized country-level attributes that are shared across both datasets.
- Avoid duplication of country metadata such as region and income group
- Provide a single authoritative reference for country identifiers
- Enable consistent joins across agricultural and climate datasets

Key fields include: 
country_id (Primary Key); country_name; iso3_code; region; income_group

2. The *crops* table stores standardized crop-level attributes that are reused across agricultural production records.

- Avoid duplication of crop names across production rows
- Prevent inconsistencies caused by spelling variations or formatting differences (e.g., "Wheat" vs "Wheat " )
- Provide a single authoritative reference for crop identifiers used in agricultural analysis

Key fields include: 
crop_id (Primary Key); crop_name

3. The *country_name_mapping* table resolves inconsistencies in country naming across the two source datasets.

- Map multiple source-specific country name variants to a standardized country identifier
- Enable reliable joins between crop production and temperature anomaly datasets
- Explicitly document data harmonization decisions across data sources

Key fields include:
source_country_name (Primary Key); country_id (Foreign Key from countries.country_id)

4. The *crop_production* table stores annual agricultural production measures at the country–year–crop level and serves as the core fact table for analysis.

- Store time-varying agricultural metrics derived from the crop production dataset
- Support analysis of production, yield, and agricultural inputs over time
- Enable aggregation and comparison across countries, regions, crops, and years

Key fields include:
country_id (from countries.country_id); year; crop_id (from crops.crop_id); area_harvested_ha; production_tonnes; yield_kg_ha; fertilizer_use_kg_ha; irrigation_pct; notes; where (country_id, year, crop_id) is the Primary Key

5. The *temperature_anomalies* table stores annual temperature anomaly measures at the country–year level and serves as the climate fact table.

- Store time-varying climate metrics derived from the temperature anomaly dataset
- Support integration of climate indicators with agricultural production data
- Enable comparison of agricultural outcomes across warmer and cooler years

Key fields include:
country_id (Foreign Key from countries.country_id); year; annual_anomaly_c; jan–dec.

b)

`Primary Keys`

The *countries table* uses *country_id* as its primary key, serving as the unique identifier for each country.

The *crops table* uses *crop_id* as its primary key, ensuring the identifier for each crop.

The *country_name_mapping* table uses *source_country_name* as its primary key, guaranteeing that each raw country name from the source datasets maps to a single standardized country.

The *crop_production* table uses a composite primary key *(country_id, year, crop_id)*, reflecting the natural grain of the data, where each record represents a unique country–year–crop observation.

The *temperature_anomalies* table uses a composite primary key *(country_id, year)*, consistent with the country–year granularity of the temperature dataset.

`Foreign Keys`

*crop_production.country_id* references *countries.country_id*, bringing agricultural observations and country together.

*crop_production.crop_id* references *crops.crop_id*, enabling consistency in crop identification.

*temperature_anomalies.country_id* references *countries.country_id*, enabling all climate records to be associated with valid countries.

*country_name_mapping.country_id* references *countries.country_id*, enabling source-specific country names with standardized identifiers.

c)

*countries* table:
- country_id: INT, numeric surrogate key that enables efficient joins and indexing
- country_name: VARCHAR, flexible text storage for country names of varying length
- iso3_code: CHAR(3), fixed-length ISO country code of 3
- region: VARCHAR, categorical attribute representing geographic region
- income_group: VARCHAR, categorical attribute representing income classification


*crops* table:
- crop_id: INT, numeric surrogate key for standardized crop identification
- crop_name: VARCHAR, categorical text field storing unique crop names


*country_name_mapping* table:
- source_dataset: VARCHAR, indicates which source dataset the raw country name came from (e.g., 'crop', 'temp'), so the same string can be mapped differently across sources if needed
- source_country_name: VARCHAR, stores raw country name strings exactly as they appear in source datasets
- country_id: INT, foreign referencing countries.country_id, linking each source-specific country name to standardized country identifiers


*crop_production* table:
- country_id: INT, foreign key enabling joins with standardized country metadata
- year: INT, numeric representation supporting temporal filtering and grouping
- crop_id: INT, foreign key enforcing consistent crop identification
- area_harvested_ha: FLOAT, continuous numeric measure of harvested area
- production_tonnes: FLOAT, continuous numeric measure of total production
- yield_kg_ha: FLOAT, continuous numeric yield metric
- fertilizer_use_kg_ha: FLOAT, continuous numeric input measure
- irrigation_pct: FLOAT, numeric percentage representing irrigation coverage
- notes: VARCHAR, free-text field for annotations and metadata


*temperature_anomalies* table:
- country_id: INT, foreign key enabling integration with agricultural data
- year: INT, numeric representation of the observation year
- annual_anomaly_c: FLOAT, continuous numeric climate anomaly measure
- jan–dec: FLOAT (nullable), monthly temperature anomaly values that may be missing



d) 

The schema is normalized mainly to reduce data repetition and keep country and crop identifiers consistent across datasets. Country-level information such as region, income group, and ISO code is stored in a separate countries table instead of being repeated in every agricultural or climate record. Crop names are also placed in a crops table so that the same crop is always identified in a consistent way. Agricultural production data and temperature anomaly data are stored in different fact tables because they have different data structures and time coverage. Differences in country names between the source files are handled using a country_name_mapping table, which helps connect the datasets correctly. Composite keys are used in the fact tables to match the natural structure of the data and uniquely identify each observation.

In [7]:
# Import required libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
import warnings

warnings.filterwarnings("ignore")

In [8]:
# Load the SQL magic extension
%load_ext sql
    
%config SqlMagic.style = '_DEPRECATED_DEFAULT'
%config SqlMagic.autopandas = False

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [9]:
%load_ext sql
%sql mysql+pymysql://mfre521d_user:mfre521d_user_pw@127.0.0.1:3306/mfre521d


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [10]:
# Database connection parameters
DB_USER = "mfre521d_user"
DB_PASSWORD = "mfre521d_user_pw"
DB_HOST = "localhost"
DB_PORT = "3306"
DB_NAME = "mfre521d"

# Create connection string
connection_string = f"mysql+pymysql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

# Create SQLAlchemy engine
engine = create_engine(connection_string)

# Connect using SQL magic
%sql {connection_string}


In [11]:
def run_query(sql):
    return pd.read_sql(sql, engine)


In [12]:
%%sql

SELECT 'Connection successful!' AS status, NOW() AS current_ts;

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
1 rows affected.


status,current_ts
Connection successful!,2026-01-16 01:44:23


Import CSV Files - 1) Crop Production, and 2) Temperature Anomalies

In [13]:
from pathlib import Path
import pandas as pd

# Locate repo root
# Notebook is in: 521d_assignment/notebooks/
# Repo root is:   521d_assignment/
repo_root = Path.cwd().resolve().parents[0]

print("Repo root:", repo_root)

# Build paths to CSV files
crop_csv = repo_root / "data" / "crop_production_1990_2023.csv"
temp_csv = repo_root / "data" / "temperature_anomalies_1990_2023.csv"

print("Crop CSV exists:", crop_csv.exists())
print("Temp CSV exists:", temp_csv.exists())

Repo root: /Users/claremengebier/Desktop/MFRE/FRE 521D/521d_assignment
Crop CSV exists: True
Temp CSV exists: True


In [14]:
# Read crop production data
crop_df_raw = pd.read_csv(crop_csv, na_values=["..", "NA", ""], encoding="utf-8")

print("Crop data shape:", crop_df_raw.shape)
crop_df_raw.head()

Crop data shape: (4187, 12)


Unnamed: 0,Country,ISO3_Code,Region,Income_Group,Year,Crop,Area_Harvested_Ha,Production_Tonnes,Yield_Kg_Ha,Fertilizer_Use_Kg_Ha,Irrigation_Pct,Notes
0,China,CHN,East Asia,Upper middle income,2001.0,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993.0,Maize,2112762,11377270.55,538502.0,1914.0,9.8,
2,South Korea,KOR,East Asia,High income,1995.0,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,
3,United States,USA,North America,High income,2018.0,Wheat,4782989,32397951.41,6773.58,205.12,62.5,
4,Japan,JPN,East Asia,High income,2013.0,Rice,5434696,58322509.35,1073151.0,21164.0,61.4,


In [15]:
# Read temperature anomaly data
temp_df_raw = pd.read_csv(temp_csv, na_values=["NA", ""], encoding="utf-8")

print("Temperature data shape:", temp_df_raw.shape)
temp_df_raw.head()

Temperature data shape: (1137, 15)


Unnamed: 0,Country_Name,Year,Annual_Anomaly_C,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,United States of America,1990,0.07,(0.02),,,-0.09,(0.11),0.44,-0.44,0.3,-0.08,(0.03),-0.31,-0.39
1,United States of America,1991,0.2,0.36,0.4,0.6,0.74,0.22,0.34,0.22,0.7,,-0.44,-0.5,-0.22
2,United States of America,1992,0.54,0.46,0.76,0.85,0.68,0.6,0.98,0.52,0.92,0.53,0.01,0.09,0.07
3,United States of America,1993,0.43,0.28,0.22,0.74,0.6,,1.3,0.81,0.35,(0.3),0.04,0.55,-0.13
4,United States of America,1994,0.87,0.75,0.86,0.65,1.14,0.45,1.71,1.27,1.04,0.01,0.62,1.05,0.88


## Task 2 - Table Creation & Data Ingestion

### Inspecting the data

In [16]:
# checking column names & data type
crop_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4187 entries, 0 to 4186
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               4187 non-null   object 
 1   ISO3_Code             4187 non-null   object 
 2   Region                4187 non-null   object 
 3   Income_Group          4187 non-null   object 
 4   Year                  4187 non-null   float64
 5   Crop                  4187 non-null   object 
 6   Area_Harvested_Ha     4187 non-null   int64  
 7   Production_Tonnes     4087 non-null   object 
 8   Yield_Kg_Ha           4083 non-null   object 
 9   Fertilizer_Use_Kg_Ha  4090 non-null   object 
 10  Irrigation_Pct        4088 non-null   object 
 11  Notes                 209 non-null    object 
dtypes: float64(1), int64(1), object(10)
memory usage: 392.7+ KB


In [17]:
temp_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Country_Name      1137 non-null   object
 1   Year              1137 non-null   int64 
 2   Annual_Anomaly_C  1137 non-null   object
 3   Jan               1114 non-null   object
 4   Feb               1114 non-null   object
 5   Mar               1114 non-null   object
 6   Apr               1115 non-null   object
 7   May               1115 non-null   object
 8   Jun               1114 non-null   object
 9   Jul               1114 non-null   object
 10  Aug               1114 non-null   object
 11  Sep               1114 non-null   object
 12  Oct               1114 non-null   object
 13  Nov               1114 non-null   object
 14  Dec               1114 non-null   object
dtypes: int64(1), object(14)
memory usage: 133.4+ KB


### Data Cleaning

In [18]:
# converting year type for crop df & checking temp df to avoid join / key mismatches later

crop_df = crop_df_raw.copy()
crop_df["Year"] = pd.to_numeric(crop_df["Year"], errors="coerce").astype("Int64")

temp_df = temp_df_raw.copy()

assert (
    crop_df["Year"].isna().sum() == 0
)  # if no error, will run cell, otherwise will throw an error
assert temp_df["Year"].isna().sum() == 0
print("Year columns validated.")

Year columns validated.


In [19]:
# checking if parentheses in the temp_df_raw appear alongside minus signs

temp_df_raw.apply(
    lambda col: col.astype(str).str.contains(r"\(-", regex=True, na=False).any()
).sum()

# checking how many times parentheses occur in each column to see if it's noise or a systematic occurence
months = [
    "Jan",
    "Feb",
    "Mar",
    "Apr",
    "May",
    "Jun",
    "Jul",
    "Aug",
    "Sep",
    "Oct",
    "Nov",
    "Dec",
]
cols = ["Annual_Anomaly_C"] + months

for c in cols:
    n = (
        temp_df_raw[c]
        .astype("string")
        .str.contains(r"^\(.*\)$", regex=True, na=False)
        .sum()
    )
    if n:
        print(c, n)

Annual_Anomaly_C 35
Jan 55
Feb 38
Mar 34
Apr 22
May 27
Jun 40
Jul 50
Aug 51
Sep 64
Oct 81
Nov 73
Dec 83


In [20]:
# More cleaning

# 1. Cleaning crop_df

crop_numeric_cols = [
    "Area_Harvested_Ha",
    "Production_Tonnes",
    "Yield_Kg_Ha",
    "Fertilizer_Use_Kg_Ha",
    "Irrigation_Pct",
]


def clean_num_eu(series):  # series refers to one pandas column
    # forcing everything to be read as string temporarily & .str.strip() removes leading & trailing spaces e.g., " 19.14 "
    s = series.astype("string").str.strip()
    # removing anything that is not digits 0-9, commas, dots, minus signs & parentheses (e.g., footnote markers)
    s = s.str.replace(r"[^0-9,\.\(\)]", "", regex=True)
    # EU decimal conversion to dot
    s = s.str.replace(",", ".", regex=False)
    # replacing any negative values represented with parentheses to minus signs
    s = s.str.replace(r"^\((.*)\)$", r"-\1", regex=True)
    # converting any missing values (after all transformation above) to NA
    s = s.replace({"": pd.NA})
    return pd.to_numeric(s, errors="coerce")


# applying function to each column in crop_numeric_cols
for c in crop_numeric_cols:
    crop_df[c] = clean_num_eu(crop_df[c])

# 2. Cleaning temp_df

temp_numeric_cols = [
    "Annual_Anomaly_C",
    "Jan",
    "Feb",
    "Mar",
    "Apr",
    "May",
    "Jun",
    "Jul",
    "Aug",
    "Sep",
    "Oct",
    "Nov",
    "Dec",
]

for c in temp_numeric_cols:
    s = temp_df[c].astype("string").str.strip()
    s = s.str.replace(
        r"^\((.*)\)$", r"-\1", regex=True
    )  # replacing parentheses (negative values) with a minus sign
    s = s.replace({"": pd.NA})
    temp_df[c] = pd.to_numeric(s, errors="coerce")

# checking results of cleaning

print("Crop cleaned dtypes:\n", crop_df[["Year"] + crop_numeric_cols].dtypes)
print("Temperature cleaned dtypes:\n", temp_df[["Year"] + temp_numeric_cols].dtypes)

Crop cleaned dtypes:
 Year                      Int64
Area_Harvested_Ha         Int64
Production_Tonnes       Float64
Yield_Kg_Ha             Float64
Fertilizer_Use_Kg_Ha    Float64
Irrigation_Pct          Float64
dtype: object
Temperature cleaned dtypes:
 Year                  int64
Annual_Anomaly_C    Float64
Jan                 Float64
Feb                 Float64
Mar                 Float64
Apr                 Float64
May                 Float64
Jun                 Float64
Jul                 Float64
Aug                 Float64
Sep                 Float64
Oct                 Float64
Nov                 Float64
Dec                 Float64
dtype: object


In [21]:
# checking NA values

print(crop_df[crop_numeric_cols].isna().sum())
print(temp_df[temp_numeric_cols].isna().sum())

# checking duplicates
# checking that there are no duplicate crop observations for the combination 'country', 'isocode', 'year', 'crop' & 'country_name', 'year'
print(crop_df.duplicated(subset=["Country", "ISO3_Code", "Year", "Crop"]).sum())
print(temp_df.duplicated(subset=["Country_Name", "Year"]).sum())

Area_Harvested_Ha         0
Production_Tonnes       195
Yield_Kg_Ha             125
Fertilizer_Use_Kg_Ha    125
Irrigation_Pct          125
dtype: int64
Annual_Anomaly_C     0
Jan                 23
Feb                 23
Mar                 23
Apr                 22
May                 22
Jun                 23
Jul                 23
Aug                 23
Sep                 23
Oct                 23
Nov                 23
Dec                 23
dtype: int64
0
0


In [22]:
# renaming columns to be consistent going forward (all lower-case)

crop_rename = {
    "Country": "country",
    "ISO3_Code": "iso3_code",
    "Region": "region",
    "Income_Group": "income_group",
    "Year": "year",
    "Crop": "crop",
    "Area_Harvested_Ha": "area_harvested_ha",
    "Production_Tonnes": "production_tonnes",
    "Yield_Kg_Ha": "yield_kg_ha",
    "Fertilizer_Use_Kg_Ha": "fertilizer_use_kg_ha",
    "Irrigation_Pct": "irrigation_pct",
    "Notes": "notes",
}
crop_df = crop_df.rename(columns=crop_rename)

temp_rename = {
    "Country_Name": "country_name",
    "Year": "year",
    "Annual_Anomaly_C": "annual_anomaly_c",
    "Jan": "jan",
    "Feb": "feb",
    "Mar": "mar",
    "Apr": "apr",
    "May": "may",
    "Jun": "jun",
    "Jul": "jul",
    "Aug": "aug",
    "Sep": "sep",
    "Oct": "oct",
    "Nov": "nov",
    "Dec": "dec",
}
temp_df = temp_df.rename(columns=temp_rename)

In [23]:
# also renaming crop_numeric_cols & temp_numeric_cols to match new naming

crop_numeric_cols = [
    "area_harvested_ha",
    "production_tonnes",
    "yield_kg_ha",
    "fertilizer_use_kg_ha",
    "irrigation_pct",
]

temp_numeric_cols = [
    "annual_anomaly_c",
    "jan",
    "feb",
    "mar",
    "apr",
    "may",
    "jun",
    "jul",
    "aug",
    "sep",
    "oct",
    "nov",
    "dec",
]

### Loading data into MySQL

In [24]:
%%sql
-- RESET (for reproducibility): drop tables so running top-to-bottom always works on a fresh DB
SET FOREIGN_KEY_CHECKS = 0;

DROP TABLE IF EXISTS temperature_anomalies;
DROP TABLE IF EXISTS crop_production;
DROP TABLE IF EXISTS country_name_mapping;
DROP TABLE IF EXISTS crops;
DROP TABLE IF EXISTS countries;

SET FOREIGN_KEY_CHECKS = 1;

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.
0 rows affected.


[]

### Table Creation

In [25]:
%%sql

-- Create countries table (dimensions table)
CREATE TABLE countries (
    country_id INT AUTO_INCREMENT PRIMARY KEY,
    country_name VARCHAR(255) NOT NULL,
    iso3_code CHAR(3) NOT NULL,
    region VARCHAR(100) NOT NULL,
    income_group VARCHAR (100) NOT NULL    
);

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.


[]

In [26]:
%%sql

-- Create crops table (dimensions table)
CREATE TABLE crops (
    crop_id INT AUTO_INCREMENT PRIMARY KEY,
    crop_name VARCHAR(255) NOT NULL
);

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.


[]

In [27]:
%%sql

-- Create country_name_mapping table (dimensions table)
CREATE TABLE country_name_mapping (
    source_dataset VARCHAR(50) NOT NULL,
    source_country_name VARCHAR(255) NOT NULL,
    country_id INT NOT NULL,
    PRIMARY KEY (source_dataset, source_country_name),
    FOREIGN KEY (country_id) REFERENCES countries(country_id)
);

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.


[]

In [28]:
%%sql

-- Create crop_production table (fact table)
CREATE TABLE crop_production (
    country_id INT NOT NULL,
    year INT NOT NULL,
    crop_id INT NOT NULL,

    area_harvested_ha FLOAT,
    production_tonnes FLOAT,
    yield_kg_ha FLOAT,
    fertilizer_use_kg_ha FLOAT,
    irrigation_pct FLOAT,
    notes VARCHAR(255),

    PRIMARY KEY (country_id, year, crop_id),
    FOREIGN KEY (country_id) REFERENCES countries(country_id),
    FOREIGN KEY (crop_id) REFERENCES crops(crop_id)
);

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.


[]

In [29]:
%%sql

-- Create temperature_anomalies table (fact table)
CREATE TABLE temperature_anomalies (
    country_id INT NOT NULL,
    year INT NOT NULL,

    annual_anomaly_c FLOAT,
    jan FLOAT,
    feb FLOAT, 
    mar FLOAT,
    apr FLOAT,
    may FLOAT,
    jun FLOAT,
    jul FLOAT,
    aug FLOAT,
    sep FLOAT,
    oct FLOAT,
    nov FLOAT,
    `dec` FLOAT,

    PRIMARY KEY (country_id, year),
    FOREIGN KEY (country_id) REFERENCES countries(country_id)
);

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
0 rows affected.


[]

In [30]:
%%sql

SHOW TABLES;

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
15 rows affected.


Tables_in_mfre521d
AirQuality_2
air_quality_readings
countries
country_name_mapping
crop_production
crops
daily_summary
food_nutrition
monthly_summary
pollution_thresholds


In [31]:
# inspecting data
crop_df.head()

Unnamed: 0,country,iso3_code,region,income_group,year,crop,area_harvested_ha,production_tonnes,yield_kg_ha,fertilizer_use_kg_ha,irrigation_pct,notes
0,China,CHN,East Asia,Upper middle income,2001,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993,Maize,2112762,11377270.55,5385.02,19.14,9.8,
2,South Korea,KOR,East Asia,High income,1995,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,
3,United States,USA,North America,High income,2018,Wheat,4782989,32397951.41,6773.58,205.12,62.5,
4,Japan,JPN,East Asia,High income,2013,Rice,5434696,58322509.35,10731.51,211.64,61.4,


In [32]:
temp_df.head()

Unnamed: 0,country_name,year,annual_anomaly_c,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec
0,United States of America,1990,0.07,-0.02,,,-0.09,-0.11,0.44,-0.44,0.3,-0.08,-0.03,-0.31,-0.39
1,United States of America,1991,0.2,0.36,0.4,0.6,0.74,0.22,0.34,0.22,0.7,,-0.44,-0.5,-0.22
2,United States of America,1992,0.54,0.46,0.76,0.85,0.68,0.6,0.98,0.52,0.92,0.53,0.01,0.09,0.07
3,United States of America,1993,0.43,0.28,0.22,0.74,0.6,,1.3,0.81,0.35,-0.3,0.04,0.55,-0.13
4,United States of America,1994,0.87,0.75,0.86,0.65,1.14,0.45,1.71,1.27,1.04,0.01,0.62,1.05,0.88


In [33]:
crop_df.dtypes

country                  object
iso3_code                object
region                   object
income_group             object
year                      Int64
crop                     object
area_harvested_ha         Int64
production_tonnes       Float64
yield_kg_ha             Float64
fertilizer_use_kg_ha    Float64
irrigation_pct          Float64
notes                    object
dtype: object

In [34]:
temp_df.dtypes

country_name         object
year                  int64
annual_anomaly_c    Float64
jan                 Float64
feb                 Float64
mar                 Float64
apr                 Float64
may                 Float64
jun                 Float64
jul                 Float64
aug                 Float64
sep                 Float64
oct                 Float64
nov                 Float64
dec                 Float64
dtype: object

#### Building & Inserting into countries table

In [35]:
# inserting data into countries table (1 row per iso3 country)

countries_df = crop_df[["country", "iso3_code", "region", "income_group"]].copy()

# removing leading/trailing spaces (to avoid duplicates which occurred without)
for c in ["country", "iso3_code", "region", "income_group"]:
    countries_df[c] = countries_df[c].astype(str).str.strip()

# making sure iso3 is consistent (upper-case)
countries_df["iso3_code"] = countries_df["iso3_code"].str.upper()

# keeping 1 row per iso3_code
countries_df = countries_df.drop_duplicates(subset=["iso3_code"]).rename(
    columns={"country": "country_name"}
)
countries_df.head()

Unnamed: 0,country_name,iso3_code,region,income_group
0,China,CHN,East Asia,Upper middle income
1,Nepal,NPL,South Asia,Low income
2,South Korea,KOR,East Asia,High income
3,United States,USA,North America,High income
4,Japan,JPN,East Asia,High income


In [36]:
# checking duplicates
countries_df.duplicated("iso3_code").sum()

np.int64(0)

In [37]:
# inserting into SQL table
countries_df_to_insert = countries_df.copy()
countries_df_to_insert.to_sql("countries", engine, if_exists="append", index=False)

34

#### Building & Inserting into crops table

In [38]:
# formatting before inserting

crops_df = crop_df[["crop"]].copy()
crops_df["crop"] = crops_df["crop"].astype(str).str.strip()

crops_df = (
    crops_df.drop_duplicates(subset=["crop"])
    .rename(columns={"crop": "crop_name"})
    .sort_values("crop_name")
    .reset_index(drop=True)
)

crops_df.head()

Unnamed: 0,crop_name
0,Maize
1,Rice
2,Soybeans
3,Wheat


In [39]:
crops_df.to_sql("crops", con=engine, if_exists="append", index=False)

4

In [40]:
# inspecting table
pd.read_sql("SELECT * FROM crops ORDER BY crop_id LIMIT 10;", engine)

Unnamed: 0,crop_id,crop_name
0,1,Maize
1,2,Rice
2,3,Soybeans
3,4,Wheat


#### Building & Inserting into country_names_mapping Table

In [41]:
# pulling country table from DB
countries_lookup = pd.read_sql(
    "SELECT country_id, iso3_code, country_name FROM countries;", engine
)

# building mapping tables from each source dataset (unique country identifiers)
mapping_crop = (
    crop_df[["country", "iso3_code"]]
    .drop_duplicates()
    .rename(columns={"country": "source_country_name"})
)
mapping_crop["source_dataset"] = "crop"

mapping_temp = (
    temp_df[["country_name"]]
    .drop_duplicates()
    .rename(columns={"country_name": "source_country_name"})
)
mapping_temp["source_dataset"] = "temp"

# cleaning join keys so merges don't fail due to spaces
countries_lookup["iso3_code"] = (
    countries_lookup["iso3_code"].astype(str).str.strip().str.upper()
)

mapping_crop["iso3_code"] = (
    mapping_crop["iso3_code"].astype(str).str.strip().str.upper()
)
mapping_crop["source_country_name"] = (
    mapping_crop["source_country_name"].astype(str).str.strip()
)

mapping_temp["source_country_name"] = (
    mapping_temp["source_country_name"].astype(str).str.strip()
)
countries_lookup["country_name"] = (
    countries_lookup["country_name"].astype(str).str.strip()
)

# mapping to country_id
# cropping dataset: best key is iso3_code
mapping_crop = mapping_crop.merge(
    countries_lookup[["country_id", "iso3_code"]], on="iso3_code", how="left"
)

# needed because USA, Russia & Vietnam do not match
name_fixes = {
    "United States of America": "United States",
    "Russian Federation": "Russia",
    "Viet Nam": "Vietnam",
}

mapping_temp["source_country_name"] = mapping_temp["source_country_name"].replace(
    name_fixes
)

# temp dataset only has country name, so merge on name
mapping_temp = mapping_temp.merge(
    countries_lookup[["country_id", "country_name"]],
    left_on="source_country_name",
    right_on="country_name",
    how="left",
).drop(columns=["country_name"])

In [42]:
# checks (should be 0)
print("Unmatched crop rows (should be 0):", mapping_crop["country_id"].isna().sum())
print("Unmatched temp rows:", mapping_temp["country_id"].isna().sum())

# if something doesn't match, showing them to fix
if mapping_crop["country_id"].isna().any():
    print("\nUnmatched crop entries:")
    display(
        mapping_crop[mapping_crop["country_id"].isna()][
            ["source_country_name", "iso3_code"]
        ]
    )

if mapping_temp["country_id"].isna().any():
    print("\nUnmatched temp entries:")
    display(mapping_temp[mapping_temp["country_id"].isna()][["source_country_name"]])

# looking at mappings to see what they look like / if they look ok
display(mapping_crop.head())
display(mapping_temp.head())

Unmatched crop rows (should be 0): 0
Unmatched temp rows: 0


Unnamed: 0,source_country_name,iso3_code,source_dataset,country_id
0,China,CHN,crop,1
1,Nepal,NPL,crop,2
2,South Korea,KOR,crop,3
3,United States,USA,crop,4
4,Japan,JPN,crop,5


Unnamed: 0,source_country_name,source_dataset,country_id
0,United States,temp,4
1,Canada,temp,6
2,Australia,temp,34
3,France,temp,28
4,Germany,temp,19


In [43]:
# checking if duplicates in countries table are removed

pd.read_sql(
    """
SELECT iso3_code, COUNT(*) AS n
FROM countries
GROUP BY iso3_code
HAVING COUNT(*) > 1;
""",
    engine,
)

dupes = pd.read_sql(
    """
SELECT iso3_code, COUNT(*) AS n
FROM countries
GROUP BY iso3_code
HAVING COUNT(*) > 1;
""",
    engine,
)

print("duplicate iso3 groups:", len(dupes))
dupes

duplicate iso3 groups: 0


Unnamed: 0,iso3_code,n


In [44]:
# inserting country_name_mapping into MySQL

mapping_crop_small = mapping_crop[
    ["source_dataset", "source_country_name", "country_id"]
].copy()
mapping_temp_small = mapping_temp[
    ["source_dataset", "source_country_name", "country_id"]
].copy()

country_name_mapping_df = pd.concat(
    [mapping_crop_small, mapping_temp_small], ignore_index=True
).drop_duplicates(subset=["source_dataset", "source_country_name"])

country_name_mapping_df.to_sql(
    "country_name_mapping", con=engine, if_exists="append", index=False
)

76

In [45]:
# checking how many entries from crop dataset & temp dataset

pd.read_sql(
    "SELECT source_dataset, COUNT(*) AS n FROM country_name_mapping GROUP BY source_dataset;",
    engine,
)

Unnamed: 0,source_dataset,n
0,crop,42
1,temp,34


In [46]:
# checking no NULL country_id rows inserted

pd.read_sql(
    """
SELECT source_dataset, COUNT(*) AS null_ids
FROM country_name_mapping
WHERE country_id IS NULL
GROUP BY source_dataset;
""",
    engine,
)

Unnamed: 0,source_dataset,null_ids


In [47]:
%%sql
DESCRIBE crop_production;

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
9 rows affected.


Field,Type,Null,Key,Default,Extra
country_id,int,NO,PRI,,
year,int,NO,PRI,,
crop_id,int,NO,PRI,,
area_harvested_ha,float,YES,,,
production_tonnes,float,YES,,,
yield_kg_ha,float,YES,,,
fertilizer_use_kg_ha,float,YES,,,
irrigation_pct,float,YES,,,
notes,varchar(255),YES,,,


#### Building & Inserting into crop_production table

In [48]:
# Need to attach IDs before inserting
# countries_key tells us iso3_code mapping to country_id & crops_key maps crop_name to crop_id
countries_key = pd.read_sql("SELECT country_id, iso3_code FROM countries;", engine)
crops_key = pd.read_sql("SELECT crop_id, crop_name FROM crops;", engine)

crop_fact = crop_df.copy()

# cleaning the join keys so merges match (& we're sure of it) e.g., 'usa', 'USA' & 'USA ' will match
crop_fact["iso3_code"] = crop_fact["iso3_code"].astype(str).str.strip().str.upper()
crop_fact["crop"] = crop_fact["crop"].astype(str).str.strip()

# merging in country_id using iso3_codes
crop_fact = crop_fact.merge(countries_key, on="iso3_code", how="left")

# merging in crop_id using crop_name
crop_fact = crop_fact.merge(crops_key, left_on="crop", right_on="crop_name", how="left")

# keeping only columns expected by SQL table
crop_fact = crop_fact[
    [
        "country_id",
        "year",
        "crop_id",
        "area_harvested_ha",
        "production_tonnes",
        "yield_kg_ha",
        "fertilizer_use_kg_ha",
        "irrigation_pct",
        "notes",
    ]
].copy()

crop_fact.columns

Index(['country_id', 'year', 'crop_id', 'area_harvested_ha',
       'production_tonnes', 'yield_kg_ha', 'fertilizer_use_kg_ha',
       'irrigation_pct', 'notes'],
      dtype='object')

In [49]:
# checking data

crop_fact.dtypes

print("missing country_id:", crop_fact["country_id"].isna().sum())
print("missing crop_id:", crop_fact["crop_id"].isna().sum())

missing country_id: 0
missing crop_id: 0


In [50]:
# inserting into crop_production

crop_fact.to_sql(
    "crop_production",
    con=engine,
    if_exists="append",
    index=False,
    method="multi",
    chunksize=1000,
)

4187

#### Building & Inserting into temperature_anomalies table

In [51]:
# checking country_id & country_name in countries for US, Russia, Vietnam = problematic countries

pd.read_sql(
    """
SELECT country_id, country_name
FROM countries
WHERE country_name IN ('United States', 'Russia', 'Vietnam');
""",
    engine,
)

Unnamed: 0,country_id,country_name
0,4,United States
1,8,Vietnam
2,12,Russia


In [52]:
# adding the 3 missing temp name mappings

fixes = pd.DataFrame(
    {
        "source_dataset": ["temp", "temp", "temp"],
        "source_country_name": [
            "United States of America",
            "Russian Federation",
            "Viet Nam",
        ],
        "country_id": [4, 12, 8],
    }
)

fixes.to_sql("country_name_mapping", con=engine, if_exists="append", index=False)

3

In [53]:
# checking names & country_ids of problematic countries in country_name_mapping

pd.read_sql(
    """
SELECT source_country_name, country_id
FROM country_name_mapping
WHERE source_dataset='temp'
AND source_country_name IN ('United States of America','Russian Federation','Viet Nam');
""",
    engine,
)

Unnamed: 0,source_country_name,country_id
0,Russian Federation,12
1,United States of America,4
2,Viet Nam,8


In [54]:
# getting country_id from country_name_mapping

temp_map = pd.read_sql(
    """
SELECT source_country_name, country_id
FROM country_name_mapping
WHERE source_dataset = 'temp';
""",
    engine,
)

month_cols = [
    "jan",
    "feb",
    "mar",
    "apr",
    "may",
    "jun",
    "jul",
    "aug",
    "sep",
    "oct",
    "nov",
    "dec",
]

# joining first (keeping country_name & source_country_name because bugs occurred)
temp_joined = temp_df.assign(
    source_country_name=temp_df["country_name"].astype(str).str.strip()
).merge(temp_map, on="source_country_name", how="left")

# seeing which country names didn't map (bug previously)
unmatched = temp_joined.loc[
    temp_joined["country_id"].isna(), ["country_name", "source_country_name"]
].drop_duplicates()
display(unmatched)

# makes sure that the code stops is anything is unmatched
assert (
    unmatched.empty
), "Some temp country names did not map to country_id. Fix mapping before inserting."

# subsetting to the exact columns that match the SQL table
temp_fact = temp_joined[["country_id", "year", "annual_anomaly_c"] + month_cols].copy()

Unnamed: 0,country_name,source_country_name


In [55]:
# inserting data into SQL table
temp_fact.to_sql("temperature_anomalies", con=engine, if_exists="append", index=False)

1137

#### Final Validation Checks

In [56]:
# final checks - number of rows

pd.read_sql(
    """
SELECT 'countries' AS t, COUNT(*) n FROM countries
UNION ALL SELECT 'crops', COUNT(*) FROM crops
UNION ALL SELECT 'country_name_mapping', COUNT(*) FROM country_name_mapping
UNION ALL SELECT 'crop_production', COUNT(*) FROM crop_production
UNION ALL SELECT 'temperature_anomalies', COUNT(*) FROM temperature_anomalies;
""",
    engine,
)

Unnamed: 0,t,n
0,countries,34
1,crops,4
2,country_name_mapping,79
3,crop_production,4187
4,temperature_anomalies,1137


In [57]:
# final check - duplicates

pd.read_sql(
    """
SELECT COUNT(*) AS dupes
FROM (
  SELECT country_id, year, crop_id, COUNT(*) c
  FROM crop_production
  GROUP BY 1,2,3
  HAVING COUNT(*) > 1
) x;
""",
    engine,
)

pd.read_sql(
    """
SELECT COUNT(*) AS dupes
FROM (
  SELECT country_id, year, COUNT(*) c
  FROM temperature_anomalies
  GROUP BY 1,2
  HAVING COUNT(*) > 1
) x;
""",
    engine,
)

Unnamed: 0,dupes
0,0


## Task 3: Business Questions

### Question 1: Regional Production Trends

In [52]:
# This query summarizes regional crop production patterns for the year 2023.
#
# We:
# 1. Join crop production data with country and crop reference tables
# 2. Filter observations to the year 2023
# 3. Aggregate total production, average yield, and observation counts
#    by region and crop
# 4. Order results by total production in descending order

q1_sql = """
SELECT
    c.region AS region,
    cr.crop_name AS crop,
    SUM(cp.production_tonnes) AS total_production_tonnes,
    AVG(cp.yield_kg_ha) AS avg_yield_kg_ha,
    COUNT(*) AS observation_count
FROM crop_production cp
JOIN countries c
    ON cp.country_id = c.country_id
JOIN crops cr
    ON cp.crop_id = cr.crop_id
WHERE cp.year = 2023
GROUP BY
    c.region,
    cr.crop_name
ORDER BY
    total_production_tonnes DESC;
"""


In [64]:
# results
q1_result = run_query(q1_sql)
q1_result.style.format({
    'total_production_tonnes': '{:,.0f}',   # no scientific notation
    'avg_yield_kg_ha': '{:.2f}'              # 2 decimals
})


Unnamed: 0,region,crop,total_production_tonnes,avg_yield_kg_ha,observation_count
0,Europe,Maize,145210705,11943.86,5
1,Southeast Asia,Rice,93346196,6665.57,5
2,Sub-Saharan Africa,Maize,87612816,5758.17,8
3,Europe,Wheat,77651221,7263.25,5
4,Sub-Saharan Africa,Wheat,76163206,4257.14,8
5,Sub-Saharan Africa,Rice,65806524,5953.92,8
6,South Asia,Rice,59198338,6747.76,4
7,Southeast Asia,Wheat,59162863,5325.39,5
8,South America,Maize,54143975,9132.24,2
9,Oceania,Maize,49896884,12428.05,1


**Q1 Analysis :**
The results show that total production of major crops is highly concentrated in a few region–crop combinations, with maize and rice accounting for the largest production volumes across Europe, Sub-Saharan Africa, and parts of Asia. However, average yields vary substantially across regions, even for the same crop, indicating that high production levels are often driven by scale rather than consistently higher yield efficiency.

### Question 2: Climate Sensitivity by Crop

In [54]:
# This query compares average crop yields between
# warm years and cool/normal years using temperature anomalies.
#
# We:
# 1. Join crop production data with temperature anomalies by country and year
# 2. Classify each observation into a temperature bucket
# 3. Aggregate average yields by crop and temperature bucket

q2_sql = """
WITH climate_classified AS (
    SELECT
        cr.crop_name AS crop,                 -- Crop name
        cp.yield_kg_ha AS yield_kg_ha,        -- Yield (kg/ha)
        ta.annual_anomaly_c AS annual_anomaly_c,  -- Temperature anomaly (°C)

        -- Classify years based on temperature anomaly
        CASE
            WHEN ta.annual_anomaly_c > 0.5 THEN 'Warm'
            ELSE 'Cool/Normal'
        END AS temperature_bucket

    FROM crop_production cp

    -- Join temperature data by country and year
    JOIN temperature_anomalies ta
        ON cp.country_id = ta.country_id
       AND cp.year = ta.year

    -- Join crop names
    JOIN crops cr
        ON cp.crop_id = cr.crop_id

    -- Remove rows with missing yield or temperature data
    WHERE cp.yield_kg_ha IS NOT NULL
      AND ta.annual_anomaly_c IS NOT NULL
)
SELECT
    crop,                                   -- Crop name
    temperature_bucket,                     -- Warm vs Cool/Normal
    AVG(yield_kg_ha) AS avg_yield_kg_ha,    -- Average yield
    COUNT(*) AS observation_count           -- Number of observations

FROM climate_classified

GROUP BY
    crop,
    temperature_bucket

ORDER BY
    crop,
    temperature_bucket;
"""


In [55]:
# results
q2_result = run_query(q2_sql)
q2_result


Unnamed: 0,crop,temperature_bucket,avg_yield_kg_ha,observation_count
0,Maize,Cool/Normal,7202.800704,656
1,Maize,Warm,8044.331245,445
2,Rice,Cool/Normal,5555.146936,555
3,Rice,Warm,6372.943899,377
4,Soybeans,Cool/Normal,3446.468383,524
5,Soybeans,Warm,3913.18017,333
6,Wheat,Cool/Normal,4508.743011,661
7,Wheat,Warm,5201.260129,445


**Q2 Analysis :**
Across all crops, average yields are higher in warm years than in cool or normal years, suggesting a positive association between temperature anomalies and crop productivity in the observed data. This comparison is descriptive and does not imply causality, but it highlights systematic yield differences across climate conditions.

### Question 3: Yield Gap Analysis

In [56]:
# This query identifies high-yield crop-country combinations
# among low-income and lower-middle-income countries
# Results are sorted by yield to highlight the highest-performing observations.

q3_sql = """
SELECT
    c.country_name AS country,          -- Country name
    c.income_group AS income_group,     -- Income classification
    cr.crop_name AS crop,               -- Crop name
    cp.yield_kg_ha AS yield_kg_ha,      -- Yield (kg/ha)
    cp.production_tonnes AS production_tonnes  -- Total production

FROM crop_production cp

-- Join country information
JOIN countries c
    ON cp.country_id = c.country_id

-- Join crop information
JOIN crops cr
    ON cp.crop_id = cr.crop_id

WHERE
    cp.year = 2023                                  -- Focus on 2023
    AND c.income_group IN ('Low income', 'Lower middle income')
    AND cp.yield_kg_ha IS NOT NULL                  -- Ensure valid yields

ORDER BY
    cp.yield_kg_ha DESC                             -- Highest yields first

LIMIT 10;                                           -- Top 10 results
"""


In [57]:
# results
q3_result = run_query(q3_sql)
q3_result


Unnamed: 0,country,income_group,crop,yield_kg_ha,production_tonnes
0,Indonesia,Lower middle income,Maize,10542.8,5221800.0
1,Morocco,Lower middle income,Maize,9908.99,14815500.0
2,India,Lower middle income,Maize,9737.61,7438880.0
3,Bangladesh,Lower middle income,Maize,9372.67,
4,India,Lower middle income,Rice,7781.3,17860600.0
5,Philippines,Lower middle income,Rice,7769.88,36854500.0
6,Ukraine,Lower middle income,Maize,7549.53,11605800.0
7,Pakistan,Lower middle income,Rice,7237.6,15146100.0
8,Ukraine,Lower middle income,Wheat,7226.0,39230300.0
9,Philippines,Lower middle income,Maize,7162.9,409245.0


**Q3 Analysis :**
The top-yielding crop–country combinations in 2023 are all from lower-middle-income countries, with maize and rice appearing most frequently among the highest yields. These results indicate that high agricultural productivity can occur within lower-income contexts, even though production volumes and data completeness vary across observations.

### final checks

In [58]:
# To Confirm 2023 data exists
run_query("""
SELECT COUNT(*) AS rows_2023
FROM crop_production
WHERE year = 2023;
""")


Unnamed: 0,rows_2023
0,125


In [59]:
# To Check income group labels
run_query("""
SELECT DISTINCT income_group
FROM countries
ORDER BY income_group;
""")


Unnamed: 0,income_group
0,High income
1,Low income
2,Lower middle income
3,Upper middle income


In [60]:
# To Ensure no duplicate joins
run_query("""
SELECT
    COUNT(*) AS total_rows,
    COUNT(DISTINCT CONCAT(country_id, '-', crop_id, '-', year)) AS unique_rows
FROM crop_production;
""")


Unnamed: 0,total_rows,unique_rows
0,4187,4187


In [61]:
# To check missing production values
run_query("""
SELECT COUNT(*) AS missing_production
FROM crop_production
WHERE production_tonnes IS NULL;
""")


Unnamed: 0,missing_production
0,195


### Question 4: Data Quality Assessment

In [103]:
%%sql

-- Identify countries with high % of missing yield and/or production values

SELECT 
    co.country_id,
    co.country_name,
    COUNT(*) AS row_count,

    SUM(cr.production_tonnes IS NULL) AS missing_production,
    ROUND(100.0 * SUM(cr.production_tonnes  IS NULL) / COUNT(*), 2) AS pct_missing_production,

    SUM(cr.yield_kg_ha IS NULL) AS missing_yield,
    ROUND(100.0 * SUM(cr.yield_kg_ha IS NULL) / COUNT(*),2) AS pct_missing_yield,

    SUM(cr.production_tonnes IS NULL) + SUM(cr.yield_kg_ha IS NULL) AS missing_total,
    ROUND(100.0 * (SUM(cr.production_tonnes IS NULL) + SUM(cr.yield_kg_ha IS NULL)) / COUNT(*), 2) AS pct_missing_total

FROM crop_production AS cr
JOIN countries AS co
    ON cr.country_id = co.country_id
GROUP BY co.country_id
ORDER BY pct_missing_total DESC;


   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
34 rows affected.


country_id,country_name,row_count,missing_production,pct_missing_production,missing_yield,pct_missing_yield,missing_total,pct_missing_total
34,Australia,136,13,9.56,5,3.68,18,13.24
9,Ethiopia,109,9,8.26,5,4.59,14,12.84
8,Vietnam,136,11,8.09,5,3.68,16,11.76
25,Pakistan,136,8,5.88,6,4.41,14,10.29
16,Mali,107,8,7.48,3,2.8,11,10.28
14,South Africa,107,8,7.48,2,1.87,10,9.35
27,Uganda,107,4,3.74,6,5.61,10,9.35
15,Malawi,109,5,4.59,5,4.59,10,9.17
7,Thailand,136,7,5.15,5,3.68,12,8.82
3,South Korea,136,6,4.41,6,4.41,12,8.82


**Q4 Analysis :**
There are no countries with more than 10% of production data missing. Separately, no countries report more than 10% of yeild data missing. If reporting production and yield data together for aggregate analysis, only 5 countries report a 10% or higher gap in % of total missing values across production + yield data (Australia, Ethiopia, Vietnam, Pakistan, and Mali).

### Question 5: Integrated View

In [107]:
%%sql

-- Create aggregate view that combines country attributes, crop production metrics, and temp anomaly metrics + derived fields

WITH climate_agricultural_analysis AS (
    SELECT
        co.country_id,
        co.country_name,
        co.iso3_code,
        co.region,
        co.income_group,

        cp.year,
        cp.crop_id,
        cr.crop_name,
        cp.yield_kg_ha,
        cp.fertilizer_use_kg_ha,
        cp.irrigation_pct,
        cp.notes,

        t.annual_anomaly_c, 

        -- derived production-per-area metric
        CASE
            WHEN cp.area_harvested IS NULL OR cp.area_harvested = 0 THEN NULL
            ELSE cp.production_tonnes / cp.area_harvested 
        END AS tonnes_per_ha,

        -- derived categorial temp bucket
        CASE
            WHEN t.annual_anomaly_c IS NULL THEN 'Missing'
            WHEN t.annual_anomaly_c > 0.5 THEN 'Warm'
            ELSE 'Cool/Normal'
        END AS temp_bucket

        FROM crop_production cp
        JOIN countries co
        ON co.country_id = cp.country_id
        LEFT JOIN crops cr
        ON cr.crop_id = cp.crop_id
        LEFT JOIN temperature_anomalies t 
        ON t.country_id = cp_country.id
        AND t.year = cp.year;)

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
(pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 41")
[SQL: -- Create aggregate view that combines country attributes, crop production metrics, and temp anomaly metrics + derived fields

WITH climate_agricultural_analysis AS (
    SELECT
        co.country_id,
        co.country_name,
        co.iso3_code,
        co.region,
        co.income_group,

        cp.year,
        cp.crop_id,
        cr.crop_name,
        cp.yield_kg_ha,
        cp.fertilizer_use_kg_ha,
        cp.irrigation_pct,
        cp.notes,

        t.annual_anomaly_c, 

        -- derived production-per-area metric
        CASE
            WHEN cp.area_harvested IS NULL OR cp.area_harvested = 0 THEN NULL
            ELSE cp.production_tonnes / cp.area_harvested 
  

#### AI Use Statement
AI tools, primarily ChatGPT, were used in support of the creation of the code in this report. AI was used to troubleshoot errors, brainstorm appropriate data structures and coding solutions, and to clarify SQL & Python coding syntax. All AI use complied with UBC and course guidelines. All code, output, and comments in this file have been customized and verified by the authors. 