# `FRE 521D_Assignment 1_Group 3`

## Task 1

The database consists of five tables:

1. countries
2. crop
2. country_name_mapping
3. crop_production
4. temperature_anomalies

Each table is designed to represent a distinct entity and maintain referential integrity across datasets.


a)
1. The *countries* table stores standardized country-level attributes that are shared across both datasets.
- Avoid duplication of country metadata such as region and income group
- Provide a single authoritative reference for country identifiers
- Enable consistent joins across agricultural and climate datasets

Key fields include: 
country_id (Primary Key); country_name; iso3_code; region; income_group

2. The *crops* table stores standardized crop-level attributes that are reused across agricultural production records.

- Avoid duplication of crop names across production rows
- Prevent inconsistencies caused by spelling variations or formatting differences (e.g., "Wheat" vs "Wheat " )
- Provide a single authoritative reference for crop identifiers used in agricultural analysis

Key fields include: 
crop_id (Primary Key); crop_name

3. [e)] The *country_name_mapping* table resolves inconsistencies in country naming across the two source datasets.

- Map multiple source-specific country name variants to a standardized country identifier
- Enable reliable joins between crop production and temperature anomaly datasets
- Explicitly document data harmonization decisions across data sources

Key fields include:
source_country_name (Primary Key); country_id (Foreign Key from countries.country_id)

4. The *crop_production* table stores annual agricultural production measures at the country–year–crop level and serves as the core fact table for analysis.

- Store time-varying agricultural metrics derived from the crop production dataset
- Support analysis of production, yield, and agricultural inputs over time
- Enable aggregation and comparison across countries, regions, crops, and years

Key fields include:
country_id (from countries.country_id); year; crop_id (from crops.crop_id); area_harvested_ha; production_tonnes; yield_kg_ha; fertilizer_use_kg_ha; irrigation_pct; notes; where (country_id, year, crop_id) is the Primary Key

5. The *temperature_anomalies* table stores annual temperature anomaly measures at the country–year level and serves as the climate fact table.

- Store time-varying climate metrics derived from the temperature anomaly dataset
- Support integration of climate indicators with agricultural production data
- Enable comparison of agricultural outcomes across warmer and cooler years

Key fields include:
country_id (Foreign Key from countries.country_id); year; annual_anomaly_c; jan–dec.

b)

`Primary Keys`

The *countries table* uses *country_id* as its primary key, serving as the unique identifier for each country.

The *crops table* uses *crop_id* as its primary key, ensuring the identifier for each crop.

The *country_name_mapping* table uses *source_country_name* as its primary key, guaranteeing that each raw country name from the source datasets maps to a single standardized country.

The *crop_production* table uses a composite primary key *(country_id, year, crop_id)*, reflecting the natural grain of the data, where each record represents a unique country–year–crop observation.

The *temperature_anomalies* table uses a composite primary key *(country_id, year)*, consistent with the country–year granularity of the temperature dataset.

`Foreign Keys`

*crop_production.country_id* references *countries.country_id*, bringing agricultural observations and country together.

*crop_production.crop_id* references *crops.crop_id*, enabling consistency in crop identification.

*temperature_anomalies.country_id* references *countries.country_id*, enabling all climate records are associated with valid countries.

*country_name_mapping.country_id* references *countries.country_id*, enabling source-specific country names with standardized identifiers.

c)

*countries* table:
- country_id: INT, numeric surrogate key that enables efficient joins and indexing
- country_name: VARCHAR, flexible text storage for country names of varying length
- iso3_code: CHAR(3), fixed-length ISO country code of 3
- region: VARCHAR, categorical attribute representing geographic region
- income_group: VARCHAR, categorical attribute representing income classification


*crops* table:
- crop_id: INT, numeric surrogate key for standardized crop identification
- crop_name: VARCHAR, categorical text field storing unique crop names


*country_name_mapping* table:
- source_country_name: VARCHAR, stores raw country name strings exactly as they appear in source datasets
- country_id: INT, foreign key linking source-specific names to standardized country identifiers


*crop_production* table:
- country_id: INT, foreign key enabling joins with standardized country metadata
- year: INT, numeric representation supporting temporal filtering and grouping
- crop_id: INT, foreign key enforcing consistent crop identification
- area_harvested_ha: FLOAT, continuous numeric measure of harvested area
- production_tonnes: FLOAT, continuous numeric measure of total production
- yield_kg_ha: FLOAT, continuous numeric yield metric
- fertilizer_use_kg_ha: FLOAT, continuous numeric input measure
- irrigation_pct: FLOAT, numeric percentage representing irrigation coverage
- notes: VARCHAR, free-text field for annotations and metadata


*temperature_anomalies* table:
- country_id: INT, foreign key enabling integration with agricultural data
- year: INT, numeric representation of the observation year
- annual_anomaly_c: FLOAT, continuous numeric climate anomaly measure
- jan–dec: FLOAT (nullable), monthly temperature anomaly values that may be missing



d) 

The schema is normalized mainly to reduce data repetition and keep country and crop identifiers consistent across datasets. Country-level information such as region, income group, and ISO code is stored in a separate countries table instead of being repeated in every agricultural or climate record. Crop names are also placed in a crops table so that the same crop is always identified in a consistent way. Agricultural production data and temperature anomaly data are stored in different fact tables because they have different data structures and time coverage. Differences in country names between the source files are handled using a country_name_mapping table, which helps connect the datasets correctly. Composite keys are used in the fact tables to match the natural structure of the data and uniquely identify each observation.

In [None]:
# Load the SQL magic extension
%load_ext sql
    
%config SqlMagic.style = '_DEPRECATED_DEFAULT'
%config SqlMagic.autopandas = False

In [2]:
%load_ext sql
%sql mysql+pymysql://mfre521d_user:mfre521d_user_pw@127.0.0.1:3306/mfre521d


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [3]:
# Database connection parameters
DB_USER = "mfre521d_user"
DB_PASSWORD = "mfre521d_user_pw"
DB_HOST = "localhost"
DB_PORT = "3306"
DB_NAME = "mfre521d"

# Create connection string
connection_string = f"mysql+pymysql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"

# Connect using SQL magic
%sql {connection_string}


In [4]:
%%sql

SELECT 'Connection successful!' AS status, NOW() AS current_ts;

   mysql+pymysql://mfre521d_user:***@127.0.0.1:3306/mfre521d
 * mysql+pymysql://mfre521d_user:***@localhost:3306/mfre521d
1 rows affected.


status,current_ts
Connection successful!,2026-01-12 23:12:03


Import CVS Files - 1) Crop Production, and 2) Temperature Anomalies

In [11]:
from pathlib import Path
import pandas as pd

# Locate repo root
# Notebook is in: 521d_assignment/notebooks/
# Repo root is:   521d_assignment/
repo_root = Path.cwd().resolve().parents[0]

print("Repo root:", repo_root)

# Build paths to CSV files
crop_csv = repo_root / "data" / "crop_production_1990_2023.csv"
temp_csv = repo_root / "data" / "temperature_anomalies_1990_2023.csv"

print("Crop CSV exists:", crop_csv.exists())
print("Temp CSV exists:", temp_csv.exists())


Repo root: /Users/claremengebier/Desktop/MFRE/FRE 521D/521d_assignment
Crop CSV exists: True
Temp CSV exists: True


In [15]:
# Read crop production data
crop_df_raw = pd.read_csv(
    crop_csv,
    na_values=["..", "NA", ""],
    encoding="utf-8"
)

print("Crop data shape:", crop_df_raw.shape)
crop_df_raw.head()

Crop data shape: (4187, 12)


Unnamed: 0,Country,ISO3_Code,Region,Income_Group,Year,Crop,Area_Harvested_Ha,Production_Tonnes,Yield_Kg_Ha,Fertilizer_Use_Kg_Ha,Irrigation_Pct,Notes
0,China,CHN,East Asia,Upper middle income,2001.0,Soybeans,3751494,12036421.75,3208.43,100.9,,
1,Nepal,NPL,South Asia,Low income,1993.0,Maize,2112762,11377270.55,538502.0,1914.0,9.8,
2,South Korea,KOR,East Asia,High income,1995.0,Soybeans,1650777,7474101.16,4527.63,193.84,56.6,
3,United States,USA,North America,High income,2018.0,Wheat,4782989,32397951.41,6773.58,205.12,62.5,
4,Japan,JPN,East Asia,High income,2013.0,Rice,5434696,58322509.35,1073151.0,21164.0,61.4,


In [14]:
# Read temperature anomaly data
temp_df_raw = pd.read_csv(
    temp_csv,
    na_values=["NA", ""],
    encoding="utf-8"
)

print("Temperature data shape:", temp_df_raw.shape)
temp_df_raw.head()

Temperature data shape: (1137, 15)


Unnamed: 0,Country_Name,Year,Annual_Anomaly_C,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,United States of America,1990,0.07,(0.02),,,-0.09,(0.11),0.44,-0.44,0.3,-0.08,(0.03),-0.31,-0.39
1,United States of America,1991,0.2,0.36,0.4,0.6,0.74,0.22,0.34,0.22,0.7,,-0.44,-0.5,-0.22
2,United States of America,1992,0.54,0.46,0.76,0.85,0.68,0.6,0.98,0.52,0.92,0.53,0.01,0.09,0.07
3,United States of America,1993,0.43,0.28,0.22,0.74,0.6,,1.3,0.81,0.35,(0.3),0.04,0.55,-0.13
4,United States of America,1994,0.87,0.75,0.86,0.65,1.14,0.45,1.71,1.27,1.04,0.01,0.62,1.05,0.88
