# Build `datas.csv` from APIs — USDA (NASS) + NOAA (CDO)

This notebook fetches data from **live APIs** (no local CSV reads):
- **USDA NASS QuickStats**: soybean **yield** and **harvested area** (state level).
- **NOAA CDO (NCLIMDIV)**: monthly **temperature (TAVG)** and **precipitation (TPCP)** (state level).
- **Crop conditions**: via your existing `src.soy_conditions` module (USDA QuickStats).

It then:
1. Builds **WAOB-like weather features** (`temp_JA`, `prec_JA`, `jun_shortfall`, etc.).
2. Adds **annual crop-condition features** (`gex_JA_min`, etc.).
3. Appends a **US** weighted line per year (weights = `harvest_ha`).
4. Writes a single file: `data/processed/datas.csv` (states + US).

> Set environment variables first: `USDA_API_KEY`, `NOAA_TOKEN`.


In [None]:
import os, sys, numpy as np, pandas as pd

# Make repo root importable
sys.path.append(os.path.abspath("."))

# Local modules
from src import SEVEN_STATES, get_soy_condition_features
# Use the generated API module (move it to src/fetchweather.py in your repo):
from fetchweather import fetch_yield_and_area, fetch_noaa_monthly, build_waob_from_monthlies, add_us_weighted_row

YEAR_FROM = 1987
YEAR_TO   = 2024
STATES    = SEVEN_STATES  # ("IA","IL","IN","OH","MO","MN","NE")
OUT_CSV   = "data/processed/datas.csv"

assert os.getenv("USDA_API_KEY"), "Missing USDA_API_KEY env var"
assert os.getenv("NOAA_TOKEN"), "Missing NOAA_TOKEN env var"


## 1) Fetch USDA yield & harvested area

In [None]:
ya = fetch_yield_and_area(YEAR_FROM, YEAR_TO, STATES)
print("Yield+Area shape:", ya.shape)
ya.head(3)

## 2) Fetch NOAA monthly TAVG/TPCP and build WAOB features

In [None]:
monthly = fetch_noaa_monthly(YEAR_FROM, YEAR_TO, STATES)
print("Monthly shape:", monthly.shape, "sample months:", sorted(monthly['month'].unique())[:5])
waob = build_waob_from_monthlies(monthly, ya, base_trend_year=1987)
print("WAOB-like features:", waob.shape)
waob.head(3)

## 3) Build annual crop-condition features (USDA QuickStats)

In [None]:
# Reuse your working module that aggregates Good/Excellent etc.
_, cond_annual = get_soy_condition_features(YEAR_FROM, YEAR_TO, STATES)

# Normalize keys and deduplicate
def normalize_keys(df):
    out = df.copy()
    out["state"] = out["state"].astype(str).str.strip().str.upper()
    out["year"] = pd.to_numeric(out["year"], errors="coerce").astype("Int64")
    return out

cond_annual = normalize_keys(cond_annual).drop_duplicates(subset=["state","year"]).copy()
print("Crop condition annual shape:", cond_annual.shape)
cond_annual.head(3)

## 4) Merge weather+yield with crop conditions

In [None]:
df = waob.merge(cond_annual, on=["state","year"], how="left", indicator=True)
print(df["_merge"].value_counts())
df = df.drop(columns="_merge")

# Safety aggregation if any duplicates slipped in
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df = (df.groupby(["state","year"], as_index=False)
        .agg({**{c: "mean" for c in num_cols},
              **{c: "first" for c in df.columns if c not in num_cols and c not in ["state","year"]}}))
print("Merged df shape:", df.shape)
df.head(3)

## 5) Append US weighted line and write `datas.csv`

In [None]:
df_out = add_us_weighted_row(df)
print("Final rows (states + US):", df_out.shape)

os.makedirs("data/processed", exist_ok=True)
df_out.to_csv(OUT_CSV, index=False)
print("✅ Written:", OUT_CSV)
df_out.tail(3)