
# Predicting U.S. Unemployment Rate from synthetic ADP Non-Farm Payrolls with PyTorch (GPU)

This sample notebook walks through a minimal-yet-complete forecasting workflow—from **data loading & cleaning** to **model training** and **inference**—using **PyTorch**. It uses GPU acceleration automatically if available.

**What you'll see:**
1. Data loading 
2. Cleaning, alignment, and feature engineering (lags, rolling means, calendar)
3. Train/validation split that respects time order
4. Scaling
5. PyTorch `Dataset` / `DataLoader`
6. A compact MLP forecaster (you could swap for LSTM/GRU)
7. Training loop with early stopping
8. Evaluation and error metrics
9. Saving, reloading, and running inference on a "next month" example

> **Note**: This example uses ADP non-farm payrolls as the primary driver. In practice, adding more macro features (e.g., continuing claims, JOLTS, ISM, CPI) typically improves performance.

In [None]:

# Install instructions (if running locally and PyTorch isn't available):
#   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
#   pip install pandas numpy scikit-learn matplotlib

import math
import os
import sys
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

try:
    import torch
    import torch.nn as nn
    from torch.utils.data import Dataset, DataLoader
except Exception as e:
    raise RuntimeError(
        "PyTorch is required to run this notebook. Please install it first. "
        "See installation hint in the cell above."
    ) from e

import matplotlib.pyplot as plt

# Select device (prefer GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


## 1) Load data 

We'll generate a synthetic monthly dataset with columns:

- `date` (month-end)
- `adp_change` (ADP non-farm payroll change, thousands)
- `unemployment_rate` (percent)

**Expected CSV schema (if you bring your own):**
```csv
date,adp_change,unemployment_rate
2010-01-31,45,9.8
2010-02-28,130,9.8
...
```

In [None]:
def synthesize_data(n_months=180, seed=42):
    rng = np.random.default_rng(seed)
    start = pd.Timestamp("2010-01-31")
    dates = pd.date_range(start=start, periods=n_months, freq="M")

    # ADP hires pattern: base seasonal + noise
    seasonal = 150 * np.sin(np.linspace(0, 8*np.pi, n_months))  # seasonal swing
    trend = np.linspace(50, 250, n_months)                      # slow drift upward
    noise = rng.normal(0, 80, size=n_months)
    adp_change = (seasonal + trend + noise).astype(float)

    # Unemployment rate: depends negatively on ADP hiring, with lag + mean reversion
    ur = np.empty(n_months)
    ur[0] = 8.5
    for t in range(1, n_months):
        # Higher ADP -> lower UR next month; also add small persistence
        ur[t] = 0.85 * ur[t-1] - 0.0009 * adp_change[t-1] + rng.normal(0, 0.08)
        # Bound realistically
        ur[t] = float(np.clip(ur[t], 2.5, 12.0))

    df = pd.DataFrame({
        "date": dates,
        "adp_change": adp_change,
        "unemployment_rate": ur
    })
    return df

df = synthesize_data()

df.head(5)


## 2) Clean & feature engineer

We'll create:
- **Lag features** for ADP and unemployment rate (predict next month's UR from recent months).
- **Rolling means** for ADP to smooth noise.
- **Calendar features** (month as a cyclical variable).

We'll predict **next month's** unemployment rate (`target_t+1`).

In [None]:

def add_features(dfin, adp_lags=(1,2,3), ur_lags=(1,2), adp_rolls=(3,6)):
    df = dfin.copy()
    df = df.sort_values("date").reset_index(drop=True)

    # Lag features
    for L in adp_lags:
        df[f"adp_lag{L}"] = df["adp_change"].shift(L)
    for L in ur_lags:
        df[f"ur_lag{L}"] = df["unemployment_rate"].shift(L)

    # Rolling means of ADP
    for W in adp_rolls:
        df[f"adp_roll{W}"] = df["adp_change"].rolling(W).mean()

    # Calendar cyclic features
    df["month"] = df["date"].dt.month
    df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12.0)
    df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12.0)

    # Target is next month's UR
    df["target_next_ur"] = df["unemployment_rate"].shift(-1)

    # Drop rows with NA due to lags/rolls/shift
    df = df.dropna().reset_index(drop=True)
    return df

df_feat = add_features(df)
df_feat.tail(3)


## 3) Train/Validation split (time-aware)

We keep chronological order to avoid leakage. We'll use the last 15% of samples as validation.

In [None]:
val_frac = 0.15
n_total = len(df_feat)
n_val = int(round(n_total * val_frac))
n_train = n_total - n_val

train_df = df_feat.iloc[:n_train].reset_index(drop=True)
val_df = df_feat.iloc[n_train:].reset_index(drop=True)

n_train, n_val


## 4) Scale features

We'll standardize features using statistics from the **training** set only. Targets are left in original units (percentage points).

In [None]:
from sklearn.preprocessing import StandardScaler

feature_cols = [c for c in df_feat.columns if c not in ["date","unemployment_rate","target_next_ur","month"]]

scaler = StandardScaler()
X_train = scaler.fit_transform(train_df[feature_cols].values)
X_val = scaler.transform(val_df[feature_cols].values)

y_train = train_df["target_next_ur"].values.astype(np.float32)
y_val = val_df["target_next_ur"].values.astype(np.float32)

X_train.shape, X_val.shape, y_train.shape, y_val.shape


## 5) PyTorch `Dataset` & `DataLoader`

We'll create a simple tabular dataset for supervised regression.

In [None]:
class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32).view(-1, 1)

    def __len__(self):
        return self.X.shape[0]

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

batch_size = 32
train_ds = TabularDataset(X_train, y_train)
val_ds = TabularDataset(X_val, y_val)

train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True, drop_last=False)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False, drop_last=False)

len(train_ds), len(val_ds)


## 6) Model: Compact MLP forecaster

A small MLP is adequate for this demo. You can replace it with an LSTM/GRU if you transform the problem into a sequence model.

In [None]:
class MLPRegressor(nn.Module):
    def __init__(self, in_features, hidden=(64, 32), dropout=0.1):
        super().__init__()
        layers = []
        last = in_features
        for h in hidden:
            layers.append(nn.Linear(last, h))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            last = h
        layers.append(nn.Linear(last, 1))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

model = MLPRegressor(in_features=X_train.shape[1]).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)


## 7) Training loop with early stopping

In [None]:
def evaluate(model, loader, device):
    model.eval()
    total_loss = 0.0
    with torch.no_grad():
        for xb, yb in loader:
            xb = xb.to(device)
            yb = yb.to(device)
            pred = model(xb)
            loss = criterion(pred, yb)
            total_loss += loss.item() * xb.size(0)
    return total_loss / len(loader.dataset)

best_state = None
best_val = float("inf")
patience = 20
pat = 0
num_epochs = 300
train_losses, val_losses = [], []

for epoch in range(1, num_epochs + 1):
    model.train()
    running = 0.0
    for xb, yb in train_loader:
        xb = xb.to(device)
        yb = yb.to(device)
        optimizer.zero_grad()
        pred = model(xb)
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()
        running += loss.item() * xb.size(0)
    train_loss = running / len(train_loader.dataset)
    val_loss = evaluate(model, val_loader, device)
    train_losses.append(train_loss)
    val_losses.append(val_loss)

    if val_loss < best_val - 1e-6:
        best_val = val_loss
        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        pat = 0
    else:
        pat += 1

    if epoch % 25 == 0 or epoch == 1:
        print(f"Epoch {epoch:03d} | train {train_loss:.4f} | val {val_loss:.4f}")
    if pat >= patience:
        print(f"Early stopping at epoch {epoch}. Best val: {best_val:.4f}")
        break

# Load best weights
if best_state is not None:
    model.load_state_dict(best_state)
model.to(device)

best_val


## 8) Learning curves

In [None]:
plt.figure(figsize=(6,4))
plt.plot(train_losses, label="train")
plt.plot(val_losses, label="val")
plt.title("Learning Curves (MSE)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()


## 9) Metrics & sanity check predictions
We'll report RMSE and a quick actual-vs-predicted plot on the validation set.

In [None]:
from sklearn.metrics import root_mean_squared_error

model.eval()
with torch.no_grad():
    Xv = torch.tensor(X_val, dtype=torch.float32).to(device)
    pv = model(Xv).cpu().numpy().ravel()

rmse = root_mean_squared_error(y_val, pv)
print("Validation RMSE (pct-pts):", rmse)

plt.figure(figsize=(6,4))
plt.plot(val_df["date"], y_val, label="Actual (next UR)")
plt.plot(val_df["date"], pv, label="Predicted (next UR)")
plt.title("Validation: Next-Month Unemployment Rate")
plt.xlabel("Date")
plt.ylabel("%")
plt.legend()
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()


## 10) Save scaler & model for later inference

In [None]:
save_dir = "/mnt/data"
os.makedirs(save_dir, exist_ok=True)

torch.save(model.state_dict(), os.path.join(save_dir, "ur_from_adp_mlp.pt"))
import joblib
joblib.dump(scaler, os.path.join(save_dir, "ur_from_adp_scaler.joblib"))

print("Saved:", os.path.join(save_dir, "ur_from_adp_mlp.pt"))
print("Saved:", os.path.join(save_dir, "ur_from_adp_scaler.joblib"))


## 11) Inference (predicting the **next** month)

Here we construct the latest available feature row from the dataset (or your CSV), transform it with the **training scaler**, and generate a prediction for `target_next_ur` (next-month unemployment rate).

In [None]:

# Rebuild the newest feature row from the original (unscaled) feature DataFrame.
latest_row = df_feat.iloc[[-1]].copy()

# Prepare the feature vector in correct column order
X_new = latest_row[feature_cols].values.astype(np.float32)

# Scale with training scaler
X_new_scaled = scaler.transform(X_new)

# Torch tensor -> device
X_new_t = torch.tensor(X_new_scaled, dtype=torch.float32).to(device)

model.eval()
with torch.no_grad():
    pred_next_ur = model(X_new_t).cpu().numpy().ravel()[0]

print(f"Predicted next-month unemployment rate: {pred_next_ur:.2f}%")
print("This prediction corresponds to the month after:", latest_row['date'].dt.strftime('%Y-%m').item())


---

## Appendix: Bring-your-own ADP/BLS CSVs

1. **ADP file** (monthly):  
   - Columns: `date, adp_change`  
   - `date` should be end-of-month (or you can convert to month-end with `pd.to_datetime(...).dt.to_period('M').dt.to_timestamp('M')`).

2. **Unemployment rate file** (monthly, from BLS `LNS14000000` or similar):  
   - Columns: `date, unemployment_rate` (percent).

3. **Join & save** to `/mnt/data/adp_unemployment_sample.csv` with columns:
   ```csv
   date,adp_change,unemployment_rate
   2010-01-31,45,9.8
   ...
   ```

4. **Rerun** the notebook; it will pick up your CSV automatically.

> For sequence models (LSTM/GRU), arrange windows of past `k` months as inputs; the rest of the pipeline remains similar.