# 2. Backfill FRED Macroeconomic Data

Fetch historical DGS10 (10-year Treasury yield) and CPIAUCSL (CPI) from FRED and upload to Hopsworks.

**Important**: This notebook fetches RAW data only. Point-in-time correctness for CPI release dates is handled in notebook 5_macro_sentiment_features.

**Pipeline**: FRED API → Hopsworks Feature Groups (raw)

In [1]:
import sys
sys.path.append('..')

import pandas as pd
from utils.data_fetchers import fetch_dgs10, fetch_cpi
from utils.hopsworks_helpers import get_feature_store, create_feature_group
from dotenv import load_dotenv
import yaml

load_dotenv()

# Load config
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

## Fetch 10-Year Treasury Yield (DGS10)

DGS10 is a daily series. Missing values (weekends/holidays) will be forward-filled when creating daily features (no look-ahead bias since yields are known in real-time).

In [3]:
import ssl
import urllib.request

ssl._create_default_https_context = ssl._create_unverified_context
start_date = config['data']['start_date']
end_date = config['data']['end_date']

print(f"Fetching DGS10 from {start_date} to {end_date}...")
dgs10_data = fetch_dgs10(start_date, end_date)

print(f"\nDGS10 data shape: {dgs10_data.shape}")
print(f"Date range: {dgs10_data['date'].min()} to {dgs10_data['date'].max()}")
print(f"Missing values: {dgs10_data['dgs10'].isna().sum()}")
dgs10_data.head(10)

Fetching DGS10 from 2020-01-01 to 2025-12-30...

DGS10 data shape: (1565, 2)
Date range: 2020-01-01 00:00:00 to 2025-12-30 00:00:00
Missing values: 66


Unnamed: 0,date,dgs10
0,2020-01-01,
1,2020-01-02,1.88
2,2020-01-03,1.8
3,2020-01-06,1.81
4,2020-01-07,1.83
5,2020-01-08,1.87
6,2020-01-09,1.85
7,2020-01-10,1.83
8,2020-01-13,1.85
9,2020-01-14,1.82


In [4]:
# Basic statistics
dgs10_data['dgs10'].describe()

count    1499.000000
mean        2.953709
std         1.407235
min         0.520000
25%         1.540000
50%         3.570000
75%         4.205000
max         4.980000
Name: dgs10, dtype: float64

## Fetch CPI Data (CPIAUCSL)

**CRITICAL**: CPI is a monthly series. The 'date' column in FRED represents the **reference month** (e.g., 2024-01-01 for January 2024), NOT the release date.

To avoid look-ahead bias:
- CPI for month M is typically released ~15th of month M+1
- Point-in-time alignment is handled in notebook 5 using `make_macro_daily_features()`

In [5]:
print(f"Fetching CPIAUCSL from {start_date} to {end_date}...")
cpi_data = fetch_cpi(start_date, end_date)

print(f"\nCPI data shape: {cpi_data.shape}")
print(f"Date range: {cpi_data['date'].min()} to {cpi_data['date'].max()}")
print(f"Missing values: {cpi_data['cpiaucsl'].isna().sum()}")
print(f"\nCPI is monthly, so we have ~{cpi_data.shape[0]} observations for ~2 years")
cpi_data.head(10)

Fetching CPIAUCSL from 2020-01-01 to 2025-12-30...

CPI data shape: (71, 2)
Date range: 2020-01-01 00:00:00 to 2025-11-01 00:00:00
Missing values: 1

CPI is monthly, so we have ~71 observations for ~2 years


Unnamed: 0,date,cpiaucsl
0,2020-01-01,259.127
1,2020-02-01,259.25
2,2020-03-01,258.076
3,2020-04-01,256.032
4,2020-05-01,255.802
5,2020-06-01,257.042
6,2020-07-01,258.352
7,2020-08-01,259.316
8,2020-09-01,259.997
9,2020-10-01,260.319


In [6]:
# Verify monthly frequency
cpi_data['month_diff'] = cpi_data['date'].diff().dt.days
print("\nDays between CPI observations (should be ~28-31):")
print(cpi_data['month_diff'].describe())
cpi_data = cpi_data.drop(columns=['month_diff'])


Days between CPI observations (should be ~28-31):
count    70.000000
mean     30.442857
std       0.810005
min      28.000000
25%      30.000000
50%      31.000000
75%      31.000000
max      31.000000
Name: month_diff, dtype: float64


In [7]:
# Basic statistics
cpi_data['cpiaucsl'].describe()

count     70.000000
mean     292.942643
std       22.754817
min      255.802000
25%      271.023750
50%      298.758000
75%      313.102250
max      325.031000
Name: cpiaucsl, dtype: float64

## Upload to Hopsworks Feature Store

Create raw feature groups for FRED data. These will be read by notebook 5 for point-in-time correct feature engineering.

In [8]:
# Connect to Hopsworks
print("Connecting to Hopsworks...")
fs = get_feature_store()
print(f"✓ Connected to feature store: {fs.name}")

Connecting to Hopsworks...
2026-01-05 12:07:25,385 INFO: Initializing external client
2026-01-05 12:07:25,386 INFO: Base URL: https://c.app.hopsworks.ai:443
2026-01-05 12:07:26,882 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1272010
✓ Connected to feature store: scalable_lab1_featurestore


In [9]:
# Create DGS10 feature group
print("Creating DGS10 feature group...")
dgs10_fg = create_feature_group(
    fs,
    name='dgs10_raw',
    df=dgs10_data,
    primary_key=['date'],
    description='Raw 10-year Treasury Constant Maturity Rate (DGS10) from FRED - daily series'
)
print(f"✓ Created feature group: dgs10_raw (version {dgs10_fg.version})")

Creating DGS10 feature group...

Creating feature group: dgs10_raw
Data shape (before deduplication): (1565, 2)
Data shape (after deduplication): (1565, 2)
Columns: ['date', 'dgs10']
Data types:
date     datetime64[ms]
dgs10           float64
dtype: object
✓ Feature group 'dgs10_raw' already exists (version 1)
  Feature group object type: <class 'hsfs.feature_group.FeatureGroup'>
  Deleting existing data and re-inserting...

Inserting 1565 rows...
Sample data (first row):
[{'date': Timestamp('2020-01-01 00:00:00'), 'dgs10': nan}]


Uploading Dataframe: 100.00% |██████████| Rows 1565/1565 | Elapsed Time: 00:00 | Remaining Time: 00:00


Launching job: dgs10_raw_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1272010/jobs/named/dgs10_raw_1_offline_fg_materialization/executions
2026-01-05 12:08:15,222 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 12:08:18,409 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2026-01-05 12:08:21,684 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 12:10:33,085 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 12:10:33,249 INFO: Waiting for log aggregation to finish.
2026-01-05 12:10:52,028 INFO: Execution finished successfully.
✓ Insert job completed
  Job details: (Job('dgs10_raw_1_offline_fg_materialization', 'SPARK'), None)

Waiting for data to be committed (10 seconds)...

✓ Upload completed successfully
  Job fini

In [10]:
# Create CPI feature group
print("Creating CPI feature group...")
cpi_fg = create_feature_group(
    fs,
    name='cpi_raw',
    df=cpi_data,
    primary_key=['date'],
    description='Raw Consumer Price Index (CPIAUCSL) from FRED - monthly series with REFERENCE month dates (not release dates)'
)
print(f"✓ Created feature group: cpi_raw (version {cpi_fg.version})")

Creating CPI feature group...

Creating feature group: cpi_raw
Data shape (before deduplication): (71, 2)
Data shape (after deduplication): (71, 2)
Columns: ['date', 'cpiaucsl']
Data types:
date        datetime64[ms]
cpiaucsl           float64
dtype: object
✓ Feature group 'cpi_raw' already exists (version 1)
  Feature group object type: <class 'hsfs.feature_group.FeatureGroup'>
  Deleting existing data and re-inserting...

Inserting 71 rows...
Sample data (first row):
[{'date': Timestamp('2020-01-01 00:00:00'), 'cpiaucsl': 259.127}]


Uploading Dataframe: 100.00% |██████████| Rows 71/71 | Elapsed Time: 00:00 | Remaining Time: 00:00


Launching job: cpi_raw_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1272010/jobs/named/cpi_raw_1_offline_fg_materialization/executions
2026-01-05 12:14:18,455 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 12:14:21,672 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2026-01-05 12:14:24,879 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 12:16:36,987 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 12:16:37,165 INFO: Waiting for log aggregation to finish.
2026-01-05 12:16:52,678 INFO: Execution finished successfully.
✓ Insert job completed
  Job details: (Job('cpi_raw_1_offline_fg_materialization', 'SPARK'), None)

Waiting for data to be committed (10 seconds)...

✓ Upload completed successfully
  Job finished w

## Summary

✅ FRED data successfully uploaded to Hopsworks Feature Store:
- **dgs10_raw**: Daily 10-year Treasury yield
- **cpi_raw**: Monthly Consumer Price Index (reference month dates)

**⚠️ Important Note on CPI**:
- The dates in `cpi_raw` represent the **reference month** (data collection month)
- CPI for January 2024 (date=2024-01-01) is typically **released** in mid-February 2024
- Notebook 5 will handle release date alignment to ensure point-in-time correctness

**Next steps**:
- Run notebook 3 to backfill news sentiment data
- Run notebook 5 to create point-in-time correct macro features