# 1. Backfill Yahoo Finance Data

Fetch historical OHLCV data for QQQ, XLK, and VIX and upload to Hopsworks Feature Store.

**Pipeline**: Yahoo Finance API → Hopsworks Feature Groups (raw)

In [1]:
import sys
sys.path.append('..')

import pandas as pd
from utils.data_fetchers import fetch_yahoo_data, validate_ohlcv_data
from utils.hopsworks_helpers import get_feature_store, create_feature_group
from dotenv import load_dotenv
import yaml

load_dotenv()

# Load config
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

## Fetch QQQ Data

In [2]:
start_date = config['data']['start_date']
end_date = config['data']['end_date']

print(f"Fetching QQQ from {start_date} to {end_date}...")
qqq_data = fetch_yahoo_data('QQQ', start_date, end_date)
validate_ohlcv_data(qqq_data)

print(f"\nQQQ data shape: {qqq_data.shape}")
print(f"Date range: {qqq_data['date'].min()} to {qqq_data['date'].max()}")
qqq_data.head()

Fetching QQQ from 2020-01-01 to 2025-12-30...

QQQ data shape: (1506, 6)
Date range: 2020-01-02 00:00:00 to 2025-12-29 00:00:00


Unnamed: 0,date,open,high,low,close,volume
0,2020-01-02,206.881937,208.580231,206.476666,208.580231,30969400
1,2020-01-03,205.820485,207.91439,205.801182,206.669617,27518900
2,2020-01-06,205.0486,208.030244,204.797722,208.001297,21655300
3,2020-01-07,208.078449,208.560916,207.316157,207.972305,22139300
4,2020-01-08,207.943341,210.490767,207.615267,209.535477,26397300


## Fetch XLK Data (Technology Sector ETF)

In [3]:
print(f"Fetching XLK from {start_date} to {end_date}...")
xlk_data = fetch_yahoo_data('XLK', start_date, end_date)
validate_ohlcv_data(xlk_data)

print(f"\nXLK data shape: {xlk_data.shape}")
print(f"Date range: {xlk_data['date'].min()} to {xlk_data['date'].max()}")
xlk_data.head()

Fetching XLK from 2020-01-01 to 2025-12-30...

XLK data shape: (1506, 6)
Date range: 2020-01-02 00:00:00 to 2025-12-29 00:00:00


Unnamed: 0,date,open,high,low,close,volume
0,2020-01-02,43.926613,44.349258,43.841134,44.349258,26567000
1,2020-01-03,43.703423,44.154564,43.698673,43.850636,30023600
2,2020-01-06,43.413724,44.002581,43.332995,43.95509,15630000
3,2020-01-07,44.031079,44.154549,43.869617,43.9361,15363600
4,2020-01-08,43.983594,44.600938,43.902861,44.406239,23254400


## Fetch VIX Data (Volatility Index)

In [4]:
print(f"Fetching ^VIX from {start_date} to {end_date}...")
vix_data = fetch_yahoo_data('^VIX', start_date, end_date)

print(f"\nVIX data shape: {vix_data.shape}")
print(f"Date range: {vix_data['date'].min()} to {vix_data['date'].max()}")
vix_data.head()

Fetching ^VIX from 2020-01-01 to 2025-12-30...

VIX data shape: (1506, 6)
Date range: 2020-01-02 00:00:00 to 2025-12-29 00:00:00


Unnamed: 0,date,open,high,low,close,volume
0,2020-01-02,13.46,13.72,12.42,12.47,0
1,2020-01-03,15.01,16.200001,13.13,14.02,0
2,2020-01-06,15.45,16.389999,13.54,13.85,0
3,2020-01-07,13.84,14.46,13.39,13.79,0
4,2020-01-08,15.16,15.24,12.83,13.45,0


## Upload to Hopsworks Feature Store

Create raw feature groups for each ticker. These will be used by feature engineering notebooks.

In [5]:
# Connect to Hopsworks
print("Connecting to Hopsworks...")
fs = get_feature_store()
print(f"✓ Connected to feature store: {fs.name}")

Connecting to Hopsworks...
2026-01-05 11:53:51,292 INFO: Initializing external client
2026-01-05 11:53:51,293 INFO: Base URL: https://c.app.hopsworks.ai:443
2026-01-05 11:53:52,627 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1272010
✓ Connected to feature store: scalable_lab1_featurestore


In [6]:
# Prepare QQQ data
qqq_data_fg = qqq_data.copy()
qqq_data_fg.columns = ['date'] + [f'qqq_{col}' for col in qqq_data.columns if col != 'date']

print("Creating QQQ feature group...")
qqq_fg = create_feature_group(
    fs,
    name='qqq_raw',
    df=qqq_data_fg,
    primary_key=['date'],
    description='Raw OHLCV data for QQQ ETF from Yahoo Finance'
)
print(f"✓ Created feature group: qqq_raw (version {qqq_fg.version})")

Creating QQQ feature group...

Creating feature group: qqq_raw
Data shape (before deduplication): (1506, 6)
Data shape (after deduplication): (1506, 6)
Columns: ['date', 'qqq_open', 'qqq_high', 'qqq_low', 'qqq_close', 'qqq_volume']
Data types:
date          datetime64[ms]
qqq_open             float64
qqq_high             float64
qqq_low              float64
qqq_close            float64
qqq_volume             int64
dtype: object
✓ Feature group 'qqq_raw' already exists (version 1)
  Feature group object type: <class 'hsfs.feature_group.FeatureGroup'>
  Deleting existing data and re-inserting...

Inserting 1506 rows...
Sample data (first row):
[{'date': Timestamp('2020-01-02 00:00:00'), 'qqq_open': 206.8819366864673, 'qqq_high': 208.58023071289062, 'qqq_low': 206.4766659911793, 'qqq_close': 208.58023071289062, 'qqq_volume': 30969400}]


Uploading Dataframe: 100.00% |██████████| Rows 1506/1506 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: qqq_raw_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1272010/jobs/named/qqq_raw_1_offline_fg_materialization/executions
2026-01-05 11:54:16,797 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 11:54:20,016 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 11:56:31,966 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 11:56:32,129 INFO: Waiting for log aggregation to finish.
2026-01-05 11:56:54,371 INFO: Execution finished successfully.
✓ Insert job completed
  Job details: (Job('qqq_raw_1_offline_fg_materialization', 'SPARK'), None)

Waiting for data to be committed (10 seconds)...

✓ Upload completed successfully
  Job finished with status: SUCCEEDED
  Uploaded 1506 rows to 'qqq_raw'

⚠️  NOTE: Data is in Hopsworks but may take a few minute

In [7]:
# Prepare XLK data
xlk_data_fg = xlk_data.copy()
xlk_data_fg.columns = ['date'] + [f'xlk_{col}' for col in xlk_data.columns if col != 'date']

print("Creating XLK feature group...")
xlk_fg = create_feature_group(
    fs,
    name='xlk_raw',
    df=xlk_data_fg,
    primary_key=['date'],
    description='Raw OHLCV data for XLK Technology Sector ETF from Yahoo Finance'
)
print(f"✓ Created feature group: xlk_raw (version {xlk_fg.version})")

Creating XLK feature group...

Creating feature group: xlk_raw
Data shape (before deduplication): (1506, 6)
Data shape (after deduplication): (1506, 6)
Columns: ['date', 'xlk_open', 'xlk_high', 'xlk_low', 'xlk_close', 'xlk_volume']
Data types:
date          datetime64[ms]
xlk_open             float64
xlk_high             float64
xlk_low              float64
xlk_close            float64
xlk_volume             int64
dtype: object
✓ Feature group 'xlk_raw' already exists (version 1)
  Feature group object type: <class 'hsfs.feature_group.FeatureGroup'>
  Deleting existing data and re-inserting...

Inserting 1506 rows...
Sample data (first row):
[{'date': Timestamp('2020-01-02 00:00:00'), 'xlk_open': 43.92661345887618, 'xlk_high': 44.34925842285156, 'xlk_low': 43.84113449857417, 'xlk_close': 44.34925842285156, 'xlk_volume': 26567000}]


Uploading Dataframe: 100.00% |██████████| Rows 1506/1506 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: xlk_raw_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1272010/jobs/named/xlk_raw_1_offline_fg_materialization/executions
2026-01-05 11:57:20,442 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 11:57:23,662 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2026-01-05 11:57:26,869 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 11:59:35,197 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 11:59:35,363 INFO: Waiting for log aggregation to finish.
2026-01-05 12:00:14,522 INFO: Execution finished successfully.
✓ Insert job completed
  Job details: (Job('xlk_raw_1_offline_fg_materialization', 'SPARK'), None)

Waiting for data to be committed (10 seconds)...

✓ Upload completed successfully
  Job finished w

In [8]:
# Prepare VIX data
vix_data_fg = vix_data.copy()
vix_data_fg.columns = ['date'] + [f'vix_{col}' for col in vix_data.columns if col != 'date']

print("Creating VIX feature group...")
vix_fg = create_feature_group(
    fs,
    name='vix_raw',
    df=vix_data_fg,
    primary_key=['date'],
    description='Raw CBOE Volatility Index (VIX) data from Yahoo Finance'
)
print(f"✓ Created feature group: vix_raw (version {vix_fg.version})")

Creating VIX feature group...

Creating feature group: vix_raw
Data shape (before deduplication): (1506, 6)
Data shape (after deduplication): (1506, 6)
Columns: ['date', 'vix_open', 'vix_high', 'vix_low', 'vix_close', 'vix_volume']
Data types:
date          datetime64[ms]
vix_open             float64
vix_high             float64
vix_low              float64
vix_close            float64
vix_volume             int64
dtype: object
✓ Feature group 'vix_raw' already exists (version 1)
  Feature group object type: <class 'hsfs.feature_group.FeatureGroup'>
  Deleting existing data and re-inserting...

Inserting 1506 rows...
Sample data (first row):
[{'date': Timestamp('2020-01-02 00:00:00'), 'vix_open': 13.460000038146973, 'vix_high': 13.720000267028809, 'vix_low': 12.420000076293945, 'vix_close': 12.470000267028809, 'vix_volume': 0}]


Uploading Dataframe: 100.00% |██████████| Rows 1506/1506 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: vix_raw_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1272010/jobs/named/vix_raw_1_offline_fg_materialization/executions
2026-01-05 12:00:40,882 INFO: Waiting for execution to finish. Current state: INITIALIZING. Final status: UNDEFINED
2026-01-05 12:00:44,083 INFO: Waiting for execution to finish. Current state: SUBMITTED. Final status: UNDEFINED
2026-01-05 12:00:47,280 INFO: Waiting for execution to finish. Current state: RUNNING. Final status: UNDEFINED
2026-01-05 12:03:05,158 INFO: Waiting for execution to finish. Current state: SUCCEEDING. Final status: UNDEFINED
2026-01-05 12:03:08,364 INFO: Waiting for execution to finish. Current state: AGGREGATING_LOGS. Final status: SUCCEEDED
2026-01-05 12:03:08,536 INFO: Waiting for log aggregation to finish.
2026-01-05 12:03:27,387 INFO: Execution finished successfully.
✓ Insert job completed
  Job details: (Job('vix_raw_1_offline_fg_materialization', '

## Summary

✅ Yahoo Finance data successfully uploaded to Hopsworks Feature Store:
- **qqq_raw**: QQQ ETF OHLCV data
- **xlk_raw**: XLK Technology Sector ETF OHLCV data  
- **vix_raw**: VIX Volatility Index data

These raw feature groups will be used by:
- Notebook 4: Market feature engineering (technical indicators)
- Notebook 5: Macro feature engineering (trading calendar reference)

**Next steps**:
- Run notebook 2 to backfill FRED macro data
- Run notebook 3 to backfill news sentiment data