# Stage 04: Financial Data Acquisition

This notebook demonstrates data acquisition from financial APIs for the Portfolio Risk Management System.

## Objectives
- Set up API connections (Alpha Vantage, yfinance)
- Fetch historical stock data
- Implement error handling and rate limiting
- Save data with proper structure

In [1]:
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
from datetime import datetime
import utils
from dotenv import load_dotenv

# Load environment variables
load_dotenv('../.env')

print("📊 Financial Data Acquisition Setup Complete")

📊 Financial Data Acquisition Setup Complete


## 1. Test API Connections

In [2]:
# Test yfinance connection (no API key required)
test_symbol = "AAPL"
print(f"Testing yfinance connection with {test_symbol}...")

df_test = utils.fetch_yfinance(test_symbol, period="1mo")
if not df_test.empty:
    print(f"✅ yfinance working: {len(df_test)} records fetched")
    print(df_test.head())
else:
    print("❌ yfinance test failed")

Testing yfinance connection with AAPL...
📈 Fetching AAPL data (period=1mo, interval=1d)
✅ Fetched 23 records for AAPL
✅ yfinance working: 23 records fetched
                       date        open        high         low       close  \
0 2025-07-23 00:00:00-04:00  214.756269  214.906093  212.169209  213.907227   
1 2025-07-24 00:00:00-04:00  213.657510  215.445490  213.287935  213.517670   
2 2025-07-25 00:00:00-04:00  214.456605  214.996002  213.158076  213.637543   
3 2025-07-28 00:00:00-04:00  213.787376  214.606454  212.818475  213.807358   
4 2025-07-29 00:00:00-04:00  213.937192  214.566483  210.581016  211.030502   

     volume  dividends  stock_splits symbol            fetch_timestamp  \
0  46989300        0.0           0.0   AAPL 2025-08-24 17:35:04.536761   
1  46022600        0.0           0.0   AAPL 2025-08-24 17:35:04.536761   
2  40268800        0.0           0.0   AAPL 2025-08-24 17:35:04.536761   
3  37858000        0.0           0.0   AAPL 2025-08-24 17:35:04.536761  

## 2. Fetch Multiple Stock Data

In [3]:
# Define portfolio symbols for risk analysis
portfolio_symbols = ["AAPL", "MSFT", "GOOGL", "AMZN", "TSLA"]

print(f"Fetching data for portfolio: {portfolio_symbols}")

# Fetch data using yfinance (reliable and free)
portfolio_data = utils.fetch_multiple_stocks(
    symbols=portfolio_symbols,
    prefer_alphavantage=False,  # Use yfinance for reliability
    period="6mo"
)

if not portfolio_data.empty:
    print(f"\n✅ Portfolio data fetched successfully")
    print(f"Shape: {portfolio_data.shape}")
    print(f"Date range: {portfolio_data['date'].min()} to {portfolio_data['date'].max()}")
    print(f"Symbols: {portfolio_data['symbol'].unique()}")
else:
    print("❌ Failed to fetch portfolio data")

Fetching data for portfolio: ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA']
📈 Using yfinance fallback for AAPL
📈 Fetching AAPL data (period=6mo, interval=1d)
✅ Fetched 126 records for AAPL
📈 Using yfinance fallback for MSFT
📈 Fetching MSFT data (period=6mo, interval=1d)
✅ Fetched 126 records for MSFT
📈 Using yfinance fallback for GOOGL
📈 Fetching GOOGL data (period=6mo, interval=1d)
✅ Fetched 126 records for GOOGL
📈 Using yfinance fallback for AMZN
📈 Fetching AMZN data (period=6mo, interval=1d)
✅ Fetched 126 records for AMZN
📈 Using yfinance fallback for TSLA
📈 Fetching TSLA data (period=6mo, interval=1d)
✅ Fetched 126 records for TSLA
✅ Combined data for 5 symbols: (630, 11)
📊 Data sources used: {'yfinance': np.int64(630)}

✅ Portfolio data fetched successfully
Shape: (630, 11)
Date range: 2025-02-24 00:00:00-05:00 to 2025-08-22 00:00:00-04:00
Symbols: ['AAPL' 'MSFT' 'GOOGL' 'AMZN' 'TSLA']


## 3. Data Quality Assessment

In [4]:
if not portfolio_data.empty:
    # Generate data quality report
    quality_report = utils.data_quality_report(portfolio_data)
    
    print("📊 Data Quality Report:")
    print(f"Shape: {quality_report['shape']}")
    print(f"Memory usage: {quality_report['memory_usage_mb']:.2f} MB")
    print(f"Duplicate rows: {quality_report['duplicate_rows']}")
    
    print("\nMissing data:")
    for col, count in quality_report['missing_data'].items():
        if count > 0:
            print(f"  {col}: {count} missing values")
    
    # Validate required columns
    required_cols = ['date', 'open', 'high', 'low', 'close', 'volume', 'symbol']
    try:
        utils.validate_dataframe(portfolio_data, required_cols=required_cols)
    except Exception as e:
        print(f"❌ Validation failed: {e}")

📊 Data Quality Report:
Shape: (630, 11)
Memory usage: 0.11 MB
Duplicate rows: 0

Missing data:

🔍 Validating DataFrame...
✅ Validation passed!
   Shape: (630, 11)
   Columns: ['date', 'open', 'high', 'low', 'close', 'volume', 'dividends', 'stock_splits', 'symbol', 'fetch_timestamp', 'data_source']
✅ No missing data found
   Memory usage: 0.11 MB


## 4. Save Raw Data

In [5]:
if not portfolio_data.empty:
    # Save raw portfolio data
    saved_path = utils.save_with_timestamp(
        df=portfolio_data,
        prefix="portfolio_raw",
        source="financial_apis",
        ext="csv"
    )
    
    print(f"💾 Raw data saved to: {saved_path}")
    
    # Also save as JSON for backup
    json_path = utils.save_with_timestamp(
        df=portfolio_data,
        prefix="portfolio_raw",
        source="financial_apis",
        ext="json"
    )
    
    print(f"💾 Backup JSON saved to: {json_path}")

💾 Saved 630 rows to ./data/financial_apis/portfolio_raw_20250824_173509.csv
💾 Raw data saved to: ./data/financial_apis/portfolio_raw_20250824_173509.csv
💾 Saved 630 rows to ./data/financial_apis/portfolio_raw_20250824_173509.json
💾 Backup JSON saved to: ./data/financial_apis/portfolio_raw_20250824_173509.json


## 5. Summary and Next Steps

In [6]:
print("\n🎯 Stage 04 Summary:")
print("✅ API connections tested")
print("✅ Portfolio data acquired")
print("✅ Data quality assessed")
print("✅ Raw data saved with timestamps")

print("\n📋 Next Steps:")
print("- Stage 05: Set up data storage infrastructure")
print("- Stage 06: Create preprocessing pipeline")
print("- Stage 07: Build risk analysis models")


🎯 Stage 04 Summary:
✅ API connections tested
✅ Portfolio data acquired
✅ Data quality assessed
✅ Raw data saved with timestamps

📋 Next Steps:
- Stage 05: Set up data storage infrastructure
- Stage 06: Create preprocessing pipeline
- Stage 07: Build risk analysis models
