# Stage 04: Data Acquisition and Ingestion

- Name: Will Wu
- Date: Aug 20, 2025

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [2]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

RAW = pathlib.Path('data/raw'); RAW.mkdir(parents=True, exist_ok=True)
load_dotenv(); print('ALPHAVANTAGE_API_KEY loaded?', bool(os.getenv('ALPHAVANTAGE_API_KEY')))

ALPHAVANTAGE_API_KEY loaded? True


## Helpers (use or modify)

In [3]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 - API Pull
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [13]:
import sys
print(sys.executable)

!{sys.executable} -m pip install yfinance

import yfinance as yf
print("yfinance version:", yf.__version__)

/Users/willwu/anaconda3/bin/python
Collecting yfinance
  Downloading yfinance-0.2.65-py2.py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting peewee>=3.16.2
  Downloading peewee-3.18.2.tar.gz (949 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m949.2/949.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting protobuf>=3.19.0
  Downloading protobuf-6.32.0-cp39-abi3-macosx_10_9_universal2.whl (426 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m426.4/426.4 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting frozendict>=2.3.4
  Downloading frozendict-2.4.6-cp310-cp310-macosx_11_0_arm64.whl (37 kB)
Collecting requ

In [15]:
SYMBOL = 'AAPL'

if USE_ALPHA:
    url = 'https://www.alphavantage.co/query'
    params = {'function':'TIME_SERIES_DAILY_ADJUSTED','symbol':SYMBOL,'outputsize':'compact','apikey':os.getenv('ALPHAVANTAGE_API_KEY')}
    r = requests.get(url, params = params, timeout = 30)
    r.raise_for_status()
    js = r.json()
    key = [k for k in js if 'Time Series' in k][0]
    df_api = pd.DataFrame(js[key]).T.reset_index().rename(columns = {'index':'date','5. adjusted close':'adj_close'})[['date','adj_close']]
    df_api['date'] = pd.to_datetime(df_api['date']); df_api['adj_close'] = pd.to_numeric(df_api['adj_close'])
else:
    import yfinance as yf
    df_api = yf.download(SYMBOL, period = '3mo', interval = '1d').reset_index()[['Date','Close']]
    df_api.columns = ['date','adj_close']
    
v_api = validate(df_api, ['date','adj_close']); v_api

  df_api = yf.download(SYMBOL, period = '3mo', interval = '1d').reset_index()[['Date','Close']]
[*********************100%***********************]  1 of 1 completed


{'missing': [], 'shape': (64, 2), 'na_total': 0}

In [16]:
_ = save_csv(df_api.sort_values('date'), prefix = 'api', source = 'alpha' if USE_ALPHA else 'yfinance', symbol = SYMBOL)

Saved data/raw/api_source-yfinance_symbol-AAPL_20250820-142648.csv


## Part 2 - Scrape a Public Table
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [18]:
SCRAPE_URL = 'https://markets.businessinsider.com/index/components/s&p_500'
headers = {'User-Agent':'AFE-Homework/1.0'}

try:
    resp = requests.get(SCRAPE_URL, headers = headers, timeout = 30); resp.raise_for_status()
    soup = BeautifulSoup(resp.text, 'html.parser')
    rows = [[c.get_text(strip = True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)
except Exception as e:
    print('Scrape failed, using inline demo table:', e)
    html = '<table><tr><th>Ticker</th><th>Price</th></tr><tr><td>AAA</td><td>101.2</td></tr></table>'
    soup = BeautifulSoup(html, 'html.parser')
    rows = [[c.get_text(strip = True) for c in tr.find_all(['th','td'])] for tr in soup.find_all('tr')]
    header, *data = [r for r in rows if r]
    df_scrape = pd.DataFrame(data, columns=header)

if 'Price' in df_scrape.columns:
    df_scrape['Price'] = pd.to_numeric(df_scrape['Price'], errors = 'coerce')
v_scrape = validate(df_scrape, list(df_scrape.columns)); v_scrape

{'missing': [], 'shape': (46, 8), 'na_total': 0}

In [19]:
_ = save_csv(df_scrape, prefix = 'scrape', site = 'example', table = 'markets')

Saved data/raw/scrape_site-example_table-markets_20250820-152141.csv


## Part 3 - Documentation

### API Source

- URL: no direct URL
- Endpoint: yfinance
- Ticker: AAPL
- Parameters: period: 3mo; interval: 1d

### Scrape Source

- URL: https://markets.businessinsider.com/index/components/s&p_500
- Table description: Name, Latest Price (Previous Close), Low High, +/-%, Time Date, 3 Mo. +/-%, 6 Mo. +/-%,1 Year +/-%. 

### Assumptions & Risks
- API/HTML may change, including column names and table structure. The notebook includes light-weight validation to fail early.
- Endpoints provide historical data, need to adjust parameters for future tasks.
- This project only includes minimal backoff for brevity. In future tasks, might need to add retries and stronger parsing/validation.

### Confirm `.env` is not committed

In [24]:
from pathlib import Path
env_path = Path('/Users/willwu/Desktop/NYU Tandon/FRE-GY/5040 Foundations of Applied Financial Engineering/bootcamp_will_wu/homework/stage04/.env')
env_path.exists()

True