# Homework Starter — Stage 04: Data Acquisition and Ingestion
Name: Student
Date: December 2024

## Objectives
- API ingestion with secrets in `.env`
- Scrape a permitted public table
- Validate and save raw data to `data/raw/`

In [2]:
import os, pathlib, datetime as dt
import requests
import pandas as pd
from bs4 import BeautifulSoup
from dotenv import load_dotenv

RAW = pathlib.Path('data/raw'); RAW.mkdir(parents=True, exist_ok=True)
load_dotenv(); print('ALPHAVANTAGE_API_KEY loaded?', bool(os.getenv('ALPHAVANTAGE_API_KEY')))

ALPHAVANTAGE_API_KEY loaded? True


## Helpers (use or modify)

In [3]:
def ts():
    return dt.datetime.now().strftime('%Y%m%d-%H%M%S')

def save_csv(df: pd.DataFrame, prefix: str, **meta):
    mid = '_'.join([f"{k}-{v}" for k,v in meta.items()])
    path = RAW / f"{prefix}_{mid}_{ts()}.csv"
    df.to_csv(path, index=False)
    print('Saved', path)
    return path

def validate(df: pd.DataFrame, required):
    missing = [c for c in required if c not in df.columns]
    return {'missing': missing, 'shape': df.shape, 'na_total': int(df.isna().sum().sum())}

## Part 1 — API Pull (Required)
Choose an endpoint (e.g., Alpha Vantage or use `yfinance` fallback).

In [8]:
SYMBOL = 'AAPL'
print(f"Ticker chosen: {SYMBOL}")

print("Loading API key from .env...")
USE_ALPHA = bool(os.getenv('ALPHAVANTAGE_API_KEY'))
print(f"Alpha Vantage API key found: {USE_ALPHA}")

if USE_ALPHA:
    print("Requesting data from Alpha Vantage API...")
    url = 'https://www.alphavantage.co/query'
    params = {
        'function': 'TIME_SERIES_DAILY',
        'symbol': SYMBOL,
        'apikey': os.getenv('ALPHAVANTAGE_API_KEY')
    }
    
    try:
        r = requests.get(url, params=params, timeout=30)
        r.raise_for_status()
        data = r.json()
        
        time_series = data['Time Series (Daily)']
        df_api = pd.DataFrame(time_series).T.reset_index()
        df_api.columns = ['date', 'open', 'high', 'low', 'close', 'volume']
        df_api = df_api[['date', 'close']].rename(columns={'close': 'adj_close'})
        
        print("Converting to DataFrame and parsing dtypes...")
        df_api['date'] = pd.to_datetime(df_api['date'])
        df_api['adj_close'] = pd.to_numeric(df_api['adj_close'])
        print("Alpha Vantage data successfully fetched!")
        
    except:
        print("Alpha Vantage failed, using yfinance fallback...")
        USE_ALPHA = False

if not USE_ALPHA:
    print("Requesting data from yfinance...")
    import yfinance as yf
    ticker = yf.Ticker(SYMBOL)
    df_api = ticker.history(period='3mo').reset_index()[['Date', 'Close']]
    df_api.columns = ['date', 'adj_close']
    df_api['date'] = pd.to_datetime(df_api['date'])
    print("yfinance data successfully fetched!")

print("\nValidating data (required columns, NA counts, shape)...")
v_api = validate(df_api, ['date', 'adj_close'])
print(f"Validation results: {v_api}")

print(f"\nSample of API data:")
print(df_api.head())
print(f"\nData types:")
print(df_api.dtypes)

v_api

Ticker chosen: AAPL
Loading API key from .env...
Alpha Vantage API key found: True
Requesting data from Alpha Vantage API...
Converting to DataFrame and parsing dtypes...
Alpha Vantage data successfully fetched!

Validating data (required columns, NA counts, shape)...
Validation results: {'missing': [], 'shape': (100, 2), 'na_total': 0}

Sample of API data:
        date  adj_close
0 2025-08-15     231.59
1 2025-08-14     232.78
2 2025-08-13     233.33
3 2025-08-12     229.65
4 2025-08-11     227.18

Data types:
date         datetime64[ns]
adj_close           float64
dtype: object


{'missing': [], 'shape': (100, 2), 'na_total': 0}

In [9]:
print("Saving raw CSV to data/raw/...")
source = 'alpha' if USE_ALPHA else 'yfinance'
saved_path = save_csv(df_api.sort_values('date'), prefix='api', source=source, symbol=SYMBOL)
print(f"API data saved successfully!")

Saving raw CSV to data/raw/...
Saved data\raw\api_source-alpha_symbol-AAPL_20250817-230500.csv
API data saved successfully!


## Part 2 — Scrape a Public Table (Required)
Replace `SCRAPE_URL` with a permitted page containing a simple table.

In [10]:
print("Scraping S&P 500 companies table...")
SCRAPE_URL = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

print(f"Requesting data from: {SCRAPE_URL}")
resp = requests.get(SCRAPE_URL, headers=headers, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'html.parser')

print("Parsing HTML table...")
table = soup.find('table', {'class': 'wikitable'})
rows = []
for tr in table.find_all('tr'):
    row = [cell.get_text(strip=True).split('[')[0].strip() for cell in tr.find_all(['th', 'td'])]
    if row:
        rows.append(row)

print("Converting to DataFrame...")
header = rows[0]
data = rows[1:11]
df_scrape = pd.DataFrame(data, columns=header)
df_scrape.columns = [col.replace('\n', ' ').strip() for col in df_scrape.columns]

print("Validating scraped data...")
v_scrape = validate(df_scrape, list(df_scrape.columns))
print(f"Validation results: {v_scrape}")

print(f"\nSample of scraped data:")
print(df_scrape.head())
print(f"\nColumn names:")
print(list(df_scrape.columns))

v_scrape

Scraping S&P 500 companies table...
Requesting data from: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
Parsing HTML table...
Converting to DataFrame...
Validating scraped data...
Validation results: {'missing': [], 'shape': (10, 8), 'na_total': 0}

Sample of scraped data:
  Symbol             Security              GICSSector  \
0    MMM                   3M             Industrials   
1    AOS          A. O. Smith             Industrials   
2    ABT  Abbott Laboratories             Health Care   
3   ABBV               AbbVie             Health Care   
4    ACN            Accenture  Information Technology   

                GICS Sub-Industry    Headquarters Location  Date added  \
0        Industrial Conglomerates    Saint Paul, Minnesota  1957-03-04   
1               Building Products     Milwaukee, Wisconsin  2017-07-26   
2           Health Care Equipment  North Chicago, Illinois  1957-03-04   
3                   Biotechnology  North Chicago, Illinois  2012-12-31   
4

{'missing': [], 'shape': (10, 8), 'na_total': 0}

In [11]:
print("Saving scraped data to CSV...")
saved_path = save_csv(df_scrape, prefix='scrape', site='wikipedia', table='sp500')
print("Scrape data saved successfully!")

Saving scraped data to CSV...
Saved data\raw\scrape_site-wikipedia_table-sp500_20250817-230516.csv
Scrape data saved successfully!


## Documentation

### API Source: Alpha Vantage
- URL: `https://www.alphavantage.co/query`
- Function: `TIME_SERIES_DAILY`
- Symbol: `AAPL`
- API key loaded from .env file

### Scrape Source: Wikipedia S&P 500 Companies
- URL: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
- Table: First wikitable with company listings
- Data: First 10 companies

### Assumptions & Risks
- API rate limits may apply
- Website structure could change
- .env file contains API key (not committed)