# 01 — Data Collection

In this notebook we demonstrate how to collect historical data for any ticker (**VIX**, **S&P 500**, **VIX Futures ETF**) from Yahoo Finance using Playwright and BeautifulSoup.

**Outputs:**
- Format: comma-separated files (csv)
- `data/raw/vix_ytd_ohlc.csv`
- `data/raw/sp500_ytd_ohlcv.csv`
- `data/raw/vxx_ytd_ohlcv.csv`

**Notes:** 
- Only educational use (respect Yahoo ToS).  
- Dates in UTC; daily frequency.  
- The scraping code (three script files) live in `/scripts/`.

**Data collected**
1. VIX: daily open, high, low, close
2. S&P 500: daily open, high, low, close, volume
3. VXX Futures ETF: daily open, high, low, close, volume

**Default window**: Year-to-date YTD (Jan 02 2025 - Nov 06 2025 at the time of writing).

Note that markets are closed on Jan 01.

**Scraping runs as separate .py files (and not in the notebook)**

The scrapers use Playwright to drive a real browser and parse Yahoo’s Historical Data table.

Playwright has two APIs: a synchronous API and an asynchronous API.
Jupyter notebooks already run an event loop under the hood. Mixing that loop with Playwright’s sync API can trigger errors like:
“It looks like you are using Playwright Sync API inside the asyncio loop…”

While there are workarounds (e.g., wrapping sync calls with `asyncio.to_thread(...)` or fully rewriting the scraper to async and using await), the simplest approach (which I took) is to:

1. keep the scraper as a normal Python script, and

2. run it from the terminal or open the files and run them individually, then

3. load the produced CSVs in other notebooks.

**Customization**

VIX (`scripts/vix.py`) currently uses YTD automatically via:

In [19]:
from datetime import datetime, timezone
def ytd_epochs():
    now = datetime.now(timezone.utc)
    jan1 = datetime(now.year, 1, 1, tzinfo=timezone.utc)
    return int(jan1.timestamp()), int(now.timestamp())

and S&P 500 (`scripts/gspc.py`) currently uses a fixed link with period1/period2 embedded (VXX is similar):

In [20]:
def sp500_url():
    # Use your provided link values
    return (
        "https://finance.yahoo.com/quote/%5EGSPC/history/"
        "?period1=1735746040&period2=1762443547"
        "&interval=1d&filter=history&frequency=1d"
    )
# Ticker symbol: ^GSPC

To change the window, modify `history_url_for_vix()` or `sp500_url()` to compute your own period1/period2 Unix timestamps (UTC) and swap them in the URL:

In [21]:
def history_url_for_vix():
    p1, p2 = ytd_epochs()
    # Daily history. (frequency=1d + interval=1d are both used by Yahoo)
    return (
        "https://finance.yahoo.com/quote/%5EVIX/history/"
        f"?period1={p1}&period2={p2}&interval=1d&filter=history&frequency=1d"
    )

# Example: Jan 1, 2024 to Nov 6, 2025 (UTC)
p1 = int(datetime(2024,1,1,tzinfo=timezone.utc).timestamp())
p2 = int(datetime(2025,11,6,tzinfo=timezone.utc).timestamp())
url = f"...period1={p1}&period2={p2}&interval=1d&filter=history&frequency=1d"

**To change the frequency (daily/weekly/monthly)**

Both URLs accept `interval=`:

`interval=1d` → daily

`interval=1wk` → weekly

`interval=1mo` → monthly

Just replace the interval query string in the URL builder inside the script and re-run.

**To change stock/ticker**

You can simply browse to the relevant Yahoo Finance page for historical data and copy the URL into the S&P 500 script to obtain data for any stock/ticker.

You can also try simply changing the ticker symbol in the URL; though this may not always work.

Other scripts provided inside `/scripts` have the following extensive functionality - these were used in an earlier project which required much larger amount of data. These output csv files into current directory.

1. `scrape_oneStock`: Output weekly adj. close data for one particular stock/ticker symbol over the previous 2 years (current default: `AAPL`)

2. `scrape_allStocks`: Output weekly adj. close data for any list of ticker symbols over the previous 2 years into one csv file. Requires a list "sp500_tickers.csv" to be saved in the current directory. Attempts for a maximum of 3 times per ticker symbol, and also saves a list of failed tickers as "failed_tickers.csv"

3. `scrape_S&P_ordered`: Output list of individial ticker symbols in the S&P 500 ordered by market cap (descending order). This scrapes a particular website available online (*Slickcharts*) which maintains the list. The S&P Global used to provide a list, but does not do so presently (Nov 05 2025). For this script, I did not use Playwright but manually mimiced a Mozilla browser

5. `combine_CSVs`: Combine two csv files into one (I used this routine to combine my initial scrape attempt and the second scrape on the failed ticker list to get one dataframe)

Each of these routines can be customized to get more data (e.g. more than just adj. close for stocks, or daily data), or to scrape data from different sources/websites.


**Troubleshooting notes**

Make sure to install Playwright (chromium) - see `environment.yml`

Good internet speed is essential if a large amount of data is being scraped from Yahoo Finance (daily data for 500 stocks over two years took me about 30-60 minutes)

Timeout waiting for table: network hiccups or consent banners can delay the table. Individual pop-ups (e.g. cookies) are not handled in these routines. Otherwise, you can increase the wait by inserting:

In [23]:
# page.set_default_timeout(20000)  # 20 seconds