# Publicly available trading data for Representatives and Senators

This data is made available because of the 2012 STOCK Act.

## Data for each group can be found here:
- Reps: https://disclosures-clerk.house.gov/FinancialDisclosure
- Senators: https://efdsearch.senate.gov/search/home/

# Data from the House of Representatives
We'll start with data from the House of Representatives since they are the larger of the two groups and most of the people whose trading data we care about are Representatives.

The data that the Clerk of the House provides are in the form of zip files. The data in this zip files serve as an index to all of the original pdfs that each member must submit.

The zip files from the public disclosure website come in either text or xml format.
Their schema is the following:
- Prefix - the title of the person, nullable
- Last - the Representative's last name
- First - the Representative's first name
- Suffix - the Representative's suffix, nullable
- FilingType - one of C, D, P, W, X (more info below)
- StateDst - the state and district the person is representing
- Year - the year of the filing
- FilingDate - the date of the filing
- DocID - the internal id of the document, used for downloading the original pdf

## A breakdown of the FilingTypes

C - Candidacy Financial Disclosure Report:
    Candidates are required to disclose their net worth and assets.
    Example: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2024/10061382.pdf

D - Financial Disclosure Report
    Candidates are required to disclose if they have receieved more than $5,000 for their campaign.
    Example: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2024/40003638.pdf

P - Periodic Transaction Report
    Candidates are required to disclose any transactions within 45 days of that transaction.
    Example: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2024/20025368.pdf

W - Withdrawl of Candidacy
    Example: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2024/7923.pdf

X - Financial Disclosure Extension Request
    Example: https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2024/30022024.pdf

**P FilingTypes are what we are most interested in, they provide the trade type,actual stock tickers, general amounts, and dates**

### Where are the original pdfs stored?
Each pdf is stored at URL that is a combination of the FilingType, Year, and DocID.

Base URL for C, D, W, X: https://disclosures-clerk.house.gov/public_disc/financial-pdfs

Base URL for P: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs

# Download all available Congress people's trading data

In [18]:
from pathlib import Path
import os
import re
import random
from zipfile import ZipFile
from io import BytesIO
import asyncio
import aiohttp
import pandas as pd

In [4]:
async def download_disclosure_file(session: aiohttp.ClientSession, url: str, year: int, output_dir: Path):
    try:
        async with session.get(url) as response:
            if response.status == 200:
                txt_file = f"{year}FD.txt"
                content = await response.read()
                with ZipFile(BytesIO(content)) as zip_file:
                    if txt_file in zip_file.namelist():
                        zip_file.extract(txt_file, output_dir)
                        print(f"Successfully downloaded and extracted {txt_file}")
                    else:
                        print(f"No {txt_file} found for {url}")
                return True
    except Exception as e:
        print(f"Error downloading {url}: {str(e)}")
        return False

async def bulk_download_disclosure_files(base_url: str, years: range, output_dir: Path):
    os.makedirs(output_dir, exist_ok=True)

    files_to_download = []
    for year in years:
        txt_file = f"{year}FD.txt"
        if (output_dir / txt_file).exists():
            print(f"File {txt_file} already exists, skipping download")
            continue
        else:
            files_to_download.append((year, txt_file))

    async with aiohttp.ClientSession() as session:
        tasks = [download_disclosure_file(session, f"{base_url}/{year}FD.zip", year, output_dir) 
                for year, txt_file in files_to_download]
        await asyncio.gather(*tasks)


In [6]:
base_url = "https://disclosures-clerk.house.gov/public_disc/financial-pdfs"
years = range(2008, 2025)
output_dir = Path("../data/disclosures")
await bulk_download_disclosure_files(base_url, years, output_dir)

Successfully downloaded and extracted 2010FD.txt
Successfully downloaded and extracted 2008FD.txt
Successfully downloaded and extracted 2009FD.txt
Successfully downloaded and extracted 2024FD.txt
Successfully downloaded and extracted 2011FD.txt
Successfully downloaded and extracted 2012FD.txt
Successfully downloaded and extracted 2013FD.txt
Successfully downloaded and extracted 2021FD.txt
Successfully downloaded and extracted 2022FD.txt
Successfully downloaded and extracted 2023FD.txt
Successfully downloaded and extracted 2017FD.txt
Successfully downloaded and extracted 2016FD.txt
Successfully downloaded and extracted 2015FD.txt
Successfully downloaded and extracted 2014FD.txt
Successfully downloaded and extracted 2018FD.txt
Successfully downloaded and extracted 2019FD.txt
Successfully downloaded and extracted 2020FD.txt


## Download all available PTR PDFs

In [19]:
async def download_ptr_pdf(session, url: str, output_path: Path) -> bool:
    """Download a single PTR PDF file asynchronously."""
    try:
        await asyncio.sleep(random.uniform(0.8, 1.3))  # Random delay between requests
        async with session.get(url) as response:
            if response.status == 200:
                output_path.parent.mkdir(parents=True, exist_ok=True)
                content = await response.read()
                with open(output_path, 'wb') as f:
                    f.write(content)
                return True
            else:
                print(f"Failed to download {url}: Status {response.status}")
                return False
    except Exception as e:
        print(f"Error downloading {url}: {str(e)}")
        return False

def parse_filing_data(year: int) -> pd.DataFrame:
    """Parse the filing data file for a given year and return PTR records."""
    file_path = Path(f"../data/disclosures/{year}FD.txt")
    
    try:
        # Read the tab-separated file
        df = pd.read_csv(file_path, sep='\t')
        
        # Filter for PTR (Periodic Transaction Report) filings only
        ptr_df = df[df['FilingType'] == 'P'].copy()
        
        # Convert FilingDate to datetime
        ptr_df['FilingDate'] = pd.to_datetime(ptr_df['FilingDate'])
        
        return ptr_df
    except Exception as e:
        print(f"Error parsing {file_path}: {str(e)}")
        return pd.DataFrame()

def generate_pdf_filename(row) -> str:
    """Generate standardized filename for a PTR filing."""
    last_name = re.sub(r'[^a-zA-Z]', '', row['Last']).lower()
    first_name = re.sub(r'[^a-zA-Z]', '', row['First']).lower()
    state_dist = row['StateDst'].lower()
    filing_date = row['FilingDate'].strftime('%Y-%m-%d')
    
    return f"{last_name}_{first_name}_{state_dist}_{filing_date}.pdf"

async def process_batch(session, batch: list[tuple[str, str]], output_dir: Path) -> tuple[int, int]:
    """Process a batch of PTR downloads."""
    tasks = [download_ptr_pdf(session, url, output_dir / filename) 
             for url, filename in batch
             if not (output_dir / filename).exists()]
    
    if tasks:
        results = await asyncio.gather(*tasks)
        return sum(1 for r in results if r), len(tasks)
    return 0, 0

async def download_ptr_pdfs(start_year: int, end_year: int, batch_size: int = 25):
    """
    Download all PTR PDFs for the specified year range.

    Args:
        start_year (int): The start year of the range.
        end_year (int): The end year of the range (inclusive).
        batch_size (int): The number of PTRs to download concurrently.
    """
    base_url = "https://disclosures-clerk.house.gov/public_disc/ptr-pdfs"
    output_dir = Path("../data/ptrs")
    output_dir.mkdir(parents=True, exist_ok=True)
    
    async with aiohttp.ClientSession() as session:
        for year in range(start_year, end_year + 1):
            print(f"Processing year {year}")
            
            ptr_df = parse_filing_data(year)
            if ptr_df.empty:
                continue
            
            # Create list of all downloads needed
            downloads = [
                (f"{base_url}/{year}/{row['DocID']}.pdf", generate_pdf_filename(row))
                for row in ptr_df.to_dict('records')
            ]
            
            # Process in batches
            total_successful = 0
            total_attempts = 0
            
            for i in range(0, len(downloads), batch_size):
                batch = downloads[i:i + batch_size]
                successful, attempts = await process_batch(session, batch, output_dir)
                total_successful += successful
                total_attempts += attempts
                
            if total_attempts > 0:
                print(f"Year {year}: Downloaded {total_successful}/{total_attempts} PTR PDFs")

async def download_ptrs(start_year: int, end_year: int, batch_size: int = 25):
    """Main function to initiate PTR downloads."""
    await download_ptr_pdfs(start_year, end_year, batch_size)


In [20]:
await download_ptrs(2014, 2024)

Processing year 2014
Year 2014: Downloaded 706/706 PTR PDFs
Processing year 2015
Year 2015: Downloaded 726/726 PTR PDFs
Processing year 2016
Year 2016: Downloaded 760/760 PTR PDFs
Processing year 2017
Year 2017: Downloaded 800/800 PTR PDFs
Processing year 2018
Year 2018: Downloaded 820/820 PTR PDFs
Processing year 2019
Year 2019: Downloaded 677/677 PTR PDFs
Processing year 2020
Year 2020: Downloaded 727/727 PTR PDFs
Processing year 2021
Year 2021: Downloaded 674/674 PTR PDFs
Processing year 2022
Year 2022: Downloaded 623/623 PTR PDFs
Processing year 2023
Year 2023: Downloaded 460/460 PTR PDFs
Processing year 2024
Year 2024: Downloaded 408/408 PTR PDFs


## Process the PTR PDFs

PDFs are notoriously difficult to parse. We'll need to use OCR to get the text and tables from the PDFs.

# Get historic prices for S&P 500 index

In [None]:
import yfinance as yf

def get_spy_data(start_date: str):
    # Get SPY data from Yahoo Finance
    spy = yf.Ticker("SPY")

    # Download historical data from 2000 to present
    spy_data = spy.history(
        start=start_date,
        interval="1d"
    )

    # Calculate daily returns
    spy_data['Returns'] = spy_data['Close'].pct_change()

    # Calculate rolling volatility (20-day standard deviation of returns)
    spy_data['Volatility_20d'] = spy_data['Returns'].rolling(window=20).std()

    # Calculate rolling volatility (60-day standard deviation of returns)
    spy_data['Volatility_60d'] = spy_data['Returns'].rolling(window=60).std()

    # Add moving averages
    spy_data['MA_50'] = spy_data['Close'].rolling(window=50).mean()
    spy_data['MA_200'] = spy_data['Close'].rolling(window=200).mean()

    # Calculate trading ranges
    spy_data['Daily_Range'] = spy_data['High'] - spy_data['Low']
    spy_data['Daily_Range_Pct'] = spy_data['Daily_Range'] / spy_data['Open']

    # Reset index to make Date a column
    spy_data = spy_data.reset_index()

    print(f"Downloaded {len(spy_data)} days of SPY data")
    print("\nRandom sample:")
    print(spy_data.sample(15))
    print("\nColumns available:")
    print(spy_data.columns.tolist())

    return spy_data

In [None]:
spy_output_path = Path("../data/spy.csv")
if not spy_output_path.exists():
    spy_data = get_spy_data("2000-01-01")
    spy_data.to_csv(spy_output_path, index=False)
else:
    spy_data = pd.read_csv(spy_output_path)


# Train RL model