# 📘 SEC Schedule 13D / 13G Filing Extractor

This notebook provides a robust and scalable pipeline for downloading and parsing Schedule 13D and Schedule 13G filings from the U.S. Securities and Exchange Commission (SEC) EDGAR system.

These filings are submitted by institutional investors, activist funds, or large beneficial owners when they acquire more than 5% of a public company's outstanding shares. Analyzing these disclosures can help track:

🔍 Institutional ownership trends

🧠 Activist investment behavior

📈 Early signals of corporate control contests or strategic stake-building

### 🛠️ Key Features
Automated Download of master index files from SEC EDGAR by year/quarter

Filter & Locate all Schedule 13D / 13G filings from the index

Structured Parsing of filing content including:

Issuer info (name, CIK, CUSIP)

Reporting party info (name, CIK, citizenship, ownership)

Voting power and beneficial shares

Narrative Extraction from unstructured fields such as:

Funds source

Purpose of transaction

Certification statements

Rate-limiting & Retry Control to comply with SEC usage guidelines

Output to Pandas DataFrame for downstream analysis or export

#### 🔐 Usage Disclaimer
This notebook accesses publicly available data from https://www.sec.gov. To respect the SEC’s infrastructure:

Always include a descriptive User-Agent header with your name and email.

Do not send too many requests in a short time.

This project includes polite sleep intervals and retry logic to avoid being blocked.

In [None]:
headers = {
    'User-Agent': 'Your Name (your_email@example.com)'
}


In [42]:
import requests
import os
import pandas as pd

def download_master_idx(headers, year=2024, quarter='QTR1'):
    """
    Download the SEC EDGAR master index file for a specific year and quarter.

    Parameters:
    - year (int): The target year (e.g., 2024)
    - quarter (str): One of 'QTR1', 'QTR2', 'QTR3', 'QTR4'

    Returns:
    - str: Raw text content of the master.idx file
    """
    url = f"https://www.sec.gov/Archives/edgar/full-index/{year}/{quarter}/master.idx"
    headers = headers  # Replace with your real name/email
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return response.text

def parse_master_idx(idx_text):
    """
    Parse the raw text of a master.idx file and convert it to a structured DataFrame.

    Parameters:
    - idx_text (str): Raw text from the master.idx file

    Returns:
    - pd.DataFrame: DataFrame with columns [CIK, Company Name, Form Type, Date Filed, File Name]
    """
    lines = idx_text.splitlines()
    start_idx = 0

    # Find the start of the data section (header row starts with 'CIK|...')
    for i, line in enumerate(lines):
        if line.startswith("CIK|Company Name|Form Type|Date Filed|File Name"):
            start_idx = i + 1
            break

    # Parse rows into DataFrame
    data = [line.split('|') for line in lines[start_idx:] if '|' in line]
    df = pd.DataFrame(data, columns=["CIK", "Company Name", "Form Type", "Date Filed", "File Name"])
    return df

def parse_master_idx_text(idx_text):
    """
    A more robust parser for master.idx in case of inconsistent formatting or headers.

    Parameters:
    - idx_text (str): Raw text content of the master.idx file

    Returns:
    - pd.DataFrame: Cleaned and parsed DataFrame
    """
    lines = idx_text.splitlines()

    # Identify the start of the data section
    start_idx = 0
    for i, line in enumerate(lines):
        if line.startswith("CIK|Company Name|Form Type|Date Filed|Filename"):
            start_idx = i + 1
            break

    # Skip separator rows and extract actual data lines
    data_lines = [line for line in lines[start_idx:] if not line.startswith("-----")]
    rows = [line.split('|') for line in data_lines if '|' in line]

    # Create DataFrame
    df = pd.DataFrame(rows, columns=["CIK", "Company Name", "Form Type", "Date Filed", "File Name"])
    return df

def filter_all_13D_13G(df):
    """
    Filter all Schedule 13D and 13G filings from a master index DataFrame.

    Parameters:
    - df (pd.DataFrame): DataFrame with the master.idx content

    Returns:
    - pd.DataFrame: Filtered DataFrame with only 13D and 13G related filings, with a URL column
    """
    df_filtered = df[df["Form Type"].str.contains(r'13D|13G', case=False, na=False)].copy()
    df_filtered["SEC URL"] = "https://www.sec.gov/Archives/" + df_filtered["File Name"]
    return df_filtered

def save_txt_reports(headers, df_filtered, save_dir="sec_13d_13g"):
    """
    Download raw TXT filings of 13D/13G from SEC URLs and save to local folder.

    Parameters:
    - df_filtered (pd.DataFrame): DataFrame containing SEC URLs to download
    - save_dir (str): Local directory path to save the downloaded filings

    Output:
    - Saves .txt files to disk, named by CIK + Form Type + Date
    """
    os.makedirs(save_dir, exist_ok=True)
    headers = headers  # Replace with real info

    for _, row in df_filtered.iterrows():
        cik = row["CIK"]
        form_type = row["Form Type"].replace("/", "_")
        date = row["Date Filed"]
        url = row["SEC URL"]
        fname = f"{cik}_{form_type}_{date}.txt"
        path = os.path.join(save_dir, fname)

        try:
            r = requests.get(url, headers=headers)
            r.raise_for_status()
            with open(path, "w", encoding='utf-8') as f:
                f.write(r.text)
        except Exception as e:
            print(f"Failed to download {url}: {e}")


In [None]:
import requests
import xml.etree.ElementTree as ET
import time
import random

def try_get(root, path, ns):
    """Safely retrieve text from XML element path; return None if not found."""
    try:
        node = root.find(path, ns)
        return node.text.strip() if node is not None else None
    except:
        return None

def extract_13d_13g_info_simple(url, headers, max_retries=3, backoff_base=1.5):
    """
    Robustly extract structured and narrative (textual) information from a Schedule 13D/13G SEC filing.

    Features:
    - Includes retry logic with exponential backoff
    - Random sleep between requests (0.5 to 2.0 seconds)
    - Extracts both structured fields and narrative fields as text

    Parameters:
    - url (str): SEC TXT filing URL
    - headers (dict): Custom headers with User-Agent
    - max_retries (int): Maximum number of retries on failed request
    - backoff_base (float): Base multiplier for exponential backoff

    Returns:
    - dict: Extracted filing data including structured and unstructured fields
    """
    attempt = 0
    while attempt <= max_retries:
        try:
            time.sleep(random.uniform(0.5, 2.0))  # polite delay
            response = requests.get(url, headers=headers, timeout=10)
            if response.status_code != 200:
                raise ValueError(f"HTTP {response.status_code}")
            text = response.text
            break
        except Exception as e:
            wait_time = backoff_base ** attempt
            attempt += 1
            if attempt > max_retries:
                return {"error": f"Failed to download after {max_retries} attempts. Last error: {e}"}
            time.sleep(wait_time)

    start = text.find('<?xml')
    end = text.find('</edgarSubmission>') + len('</edgarSubmission>')
    if start == -1 or end == -1:
        return {"error": "XML block not found in filing"}
    xml_str = text[start:end]

    try:
        root = ET.fromstring(xml_str)
    except ET.ParseError as e:
        return {"error": f"XML parsing error: {e}"}

    ns_uri = root.tag.split('}')[0].strip('{')
    ns = {'ns': ns_uri}
    data = {}

    # Structured fields
    data['Filing Type'] = '13G' if '13g' in ns_uri else '13D'
    data['Issuer Name'] = try_get(root, './/ns:issuerName', ns)
    data['Issuer CUSIP'] = try_get(root, './/ns:issuerCUSIP', ns)
    data['Issuer CIK'] = try_get(root, './/ns:issuerCik', ns)
    data['Reporting Person'] = try_get(root, './/ns:reportingPersonName', ns)
    data['Reporting Person CIK'] = try_get(root, './/ns:reportingPersonCIK', ns)
    data['Reporting Person Citizenship'] = try_get(root, './/ns:citizenshipOrOrganization', ns)
    data['Reporting Person Type'] = try_get(root, './/ns:typeOfReportingPerson', ns)
    data['Aggregate Shares Owned'] = (
        try_get(root, './/ns:aggregateAmountOwned', ns) or
        try_get(root, './/ns:reportingPersonBeneficiallyOwnedAggregateNumberOfShares', ns)
    )
    data['Percent of Class'] = (
        try_get(root, './/ns:percentOfClass', ns) or
        try_get(root, './/ns:classPercent', ns)
    )
    data['Sole Voting Power'] = try_get(root, './/ns:soleVotingPower', ns)
    data['Shared Voting Power'] = try_get(root, './/ns:sharedVotingPower', ns)
    data['Sole Dispositive Power'] = try_get(root, './/ns:soleDispositivePower', ns)
    data['Shared Dispositive Power'] = try_get(root, './/ns:sharedDispositivePower', ns)

    # Narrative fields
    data['Funds Source'] = try_get(root, './/ns:fundsSource', ns)
    data['Ownership Description'] = try_get(root, './/ns:numberOfShares', ns)
    data['Transaction Description'] = try_get(root, './/ns:transactionDesc', ns)
    data['Intent or Purpose'] = try_get(root, './/ns:purposeOfTransaction', ns)
    data['Certification Statement'] = try_get(root, './/ns:certifications', ns)

    return data


In [43]:
# apply

idx_text = download_master_idx(year=2025, quarter='QTR1')

df_idx = parse_master_idx_text(idx_text)

df_13d_13g = filter_all_13D_13G(df_idx)
df_13d_13g[['Company Name', 'Form Type', 'Date Filed', 'SEC URL']].head()


Unnamed: 0,Company Name,Form Type,Date Filed,SEC URL
7,OLD MARKET CAPITAL Corp,SCHEDULE 13G,2025-01-23,https://www.sec.gov/Archives/edgar/data/100004...
63,MEDALLION FINANCIAL CORP,SCHEDULE 13D/A,2025-01-03,https://www.sec.gov/Archives/edgar/data/100020...
64,MEDALLION FINANCIAL CORP,SCHEDULE 13D/A,2025-02-18,https://www.sec.gov/Archives/edgar/data/100020...
65,MEDALLION FINANCIAL CORP,SCHEDULE 13G/A,2025-02-06,https://www.sec.gov/Archives/edgar/data/100020...
69,MURSTEIN ALVIN,SCHEDULE 13D/A,2025-02-18,https://www.sec.gov/Archives/edgar/data/100021...


In [44]:
df_13d_13g.shape

(16629, 6)

In [45]:
df_13d_13g.head()

Unnamed: 0,CIK,Company Name,Form Type,Date Filed,File Name,SEC URL
7,1000045,OLD MARKET CAPITAL Corp,SCHEDULE 13G,2025-01-23,edgar/data/1000045/0000354204-25-000480.txt,https://www.sec.gov/Archives/edgar/data/100004...
63,1000209,MEDALLION FINANCIAL CORP,SCHEDULE 13D/A,2025-01-03,edgar/data/1000209/0000921895-25-000033.txt,https://www.sec.gov/Archives/edgar/data/100020...
64,1000209,MEDALLION FINANCIAL CORP,SCHEDULE 13D/A,2025-02-18,edgar/data/1000209/0000950170-25-022263.txt,https://www.sec.gov/Archives/edgar/data/100020...
65,1000209,MEDALLION FINANCIAL CORP,SCHEDULE 13G/A,2025-02-06,edgar/data/1000209/0000950170-25-015451.txt,https://www.sec.gov/Archives/edgar/data/100020...
69,1000210,MURSTEIN ALVIN,SCHEDULE 13D/A,2025-02-18,edgar/data/1000210/0000950170-25-022263.txt,https://www.sec.gov/Archives/edgar/data/100021...


In [91]:
# Update this quarter records 

urls = df_13d_13g['SEC URL']
urls = list(urls)


In [None]:
# Example 


one_url = urls[0]
one_record = extract_13d_13g_info_simple(url, headers=headers)


In [104]:
# to display neatly as column-wise summary 
pd.DataFrame([one_record]) 

Unnamed: 0,Filing Type,Issuer Name,Issuer CUSIP,Issuer CIK,Reporting Person,Reporting Person CIK,Reporting Person Citizenship,Reporting Person Type,Aggregate Shares Owned,Percent of Class,Sole Voting Power,Shared Voting Power,Sole Dispositive Power,Shared Dispositive Power,Funds Source,Ownership Description,Transaction Description,Intent or Purpose,Certification Statement
0,13G,Old Market Capital Corp,,1000045,Dimensional Fund Advisors LP,,X1,IA,340008.0,5.1,338402,0,340008,0,,,,,"By signing below I certify that, to the best o..."


In [109]:

url_list = urls[10:15]


In [110]:
import time
import random

records = []

for i, u in enumerate(url_list):
    record = extract_13d_13g_info_simple(u, headers=headers)
    records.append(record)
    
    # after each request, sleep for a while 
    sleep_time = random.uniform(3.0, 10)
    print(f"[{i+1}/{len(url_list)}] Sleeping for {sleep_time:.2f} seconds...")
    time.sleep(sleep_time)


[1/5] Sleeping for 7.98 seconds...
[2/5] Sleeping for 5.59 seconds...
[3/5] Sleeping for 9.16 seconds...
[4/5] Sleeping for 7.11 seconds...
[5/5] Sleeping for 5.88 seconds...


In [111]:
pd.DataFrame(records)

Unnamed: 0,Filing Type,Issuer Name,Issuer CUSIP,Issuer CIK,Reporting Person,Reporting Person CIK,Reporting Person Citizenship,Reporting Person Type,Aggregate Shares Owned,Percent of Class,Sole Voting Power,Shared Voting Power,Sole Dispositive Power,Shared Dispositive Power,Funds Source,Ownership Description,Transaction Description,Intent or Purpose,Certification Statement
0,13G,Alger Mid Cap 40 ETF,,1807486,"Alger Associates, Inc.",,NY,HC,290491.0,7.4,290491.0,0.0,290491.0,0.0,,,,,"By signing below I certify that, to the best o..."
1,13G,"PROS Holdings, Inc.",,1392972,"Alger Associates, Inc. 13-3017981",,NY,HC,2207784.0,4.7,1934943.0,0.0,2207784.0,0.0,,,,,"By signing below I certify that, to the best o..."
2,13G,"Montrose Environmental Group, Inc.",,1643615,"Alger Associates, Inc.",,NY,HC,1013896.0,3.0,607051.0,0.0,1013896.0,0.0,,,,,"By signing below I certify that, to the best o..."
3,13G,Absci Corporation,,1672688,"Alger Associates, Inc.",,NY,HC,6505423.0,5.7,6505423.0,0.0,6505423.0,0.0,,,,,"By signing below I certify that, to the best o..."
4,13G,BROOKFIELD BUSINESS PARTNERS LP,,1654795,Royal Bank of Canada,,Z4,HC,7480617.0,10.07,0.0,7480617.0,0.0,7480617.0,,,,,"By signing below I certify that, to the best o..."
