# SEC MD&A Section Extractor

## Overview
This script extracts Management Discussion and Analysis (MD&A) sections from SEC filings (10-K and 10-Q forms) for S&P 500 companies. It uses the SEC API to query and extract these sections, processes them, and saves the results to a CSV file.

## Dependencies
- pandas: Data manipulation and CSV I/O
- html: HTML entity decoding
- time: Rate limiting API requests
- tqdm: Progress bar for tracking execution
- sec_api: API wrapper for SEC EDGAR database access

## Authentication
The script uses an SEC API key that must be valid:
```python
API_KEY = '58f91ec81d0930caccf98028a9dc724c22b081e8a2f7666721946c3aeab9e05b'
```

## File Structure
- Input file: `data/sp500_constituents.csv` - List of S&P 500 companies with their tickers and CIK numbers
- Output file: `data/mda_sections_2023.csv` - Extracted MD&A sections
- Debug log: `mda_debug.log` - Detailed log of the extraction process

## Process Flow

1. **Initialization**
   - Load the S&P 500 constituents from the input CSV
   - Initialize the SEC API with the provided API key
   - Create an empty DataFrame to store results
   - Set up debug logging and progress bar

2. **Company Processing Loop**
   - For each company in the input file:
     - Extract ticker and CIK
     - Query the SEC API for 10-K and 10-Q filings from 2023
     - Apply rate limiting (0.2 seconds between requests)
     - Log query results

3. **Filing Processing**
   - For each filing returned by the query:
     - Determine the appropriate section to extract based on form type:
       - For 10-K: Item 7
       - For 10-Q: Part 1, Item 2
     - Extract the MD&A section text using the Extractor API
     - Decode HTML entities in the extracted text
     - Add the data to the results DataFrame

4. **Data Persistence**
   - Save results to CSV periodically (every 5 companies)
   - Perform a final save at the end of processing
   - Verify the output file exists and contains data

## Error Handling
- Exception handling for file I/O operations
- Exception handling for API queries
- Exception handling for section extraction
- Detailed logging of all errors to the debug log file

## Rate Limiting
The script implements a simple rate limiting mechanism with a 0.2-second delay between API requests to avoid hitting SEC API rate limits.

## Logging
- Console output for high-level progress
- Progress bar using tqdm
- Detailed debug log in `mda_debug.log`
- Summary statistics at the end of execution

## Output Format
The output CSV contains the following columns:
- Ticker: Company ticker symbol
- CIK: Company's Central Index Key (SEC identifier)
- FormType: Either '10-K' or '10-Q'
- FiledAt: Date when the form was filed
- MDA_Text: Text content of the MD&A section

## Usage
```bash
python extract_mda_sections.py
```

## Performance Considerations
- The script processes companies sequentially, not in parallel
- Rate limiting may cause the script to run for an extended period with large datasets
- Progress tracking helps estimate completion time
- Periodic saving prevents data loss in case of interruption

## Troubleshooting
If the script fails or produces unexpected results:
1. Check the `mda_debug.log` file for detailed error information
2. Verify your SEC API key is valid and hasn't exceeded usage limits
3. Ensure the input file exists and has the expected format
4. Check for network connectivity issues

In [1]:
import pandas as pd
import html
import time
from tqdm import tqdm
from sec_api import QueryApi, ExtractorApi

# Initialize SEC API with your API key
API_KEY = '58f91ec81d0930caccf98028a9dc724c22b081e8a2f7666721946c3aeab9e05b'
query_api = QueryApi(API_KEY)
extractor_api = ExtractorApi(API_KEY)

# File paths
input_file = 'data/sp500_constituents.csv'
output_file = 'data/mda_sections_2023.csv'

# Read the input CSV file with pandas
try:
    companies_df = pd.read_csv(input_file)
    print(f"Successfully loaded {len(companies_df)} companies from {input_file}")
except Exception as e:
    print(f"Error loading input file: {e}")
    exit(1)

# Create empty DataFrame for results
results = pd.DataFrame(columns=['Ticker', 'CIK', 'FormType', 'FiledAt', 'MDA_Text'])

# Set up progress bar
total_companies = len(companies_df)
pbar = tqdm(total=total_companies, desc="Processing Companies")

# Debug log file
with open('mda_debug.log', 'w') as log_file:
    log_file.write("Starting extraction process\n")

# Process each company
company_count = 0
for _, company in companies_df.iterrows():
    ticker = company['Ticker']
    cik = company['CIK']
    
    # Update progress bar
    pbar.set_description(f"Processing {ticker} ({company_count}/{total_companies})")
    
    # Debug log
    with open('mda_debug.log', 'a') as log_file:
        log_file.write(f"\nProcessing {ticker}, CIK: {cik}\n")
    
    # Construct query for 10-K and 10-Q filings from 2023
    query = {
        "query": {
            "query_string": {
                "query": f"cik:{cik} AND formType:(\"10-K\" OR \"10-Q\") AND filedAt:[2023-01-01 TO 2023-12-31]"
            }
        },
        "from": "0",
        "size": "10",
        "sort": [{"filedAt": {"order": "desc"}}]
    }
    
    # Execute the query with rate limiting
    try:
        time.sleep(0.2)  # Rate limiting
        filings = query_api.get_filings(query)
        filing_count = len(filings['filings'])
        
        with open('mda_debug.log', 'a') as log_file:
            log_file.write(f"Found {filing_count} filings for {ticker}\n")
    except Exception as e:
        with open('mda_debug.log', 'a') as log_file:
            log_file.write(f"Error querying filings for {ticker}: {e}\n")
        pbar.update(1)
        company_count += 1
        continue
    
    # Process each filing
    for filing in filings['filings']:
        form_type = filing['formType']
        filed_at = filing['filedAt']
        filing_url = filing['linkToFilingDetails']
        
        # Determine the MDA section based on form type
        if form_type == '10-K':
            section = '7'  # Item 7 for 10-K
        elif form_type == '10-Q':
            section = 'part1item2'  # Part 1, Item 2 for 10-Q
        else:
            continue
        
        # Extract the MDA section
        try:
            time.sleep(0.2)  # Rate limiting
            mda_text = extractor_api.get_section(filing_url, section, 'text')
            
            # Debug log
            with open('mda_debug.log', 'a') as log_file:
                log_file.write(f"Extracted MDA for {ticker} {form_type} ({len(mda_text) if mda_text else 0} chars)\n")
            
            # Skip if no MDA text was found
            if not mda_text or mda_text.strip() == '':
                with open('mda_debug.log', 'a') as log_file:
                    log_file.write(f"No MDA text found for {ticker} {form_type}\n")
                continue
                
            # Decode HTML entities in the extracted text
            decoded_mda_text = html.unescape(mda_text)
            
            # Add to results DataFrame
            new_row = pd.DataFrame({
                'Ticker': [ticker], 
                'CIK': [cik], 
                'FormType': [form_type], 
                'FiledAt': [filed_at], 
                'MDA_Text': [decoded_mda_text]
            })
            
            results = pd.concat([results, new_row], ignore_index=True)
            
            # Debug log
            with open('mda_debug.log', 'a') as log_file:
                log_file.write(f"Added data for {ticker} {form_type}. Results now has {len(results)} rows\n")
            
            # Save the current results to CSV periodically (every 5 companies)
            if company_count % 5 == 0 or company_count == total_companies - 1:
                try:
                    results.to_csv(output_file, index=False, quoting=1, encoding='utf-8')
                    with open('mda_debug.log', 'a') as log_file:
                        log_file.write(f"Saved {len(results)} rows to {output_file}\n")
                except Exception as e:
                    with open('mda_debug.log', 'a') as log_file:
                        log_file.write(f"Error saving to CSV: {e}\n")
            
        except Exception as e:
            with open('mda_debug.log', 'a') as log_file:
                log_file.write(f"Error extracting MDA for {ticker} {form_type}: {e}\n")
            continue
    
    # Update progress bar
    pbar.update(1)
    company_count += 1

# Close progress bar
pbar.close()

# Final save of results to CSV
try:
    if not results.empty:
        results.to_csv(output_file, index=False, quoting=1, encoding='utf-8')
        print(f"✅ Finished processing. {len(results)} rows saved to {output_file}")
    else:
        print("❌ No data was collected. Check the debug log for details.")
except Exception as e:
    print(f"Error saving final results to CSV: {e}")

# Print summary of extraction
print(f"Processed {company_count} companies")
print(f"Extracted {len(results)} MDA sections")
print(f"Debug log saved to mda_debug.log")

# Test the output file exists and has data
try:
    test_df = pd.read_csv(output_file)
    print(f"Output file verification: {len(test_df)} rows read successfully")
except Exception as e:
    print(f"Error verifying output file: {e}")

Successfully loaded 503 companies from data/sp500_constituents.csv


Processing ZTS (502/503): 100%|██████████| 503/503 [23:33<00:00,  2.81s/it]  


✅ Finished processing. 1943 rows saved to data/mda_sections_2023.csv
Processed 503 companies
Extracted 1943 MDA sections
Debug log saved to mda_debug.log
Output file verification: 1943 rows read successfully


In [3]:
import pandas as pd

df = pd.read_csv("data/mda_sections_2023.csv")
df.head()

Unnamed: 0,Ticker,CIK,FormType,FiledAt,MDA_Text
0,MMM,66740,10-Q,2023-10-24T12:17:55-04:00,Item 2. Management’s Discussion and Analysis ...
1,MMM,66740,10-Q,2023-07-25T14:07:02-04:00,Item 2. Management’s Discussion and Analysis ...
2,MMM,66740,10-Q,2023-04-25T13:59:21-04:00,Item 2. Management’s Discussion and Analysis ...
3,MMM,66740,10-K,2023-02-08T12:59:49-05:00,Item 7. Management’s Discussion and Analysis ...
4,AOS,91142,10-Q,2023-10-27T13:17:53-04:00,ITEM 2 - MANAGEMENT’S DISCUSSION AND ANALYSIS...


In [4]:
df.tail()

Unnamed: 0,Ticker,CIK,FormType,FiledAt,MDA_Text
1938,ZBH,1136869,10-K,2023-02-24T09:46:33-05:00,Item 7.Management’s Discussion and Analysis o...
1939,ZTS,1555280,10-Q,2023-11-02T11:45:54-04:00,Item 2. Management’s Discussion and Analysis ...
1940,ZTS,1555280,10-Q,2023-08-08T10:38:57-04:00,Item 2. Management’s Discussion and Analysis ...
1941,ZTS,1555280,10-Q,2023-05-04T10:51:57-04:00,Item 2. Management’s Discussion and Analysis ...
1942,ZTS,1555280,10-K,2023-02-14T14:34:02-05:00,Item 7. Management’s Discussion and Analysis ...
