[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dgunning/edgartools/blob/main/notebooks/sec-filing-text-nlp-python.ipynb)

# SEC EDGAR Filing Text Extraction for NLP with Python -- Free, No API Key

Use **edgartools** to extract clean text from SEC filings for NLP and AI analysis -- completely free, no API key or paid subscription required. Get full filing text, individual sections, or structured chunks ready for machine learning pipelines and LLM context windows.

**What you'll learn:**
- Extract full filing text as plain text, markdown, or HTML
- Pull specific sections (business, risk factors, MD&A) by item number
- Get word and character counts for each section
- Search within filings for specific terms
- Access structured chunks for LLM context windows
- Compare text volumes across companies

## Install edgartools

In [None]:
!pip install -U edgartools

## Setup

The SEC requires all automated tools to identify themselves. Replace the email below with your own -- any valid email works.

In [None]:
import pandas as pd
from edgar import *

# The SEC requires you to identify yourself (any email works)
set_identity("your.name@example.com")

## Extract Filing Text in 3 Lines

Every SEC filing can be converted to plain text, markdown, or HTML. edgartools handles all the parsing automatically:

In [None]:
filing = Company("NVDA").get_filings(form="10-K")[0]

text = filing.text()
markdown = filing.markdown()
html = filing.html()

print(f"Plain text: {len(text):>10,} chars  ({len(text.split()):,} words)")
print(f"Markdown:   {len(markdown):>10,} chars")
print(f"HTML:       {len(html):>10,} chars")
print(f"\nFirst 500 chars of text:\n")
print(text[:500])

## Extract Specific Sections by Item Number

For NLP, you often need specific sections rather than the full filing. Parse a 10-K and access any section by item number:

In [None]:
tenk = filing.obj()

# Key sections for NLP analysis
business = tenk["1"]        # Item 1: Business description
risks = tenk["1A"]          # Item 1A: Risk factors
mda = tenk["7"]             # Item 7: Management Discussion & Analysis

print(f"{'Section':<20s} {'Words':>10s} {'Characters':>12s}")
print(f"{'-'*20} {'-'*10} {'-'*12}")
for name, text in [("Business", business), ("Risk Factors", risks), ("MD&A", mda)]:
    print(f"{name:<20s} {len(text.split()):>10,} {len(text):>12,}")

print(f"\nBusiness description (first 500 chars):\n")
print(business[:500])

## All Available Sections

A 10-K has over 20 sections. See what's available and how much text each contains:

In [None]:
for item in tenk.items:
    text = tenk[item]
    words = len(text.split())
    print(f"{item:12s} {words:>8,} words  {len(text):>8,} chars")

## Search Within a Filing

Search for specific terms or topics within a filing to find relevant passages:

In [None]:
results = filing.search("artificial intelligence")
print(f"Found {len(results)} matching sections")
results

## Structured Chunks for LLM Context

The chunked document breaks a filing into labeled segments, ideal for RAG pipelines or fitting within LLM context windows:

In [None]:
chunked = tenk.chunked_document
df = chunked.as_dataframe()

print(f"Total chunks: {len(df)}")
print(f"Columns: {df.columns.tolist()}\n")

# Chunks per section
sections = df[df["Item"] != ""].groupby("Item").agg(
    Chunks=("Chars", "count"),
    Total_Chars=("Chars", "sum")
).sort_values("Total_Chars", ascending=False)

sections.head(10)

## Compare Text Volumes Across Companies

Different companies disclose varying levels of detail. Compare text volumes to understand disclosure depth:

In [None]:
tickers = ["NVDA", "MSFT", "AAPL", "GOOG"]
rows = []

for ticker in tickers:
    filing = Company(ticker).get_filings(form="10-K")[0]
    tenk = filing.obj()
    text = filing.text()
    rows.append({
        "Ticker": ticker,
        "Total Words": f"{len(text.split()):,}",
        "Business": f"{len(tenk['1'].split()):,}",
        "Risk Factors": f"{len(tenk['1A'].split()):,}",
        "MD&A": f"{len(tenk['7'].split()):,}",
    })

pd.DataFrame(rows).set_index("Ticker")

## Why EdgarTools?

EdgarTools is free and open-source. Compare extracting SEC filing text:

**With edgartools (free, no API key):**
```python
filing = Company("NVDA").get_filings(form="10-K")[0]
text = filing.text()            # Full text
tenk = filing.obj()
tenk["1A"]                      # Risk factors section
filing.search("AI")             # Search within filing
```

**Typical paid API approach ($50+/month, API key required):**
```python
from sec_api import ExtractorApi
api = ExtractorApi(api_key="YOUR_PAID_API_KEY")
text = api.get_section(url, "1A", "text")  # One section per API call
# ... rate-limited, paid per request, no search capability
```

With edgartools, the entire filing is parsed locally -- all sections, search, and chunks available instantly with no per-request cost.

## Quick Reference

```python
from edgar import *
set_identity("your.name@example.com")

# ── Full filing text ──
filing = Company("NVDA").get_filings(form="10-K")[0]
filing.text()                          # Plain text
filing.markdown()                      # Markdown
filing.html()                          # HTML

# ── Sections by item number ──
tenk = filing.obj()
tenk["1"]                              # Business description
tenk["1A"]                             # Risk factors
tenk["7"]                              # MD&A
tenk.items                             # List all items

# ── Search ──
filing.search("artificial intelligence")  # Find matching passages

# ── Structured chunks ──
chunked = tenk.chunked_document
df = chunked.as_dataframe()            # All chunks with metadata
# Columns: Text, Table, Chars, Part, Item
```

## What's Next

You've learned how to extract SEC filing text for NLP analysis. Here are related tutorials:

- [Analyze 10-K Annual Reports](https://colab.research.google.com/github/dgunning/edgartools/blob/main/notebooks/analyze-10k-annual-report-python.ipynb)
- [Download SEC Filings in Bulk](https://colab.research.google.com/github/dgunning/edgartools/blob/main/notebooks/download-sec-filings-bulk-python.ipynb)
- [Search and Filter SEC Filings](https://colab.research.google.com/github/dgunning/edgartools/blob/main/notebooks/search-sec-filings-python.ipynb)
- [SEC EDGAR API in Python](https://colab.research.google.com/github/dgunning/edgartools/blob/main/notebooks/sec-edgar-api-python.ipynb)

**Resources:**
- [EdgarTools Documentation](https://edgartools.readthedocs.io/)
- [GitHub Repository](https://github.com/dgunning/edgartools)
- [PyPI Package](https://pypi.org/project/edgartools/)

---

## Support EdgarTools

If you found this tutorial helpful, here are a few ways to support the project:

- **Star the repo** -- [github.com/dgunning/edgartools](https://github.com/dgunning/edgartools) -- it helps others discover edgartools
- **Visit edgartools.io** -- [edgartools.io](https://www.edgartools.io/) -- for more tutorials, articles, and updates
- **Report issues** -- found a bug or have a feature idea? [Open an issue](https://github.com/dgunning/edgartools/issues)
- **Share this notebook** -- know someone who works with SEC data? Send them the Colab link

*edgartools is free, open-source, and community-driven. No API key or paid subscription required.*