<a href="https://colab.research.google.com/github/ZombieSwan/InsightEDGAR/blob/main/InsightEDGAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **InsightEDGAR: Retrieval-Augmented Generative AI for Financial Document Insights**

**InsightEDGAR** combines the retrieval of relevant text segments from both SEC filings and user-uploaded documents with LLM models to provide instant answers to natural language queries. The system is built with LangChain, Hugging Face Embeddings, and Chroma, and supports flexible filtering using metadata.

In [None]:
!pip install sec-edgar-downloader
!pip install chromadb
!pip install sentence-transformers
!pip install -q sentence-transformers chromadb


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transforme

In [None]:
import os
import re
import pickle
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sec_edgar_downloader import Downloader
from google.colab import drive
from langchain.docstore.document import Document


## Create User Chunks from uploaded files
We will use Q1 2025 earnings call transcript in .txt as example   

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path = "/content/drive/My Drive/transcript_q1_2025.txt"


In [None]:
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
    text = f.read()

print(text[:1000])  # Preview first 1000 characters


Tesla, Inc. (NASDAQ:TSLA) Q1 2025 Earnings Conference Call April 22, 2025 5:30 PM ET

Company Participants

Travis Axelrod - Head of IR
Elon Musk - CEO
Vaibhav Taneja - CFO
Ashok Elluswamy - Director, Autopilot Software
Lars Moravy - VP, Vehicle Engineering
Roshan Thomas - VP, Supply Chain
Karn Budhiraj - VP, Supply Chain

Conference Call Participants

Pierre Ferragu - New Street
Emmanuel Rosner - Wolfe
Edison Yu - Deutsche Bank
George Gianarikas - Canaccord
Colin Langan - Wells Fargo
Adam Jonas - Morgan Stanley

Operator

Good afternoon, everyone, and welcome to Tesla's First Quarter 2025 Q&A Webcast. My name is Travis Axelrod, Head of Investor Relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q1 results were announced at about 3 p.m. Central Time in the update deck we published at the same link as this webcast.

During this call, we will discuss our business outlook and make forward looking statements. These comments are based on our 

### Implement a upload-clean-chunk function for User Uploaded files
We also add section metadata:
Searches the first 300 characters of each chunk for likely section markers:
- "ITEM 1. Business"
- "ITEM 7A. Quantitative..."
- "Q&A"
- "RISK FACTORS"
- "MANAGEMENT"

Case-insensitive (re.IGNORECASE)

If found: assigns that as the "section"

If not: assigns "Unknown"

In [None]:
# Recommended cleaning function — works for both EDGAR and user uploads
def clean_edgar_text(text):
    # Remove all tags like <SEC-DOCUMENT>, <TEXT>, etc.
    text = re.sub(r"<[^>]+>", "", text)
    # Replace long sequences of newlines with just two
    text = re.sub(r"\n{2,}", "\n\n", text)
    # Trim leading/trailing whitespace
    return text.strip()

# Function to process any uploaded text file
def process_uploaded_file(path, ticker, year, quarter, doc_type):
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        raw_text = f.read()

    cleaned = clean_edgar_text(raw_text)

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n\n", "\n", ".", " ", ""]
    )

    chunks = splitter.split_text(cleaned)

    user_chunks = []
    for i, chunk in enumerate(chunks):
        # Try to infer section heading for better metadata
        match = re.search(
            r"(ITEM\s+\d+[A-Z]*\..+?|Q&A|MANAGEMENT.*?|RISK FACTORS|RESULTS OF OPERATIONS|FINANCIAL STATEMENTS|DISCUSSION AND ANALYSIS)",
            chunk[:300], re.IGNORECASE
        )
        section = match.group(1).strip() if match else "Unknown"

        metadata = {
            "source": "user_upload",
            "filename": os.path.basename(path),
            "ticker": ticker,
            "doc_type": doc_type,
            "year": year,
            "quarter": quarter,
            "chunk_id": i,
            "section": section
        }

        user_chunks.append(Document(page_content=chunk, metadata=metadata))

    return user_chunks

# Example run with your file
file_path = "/content/drive/My Drive/transcript_q1_2025.txt"
user_chunks = process_uploaded_file(
    file_path,
    ticker="TSLA",
    year=2025,
    quarter="Q1",
    doc_type="earnings_call_transcript"
)

# Preview result
print(f"✅ Total chunks created: {len(user_chunks)}")
print("\n--- Sample chunk metadata ---\n")
print(user_chunks[0].metadata)
print("\n--- Sample chunk content ---\n")
print(user_chunks[0].page_content[:500])


✅ Total chunks created: 102

--- Sample chunk metadata ---

{'source': 'user_upload', 'filename': 'transcript_q1_2025.txt', 'ticker': 'TSLA', 'doc_type': 'earnings_call_transcript', 'year': 2025, 'quarter': 'Q1', 'chunk_id': 0, 'section': 'Unknown'}

--- Sample chunk content ---

Tesla, Inc. (NASDAQ:TSLA) Q1 2025 Earnings Conference Call April 22, 2025 5:30 PM ET

Company Participants

Travis Axelrod - Head of IR
Elon Musk - CEO
Vaibhav Taneja - CFO
Ashok Elluswamy - Director, Autopilot Software
Lars Moravy - VP, Vehicle Engineering
Roshan Thomas - VP, Supply Chain
Karn Budhiraj - VP, Supply Chain

Conference Call Participants

Pierre Ferragu - New Street
Emmanuel Rosner - Wolfe
Edison Yu - Deutsche Bank
George Gianarikas - Canaccord
Colin Langan - Wells Fargo
Adam Jonas


## Create EDGAR Chunks
Load filings from EDGAR

Clean the Raw Text

- Read each full-submission.txt

- Clean it with regex cleaner

- Store the cleaned content + metadata

In [None]:
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [None]:
save_dir = "/content/drive/MyDrive/edgar-filings-tsla"


In [None]:
# Step 1: Set working directory inside Google Drive
drive_path = "/content/drive/MyDrive/edgar-filings"
os.makedirs(drive_path, exist_ok=True)  # Ensure the folder exists
os.chdir(drive_path)                    # Change working directory

# Step 2: Initialize the downloader using relative path "."
dl = Downloader(".", "restrepomjuan@hotmail.com")

# Step 3: Download filings
print("📥 Downloading 10-Q filings for TSLA...")
dl.get("10-Q", "TSLA")

print("📥 Downloading 10-K filings for TSLA...")
dl.get("10-K", "TSLA")



📥 Downloading 10-Q filings for TSLA...
📥 Downloading 10-K filings for TSLA...


15

In [None]:
!ls -R "/content/drive/MyDrive/edgar-filings"


/content/drive/MyDrive/edgar-filings:
sec-edgar-filings

/content/drive/MyDrive/edgar-filings/sec-edgar-filings:
TSLA

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA:
10-K  10-Q

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA/10-K:
0000950170-22-000796  0001193125-14-069681  0001564590-19-003165
0000950170-23-001409  0001564590-15-001031  0001564590-20-004475
0001193125-11-054847  0001564590-16-013195  0001564590-21-004599
0001193125-12-081990  0001564590-17-003118  0001628280-24-002390
0001193125-13-096241  0001564590-18-002956  0001628280-25-003063

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA/10-K/0000950170-22-000796:
full-submission.txt

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA/10-K/0000950170-23-001409:
full-submission.txt

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA/10-K/0001193125-11-054847:
full-submission.txt

/content/drive/MyDrive/edgar-filings/sec-edgar-filings/TSLA/10-K/0001193125-12-081990:
full-

After processing the 10-Qs & 10-Ks, each will have:

- The raw cleaned content

- Context info for traceability

In [None]:
# Step 1: Cleaning function
def clean_edgar_text(text):
    text = re.sub(r"<[^>]+>", "", text)  # remove HTML tags
    text = re.sub(r"\n{2,}", "\n\n", text)  # collapse excessive newlines
    return text.strip()

# New Step: Robust extraction of date information
def extract_date_info(text):
    match = re.search(r"CONFORMED PERIOD OF REPORT:\s*(\d{4})(\d{2})(\d{2})", text)
    if match:
        year = int(match.group(1))
        month = int(match.group(2))
        quarter = f"Q{((month - 1) // 3) + 1}"
        return year, quarter
    return None, None

# Step 2: Load and clean all filings
def process_filings(ticker, form_type, base_dir_root="/content/drive/MyDrive/edgar-filings", count=12):
    base_dir = os.path.join(base_dir_root, "sec-edgar-filings", ticker, form_type)
    filing_folders = sorted(os.listdir(base_dir))[-count:]

    documents = []
    for folder in filing_folders:
        filing_path = os.path.join(base_dir, folder, "full-submission.txt")
        with open(filing_path, "r", encoding="utf-8", errors="ignore") as f:
            raw_text = f.read()
            cleaned = clean_edgar_text(raw_text)

            # Year and quarter extraction
            year, quarter = extract_date_info(cleaned)

            documents.append({
                "ticker": ticker,
                "filing_type": form_type,
                "folder": folder,
                "path": filing_path,
                "cleaned_text": cleaned,
                "year": year,
                "quarter": quarter
            })

    return documents


### Chunk Each Cleaned Filing

We’ll loop through all_docs (which contains both 10-Q and 10-K filings for the last 3 years), and:

- Split the cleaned_text into chunks

- Add metadata like: ticker, filing_type, folder, and a guessed section - (it will attempt to find a section heading within the first 300 characters of each chunk)

In [None]:
# Step 3: Run for TSLA
docs_10q = process_filings("TSLA", "10-Q", count=12)
docs_10k = process_filings("TSLA", "10-K", count=3)
all_docs = docs_10q + docs_10k

# Step 4: Chunk the documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)

edgar_chunks = []

for doc in all_docs:
    chunks = splitter.split_text(doc["cleaned_text"])

    for i, chunk in enumerate(chunks):
        match = re.search(
            r"(ITEM\s+\d+[A-Z]*\..+?|MANAGEMENT.*?|RISK FACTORS|RESULTS OF OPERATIONS|FINANCIAL STATEMENTS|DISCUSSION AND ANALYSIS)",
            chunk[:300], re.IGNORECASE
        )
        section = match.group(1).strip() if match else "Unknown"

        metadata = {
            "source": "edgar",
            "ticker": doc["ticker"],
            "filing_type": doc["filing_type"],
            "filing_id": doc["folder"],
            "chunk_id": i,
            "section": section,
            "year": doc["year"],
            "quarter": doc["quarter"]
        }

        edgar_chunks.append(Document(page_content=chunk, metadata=metadata))

# Step 5: Confirm result
print(f"✅ Total EDGAR chunks: {len(edgar_chunks)}")
print("\n--- Sample chunk metadata ---\n")
print(edgar_chunks[0].metadata)
print("\n--- Sample chunk content ---\n")
print(edgar_chunks[0].page_content[:500])


✅ Total EDGAR chunks: 112848

--- Sample chunk metadata ---

{'source': 'edgar', 'ticker': 'TSLA', 'filing_type': '10-Q', 'filing_id': '0001564590-18-026353', 'chunk_id': 0, 'section': 'Unknown', 'year': 2018, 'quarter': 'Q3'}

--- Sample chunk content ---

0001564590-18-026353.txt : 20181102
0001564590-18-026353.hdr.sgml : 20181102
20181101203856
ACCESSION NUMBER:		0001564590-18-026353
CONFORMED SUBMISSION TYPE:	10-Q
PUBLIC DOCUMENT COUNT:		97
CONFORMED PERIOD OF REPORT:	20180930
FILED AS OF DATE:		20181102
DATE AS OF CHANGE:		20181101

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Tesla, Inc.
		CENTRAL INDEX KEY:			0001318605
		STANDARD INDUSTRIAL CLASSIFICATION:	MOTOR VEHICLES & PASSENGER CAR BODIES [3711]
		IRS NUMBER:				912197729
		STAT


### Now merge EDGAR chunks with User chunks

In [None]:
# Combine both sources into one list
all_chunks = edgar_chunks + user_chunks
print(f"✅ Total combined chunks: {len(all_chunks)}")


✅ Total combined chunks: 112950


#### Save Chunks

In [None]:
drive_path = "/content/drive/MyDrive/Generative_AI"
os.makedirs(drive_path, exist_ok=True)
chunk_file_path = os.path.join(drive_path, "all_chunks.pkl")

In [None]:
with open(chunk_file_path, "wb") as f:
    pickle.dump(all_chunks, f)

print(f"Chunks saved successfully to {chunk_file_path}")

Chunks saved successfully to /content/drive/MyDrive/Generative_AI/all_chunks.pkl


# Part 2 - Embeddings
Restart the Colab Runtime & Re-Install packages

### Flatten the List for Embeddings

all_chunks contains dictionaries - before embeding the text using models like sentence-transformers, we need to define texts and metadatas:

The pure text → this gets turned into an embedding.

The associated metadata → so we can later filter, search, or display info about each chunk in your vector DB.

In [None]:
!pip install -U langchain langchain-openai langchain-chroma python-dotenv
!pip install -q sentence-transformers chromadb
!pip install -U langchain-community
!pip install lark

import os
import pickle
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever




#### Load all_Chunks

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [None]:
chunk_file_path = "/content/drive/MyDrive/Generative_AI/all_chunks.pkl"


In [None]:
with open(chunk_file_path, "rb") as f:
    all_chunks = pickle.load(f)

print(f"Chunks loaded successfully. Total number of chunks: {len(all_chunks)}")


Chunks loaded successfully. Total number of chunks: 112950


In [None]:
# Extract text content & metadata
texts = [chunk.page_content for chunk in all_chunks[:1000]]
metadatas = [chunk.metadata for chunk in all_chunks[:1000]]

# Quick sanity check
print(f"Total texts: {len(texts)}")
print(f"Total metadatas: {len(metadatas)}")
print("\nSample metadata:\n", metadatas[1])


Total texts: 1000
Total metadatas: 1000

Sample metadata:
 {'source': 'edgar', 'ticker': 'TSLA', 'filing_type': '10-Q', 'filing_id': '0001564590-18-026353', 'chunk_id': 1, 'section': 'Unknown', 'year': 2018, 'quarter': 'Q3'}


In [None]:
# Set up Google Drive directory for Chroma DB
persist_directory = "/content/drive/MyDrive/chroma_db_insightedgar06"
os.makedirs(persist_directory, exist_ok=True)


In [None]:
# Initialize the embedding model (optimized for semantic search)
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")


In [None]:
# Combine texts with their metadata into LangChain Document format
docs = [
    Document(page_content=texts[i], metadata=metadatas[i])
    for i in range(len(texts))
]

print(f"✅ Created {len(docs)} LangChain Documents.")

✅ Created 1000 LangChain Documents.


In [None]:
# Create Chroma vector database and save automatically (no explicit persist required)
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding_model,
    persist_directory=persist_directory
)

print("✅ Vector DB stored and persisted to Google Drive.")

✅ Vector DB stored and persisted to Google Drive.


In [None]:
!ls -lh "/content/drive/MyDrive/chroma_db_insightedgar06"


total 11M
-rw------- 1 root root  11M May 30 12:06 chroma.sqlite3
drwx------ 2 root root 4.0K May 30 12:06 eabe34b8-9360-4423-8cbe-9da9ff2f7354


In [None]:
# Reload vector DB from the saved directory
vectordb = Chroma(
    embedding_function=embedding_model,
    persist_directory=persist_directory
)

print("✅ Vector DB loaded.")
print(f"🔢 Total chunks: {vectordb._collection.count()}")


✅ Vector DB loaded.
🔢 Total chunks: 1000


## So far we:
- Ingested and chunked EDGAR filings and user-uploaded documents.

- Generated embeddings using BAAI/bge-base-en-v1.5.

- Stored those embeddings in a persistent Chroma vector store in Google Drive.

# Set-up RAG Pipeline

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
env_path = "/content/drive/MyDrive/Generative_AI/edgar.env"
load_dotenv(env_path)

# Load OpenAI API key
openai_key = os.getenv("OPENAI_API_KEY")
print("🔐 Key loaded:", openai_key[:6] + "..." if openai_key else "❌ Key not found")


🔐 Key loaded: sk-pro...


### Load Chroma Vector DB

In [None]:
# Load vector DB from Drive
persist_directory = "/content/drive/MyDrive/chroma_db_insightedgar06"
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_model
)


### Load LLM

In [None]:
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0.2,
    openai_api_key=openai_key
)


### Create SelfQueryRetriever

The SelfQueryRetriever from LangChain retriever uses an LLM (like GPT-3.5) to convert natural language queries into structured filter + keyword search queries over the metadata + vector store.

In [None]:
# Describe what the documents are about
document_content_description = "SEC financial filings from EDGAR 10-Ks and 10-Qs and user-uploaded documents like call transcripts"

# Define the metadata fields the retriever should understand
metadata_field_info = [
    AttributeInfo(
        name="ticker",
        description="Company ticker symbol, such as TSLA, AAPL, etc.",
        type="string"
    ),
    AttributeInfo(
        name="filing_type",
        description="EDGAR filing form type such as 10-K or 10-Q",
        type="string"
    ),
    AttributeInfo(
        name="quarter",
        description="Financial quarter of the report (Q1, Q2, Q3, Q4)",
        type="string"
    ),
    AttributeInfo(
        name="year",
        description="Year of the report or transcript (e.g., 2025)",
        type="integer"
    ),
    AttributeInfo(
        name="section",
        description="Section of the document, such as 'Item 1A. Risk Factors', 'MD&A', or 'earnings call transcript'",
        type="string"
    ),
    AttributeInfo(
        name="source",
        description="Whether the document is from EDGAR or user upload",
        type="string"
    )
]




In [None]:
# Create the retriever
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents = document_content_description,
    metadata_field_info=metadata_field_info,
    search_kwargs={"k": 5}
)

from langchain.chains.query_constructor.base import StructuredQueryOutputParser

# Access the internal parser
parser = retriever.query_constructor

# Test your query
query_text = "What are the main risk factors for Tesla ?"
structured_query = parser.invoke({"query": query_text})

print(structured_query)


query='main risk factors Tesla' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='ticker', value='TSLA') limit=None


In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

query = "What are the main risk factors for Tesla?"
result = qa_chain.invoke({"query": query})

# Print result
print("\n📄 Answer:\n", result["result"])

# Print document sources (optional for debugging)
for i, doc in enumerate(result["source_documents"]):
    print(f"\n📚 Source {i+1}:")
    print("Metadata:", doc.metadata)
    print("Content Preview:", doc.page_content[:500], "...")



📄 Answer:
 The main risk factors for Tesla include potential delays or complications in designing, manufacturing, and delivering new vehicles and products, such as the Model 3, energy storage products, and the Solar Roof. Additionally, there are risks related to disasters that could damage facilities or operations, volatility in the trading price of Tesla's common stock, the need to increase Supercharger stations globally, and compliance with environmental, health, and safety regulations.

📚 Source 1:
Metadata: {'chunk_id': 409, 'section': 'ITEM 1A.', 'ticker': 'TSLA', 'filing_type': '10-Q', 'source': 'edgar', 'year': 2018, 'quarter': 'Q3', 'filing_id': '0001564590-18-026353'}
Content Preview: &nbsp;
ITEM 1A. RISK FACTORS
You should carefully consider the risks described below together with the other information set forth in this report, which could materially affect our business, financial condition and future results. The risks described below are not the only risks facing our compa