### Final Project - IPO Watch Agent
#### By - Yatharth Vardan

## Project Files Overview

This project consists of multiple interconnected Python modules that work together to create an intelligent IPO analysis system:

### Core Modules

**filings.py** - SEC Filing Extraction
- Extracts HTML content from SEC submission files
- Navigates complex SEC EDGAR file structure
- Isolates specific filing types (S-1, 10-Q, 10-K)

**retriever.py** - RAG Document Retrieval System  
- Manages Chroma vector database
- Generates embeddings using Ollama Mistral
- Provides semantic search over S-1 documents
- Supports metadata filtering for company-specific queries

**ipo_score.py** - Multi-Agent Analysis System
- Financial Analysis Agent: Evaluates revenue, profitability, growth
- Risk Analysis Agent: Identifies and grades risk factors
- Business Analysis Agent: Analyzes business model and impact
- Final Scoring Agent: Synthesizes analyses into investment recommendation

**workflow.py** - Pipeline Orchestrator
- Manages end-to-end workflow
- Coordinates data acquisition, processing, and analysis
- Supports two modes: new companies and existing companies

**utils.py** - Helper Functions
- Date extraction from natural language using LLM
- IPO data fetching from public sources
- Various utility functions

**extract_s1_sections.py** - S-1 Section Parser
- Extracts specific sections from S-1 HTML
- Identifies table of contents
- Parses structured sections (Business, Risk Factors, Financials, etc.)

This notebook demonstrates all these modules working together!


#### Limitations for the project

1. For the retriver to understand the content of entire S1 filing , it takes about 30 minutes to ingest new companies in the retriever
2. Hence for the purpose of this run the companies data is pre loaded in the retriever. 
3. The NEWS SENTIMENT ANALYSIS - turned out be a major blocker for implementation, for any API used there was always a limit on the content which did not allow any good sentiment to be performed 
<b>4. Hence for this project - the analysis of the IPO will be on the basis of S1 filing and the information provided in the file.

#### Resources Used 
1. SuperApp - Built by instabase 
2. ChatGPT - for coding assistance 
3. Claude - for dummy data testing.
4. Kaggle - for better understanding of the workflow
5. LLM Model - 
    a. Embeddings - Ollama Mistral ( faster and lightweight for embeddings )
    b. Query and Reasoning - Ollaba llama2:13b ( Larger model for better reasoning )

---

## 📋 Project Overview

This notebook demonstrates the **IPO Watch Agent** - an intelligent multi-agent system that analyzes Initial Public Offerings (IPOs) using:

- **SEC S-1 Filings**: Official IPO registration documents
- **RAG (Retrieval-Augmented Generation)**: For efficient document retrieval
- **Multi-Agent LLM System**: Specialized agents for financial, risk, and business analysis

### 🏗️ System Architecture

The system consists of several interconnected modules:

1. **`filings.py`**: Extracts HTML content from SEC filing documents
2. **`retriever.py`**: RAG system for document storage and retrieval using vector embeddings
3. **`ipo_score.py`**: Multi-agent system with specialized analysis agents:
   - Financial Analysis Agent
   - Risk Analysis Agent
   - Business Analysis Agent
   - Final Scoring Agent
4. **`workflow.py`**: Orchestrator that manages the complete pipeline
5. **`utils.py`**: Helper functions for date extraction and IPO data fetching

### 🔄 Workflow Pipeline

**Mode 1: New Companies (flag=True)**
```
User Query → Date Extraction (LLM Agent) → Get IPO Companies → Download S-1 Filings 
→ Ingest into RAG → Create Scoring Agents → Multi-Agent Analysis → Investment Score
```

**Mode 2: Existing Companies (flag=False)**
```
User Query → Load Pre-ingested Companies → Create Scoring Agents 
→ Multi-Agent Analysis → Investment Score
```

---


## 🎯 Step 1: Define User Query

The user query is the natural language input that drives the entire system. The query can specify:
- Time period (e.g., "January 2025")
- Type of analysis requested
- Specific companies (optional)

The **Date Extraction Agent** (in `utils.py`) will parse this query to identify months and years.


## 📦 Step 2: Import Dependencies

### Core Libraries
- **LangChain**: Framework for building LLM applications
  - `Document`: Document data structure
  - `RecursiveCharacterTextSplitter`: Chunks documents for embedding
  - `Chroma`: Vector database for storing embeddings
  - `RetrievalQA`: Question-answering chain with retrieval
  
- **Ollama**: Local LLM inference
  - `OllamaEmbeddings`: Generate embeddings using Ollama models
  - `OllamaLLM`: Run LLM inference locally
  
- **SentenceTransformers**: Generate sentence embeddings


In [1]:
query = "Provide me some IPOs and your analysis for the month of January 2025"

### Custom Modules

- **`workflow.py`**: Contains the `WorkFlow` class - the main orchestrator
  - Manages the complete IPO analysis pipeline
  - Handles data acquisition, processing, and analysis
  - Coordinates between different agents
  
- **`retriever.py`**: Contains the `RAGRetriever` class
  - Manages vector store for document embeddings
  - Provides document retrieval functionality
  - Handles ingestion of S-1 filings
  - Supports metadata filtering for company-specific queries


In [2]:
from sentence_transformers import SentenceTransformer
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama import OllamaLLM

  from .autonotebook import tqdm as notebook_tqdm


## ⚙️ Step 3: Initialize System Components

### Initialize Large Language Model

We use **Ollama Llama2:13b** for reasoning and analysis tasks:
- Larger 13B parameter model for better reasoning capability
- Used by all agent functions for analysis
- Handles complex tasks like financial analysis, risk assessment, and scoring


In [3]:
from workflow import *
from retriever import *

### Initialize RAG Retriever

The **RAGRetriever** (from `retriever.py`) manages document storage and retrieval:
- Uses **Ollama Mistral** for embeddings (faster and lightweight)
- Stores embeddings in **Chroma** vector database
- Enables semantic search over S-1 filing documents
- Supports metadata filtering for company-specific queries

**Key Methods**:
- `load_vectorstore()`: Load existing vector store from disk
- `save_vectorstore()`: Persist vector store to disk
- `ingest_s1()`: Process and store S-1 filing text
- `query_vectorstore_with_filter()`: Retrieve relevant documents with filters


In [4]:
llm = OllamaLLM(model = "llama2:13b")

### Load Pre-existing Vector Store

Load previously ingested S-1 filings from the persistent Chroma database.
This allows us to reuse embeddings from previous sessions without re-processing.


In [5]:
rag = RAGRetriever()

  self.embeddings = OllamaEmbeddings(model="llama2:13b")
  self.vectorstore = Chroma(


### Persist Vector Store to Disk

Save the current state of the vector store to ensure data persistence across sessions.


In [6]:
rag.load_vectorstore()

### Check Pre-loaded Companies

View the list of companies whose S-1 filings are already ingested in the vector store.
These companies can be analyzed without downloading and re-processing their filings.


In [7]:
rag.save_vectorstore()

  self.vectorstore.persist()


---

## 🤖 DEMONSTRATION 1: Analyze Pre-loaded Companies (Mode 2)

### Create Workflow Agent - Existing Companies Mode

**Parameters**:
- `query`: User's natural language request
- `rag`: RAG retriever instance with pre-loaded data
- `llm`: Language model for analysis
- `flag=False`: **Use existing companies** (don't download new filings)

**What happens internally** (see `workflow.py`):
1. ❌ Skip date extraction, company fetching, and downloading
2. ✅ Load companies from `rag.ingested_companies`
3. ✅ Create `Ipo_Score` agent for each company
4. ✅ Execute multi-agent analysis pipeline

This mode is faster because it uses pre-processed data.


In [7]:
rag.ingested_companies

[{'company_name': 'Health In Tech, Inc.', 'symbol': ' HIT'},
 {'company_name': 'Translational Development Acquisition Corp.',
  'symbol': ' TDACU'},
 {'company_name': 'Range Capital Acquisition Corp.', 'symbol': ' RANGU'},
 {'company_name': 'Mountain Lake Acquisition Corp.', 'symbol': ' MLACU'},
 {'company_name': 'INFINITY NATURAL RESOURCES, INC.', 'symbol': 'INR'}]

### Execute Multi-Agent Analysis Pipeline

**`Agent.analyse_ipos()`** runs the complete analysis workflow:

#### For Each Company:

**Step 1: Financial Analysis Agent** (`ipo_score.py` - `analyse_financial_information()`)
- Queries RAG for financial data (revenue, losses, projections)
- LLM analyzes profitability and growth
- Outputs: Profit status, growth reasoning, financial rating (Good/Bad)

**Step 2: Risk Analysis Agent** (`ipo_score.py` - `analyse_risks()`)
- Queries RAG for risk factors and market concerns
- LLM prioritizes and evaluates risks
- Outputs: Highest risk, market understanding, risk grade (Good/Bad)

**Step 3: Business Analysis Agent** (`ipo_score.py` - `analyse_business_information()`)
- Queries RAG for business operations and impact
- LLM summarizes business model
- Outputs: What company does, impact, success assessment

**Step 4: Final Scoring Agent** (`ipo_score.py` - `generate_ipo_score()`)
- Synthesizes all three analyses
- LLM produces final investment recommendation
- Outputs: Final rating (Good/Bad) with explanation

**Returns**: List of `{'company_name': ..., 'score_summary': ...}`


In [9]:
Agent = WorkFlow(query,rag,llm,False)

### View Analysis Results

Display the IPO scores generated by the multi-agent system.
Each score includes:
- Company name
- Investment recommendation (Good/Bad)
- Brief explanation from the final scoring agent


In [10]:
ipo_scores = Agent.analyse_ipos()

  chain = LLMChain(llm = self.llm, prompt = prompt)
  answer = chain.run({"risks": self.risks_summary, "finance_situation":self.financial_summary,"business":self.business_summary})


### Save Results to File

Export the IPO analysis results to `IPO SCORING.txt` for documentation and review.


In [12]:
ipo_scores

[{'company_name': 'Health In Tech, Inc.',
  'score_summary': '\n    Based on the information provided, I would rate this IPO as "BAD" to invest in. The lack of any risks, financial information, or business details indicates that the company has not disclosed enough information for investors to make an informed decision. Additionally, the lack of any financials and business details suggests that the company may not be transparent about its operations and financial health, which can be a red flag for potential investors. Therefore, I would advise investors to exercise caution and thoroughly research the company before making any investment decisions.'},
 {'company_name': 'Translational Development Acquisition Corp.',
  'score_summary': '\nAs an expert scoring agent for IPOs, based on the information provided, I would rate this IPO as "BAD" to invest in.\n\nThe reason for this rating is that there are no risks, financial situation, or business information provided, which suggests that the

In [13]:
f = open("IPO SCORING.txt",'w')
for score in ipo_scores:
    f.write(score['company_name'])
    f.write('---'*10)
    f.write(score['score_summary'])

---

## 🔄 DEMONSTRATION 2: Download and Analyze New Companies (Mode 1)

### Create Workflow Agent - New Companies Mode

**Parameters**:
- `query`: Same user query
- `rag`: RAG retriever instance
- `llm`: Language model for analysis
- `flag=True`: **Download and process NEW companies**

**Complete Pipeline** (see `workflow.py`):

#### Phase 1: Data Acquisition
1. **Date Extraction Agent** (`extract_date()`)
   - Uses LLM to parse "January 2025" from query
   - Extracts: `{'months': [1], 'years': [2025]}`

2. **Get Companies** (`get_companies_information()`)
   - Calls `utils.py` - `get_ipos_month_year()`
   - Fetches companies that filed IPOs in specified period
   - Returns company list with names, symbols, dates

3. **Download S-1 Filings** (`download_s1_filings()`)
   - Uses `sec_edgar_downloader` library
   - Downloads S-1 filings from SEC EDGAR database
   - Stores in `sec-edgar-filings/{symbol}/S-1/`
   - Handles failures gracefully

#### Phase 2: Data Processing
4. **Extract & Ingest** (`ingest_s1_files_in_retriever()`)
   - Uses `filings.py` to extract HTML from submission files
   - Converts HTML to plain text
   - Chunks text using `RecursiveCharacterTextSplitter`
   - Generates embeddings and stores in Chroma vector database
   - **⚠️ Note**: Takes ~30 minutes per company

#### Phase 3: Analysis
5. **Create Agent Objects** (`generate_ipo_objects()`)
   - Instantiates `Ipo_Score` agent for each company

6. **Execute Multi-Agent Analysis** (`generate_ipo_scores()`)
   - Runs all 4 analysis agents (Financial, Risk, Business, Scoring)

**This mode demonstrates the complete end-to-end pipeline!**


In [14]:
f.close()

### Execute Complete Pipeline with New Companies

**⚠️ WARNING**: This will:
- Download ~70 S-1 filings from SEC EDGAR (takes several minutes)
- Process and ingest first company's filing (takes ~30 minutes)
- Run multi-agent analysis

**Expected Output**:
- Download logs for each company
- Ingestion progress
- Final IPO scores with investment recommendations

The output below shows the download process in action!


In [10]:
Agent2 = WorkFlow(query, rag, llm, True)

### Persist Newly Ingested Data

Save the vector store with newly added companies.
This ensures the downloaded and processed data is available for future sessions.


In [None]:
ipo_Scores_New = Agent2.analyse_ipos()

---

## 🎭 DEMONSTRATION 3: Analyze ALL Companies (Including Newly Added)

### Create Agent for Complete Analysis

Now that we've added new companies to the vector store, we can analyze **ALL companies** (both pre-existing and newly added) using `flag=False`.

For quick run , i have added the feature to only work with one more new company , as it takes 30min to 1hour to ingest one single S1 filing.

**This demonstrates**:
- The persistence of the vector store
- Ability to query across all ingested companies
- Scalability of the multi-agent system


In [12]:
rag.save_vectorstore()

  self.vectorstore.persist()


In [8]:
Agent3 = WorkFlow(query,rag,llm,False)

### View Complete Results

Display investment recommendations for all analyzed companies.
This represents the output of the complete multi-agent IPO analysis system!


In [10]:
ipos= Agent3.analyse_ipos()

### Export Final Results

Save comprehensive IPO analysis results to `IPO SCORING 2.txt`.
This file contains investment recommendations for all companies analyzed in this session.


In [11]:
ipos

[{'company_name': 'Health In Tech, Inc.',
  'score_summary': '\n Based on the provided information, I would rate the IPO as "GOOD" for investment. Here\'s a brief explanation of my reasoning:\n\n Financials: The company has a strong track record of profitability, with a net income of $1,481,254 in the two and five months ended May 31, 2023. Its commitment to investing in technology and innovation positions it well for future growth. However, the company does face risks related to the complexity and fragmentation of the healthcare insurance market, as well as the potential for reputational harm due to negative publicity or cyber-attacks. Overall, I believe that the company\'s financials are strong and that it has significant growth potential in the future.\n\n Risks: While there are several risks facing the company, such as its reliance on insurance carriers and the potential for reputational harm, the company has a strong understanding of the complexities of the healthcare insurance ma

In [13]:
f = open("IPO SCORING 2.txt",'w')
for score in ipos:
    f.write(score['company_name'])
    f.write('---'*10)
    f.write('\n')
    f.write(score['score_summary'])
    f.write('\n\n')

In [14]:
f.close()

## Summary of Demonstrations

### What We Demonstrated

1. Demonstration 1: Analyzed pre-loaded companies using existing vector store data
2. Demonstration 2: Complete end-to-end pipeline with new companies  
3. Demonstration 3: Analyzed all companies (old and new)

### Multi-Agent System Components

Four Specialized Agents in ipo_score.py:
- Financial Analysis Agent: Evaluates profitability and growth
- Risk Analysis Agent: Identifies and prioritizes risks
- Business Analysis Agent: Analyzes business model
- Final Scoring Agent: Produces investment recommendation

### Key Technologies

- LLM: Ollama Llama2:13b for reasoning and Mistral for embeddings
- Vector DB: Chroma for document storage and retrieval
- Framework: LangChain for RAG pipeline
- Data Source: SEC EDGAR S-1 filings

