![image](img/librarian_bot.png)

# RAG bot for investor information

Specialized bot focusing on analysing financial documents from Investor Relations webpages. 
Comes together with a web crawler spider to gather documents quickly.

This notebook will create a personal RAG bot. It will use a the ./kb directory to store the files that we want to include in the RAG. Subdirectories will be used to denote categories for the files.
**Important: only one level of subdirectories will be used for the categories**

It uses LangChain to create and process the RAG pipeline and chat.
The vector database persistent sotre is in the ./vdb folder. 

In this version we use chromadb for the vector store.
The store is recreated each run. This is not efficient for large datasets. 

Future upgrades - To Do (in no particular order): 
- [x] Create a fully local version for security and privacy (*see v01_local*) <span style="color:orange">
        NOTE: will require a fairly advanced LLM to answer questions without losing context. 2-4bn parameters LLM's struggle and tend to hallucinate. Best options are gpt-4o-mini and claude-3.5-haiku.</span>
- [x] Fine tune the pdf scraper to handle financial reports better
- [x] Create custom retriever for financial information
- [x] Add chatbot selection option (Claude - GPT)
- [ ] Switch to LangGraph for Conversation Chain. 
- [ ] Change vector store to lancedb (part of moving to serverless data store in production)
- [ ] Add an interface to upload documents in data store - including user-defined metadata tags
- [ ] Create persistent data store between runs - only load, chunk and embed changed documents.
- [ ] Read data from S3
- [ ] Enhance data store with metadata for advanced filtering. When uploading reports from different companies semantic search cannot always accurately distinguish differences and can combine references or results from two different companies in a single response. Also see above for lancedb.
- [ ] Multimodality: Process more document data types (e.g. ppt) 
- [x] Add online search capability - use web crawler tool to crawl a website and create website-specific RAG bot



![image](img/thinking_cyborg.jpg)

# Future ideas:

1) To avoid context contamination from multiple companies consider setting up a different RAG for each company
2) Select companies from dropdown or hamburger menu
3) Keep company metadata and RAG registry in a document db (Mongo, DynamoDB)

In [13]:
# These were necessary as langchain does not install them by default
# !pip install pypdf
# !pip install pdfminer.six
# !pip install python-docx
!pip install docx2txt
!pip install pymupdf4llm
!pip install langchain-openai
!pip install langchain-anthropic

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [14]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

# imports for langchain, plotly and Chroma
# plotly is commented out, as it is not used in the current code

from langchain.document_loaders import DirectoryLoader, TextLoader, PDFMinerLoader, Docx2txtLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_anthropic import Anthropic

from langchain_chroma import Chroma
#import matplotlib.pyplot as plt
#from sklearn.manifold import TSNE
#import numpy as np
#import plotly.graph_objects as go
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceEmbeddings

In [15]:
OPENAI_MODEL = "gpt-4o-mini"
ANTHROPIC_MODEL = "claude-3-5-haiku-20241022"
db_name = "vdb"

In [16]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')
os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')


## Loading the documents

In the code below we read in the KB documents and create the vector store. 
We will be adding PDF documents, Word documents and text/markdown documents. 
Each document has its own loader, which we are calling separately through DirectoryLoader.
For PDF we implement custom loader to manage financial data. 

At the end, we are combining the results, and then start splitting the documents using the Recursive Character Text Splitter.

This approach is not optimal for financial tables.
TO DO:
 - [x] Replace splitter with better technique that preserves tables.
 - [x] Replace PDF Reader with pymupdf4llm

In [17]:
# Utility functions for EU financial reporting (read from PDF)
# We're using pymupdf4llm for better handling of financial reports
# This function does not utilize a loader class, but directly processes the PDF file
# It extracts financial sections and returns them as Document objects\

import pymupdf4llm
from langchain.schema import Document
import re
import string
from pathlib import Path

def extract_eu_financial_reports(pdf_path):
    """
    Extracts financial sections from an EU financial report PDF using pymupdf4llm.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        list[Document]: A list of LangChain Document objects, each representing a detected financial section
                        (e.g., income statement, balance sheet, cash flow statement, etc.) with associated metadata.

    The function processes the PDF, detects section headers based on common financial report section names,
    and splits the content accordingly. Each Document contains the section text and metadata including section name,
    content type, source file, and page range.
    """
    md_text = pymupdf4llm.to_markdown(
        pdf_path,
        page_chunks=True,  # Preserve page boundaries
        write_images=False,
        embed_images=False
    )
    
    # EU financial reports have predictable structures
    financial_sections = [
        "consolidated income statement", "profit and loss", "p&l", "remuneration report",
        "balance sheet", "cash flow statement", "statement of financial position",
        "notes to the consolidated financial statements", "segment reporting",
        "risk management", "capital adequacy", "basel", "ifrs", "regulatory capital", "income statement"
    ]
    
    documents = []
    current_section = None
    current_content = ""
    start_page = 1
    
    for page_dict in md_text:
        # Extract the actual text content from the dictionary
        page_content = page_dict.get("text", "")
        page_num = page_dict.get("page", start_page)
        
        print(f"Processing page: {page_num}")

        # Detect financial section headers
        content_lower = page_content.lower()
        detected_section = None
        
        for section in financial_sections:
            if section in content_lower:
                detected_section = section
                break
        
        # Process section changes
        if detected_section and detected_section != current_section:
            if current_content:
                # Save previous section
                documents.append(Document(
                    page_content=current_content.strip(),
                    metadata={
                        "content_type": "financial_statement",
                        "section": current_section or "general",
                        "source": pdf_path,
                        "pages": f"{start_page}-{page_num-1}"
                    }
                ))
            current_section = detected_section
            current_content = page_content
        else:
            current_content += "\n---\n" + page_content
    
    # Handle final section
    if current_content:
        documents.append(Document(
            page_content=current_content.strip(),
            metadata={
                "content_type": "financial_statement",
                "section": current_section or "general",
                "source": pdf_path,
                "pages": f"{start_page}-{page_num}"
            }
        )) 
    
    return documents

# Utility functions for loading documents from a folder
def load_eu_financial_reports_from_directory(directory_path: str, glob_pattern: str = "*.pdf"):
    """
    Load and process all EU financial reports from a directory.

    Args:
        directory_path (str): Path to the directory containing PDF files
        glob_pattern (str, optional): Pattern to match PDF files. Defaults to "*.pdf"

    Returns:
        list[Document]: A list of LangChain Document objects containing the extracted financial sections
                       from all successfully processed PDFs in the directory.

    The function iterates through PDF files in the specified directory that match the glob pattern,
    processes each file using extract_eu_financial_reports(), and combines the results into a single list.
    Files that cannot be processed are skipped with an error message printed to stdout.
    """
    all_documents = []
    directory = Path(directory_path)
    
    for pdf_file in directory.glob(glob_pattern):
        try:
            print(f"Processing {pdf_file}...")
            documents = extract_eu_financial_reports(str(pdf_file))
            all_documents.extend(documents)
        except Exception as e:
            print(f"Error processing {pdf_file}: {e}")
    
    return all_documents


In [18]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("kb/*")
print(f"Found {len(folders)} folders in the knowledge base.")

def add_metadata(doc, doc_type):
    doc.metadata["doc_type"] = doc_type
    return doc

# For text files
text_loader_kwargs = {'encoding': 'utf-8'}

documents = []
pdf_docs = []
for folder in folders:
    print(f"Loading documents from folder: {folder}")
    doc_type = os.path.basename(folder)
    # PDF Loader
    # We're not using the PDFMinerLoader as it does not handle EU financial reports well.
    # Instead, we use our custom extract_eu_financial_reports function.
    # Uncomment the next line if you want to use the standard loader for PDF files
    # pdf_loader = DirectoryLoader(folder, glob="**/*.pdf", loader_cls=extract_eu_financial_reports)
    # Text loaders
    txt_loader = DirectoryLoader(folder, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    md_loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    # Load MS Word documents - UnstructuredWordDocumentLoader does not play well with numpy > 1.24.0, and we use Docx2txtLoader instead. 
    # doc_loader = DirectoryLoader(folder, glob="**/*.doc", loader_cls=UnstructuredWordDocumentLoader)
    docx_loader = DirectoryLoader(folder, glob="**/*.docx", loader_cls=Docx2txtLoader)
    # document doc_type is used to identify the type of document
    # Load documents from PDF, text and word files and combine the results
    pdf_docs += load_eu_financial_reports_from_directory(folder)
    print(f"Loaded {len(pdf_docs)} PDF documents from {folder}")
    text_docs = txt_loader.load() + md_loader.load()
    print(f"Loaded {len(text_docs)} text documents from {folder}")
    word_docs = docx_loader.load()
    print(f"Loaded {len(word_docs)} Word documents from {folder}")
    # PDF documents are processed separately, so we combine them with text and word documents
    # If no PDF documents are found, we still want to process text and word documents
    folder_docs = text_docs + word_docs
    
    # Add metadata to each document
    if not folder_docs:
        print(f"No documents found in folder: {folder}")
        continue
    documents.extend([add_metadata(doc, doc_type) for doc in folder_docs])

# Split the text documents into chunks and combine with the pdf documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents) + pdf_docs

# Print out some basic info for the loaded documents and chunks
print(f"Total number of documents: {len(documents)}")
print(f"Total number of chunks: {len(chunks)}")
print(f"Document types found: {set(doc.metadata['doc_type'] for doc in documents)}")


Found 5 folders in the knowledge base.
Loading documents from folder: kb/eurobankholdings.gr
Processing kb/eurobankholdings.gr/etisia-oikonomiki-ekthesi-en-2024.pdf...


KeyboardInterrupt: 

## Vector Store

We use Chromadb for vector store.

Same code as the one in the lesson notebook, minus the visualization part


In [None]:
#embeddings = OpenAIEmbeddings()

# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers
# Then replace embeddings = OpenAIEmbeddings()
# with:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") # A bit slower, but better than all-MiniLM-L6-v2 for financial documents

# Delete if already exists

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

# Let's investigate the vectors

collection = vectorstore._collection
count = collection.count()

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"There are {count:,} vectors with {dimensions:,} dimensions in the vector store")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") # A bit slower, but better than all-MiniLM-L6-v2 for financial documents


Vectorstore created with 149 documents
There are 149 vectors with 768 dimensions in the vector store


## LangChain
Create Langchain chat, memory and retrievers.

Trying a number of LLM's for ollama. They are not very good at sortingo out the relevant information from financial documents - they do provide results, but tend to be overly chatty and especially the specific numbers can be hallucinated or taken out of context. 

GPT-4o-mini provided much more accurate answers to specific questions, even with huggingface's embeddings for the vector store. 

Implemented (with Claude's help) a custom retriever and prompt to focus on financial statement analysis.

### OpenAI rate limits
*Note*: If using OpenAI's embeddings, there's a limit of 300K tokens per request. This requires special handling when calling Chroma.from_documents.
###TO DO:
- [ ] Add rate limiter for encoding documents and encode in batches.

In [None]:
# Specialized Retriever for consolidated financials

from langchain.schema import BaseRetriever, Document
from typing import List

from langchain.vectorstores.base import VectorStoreRetriever

class EUFinancialRetriever(VectorStoreRetriever):
    def _get_relevant_documents(self, query: str, *, run_manager=None) -> List[Document]:
        query_lower = query.lower()
        k = self.search_kwargs.get("k", 5)
        
        # Section-aware search logic
        section_queries = {
            'income': ['income', 'revenue', 'profit', 'earnings'],
            'balance': ['balance', 'assets', 'liabilities', 'equity'],
            'cash': ['cash flow', 'operating cash', 'free cash']
        }
        
        for section, terms in section_queries.items():
            if any(term in query_lower for term in terms):
                try:
                    return self.vectorstore.similarity_search(
                        query, k=k, filter={"section": section}
                    )
                except:
                    break
        
        # Fallback to standard search
        return self.vectorstore.similarity_search(query, k=k)




In [None]:
# Specialized prompt for the retriever

financial_prompt = """
You are analyzing EU bank and corporate financial statements. When answering:

1. For numerical data, ALWAYS cite the specific financial statement section
2. Consider regulatory context (IFRS, Basel III for banks)
3. Note if data spans multiple periods or segments
4. Highlight any footnotes or adjustments mentioned
5. Be precise about currency and units (EUR millions, thousands, etc.)

Context from financial statements:
{context}

Question: {question}

Answer:
"""
# Updated chain with financial-aware prompt
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=financial_prompt
)


In [None]:
# create a new Chat with OpenAI
#llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# Alternative - if you'd like to use Ollama locally, uncomment this line instead
#llm = ChatOpenAI(temperature=0.7, model_name='gemma3:4b', base_url='http://localhost:11434/v1', api_key='ollama')

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = EUFinancialRetriever(
    vectorstore=vectorstore, 
    search_kwargs={"k": 5}
)
# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory

# Initialize chains outside the chat function
chains = {}

def get_or_create_chain(model_name):
    if model_name not in chains:
        if model_name == OPENAI_MODEL:
            llm = ChatOpenAI(temperature=0.7, model_name=OPENAI_MODEL)
        elif model_name == ANTHROPIC_MODEL:
            # If using Anthropic, ensure you have the langchain-anthropic package installed
            from langchain_anthropic import ChatAnthropic
            llm = ChatAnthropic(temperature=0.7, model_name=ANTHROPIC_MODEL)
        
        chains[model_name] = ConversationalRetrievalChain.from_llm(
            llm=llm, 
            retriever=retriever, 
            memory=memory, 
            combine_docs_chain_kwargs={"prompt": prompt},
            return_source_documents=True
        )
    return chains[model_name]


  memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True, output_key='answer')


## UI part
Create Gradio interface

Simple built-in chat interface

###To Do: 
- [x] Add model selector for Claude 3.5 Haiku
- [x] Update interface to handle sources (with **return_source_documents=True**)

In [None]:
# Wrapping that in a function

def chat(question, history, dropdown_value):
    conversation_chain = get_or_create_chain(dropdown_value)
    result = conversation_chain.invoke({"question": question})
    answer = result["answer"]
    source_docs = result.get("source_documents", [])
    if source_docs:
        sources_text = "\n\n**Sources:**\n"
        for i, doc in enumerate(source_docs, 1):
            # Extracting source and page information from the document metadata
            # If the document has a 'source' metadata field, use it; otherwise, default
            source = doc.metadata.get('source', 'Unknown source')
            page = doc.metadata.get('pages', '')
            
            if page:
                sources_text += f"{i}. {source} (Page {page})\n"
            else:
                sources_text += f"{i}. {source}\n"
        return answer + sources_text
    else: 
        return answer

# And in Gradio:

view = gr.ChatInterface(chat, type="messages",
    additional_inputs=[gr.Dropdown([OPENAI_MODEL, ANTHROPIC_MODEL], label="Select model", value=OPENAI_MODEL)]).launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
