## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

### langchain

* https://chatgpt.com/share/68338ab8-326c-8001-89ae-1859251b4dff

In [22]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [23]:
# imports for langchain

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [24]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [25]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [48]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase
# This code snippet loads Markdown (.md) documents from a structured folder 
# hierarchy using DirectoryLoader and appends metadata about the document type. 
# Thank you Mark D. and Zoya H. for fixing a bug here..

folders = glob.glob("knowledge-base/*")
print("folders => ", folders)

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)   # The function os.path.basename(path) returns the last part of a given path. 
    print("doc_type => ", doc_type)

    loader = DirectoryLoader(folder,   # path to the folder
                             glob="**/*.md",   # recursive search pattern for files
                             loader_cls=TextLoader,   # loader used to read each file
                             loader_kwargs=text_loader_kwargs)    # optional kwargs passed to the loader

    # Returns a list of Document objects (from LangChain), each having:
    # - page_content: the text content of the file
    # - metadata: a dictionary including the file path, and any custom metadata you add
    folder_docs = loader.load()
    # print(folder_docs)
    
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        print("doc.metadata => ",doc.metadata)
        
        documents.append(doc)
        # print("documents => ",documents)

folders =>  ['knowledge-base/products', 'knowledge-base/contracts', 'knowledge-base/company', 'knowledge-base/employees']
doc_type =>  products
doc.metadata =>  {'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}
doc.metadata =>  {'source': 'knowledge-base/products/Markellm.md', 'doc_type': 'products'}
doc.metadata =>  {'source': 'knowledge-base/products/Homellm.md', 'doc_type': 'products'}
doc.metadata =>  {'source': 'knowledge-base/products/Carllm.md', 'doc_type': 'products'}
doc_type =>  contracts
doc.metadata =>  {'source': 'knowledge-base/contracts/Contract with GreenField Holdings for Markellm.md', 'doc_type': 'contracts'}
doc.metadata =>  {'source': 'knowledge-base/contracts/Contract with Apex Reinsurance for Rellm.md', 'doc_type': 'contracts'}
doc.metadata =>  {'source': 'knowledge-base/contracts/Contract with Greenstone Insurance for Homellm.md', 'doc_type': 'contracts'}
doc.metadata =>  {'source': 'knowledge-base/contracts/Contract with Roadway Insurance In

In [36]:
len(doc.metadata)

2

In [40]:
doc.metadata

{'source': 'knowledge-base/employees/Emily Carter.md', 'doc_type': 'employees'}

In [6]:
len(documents)

31

In [7]:
documents[24]

Document(metadata={'source': 'knowledge-base/employees/Maxine Thompson.md', 'doc_type': 'employees'}, page_content="# HR Record\n\n# Maxine Thompson\n\n## Summary\n- **Date of Birth:** January 15, 1991  \n- **Job Title:** Data Engineer  \n- **Location:** Austin, Texas  \n\n## Insurellm Career Progression\n- **January 2017 - October 2018**: **Junior Data Engineer**  \n  * Maxine joined Insurellm as a Junior Data Engineer, focusing primarily on ETL processes and data integration tasks. She quickly learned Insurellm's data architecture, collaborating with other team members to streamline data workflows.  \n- **November 2018 - December 2020**: **Data Engineer**  \n  * In her new role, Maxine expanded her responsibilities to include designing comprehensive data models and improving data quality measures. Though she excelled in technical skills, communication issues with non-technical teams led to some project delays.  \n- **January 2021 - Present**: **Senior Data Engineer**  \n  * Maxine wa

### break long documents into smaller, overlapping chunks. This is a common and important step when preparing text for embedding, vector search, or language model retrieval (RAG). 

### every single chunk includes both metadata and page_content, and that is by design. Without metadata, once the chunk is detached from its original document, you wouldn't know where it came from or how to present it in a response.

In [8]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [42]:
for i, chunk in enumerate(chunks[:10]):
    print(f"Chunk {i+1} Metadata: {chunk.metadata}")
    print(f"Chunk Content: {chunk.page_content[:200]}")
    print("-" * 50)


Chunk 1 Metadata: {'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}
Chunk Content: # Product Summary

# Rellm: AI-Powered Enterprise Reinsurance Solution

## Summary

Rellm is an innovative enterprise reinsurance product developed by Insurellm, designed to transform the way reinsura
--------------------------------------------------
Chunk 2 Metadata: {'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}
Chunk Content: ### Seamless Integrations
Rellm's architecture is designed for effortless integration with existing systems. Whether it's policy management, claims processing, or financial reporting, Rellm connects s
--------------------------------------------------
Chunk 3 Metadata: {'source': 'knowledge-base/products/Rellm.md', 'doc_type': 'products'}
Chunk Content: ### Regulatory Compliance Tools
Rellm includes built-in compliance tracking features to help organizations meet local and international regulatory standards. This ensures that reinsura

In [9]:
len(chunks)

123

In [10]:
chunks[6]

Document(metadata={'source': 'knowledge-base/products/Markellm.md', 'doc_type': 'products'}, page_content='- **User-Friendly Interface**: Designed with user experience in mind, Markellm features an intuitive interface that allows consumers to easily browse and compare various insurance offerings from multiple providers.\n\n- **Real-Time Quotes**: Consumers can receive real-time quotes from different insurance companies, empowering them to make informed decisions quickly without endless back-and-forth communication.\n\n- **Customized Recommendations**: Based on user profiles and preferences, Markellm provides personalized insurance recommendations, ensuring consumers find the right coverage at competitive rates.\n\n- **Secure Transactions**: Markellm prioritizes security, employing robust encryption methods to ensure that all transactions and data exchanges are safe and secure.\n\n- **Customer Support**: Our dedicated support team is always available to assist both consumers and insurer

In [43]:
chunks[7]

Document(metadata={'source': 'knowledge-base/products/Markellm.md', 'doc_type': 'products'}, page_content="- **Customer Support**: Our dedicated support team is always available to assist both consumers and insurers throughout the process, providing guidance and answering any questions that may arise.\n\n- **Data Insights**: Insurers gain access to valuable data insights through Markellm's analytics dashboard, helping them understand market trends and consumer behavior to refine their offerings.\n\n## Pricing\n\nAt Markellm, we believe in transparency and flexibility. Our pricing structure is designed to accommodate different types of users—whether you're a consumer seeking insurance or an insurance provider seeking customers.\n\n### For Consumers:\n- **Free Membership**: Access to the marketplace at no cost, allowing unlimited browsing and comparisons.\n- **Premium Features**: Optional subscription at $9.99/month for advanced analytics on choices, priority customer support, and enhanc

In [11]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: employees, company, products, contracts


In [45]:
for chunk in chunks:
    if 'CEO' in chunk.page_content:
        print(chunk)
        print("============> ")
        print()

page_content='## Support

1. **Customer Support**: Velocity Auto Solutions will have access to Insurellm’s customer support team via email or chatbot, available 24/7.  
2. **Technical Maintenance**: Regular maintenance and updates to the Carllm platform will be conducted by Insurellm, with any downtime communicated in advance.  
3. **Training & Resources**: Initial training sessions will be provided for Velocity Auto Solutions’ staff to ensure effective use of the Carllm suite. Regular resources and documentation will be made available online.

---

**Accepted and Agreed:**  
**For Velocity Auto Solutions**  
Signature: _____________________  
Name: John Doe  
Title: CEO  
Date: _____________________  

**For Insurellm**  
Signature: _____________________  
Name: Jane Smith  
Title: VP of Sales  
Date: _____________________' metadata={'source': 'knowledge-base/contracts/Contract with Velocity Auto Solutions for Carllm.md', 'doc_type': 'contracts'}

page_content='3. **Regular Updates:**