## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [57]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [58]:
!pip install -qU langchain_community pypdf

In [59]:
# imports for langchain

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

In [60]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [81]:
# Load environment variables in a file called .env

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [89]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

#TEMP: folders = glob.glob("knowledge-base/*")
folders = glob.glob("kb-sfdc/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    print(f"folder <${folder}>")
    #TEMP: loader = DirectoryLoader(folder, glob="**/*.pdf", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    loader = DirectoryLoader(folder, glob="**/*.pdf", loader_cls=PyPDFLoader)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

folder <$kb-sfdc/real_time_event_monitoring>


Ignoring wrong pointing object 15 0 (offset 0)
Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 32 0 (offset 0)
Ignoring wrong pointing object 36 0 (offset 0)
Ignoring wrong pointing object 112 0 (offset 0)
Ignoring wrong pointing object 144 0 (offset 0)
Ignoring wrong pointing object 180 0 (offset 0)


folder <$kb-sfdc/platform_encryption>
folder <$kb-sfdc/best_practices>


In [90]:
len(documents)

1076

In [91]:
documents[24]

Document(metadata={'producer': 'XEP 4.20 build 20120720', 'creator': 'Unknown', 'creationdate': '2025-03-21T17:14:48+00:00', 'author': 'Salesforce, Inc.', 'date/time generated': '2025-03-21T10:14:34.758-07:00', 'trapped': '/False', 'title': 'Platform Events Developer Guide', 'drc': '254.11', 'moddate': '2025-03-21T17:14:48+00:00', 'source': 'kb-sfdc/real_time_event_monitoring/sfdc-platform_events.pdf', 'total_pages': 590, 'page': 24, 'page_label': '21', 'doc_type': 'real_time_event_monitoring'}, page_content='// Create events in a loop\nfor(Integer i = 0;i<10;i++) {\nevents.add((Order_Event__e)Order_Event__e.sObjectType.newSObject(null, true));\n}\n// Pass the list of events to the publish call\nEventBus.publish(events, cb);\nIn contrast, this example shows what to avoid. It’s inefficiently making 10 calls to the publish method with a callback, each with one\nevent. This example can result in more callback executions later than when events are batched in one publish call.\n// !! NOT RE

In [92]:
#NOTE: divide each document into chunks...
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

In [93]:
len(chunks)

1074

In [94]:
chunks[6]

Document(metadata={'producer': 'XEP 4.20 build 20120720', 'creator': 'Unknown', 'creationdate': '2025-03-21T17:14:48+00:00', 'author': 'Salesforce, Inc.', 'date/time generated': '2025-03-21T10:14:34.758-07:00', 'trapped': '/False', 'title': 'Platform Events Developer Guide', 'drc': '254.11', 'moddate': '2025-03-21T17:14:48+00:00', 'source': 'kb-sfdc/real_time_event_monitoring/sfdc-platform_events.pdf', 'total_pages': 590, 'page': 6, 'page_label': '3', 'doc_type': 'real_time_event_monitoring'}, page_content='In comparison, systems in an event-based model obtain information and can react to it in near real time when the event occurs. Event\nproducers don’t know the consumers that receive the events. Any number of consumers can receive and react to the same events. The\nonly dependency between producers and consumers is the semantic of the message content.\nThe Event Bus\nPlatform event messages are published to the event bus, where they’re stored temporarily. You can retrieve stored even

In [95]:
#NOTE: confirm that we have the 4 directories listed in the "knowledge-base" folder...
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"doc_types <${doc_types}>")

doc_types <${'real_time_event_monitoring', 'best_practices', 'platform_encryption'}>


In [104]:
for chunk in chunks:
    if 'OAuth' in chunk.page_content:
        print(chunk)
        print("#################### CHUNK DIVIDER ########################")

page_content='You define a custom platform event in Salesforce in the same way that you define a custom object. Create a platform event definition
by giving it a name and adding custom fields. Platform events support a subset of field types in Salesforce. See Platform Event Fields.
This table lists a sample definition of custom fields for a printer ink event.
Field TypeField API NameField Name
TextPrinter_Model__cPrinter Model
TextSerial_Number__cSerial Number
NumberInk_Percentage__cInk Percentage
You can publish custom platform events on the Lightning Platform by using Apex or point-and-click tools, such as Process Builder and
Flow Builder, or an API in external apps. Similarly, you can subscribe to an event either on the platform through an Apex trigger or
point-and-click tools or in external apps, such as Pub/Sub API. When an app publishes an event message, event subscribers receive the
event message and execute business logic. Using the printer ink example, a software system monito