## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

**Vector Embedding Models**

word2vec(2013)

BERT(2018)

OPENAI Embeddings(2024 updates)

**Chroma** is the first open-source Al application database. Batteries included.

![chroma.png](chroma.png)

In [1]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr

In [2]:
# imports for langchain and Chroma and plotly

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go

In [3]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
MODEL = "Qwen/QwQ-32B"
db_name = "vector_db"

In [4]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [5]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("knowledge-base/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder) # use the folder name as the doc_type
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

In [7]:
print(documents)

[Document(metadata={'source': 'knowledge-base\\company\\about.md', 'doc_type': 'company'}, page_content="# About Insurellm\n\nInsurellm was founded by Avery Lancaster in 2015 as an insurance tech startup designed to disrupt an industry in need of innovative products. It's first product was Markellm, the marketplace connecting consumers with insurance providers.\nIt rapidly expanded, adding new products and clients, reaching 200 emmployees by 2024 with 12 offices across the US."), Document(metadata={'source': 'knowledge-base\\company\\careers.md', 'doc_type': 'company'}, page_content='# Careers at Insurellm\n\nInsurellm is hiring! We are looking for talented software engineers, data scientists and account executives to join our growing team. Come be a part of our movement to disrupt the insurance sector.'), Document(metadata={'source': 'knowledge-base\\company\\overview.md', 'doc_type': 'company'}, page_content='# Overview of Insurellm\n\nInsurellm is an innovative insurance tech firm w

# Please note:

In the next cell, we split the text into chunks.

2 students let me know that the next cell crashed their computer.  
They were able to fix it by changing the chunk_size from 1,000 to 2,000 and the chunk_overlap from 200 to 400.  
This shouldn't be required; but if it happens to you, please make that change!  
(Note that LangChain may give a warning about a chunk being larger than 1,000 - this can be safely ignored).

_With much thanks to Steven W and Nir P for this valuable contribution._

In [8]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [9]:
len(chunks)

123

In [10]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks) # 使用set去重
print(f"Document types found: {', '.join(doc_types)}")

Document types found: contracts, employees, products, company


## A sidenote on Embeddings, and "Auto-Encoding LLMs"

We will be mapping each chunk of text into a Vector that represents the meaning of the text, known as an embedding.

OpenAI offers a model to do this, which we will use by calling their API with some LangChain code.

This model is an example of an "Auto-Encoding LLM" which generates an output given a complete input.
It's different to all the other LLMs we've discussed today, which are known as "Auto-Regressive LLMs", and generate future tokens based only on past context.

Another example of an Auto-Encoding LLMs is BERT from Google. In addition to embedding, Auto-encoding LLMs are often used for classification.

### Sidenote

In week 8 we will return to RAG and vector embeddings, and we will use an open-source vector encoder so that the data never leaves our computer - that's an important consideration when building enterprise systems and the data needs to remain internal.

In [11]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk

# embeddings = OpenAIEmbeddings()

# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers
# Then replace embeddings = OpenAIEmbeddings()
# with:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
# Check if a Chroma Datastore already exists - if so, delete the collection to start from scratch

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

In [13]:
# Create our Chroma vectorstore!

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 123 documents


In [15]:
# Get one vector and find how many dimensions it has

collection = vectorstore._collection
print(collection.get(limit=1, include=["embeddings"]))
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

{'ids': ['f4217b8f-d38f-4ddb-b0bd-bbcda91d022b'], 'embeddings': array([[-4.51702289e-02, -4.03837115e-03, -4.76738065e-02,
         5.79420365e-02,  7.09055085e-03, -1.20240906e-02,
         1.13990404e-01,  3.22364978e-02, -2.38573346e-02,
        -2.50851680e-02,  7.44797289e-02,  2.48318519e-02,
         1.05261981e-01, -2.77122390e-02, -3.79517823e-02,
         1.81515012e-02, -1.18474700e-02, -4.32383046e-02,
         3.21236514e-02,  2.71694474e-02, -4.61054370e-02,
        -4.26115748e-03, -1.15558309e-02, -8.19229614e-03,
        -3.33760642e-02,  6.54421654e-03,  4.88266498e-02,
         3.43602151e-02, -3.53719741e-02, -5.88528961e-02,
        -1.16648413e-02, -3.25214863e-02,  2.13794466e-02,
         6.14821576e-02,  1.26224617e-03,  1.59482993e-02,
        -2.78290100e-02, -7.27213100e-02, -9.33095440e-02,
         4.26133443e-03,  3.62291224e-02, -2.61924379e-02,
        -4.81943563e-02, -2.92451885e-02, -9.08270702e-02,
         5.86406998e-02,  4.04887050e-02,  6.680569

## Visualizing the Vector Store

Let's take a minute to look at the documents and their embedding vectors to see what's going on.

In [16]:
# Prework

# Retrieve data from the 'collection' object.
# 'include' specifies which components of the data to fetch:
# - 'embeddings': The high-dimensional vector representations of the documents.
# - 'documents': The original text content of the documents.
# - 'metadatas': Additional metadata associated with each document, including 'doc_type'.
result = collection.get(include=['embeddings', 'documents', 'metadatas'])

# Extract the embeddings (vectors) from the result and convert them to a NumPy array.
# NumPy arrays are efficient for numerical computations required by t-SNE.
vectors = np.array(result['embeddings'])

# Extract the original documents.
documents = result['documents']

# Extract 'doc_type' from the metadata of each document.
# This will be used to assign different colors to different types of documents.
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]

# Assign a color to each document based on its 'doc_type'.
# The index of the doc_type in the fixed list determines the color.
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [19]:
# We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)
# 使用 t-SNE（t-分布随机邻域嵌入）将高维向量降维到2D
# t-SNE 是一种非线性降维技术，能够保持数据点之间的相对距离关系

# n_components=2: 降维到2维
# random_state=42: 设置随机种子，确保结果可重现
tsne = TSNE(n_components=2, random_state=42)

# 执行降维操作，将高维向量转换为2D坐标
reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
# 创建2D散点图可视化
# 使用 Plotly 创建交互式散点图
fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],  # X轴坐标（降维后的第一维）
    y=reduced_vectors[:, 1],  # Y轴坐标（降维后的第二维）
    mode='markers',           # 显示模式为散点
    marker=dict(
        size=5,               # 散点大小
        color=colors,         # 散点颜色（根据文档类型）
        opacity=0.8           # 透明度
    ),
    # 悬停文本：显示文档类型和文档内容的前100个字符
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'          # 悬停时显示自定义文本
)])

# 设置图表布局和样式
fig.update_layout(
    title='2D Chroma Vector Store Visualization',  # 图表标题
    scene=dict(
        xaxis_title='x',      # X轴标题
        yaxis_title='y'       # Y轴标题
    ),
    width=800,                # 图表宽度
    height=600,               # 图表高度
    margin=dict(              # 边距设置
        r=20,                 # 右边距
        b=10,                 # 下边距  
        l=10,                 # 左边距
        t=40                  # 上边距
    )
)


fig.show()

In [18]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()