## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [20]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import numpy as np
import plotly.graph_objects as go


In [2]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [3]:
# Load environment variables in a file called .env

load_dotenv(override=True)
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [8]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("../knowledge-base/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

# Please note:

In the next cell, we split the text into chunks.

2 students let me know that the next cell crashed their computer.  
They were able to fix it by changing the chunk_size from 1,000 to 2,000 and the chunk_overlap from 200 to 400.  
This shouldn't be required; but if it happens to you, please make that change!  
(Note that LangChain may give a warning about a chunk being larger than 1,000 - this can be safely ignored).

_With much thanks to Steven W and Nir P for this valuable contribution._

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

In [None]:
chunks

In [None]:
len(chunks)

In [None]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

## A sidenote on Embeddings, and "Auto-Encoding LLMs"

We will be mapping each chunk of text into a Vector that represents the meaning of the text, known as an embedding.

OpenAI offers a model to do this, which we will use by calling their API with some LangChain code.

This model is an example of an "Auto-Encoding LLM" which generates an output given a complete input.
It's different to all the other LLMs we've discussed today, which are known as "Auto-Regressive LLMs", and generate future tokens based only on past context.

Another example of an Auto-Encoding LLMs is BERT from Google. In addition to embedding, Auto-encoding LLMs are often used for classification.

### Sidenote

In week 8 we will return to RAG and vector embeddings, and we will use an open-source vector encoder so that the data never leaves our computer - that's an important consideration when building enterprise systems and the data needs to remain internal.

In [13]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk

embeddings = OpenAIEmbeddings()

# If you would rather use the free Vector Embeddings from HuggingFace sentence-transformers
# Then replace embeddings = OpenAIEmbeddings()
# with:
# from langchain.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [None]:
# Check if a Chroma Datastore already exists - if so, delete the collection to start from scratch

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

In [None]:
# Create our Chroma vectorstore!

vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

In [None]:
# Get one vector and find how many dimensions it has

collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

## Visualizing the Vector Store

Let's take a minute to look at the documents and their embedding vectors to see what's going on.

In [17]:
# Prework

result = collection.get(include=['embeddings', 'documents', 'metadatas'])
vectors = np.array(result['embeddings'])
documents = result['documents']
doc_types = [metadata['doc_type'] for metadata in result['metadatas']]
colors = [['blue', 'green', 'red', 'orange'][['products', 'employees', 'contracts', 'company'].index(t)] for t in doc_types]

In [None]:
def new_func():
    # We humans find it easier to visalize things in 2D!
# Reduce the dimensionality of the vectors to 2D using t-SNE
# (t-distributed stochastic neighbor embedding)
    tsne = TSNE(n_components=2, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors)

# Create the 2D scatter plot
    fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

    fig.update_layout(
    title='2D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x',yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

    fig.show()

new_func()

In [None]:
def visualize_clusters(vectors, n_clusters=5, boundary_type='hull'):
    """
    Visualize document clusters with different boundary types.
    
    Parameters:
    -----------
    vectors : array-like
        The document vectors to visualize
    n_clusters : int, default=5
        Number of clusters to create
    boundary_type : str, default='hull'
        Type of boundary to draw around clusters. Options:
        - 'hull': Convex hull with straight lines (most accurate, O(n log n))
        - 'hull_curve': Convex hull with smooth curves (visually appealing, O(n log n))
        - 'hull_circle': Circle based on convex hull center (O(n log n))
        - 'std_circle': Circle based on standard deviation (fast, O(n))
        - 'mean_circle': Circle based on mean and max distance (fastest, O(n))
    
    Returns:
    --------
    plotly.graph_objects.Figure
        The visualization figure
    """
    # Convert vectors to numpy array
    vectors_array = np.array(vectors)
    
    # Reduce to 2D using t-SNE
    tsne = TSNE(n_components=2, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors_array)
    
    # Perform K-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(vectors_array)
    
    # Create the scatter plot
    fig = go.Figure()
    
    # Define colors for clusters
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
    
    # Process each cluster
    for cluster in range(n_clusters):
        # Get points for this cluster
        mask = cluster_labels == cluster
        cluster_points = reduced_vectors[mask]
        
        if len(cluster_points) >= 3:  # Need at least 3 points for hull
            if boundary_type in ['hull', 'hull_curve']:
                # Convex hull visualization
                from scipy.spatial import ConvexHull
                hull = ConvexHull(cluster_points)
                hull_points = cluster_points[hull.vertices]
                
                if boundary_type == 'hull':
                    # Straight line hull
                    fig.add_trace(go.Scatter(
                        x=hull_points[:, 0],
                        y=hull_points[:, 1],
                        fill='toself',
                        fillcolor=colors[cluster],
                        opacity=0.2,
                        line=dict(color=colors[cluster], width=2),
                        name=f'Cluster {cluster} boundary',
                        showlegend=True
                    ))
                else:  # hull_curve
                    # Create smooth curve using spline interpolation
                    from scipy.interpolate import splprep, splev
                    # Close the curve by adding the first point at the end
                    closed_points = np.vstack((hull_points, hull_points[0]))
                    # Fit a spline to the points
                    tck, u = splprep([closed_points[:, 0], closed_points[:, 1]], s=0, per=1)
                    # Generate points for smooth curve
                    u_new = np.linspace(0, 1, 100)
                    x_new, y_new = splev(u_new, tck)
                    
                    fig.add_trace(go.Scatter(
                        x=x_new,
                        y=y_new,
                        fill='toself',
                        fillcolor=colors[cluster],
                        opacity=0.2,
                        line=dict(color=colors[cluster], width=2),
                        name=f'Cluster {cluster} boundary',
                        showlegend=True
                    ))
                
                center = np.mean(hull_points, axis=0)
                
            elif boundary_type == 'hull_circle':
                # Circle based on convex hull
                from scipy.spatial import ConvexHull
                hull = ConvexHull(cluster_points)
                hull_points = cluster_points[hull.vertices]
                center = np.mean(hull_points, axis=0)
                radius = np.max(np.sqrt(np.sum((hull_points - center)**2, axis=1)))
                
                # Generate circle points
                theta = np.linspace(0, 2*np.pi, 100)
                circle_x = center[0] + radius * np.cos(theta)
                circle_y = center[1] + radius * np.sin(theta)
                
                fig.add_trace(go.Scatter(
                    x=circle_x,
                    y=circle_y,
                    fill='toself',
                    fillcolor=colors[cluster],
                    opacity=0.2,
                    line=dict(color=colors[cluster], width=2),
                    name=f'Cluster {cluster} boundary',
                    showlegend=True
                ))
                
            elif boundary_type == 'std_circle':
                # Circle based on standard deviation
                center = np.mean(cluster_points, axis=0)
                radius = 2 * np.std(np.sqrt(np.sum((cluster_points - center)**2, axis=1)))
                
                # Generate circle points
                theta = np.linspace(0, 2*np.pi, 100)
                circle_x = center[0] + radius * np.cos(theta)
                circle_y = center[1] + radius * np.sin(theta)
                
                fig.add_trace(go.Scatter(
                    x=circle_x,
                    y=circle_y,
                    fill='toself',
                    fillcolor=colors[cluster],
                    opacity=0.2,
                    line=dict(color=colors[cluster], width=2),
                    name=f'Cluster {cluster} boundary',
                    showlegend=True
                ))
                
            elif boundary_type == 'mean_circle':
                # Circle based on mean and max distance
                center = np.mean(cluster_points, axis=0)
                radius = np.max(np.sqrt(np.sum((cluster_points - center)**2, axis=1)))
                
                # Generate circle points
                theta = np.linspace(0, 2*np.pi, 100)
                circle_x = center[0] + radius * np.cos(theta)
                circle_y = center[1] + radius * np.sin(theta)
                
                fig.add_trace(go.Scatter(
                    x=circle_x,
                    y=circle_y,
                    fill='toself',
                    fillcolor=colors[cluster],
                    opacity=0.2,
                    line=dict(color=colors[cluster], width=2),
                    name=f'Cluster {cluster} boundary',
                    showlegend=True
                ))
            
            # Add the points
            fig.add_trace(go.Scatter(
                x=cluster_points[:, 0],
                y=cluster_points[:, 1],
                mode='markers',
                name=f'Cluster {cluster} points',
                marker=dict(
                    size=8,
                    color=colors[cluster],
                    opacity=0.8
                ),
                text=[f"Cluster: {cluster}<br>Text: {d[:100]}..." for d in np.array(documents)[mask]],
                hoverinfo='text'
            ))
            
            # Add the center
            fig.add_trace(go.Scatter(
                x=[center[0]],
                y=[center[1]],
                mode='markers',
                name=f'Center {cluster}',
                marker=dict(
                    symbol='star',
                    size=15,
                    color=colors[cluster],
                    line=dict(width=2, color='white')
                ),
                hoverinfo='text',
                text=f'Center of Cluster {cluster}'
            ))
    
    # Update layout
    fig.update_layout(
        title=f'Document Clusters Visualization ({boundary_type})',
        xaxis_title='t-SNE dimension 1',
        yaxis_title='t-SNE dimension 2',
        width=800,
        height=600,
        showlegend=True
    )
    
    return fig

# Example usage:
# fig1 = visualize_clusters(vectors, boundary_type='hull')  # Convex hull
# fig2 = visualize_clusters(vectors, boundary_type='hull_curve')  # Curvy hull
# fig3 = visualize_clusters(vectors, boundary_type='hull_circle')  # Hull-based circle
# fig4 = visualize_clusters(vectors, boundary_type='std_circle')  # Std-based circle
# fig5 = visualize_clusters(vectors, boundary_type='mean_circle')  # Mean-based circle

# Show all visualizations
for boundary_type in ['hull', 'hull_curve', 'hull_circle', 'std_circle', 'mean_circle']:
    fig = visualize_clusters(vectors, boundary_type=boundary_type)
    fig.show()

In [None]:
# Let's try 3D!

tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create the 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    z=reduced_vectors[:, 2],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='3D Chroma Vector Store Visualization',
    scene=dict(xaxis_title='x', yaxis_title='y', zaxis_title='z'),
    width=900,
    height=700,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

In [None]:
def visualize_3d_clusters(vectors, n_clusters=5):
    cluster_colors = [
        'rgb(255, 0, 0)',    # Red
        'rgb(0, 255, 0)',    # Green
        'rgb(0, 0, 255)',    # Blue
        'rgb(255, 165, 0)',  # Orange
        'rgb(128, 0, 128)',  # Purple
        'rgb(0, 255, 255)',  # Cyan
        'rgb(255, 192, 203)', # Pink
        'rgb(165, 42, 42)',  # Brown
        'rgb(0, 128, 0)',    # Dark Green
        'rgb(128, 128, 0)'   # Olive
    ]
    tsne = TSNE(n_components=3, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors)
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(reduced_vectors)
    fig = go.Figure()

    # Add scatter points for each cluster
    for i in range(n_clusters):
        cluster_points = reduced_vectors[cluster_labels == i]
        cluster_docs = [doc for j, doc in enumerate(documents) if cluster_labels[j] == i]
        cluster_types = [t for j, t in enumerate(doc_types) if cluster_labels[j] == i]
        fig.add_trace(go.Scatter3d(
            x=cluster_points[:, 0],
            y=cluster_points[:, 1],
            z=cluster_points[:, 2],
            mode='markers',
            marker=dict(
                size=5,
                color=cluster_colors[i % len(cluster_colors)],
                opacity=0.8
            ),
            text=[f"Type: {t}<br>Text: {d[:100]}..." for t, d in zip(cluster_types, cluster_docs)],
            hoverinfo='text',
            name=f'Cluster {i+1}',
            visible=True,
            showlegend=True
        ))

    # Add hulls for each cluster
    for i in range(n_clusters):
        cluster_points = reduced_vectors[cluster_labels == i]
        center = np.mean(cluster_points, axis=0)
        radius = np.max(np.linalg.norm(cluster_points - center, axis=1))
        u = np.linspace(0, 2 * np.pi, 20)
        v = np.linspace(0, np.pi, 20)
        x = center[0] + radius * np.outer(np.cos(u), np.sin(v))
        y = center[1] + radius * np.outer(np.sin(u), np.sin(v))
        z = center[2] + radius * np.outer(np.ones(np.size(u)), np.cos(v))
        fig.add_trace(go.Surface(
            x=x, y=y, z=z,
            opacity=0.07,
            showscale=False,
            name=f'Cluster {i+1} Boundary',
            surfacecolor=np.zeros_like(x),
            colorscale=[[0, cluster_colors[i % len(cluster_colors)]], [1, cluster_colors[i % len(cluster_colors)]]],
            showlegend=False,
            visible=True,
            hoverinfo='skip'  # disable hover for hulls
        ))

    # Two buttons: Show Points / Hide Points
    buttons = [
        dict(
            label="Show Points",
            method="update",
            args=[{"visible": [True]*n_clusters + [True]*n_clusters}]
        ),
        dict(
            label="Hide Points",
            method="update",
            args=[{"visible": [False]*n_clusters + [True]*n_clusters}]
        )
    ]

    fig.update_layout(
        title={
            'text': '3D Chroma Vector Store Visualization with Cluster Boundaries',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        scene=dict(
            xaxis_title='x',
            yaxis_title='y',
            zaxis_title='z',
            aspectmode='cube'
        ),
        width=900,
        height=700,
        margin=dict(r=8, b=8, l=8, t=60),
        updatemenus=[
            dict(
                type="buttons",
                direction="right",
                x=0.85,
                y=1.0,
                showactive=True,
                buttons=buttons,
                pad={"r": 10, "t": 10},
                bgcolor="rgba(255, 255, 255, 0.8)"
            )
        ],
        legend=dict(
            y=0.9,
            x=0.85,
            xanchor='right',
            yanchor='top'
        )
    )
    return fig

# Usage:
fig = visualize_3d_clusters(vectors, n_clusters=5)
fig.show()