https://itnext.io/visualize-your-rag-data-eda-for-retrieval-augmented-generation-0701ee98768f

    Prepare Documents: Start by collecting data. This tutorial uses Formula One data from Wikipedia in HTML format as an example to build a dataset for our RAG application. You can also use your own data here!
    Split and Create Embeddings: Break down the collected documents into smaller snippets and use an embedding model to convert them into compact vector representations. This involves utilizing a splitter, OpenAI’s text-embedding-ada-002, and ChromaDB as vector store.
    Build a LangChain: Set up the LangChain by combining a prompt generator for context creation, a retriever for fetching relevant snippets, and an LLM (GPT-4) to answer queries.
    Ask a Question: Learn how to ask questions to the RAG application.
    Visualize: Use Renumics-Spotlight visualize the embeddings in 2D, and analyze the relationships and proximities between queries and document snippets.

In [1]:
!pip install langchain langchain-openai chromadb renumics-spotlight 

Collecting langchain
  Downloading langchain-0.1.7-py3-none-any.whl.metadata (13 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.0.6-py3-none-any.whl.metadata (2.5 kB)
Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl.metadata (7.3 kB)
Collecting renumics-spotlight
  Downloading renumics_spotlight-1.6.4-py3-none-any.whl.metadata (5.4 kB)
Collecting PyYAML>=5.3 (from langchain)
  Downloading PyYAML-6.0.1-cp311-cp311-win_amd64.whl.metadata (2.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.27-cp311-cp311-win_amd64.whl.metadata (9.8 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.3-cp311-cp311-win_amd64.whl.metadata (7.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Using cached jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-com

In [1]:
%env OPENAI_API_KEY=<>



env: OPENAI_API_KEY=<>


Split and Create embeddings for the dataset

You can skip this section and download a database with embeddings of the Formula One Dataset.

To create the embeddings on your own you first need to set up the embeddings model and the vectorstore. Here we use text-embedding-ada-002 from OpenAIEmbeddings and a vectorstore using ChromaDB:

In [2]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma

embeddings_model = OpenAIEmbeddings(model="text-embedding-ada-002")
docs_vectorstore = Chroma(
    collection_name="docs_store",
    embedding_function=embeddings_model,
    persist_directory="docs-db",
)

To fill the vector store we load the html documents using the BSHTMLLoader:

In [4]:
from langchain_community.document_loaders import BSHTMLLoader, DirectoryLoader
import getpass
username = getpass.getuser()


loader = DirectoryLoader(
    "docs",
    glob="*.html",
    loader_cls=BSHTMLLoader,
    loader_kwargs={"open_encoding": "utf-8"},
    recursive=True,
    show_progress=True,
)
docs = loader.load()

ModuleNotFoundError: No module named 'pwd'

Divide them into smaller chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
splits = text_splitter.split_documents(docs)

Additionally you can create an id that can be reconstructed from the metadata. This allows to find the embeddings in the db if you only have the document with its content and metadata. You can add all to the database and store it:

In [None]:
import hashlib
import json
from langchain_core.documents import Document

def stable_hash(doc: Document) -> str:
    """
    Stable hash document based on its metadata.
    """
    return hashlib.sha1(json.dumps(doc.metadata, sort_keys=True).encode()).hexdigest()

split_ids = list(map(stable_hash, splits))
docs_vectorstore.add_documents(splits, ids=split_ids)
docs_vectorstore.persist()

Build the LangChain

First, you need to choose an LLM Model. Here, we use GPT-4. Also, you need to prepare the retriever to use the vector store:

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0.0)
retriever = docs_vectorstore.as_retriever(search_kwargs={"k": 20})



Setting the temperature parameter to 0.0 when initializing the ChatOpenAI model ensures deterministic output.

Now, let’s create a prompt for RAG. The LLM will be provided with the user’s question and the retrieved documents as a context to answer the question. It is also instructed to provide the sources that allowed its answer:

In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = """
You are an assistant for question-answering tasks.
Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES").
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.

QUESTION: {question}
=========
{source_documents}
=========
FINAL ANSWER: """
prompt = ChatPromptTemplate.from_template(template)



Next, set up a processing pipeline that starts by formatting the retrieved documents to contain the page content and the source file path. This formatted input is then fed into a language model (LLM) step that generates an answer based on the combined user question and document context.

In [None]:
from typing import List

from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


def format_docs(docs: List[Document]) -> str:
    return "\n\n".join(
        f"Content: {doc.page_content}\nSource: {doc.metadata['source']}" for doc in docs
    )


rag_chain_from_docs = (
    RunnablePassthrough.assign(
        source_documents=(lambda x: format_docs(x["source_documents"]))
    )
    | prompt
    | llm
    | StrOutputParser()
)
rag_chain = RunnableParallel(
    {
        "source_documents": retriever,
        "question": RunnablePassthrough(),
    }
).assign(answer=rag_chain_from_docs)



Ask a Question

The RAG application is now ready to answer questions:

In [None]:
question = "Who built the nuerburgring"
response = rag_chain.invoke(question)
response["answer"]

NameError: name 'rag_chain' is not defined

Visualize

To explore the data in Spotlight, we use Pandas DataFrame to organize our data. Let’s start with the extraction of the text snippets and their embeddings from the vector store. In addition, let’s mark the correct answer:

In [None]:
import pandas as pd

response = docs_vectorstore.get(include=["metadatas", "documents", "embeddings"])
df = pd.DataFrame(
    {
        "id": response["ids"],
        "source": [metadata.get("source") for metadata in response["metadatas"]],
        "page": [metadata.get("page", -1) for metadata in response["metadatas"]],
        "document": response["documents"],
        "embedding": response["embeddings"],
    }
)
df["contains_answer"] = df["document"].apply(lambda x: "Eichler" in x)
df["contains_answer"].to_numpy().nonzero()


The question and the associated answer are also projected into the Embeddings Space. They are processed in the same way as the text snippets:

In [None]:
question_row = pd.DataFrame(
    {
        "id": "question",
        "question": question,
        "embedding": embeddings_model.embed_query(question),
    }
)
answer_row = pd.DataFrame(
    {
        "id": "answer",
        "answer": answer,
        "embedding": embeddings_model.embed_query(answer),
    }
)
df = pd.concat([question_row, answer_row, df])

In [None]:
import numpy as np
question_embedding = embeddings_model.embed_query(question)
df["dist"] = df.apply(
    lambda row: np.linalg.norm(
        np.array(row["embedding"]) - question_embedding
    ),
    axis=1,
)

This can additionally be used for visualization and will be stored in the column distance:

In [None]:
from renumics import spotlight
spotlight.show(df)

It will open a new browser window. The top-left table section displays all fields of the dataset. You can use the “visible column” button to select the columns “question”, “answer”, “source”, “document”, and “dist”. Ordering the table by “dist” shows the question, answer, and the most relevant document snippets on top. Select the first 14 rows to highlight them in the similarity map on the top right.

What’s next?

The good visualization of a single question, answer, and the related documents shows a large potential for RAG. Using dimensionality reduction techniques can make the embedding space accessible for users and developers. The utility of the specific presentation in this article is still very limited. It remains exciting to explore the possibilities of these methods in presenting many questions and thus illustrating the use of a RAG system in operation or checking the coverage of the embedding space through evaluation questions. Stay tuned for more articles to follow.