
Damian Garayalde

damiangarayalde@gmail.com






# How to build a RAG with ChromaDB and ChatGPT

In this notebook we can find first a sample of how a document can be prepared and added into Chroma DB.    
Then we create a RAG methon and use a LLM (ChatGPT) to answer questions based on the output of queryng the DB. 

First we need to load the file info and clean it.

In [15]:
file_path = 'parsed_text.txt'

with open(file_path, 'r') as file:
    lines = file.readlines()

filtered_text = ' '.join(lines)

pdf_texts= [filtered_text]

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [3]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=500,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 6194


In [4]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)
   
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm



Total chunks: 6194


In [5]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()

In [6]:
chroma_client = chromadb.Client()

chroma_collection = chroma_client.create_collection("historia.txt", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

# The .add method will embedd the token_split_texts using the embedding_function specified above

chroma_collection.add(ids=ids, documents=token_split_texts)

chroma_collection.count()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


6194

In [7]:
# Set ChatGPT API connection

import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [8]:
# The information provided in the 'content' is the key for how the system will behave. 
# Feel free to modify it and test different scenarios

def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "Eres un asistente de aprendizaje. Tus usuarios son estudiantes que hacen preguntas sobre información contenida en un texto de historia. Se te mostrará la pregunta del usuario y la información relevante del texto. Responde la pregunta del usuario utilizando solo esta información."
        },
        {"role": "user", "content": f"Pregunta: {query}. \n Información: {information}"}
    ]
    

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:

# def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
#     information = "\n\n".join(retrieved_documents)

#     messages = [
#         {
#             "role": "system",
#             "content": "You are a learning assistant. Your users are students asking questions about information contained in a Ted talk transcript."
#             "You will be shown the user's question, and the relevant information from the lecture transcript. Answer the user's question using only this information."
#         },
#         {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
#     ]
    
#     response = openai_client.chat.completions.create(
#         model=model,
#         messages=messages,
#     )
#     content = response.choices[0].message.content
#     return content

In [11]:
def get_tedtalk_answer(question, detail_retrieved_docs = True ):

    results = chroma_collection.query(query_texts=[question], n_results=5)

    # Under the hood the .query() method will embedd the query using the same embedding funtion used when adding the documents. 
    # Here is where chroma_db searchs for the documents that look similar to the query and then return some documents (5 here)

    retrieved_documents = results['documents'][0]

    # If required we can list the retrieved fragments:
    if detail_retrieved_docs==True:

        print('Los fragmentos que poseen una mayor relacion con la pregunta son: \n')

        for document in retrieved_documents:
            print(document)
        
        print('\n La respuesta construida en base a dichos fragmentos es: ')


    output = rag(query=question, retrieved_documents=retrieved_documents)
        
    return output


In [20]:
# Test  without detailing the retrieved fragments:

get_tedtalk_answer( "What is main idea of the lecture?" , False)

'The main idea of the lecture is that attention, alertness, sleep, repetition, breaks, and mistakes can be used to improve learning. Paying attention is important for learning, and when we are fully focused on a task, we are more likely to retain information for the long term. Repetition is key in learning, as it is not enough to hear or see something once and expect to remember it forever. It is important to prioritize sleep before studying to improve alertness, and to study after learning to retain information for the long term. The hippocampus, which is important for learning and memory, keeps track of information like a diary, but only for the short term.'

In [19]:
# Test detailing the retrieved fragments:

get_tedtalk_answer( "cuales son las bulas papales" )

The fragments that show a closest match to the question are: 

37. para explicar el alcance del poder papal debemos remontarnos a los siglos xiv y xv, distinguiendo entonces dos corrientes ideologicas : la cesarista, que postulaba la preeminencia del poder civil sobre el religioso, y la teocratica, que consideraba al papa como senor universal del mundo, como autoridad suprema tanto en el orden temporal como en el espiritual
el de la religion y la felicidad de sus subditos ”. desde su misma publicacion se desato una polemica acerca de la autenticidad de este documento papal, sosteniendose que su texto contenia alguna maliciosa interpolacion. sin embargo, los modernos estudios han confirmado su total autenticidad, senalando que leon xii expidio conscientemente el breve, aunque bajo la fuerte presion del embajador'espanol en roma, don antonio vargas laguna.
anos siguientes, el enfoque papal fue modificandose no solo por una mejor comprension del caso americano, sino tambien por cierto dis

'Las bulas papales eran documentos emitidos por el papa que otorgaban derechos especiales, como por ejemplo la concesión de un derecho para difundir el evangelio y proteger su predicación. Estos documentos no tenían valor jurídico como donación temporal.'