# How to build a RAG with ChromaDB and ChatGPT

In this notebook we can find first a sample of how a document can be prepared and added into Chroma DB.    
Then we create a RAG methon and use a LLM (ChatGPT) to answer questions based on the output of queryng the DB. 

First we need to load the file info and clean it.

In [3]:
file_path = 'brain_hack.txt'

with open(file_path, 'r') as file:
    lines = file.readlines()

# Filter odd lines (skipping time tags) and join them
filtered_lines = [line.strip() for i, line in enumerate(lines) if i % 2 == 0]

# Join the filtered lines into a single text
filtered_text = ' '.join(filtered_lines)

# Print or use the filtered text as needed
print(filtered_text)
pdf_texts= [filtered_text]

I'm angry. And I’m angry because I wish I knew this when I was younger. So I’m a neuroscientist and a lecturer. And as a neuroscientist, I study the brain and the nerves that span out into the body. And as a lecturer, I teach the next generation of healthcare professionals. And look, I see some students struggle with their learning, especially the older ones, but it’s not their fault. You know, we don't get taught how to learn. We just kind of expect it to happen. And I think the worst curse of all really is it gets harder to learn as we age. But what if I told you that there are things that we can do to learn faster and more effectively? I’m going to take you through the neuroscience behind six critical ingredients that can help you learn faster: attention, alertness, sleep, repetition, breaks, and mistakes. Now, first things first. How do we actually learn? We need neuroplasticity to happen. So neuroplasticity is the scientific term that essentially means our brain’s ability to physi

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [5]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=500,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 38


In [6]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)
   
print(f"\nTotal chunks: {len(token_split_texts)}")

  from .autonotebook import tqdm as notebook_tqdm



Total chunks: 38


In [7]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction()

In [8]:
chroma_client = chromadb.Client()

chroma_collection = chroma_client.create_collection("TEDTalk.txt", embedding_function=embedding_function)

ids = [str(i) for i in range(len(token_split_texts))]

# The .add method will embedd the token_split_texts using the embedding_function specified above

chroma_collection.add(ids=ids, documents=token_split_texts)

chroma_collection.count()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


38

In [9]:
# Set ChatGPT API connection

import os
import openai
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [10]:
# The information provided in the 'content' is the key for how the system will behave. 
# Feel free to modify it and test different scenarios

def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a learning assistant. Your users are students asking questions about information contained in a Ted talk transcript."
            "You will be shown the user's question, and the relevant information from the lecture transcript. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [19]:
def get_tedtalk_answer(question, detail_retrieved_docs = True ):

    results = chroma_collection.query(query_texts=[question], n_results=5)

    # Under the hood the .query() method will embedd the query using the same embedding funtion used when adding the documents. 
    # Here is where chroma_db searchs for the documents that look similar to the query and then return some documents (5 here)

    retrieved_documents = results['documents'][0]

    # If required we can list the retrieved fragments:
    if detail_retrieved_docs==True:

        print('The fragments that show a closest match to the question are: \n')

        for document in retrieved_documents:
            print(document)
        
        print('\n And the rag output is:')


    output = rag(query=question, retrieved_documents=retrieved_documents)
        
    return output


In [20]:
# Test  without detailing the retrieved fragments:

get_tedtalk_answer( "What is main idea of the lecture?" , False)

'The main idea of the lecture is that attention, alertness, sleep, repetition, breaks, and mistakes can be used to improve learning. Paying attention is important for learning, and when we are fully focused on a task, we are more likely to retain information for the long term. Repetition is key in learning, as it is not enough to hear or see something once and expect to remember it forever. It is important to prioritize sleep before studying to improve alertness, and to study after learning to retain information for the long term. The hippocampus, which is important for learning and memory, keeps track of information like a diary, but only for the short term.'

In [21]:
# Test detailing the retrieved fragments:

get_tedtalk_answer( "What is the biggest challenge to improve learning?" )

The fragments that show a closest match to the question are: 

. you can use attention, alertness, sleep, repetition, breaks, and mistakes to make your learning better. so first things first. in order to learn, we need to pay attention, right? attention is a really important function. so, for example, if i were to ask you to close your eyes and focus on your contact between your feet and the floor, you ’ ll suddenly be aware of maybe the texture of your socks, maybe how tight your shoes are, maybe how firm the floor is
. and i think the worst curse of all really is it gets harder to learn as we age. but what if i told you that there are things that we can do to learn faster and more effectively? i ’ m going to take you through the neuroscience behind six critical ingredients that can help you learn faster : attention, alertness, sleep, repetition, breaks, and mistakes. now, first things first. how do we actually learn? we need neuroplasticity to happen
. make sure you take a 10 - to 20

'The biggest challenge to improve learning is that it gets harder to learn as we age. However, there are several factors that can help improve learning, including attention, alertness, sleep, repetition, breaks, and mistakes. It is important to prioritize attention and alertness, possibly through exercise, and repeat the material multiple times over multiple days. Taking breaks of 10 to 20 minutes after learning is also beneficial. Embracing mistakes and allowing for proper sleep are also important for effective learning. Additionally, it is not true that some people have a magical talent for learning; it mostly comes down to practice, perseverance, and starting to learn a skill early in development.'

In [22]:
# Test with a question that is not related to the text:

get_tedtalk_answer( "What has Mafalda done for the Chinese government?" )

The fragments that show a closest match to the question are: 

. but sleep is really important for learning for another reason. so sleep serves a really important constellation of functions. so, for example, it resets our immune system, it resets our metabolism, it resets our emotional control, and it even gets rid of the waste that builds up in our brain over the course of the day. but sleep is actually critical for memory consolidation, so for turning short - term memories into long - term memories
. just like exercising builds muscle, repetitive patterns of thinking or doing things will reinforce those pathways and those connections in the brain associated with doing that thing, so it'll become easier to recall. so through the process of neuroplasticity, you ’ re making these brand - new connections. and that takes energy, requires fatty acids, requires lots of little proteins to be made. it ’ s a big job. it takes a lot of energy
. so i ’ m not saying it causes adhd, but studies ha

'There is no information provided in the transcript about what Mafalda has done for the Chinese government.'