# RAG System

#### Javier Corpus

### Assignment Instructions

Create a chat like the one in the reading [Building Your First RAG System with Python and OpenAI](https://dev.to/mazyaryousefinia/building-your-first-rag-system-with-python-and-openai-1326) using any Wikipedia page. Be sure to ask your bot at least two separate questions. 

This code is based on the one showed in [Retrieval Augmented Generation with OpenAI and Choma](https://www.youtube.com/watch?v=Cim1lNXvCzY) YouTube video.

The first thing we are going to do is import all the required libraries:

 - WikipediaLoader - Used to fetch and load content from Wikipedia articles.
 - RecursiveCharacterTextSplitter - Used to split long texts into smaller chunks for processing.
 - Chroma - Used as a vector database for storing and retrieving embeddings.
 - OpenAIEmbeddings - Used to generate text embeddings using OpenAI's models.
 - RetrievalQA - Used to create question-answering systems.
 - PromptTemplate - Used to create reusable templates for structuring LLM prompts.
 - ChatOpenAI - Used to interact with OpenAI's chat models.
 - pprint - Used to format and display Python data structures in a readable way.

Then we are selecting a model (`gpt-5` in this case), and finally we are creating an instance of the OpenAIEmbeddings class. This class sets up a connection to OpenAI's embedding API. The API key is loaded from the from environment variables.

In [1]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import pprint

GPT_MODEL = "gpt-5"
embeddings = OpenAIEmbeddings()

We are going to search Wikipedia for "Day of the Dead", just because it's a tradition I personally like. We are loading the first article it finds (https://en.wikipedia.org/wiki/Day_of_the_Dead). The text is then split into smaller chunks.

In [2]:
search_term = "Day of the Dead"
docs = WikipediaLoader(query=search_term, load_max_docs=1).load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
    is_separator_regex = False,
)

data = text_splitter.split_documents(docs)
data[:1]  # Printing the first chunk to verify

[Document(metadata={'title': 'Day of the Dead', 'summary': 'The Day of the Dead (Spanish: Día de (los) Muertos) is a holiday traditionally celebrated on November 1 and 2, though other days, such as October 31 or November 6, may be included depending on the locality. The multi-day holiday involves family and friends gathering to pay respects and  remember friends and family members who have died. These celebrations can take a humorous tone, as celebrants remember amusing events and anecdotes about the departed. It is widely observed in Mexico, where it largely developed, and is also observed in other places, especially by people of Mexican heritage. The observance falls during the Christian period of Allhallowtide. Some argue that there are Indigenous Mexican or ancient Aztec influences that account for the custom, though others see it as a local expression of the Allhallowtide season that was brought to the region by the Spanish; the Day of the Dead has become a way to remember those f

We are storing the text locally (`Chroma vector database`) in a directory called `Wiki_DDM` (stands for `Wikipedia - Dia De Muertos`.)

In [3]:
store = Chroma.from_documents(
    data, 
    embeddings, 
    ids = [f"{item.metadata['source']}-{index}" for index, item in enumerate(data)],
    collection_name="DayoftheDead-Embeddings", 
    persist_directory='Wiki_DDM',
)

In this template, we are asking the model to provide answers only from the context provided (this will be the `Chroma vector database`.)

If we ask a question that cannot be answered with the given information, we expect the model to reply saying that it does not know the answer.

In [4]:
template = """You are a bot that answers questions about Day of the Dead, using only the context provided.
If you don't know the answer, simply state that you don't know.

{context}

Question: {question}"""

PROMPT = PromptTemplate(
    template=template, input_variables=["context", "question"]
)

We create an instance of the model (`gpt-5` in this example), and we set the temperature to 0 to prevent it from getting too creative.

In [5]:
llm = ChatOpenAI(temperature=0, model=GPT_MODEL)

This function gets a query as input, and ask the model to retrieve the answer from the specified source (the `Chroma vector DB` in this case.) We set `return_source_documents` to `False` just to keep the output short. If this value is set to `True`, it will show a very lenghty text after each answer.

In [11]:
qa_without_source = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=store.as_retriever(),
    chain_type_kwargs={"prompt": PROMPT, },
    return_source_documents=False,
)

### Testing the code

The first question is a very basic one: "What is the Day of the Dead?" The model found the answer right away, retrieved from Wikipedia (stored in Chroma.)

In [7]:
query = "What is the Day of the Dead?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'What is the Day of the Dead?',
 'result': 'The Day of the Dead (Spanish: Día de [los] Muertos) is a largely '
           'festive holiday in Mexican culture that serves to remember and '
           'honor the deceased, with traditions like using calaveras (skulls) '
           'and marigolds.'}


The second question is also a basic one ("When is the Day of the Dead traditionally celebrated?"), with the information available in Wikipedia. However the model was not able to find it. It responded with an unexpected "I don't know based on the provided context."

In [8]:
query = "When is the Day of the Dead traditionally celebrated?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'When is the Day of the Dead traditionally celebrated?',
 'result': 'I don’t know based on the provided context.'}


The third question is a more complex one ("What is the origin of the Day of the Dead?"), since there is no definitive answer in the article. However, the model was able to provide a somewhat good summary.

In [9]:
query = "What is the origin of the Day of the Dead?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'What is the origin of the Day of the Dead?',
 'result': 'It originates in Mexican culture (Mexico).'}


The fourth question was also a basic one ("What is commonly placed in home altars?"). The model had no trouble answering it.

In [12]:
query = "What is commonly placed in home altars?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'What is commonly placed in home altars?',
 'result': 'The favorite foods and beverages of the departed.'}


This fifth question is a bit tricky ("What are the differences between home altars and ofrendas?"). Nonetheless, the model returned the correct answer.

In [13]:
query = "What are the differences between home altars and ofrendas?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'What are the differences between home altars and ofrendas?',
 'result': 'There is no difference—“home altars” are called “ofrendas.”'}


The sixth question is also tricky ("Who is the president of Mexico?"), since the article mentions a president of Mexico, but it doesn't say who is the current president. The model was perfectly capable of making this distinction and it retuned the expected answer: "I don't know", while also mentioning Lazaro Cardenas as president mentioned in the article.

In [14]:
query = "Who is the president of Mexico?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'Who is the president of Mexico?',
 'result': 'I don’t know from the provided context. The only president '
           'mentioned is the historical figure Lázaro Cárdenas.'}


When asked a completely off-topic question ("What are the three main ingredients in a vanilla latte?"), the model quickly responded with the expected answer.

In [15]:
query = "What are the three main ingredients in a vanilla latte?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'What are the three main ingredients in a vanilla latte?',
 'result': "I don't know."}


Just as an experiment, we are elevating the temperature of the model, allowing it to be a little bit more creative. Then we asked the same question as before: Who is the president of Mexico?

In the two answers below, the model correctly indentifies Lazaro Cardenas as the president mentioned in the context, but it states it doesn't know who the current president is. This is the expected answer.

In [16]:
llm = ChatOpenAI(temperature=1, model=GPT_MODEL)

In [17]:
query = "Who is the president of Mexico?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'Who is the president of Mexico?', 'result': "I don't know."}


In [18]:
llm = ChatOpenAI(temperature=2, model=GPT_MODEL)

In [19]:
query = "Who is the president of Mexico?"
resonse = qa_without_source.invoke({"query": query})

pprint.pprint(resonse)

{'query': 'Who is the president of Mexico?', 'result': "I don't know."}


## Conclusion

The model (`gpt-5`) did a good job overall, although not perfect. I think all the answers are acceptable, except for the second one ("When is the Day of the Dead traditionally celebrated?"). The Wikipedia article clearly mentions this date (The Day of the Dead is a holiday traditionally celebrated on November 1 and 2, though other days, such as October 31 or November 6, may be included depending on the locality).

When using a source that can be read quickly (such as this Wikipedia article), and we ask only a few questions, it's easy to spot the errors. However, when using a larger document (or multiple ones), or when several questions are asked, it is going to be more difficult to evaluate the accuracy of the model. It's important to keep this in mind, especially if we are dealing with sensitive or critical information.

