# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [24]:
# imports
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from openai import OpenAI
import ollama

In [25]:
# constants
MODEL_GPT = "gpt-4.1-mini"
MODEL_LLAMA = "llama3.2"

In [None]:
# set up environment
load_dotenv()
openai = OpenAI()

In [20]:
# here is the question; type over this to ask something new
question = """
In AI systems, what is RAG and how does it reduce hallucinations compared with a plain LLM chatbot?
"""

In [26]:
# prompts
system_prompt = "You are a helpful technical tutor who answers questions about python code, software engineering, data science and LLMs."
bad_examples = """
Do not answer like these examples:
1) "RAG is just storing prompts in memory and has nothing to do with retrieval."
2) "RAG guarantees zero hallucinations in all cases."
3) "RAG only works if you fine-tune the base model first."
"""
user_prompt = "Please explain this technical question clearly with sections for short answer, how it works, example, common mistakes, and analogy. " + bad_examples + " Question: " + question

In [27]:
# messages
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

In [None]:
# Get gpt-4o-mini to answer, with streaming
stream = openai.chat.completions.create(model=MODEL_GPT, messages=messages, temperature=0.2, stream=True)
response = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    response += chunk.choices[0].delta.content or ""
    update_display(Markdown(response), display_id=display_handle.display_id)

In [28]:
# Get Llama 3.2 to answer
stream = ollama.chat(model=MODEL_LLAMA, messages=messages, stream=True)
response = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
    response += chunk["message"]["content"]
    update_display(Markdown(response), display_id=display_handle.display_id)

**Short Answer**
RAG (Retrieval Augmented Generation) is an architecture that combines retrieval models with text generation models to reduce hallucinations in large language models (LLMs). It uses a retrieval model to retrieve relevant documents from a database, which are then used as input for the generator model to generate coherent and accurate responses.

**How it Works**

The RAG architecture consists of two main components:

1. **Retrieval Model**: A specialized model that takes in user queries and retrieves relevant documents from a database.
2. **Generator Model**: A large language model that generates text based on the retrieved documents.

Here's an overview of how RAG works:

* The user submits a query to the retrieval model, which searches for relevant documents in the database.
* The retrieval model returns a list of top-ranked documents that are most relevant to the query.
* The generator model takes the top-ranked documents as input and generates text based on their content.
* The final output is a coherent and accurate response that combines the best parts of each document.

**Example**

Suppose we want to build an RAG-based chatbot for customer support. We have a database of articles related to common customer inquiries, such as "how do I reset my password?" or "what are the return policies for defective products?"

When a user submits a query like "I forgot my password and it's not showing up on my account page", the retrieval model searches for relevant documents in the database and returns a list of top-ranked articles. The generator model takes these articles as input and generates a response that combines the best parts of each article, such as "You can try resetting your password by following these steps: [insert steps from retrieved articles]."

**Common Mistakes**

* **Ignoring the retrieval model**: Many RAG implementations neglect to include or underestimate the importance of the retrieval model. This can lead to poor performance and hallucinations.
* **Over-reliance on generator model**: If the generator model is not properly fine-tuned, it may produce inaccurate or irrelevant responses even with a good retrieval model.
* **Insufficient training data**: RAG requires large amounts of high-quality training data to ensure effective retrieval and generation.

**Analogy**

Think of RAG as a librarian and an author. The librarian (retrieval model) searches for relevant books (documents) in the library's catalog. The author (generator model) takes these books as input and writes a coherent and accurate book summary or chapter based on their content. By combining the strengths of both, RAG produces more accurate and informative responses than a plain LLM chatbot alone.