# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [None]:
# imports
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, update_display

In [19]:
# constants

MODEL_GPT = 'gpt-5-mini'
MODEL_LLAMA = 'llama3.2'

In [3]:
# set up environment

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [16]:
# here is the question; type over this to ask something new
system_prompt = """
You are a helpful assistant that answers technical questions including code explanations. You should explain the concepts clearly with practical examples, in a way that any layman would understand.
"""

question = """
Please explain how RAG AI systems work, with a simple example in Python.
"""

In [20]:
# Get gpt-4o-mini to answer, with streaming
def chat(question, model):
  OLLAMA_BASE_URL = "http://localhost:11434/v1"
  openai = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama") if model == MODEL_LLAMA else OpenAI()

  stream = openai.chat.completions.create(
    model=model,
    messages=[
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": question}
    ],
    stream=True
  )

  return stream

def display_stream(stream):
  response=""
  display_handle = display(Markdown(""), display_id=True)

  for chunk in stream:
      response += chunk.choices[0].delta.content or ""
      update_display(Markdown(response), display_id=display_handle.display_id)

stream = chat(question, MODEL_GPT)
display_stream(stream)

Short answer
- RAG (Retrieval-Augmented Generation) combines a retriever (searches a document collection) with a generator (a large language model) so the model answers using retrieved documents as context. That typically makes answers more factual and up-to-date than a closed-book LLM alone.

How it works (plain language)
1. Index documents: break your knowledge source (manuals, webpages, emails) into chunks and embed each chunk into a vector space.
2. Retrieve: when a user asks a question, embed the question and find the most similar chunks using nearest-neighbor search.
3. Form the prompt: assemble those retrieved chunks (plus any metadata) into a prompt or “context” for the LLM.
4. Generate: send the prompt+question to a text generation model. The model conditions on the retrieved facts and writes the answer.
5. (Optional) Post-process: verify facts, re-rank candidate answers, cite sources, or run a truth-checker.

Why it’s useful
- Keeps the generator “grounded” in a knowledge base (reduces hallucinations).
- Makes answers up-to-date without retraining the model (just update the index).
- Allows smaller LMs to produce accurate answers using external knowledge.

Simple Python example
This example uses:
- sentence-transformers to create embeddings,
- FAISS for a vector index (dense similarity search),
- OpenAI’s chat completions for generation (you can replace the generation call with any LLM).

Install prerequisites:
- pip install sentence-transformers faiss-cpu openai

Code:
```
# RAG minimal example
# Requires: sentence-transformers, faiss-cpu, openai
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import openai   # replace with your LLM client if needed

# ---------- 1) Prepare documents ----------
documents = [
    "RAG stands for Retrieval-Augmented Generation. It combines search with a language model.",
    "FAISS is a library for fast similarity search over vectors.",
    "Sentence-Transformers are pre-trained models that map text to vectors (embeddings).",
    "Chunking long docs into smaller pieces helps retrieval and fits model context windows."
]

# (Optional) Chunking function — keep short chunks in practice
def chunk_texts(docs, max_chars=500):
    chunks = []
    for d in docs:
        if len(d) <= max_chars:
            chunks.append(d)
        else:
            # very simple split; use smarter splitting for real data
            for i in range(0, len(d), max_chars):
                chunks.append(d[i:i+max_chars])
    return chunks

chunks = chunk_texts(documents)

# ---------- 2) Compute embeddings and build FAISS index ----------
embed_model = SentenceTransformer('all-MiniLM-L6-v2')  # small, fast
embeddings = embed_model.encode(chunks, convert_to_numpy=True, show_progress_bar=False)
dim = embeddings.shape[1]

index = faiss.IndexFlatL2(dim)   # exact L2 index; switch to IVF/HNSW for large collections
index.add(embeddings)           # add all chunk embeddings

# Keep a mapping so we can show sources
id_to_chunk = {i: chunks[i] for i in range(len(chunks))}

# ---------- 3) Query, retrieve top-k ----------
def retrieve(query, k=3):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(q_emb, k)
    results = []
    for idx in indices[0]:
        if idx == -1:
            continue
        results.append(id_to_chunk[idx])
    return results

# ---------- 4) Build the prompt (context) and call the LLM ----------
def build_prompt(retrieved_chunks, user_question):
    context = "\n\n---\n".join(retrieved_chunks)
    prompt = (
        "You are an assistant. Use ONLY the information in the context to answer the question. "
        "If the answer isn't in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {user_question}\nAnswer:"
    )
    return prompt

# Example run
user_question = "What does RAG mean?"
retrieved = retrieve(user_question, k=3)
prompt = build_prompt(retrieved, user_question)

# Use OpenAI Chat Completion (replace with your preferred LLM)
openai.api_key = "YOUR_API_KEY"  # set via env variable in real code
resp = openai.ChatCompletion.create(
    model="gpt-4o-mini",  # example — pick a suitable model
    messages=[
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content": prompt}
    ],
    max_tokens=200,
    temperature=0.0
)
answer = resp['choices'][0]['message']['content'].strip()
print("Retrieved chunks:")
for i, r in enumerate(retrieved, 1):
    print(f"{i}. {r}")
print("\nLLM answer:\n", answer)
```

Notes about this example
- Embeddings: sentence-transformers is an easy, local option. In production you may use a provider’s embedding model for better alignment with your LLM.
- FAISS: for a few documents, IndexFlatL2 is fine. For millions of vectors, use IVF/HNSW indices and approximate search for speed.
- Prompting: put a clear instruction to the LLM to rely on the provided context and to cite/decline if unsure.
- Temperature: set low (e.g., 0–0.2) to reduce hallucination; but that affects creativity.
- Citation: store and return chunk metadata (source URL, doc id, offset) so you can cite sources in answers.

Practical improvements and variants
- Hybrid retrieval: combine sparse (BM25) + dense (embeddings) retrieval and merge results.
- Reranking: after retrieving, use a cross-encoder or another model to re-score candidate chunks for better relevance.
- Chain-of-thought / step-by-step: if the task needs reasoning, you can ask the LLM to show its chain-of-thought, or use a verification step.
- Retrieval during generation: “retrieve-then-generate” vs. “retrieve-and-augment-while-generating” (some systems call external search mid-generation).
- Up-to-date data: refresh the index when documents change.

Limitations and risks
- Garbage in → garbage out: if documents are wrong, the LLM will propagate incorrect info.
- Hallucination: LLMs can still hallucinate; use grounded prompts, verification, and fact-checking.
- Cost & latency: embeddings + search + generation adds cost and time per query.
- Privacy: be careful about what private data you index and which model you send data to.

If you want, I can:
- Provide a version that uses only open-source LLMs for generation (no API key),
- Show how to include source citations automatically,
- Show chunking strategies and how to scale FAISS for millions of docs.

Which would you like next?

In [18]:
# Get Llama 3.2 to answer
stream = chat(question, MODEL_LLAMA)
display_stream(stream)

**What is RAG (Relational Attention Graph) AI?**

RAG is an artificial intelligence framework developed by Facebook's AI Research Lab. It's primarily used for natural language processing (NLP) tasks like language translation, question answering, and text classification.

The core idea behind RAG is to replace traditional recurrent neural networks (RNNs) with graph-structured attention mechanisms. This allows the model to consider both sequential dependencies and global contextual relationships in a sentence or document.

**How does RAG work?**

In essence, RAG works by:

1. Representing input text as a graph, where each node represents an element of the input sequence (e.g., word tokens).
2. Assigning attention weights to each node, indicating its importance.
3. Computing the weighted sum of all nodes' representations to generate the contextualized embeddings.
4. Using these embeddings as inputs for downstream tasks.

**A simple RAG example in Python**

To demonstrate this concept, we can create a simplified RAG model using the Keras library and attention mechanisms:
```python
import numpy as np
from keras.layers import Input, Dense, Embedding, Flatten, Attention
from keras.models import Model

# Sample input text (a sentence)
sentences = ["This is a sample query."]

# Convert sentences to numerical labels
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Create attention-based RAG model
attention_input = Input(shape=(100,), name="Input")
x = Embedding(input_dim=100, output_dim=128)(attention_input)
x = Flatten()(x)  # Flatten embedding

# Compute attention layer
attention_weights = Attention().apply(x)
att_weights = Dense(1, kernel_initializer='zero')(attention_weights)

# weighted sum of node representations
node_representations = att_weights * x

# Generate contextualized embeddings
context_embedding = Dense(128, activation='relu')(node_representations)

# Create RAG model with input and output layers
model = Model(inputs=attention_input, outputs=context_embedding)
```
In this example:

*   We create an input layer representing the tokens obtained from a sentence.
*   The embedding layer maps these tokens to dense vectors of size 128.
*   The attention layer computes weighted sums for each token, giving us attention scores att_weights and node representations x.
*   Finally, we compute contextualized embeddings by integrating these node representations using the generated weights.

Note: This simplified example is not optimized for accuracy or practicality and serves mostly to illustrate the fundamental concept of RAG.

To go a step further in implementation of this model you may want to check the **RAG** documentation on this [GitHub website](https://github.com/fact-checking/rags).

Hope that helps!