# End of week 1 exercise

To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,  
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [None]:
# imports
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, update_display

In [25]:
# constants

MODEL_GPT = 'gpt-5'
MODEL_LLAMA = 'llama3.2'

In [3]:
# set up environment

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [16]:
# here is the question; type over this to ask something new
system_prompt = """
You are a helpful assistant that answers technical questions including code explanations. You should explain the concepts clearly with practical examples, in a way that any layman would understand.
"""

question = """
Please explain how RAG AI systems work, with a simple example in Python.
"""

In [26]:
# Get gpt-4o-mini to answer, with streaming
def chat(question, model):
  OLLAMA_BASE_URL = "http://localhost:11434/v1"
  openai = OpenAI(base_url=OLLAMA_BASE_URL, api_key="ollama") if model == MODEL_LLAMA else OpenAI()

  stream = openai.chat.completions.create(
    model=model,
    messages=[
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": question}
    ],
    stream=True
  )

  return stream

def display_stream(stream):
  response=""
  display_handle = display(Markdown(""), display_id=True)

  for chunk in stream:
      response += chunk.choices[0].delta.content or ""
      update_display(Markdown(response), display_id=display_handle.display_id)

stream = chat(question, MODEL_GPT)
display_stream(stream)

Here’s the big idea in plain language:
- A RAG system first looks up facts, then writes the answer. It doesn’t rely only on what the model “remembers.”
- Steps:
  1) Split your documents into small chunks and store them in a searchable index (often a vector database).
  2) For a user question, retrieve the most relevant chunks.
  3) Feed those chunks (the context) plus the question into a language model to generate a grounded answer.
- Why it’s useful: answers can cite your data, stay up-to-date, and reduce hallucinations.

Simple, self-contained Python example (no external packages). This shows the R and the “A” wiring. It uses a tiny TF‑IDF-like retriever for clarity and a toy generator; you can later swap in a real LLM.

```python
import re
import math
from collections import Counter, defaultdict

# -----------------------------
# 1) Toy knowledge base
# -----------------------------
docs = [
    {
        "id": "company-overview",
        "text": (
            "Acme Solar builds portable solar chargers and home energy kits. "
            "Our mission is to make clean energy easy to use. "
            "Customer support is available 9am-6pm PST, Monday to Friday."
        ),
    },
    {
        "id": "product-solar-charger",
        "text": (
            "The Acme Pocket Charger is a 20W portable solar panel with USB-C output. "
            "It charges phones and small devices in direct sunlight. "
            "For best results, angle the panel toward the sun and avoid shade."
        ),
    },
    {
        "id": "refund-policy",
        "text": (
            "Refunds: We offer refunds within 30 days of delivery for damaged or defective products. "
            "Contact support with your order number and photos of the issue. "
            "Accessories and cables are covered under the same policy."
        ),
    },
    {
        "id": "installation-guide",
        "text": (
            "To install a home energy kit, mount the panel securely and connect the charge controller. "
            "Follow the wiring diagram and use weatherproof connectors. "
            "If unsure, consult a licensed electrician."
        ),
    },
]

# -----------------------------
# 2) Chunking (simple sentence split)
# -----------------------------
def split_sentences(text):
    # naive split; good enough for demo
    parts = re.split(r'(?<=[.!?])\s+', text.strip())
    return [p.strip() for p in parts if p.strip()]

chunks = []
for d in docs:
    for i, sent in enumerate(split_sentences(d["text"])):
        chunks.append({
            "chunk_id": f'{d["id"]}#{i}',
            "doc_id": d["id"],
            "text": sent
        })

# -----------------------------
# 3) Tiny TF-IDF-like retriever
# -----------------------------
def tokenize(text):
    return re.findall(r"[a-z0-9']+", text.lower())

# Build vocabulary and IDF from chunks
vocab = {}                # token -> index
df = defaultdict(int)     # document frequency per token
N = len(chunks)

for ch in chunks:
    tokens = set(tokenize(ch["text"]))
    for tok in tokens:
        if tok not in vocab:
            vocab[tok] = len(vocab)
        df[tok] += 1

# IDF with smoothing
idf = [0.0] * len(vocab)
for tok, idx in vocab.items():
    idf[idx] = math.log((1 + N) / (1 + df[tok])) + 1.0

def tfidf_vector(text):
    tokens = tokenize(text)
    counts = Counter(tokens)
    vec = [0.0] * len(vocab)
    # term frequency (raw count) * idf
    for tok, c in counts.items():
        if tok in vocab:
            vec[vocab[tok]] = c * idf[vocab[tok]]
    # L2 normalize
    norm = math.sqrt(sum(x*x for x in vec)) or 1.0
    return [x / norm for x in vec]

# Precompute embeddings for chunks
chunk_vectors = [tfidf_vector(ch["text"]) for ch in chunks]

def cosine(a, b):
    return sum(x*y for x, y in zip(a, b))

def retrieve(query, k=3):
    qv = tfidf_vector(query)
    sims = [(cosine(qv, cv), i) for i, cv in enumerate(chunk_vectors)]
    sims.sort(reverse=True)
    top = [chunks[i] for _, i in sims[:k]]
    return top

# -----------------------------
# 4) Prompt assembly and a toy "generator"
# -----------------------------
def build_context(top_chunks):
    lines = []
    for ch in top_chunks:
        lines.append(f"[{ch['doc_id']}] {ch['text']}")
    return "\n".join(lines)

def fake_llm_generate(question, context):
    # This simulates what you'd ask a real LLM.
    # For the demo, we extract an answer from the most relevant chunk(s).
    # In production, you would call an actual LLM with the prompt below.
    if "refund" in question.lower() or "return" in question.lower():
        # try to find a number of days in the context
        m = re.search(r'(\d+)\s*days', context.lower())
        days = m.group(1) if m else None
        base = "We offer refunds for damaged or defective products"
        if days:
            base += f" within {days} days of delivery"
        return base + ". Please contact support with your order number and photos."
    # Fallback: provide a concise sentence drawn from the top context
    first_line = context.splitlines()[0] if context else ""
    return "Based on our docs: " + re.sub(r'^\[[^\]]+\]\s*', '', first_line)

def rag_answer(question, k=3):
    top = retrieve(question, k=k)
    context = build_context(top)
    prompt = (
        "You are a helpful assistant. Use ONLY the information in Context to answer.\n"
        f"Question: {question}\n\n"
        f"Context:\n{context}\n\n"
        "If the answer is not in the context, say you don't know."
    )
    answer = fake_llm_generate(question, context)
    return answer, top, prompt

# -----------------------------
# 5) Try it
# -----------------------------
if __name__ == "__main__":
    user_question = "Do you offer refunds on broken solar chargers? How long do I have?"
    answer, sources, prompt = rag_answer(user_question, k=3)

    print("Answer:")
    print(answer)
    print("\nSources used:")
    for s in sources:
        print(f"- {s['doc_id']} -> {s['text']}")
    print("\n(Example prompt you would send to a real LLM):")
    print(prompt)
```

What this code shows
- Retrieval: retrieve(question) finds the top-k relevant chunks by cosine similarity over a simple TF-IDF embedding.
- Augmentation: build_context(...) stitches those chunks into a context block with lightweight citations.
- Generation: fake_llm_generate(...) is a placeholder that mimics how an LLM would answer using only the provided context.
- In practice, you replace fake_llm_generate with a real model call.

How to swap in a real LLM (optional)
- Replace fake_llm_generate with a function that calls your model. For example, with an API:
  - Prepare the same prompt string that includes the Context and Question.
  - Send it to your LLM of choice (OpenAI, local model via transformers, etc.).
  - Return the model’s text output as the answer.

Tips to make this production-ready
- Use good embeddings (e.g., sentence-transformers) and a vector index (e.g., FAISS) instead of the toy TF-IDF.
- Chunk documents with sensible sizes and overlaps (e.g., ~200–500 tokens, 10–50 token overlap).
- Store metadata (source, URL, title) and return citations with the answer.
- Add a re-ranking step to improve retrieval quality before generation.
- Instruct the LLM to cite sources and to say “I don’t know” if the context is insufficient.

In [18]:
# Get Llama 3.2 to answer
stream = chat(question, MODEL_LLAMA)
display_stream(stream)

**What is RAG (Relational Attention Graph) AI?**

RAG is an artificial intelligence framework developed by Facebook's AI Research Lab. It's primarily used for natural language processing (NLP) tasks like language translation, question answering, and text classification.

The core idea behind RAG is to replace traditional recurrent neural networks (RNNs) with graph-structured attention mechanisms. This allows the model to consider both sequential dependencies and global contextual relationships in a sentence or document.

**How does RAG work?**

In essence, RAG works by:

1. Representing input text as a graph, where each node represents an element of the input sequence (e.g., word tokens).
2. Assigning attention weights to each node, indicating its importance.
3. Computing the weighted sum of all nodes' representations to generate the contextualized embeddings.
4. Using these embeddings as inputs for downstream tasks.

**A simple RAG example in Python**

To demonstrate this concept, we can create a simplified RAG model using the Keras library and attention mechanisms:
```python
import numpy as np
from keras.layers import Input, Dense, Embedding, Flatten, Attention
from keras.models import Model

# Sample input text (a sentence)
sentences = ["This is a sample query."]

# Convert sentences to numerical labels
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Create attention-based RAG model
attention_input = Input(shape=(100,), name="Input")
x = Embedding(input_dim=100, output_dim=128)(attention_input)
x = Flatten()(x)  # Flatten embedding

# Compute attention layer
attention_weights = Attention().apply(x)
att_weights = Dense(1, kernel_initializer='zero')(attention_weights)

# weighted sum of node representations
node_representations = att_weights * x

# Generate contextualized embeddings
context_embedding = Dense(128, activation='relu')(node_representations)

# Create RAG model with input and output layers
model = Model(inputs=attention_input, outputs=context_embedding)
```
In this example:

*   We create an input layer representing the tokens obtained from a sentence.
*   The embedding layer maps these tokens to dense vectors of size 128.
*   The attention layer computes weighted sums for each token, giving us attention scores att_weights and node representations x.
*   Finally, we compute contextualized embeddings by integrating these node representations using the generated weights.

Note: This simplified example is not optimized for accuracy or practicality and serves mostly to illustrate the fundamental concept of RAG.

To go a step further in implementation of this model you may want to check the **RAG** documentation on this [GitHub website](https://github.com/fact-checking/rags).

Hope that helps!