# Imports

In [28]:
import os
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
from openai import OpenAI
from IPython.display import Markdown, display
from dotenv import load_dotenv
from scrap import fetch_article
load_dotenv(override=True)




True

# Constants & Switch

In [29]:
MODEL_GPT = "gpt-5-nano"       # Cloud model
MODEL_LLAMA = "llama3.2"       # Local Ollama
USE_LOCAL_MODEL = False         # True = Ollama, False = Cloud GPT/OpenRouter
OLLAMA_BASE_URL = "http://localhost:11434/v1"

#### detect which api-keys detects 

In [None]:
# Initialize and constants

api_key = os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")

if api_key:
    if api_key.startswith("sk-or-") and len(api_key) > 10:
        print("OpenRouter API key detected")
        BASE_URL = "https://openrouter.ai/api/v1"
    elif api_key.startswith("sk-proj-") and len(api_key) > 10:
        print("OpenAI Project API key detected")
        BASE_URL = None
    else:
        print("API key format not recognized")
        BASE_URL = None
else:
    print("No API key found")
    BASE_URL = None

✅ OpenRouter API key detected


In [31]:

# Cloud client (OpenRouter/OpenAI)
cloud_client = OpenAI(
    api_key=api_key,
    base_url=BASE_URL
) if not USE_LOCAL_MODEL else None

# Ollama client (Local)
ollama_client = OpenAI(
    api_key="ollama",
    base_url=OLLAMA_BASE_URL
) if USE_LOCAL_MODEL else None

def get_client():
    if USE_LOCAL_MODEL:
        return ollama_client
    else:
        return cloud_client

def get_model():
    return MODEL_LLAMA if USE_LOCAL_MODEL else MODEL_GPT

## Client switch

In [32]:
def explain_question(question: str) -> str:
    """
    Ask a technical question and get a clear explanation from the selected LLM.
    """
    client = get_client()
    model = get_model()

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": """
You are an expert AI tutor.

Explain technical concepts clearly for a developer.
- Step-by-step explanation
- Use simple language
- Include examples if needed
- Avoid jargon unless explained
"""
            },
            {
                "role": "user",
                "content": question
            }
        ]
    )

    explanation = response.choices[0].message.content
    return explanation

In [33]:
def ask_llm(question: str):
    answer = explain_question(question)
    display(Markdown(answer))


In [34]:
if __name__ == "__main__":
    question = "Explain Retrieval Augmented Generation (RAG) in simple terms"
    ask_llm(question)

Retrieval Augmented Generation (RAG) is a way to build a question-answering system that combines two ideas: look up documents (retrieval) and then write an answer using those documents (generation).

Step-by-step, at a high level:

- Step 1: Build a knowledge store
  - Collect a set of documents you want the system to know about (manuals, articles, PDFs, web pages).
  - Break them into chunks (e.g., 200–500 words per chunk) so they’re easy to search.

- Step 2: Create a retriever
  - Turn each doc chunk into a numerical vector (an “embedding”) that captures its meaning.
  - Put all chunks into a vector store or index (examples: FAISS, Pinecone, or a simple database with embeddings).
  - The retriever’s job is to take a question and find the most relevant chunks.

- Step 3: When a user asks a question
  - Convert the question into a query vector (same method used for docs).
  - Retrieve the top-K most relevant doc chunks from the vector store.

- Step 4: Generate an answer
  - Feed the question and the retrieved chunks to a language model (the “generator”).
  - The model writes an answer that borrows information from the retrieved chunks, rather than relying only on what it was trained on.

- Step 5: (Optional) refine or re-rank
  - You can re-rank the candidate answers or use a separate step to merge info from multiple chunks more cleanly.

- Step 6: Deliver the result
  - Return the answer to the user, and optionally show which docs were used as sources.

Two common flavors you’ll hear about:

- RAG-Token
  - The model can switch between retrieved chunks while generating each token.
  - It may look up different documents as it writes the answer.

- RAG-Sequence
  - The model generates a whole answer conditioned on the set of retrieved chunks.
  - The retrieval is done first, then the generator produces the final text.

A simple example

- You have a knowledge store with product manuals and FAQs.
- A user asks: “How do I reset my device?”
- The retriever pulls the most relevant chunks from the manuals (e.g., “to reset, hold the power button for 10 seconds,” etc.).
- The generator writes an answer like: “To reset your device, press and hold the power button for 10 seconds until the screen goes off, then release and press it again to power on. If it still doesn’t start, try…”
- The answer is grounded in the retrieved docs, not just invented.

Why use RAG?

- Keeps answers grounded in real documents rather than just what the model “thinks.”
- Can use up-to-date or domain-specific knowledge by updating the doc store.
- Can handle long or technical information by pulling only the relevant chunks as context.

What to watch out for

- Quality of the document store matters: wrong or outdated docs lead to wrong answers.
- Retrieval can fail if the vectors aren’t good or the index isn’t well-tuned.
- There’s still potential for errors or misinterpretation if the generator over-weights noisy chunks.
- Costs: embedding, indexing, and running the generator can be heavier than plain generation.

Common components you’ll use

- Document store: your collection of docs chunked for search.
- Retriever: dense (neural embeddings) or sparse (BM25); can be hybrid.
- Generator: a language model (e.g., a seq-to-seq model or a large LM) that takes the question and retrieved docs as input.
- Optional: a re-ranker, metadata filters, or a librarian-like module to verify sources.

A tiny mental model

- Think of the system as two parts: a smart librarian (the retriever) and a careful writer (the generator). The librarian fetches relevant passages, and the writer composes an answer that cites those passages.

If you want, I can tailor this to a particular stack (e.g., Haystack/LangChain, FAISS, or Pinecone) and sketch a minimal implementation plan.