# Exercises



## Part 1: Conceptual Questions


### **Q1. Context Window:** What is the context window in an LLM, and why does it matter in real-world applications?  

The context window of an LLM refers to the **maximum number of tokens** the model can process **at once**, including both **the prompt and its generated output**.  

It matters because  
- anything outside this window is effectively forgotten or ignored by the model. 
- In real-world applications, this determines how much history, instructions, or conversation you can provide before the model begins losing earlier information. 
- A larger context window improves continuity in long chats and enables processing of large documents. 
- However, longer contexts also increase cost, latency, and may dilute attention across irrelevant information. 
- Developers must balance context length, relevance, and retrieval strategies for reliable performance.



### **Q2. Tokens & Token Efficiency:** Why do LLM APIs price by tokens instead of characters or words? What does token efficiency mean in practice? 

LLM APIs price by tokens because 
- tokens represent the actual computational workload: 
  - models operate on token-level embeddings, not characters or words. 
- Characters are too granular and words too inconsistent across languages, whereas tokens provide a standardized unit across all text. 

Token efficiency refers to  
- minimizing unnecessary tokens in both prompts and model outputs to reduce cost and latency. 
- In practice, this means writing concise prompts, removing redundant text, and using formats like JSON instead of verbose prose when appropriate. 
- Efficient token use helps reduce API expenses, improve throughput, and prevent hitting context window limits.



### **Q3. Sampling Controls:** Explain what temperature, top-k, and top-p do during generation, and how you would tune them for: (a) a factual RAG-based Q&A bot, and (b) a creative brainstorming assistant.  
- Temperature controls randomness: low values produce deterministic answers, while high values encourage creative variation. 
- Top-k limits sampling to the k most likely next tokens, reducing noise and making generation more controlled. 
- Top-p (nucleus sampling) chooses from the smallest probability mass that exceeds p, balancing creativity and coherence. 

For a factual RAG-based Q&A bot
- you would set temperature low (0–0.2) and 
- use conservative top-p or top-k values to minimize hallucination. 
  
For a creative brainstorming assistant
- you would increase temperature (0.7–1.0) and 
- loosen top-p or top-k to allow more diverse and imaginative outputs.   

Tuning these controls helps match the model’s behavior to user expectations.



### **Q4. Prompt Engineering:** What is the difference between zero-shot and few-shot prompting? Give a concrete example of when you would choose each.

- Zero-shot prompting means giving the model instructions with no examples, relying entirely on general pretrained knowledge. 
- Few-shot prompting adds representative examples, guiding the model by pattern demonstration.  

Zero-shot is ideal when 
- the task is simple, well-known, or requires minimal structure
- for example, asking for a definition or summarizing a document.   

Few-shot prompting is preferred when 
- the output must follow a strict format or 
- when the task is nuanced, such as writing product descriptions in a specific brand voice.   
  
The choice depends on complexity, consistency needs, and how predictable the output format must be.



### **Q5. LLM Limitations:** If LLMs are so powerful, why do we still talk about hallucinations and outdated knowledge? Give a couple of concrete examples.

LLMs can hallucinate because they generate text by predicting likely sequences, not by verifying facts.  

They can also contain outdated knowledge if their training data doesn’t include recent information.   

For example, an LLM might invent API parameters that don’t exist because they "sound right," or might provide outdated information about laws or product versions.  

Another limitation is that the model may confidently answer questions about nonexistent research papers or people. These issues arise from probabilistic generation and static training data, which makes external grounding essential for accuracy.



### **Q6. RAG Definition:** What is Retrieval-Augmented Generation (RAG), and what are its main components?

Retrieval-Augmented Generation (RAG) combines information retrieval with LLM generation to produce grounded, up-to-date, and contextually relevant responses.  

Its main components include 
- a retriever, which finds relevant documents based on embeddings or keyword search, and 
- a generator, which conditions on those retrieved documents to produce an answer.  

RAG reduces hallucination by anchoring the model to factual evidence. It also enables dynamic, domain-specific knowledge without the need for fine-tuning.



### **Q7. Naive RAG Pipeline:** Walk me through a naive RAG pipeline end-to-end, starting from raw documents and ending at the final answer.
- A naive RAG pipeline starts by collecting raw documents such as PDFs, web pages, or text files. 
- These documents are cleaned and chunked into manageable segments to fit within embedding models’ limits. 
- Each chunk is transformed into an embedding and stored in a vector database. 
- When a user asks a question, the system embeds the query and retrieves the most semantically similar chunks. 
- The retrieved context is inserted into a prompt that instructs the LLM to answer using those sources. 
- The model then generates a final answer grounded in the retrieved information.



### **Q8. RAG vs Long Context:** If we can just use a giant LLM with a 1M token context window, why do we still need RAG?

Even with large context windows (100K+ tokens):
1. **Cost:** Processing huge contexts is expensive per query
2. **Latency:** Longer contexts = slower response times
3. **Relevance:** Finding needles in haystacks - RAG pre-filters relevant information
4. **Scale:** You might have millions of documents, can't fit all in context
5. **Precision:** RAG can search and retrieve exactly what's needed



### **Q9. RAG vs Fine-tuning:** When would you choose RAG instead of fine-tuning, and when is fine-tuning more appropriate?
You choose RAG when   
- the knowledge is large, frequently changing, or needs to remain external for compliance or cost reasons. 
- RAG allows updates without retraining and supports explainability by showing which documents informed the answer.   

Fine-tuning is more appropriate when 
- the goal is adapting the model’s behavior, style, or reasoning patterns rather than injecting raw knowledge. 
- It also helps when the task requires domain-specific phrasing, custom tool-use patterns, or specialized reasoning techniques. 

In practice, many systems combine both: 
- RAG for knowledge grounding and 
- fine-tuning for stylistic or process improvements.

| Use Case | Best Approach | Reason |
|----------|---------------|---------|
| Company documentation Q&A | RAG | Frequently updated, need citations |
| Medical chatbot (specific terminology) | Fine-tuning | Learn specialized language patterns |
| Legal document search | RAG | Large corpus, need exact citations |
| Brand voice for marketing | Fine-tuning | Consistent style/tone |
| Customer support with product manuals | RAG + Fine-tuning | RAG for facts, fine-tuning for tone |


### **Q10. Structured Data Extraction:** How would you design a prompt to extract structured data (JSON) from unstructured text?
- the prompt should clearly specify the **desired schema, constraints, and formatting rules**. 
- It is helpful to include **examples** of valid JSON, even for slightly different inputs, to guide model behavior. 
- The prompt should instruct the model **not to add** commentary and to return only valid, parseable JSON. 
- It should also define how to handle **missing or uncertain fields**, such as using null or empty strings.  

This approach ensures predictable structure and makes downstream parsing reliable in production settings.

## Part 2: Run a Local Model with Ollama

**Goal:** See how a local LLM works and connect concepts (tokens, temperature, RAG) to a real tool.

*Skip this if you don't have a machine that can run Ollama.*

**Steps:**

1. **Install Ollama**
   - Visit: [https://docs.ollama.com/](https://docs.ollama.com/)
   - Download and install for your OS

2. **Run a Model**
   ```bash
   ollama run gemma3
   # or
   ollama run llama3
   ```

3. **Ask Questions**
   - "Explain Retrieval-Augmented Generation (RAG) in 3 bullet points."
   - "What is the exact vacation policy of ICC in 2024?" (should show limitation)

4. **Call the HTTP API**
   ```bash
   curl http://localhost:11434/api/generate \
     -d '{
       "model": "llama3",
       "prompt": "Explain what a context window is in 3 sentences."
     }'
   ```

**Deliverable:**
- 1-2 screenshots of Ollama answers
- 4-6 sentences comparing to ChatGPT (style, hallucinations, speed)
- Briefly describe the API endpoint and JSON response



## Part 3: Call OpenAI API

**Goal:** Make your first real API call to a hosted LLM and observe tokens & temperature in action.

*If you don't have an API key, just read the code and explain what it does.*

**Steps:**

1. **Setup**
   ```bash
   pip install --upgrade openai
   ```

2. **Minimal Script** (`lesson1_openai_test.py`):
   ```python
   from openai import OpenAI
   
   client = OpenAI()  # expects OPENAI_API_KEY in environment
   
   response = client.responses.create(
       model="gpt-4.1",
       input="Explain Retrieval-Augmented Generation (RAG) in 3 bullet points."
   )
   
   print(response.output[0].content[0].text)
   print("Total tokens:", response.usage.total_tokens)
   ```

3. **Run and Analyze**
   - How many total tokens did this use?
   - If price is $X per 1K tokens, roughly how much did this cost?

4. **Change Sampling Parameters**
   ```python
   response = client.responses.create(
       model="gpt-4.1",
       input="Explain Retrieval-Augmented Generation (RAG) in 3 bullet points.",
       temperature=0.0,  # Try 0.0 and 0.8
       top_p=0.9,
   )
   ```

**Deliverable:**
- Paste output showing token count
- Compare outputs at different temperatures (4-6 sentences)
- Which is more deterministic? More creative? Any hallucinations?

---

