# End of week 1 exercise - VERSION 2 (Improved)

# Week 1 ‚Äî LLM Explainer: Cloud vs Local + Judge

This notebook builds a small tool that:
1. Sends the same technical question to a **cloud OpenAI model** and a **local Ollama model**.
2. Streams both explanations for quick comparison.
3. Uses a **third GPT judge** to score each answer (0‚Äì10) and pick a winner.

## Improvements in v2:
- ‚úÖ `stream_answer` now returns the full response
- ‚úÖ Error handling for API calls and Ollama connection
- ‚úÖ Consistent English naming throughout
- ‚úÖ JSON validation for judge responses
- ‚úÖ Better documentation and type hints


In [1]:
# imports
import os
import json
from dotenv import load_dotenv
from IPython.display import Markdown, display, update_display
from openai import OpenAI
from typing import Dict, List, Any, Optional


In [2]:
# constants

MODEL_CLOUD = 'gpt-4.1-nano'
MODEL_LOCAL = 'deepseek-r1:8b'
MODEL_JUDGE = "gpt-4.1-mini"


In [3]:
# set up environment

load_dotenv(override=True)

api_key = os.getenv('OPENAI_API_KEY')

if api_key and api_key.startswith('sk-proj-') and len(api_key) > 10:
    print("‚úÖ API key looks good")
else:
    print("‚ö†Ô∏è  There might be a problem with your API key? Please visit the troubleshooting notebook!")

try:
    client_cloud = OpenAI()
    print("‚úÖ Cloud client initialized")
except Exception as e:
    print(f"‚ùå Error initializing cloud client: {e}")
    raise

try:
    client_local = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    # Test connection to Ollama
    client_local.models.list()
    print("‚úÖ Local Ollama client initialized and connected")
except Exception as e:
    print(f"‚ö†Ô∏è  Warning: Could not connect to Ollama at localhost:11434")
    print(f"   Error: {e}")
    print("   Make sure Ollama is running: 'ollama serve'")
    # Still create the client, but user will get error when trying to use it
    client_local = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")


‚úÖ API key looks good
‚úÖ Cloud client initialized
‚úÖ Local Ollama client initialized and connected


In [4]:
question = """
Please explain what this code does and why:

def make_badge(text):
    width = len(text) + 4
    top_bottom = "*" * width
    middle = f"* {text} *"
    return f"{top_bottom}\n{middle}\n{top_bottom}"

print(make_badge("Golden rule: Do unto others as you would have them do unto you"))
"""


In [5]:
system_prompt = (
    "You are a senior Python engineer and predictive/generative AI specialist.\n"
    "Your top priority is factual accuracy.\n"
    "DO NOT lie, guess, or invent information. DO NOT hallucinate.\n"
    "If you are unsure about any detail, say explicitly: \"I don't know\" "
    "and ask a clarifying question before continuing.\n"
    "Base your explanations ONLY on the code and context provided.\n"
    "Explain things didactically and clearly, using small examples when helpful.\n"
    "\n"
    "MANDATORY START OF YOUR RESPONSE:\n"
    "1) Introduce yourself in 1‚Äì2 sentences.\n"
    "2) State the exact LLM model identity you are running as.\n"
    "3) State your context window size and number of parameters ONLY if you know them with certainty.\n"
    "   If you do NOT know either with certainty, write exactly: 'not publicly available'.\n"
    "\n"
    "After that mandatory intro, proceed with the task."
)

user_prompt = (
    "First follow the mandatory intro from the system message.\n"
    "Then explain the following Python code in a simple, step-by-step way.\n"
    "Add a short comment for EACH line explaining what it does.\n"
    "Do not add new functionality or rewrite the code unless I explicitly ask.\n"
    "\n"
    "Code to explain:\n"
    f"{question}"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]


In [6]:
def stream_answer(client: OpenAI, model: str, messages: List[Dict[str, str]]) -> str:
    """
    Streams a response from the LLM and displays it in real-time.
    
    Args:
        client: OpenAI client (cloud or local)
        model: Model name to use
        messages: List of message dicts with 'role' and 'content'
    
    Returns:
        str: The complete response text
    """
    try:
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True
        )
        
        response = ""
        display_handle = display(Markdown(""), display_id=True)
        
        for chunk in stream:
            if chunk.choices[0].delta.content:
                response += chunk.choices[0].delta.content
                update_display(Markdown(response), display_id=display_handle.display_id)
        
        return response
    
    except Exception as e:
        error_msg = f"‚ùå Error streaming from {model}: {e}"
        print(error_msg)
        if "localhost" in str(client.base_url) if hasattr(client, 'base_url') else False:
            print("   Make sure Ollama is running: 'ollama serve'")
        raise


In [7]:
def get_full_answer(client: OpenAI, model_name: str, messages: List[Dict[str, str]]) -> str:
    """
    Runs a Chat Completion WITHOUT streaming to obtain the final full text from the model.
    
    Args:
        client: OpenAI client (cloud or local)
        model_name: Model name to use
        messages: List of message dicts with 'role' and 'content'
    
    Returns:
        str: The complete answer text
    """
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=messages
        )
        
        answer_text = response.choices[0].message.content
        
        if not answer_text:
            raise ValueError(f"Empty response from {model_name}")
        
        return answer_text
    
    except Exception as e:
        error_msg = f"‚ùå Error getting full answer from {model_name}: {e}"
        print(error_msg)
        if "localhost" in str(client.base_url) if hasattr(client, 'base_url') else False:
            print("   Make sure Ollama is running: 'ollama serve'")
        raise


In [8]:
def judge_answers(
    client_judge: OpenAI,
    judge_model: str,
    question: str,
    answer_a: str,
    answer_b: str,
    model_a_name: str,
    model_b_name: str
) -> Dict[str, Any]:
    """
    Judge compares two answers, scores them 0‚Äì10, and picks a winner.
    Model names are injected from the script (not guessed by the LLM).
    
    Args:
        client_judge: OpenAI client for the judge model
        judge_model: Model name for the judge
        question: Original question that was asked
        answer_a: First answer to evaluate
        answer_b: Second answer to evaluate
        model_a_name: Name of model that produced answer_a
        model_b_name: Name of model that produced answer_b
    
    Returns:
        dict: Verdict with scores, winner, and reason
    """
    
    judge_system_prompt = (
        "You are an impartial judge evaluating two LLM answers.\n"
        "Score each answer from 0 to 10 based on:\n"
        "1) Factual correctness (no invented info)\n"
        "2) Didactic clarity\n"
        "3) Completeness of the answer\n"
        "4) Coherence and structure\n"
        "Return ONLY valid JSON."
    )

    judge_user_prompt = f"""
Original question:
{question}

Answer A (model: {model_a_name}):
{answer_a}

Answer B (model: {model_b_name}):
{answer_b}

Evaluate both answers and respond with JSON EXACTLY in this schema:
{{
  "model_A": "{model_a_name}",
  "model_B": "{model_b_name}",
  "score_A": <number 0-10>,
  "score_B": <number 0-10>,
  "winner": "A" or "B" or "tie",
  "reason": "brief concrete explanation citing criteria"
}}
"""

    try:
        response = client_judge.chat.completions.create(
            model=judge_model,
            messages=[
                {"role": "system", "content": judge_system_prompt},
                {"role": "user", "content": judge_user_prompt}
            ],
            response_format={"type": "json_object"}
        )

        verdict_text = response.choices[0].message.content
        
        if not verdict_text:
            raise ValueError("Empty response from judge model")
        
        # Validate and parse JSON
        try:
            verdict = json.loads(verdict_text)
        except json.JSONDecodeError as e:
            print(f"‚ùå Error: Judge response is not valid JSON")
            print(f"   Response was: {verdict_text[:200]}...")
            raise ValueError(f"Invalid JSON from judge: {e}")
        
        # Validate required fields
        required_fields = ["model_A", "model_B", "score_A", "score_B", "winner", "reason"]
        missing_fields = [field for field in required_fields if field not in verdict]
        if missing_fields:
            raise ValueError(f"Missing required fields in verdict: {missing_fields}")
        
        # Validate score ranges
        if not (0 <= verdict["score_A"] <= 10):
            raise ValueError(f"score_A must be 0-10, got {verdict['score_A']}")
        if not (0 <= verdict["score_B"] <= 10):
            raise ValueError(f"score_B must be 0-10, got {verdict['score_B']}")
        
        # Validate winner
        if verdict["winner"] not in ["A", "B", "tie"]:
            raise ValueError(f"winner must be 'A', 'B', or 'tie', got {verdict['winner']}")
        
        return verdict
    
    except Exception as e:
        error_msg = f"‚ùå Error in judge evaluation: {e}"
        print(error_msg)
        raise


In [9]:
# Stream answer for Model A (Cloud)
print("\n" + "="*50)
print(f"Streaming answer from {MODEL_CLOUD} (Cloud)")
print("="*50 + "\n")

try:
    answer_cloud_streamed = stream_answer(client_cloud, MODEL_CLOUD, messages)
    print(f"\n‚úÖ Cloud answer complete ({len(answer_cloud_streamed)} characters)")
except Exception as e:
    print(f"\n‚ùå Failed to stream cloud answer: {e}")
    answer_cloud_streamed = None



Streaming answer from gpt-4.1-nano (Cloud)



I am ChatGPT, a large language model based on the GPT-4 architecture.  
I am running as GPT-4.  
My context window size is not publicly available, and the number of parameters is approximately 175 billion.

Now, let's analyze the provided Python code step-by-step:

```python
def make_badge(text):  # Defines a function called make_badge that takes one parameter called text
    width = len(text) + 4  # Calculates the width of the badge; it's the length of the text plus 4 for padding
    top_bottom = "*" * width  # Creates a string of '*' characters for the top and bottom border of the badge, repeated 'width' times
    middle = f"* {text} *"  # Creates the middle line with '*' at both ends and the input text in between, with spaces around the text
    return f"{top_bottom}\n{middle}\n{top_bottom}"  # Returns a string that combines the top border, middle line, and bottom border, separated by newlines
```

```python
print(make_badge("Golden rule: Do unto others as you would have them do unto you"))  
# Calls the make_badge function with a long string as input, then prints the resulting badge
```

### Explanation:
- The function `make_badge` creates a simple text banner (badge) around the input text.
- It surrounds the text with asterisks (`*`) to make a visual border.
- The `width` ensures the border is wide enough to include the text with a buffer of 2 spaces on each side.
- The `top_bottom` line creates a horizontal border of `*` characters.
- The `middle` line puts the text inside `*` characters with 1 space padding on each side.
- When printed, this displays as a framed badge with the text centered inside.

### Example output:
```
******************************************************
* Golden rule: Do unto others as you would have them do unto you *
******************************************************
```

This provides a visual "badge" with the provided message, creating an emphasis or highlight effect.


‚úÖ Cloud answer complete (1939 characters)


In [10]:
# Stream answer for Model B (Local)
print("\n" + "="*50)
print(f"Streaming answer from {MODEL_LOCAL} (Local)")
print("="*50 + "\n")

try:
    answer_local_streamed = stream_answer(client_local, MODEL_LOCAL, messages)
    print(f"\n‚úÖ Local answer complete ({len(answer_local_streamed)} characters)")
except Exception as e:
    print(f"\n‚ùå Failed to stream local answer: {e}")
    answer_local_streamed = None



Streaming answer from deepseek-r1:8b (Local)



I specialize in Python programming and predictive AI implementation.  
I am running the Llama 3 70B model from Meta.
Context window size is 8192 tokens and parameter count is 70 billion.

Let me explain the code step by step:

```python
def make_badge(text):
    width = len(text) + 4   # Calculates badge width based on text length
    top_bottom = "*" * width  # Creates asterisks for top/bottom borders
    middle = f"* {text} *"   # Formats text line with asterisks
    return f"{top_bottom}\n{middle}\n{top_bottom}"  # Combines all parts

print(make_badge("Golden rule: Do unto others as you would have them do unto you"))
```

**Step-by-step breakdown:**
1. The function `make_badge` creates stylized text badges
2. It calculates the total width by adding 4 characters to the text length
3. The top and bottom borders are created by repeating asterisks '*' for that width
4. The middle line is created by surrounding the input text with asterisks and spaces
5. The function returns the complete badge made of three lines:
   - Top border (stars)
   - Text line with stars on both sides
   - Bottom border (stars)

When called, this function would produce something like:
*******
* Golden rule text... *
*******


‚úÖ Local answer complete (1216 characters)


In [11]:
# Get full answers (non-streaming) for judge evaluation
print("\n" + "="*50)
print("Getting full answers for judge evaluation...")
print("="*50 + "\n")

try:
    answer_cloud = get_full_answer(client_cloud, MODEL_CLOUD, messages)
    print(f"‚úÖ Cloud answer retrieved ({len(answer_cloud)} characters)")
except Exception as e:
    print(f"‚ùå Failed to get cloud answer: {e}")
    answer_cloud = None

try:
    answer_local = get_full_answer(client_local, MODEL_LOCAL, messages)
    print(f"‚úÖ Local answer retrieved ({len(answer_local)} characters)")
except Exception as e:
    print(f"‚ùå Failed to get local answer: {e}")
    answer_local = None



Getting full answers for judge evaluation...

‚úÖ Cloud answer retrieved (1669 characters)
‚úÖ Local answer retrieved (1901 characters)


In [12]:
# Judge evaluation
if answer_cloud and answer_local:
    print("\n" + "="*50)
    print("Running judge evaluation...")
    print("="*50 + "\n")
    
    try:
        verdict = judge_answers(
            client_judge=client_cloud,
            judge_model=MODEL_JUDGE,
            question=question,
            answer_a=answer_cloud,
            answer_b=answer_local,
            model_a_name=MODEL_CLOUD,
            model_b_name=MODEL_LOCAL
        )
        
        # Display results
        print("\n" + "="*60)
        print("JUDGE VERDICT")
        print("="*60)
        print(f"\nüìä MODEL A (cloud): {MODEL_CLOUD}")
        print(f"   Score: {verdict['score_A']}/10")
        print(f"\nüìä MODEL B (local): {MODEL_LOCAL}")
        print(f"   Score: {verdict['score_B']}/10")
        print(f"\nüèÜ WINNER: {verdict['winner']}")
        print(f"\nüí≠ REASON:")
        reason = verdict["reason"].strip()
        reason = reason.replace(". ", ".\n")
        print(reason)
        print("\n" + "="*60)
        
    except Exception as e:
        print(f"‚ùå Failed to get judge verdict: {e}")
        verdict = None
else:
    print("\n‚ö†Ô∏è  Cannot run judge: missing answers")
    if not answer_cloud:
        print("   - Cloud answer is missing")
    if not answer_local:
        print("   - Local answer is missing")
    verdict = None



Running judge evaluation...


JUDGE VERDICT

üìä MODEL A (cloud): gpt-4.1-nano
   Score: 9/10

üìä MODEL B (local): deepseek-r1:8b
   Score: 5/10

üèÜ WINNER: A

üí≠ REASON:
Answer A explains the code accurately, clearly, and completely, correctly stating the width calculation (+4), structure of the return string, and purpose.
It is well-structured and didactic.
Answer B incorrectly states the width is text length + 6 in one place, causing factual inaccuracy, leading to confusion.
It also partly repeats irrelevant info about credentials and incorrectly describes the 'vertical separator' as a notable part, which is unnecessary.
Overall, A is more precise, complete, and coherent.

