## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [2]:
# Always remember to do this!
load_dotenv(override=True)

True

In [3]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')
perplexity_api_key = os.getenv('PERPLEXITY_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

if perplexity_api_key:
    print(f"Perplexity API Key exists and begins {perplexity_api_key[:3]}")
else:
    print("Perplexity API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AI
DeepSeek API Key exists and begins sk-
Groq API Key not set (and this is optional)
Perplexity API Key exists and begins ppl


In [4]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [5]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [7]:
# Perplexity LLM - SONAR

from perplexity import Perplexity

client = Perplexity() # Automatically 

response = client.chat.completions.create(
    messages=messages,
    model="sonar-pro"
)

question = response.choices[0].message.content
print(question)

Here's a challenging question for evaluating LLM intelligence:

"You are given a repository containing 50,000 lines of code across 200 files, where a critical function has been refactored three times over the past year. The function's behavior changed subtly after the second refactoring, introducing a bug that only manifests when specific edge-case inputs are combined with particular environmental conditions. A junior developer reports that the application crashes intermittently in production but cannot reproduce it locally. Using only the information that the bug occurs 'sometimes' and 'under certain conditions,' outline a systematic debugging strategy that accounts for the code's evolution, considers how environment-specific factors might interact with the refactored logic, and identifies which specific code sections would be most valuable to examine first. Explain your reasoning at each step, including how you'd prioritize investigating the three versions of the function and what ty

In [None]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


In [8]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

## Note - update since the videos

I've updated the model names to use the latest models below, like GPT 5 and Claude Sonnet 4.5. It's worth noting that these models can be quite slow - like 1-2 minutes - but they do a great job! Feel free to switch them for faster models if you'd prefer, like the ones I use in the video.

In [9]:
# perplexity own llm - SONAR

from perplexity import Perplexity

client = Perplexity() # Automatically 

model_name = "sonar-pro"

response = client.chat.completions.create(
    messages=messages,
    model=model_name
)

answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# Systematic Debugging Strategy for Environment-Dependent Edge-Case Bug

## Understanding the Problem Context

This scenario requires a **multi-layered debugging approach** that combines historical analysis, systematic isolation, and environment-aware hypothesis formation. The critical factors are: intermittent production failures, local non-reproducibility, a recently refactored function with multiple iterations, and edge-case sensitivity to environmental conditions. This demands moving beyond traditional debugging techniques to incorporate code evolution analysis and environmental profiling.

## Phase 1: Establish Reproducibility and Evidence Gathering

**Before examining code, establish ground truth about the bug's manifestation.**

The inability to reproduce locally is itself diagnostic information. This strongly suggests the bug involves **environment-specific factors** rather than pure logic errors. Start by collecting comprehensive production data[1]:

- Gather crash logs, stack traces, and error timestamps to identify patterns in when failures occur
- Document the exact environmental differences between production and local environments (OS versions, dependency versions, memory constraints, concurrency patterns, CPU specifications, system load patterns)
- Extract the specific inputs that precede crashes using application telemetry and request logging
- Identify whether crashes correlate with particular times, user loads, data volumes, or system states

This evidence-gathering phase is crucial because the bug's intermittent nature means you're dealing with a **Heisenbug**—a defect that disappears when you try to observe it directly. You must build a profile of conditions that make it reproducible.

## Phase 2: Binary Search Through Code Evolution Using Git History

**Leverage version control to narrow down which refactoring introduced the bug[4].**

This is more powerful than examining the code itself initially. Use `git-bisect` to systematically isolate the exact commit that introduced the regression[4]:

1. **Identify the timeframe**: Determine when the bug first appeared in production logs
2. **Bisect the three refactoring commits**: Rather than comparing all three versions linearly, perform a binary search through the git history between the last known-good version and the current broken version
3. **Run your production test suite against each bisected version**: This allows you to skip the "manual reproduction" problem—instead, you use production telemetry patterns to validate each version
4. **Examine the diff at the failure point**: Once git-bisect identifies the specific commit, analyze what changed in that single refactoring step

This approach is superior to comparing all three versions because it identifies the **minimal set of changes** that introduced the defect. A diff of one refactoring is far more manageable than comparing three complete versions of a function.

**Reasoning**: Code evolution analysis often reveals that a bug wasn't introduced by changing business logic, but by subtle shifts in **when side effects occur**, **how state is managed**, or **what guarantees are maintained**. Refactorings often change these aspects while preserving apparent functionality.

## Phase 3: Decompose the System to Isolate the Bug's Location

**Apply binary search not just to code history, but to the current codebase structure[5].**

Given 50,000 lines across 200 files, start with broad hypotheses before zooming into specific functions:

**First hypothesis level (eliminate ~50% of the system)**:
- Is the bug in the refactored function itself, or in the **interactions between the refactored function and its callers**?
- Test by: Creating a feature flag that reverts just the refactored function to its previous version while keeping the rest of the codebase current. If crashes stop, the bug is definitely in or directly caused by the refactored function. If crashes continue, the bug involves caller-side assumptions about the function's behavior.

**Second hypothesis level (eliminate another ~50%)**:
- If the bug is in the refactored function: Is it in the **happy path logic** or in **error handling/edge case branches**?
- If the bug involves interactions: Is it a **state consistency issue** between the function and its callers, or a **contract violation** (function returns unexpected types or values)?

**Reasoning**: This hierarchical decomposition avoids the trap of diving into code reading without direction[5]. Each hypothesis should cut the search space in half and be testable through targeted experimentation.

## Phase 4: Environment-Specific Factor Analysis

**The local/production gap is the key diagnostic clue[1].**

Systematically test environmental variables that differ between local and production:

- **Concurrency**: Production typically has multiple threads/processes accessing the function simultaneously. Local development often doesn't. Add concurrent access patterns to local tests and replay production workloads in staging with identical concurrency levels.

- **Memory pressure**: Edge cases often manifest under resource constraints. The function might work fine with abundant memory but fail when memory is scarce, triggering garbage collection patterns or cache eviction behaviors that affect timing or state initialization.

- **Dependency versions**: Production might use different versions of libraries (even patch versions) than local environments. These can subtly change timing, memory layout, or behavior of underlying operations the refactored function depends on.

- **System state**: Production systems have different file descriptors, open network connections, and system resource states than fresh local environments. Edge cases might depend on these pre-existing states.

**Investigation approach**: Create a **dependency manifest** comparing all package versions between environments. Replay production data volumes and access patterns against the refactored function in a staging environment that matches production exactly.

## Phase 5: Edge Case and Input Analysis

**Identify the specific inputs that trigger failures[1][5].**

Given that "specific edge-case inputs combined with particular environmental conditions" trigger the bug:

**Examine the refactored function for common edge-case vulnerabilities**:
- **Boundary conditions**: Off-by-one errors, empty collections, null/undefined values, zero/negative numbers, maximum integer values
- **Type mismatches**: Does the refactoring introduce implicit type conversions or assumptions about input types?
- **State assumptions**: Does the refactored version assume particular initialization states or ordering guarantees that the previous version didn't?
- **Floating-point precision**: If the function performs numeric operations, did the refactoring change how floating-point arithmetic is performed?
- **Resource cleanup**: Did the refactoring change how file handles, database connections, or memory are managed?

**Prioritize investigation of these specific code sections**:

1. **Refactoring points** (highest priority): Lines that were modified between the previous and current version
2. **Error handling branches**: Edge-case handling is often overlooked during refactoring
3. **Initialization logic**: Where the refactored function sets up internal state
4. **Integration points**: Where the function interacts with external systems or dependencies
5. **Timing-sensitive code**: Anywhere the function manages concurrent access or asynchronous operations

## Phase 6: Create a Reproduction Environment

**Build local conditions that match production's triggering conditions[1].**

Use the environmental and input analysis from Phases 4-5 to construct a test that reproduces the bug:

```python
# Pseudo-code example showing the reproduction strategy
def test_refactored_function_with_production_conditions():
    # Set up environment matching production
    setup_memory_pressure(threshold=80_percent)
    setup_concurrent_access(thread_count=production_thread_count)
    
    # Use actual production inputs collected during Phase 1
    for edge_case_input in production_edge_cases:
        for environmental_config in production_configurations:
            result = refactored_function(edge_case_input, config=environmental_config)
            assert_valid_result(result)
```

Once you can reproduce the bug locally, traditional debugging techniques (breakpoints, logging, hypothesis testing) become far more effective[1].

## Phase 7: Root Cause Analysis and Hypothesis Testing

**With reproducibility achieved, form increasingly specific hypotheses[5].**

Follow this iterative process:
1. Form a hypothesis about the root cause based on code examination
2. Design a minimal test that validates or invalidates the hypothesis
3. Whether validated or invalidated, use the result to construct the next narrower hypothesis
4. Repeat until the exact cause is identified

**Specific areas to investigate based on common refactoring mistakes**:
- **Removed defensive checks**: Did the refactoring remove input validation or error checks present in the previous version?
- **Changed iteration order**: Did loops iterate in a different order, potentially affecting state mutations?
- **Altered timing**: Did the refactoring change when operations occur (earlier/later in execution), affecting timing-dependent behavior?
- **Modified caching**: Did the refactoring add or remove caching that affects state visibility across concurrent calls?

## Reasoning Summary

**Why this strategy works for this specific scenario:**

- **Git-bisect prioritization**: Rather than analyzing three complete versions manually, you identify exactly which change caused the regression, reducing analysis scope dramatically
- **Environment focus**: The local/production gap is the most valuable diagnostic. Most developers focus on code logic first, but this gap strongly indicates environmental factors
- **Hierarchical decomposition**: Breaking the system into hypothesis-testable components prevents the overwhelming feeling of "which of 50,000 lines has the bug?"
- **Production data usage**: Instead of guessing at edge cases, use actual failing inputs and conditions from production, ensuring you test the right scenarios
- **Concurrency consideration**: Intermittent bugs in production but not locally almost always involve concurrency issues that don't manifest in single-threaded local development

The investigation sequence prioritizes maximum information gain per unit of effort, starting with code history analysis (high-ROI, systematic), proceeding to environmental isolation (addresses the core non-reproducibility problem), then narrowing to specific code sections only after building actionable hypotheses.

In [None]:
# The API we know well
# I've updated this with the latest model, but it can take some time because it likes to think!
# Replace the model with gpt-4.1-mini if you'd prefer not to wait 1-2 mins

model_name = "gpt-5-nano"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-sonnet-4-5"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [10]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.5-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

This is an excellent question that probes deep into practical software engineering, problem-solving, and diagnostic abilities, especially under uncertainty. It requires understanding not just code, but also system dynamics, team collaboration, and a structured approach to complex problems.

Here's a systematic debugging strategy:

---

## Systematic Debugging Strategy for an Intermittent, Environment-Dependent Bug

The core challenge is the intermittency and the non-reproducibility locally, pointing strongly to differences in production environment, scale, or specific interaction patterns. The refactoring history is our golden clue regarding the *location* of the bug.

### Phase 1: Information Gathering & Hypothesis Formation (The "What," "When," "Where," "How")

Before touching any code, we need more context.

1.  **Gather Detailed Crash Information:**
    *   **Logs:** Immediately access production application logs, system logs (OS), web server logs, database logs, and any infrastructure logs (e.g., Kubernetes, cloud platform logs). We're looking for:
        *   **Full Stack Trace:** This is paramount. It tells us *exactly* where the crash occurred.
        *   **Error Messages:** Any specific error codes or messages.
        *   **Timestamps:** Crucial for correlating events.
        *   **Request/Transaction IDs:** If applicable, to trace the full lifecycle.
        *   **Input Data:** What inputs were being processed by the critical function (or the application generally) *at the time of the crash*? Even partial or obfuscated data can be invaluable.
        *   **Environmental Context from Logs:** CPU usage, memory usage, network I/O, disk I/O, thread counts, open file handles around the crash time.
    *   **Monitoring Data:** Review dashboards for system metrics (CPU, RAM, network, disk I/O, database connections, queue depths) at the time of the crashes. Look for spikes or unusual patterns.
    *   **Frequency & Pattern:** Is it truly random, or does it happen during peak load, specific times of day, after a particular type of user action, or for certain types of data?
    *   **Impact:** What part of the application fails? What's the user experience? Is it affecting all users or a subset?
    *   **Junior Developer's "Observations":** Even if vague ("sometimes," "certain conditions"), ask for *anything* they remember. Was it after a large data import? A specific report generation? High concurrent users?

2.  **Establish a Baseline (Pre-Bug State):**
    *   **Identify the exact version of the application code that ran *before* the second refactoring (i.e., the version containing the *stable* critical function).** This will be our "known good" baseline.
    *   **Identify the exact version after the second refactoring (when the bug was introduced) and after the third refactoring.**

**Reasoning:** Without a stack trace or concrete error, we're guessing. Logs and monitoring are our eyes and ears in production. Establishing a baseline helps us isolate the *change* that caused the problem.

### Phase 2: Environment & Configuration Deep Dive (Bridging the Prod/Local Gap)

The fact that it's not reproducible locally is the biggest clue. We need to systematically compare the production environment to the local development environment.

1.  **Systematic Environment Comparison:**
    *   **Hardware:** CPU architecture (e.g., x86 vs ARM), available RAM, disk speed, number of cores.
    *   **Operating System:** Version, patches, kernel configurations, resource limits (e.g., `ulimit` for open files, processes).
    *   **Runtime/Language Version:** Exact minor/patch version of Java, Node.js, Python, .NET, Go, etc. (e.g., OpenJDK 11.0.12 vs 11.0.13).
    *   **Dependencies:** Database version (PostgreSQL 13.x vs 14.x), external service API versions, third-party library versions (check `package.json`, `pom.xml`, `requirements.txt`, `go.mod` vs. deployed versions).
    *   **Configuration Files:** Application properties, environment variables, feature flags. Any difference, no matter how small, could be critical.
    *   **Network:** Latency, firewall rules, proxy settings, DNS resolution.
    *   **Resource Limits:** Max open files, max processes, memory limits (JVM heap size, container limits), thread pool sizes.
    *   **Scale & Load:** Production has higher concurrency, more data, more requests per second.
    *   **Data Characteristics:** Production database will have more data, potentially "dirtier" data (nulls, empty strings, unusual characters, boundary values) not present in local test datasets.
    *   **Time & Locale:** Timezones, system clock synchronization, locale settings (date/number formatting).
    *   **Security Context:** User permissions, mounted volumes, security policies.

2.  **Create a Staging Environment Mirror:**
    *   Attempt to create a staging environment that is *as close to production as possible* in terms of hardware, OS, runtime, dependencies, and configuration. Ideally, use a recent snapshot of production data (scrubbed for PII if necessary).

**Reasoning:** Environment parity is crucial for reproducing production-only bugs. Even seemingly minor differences can expose timing bugs, resource contention, or edge cases related to specific library versions or OS behavior.

### Phase 3: Code Evolution Analysis & Targeted Examination

This is where the refactoring history becomes key.

1.  **Prioritizing Investigation of the Three Versions:**

    *   **Priority 1: Diff between Version 2 and Version 3 of the Critical Function (and immediate callers/callees).**
        *   **Why:** The bug was introduced *after* the second refactoring. Therefore, the changes made between Version 2 (known good) and Version 3 (first potentially bad) are the most likely culprits. This is where we invest the most time initially.
        *   **Action:** Perform a detailed `git diff` (or equivalent SCM comparison) between the commit containing Version 2 and the commit containing Version 3 for the critical function and any code directly interacting with its inputs or outputs.

    *   **Priority 2: Version 3 of the Critical Function.**
        *   **Why:** This is the current buggy version. Even without the diff, understanding its current logic is essential once a potential problematic area is identified.

    *   **Priority 3: Version 2 of the Critical Function.**
        *   **Why:** This is the *known good* baseline. If the diff with Version 3 isn't immediately obvious, understanding how Version 2 *correctly* handled the logic can highlight what Version 3 might be doing wrong. It helps establish the "original intent."

    *   **Least Priority: Version 1 of the Critical Function.**
        *   **Why:** While understanding the full history can be helpful for context, the bug occurred specifically after the second refactoring. Version 1 is unlikely to hold the direct clue unless the second refactoring merely *revealed* an older issue, which is less probable given the prompt. We'd only look at this if the v2-v3 diff yielded no clear answers and we needed to understand the foundational logic better.

2.  **Specific Code Sections to Examine First (Based on v2 vs v3 Diff):**

    *   **Changes in Input Validation:** Is new input validation stricter or looser? Are edge cases (null, empty, boundary values, invalid characters) handled differently?
    *   **Resource Management:**
        *   **Concurrency Primitives:** New locks, semaphores, mutexes, thread pools, asynchronous operations. These are prime candidates for race conditions, deadlocks, or resource exhaustion under load.
        *   **File/Network Handles:** Are they being opened and *properly closed* in all execution paths (including error paths)?
        *   **Database Connections:** Changes in how connections are acquired/released from a pool, or if new queries are introduced that could deadlock or be very slow under contention.
    *   **External API Calls:** How are retries, timeouts, and error handling for external services managed? Are new external calls introduced?
    *   **Error Handling:** Are exceptions caught and handled differently? Could an exception now go unhandled, leading to a crash instead of a graceful failure? Is a specific error being swallowed or re-thrown in a way that obscures the true problem?
    *   **Data Structures & Algorithms:** Changes to critical data structures (e.g., new `Map` vs. `HashMap`, `ArrayList` vs. `LinkedList`) or algorithmic complexity could lead to performance bottlenecks or unexpected behavior under high data volumes.
    *   **Numerical Precision/Floating Point:** Subtle changes in calculations can expose floating-point inaccuracies or integer overflows with specific large inputs.
    *   **Timing-Dependent Logic:** Any logic that relies on specific delays, timeouts, or the order of operations, especially in concurrent environments.
    *   **Side Effects:** Does the refactored function now modify shared state or external systems in a way it didn't before, or in a different order?
    *   **Any new dependencies or third-party library calls:** Check these for potential version incompatibilities or new bugs.

**Reasoning:** By focusing on the *changes* between the stable and potentially buggy versions, we drastically narrow the search space within the 50,000 lines of code. This is the most efficient use of developer time.

### Phase 4: Targeted Reproduction & Monitoring

With hypotheses from Phase 1-3, we move to prove them.

1.  **Enhanced Logging/Tracing in Production (Carefully!):**
    *   If the bug still can't be reproduced in staging, the next step is *surgical* instrumentation in production.
    *   Add highly detailed, contextual logging/tracing *specifically* around the suspected changes identified in Phase 3 (v2 vs v3 diff).
    *   Log: Function entry/exit, critical variable values (inputs, intermediate results, outputs), thread IDs, system resource usage (memory, CPU), timestamps, return codes from external calls, specific error conditions.
    *   **Crucially:** Use a feature flag or dynamic configuration to enable/disable this verbose logging. It should only be active when trying to catch the bug to avoid overwhelming logs or performance impact.
    *   **Reasoning:** This is often the only way to get the exact state that leads to an intermittent production bug.

2.  **Load Testing & Stress Testing (on Staging):**
    *   Replicate production-like traffic patterns, user concurrency, and data volumes on the mirrored staging environment.
    *   Gradually increase load and introduce stressors (e.g., simulating slow external services, network latency, high CPU usage) to try and trigger resource contention or timing issues.

3.  **Fuzz Testing / Edge Case Input Generation (on Staging/Local):**
    *   Based on the identified code changes, generate a wide range of synthetic "edge-case inputs" that might trigger the new logic. This includes:
        *   Empty strings, nulls, very long strings.
        *   Zero, negative numbers, very large numbers, floating-point numbers near boundaries.
        *   Inputs with special characters, unicode, non-ASCII.
        *   Inputs that would cause boundary conditions in loops or array access.
        *   Specific combinations of inputs identified from production logs.

**Reasoning:** We need to actively provoke the bug. Enhanced logging gives us the forensic data in production, while load/fuzz testing aims to reproduce it in a controlled environment.

### Phase 5: Hypothesis Validation & Root Cause Analysis

Once we have a reproducible scenario (or detailed logs from production):

1.  **Analyze Crash Data:** Carefully review the collected stack traces, logs, and monitoring data.
2.  **Debug (if reproducible):** Use a debugger in the staging environment, stepping through the code with the specific inputs and conditions that trigger the bug.
3.  **Formulate and Test Hypotheses:** Based on the evidence, refine your hypotheses. For example:
    *   "The bug is a race condition due to the new non-synchronized access to a shared `Map`." -> Test by adding a `synchronized` block.
    *   "The bug is caused by a null pointer when an external service returns an empty list under high load." -> Test by adding null checks and default empty list handling.
    *   "The bug is an out-of-memory error due to a growing list not being cleared in a new loop." -> Test by adding a `list.clear()` or using a fixed-size buffer.

**Reasoning:** This phase confirms the specific line of code and conditions causing the issue, leading to a targeted fix.

### Phase 6: Remediation & Prevention

1.  **Fix the Bug:** Implement the identified fix.
2.  **Unit & Integration Tests:** Write specific, targeted unit tests and integration tests for the identified edge-case and environmental conditions that exposed the bug. These tests should be added to the CI/CD pipeline to prevent regression.
3.  **Code Review:** Get the fix and new tests peer-reviewed.
4.  **Deployment & Monitoring:** Deploy the fix and closely monitor production to ensure the bug is resolved and no new issues are introduced.
5.  **Post-Mortem & Learnings:** Document the bug, its root cause, the fix, and any process improvements (e.g., better environment parity, more robust testing, enhanced logging patterns).

---

### Types of Edge Cases Most Likely to Expose Such an Environment-Dependent Bug:

Environment-dependent bugs typically arise from the **differences in operating conditions** between development and production. The refactoring (especially around resource management, concurrency, or external interactions) would be particularly vulnerable to these:

1.  **Concurrency / Race Conditions:**
    *   **High Concurrent Requests:** In production, many users/processes hit the critical function simultaneously, exposing unprotected shared mutable state, non-atomic operations, or improper locking.
    *   **Thread Pool Exhaustion:** When too many threads/goroutines/tasks are queued, leading to timeouts or deadlocks.
    *   **Asynchronous Operations:** Mismanagement of promises, futures, callbacks, or non-blocking I/O where results arrive in an unexpected order or are not correctly handled.

2.  **Resource Exhaustion / Throttling:**
    *   **Memory Leaks:** Small leaks that are unnoticeable locally can lead to OutOfMemory errors in a long-running production service.
    *   **File Handle Leaks:** Not closing files/sockets properly, hitting OS `ulimit` for open file descriptors.
    *   **Database Connection Pool Limits:** Not releasing connections, causing subsequent requests to fail to acquire a connection.
    *   **CPU/I/O Bottlenecks:** A function that is efficient for small local datasets becomes a performance killer with large production datasets, leading to timeouts or cascading failures.
    *   **External Service Rate Limits/Throttling:** Production often hits these limits for third-party APIs that local dev environments never do.

3.  **Timing-Dependent Issues:**
    *   **Network Latency:** Production environments often have higher network latency or intermittent connectivity issues when interacting with external services or databases. A refactoring that changed timeout behavior or assumed instant responses would break here.
    *   **System Clock Skew:** Differences in system clocks across distributed services or between application servers and database servers can affect time-sensitive logic (e.g., caching, authentication tokens).
    *   **Unpredictable Execution Order:** In concurrent systems, the exact order of operations is not guaranteed. A refactoring might unintentionally rely on a specific order that only happens sometimes.

4.  **Data Volume & Characteristics:**
    *   **Large Datasets:** Production databases contain more records, larger files, or bigger collections, which can expose `O(n^2)` or memory-intensive operations in the refactored code.
    *   **"Dirty" Data:** Nulls, empty strings, invalid characters (e.g., UTF-8 issues), values at the very min/max boundaries of data types, or inconsistent data formats that might not exist in local test data.
    *   **Sparse Data:** Edge cases where a database query or API call returns unexpectedly few (or zero) results, which the refactoring might not have accounted for.

5.  **Environmental Configuration Mismatches:**
    *   **Different API Endpoints/Credentials:** Production points to live services; dev might point to mocks or sandbox environments.
    *   **Feature Flags:** A feature flag enabled in production but disabled locally could expose different code paths.
    *   **File Paths/Permissions:** Hardcoded paths or assumptions about write permissions that differ between environments.
    *   **Locale/Timezones:** Date/time parsing or number formatting issues if production uses a different locale than development.

6.  **JVM/Runtime-Specifics (e.g., Garbage Collection):**
    *   Different JVM arguments (e.g., GC algorithms, heap sizes) can affect memory behavior and performance under load, potentially exposing issues that are masked in local development.

By systematically applying this strategy, we maximize our chances of identifying and fixing the elusive bug while minimizing wasted effort.

In [None]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [None]:
# Updated with the latest Open Source model from OpenAI

groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "openai/gpt-oss-120b"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [None]:
!ollama pull llama3.2

In [None]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [11]:
# So where are we?

print(competitors)
print(answers)


['sonar-pro', 'gemini-2.5-flash']
['# Systematic Debugging Strategy for Environment-Dependent Edge-Case Bug\n\n## Understanding the Problem Context\n\nThis scenario requires a **multi-layered debugging approach** that combines historical analysis, systematic isolation, and environment-aware hypothesis formation. The critical factors are: intermittent production failures, local non-reproducibility, a recently refactored function with multiple iterations, and edge-case sensitivity to environmental conditions. This demands moving beyond traditional debugging techniques to incorporate code evolution analysis and environmental profiling.\n\n## Phase 1: Establish Reproducibility and Evidence Gathering\n\n**Before examining code, establish ground truth about the bug\'s manifestation.**\n\nThe inability to reproduce locally is itself diagnostic information. This strongly suggests the bug involves **environment-specific factors** rather than pure logic errors. Start by collecting comprehensive 

In [12]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: sonar-pro

# Systematic Debugging Strategy for Environment-Dependent Edge-Case Bug

## Understanding the Problem Context

This scenario requires a **multi-layered debugging approach** that combines historical analysis, systematic isolation, and environment-aware hypothesis formation. The critical factors are: intermittent production failures, local non-reproducibility, a recently refactored function with multiple iterations, and edge-case sensitivity to environmental conditions. This demands moving beyond traditional debugging techniques to incorporate code evolution analysis and environmental profiling.

## Phase 1: Establish Reproducibility and Evidence Gathering

**Before examining code, establish ground truth about the bug's manifestation.**

The inability to reproduce locally is itself diagnostic information. This strongly suggests the bug involves **environment-specific factors** rather than pure logic errors. Start by collecting comprehensive production data[1]:

- G

In [13]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [14]:
print(together)

# Response from competitor 1

# Systematic Debugging Strategy for Environment-Dependent Edge-Case Bug

## Understanding the Problem Context

This scenario requires a **multi-layered debugging approach** that combines historical analysis, systematic isolation, and environment-aware hypothesis formation. The critical factors are: intermittent production failures, local non-reproducibility, a recently refactored function with multiple iterations, and edge-case sensitivity to environmental conditions. This demands moving beyond traditional debugging techniques to incorporate code evolution analysis and environmental profiling.

## Phase 1: Establish Reproducibility and Evidence Gathering

**Before examining code, establish ground truth about the bug's manifestation.**

The inability to reproduce locally is itself diagnostic information. This strongly suggests the bug involves **environment-specific factors** rather than pure logic errors. Start by collecting comprehensive production data[1

In [15]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [16]:
print(judge)

You are judging a competition between 2 competitors.
Each model has been given this question:

Here's a challenging question for evaluating LLM intelligence:

"You are given a repository containing 50,000 lines of code across 200 files, where a critical function has been refactored three times over the past year. The function's behavior changed subtly after the second refactoring, introducing a bug that only manifests when specific edge-case inputs are combined with particular environmental conditions. A junior developer reports that the application crashes intermittently in production but cannot reproduce it locally. Using only the information that the bug occurs 'sometimes' and 'under certain conditions,' outline a systematic debugging strategy that accounts for the code's evolution, considers how environment-specific factors might interact with the refactored logic, and identifies which specific code sections would be most valuable to examine first. Explain your reasoning at each st

In [17]:
judge_messages = [{"role": "user", "content": judge}]

In [20]:
# Judgement time!

gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.5-pro"

response = gemini.chat.completions.create(model=model_name, messages=judge_messages,)

results = response.choices[0].message.content
print(results)

```json
{
  "results": [
    "1",
    "2"
  ]
}
```


In [31]:
# First, let's clean the results and handle potential formatting issues

import re

# Print the raw results to debug
print("Raw results:")
print(repr(results))
print("\n" + "="*50 + "\n")

# Clean the results - remove markdown code blocks and extra whitespace
cleaned_results = results.strip()

# Remove markdown code blocks if present
if cleaned_results.startswith('```'):
    # Find the actual JSON content between code blocks
    json_match = re.search(r'```(?:json)?\s*({.*?})\s*```', cleaned_results, re.DOTALL)
    if json_match:
        cleaned_results = json_match.group(1)
    else:
        # If no proper code block, try to extract JSON-like content
        cleaned_results = re.sub(r'^```[^\n]*\n|```$', '', cleaned_results, flags=re.MULTILINE)

# Try to extract JSON if there's extra text
json_match = re.search(r'{.*}', cleaned_results, re.DOTALL)
if json_match:
    cleaned_results = json_match.group(0)

print("Cleaned results:")
print(repr(cleaned_results))
print("\n" + "="*50 + "\n")

try:
    results_dict = json.loads(cleaned_results)
    ranks = results_dict["results"]
    
    print("Successfully parsed JSON!")
    for index, result in enumerate(ranks):
        competitor = competitors[int(result)-1]
        print(f"Rank {index+1}: {competitor}")
        
except json.JSONDecodeError as e:
    print(f"JSON parsing failed: {e}")
    print("\nTrying to manually parse the response...")
    
    # Fallback: try to extract numbers from the response
    numbers = re.findall(r'\b\d+\b', results)
    if numbers:
        print("Found these numbers in the response:", numbers)
        for index, num in enumerate(numbers[:len(competitors)]):
            try:
                competitor_idx = int(num) - 1
                if 0 <= competitor_idx < len(competitors):
                    competitor = competitors[competitor_idx]
                    print(f"Rank {index+1}: {competitor}")
            except (ValueError, IndexError):
                continue
    else:
        print("Could not extract ranking information from the response.")
        print("Please check the raw results above and manually interpret them.")

Raw results:
'```json\n{\n  "results": [\n    "1",\n    "2"\n  ]\n}\n```'


Cleaned results:
'{\n  "results": [\n    "1",\n    "2"\n  ]\n}'


Successfully parsed JSON!
Rank 1: sonar-pro
Rank 2: gemini-2.5-flash


In [None]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>