#### **TOPIC 1 - Why Errors Happen in GenAI Systems**

#### **üö® Error Handling & Reliability in GenAI Systems**

#### **Why This Topic Matters**
A GenAI system that works only when everything is perfect
is **not a production system**.

Real-world GenAI systems must assume:
- failures will happen
- APIs will fail
- networks will break
- inputs will be invalid

This notebook focuses on **reliability thinking** ‚Äî
the difference between demos and real systems.


#### **Why Errors Are Common in GenAI Systems**

GenAI systems depend on multiple external components:
- LLM APIs
- networks
- rate limits
- user inputs
- token constraints

Each dependency introduces failure points.


#### **Common Sources of Errors**

#### **1Ô∏è‚É£ Network Failures**
- Internet instability
- DNS issues
- Timeout errors

These are outside your control.

---

#### **2Ô∏è‚É£ API-Level Errors**
- Invalid API keys
- Rate limits exceeded
- Server-side failures (5xx)

These are common in real usage.

---

#### **3Ô∏è‚É£ Token & Context Errors**
- Prompt too long
- Chat history exceeds limits
- Unexpected truncation

Memory management mistakes cause these.

---

#### **4Ô∏è‚É£ Invalid Model Outputs**
- Broken JSON
- Hallucinated fields
- Missing data

Even with strict prompts, this can happen.

---

#### **5Ô∏è‚É£ User Input Errors**
- Empty input
- Malicious input
- Unexpected formats

LLMs do not sanitize inputs automatically.

#### **Dangerous Beginner Assumption ‚ùå**

> ‚ÄúThe model usually works fine.‚Äù

This mindset causes:
- crashes
- silent failures
- broken automation
- security risks

Production systems assume **failure by default**.

#### **Senior Engineer Mindset (Very Important)**

A professional GenAI engineer designs systems that:
- fail gracefully
- recover automatically
- protect downstream systems
- never trust external input

Error handling is not optional.
It is **part of system design**.

#### **Today‚Äôs Goal**

By the end of today, I will:
- Understand why GenAI systems fail
- Identify common failure types
- Design basic retry and guardrail strategies
- Build production-ready reliability thinking

**‚úÖ Topic 1 Final Summary (Rule Applied)**

Errors are inevitable in GenAI systems

Failures come from APIs, networks, tokens, and inputs

Reliability must be designed, not assumed

This mindset separates demos from production systems

#### **TOPIC 2 - Common LLM Error Types (Awareness Before Code)**

Before handling errors, you must recognize them.
This section catalogs the **most common error types** you will face
when building GenAI systems.

The goal is to:
- identify errors quickly
- know their cause
- choose the right response strategy

#### **üî¥ 1Ô∏è‚É£ Authentication & Authorization Errors (401 / 403)**

#### What they look like:
- Invalid API key
- Missing credentials
- Access denied

#### Why they happen:
- Wrong key
- Expired key
- Wrong environment file

#### Engineer response:
‚ùå Retry blindly ‚Äî NO  
‚úÖ Fail fast and alert configuration issue

#### **üü† 2Ô∏è‚É£ Rate Limit Errors (429)**

#### **What they look like:**
- ‚ÄúToo many requests‚Äù
- ‚ÄúRate limit exceeded‚Äù

#### **Why they happen:**
- High request volume
- Bursts of traffic
- Shared API quotas

#### **Engineer response:**
‚ùå Crash  
‚ùå Retry immediately in a tight loop  
‚úÖ Retry with delay (backoff)

#### **üü° 3Ô∏è‚É£ Network & Timeout Errors**

#### **What they look like:**
- Request timeout
- Connection reset
- DNS failure

#### **Why they happen:**
- Internet instability
- Temporary service disruption

#### **Engineer response:**
‚úÖ Retry safely
‚úÖ Assume transient failure

#### **üîµ 4Ô∏è‚É£ Server-Side Errors (5xx)**

#### What they look like:
- 500 Internal Server Error
- 502 Bad Gateway
- 503 Service Unavailable

#### Why they happen:
- LLM provider issues
- Maintenance windows
- Overloaded servers

#### Engineer response:

‚úÖ Retry with limits

‚úÖ Use fallback if available

#### **üü£ 5Ô∏è‚É£ Token Limit & Context Errors**

#### What they look like:
- Context length exceeded
- Token limit reached

#### Why they happen:
- Chat history too long
- Large prompts
- Missing truncation logic

#### Engineer response:
‚ùå Retry blindly  
‚úÖ Trim history or summarize

#### **‚ö´ 6Ô∏è‚É£ Invalid Model Output Errors**

#### What they look like:
- Broken JSON
- Missing fields
- Unexpected format

#### Why they happen:
- Model hallucination
- Weak prompt constraints

#### Engineer response:
‚ùå Trust output  
‚úÖ Validate and retry or fallback


### **üü§ 7Ô∏è‚É£ User Input Errors**

#### What they look like:
- Empty input
- Unsupported language
- Prompt injection attempts

#### Why they happen:
- No input validation
- No guardrails

#### Engineer response:

‚úÖ Validate input

‚úÖ Reject or sanitize

#### **üß† Senior Engineer Insight**

Not all errors are equal.

Some require:
- immediate failure (auth)
- retries (network)
- design changes (token limits)
- validation (output)

Treating all errors the same is a beginner mistake.

**‚úÖ Topic 2 Final Summary (Rule Applied)**

Learned all major LLM error categories

Understood root causes

Mapped each error to the correct response strategy

Built mental model for production debugging

#### **üéØ Topic 3 Goal ‚Äî Basic try/except Handling**

Any GenAI system that can crash is unsafe.

The goal of this topic is to:
- prevent application crashes
- handle failures gracefully
- prepare for retries and guardrails

`try/except` is the **first line of defense** in production systems

In [10]:
# ============================================================
# üìò SECTION 1 ‚Äî Import Required Libraries
# ------------------------------------------------------------
# Why?
#   - We need no new libraries here
#   - We focus purely on error handling logic
# ============================================================


# ============================================================
# üìò SECTION 2 ‚Äî Define a Simple Prompt
# ------------------------------------------------------------
# Why?
#   - Keep prompt simple to focus on error handling
#   - Complexity comes later
# ============================================================

prompt = "Explain Python tuples in one sentence."

# ============================================================
# üìò SECTION 3 ‚Äî Wrap LLM Call in try/except
# ------------------------------------------------------------
# Why try/except?
#   - LLM calls depend on external systems (API, network)
#   - Any failure should be caught safely
#   - Prevents application crashes
# ============================================================

try:
    # Attempt to call the LLM API
    response = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role":"user","content":prompt}],
        temperature=0.3,
        max_tokens=100
    )

    # Extract assistant reply if call succeeds
    assistant_reply = response.choices[0].message.content

    print("‚úÖ LLM response received successfully:")
    print(assistant_reply)

except Exception as e:
    # Catch ANY unexpected error
    # In production, this prevents crashes
    print("‚ùå An error occurred while calling the LLM.")
    print("Error Details",e)

‚ùå An error occurred while calling the LLM.
Error Details name 'client' is not defined


#### **üîç Why We Catch `Exception` Here**

At this stage:
- We are learning the concept
- We want to prevent crashes
- We don't yet differentiate error types

Later, we will:
- Catch specific exceptions
- Add retries
- Add fallback logic

This step builds the foundation.

**‚úÖ Topic 3 Final Summary (Rule Applied)**

Introduced try/except around LLM calls

Prevented crashes from external failures

Built first reliability layer

Prepared ground for retries and guardrails

#### **üéØ TOPIC 4 Retry Logic (Core Production Skill)**

Many LLM failures are **temporary**:
- network hiccups
- rate limits
- transient server errors

Retry logic allows systems to:
- recover automatically
- avoid crashing
- improve reliability

But retries must be **controlled** to avoid infinite loops.


In [12]:
# ============================================================
# üìò SECTION 1 ‚Äî Import time Library
# ------------------------------------------------------------
# Why?
#   - To add delay between retries
#   - Prevents hammering the API
# ============================================================

import time

# ============================================================
# üìò SECTION 2 ‚Äî Define Retry Configuration
# ------------------------------------------------------------
# Why define these values?
#   - max_retries prevents infinite loops
#   - delay_seconds gives time for transient issues to recover
# ============================================================

max_retries = 3
delay_seconds = 2

# ============================================================
# üìò SECTION 3 ‚Äî Define Prompt
# ------------------------------------------------------------
# Why simple prompt?
#   - Focus on retry behavior, not output complexity
# ============================================================

prompt="Explain Python sets in one sentence."

# ============================================================
# üìò SECTION 4 ‚Äî Retry Loop with try/except
# ------------------------------------------------------------
# Why a loop?
#   - We want multiple attempts if something fails
# Why break?
#   - Stop retrying once a call succeeds
# ============================================================

for attempt in range(1, max_retries + 1):
    try:
        print(f"üîÑ Attempt {attempt}...")

        response = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role":"user","content":prompt}],
            temperature=0.3,
            max_tokens=100
        )
        # If we reach here, the call succeeded
        assistant_reply = response.choices[0].message.content
        print("‚úÖ LLM response received:")
        print(assistant_reply)

        # Exit loop after success
        break
    except Exception as e:
        print("‚ùå Error occurred:", e)

        # If this was the last attempt, do not retry
        if attempt == max_retries:
            print("üö® Max retries reached. Failing gracefully.")
        else:
            print(f"‚è≥ Retrying in {delay_seconds} seconds...")
            time.sleep(delay_seconds)

üîÑ Attempt 1...
‚ùå Error occurred: name 'client' is not defined
‚è≥ Retrying in 2 seconds...
üîÑ Attempt 2...
‚ùå Error occurred: name 'client' is not defined
‚è≥ Retrying in 2 seconds...
üîÑ Attempt 3...
‚ùå Error occurred: name 'client' is not defined
üö® Max retries reached. Failing gracefully.


**What NOT to Do (Very Important)**

#### üö´ Common Retry Mistakes

‚ùå Infinite retries  
‚ùå No delay between retries  
‚ùå Retrying authentication errors  
‚ùå Retrying invalid inputs  

These mistakes cause:
- API bans
- runaway costs
- stuck systems


#### **üß† Retry Design Rule**

Retry ONLY when:
- failure is likely temporary
- retry may succeed later

Do NOT retry when:
- configuration is wrong
- input is invalid
- permissions are missing

**‚úÖ Topic 4 Final Summary (Rule Applied)**

Implemented controlled retry logic

Prevented infinite loops

Learned when retries make sense

Built production-safe resilience

#### **üéØ Topic 5 ‚Äî Guardrails Thinking (Limit Damage, Protect Systems)**

**Guardrails are **safety boundaries** that protect your system**

when things go wrong ‚Äî even if errors are handled.

Think of guardrails as:
- limits
- checks
- constraints
- fallbacks

They prevent failures from turning into disasters.

#### **Why Guardrails Are Required in GenAI**

Even with:
- try/except
- retries
- validation

LLMs can still:
- hallucinate
- produce unsafe content
- exceed costs
- behave unexpectedly

Guardrails **limit blast radius**.


#### **Common Guardrail Categories**

#### **üß± 1Ô∏è‚É£ Input Guardrails**

Purpose:
- Prevent bad inputs from entering the system

Examples:
- Empty prompts
- Extremely long inputs
- Prompt injection attempts

Why important:
- LLMs do not validate inputs automatically


#### **üß± 2Ô∏è‚É£ Output Guardrails**

Purpose:
- Control what the model is allowed to return

Examples:
- Enforce JSON only
- Block sensitive content
- Limit verbosity

Why important:
- Output is often consumed by machines

#### **üß± 3Ô∏è‚É£ Cost Guardrails**

Purpose:
- Prevent runaway token usage and cost explosions

Examples:
- max_tokens limits
- request caps per user
- rate limiting

Why important:
- LLMs can silently burn money

#### **üß± 4Ô∏è‚É£ Retry Guardrails**

Purpose:
- Prevent infinite retry loops

Examples:
- max retry count
- retry only for specific errors

Why important:
- Uncontrolled retries can DDOS yourself


#### **üß± 5Ô∏è‚É£ Fallback Guardrails**

Purpose:
- Provide safe behavior when everything fails

Examples:
- Cached responses
- Friendly error messages
- Human escalation

Why important:
- Users should never see raw failures


#### **Beginner vs Senior Thinking**

‚ùå Beginner:
‚ÄúLet‚Äôs retry and hope it works.‚Äù

‚úÖ Senior:
‚ÄúWhat happens if retries keep failing?‚Äù

Guardrails answer that question.

#### **Interview-Level Insight**

Question:
‚ÄúHow do you make GenAI systems safe?‚Äù

Strong answer:
‚ÄúBy combining error handling, retries, validation, and guardrails
to control failure impact.‚Äù

Weak answer:
‚ÄúBy fixing bugs.‚Äù


#### **Topic 5 Summary**

- Errors will happen
- Retries help recovery
- Guardrails limit damage
- Safety is a design responsibility


**‚úÖ Topic 5 Final Summary**

Learned what guardrails are

Understood different guardrail categories

Built senior-level safety mindset

Prepared for real production systems

#### **Topic 5 done ‚Äî ready for Topic 6 (Micro practice + mock test)**

**üß© Part A ‚Äî Micro Practice (Mental + Light Code Review)**

For each scenario below, answer:
1) What type of error is this?
2) Should we retry?
3) What guardrail would you apply?

**Scenario 1**

    API key is missing or invalid.

    Think before reading answer

    <details> <summary>Answer</summary>

    Error type: Authentication (401/403)

    Retry? ‚ùå No

    Guardrail: Fail fast, configuration check, alert

    </details>

**Scenario 2**

    API returns ‚ÄúRate limit exceeded‚Äù.

    <details> <summary>Answer</summary>

    Error type: Rate limit (429)

    Retry? ‚úÖ Yes

    Guardrail: Retry with delay + max retry limit

</details>

**Scenario 3**

    Network timeout during LLM call.

    <details> <summary>Answer</summary>

    Error type: Network / timeout

    Retry? ‚úÖ Yes

    Guardrail: Retry + backoff

    </details>

**Scenario 4**

    LLM returns broken JSON.

    <details> <summary>Answer</summary>

    Error type: Invalid model output

    Retry? ‚ö†Ô∏è Maybe (after fixing prompt)

    Guardrail: Output validation + fallback

    </details>

**Scenario 5**

    User submits extremely long input.

    <details> <summary>Answer</summary>

    Error type: Input validation

    Retry? ‚ùå No

    Guardrail: Input length limit

    </details>

#### **üìù Part B ‚Äî Mini Mock Test (Interview-Style)**

#### **üìù Mini Mock Test ‚Äî Error Handling & Reliability**

---
##### **Q1Ô∏è‚É£ Why is retrying authentication errors dangerous?**
##### **A1:** Because retries can't fix misconfiguration and can cause repeated failures.
---

##### **Q2Ô∏è‚É£ What is the difference between error handling and guardrails?**
##### **A2:** Error handling catches failures, guardrails limit damage even when failures occur.
---

##### **Q3Ô∏è‚É£ Name three situations where retries should NOT be used.**
##### **A3:** Invalid API key, invalid user input, deterministic validation error.
---

##### **Q4Ô∏è‚É£ How do guardrails reduce blast radius in GenAI systems?**
##### **A4:** They constrain behavior, LImit cost, control retries, and prevent unsafe outputs.
---
##### **Q5Ô∏è‚É£ In one sentence:** What makes a GenAI system ‚Äúproduction-ready‚Äù?
##### **A5:** A production-ready GenAI system anticipates failures and handles them safely.

#### **üìå Summary**

Today I learned:
- Why GenAI systems fail
- How to recognize common LLM error types
- How to use try/except safely
- How to design retry logic
- How guardrails protect systems

Key shift:
From ‚Äúhoping it works‚Äù ‚Üí ‚Äúdesigning for failure‚Äù.