# CAH 30503 — Week 7: Responsible AI Hardening

**Theme**: From "it works" to "it's safe to use — and I can explain why."

---

Three activities today. First: **systematically catalog everything that could go wrong** with your app — technical failures, social failures, domain failures. Second: **ask whether this app should exist** using five hard questions. Third: **build safeguards** (input checks, output constraints, honest documentation) and **deploy v2**.

This is the hardest session of the course. Not technically — you already know how to build and deploy. Hard because we’re asking questions that challenge what you’ve built. But you’re asking these questions AFTER building, not before — which means you can answer with evidence, not speculation.

---

## Your Starting Point

Before we begin, confirm what you’re working with:

- **My app name**: 

- **My public URL**: 

- **Is it loading right now?** *(Visit the URL — free-tier Spaces sleep after inactivity.)*

Pull up your Week 6 CLAUDE.md entry. You’ll need:

- **My #1 priority fix from Week 6**: 

- **My accountability statement**: When my ___ produces ___ in ___, ___ is most responsible because ___.

- **My domain stakes**: Probability of error [L/M/H], Severity [L/M/H], Supervision level: ___

---

## Activity 1: Failure Mode Catalog

Your user test found some problems. But what COULD go wrong that nobody has found yet?

Every AI system can fail in three ways:

| Category | What It Means | Examples |
|----------|--------------|----------|
| **Technical** | The system breaks or gives wrong answers | Wrong output, unexpected input, crashes, too slow |
| **Social** | The system works but hurts people | Bias, privacy violation, misleading output, accessibility gaps |
| **Domain** | The system works technically but the field rejects it | Inaccurate for the domain, doesn’t fit the workflow, erodes trust |

Technical failures are the easiest to find. Social failures are harder — they’re about who gets helped and who gets hurt. Domain failures are the hardest — they require understanding the field.

### Technical Failures

For each type, think about YOUR app specifically.

**Wrong output** — When could your model produce an incorrect result?
- Specific example: 
- How likely is this? (Low / Medium / High): 

**Unexpected input** — What kind of input would your model NOT be trained on?
- Specific example: 
- What happens if someone submits this? 

**Performance degradation** — Does your app work better on some inputs than others?
- Works well on: 
- Works poorly on: 

**Cascading errors** — If one part of your pipeline gets it wrong, does the error get worse downstream?
- Example: 

**Resource failures** — Could your app be too slow, crash, or run out of memory?
- Most likely resource issue: 

### Social Failures

Your app works. But who does it work FOR — and who does it work AGAINST?

**Bias** — Does your model treat everyone fairly?
- Who might be underrepresented in the training data? 
- Specific example of unfair treatment: 

**Privacy** — Could your app expose information that should stay private?
- What goes into your app? Could it contain sensitive information? 
- What does your app output? Could the output reveal something private? 

**Manipulation** — Could your output mislead users, even unintentionally?
- Example of misleading output: 

**Displacement** — Is your app replacing human judgment where it shouldn’t be?
- What human judgment does it replace? 
- Is that replacement appropriate? 

**Accessibility** — Who can’t use your app?
- Users who would be excluded: 

### Domain Failures

Your app works technically — but would an expert in the field trust it?

**Domain inaccuracy** — What would a practitioner reject?
- Example of output an expert wouldn’t accept: 

**Workflow disruption** — Does your app fit how people actually work?
- How people currently do this task: 
- Where my app fits (or doesn’t): 

**Trust violation** — Could your app erode trust in AI for this domain?
- If this app gave wrong output to [user], they might stop trusting: 

**Accountability gap** — Is anyone positioned to catch the error?
- Who reviews the output before someone acts on it? 
- If nobody: that’s a gap.

**Stakes mismatch** — Does the supervision match the consequences?
- Consequences of wrong output: 
- Current supervision level: 

### Failure Assessment Grid

Pick your **top 5 failure modes** from all three categories. Rate each one.

**Priority guide:**
- **P1** (must address): High severity regardless of probability, OR high probability with medium+ severity
- **P2** (should address): Medium severity with medium probability, OR high probability with low severity
- **P3** (monitor): Low severity with low probability

| # | Failure Mode | Category | Probability | Severity | Priority |
|---|---|---|---|---|---|
| 1 | | Tech / Social / Domain | Low / Med / High | Low / Med / High | P1 / P2 / P3 |
| 2 | | | | | |
| 3 | | | | | |
| 4 | | | | | |
| 5 | | | | | |

**How many of your top 5 are technical vs. social vs. domain?** 

**What does that distribution tell you?** 

You don’t have to fix everything. But you have to NAME everything. P3 failures that you’ve named and decided to monitor are responsible. Failures you never identified are not.

---

## Activity 2: Should We Build This?

Everything in this course has been about building. Now we ask a question the course has never asked directly: **should this app exist?**

Not "can you build it" — you can. Not "will it work" — you’ve tested it. The question is: **should it exist?**

Five questions. Answer them in order, using your failure catalog as evidence.

### Question 1: What problem am I solving?

Not "what can my AI do?" but "what problem does a real person have?"

**Red flags**: "I’m building this because AI can do it" (technology push). "I’m building this because it’s cool" (impressive ≠ responsible).

**Green flags**: "A [specific person] spends [time] doing [task] and it’s [painful/slow/error-prone]."

- **The problem I’m solving**: 

- **Who has this problem**: 

- **Why it matters (without mentioning AI)**: 

- **Red flag check**: Is this technology push or problem pull? 

### Question 2: What happens when it’s wrong?

Use your Failure Mode Catalog as evidence.

| Severity Level | What It Means |
|---|---|
| Annoyance | User has to redo something |
| Wasted time | User loses hours of work |
| Financial harm | User loses money |
| Reputation harm | Someone’s reputation is damaged |
| Physical harm | Someone’s health or safety is affected |
| Systemic harm | A pattern of errors affects an entire community |

- **Most likely error** (from your failure catalog): 
  - Severity level: 
  - How often: 

- **Most severe error** (from your failure catalog): 
  - Severity level: 
  - How often: 

### Question 3: Could a simpler approach work?

**The Simpler Alternative Test**: Describe what your app does WITHOUT using the words "AI," "model," "neural network," or "machine learning."

**My app’s function in plain language**: 



Now — could you build THAT with simpler tools?

| Simpler Alternative | Could it work? | What would you lose? |
|---|---|---|
| A checklist | | |
| A lookup table | | |
| A set of if/then rules | | |
| A template with variables | | |
| A human doing it manually | | |

**What specifically requires AI that none of these can handle?**



AI adds value when the task requires pattern recognition across variable inputs. If the task is predictable, structured, or rule-based, simpler tools are more reliable, more transparent, and easier to audit.

### Question 4: Who bears the risk?

The person who builds the system rarely bears the full risk of its failures.

- **Who uses the output?** 

- **Who makes decisions based on it?** 

- **Who is affected by those decisions?** 

- **Did those people choose to use AI?** 

- **Power check**: Are the people bearing the most risk the ones with the least power? 

### Question 5: Can I explain this to a skeptic?

Not to a friendly audience. To someone who asks hard questions.

- "Why did you use AI instead of [alternative]?" 

- "What could go wrong?" 

- "Who is responsible when it fails?" 

- "How do you know it’s not biased?" 

### Your Verdict

Based on your answers to all 5 questions:

**Option A: Build with safeguards** — Real problem. AI adds genuine value. Risks are manageable.
- Safeguards needed: 

**Option B: Simplify** — Core function needs AI, but some components should use simpler tools.
- What to simplify: 

**Option C: Don’t build (or significantly reconceive)** — Risks outweigh benefits.
- Why: 

---

**I choose Option**: A / B / C

**My reasoning** (reference your answers above): 



**What I would tell the skeptic**: 



---

*Share your verdict with a partner. They play the skeptic for 2 minutes: "Convince me this should exist."*

**My partner’s toughest question**: 

**My response**: 

---

## Activity 3: Build Safeguards

Three types of protection. You need at least one of each.

### Input Safeguards — Catch problems BEFORE the model runs

| Safeguard | What It Catches | Implementation |
|---|---|---|
| Empty input check | User submits nothing | `if not text.strip(): return "Please provide some text."` |
| Length limit | Input exceeds model capacity | `if len(text) > 5000: text = text[:5000]` + warning |
| Format validation | Wrong input type | Check type/format before model call |
| Scope check | Input outside system’s domain | Keyword detection or pattern matching |

### Output Safeguards — Limit harm AFTER the model runs

| Safeguard | What It Catches | Implementation |
|---|---|---|
| Confidence threshold | Low-confidence predictions | `if score < 0.7: flag as uncertain` |
| Output disclaimer | All outputs | Append "AI-generated — verify before acting" |
| Range checking | Out-of-bounds values | `if result > MAX: flag` |
| Length constraint | Unexpectedly long output | Truncate and note truncation |

### Documentation Safeguards — Tell users the truth

This is your **Capability Statement** — what your app does well, what it gets wrong, and what it shouldn’t be used for.

---

**Safeguard theater** is protection that LOOKS good but doesn’t actually change what happens when the system fails. A disclaimer nobody reads is theater. Input validation that catches the most common failure-causing inputs is real.

If a safeguard doesn’t change what happens when the system fails, it’s theater.

## Setup

In [None]:
!pip install -q transformers torch gradio

import torch
import gradio as gr
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print("Setup complete.")

### Build Your Input Safeguard

Which P1/P2 failure mode does this address? 

What should happen when bad input arrives? 

In [None]:
# DEMO: Input validation function
# This catches common failure-causing inputs BEFORE the model runs.

def validate_input(text):
    """Check input before sending to the model."""
    # Guard 1: Empty input
    if not text or not text.strip():
        return False, "Please provide some text to analyze."

    # Guard 2: Too short for reliable analysis
    if len(text.split()) < 3:
        return False, "Input is too short for reliable analysis. Please provide at least a sentence."

    # Guard 3: Too long (exceeds model capacity)
    if len(text) > 5000:
        return False, f"Input is too long ({len(text)} characters). Please keep it under 5,000."

    return True, "Input valid"


# Test it
test_inputs = ["", "  ", "hi", "This is a normal input for analysis.", "x " * 3000]
for inp in test_inputs:
    valid, message = validate_input(inp)
    display = repr(inp[:40])
    print(f"Input: {display:<45} Valid: {valid}  Message: {message}")

In [None]:
# YOUR INPUT SAFEGUARD
# Adapt the pattern above for YOUR app and YOUR P1/P2 findings.
#
# def validate_input(text):
#     # What checks does YOUR app need?
#     # - Empty input?
#     # - Too short / too long?
#     # - Wrong language?
#     # - Out of scope for your domain?
#     pass
#
# Test with inputs that SHOULD be caught and inputs that SHOULD pass:

print("Uncomment and adapt the code above for YOUR app.")

**My input safeguard**:
- What it checks for: 
- Which failure mode it addresses: 
- What happens when it catches bad input: 
- What it does NOT catch (honest limitation): 

### Build Your Output Safeguard

Which P1/P2 failure mode does this address? 

What should the user see when the model is uncertain? 

In [None]:
# DEMO: Output safeguard function
# This checks the model's output BEFORE showing it to the user.

def safeguard_output(result, confidence_threshold=0.7):
    """Apply safeguards to model output before returning to user."""
    label = result['label']
    score = result['score']

    # Safeguard 1: Low confidence warning
    if score < confidence_threshold:
        disclaimer = f"\u26a0\ufe0f Low confidence ({score:.0%}). This result may not be reliable."
    else:
        disclaimer = f"Confidence: {score:.0%}"

    # Safeguard 2: General disclosure
    disclosure = "Note: This is AI-generated analysis. Verify important results independently."

    return f"{label} \u2014 {disclaimer}\n{disclosure}"


# Test it with a real model
classifier = pipeline("sentiment-analysis", device=device)

test_texts = ["I love this product!", "meh", "asdfghjkl", "The weather is okay I guess."]
for text in test_texts:
    raw = classifier(text)[0]
    safe = safeguard_output(raw)
    print(f"Input: {text}")
    print(f"  Raw: {raw['label']} ({raw['score']:.3f})")
    print(f"  Safe: {safe}")
    print()

In [None]:
# YOUR OUTPUT SAFEGUARD
# Adapt the pattern above for YOUR app and YOUR P1/P2 findings.
#
# def safeguard_output(result):
#     # What checks does YOUR output need?
#     # - Confidence threshold?
#     # - Range checking?
#     # - Disclaimer for certain output types?
#     # - Truncation for unexpectedly long output?
#     pass
#
# Test with outputs that SHOULD trigger safeguards and outputs that SHOULD pass:

print("Uncomment and adapt the code above for YOUR app.")

**My output safeguard**:
- What it checks for: 
- Which failure mode it addresses: 
- What the user sees when it triggers: 
- What it does NOT catch (honest limitation): 

### Write Your Capability Statement

This isn’t a disclaimer. It’s a feature. Users who know what to expect trust the product more than users who were promised magic.

Draw on: your Week 6 user testing findings, your failure catalog from Activity 1, your Should-We-Build-This analysis from Activity 2.

---

**What This Product Does**:

*(Clear, specific description — not marketing language)*



**What It Does Well**:

*(Specific strengths, with examples. Not "analyzes text" but "identifies sentiment with high accuracy for formal English product reviews.")*



**What It Sometimes Gets Wrong**:

*(Specific failure modes users should know about. Not "may produce errors" but "struggles with sarcasm and non-English text.")*



**What It Should NOT Be Used For**:

*(Clear boundaries. Not "use responsibly" but "do not use for medical, legal, or financial decisions.")*



### Safeguard Check

Before deploying, check your safeguards against the "theater" test:

- [ ] My input safeguard **changes what happens** when bad input arrives (not just logs it)
- [ ] My output safeguard **changes what the user sees** when the model is uncertain (not just adds fine print)
- [ ] My Capability Statement is **specific enough** that a user could decide whether to trust the output
- [ ] If a failure still gets through my safeguards, **the harm is reduced** compared to no safeguards

---

## Activity 4: Deploy v2

Push your safeguards and Capability Statement to your HF Space.

### What to Update

1. **`app.py`**: Add your input validation and output safeguards to your function(s)
2. **Capability Statement**: Add as `gr.Markdown()` in your Gradio interface, or update your Space description
3. **Push to HF Space**: Same process as Week 5 — upload updated files
4. **Verify**: Visit your URL. Test with empty input. Test with a normal input. Check that the Capability Statement is visible.

### Quick Integration Pattern

```python
# Add to the top of your Gradio app:
with gr.Blocks() as demo:
    gr.Markdown("# My App Title")
    gr.Markdown("""**What this does well**: [specific strengths]
    
**Known limitations**: [specific limitations]
    
**Not for**: [specific boundaries]""")
    
    # ... rest of your interface ...
```

Or use Claude Code: *"Add my Capability Statement as a visible section at the top of my Gradio app, and add input validation that returns a helpful message for empty input."*

### v2 Verification

- **My v2 URL** *(same URL, updated app)*: 

- **Is the Capability Statement visible?** 

- **Does the input safeguard work?** *(Try empty input.)* 

- **Does the output safeguard work?** *(Try an input that should trigger low confidence.)* 

- **Partner check** — send your URL to a classmate:
  - Can they see the Capability Statement? 
  - Do the safeguards work for them? 

---

## 6-Question Examination Protocol

Apply the protocol to this week’s entire process — failure cataloging, should-we-build-this, safeguard building, and deployment.

### 1. What did I set out to do?
*(Systematically find what could go wrong, decide if my app should exist, build protections, deploy v2.)*


### 2. What did I actually find?
*(How many failure modes? What was the Should-We-Build-This verdict? What safeguards did I build?)*


### 3. Where did the process succeed?
*(Which examinations revealed something I didn’t know? Which safeguards address real failures?)*


### 4. Where did the process fall short?
*(Which failure categories were hardest? What safeguards feel like theater? What questions couldn’t I answer?)*


### 5. Why were some parts harder than others?
*(Was naming social failures harder than technical? Was the Simpler Alternative Test uncomfortable?)*


### 6. What would I do differently if starting this project today?
*(Knowing what the failure catalog and Should-We-Build-This revealed — what would you build differently from Week 4?)*



---

## DCS Question: What Does Responsible Participation Look Like for This System?

You’ve been answering DCS questions all semester:
- Week 1: What kind of system is this?
- Week 2: What cognitive work gets outsourced?
- Week 3: What knowledge is encoded in the models?
- Week 4: Who directs this system?
- Week 5: What connects this system to its users?
- Week 6: Where does accountability live when this system is wrong?

This week’s question is different. It’s not analytical — it’s prescriptive: **given everything you know, what does it look like to participate responsibly in this cognitive system?**

---

Responsible participation in MY system means:

- Naming ___ failure modes — including: 

- Asking whether ___ could replace ___ (from the Simpler Alternative Test): 

- Building ___ (safeguard) that catches ___ (failure mode) before users see it: 

- Telling users honestly that ___ (from Capability Statement): 

- Deciding that ___ supervision level is appropriate because ___ (from Week 6): 

---

Responsible participation is NOT just having a disclaimer. It’s: 



**Connect to a previous DCS answer**: How does this week’s answer build on what you wrote in Week 5 or Week 6?



---

## Record: CLAUDE.md Week 7 Entry

Add this to your CLAUDE.md file:

```
## Week 7: Responsible AI Hardening

### Failure Mode Catalog
- Technical failures: [count and top examples]
- Social failures: [count and top examples]
- Domain failures: [count and top examples]

### Top Priority Failures
- P1 (must address): [list]
- P2 (should address): [list]
- P3 (monitor): [list]

### Should-We-Build-This Assessment
- Q1 (Problem): [real problem? who has it?]
- Q2 (Wrong): [worst consequence — severity level]
- Q3 (Simpler): [what could be simpler? what genuinely needs AI?]
- Q4 (Risk): [who bears it? fair?]
- Q5 (Skeptic): [can I defend this?]
- Verdict: Build with safeguards / Simplify / Don't build
- Reasoning: [key evidence]

### Safeguards Implemented
- Input safeguard: [what it checks, what it catches]
- Output safeguard: [what it flags or constrains]
- Capability Statement: [summary — does well / gets wrong / not for]

### v2 Deployment
- URL: [same or updated URL]
- What changed from v1: [safeguards added, Capability Statement visible]
- Safeguards tested: [what I verified]

### DCS: What Does Responsible Participation Look Like?
[Specific actions taken — failure cataloging, simpler alternative test,
 safeguard building, honest capability communication.
 What does responsible participation mean for THIS system?]
```

---

## What’s Next

You can defend what you built. You have a failure catalog, a Should-We-Build-This assessment, safeguards, and a Capability Statement.

**Next week is the last session**: you ship and reflect. You’ll polish your app, write a Product Brief, present to the class, and reflect on what the whole course taught you about building AI systems for real people. Bring everything — your CLAUDE.md, your deployed URL, your Capability Statement.