## Welcome to the Second Lab - Week 1, Day 3

Today we will work with lots of models! This is a way to get comfortable with APIs.

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Important point - please read</h2>
            <span style="color:#ff7800;">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations.<br/><br/>If you have time, I'd love it if you submit a PR for changes in the community_contributions folder - instructions in the resources. Also, if you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...
            </span>
        </td>
    </tr>
</table>

In [1]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [2]:
# Always remember to do this!
load_dotenv(override=True)

True

In [3]:
# Print the key prefixes to help with any debugging

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

if openai_api_key:
    print(f"OpenAI API Key exists and begins {openai_api_key[:8]}")
else:
    print("OpenAI API Key not set")
    
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set (and this is optional)")

if google_api_key:
    print(f"Google API Key exists and begins {google_api_key[:2]}")
else:
    print("Google API Key not set (and this is optional)")

if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set (and this is optional)")

if groq_api_key:
    print(f"Groq API Key exists and begins {groq_api_key[:4]}")
else:
    print("Groq API Key not set (and this is optional)")

OpenAI API Key exists and begins sk-proj-
Anthropic API Key exists and begins sk-ant-
Google API Key exists and begins AI
DeepSeek API Key exists and begins sk-
Groq API Key exists and begins gsk_


In [4]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."
messages = [{"role": "user", "content": request}]

In [5]:
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [6]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)




In [7]:
competitors = []
answers = []
messages = [{"role": "user", "content": question}]

## Note - update since the videos

I've updated the model names to use the latest models below, like GPT 5 and Claude Sonnet 4.5. It's worth noting that these models can be quite slow - like 1-2 minutes - but they do a great job! Feel free to switch them for faster models if you'd prefer, like the ones I use in the video.

In [8]:
# The API we know well
# I've updated this with the latest model, but it can take some time because it likes to think!
# Replace the model with gpt-4.1-mini if you'd prefer not to wait 1-2 mins

model_name = "gpt-5-nano"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

Below is a structured safety and governance plan you can adapt for each system, followed by audit allocation, and a plan for independent fairness/generalization validation and public communication.

System A. Autonomous last-mile delivery drones operating at night

1) The single most important policy/technical control before deployment
- Policy/Control: Data minimization and on-device privacy-preserving processing with strict automatic data purge (no retention of raw video; only non-identifying operational data; mandatory 24-hour automatic purge of any captured imagery; all processing and storage occur on the drone unless necessary for delivery).
- Why this control: Night operations plus drones create substantial privacy and civil-liberties risk if video or identifiable data is stored or transmitted. By keeping data on-device, deleting imagery rapidly, and avoiding cloud storage of video, you materially reduce the risk of privacy harm, data breaches, and public backlash while meeting the 30-day retention law in spirit. It also reduces the blast radius of any potential surveillance overreach.

2) One measurable metric to monitor ongoing harms or performance
- Metric: Privacy Incident Rate (PIR)
  - Calculation: PIR = (N_privacy_incidents_last_30_days) / (N_flights_last_30_days) × 1000
  - What counts as an incident: any captured imagery or raw video retained beyond the 24-hour purge window, or any post-flight data leakage, or any non-mission-identifiable data breach involving drone footage.
  - Suspension threshold: any privacy incident in the last 30 days triggers suspension and a full root-cause review before operations resume. (Optionally, if you prefer a staged approach, specify “zero tolerance for 30-day rolling window; a single incident suspends operations until remediation completes.”)

3) One realistic failure mode, early-warning signs, and rollback/mitigation plan
- Failure mode: Loss of reliable localization/navigation (GNSS jamming/spoofing or sensor fusion failure) causing drift, off-nominal routes, or unsafe landings.
- Early warning signs:
  - Increasing localization drift or inconsistent fusion confidence (lower state estimation covariance margins widen).
  - Repeated waypoint deviations beyond a small tolerance (e.g., > ±5–10 meters) over several minutes.
  - Unexplained altitude or airspeed anomalies reported by the autopilot.
  - Recurrent low-battery or power-cycle incidents in the same area, or failure to return-to-home during test segments.
- Rollback/mitigation plan:
  - Immediately switch to safe-mode: hover if stable, then execute controlled airdrop of payload only if in a safe zone; otherwise trigger return-to-home and land at the nearest safe site.
  - Pause night flights in affected corridors until root-cause analysis (logs, sensor checks, geomapping) and mitigations are implemented (geofence tightening, sensor recalibration, or flight-plan re-optimization).
  - Initiate a human-in-the-loop review for all flights in affected zones; re-run simulations with updated maps, sensor fusion tuning, and redundant localization (visual odometry and radar/ultrasound where available).
  - Post-incident: publish a concise incident report to the operations board and public safety/privacy officer; update risk register and update flight procedures.

4) Audit resource allocation (two independent audit teams in year one)
- Proposed allocation: Team 1 → System B (facial recognition) and Team 2 → System C (resume-screener). System A will be monitored through internal safety/compliance audits and cross-checked by the city’s privacy officer, with findings reported to the two teams as needed.
- Rationale: B and C present the highest near-term risk of systemic harm to individuals and civil rights (false positives in policing-related identification; hiring discrimination). A’s risk is primarily operational safety and privacy, which can be mitigated through the on-device/privacy-preserving design and ongoing internal safety audits. Given budget limits, concentrating independent audits on the two systems with the largest potential for irreversible harm and public trust damage provides the highest risk reduction per dollar and aligns with your 30-day data-retention constraint and consent/rights concerns.

5) Independent validation of fairness and generalization (tests, datasets, statistics) and transparent risk communication
- Validation approach:
  - Fairness and generalization tests (applied to both A and B as applicable, with emphasis on B for fairness concerns):
    - Datasets and benchmarks:
      - System B: Use diverse, consent-based or synthetic datasets with demographic annotations (e.g., protected attributes such as age, gender, skin tone, disability status) and varied environmental conditions. If real-world readouts are restricted, use synthetic faces or de-identified volunteers with explicit consent and robust privacy safeguards.
      - System C: Use de-identified resumes or simulated applicant data that include protected attributes only in a controlled, synthetic form, ensuring the model cannot rely on them in the decision process (or uses them only in aggregate fairness assessments).
    - Metrics and thresholds:
      - System B: Maintain privacy-preserving metrics alongside accuracy metrics.
        - Precision (positive predictive value) and recall (sensitivity) for “wanted-person” flags, computed from human-verified outcomes.
        - False-positive rate (per 1,000 flagged events) and its trend over 30-day windows.
        - Fairness checks: disparate impact ratio (positive flag rate by protected groups). Acceptance criteria: disparity ratio within [0.8, 1.25] for at least three consecutive 30-day windows.
        - Statistical criteria: 95% confidence intervals via bootstrap over flags; require no statistically significant deterioration (p > 0.05) in fairness metrics across time.
      - System C: Evaluate hiring fairness and generalization.
        - Selection rate by protected groups; disparity ratio; equal opportunity difference (acceptance to interview rate by group conditioned on qualified candidates).
        - Predictive parity: compare model-scored ranks to actual hiring outcomes, controlling for job-relevant qualifications.
        - Statistical criteria: where confidence intervals for disparity ratios include the neutral value (e.g., RR in [0.8, 1.25]) and we fail to reject equality of opportunity at p < 0.05, the model passes; otherwise retrain.
  - Generalization tests:
    - Hold-out test sets that reflect future deployment conditions (different neighborhoods for drones; different transit hubs and times for facial recognition; different municipal hiring pools).
    - Temporal splits to simulate data shift (seasonal effects, policy changes).
    - Robustness checks: stress tests for night vs. day, weather variations, camera angles, and data-entry noise.
  - Datasets and ethics:
    - Use synthetic or consent-based data to avoid exposing sensitive real individuals.
    Use third-party audits and reviewers to assess model logic, data provenance, and potential proxies for protected attributes.
  - Public communication on trade-offs and risks:
    - Publish an annual fairness and generalization report with:
      - Model descriptions, data provenance (or synthetic data usage), and the exact metrics monitored.
      - Results of the latest audits, including confidence intervals and any detected biases.
      - Known limitations (e.g., “despite bias mitigation, some demographic groups may experience higher false-positive/false-negative rates in certain conditions”).
      - Clear explanation of trade-offs between privacy, accuracy, and public safety, plus the steps being taken to mitigate risks.
    - Create a public-facing risk dashboard with plain-language summaries, glossary of terms, and red/green indicators for each metric.
    - Establish a public feedback channel and an independent ombudsperson to review citizen concerns and investigate specific incidents.

Appendix: concise risk-based rationale for prioritization in year one
- Highest-risk domains: B (facial recognition) due to potential harms from misidentification, civil rights impact, and legal exposure; C (resume-screener) due to hiring discrimination and fairness risks that affect thousands of applicants and public trust; A (nightly drone delivery) is important for service but primarily safety/operational risk that can be managed with privacy-preserving design and internal safety audits.
- With only two independent audit teams, focusing on B and C maximizes risk reduction for civil rights, privacy, and governance while still enabling A to meet safety and privacy requirements through internal processes and the city’s privacy officer oversight. This aligns with the city’s public trust levers and 30-day data-retention constraint.

Implementation notes and practical steps
- Cross-cutting controls to reinforce all three systems:
  - Data minimization: explicit data retention schedules, deletion policies, and encryption at rest/in transit.
  - Human-in-the-loop: for any actionable outcome (e.g., flagging a wanted person; hire decisions), ensure a human verifies before any action is taken.
  - Explainability: demand interpretable outputs and justifications, especially for C and B.
  - Audit trails: immutable logs of decisions, data used, and outcomes; provide access for third-party audits.
  - Public engagement: early and ongoing communications about goals, limits, and trade-offs; offer public feedback channels and frequent updates on audits and mitigation progress.

If you’d like, I can turn this into a compact policy brief or a slide-ready deck with concrete templates for the metrics dashboards, incident taxonomy, and the 30-day audit cadence.

In [12]:
# Anthropic has a slightly different API, and Max Tokens is required

model_name = "claude-sonnet-4-5"

claude = Anthropic()
response = claude.messages.create(model=model_name, messages=messages, max_tokens=1000)
answer = response.content[0].text

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

# AI Safety Framework for Municipal Deployment

## System A: Autonomous Last-Mile Delivery Drones (Night Operations)

### (1) Most Important Control
**Technical Control: Mandatory geofencing with real-time airspace deconfliction and emergency ground-stop capability.**

*Justification:* Night operations compound visual detection failures and create acute public safety risks (falling packages, collisions with people/property, interference with emergency vehicles). A hard technical constraint preventing operation outside approved corridors, with immediate override capability, bounds catastrophic outcomes better than procedural controls. This addresses the asymmetry where drone benefits are primarily commercial while physical risks are socialized to residents.

### (2) Measurable Monitoring Metric
**Metric: Near-Miss Rate (NMR)**

*Computation:*
```
NMR = (Number of unplanned proximity events <5m from uninvolved persons or structures) / (Total flight hours) × 100
```

Data sources: Onboard LIDAR/sensor logs automatically flagging obstacle avoidance activations; citizen complaint hotline tagged reports; post-flight path analysis against municipal GIS data.

**Suspension Threshold:** NMR >0.5 per 100 flight hours in any rolling 7-day window, OR any single event involving injury/property damage >$1,000.

*Rationale:* Leading indicator of collision risk; computationally verifiable from system logs; low enough baseline that trends emerge before injuries occur.

### (3) Realistic Failure Mode

**Failure:** GPS spoofing or electromagnetic interference causes drones to violate geofences, especially near hospital helipads or emergency zones.

**Early Warning Signs:**
- Increased deviation from filed flight paths (>10m horizontal drift)
- Clustering of anomalies near specific geographic locations
- Emergency services reporting unexplained airspace conflicts
- Sensor discrepancies between GPS, inertial navigation, and visual odometry

**Rollback/Mitigation Plan:**
- **Immediate (within 1 hour):** Ground all drones in affected zones; switch to limp-home mode using inertial navigation + visual landing
- **Short-term (24-48 hours):** Deploy RF spectrum analysis at incident locations; activate backup barometric/visual navigation; reduce operational radius by 50%
- **Medium-term (1-2 weeks):** Install multi-constellation GNSS (GPS+Galileo+GLONASS) with spoofing detection; implement cryptographic navigation signal authentication; establish electromagnetic "safe mode" protocols

**Concrete trigger:** Any geofence violation >20m sustained for >10 seconds triggers automatic zone grounding.

---

## System B: Real-Time Facial Recognition in Transit Hubs

### (1) Most Important Control
**Policy Control: Human-in-the-loop confirmation requirement with independent review before law enforcement contact.**

*Justification:* Facial recognition systems exhibit documented disparities in false positive rates across demographic groups (especially for darker-skinned women, transgender individuals). Given low public trust and the severe consequences of false arrest, an automated alert must not directly trigger police response. A trained human reviewer using multiple identifying factors (gait, clothing, timeline consistency) reduces both bias amplification and unlawful detention risk. This is feasible in transit contexts where minutes-scale delay doesn't defeat the purpose (vs. border instant-decision scenarios).

### (2) Measurable Monitoring Metric
**Metric: Demographic Parity in False Positive Rate (DP-FPR)**

*Computation:*
```
For each demographic group g (defined by intersection of gender presentation × skin tone category):
FPR_g = (False alerts for group g) / (Total persons from group g passing through monitored areas)

DP-FPR = max(FPR_g) / min(FPR_g) across all groups
```

**Data Collection:**
- False positives identified through human review rejections + post-contact exonerations
- Demographic distribution estimated through quarterly manual annotation of random 5,000-frame samples by independent auditors trained on Fitzpatrick scale + gender presentation spectra
- Base rates from transit ridership surveys

**Suspension Threshold:** DP-FPR >2.0 in any quarterly measurement, OR absolute FPR >0.01 for any demographic group.

*Rationale:* Directly measures discriminatory impact rather than proxy metrics;

In [13]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.5-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

As an AI Safety Officer advising your city, here are my recommendations for the deployment of your three AI systems, focusing on critical controls, measurable metrics, failure modes, audit prioritization, and transparency.

---

### **System A: Autonomous Last-Mile Delivery Drones (Night Operations)**

1.  **Most Important Policy/Technical Control:**
    *   **Policy/Technical Control:** Implement a **mandatory Human-in-the-Loop (HITL) Remote Oversight system** where trained human operators can monitor multiple drone flights simultaneously and are empowered with immediate, real-time override capabilities for anomaly resolution, emergency landing, or route correction.
    *   **Explanation:** This control is paramount because drones pose direct physical safety risks (collisions, dropped packages) and potential privacy intrusions if they deviate from intended paths. Full autonomy, especially in urban environments, is inherently risky. HITL oversight provides a crucial layer of human judgment and intervention for unforeseen circumstances, system failures, or malicious interference, significantly reducing the likelihood and severity of physical harm and building critical public trust by demonstrating accountability.

2.  **Measurable Metric for Harms/Performance:**
    *   **Metric:** **Unplanned Incident Rate per Flight Hour.**
    *   **Computation:** (Number of incidents requiring human intervention, emergency landing, property damage, or injury) / Total flight hours. An "incident" is defined as any deviation from planned, safe operation, including near-misses reported by the system, unscheduled landings, or route anomalies requiring human correction.
    *   **Threshold for Suspension:**
        *   Exceeding **0.05 incidents per 100 flight hours** (i.e., 1 incident per 2,000 flight hours) over any rolling 30-day period will trigger an immediate suspension of *new drone deployments* and a mandatory system-wide review.
        *   If the rate reaches **0.1 incidents per 100 flight hours** (1 incident per 1,000 flight hours), *all drone operations* will be suspended immediately until a root cause analysis, a comprehensive mitigation plan, and independent verification of improvements are completed and approved.

3.  **Realistic Failure Mode, Early Warning Signs, and Rollback/Mitigation Plan:**
    *   **Failure Mode:** **Persistent GPS spoofing or signal interference**, leading to drones deviating from planned flight paths, entering restricted airspace, or flying over private property without authorization.
    *   **Early Warning Signs:**
        *   Frequent GPS error alerts or "drift" notifications in drone telemetry logs that are inconsistent with environmental conditions.
        *   Discrepancies between expected drone paths and actual recorded flight data (e.g., inconsistent ground speed relative to air speed, unexpected changes in direction).
        *   Increased frequency of human operator interventions to manually correct drone positioning or reroute.
        *   Public complaints or social media reports of drones flying in unusual or restricted areas.
    *   **Rollback/Mitigation Plan:**
        *   **Immediate:** All active drones will activate a **temporary "reduced autonomy" mode**, requiring continuous human remote piloting for navigation, with strict geo-fencing software limits for all flights. New flights are temporarily restricted to designated, visually observable corridors.
        *   **Mid-term:** Deploy **redundant navigation systems** (e.g., visual-inertial odometry (VIO) or RTK GPS) to cross-verify positioning and detect anomalies. Collaborate with local law enforcement and spectrum regulators to identify and eliminate sources of interference. Update drone firmware to incorporate advanced anti-spoofing algorithms and enable autonomous detection and reporting of signal anomalies.
        *   **Long-term:** Mandate secure, encrypted communication channels and consider "GPS-denied" navigation capabilities for fallback.

---

### **System B: Real-time Facial Recognition in Transit Hubs**

1.  **Most Important Policy/Technical Control:**
    *   **Policy/Technical Control:** **Mandatory human-in-the-loop verification for *all* system-generated "positive matches" against a wanted persons database *before* any intervention or action is initiated.**
    *   **Explanation:** This is critical due to the extremely high stakes of misidentification. Facial recognition, despite advances, is prone to false positives and demographic bias. Automating an "arrest/detain" decision based solely on an algorithmic match risks wrongful detention, civil rights violations, and severe public backlash, especially given low public trust. A trained human analyst must review the match, verify identity using secondary information (if available), and assess context before any action.

2.  **Measurable Metric for Harms/Performance:**
    *   **Metric:** **Rate of Confirmed False Positives Leading to Intervention.**
    *   **Computation:** (Number of instances where an individual flagged by the system was subsequently confirmed *not* to be a wanted person *after* an intervention was initiated based on the flag) / Total number of individuals flagged by the system for intervention.
    *   **Threshold for Suspension:**
        *   A confirmed false positive rate exceeding **0.1% (1 in 1,000 flags)** over any rolling 30-day period will trigger an immediate suspension of the system for review, re-calibration, or model retraining. The system will remain suspended until the rate falls below this threshold and an independent audit verifies the improvements and bias mitigation.

3.  **Realistic Failure Mode, Early Warning Signs, and Rollback/Mitigation Plan:**
    *   **Failure Mode:** **Algorithmic bias leading to a disproportionately high false positive rate for specific demographic groups** (e.g., people of color, women, elderly individuals) due to inadequacies in training data or model generalization. This could result in discriminatory targeting and over-policing of certain communities.
    *   **Early Warning Signs:**
        *   Internal audit reports or fairness analyses showing statistically significant higher false positive rates for certain demographic subgroups (e.g., based on skin tone, gender, or age) compared to others.
        *   Increased number of formal complaints from community groups or individuals reporting feeling unfairly targeted, frequently stopped, or profiled in transit hubs.
        *   Observational data indicating a demographic breakdown of flagged individuals that significantly deviates from the general transit user population in a way consistent with known biases.
        *   High variability in "match confidence scores" for individuals across different demographic groups, even for confirmed wanted persons.
    *   **Rollback/Mitigation Plan:**
        *   **Immediate:** **Deactivate the real-time flagging capability for all new targets.** The system can only be used for *post-event forensic analysis* (reviewing footage of a known suspect after an incident, within the 30-day retention limit). Mandate immediate refresher training for all human operators on bias awareness and the strict protocols for human verification.
        *   **Mid-term:** Conduct a **comprehensive bias audit** using a diverse, representative dataset (meticulously balanced across demographics, perhaps using synthetic data generation if real data is unavailable due to privacy) to pinpoint specific areas of algorithmic weakness. Retrain the underlying AI model using a more balanced and robust dataset, employing debiasing techniques (e.g., re-weighting, adversarial debiasing) and possibly using multiple models for different demographic groups.
        *   **Long-term:** Implement a continuous, automated monitoring system for demographic performance disparities. Establish an independent oversight board with strong community representation to regularly review system performance and address public concerns.

---

### **System C: AI Resume-Screener for Municipal Hiring**

1.  **Most Important Policy/Technical Control:**
    *   **Policy/Technical Control:** **Mandatory human review of *all* candidates whose AI score places them within the "borderline" range for consideration, and a systematic random sampling of a significant percentage (e.g., 20%) of all candidates automatically rejected by the AI.**
    *   **Explanation:** The primary risk of an AI resume screener is bias, leading to unfair exclusion and perpetuating existing inequalities in municipal hiring. Requiring human review for borderline cases ensures that potentially qualified candidates who might be subtly disadvantaged by the AI (e.g., due to non-traditional backgrounds or keyword mismatches) still receive a fair human assessment. Randomly reviewing rejected candidates is crucial to detect systemic bias that might be consistently filtering out qualified applicants from specific demographic groups *before* they even reach the human review stage, preventing the AI from unilaterally shaping the applicant pool in a biased way.

2.  **Measurable Metric for Harms/Performance:**
    *   **Metric:** **Disparity in Interview Progression Rates by Protected Class.**
    *   **Computation:** For each protected class (e.g., gender, race/ethnicity, age group, veteran status, disability status): Calculate the ratio of (Percentage of applicants from that class who progress to the interview stage) / (Percentage of applicants from the highest-performing demographic class who progress to the interview stage).
    *   **Threshold for Suspension:**
        *   If the ratio for any protected class falls below **0.8 (i.e., less than 80% of the progression rate of the highest-performing group)** for two consecutive hiring cycles for the *same job type or department*, the system's use for that specific job type/department will be suspended. A full audit, bias analysis, and model retraining will be required before re-deployment.

3.  **Realistic Failure Mode, Early Warning Signs, and Rollback/Mitigation Plan:**
    *   **Failure Mode:** **The AI system inadvertently learns and perpetuates historical biases present in past successful hiring data**, leading to the systematic undervaluation of candidates with non-traditional career paths, diverse educational backgrounds, or specific demographic characteristics (e.g., favoring candidates from certain elite universities or with specific keywords that correlate with privileged backgrounds).
    *   **Early Warning Signs:**
        *   Feedback from rejected candidates, advocacy groups, or internal diversity and inclusion committees highlighting perceived unfairness or a decrease in the diversity of interview shortlists, despite diverse applicant pools.
        *   Internal analyses showing a statistically significant decrease in the representation of specific protected classes in interview pools compared to pre-AI periods or the overall applicant pool.
        *   Discrepancies between AI scores and independent "gold standard" human recruiter scores on a blind test set, particularly for resumes from diverse backgrounds.
        *   Inability to clearly articulate the AI's "feature importance" or rationale for low scores, especially when human reviewers believe a candidate is qualified.
    *   **Rollback/Mitigation Plan:**
        *   **Immediate:** **Revert to a human-only resume screening process** for all new job postings until the issue is resolved. For ongoing processes, immediately increase the random sample size for human review of rejected candidates to 100% for roles where bias is suspected.
        *   **Mid-term:** Conduct a **deep dive into the AI's training data** to identify and quantify historical biases. Re-engineer features to be less sensitive to potentially biased proxies (e.g., specific university names, zip codes) and more focused on demonstrable skills, competencies, and experience. Explore advanced debiasing techniques in model training (e.g., re-sampling, adversarial debiasing, fairness constraints).
        *   **Long-term:** Implement a **"shadow mode" for future AI iterations**, where the AI processes resumes and generates scores but these scores are *not* used for hiring decisions, allowing comparison with human decisions to identify and correct biases before live deployment. Establish a continuous monitoring framework for diversity and equity metrics throughout the hiring pipeline.

---

### **4. Audit Allocation in Year One**

Given a resource constraint of funding only two independent audit teams in year one, I would allocate them as follows:

1.  **Audit Team 1: System B (Real-time Facial Recognition in Transit Hubs)**
2.  **Audit Team 2: System A (Autonomous Last-Mile Delivery Drones)**

**Risk-Based Rationale:**

*   **System B (Facial Recognition) takes top priority due to the highest potential for severe and irreversible harm to civil liberties and individual rights.** A failure here (e.g., due to bias or false positives) could lead to wrongful detention, public profiling, severe reputational damage, legal challenges, and a permanent erosion of public trust. The immediate, high-stakes impact on individual freedom and potential for discrimination against vulnerable populations makes this the most critical system to audit first.

*   **System A (Autonomous Drones) is the second priority due to the highest potential for direct physical harm to individuals or property.** While potentially less frequent than a facial recognition error, a drone malfunction (collision, falling object) can lead to serious injury or death, property destruction, and immediate, widespread public outrage. These physical harms are irreversible. Additionally, concerns about privacy (unintentional surveillance) further elevate its risk profile.

*   **System C (AI Resume-Screener) would be deferred to year two, despite its importance.** While the resume screener poses significant risks related to bias and perpetuating inequality in hiring, the harms are primarily *discriminatory exclusion* and reputational damage. These are serious, but generally allow for more time for recourse (e.g., reapplication, human override) and are less immediate/irreversible than the civil liberty infringements of System B or the physical dangers of System A. Given the "public trust is low" context, demonstrating robust safety and civil liberty protections for the most dangerous systems is paramount in year one.

---

### **5. Fairness & Generalization Validation and Transparent Communication**

**A. Fairness & Generalization Validation**

**1. Tests & Datasets:**

*   **System A (Drones - Generalization Focus):**
    *   **Datasets:**
        *   **Environmental Diversity:** Collect flight data from various weather conditions (wind, light rain, fog within operational limits), different times of night, varying urban densities (residential, commercial, open spaces), and over diverse surface types (paved, unpaved).
        *   **Obstacle Diversity:** Introduce simulated and real-world diverse obstacles including moving vehicles, pedestrians, animals, power lines, and tree canopies.
    *   **Tests:**
        *   **Out-of-Distribution (OOD) Performance:** Evaluate navigation accuracy, obstacle avoidance, and stability under weather conditions or environmental changes beyond standard operating parameters.
        *   **Robustness Testing:** Introduce sensor noise, GPS signal degradation, or simulated partial system failures to assess resilience.
        *   **Stress Testing:** Simulate peak operational loads (multiple drones in the same airspace) or rapid environmental changes.

*   **System B (Facial Recognition - Fairness & Generalization Focus):**
    *   **Datasets:**
        *   **Fairness:** A meticulously curated, *independent* dataset of facial images (not from transit hub, ensuring privacy), balanced across demographic attributes (race/ethnicity, gender, age, skin tone using objective scales like Fitzpatrick), and relevant confounding factors (lighting, angles, expressions, occlusions like masks/glasses). Include both wanted and non-wanted persons.
        *   **Generalization:** A separate, unseen validation dataset collected from diverse real-world transit environments (e.g., different stations, times of day/night, camera models, crowd densities) to simulate operational conditions. This should reflect the full diversity of transit users.
    *   **Tests:**
        *   **Fairness:**
            *   **Disparate Impact Analysis:** Measure False Positive Rate (FPR) and False Negative Rate (FNR) across all defined demographic subgroups.
            *   **Equal Opportunity:** Ensure that True Positive Rate (recall) is approximately equal across subgroups (i.e., the system is equally likely to correctly identify a wanted person regardless of their demographic).
            *   **Predictive Parity:** Ensure that Precision is approximately equal across subgroups (i.e., when the system flags someone, the likelihood of them truly being wanted is consistent across demographics).
            *   **Intersectionality:** Analyze fairness metrics for intersecting demographic groups (e.g., young Black women, elderly Asian men) to detect compounded biases.
        *   **Generalization:**
            *   **Environmental Robustness:** Evaluate performance under varying lighting, camera angles, and crowd densities using the transit-specific validation dataset.
            *   **Occlusion Tolerance:** Test performance with partial facial occlusions common in transit (masks, scarves, glasses, hats).

*   **System C (Resume Screener - Fairness & Generalization Focus):**
    *   **Datasets:**
        *   **Fairness:** Anonymized historical municipal resumes, *significantly augmented* with synthetic resumes designed to represent underrepresented groups, non-traditional career paths, diverse educational backgrounds, and varied experience levels. This dataset must be meticulously balanced.
        *   **Generalization:** A large, independent set of *new* municipal job applications from a variety of roles and departments, including some for job types not explicitly in the training data, to test transferability and robustness to novel resume formats/experiences.
    *   **Tests:**
        *   **Fairness:**
            *   **Disparate Impact Analysis:** Measure the disparity in interview progression rates (as per the metric above) across protected classes.
            *   **Counterfactual Fairness:** Test how a resume's score changes if only protected attributes (e.g., gender pronouns, culturally specific names) are changed, holding all other qualifications constant.
            *   **Feature Importance Analysis:** Identify which resume features (e.g., university name, specific keywords) the AI weighs most heavily and assess if these are biased proxies.
        *   **Generalization:**
            *   **Domain Adaptation:** Assess performance on resumes for new job types or departments, using human "gold standard" ratings for comparison.
            *   **Robustness to Format:** Test with variations in resume formatting, use of non-standard fonts, or novel resume sections.

**2. Statistical Criteria for Suspension/Intervention:**

*   **Statistical Significance:** For all fairness metrics (e.g., FPR, FNR, progression rates), use hypothesis testing (e.g., Chi-squared tests, t-tests) to determine if observed disparities between demographic groups are statistically significant (e.g., p-value < 0.05).
*   **Effect Size:** Beyond statistical significance, also consider effect size (e.g., Cohen's d or risk ratio) to quantify the *magnitude* of any observed disparity. A statistically significant disparity with a large effect size would be a strong trigger for intervention.
*   **Pre-defined Disparity Thresholds:** Establish explicit thresholds for acceptable disparity ratios (e.g., "the ratio of FPR for any protected group must not exceed 1.2 times the lowest group's FPR for System B").
*   **Performance Degradation:** For generalization, a statistically significant drop (e.g., >5% in F1-score for B, >10% in correlation with human scores for C) on unseen or OOD data compared to established baselines will trigger review.
*   **System A (Drones):** Statistical control charts monitoring sensor errors, GPS drift, and near-miss event frequency. Any excursion beyond 3 standard deviations from the mean or sustained trends indicating instability would trigger review.

**B. Transparent Communication of Unavoidable Trade-offs and Remaining Risks to the Public**

Transparency is crucial for rebuilding public trust. My communication strategy would be proactive, multi-channel, and iterative:

1.  **Public AI Ethics & Safety Report:**
    *   **Content:** Publish a comprehensive, easily accessible report on the city's website outlining:
        *   **System Purpose & Benefits:** Clearly articulate why each system is deployed.
        *   **Limitations & Known Risks:** Honestly detail the inherent limitations of each AI (e.g., facial recognition's susceptibility to bias, drone navigation challenges in adverse weather).
        *   **Mitigation Strategies:** Explain the specific policy and technical controls (like HITL, data retention limits, human review) implemented to address risks.
        *   **Fairness & Generalization Results:** Summarize validation findings, including any identified disparities or areas of weaker performance.
        *   **Unavoidable Trade-offs:** Explicitly state any trade-offs made (e.g., for facial recognition, aiming for a very low false negative rate to catch more wanted persons might lead to a slightly higher false positive rate, or vice-versa; for drones, increased safety features might marginally increase operational costs).
        *   **Remaining Risks:** Acknowledge that no AI system is perfect and outline any residual risks despite mitigations.
        *   **Recourse Mechanisms:** Clearly explain how individuals can report issues, appeal decisions, or lodge complaints.
    *   **Format:** Use plain language, infographics, FAQs, and provide summaries in multiple languages.

2.  **Community Engagement Forums:**
    *   **Mechanism:** Host open town halls, online webinars, and dedicated workshops across diverse neighborhoods *before* and *after* deployment.
    *   **Goal:** Present the information from the ethics report, actively solicit questions, and listen to public concerns. Facilitate an open dialogue, explaining the "why" behind decisions and trade-offs directly. This builds trust through direct engagement and accountability.

3.  **Dedicated AI Oversight Committee (Publicly Accessible):**
    *   **Function:** Establish a multi-stakeholder committee comprising city officials, independent AI experts, civil liberties advocates, and community representatives.
    *   **Transparency:** The committee's meeting minutes, anonymized audit reports, and recommendations should be publicly available online. This demonstrates ongoing, independent oversight and a commitment to continuous improvement.

4.  **Clear On-Site & In-App Messaging:**
    *   **System A (Drones):** Publicize flight zones, operating hours, and a clear contact for noise/safety complaints.
    *   **System B (Facial Recognition):** Prominent signage in transit hubs indicating the presence and purpose of facial recognition, alongside clear information on data retention, individual rights, and the human verification step.
    *   **System C (Resume Screener):** Inform job applicants during the application process that AI is used in screening, explain the human review and appeal processes.

5.  **Performance Dashboards (Aggregated & Anonymized):**
    *   **Content:** Publish regularly updated (e.g., quarterly) aggregated and anonymized data on key performance metrics (e.g., drone incident rates, confirmed false positive rates for facial recognition, diversity metrics for hiring, data retention compliance).
    *   **Goal:** Provide ongoing, tangible evidence of system performance and compliance with defined thresholds, demonstrating transparency and accountability beyond initial statements.

The overarching message would be that the city is committed to responsible AI deployment, acknowledging that while these technologies offer significant benefits, they also come with inherent risks. We are taking proactive steps to mitigate these risks through robust controls, continuous monitoring, and independent oversight, and we are committed to an open, iterative dialogue with our residents.

In [14]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

response = deepseek.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

APIStatusError: Error code: 402 - {'error': {'message': 'Insufficient Balance', 'type': 'unknown_error', 'param': None, 'code': 'invalid_request_error'}}

In [15]:
# Updated with the latest Open Source model from OpenAI

groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "openai/gpt-oss-120b"

response = groq.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


Below is a **ready‑to‑present “risk‑mitigation package”** for the three AI‑enabled municipal services.  Each section follows the five‑point template you asked for, and the final two sections cover audit‑team allocation, independent fairness‑validation, and public‑communication strategy.

---

## 1. System‑by‑System Controls, Metrics, Failure‑Mode Plans  

### A. Autonomous Night‑Time Last‑Mile Delivery Drones  

| # | Element | Detail |
|---|---------|--------|
| **1️⃣ Core Policy / Technical Control** | **Geofenced “No‑Fly‑Zone” & Mandatory Redundant “Kill‑Switch”**  <br> • All flight paths must be pre‑approved by the city’s GIS layer and enforced by on‑board geofence software that cuts thrust if the drone attempts to leave the permitted airspace.<br> • A hardware‑level, tamper‑proof kill‑switch (radio‑frequency + autonomous “return‑to‑home” on loss‑of‑link) must be installed and test‑certified every 30 days. <br>**Why:**  Night operations eliminate visual oversight; a hard geofence + kill‑switch is the single most effective safeguard against *unintended incursions* (over residential areas, airports, or restricted facilities) and against software or sensor failure that could otherwise cause a crash or privacy breach. |
| **2️⃣ Ongoing‑Harms Metric** | **Mean Time Between Safety Incidents (MTBSI)**  <br> • **Definition:** \( \text{MTBSI}= \frac{\text{Total flight‑hours in the reporting period}}{\text{Number of “critical safety events”}} \) <br> • **Critical safety event** = any of: (i) hard‑landing > 2 m/s vertical speed, (ii) loss of GPS lock > 5 s, (iii) geofence breach (even if auto‑recovered). <br> • **Threshold:** MTBSI < 200 h triggers an automatic **30‑day suspension** of all autonomous flights for root‑cause analysis and software patching. (Baseline target: ≥ 500 h based on industry best‑practice.) |
| **3️⃣ Realistic Failure Mode** | **GPS Spoofing / Signal Jamming** <br> • **Early‑Warning Signs**: <br>   – Sudden increase in GPS‑position variance (> 10 m) logged by the on‑board IMU; <br>   – “Lost GPS” events > 3 s on > 5 % of flights in a week; <br>   – Unexpected “return‑to‑home” activations. <br> • **Rollback / Mitigation**: <br>   1. **Immediate**: Switch all drones to “Local‑Only” mode (dead‑reckoning + visual‑odometry) and ground any that cannot maintain safe navigation for > 30 s. <br>   2. **Short‑term**: Deploy portable GNSS‑monitoring units around the affected neighbourhoods to locate the jammer/spoofer. <br>   3. **Long‑term**: Add a multi‑constellation GNSS receiver (GPS + GLONASS + Galileo) and an RF‑interference detector that forces a safe hover + return‑to‑base if interference exceeds –85 dBm. |
| **4️⃣ Auditable Artifact** | • Flight‑log database (timestamped sensor suite, GPS/IMU data, kill‑switch activations). <br>• Geofence compliance report (daily). |

---

### B. Real‑Time Facial Recognition (FR) in Transit Hubs  

| # | Element | Detail |
|---|---------|--------|
| **1️⃣ Core Policy / Technical Control** | **“Two‑Tier Human‑in‑the‑Loop (HITL) Review” with Real‑Time Transparency Log**  <br>• **Tier 1:** An on‑site operator must receive the FR match confidence score and *must* manually confirm any match ≥ 80 % before an alert is escalated to law‑enforcement. <br>• **Tier 2:** All alerts (including rejections) are logged to an immutable, public‑access audit trail (hash‑signed JSON) showing timestamp, confidence, operator decision, and anonymized image hash. <br>**Why:**  False positives directly threaten civil liberties. A mandatory human gate plus a transparent, tamper‑evident log is the single most powerful lever to keep the system from “shoot‑first” behaviour while preserving accountability. |
| **2️⃣ Ongoing‑Harms Metric** | **False‑Positive Rate (FPR) by Protected Class**  <br>• **Computation:** \(\displaystyle \text{FPR}_{c}= \frac{\text{# of non‑target persons of class }c\text{ flagged as “match”}}{\text{Total # of non‑target persons of class }c\text{ passing through}} \) <br>• **Protected Classes**: race/ethnicity (White, Black, Asian, Hispanic, Other) and gender (male/female/other). <br>• **Threshold:** If any \(\text{FPR}_{c} > 0.5\%\) *or* the disparity ratio \(\frac{\text{FPR}_{\text{max}}}{\text{FPR}_{\text{min}}} > 3.0\) for a rolling 7‑day window, the system must be **suspended for 48 h** and a bias‑remediation review launched. (Baseline target: ≤ 0.2 % overall FPR.) |
| **3️⃣ Realistic Failure Mode** | **Algorithmic Demographic Bias → Disproportionate False Positives for Dark‑Skin Individuals** <br>• **Early‑Warning Signs**: <br>   – Sudden spike in alerts from a particular demographic (e.g., > 2× increase in alerts from Black passengers) not correlated with any change in crime‑watch list; <br>   – Operator “override” rate (manual rejection after alert) exceeds 30 % for a demographic; <br>   – Audit‑log shows confidence scores systematically lower for certain groups. <br>• **Rollback / Mitigation**: <br>   1. **Immediate**: Auto‑pause alerts for the affected demographic and route all detections to manual review only (no real‑time escalation). <br>   2. **Short‑term**: Retrain the face‑embedding model on a *locally‑collected* balanced dataset (≥ 2 000 images per class) and re‑evaluate FPR before re‑enabling real‑time mode. <br>   3. **Long‑term**: Institute a “model‑drift” monitoring pipeline that recomputes per‑class FPR nightly; if disparity persists > 48 h, enforce a permanent “human‑only” watchlist check for that hub. |
| **4️⃣ Auditable Artifact** | • Real‑time transparency log (hash‑signed JSON). <br>• Daily FPR report per protected class (published on city open‑data portal). |

---

### C. AI‑Powered Resume Screener for Municipal Hiring  

| # | Element | Detail |
|---|---------|--------|
| **1️⃣ Core Policy / Technical Control** | **“Explainable Score Card + Mandatory Human Decision Override”**  <br>• The AI must output a *transparent* score card for each applicant: (i) raw prediction probability, (ii) top‑5 feature contributions (e.g., years of experience, certifications, skill‑match), and (iii) a “bias‑flag” if any protected attribute proxy exceeds a calibrated threshold. <br>• A qualified HR officer must review **every** shortlisted candidate and **must sign off** before the system can auto‑advance a resume to the interview stage. <br>**Why:**  Hiring decisions have long‑term societal impact and are legally subject to EEOC standards. An explainable score card forces the model to surface any hidden proxies (e.g., zip‑code, school name) and gives a human the chance to veto discriminatory passes. |
| **2️⃣ Ongoing‑Harms Metric** | **Disparate Selection Rate (DSR) Ratio**  <br>• **Computation:** \(\displaystyle \text{DSR}_{g}= \frac{\#\text{ applicants of group }g\text{ moved to interview}}{\#\text{ total applicants of group }g}\)  <br>• **Groups**: race/ethnicity, gender, disability status (self‑reported). <br>• **Threshold:** If \(\displaystyle \frac{\text{DSR}_{\text{max}}}{\text{DSR}_{\text{min}}} > 1.5\) for any 30‑day rolling window *or* any group’s DSR falls below 0.5 × city‑wide average, **suspend automated scoring** and revert to manual résumé review for the next hiring cycle. (Target: DSR ratio ≤ 1.2.) |
| **3️⃣ Realistic Failure Mode** | **Proxy Variable Leakage – ZIP‑CODE as Socio‑Economic Proxy** <br>• **Early‑Warning Signs**: <br>   – Feature‑importance logs show “ZIP‑CODE” ranking in the top‑3 contributors across > 40 % of predictions; <br>   – DSR analysis shows under‑selection of applicants from low‑income census tracts; <br>   – HR staff begin to flag “odd” rejections in internal audit notes. <br>• **Rollback / Mitigation**: <br>   1. **Immediate**: Mask/remove the ZIP‑CODE field from the model input and re‑run the current batch of scores (no new decisions made until re‑scored). <br>   2. **Short‑term**: Retrain the model with *adversarial debiasing* that penalizes correlation between latent representation and protected‑group labels. <br>   3. **Long‑term**: Institutionalize a “feature‑audit” checklist (quarterly) that automatically flags any feature whose mutual information with a protected attribute exceeds 0.1 bits. |
| **4️⃣ Auditable Artifact** | • Feature‑importance heat‑map (per batch). <br>• DSR dashboard (weekly) on the city HR portal. |

---

## 2. Allocation of Two Independent Audit Teams (Year 1)

| Audit Team | Assigned System(s) | Rationale (Risk‑Based) |
|------------|-------------------|------------------------|
| **Team 1** | **Facial Recognition (B)** | - **Highest civil‑rights risk** – false positives can lead to wrongful detentions. <br>- Direct legal exposure under state privacy statutes and 30‑day video‑retention limit. <br>- Requires *continuous* bias‑monitoring and audit‑log verification. <br>- Auditors will focus on per‑class FPR, transparency‑log integrity, and compliance with the HITL policy. |
| **Team 2** | **Autonomous Drones (A) + Resume Screener (C) (shared)** | - **A** carries *physical safety* risk (injury, property damage) and privacy risk (night‑time overflight). <br>- **C** carries *institutional fairness* risk but lower immediate harm. <br>- The team will split its calendar: **Weeks 1‑12** focus on drone flight‑log audit (MTBSI, geofence compliance); **Weeks 13‑24** transition to resume‑screener audit (DSR, feature‑audit). <br>- This staggered schedule respects the limited budget while still providing **full‑cycle** audits for both systems before the end of year 1. |
| **Unassigned** | **None** – the remaining system (the one not covered directly) will be monitored via **cross‑functional oversight** (city’s IT security office) and **external community watchdog** that will receive the public dashboards produced by Teams 1 & 2. |

*Why not give each system its own team?* – The city’s **risk matrix** (severity × probability) places **B** (civil‑rights) and **A** (public‑safety) at the top two quadrants. **C** is still important but its harms are *institutional* and typically manifest over a longer horizon; a shared audit team can still meet the 30‑day DSR monitoring requirement without overwhelming the budget.

---

## 3. Independent Validation of Fairness & Generalization  

### 3.1 Validation Pipeline (All Three Systems)

1. **Data Partitioning** – Use a **stratified 70/15/15** split (train/validation/test) *by protected attributes* to guarantee representation in each set.  
2. **Hold‑out Test Set** – Must be **externally sourced** (not used for any model‑tuning) and **released publicly** (subject to privacy‑preserving hashing).  
3. **Cross‑Domain Generalization** – For each system, train a secondary “out‑of‑distribution” test set: <br>  • Drones: flights over a different neighbourhood (different building heights, vegetation). <br>  • FR: images from a separate camera vendor and lighting conditions (day vs. night). <br>  • Resume Screener: resumes from a neighboring city’s municipal jobs portal (different phrasing, education systems).  

### 3.2 Fairness Tests & Statistical Criteria  

| System | Fairness Test | Dataset(s) | Statistical Metric | Pass‑Criteria |
|--------|---------------|------------|--------------------|---------------|
| **A – Drones** | **Safety‑Equity** – compare MTBSI across *flight corridors* that intersect neighborhoods with differing median incomes. | City GIS‑derived flight logs + U.S. Census tract data. | **Two‑sample Kolmogorov‑Smirnov (KS) test** on per‑corridor incident rates; p > 0.05 = no significant disparity. | KS‑p ≥ 0.05 **and** 95 % CI of MTBSI ratio ≤ 1.2. |
| **B – FR** | **Equalized Odds** – match‑rate vs. false‑positive‑rate parity across race & gender. | Public benchmark **RFW (Racial Faces in the Wild)** + local “city‑captured” set (30 k images, consented). | **Disparate Impact Ratio (DIR)** = FPR\_group / FPR\_reference; also **ΔTPR** (difference in true‑positive‑rate). | DIR ≥ 0.8 and ≤ 1.25; |ΔTPR| ≤ 5 % (absolute). |
| **C – Resume Screener** | **Conditional Demographic Parity** – selection rates equal after conditioning on job‑relevant qualifications. | **Open‑source** “Synthetic Hiring Dataset” (fairness‑benchmark) + a **city‑sourced** anonymized resume pool (5 k). | **Standardized Mean Difference (SMD)** on the *propensity score* of being selected, per protected group. | SMD ≤ 0.1 (Cohen’s d). |
| **All** | **Calibration Across Groups** – predicted probabilities must align with observed outcomes for each demographic. | Same test sets as above. | **Brier Score** per group; **Calibration‑in‑the‑Large (CITL)** and **Calibration‑Slope**. | CITL ≈ 0 ± 0.05; Slope ≈ 1 ± 0.1; Brier ≤ 0.25 (baseline). |

### 3.3 Independent Auditing Process  

1. **Third‑Party Validation Lab** (e.g., a university ethics lab or an accredited AI audit firm) receives the **model artifact** (weights + inference code) and the **sealed test sets** (hash‑verified).  
2. They run the **fairness suite** (open‑source Fairlearn + custom scripts) and generate a **validation report** (PDF + raw CSV) that is posted on the city’s open‑data portal within **30 days** of each model update.  
3. The report includes: <br>  • Overall performance (accuracy, precision, recall). <br>  • Per‑group metrics (DIR, DSR, MTBSI). <br>  • Statistical‑significance tests (p‑values, confidence intervals). <br>  • “Risk‑Score” (0‑5) summarizing how close the system is to predefined thresholds.  

### 3.4 Communicating Trade‑offs & Remaining Risks  

| Communication Channel | Content | Frequency |
|-----------------------|---------|-----------|
| **Public Dashboard** (city website) | Real‑time KPI widgets: MTBSI, FPR per class, DSR per group, Calibration plots. Each widget links to the latest validation report and to the raw CSV for researchers. | Updated **weekly** (automated). |
| **Quarterly Town‑Hall Briefings** (live webcast + transcript) | Plain‑language summary: “What we measured, what we found, what we fixed.” Include a *risk‑heat map* showing current status (green/yellow/red) for each system. | Every 3 months. |
| **Annual Independent Audit Summary** (PDF, 8‑page) | Executive summary, methodology, key findings, remediation actions, and a “known‑unknowns” section listing risks that cannot be fully mitigated (e.g., rare edge‑case GPS spoofing). | Once per year, plus an **extra release** if any threshold breach occurs. |
| **Community Oversight Board** (citizen panel) | Receives raw audit logs, can request deeper dives, and votes on whether to *extend* or *curtail* system use. Meetings **recorded** and minutes posted. | Bimonthly. |

*Key messaging:*  
- **Transparency over perfection.** We cannot promise zero false positives or zero crashes, but we can guarantee that any deviation beyond agreed thresholds triggers an automatic suspension and a publicly documented remediation plan.  
- **Trade‑off illustration:** For FR, a lower confidence threshold reduces missed matches (higher recall) but raises false positives; the city has chosen the **80 % confidence + human‑in‑the‑loop** as the calibrated point that balances public safety with civil‑rights protection.  
- **Residual risk statement:** “Even with geofence enforcement, a drone could be lost due to a catastrophic hardware failure; in that event the kill‑switch will force a safe‑land, but we cannot guarantee zero property damage.” This language will appear in every public report.

---

### Bottom Line

1. **Controls:** geofence + kill‑switch (A); two‑tier HITL + transparency log (B); explainable score‑card + human override (C).  
2. **Metrics:** MTBSI, per‑class FPR, DSR ratio – each with clear computation and suspension thresholds.  
3. **Failure Modes:** GPS spoofing, demographic bias in FR, proxy leakage in resume screener – all paired with early‑warning signals and concrete rollback steps.  
4. **Audit‑Team Allocation:** One team fully on the highest‑risk FR system; the second team splits its time between drones (physical safety) and hiring screener (institutional fairness).  
5. **Fairness/Generalization Validation:** Stratified external test sets, standard fairness statistics (DIR, DSR, KS, SMD), calibration checks, and public, signed audit reports.  
6. **Public Communication:** Live dashboards, quarterly town‑halls, annual audit summaries, and a citizen oversight board—all designed to keep trust high while acknowledging unavoidable trade‑offs.

Implementing this roadmap will give the city a **transparent, evidence‑based safety net** that satisfies the budget constraint, respects the 30‑day data‑retention law, and demonstrably protects its residents from both physical and algorithmic harms.

## For the next cell, we will use Ollama

Ollama runs a local web service that gives an OpenAI compatible endpoint,  
and runs models locally using high performance C++ code.

If you don't have Ollama, install it here by visiting https://ollama.com then pressing Download and following the instructions.

After it's installed, you should be able to visit here: http://localhost:11434 and see the message "Ollama is running"

You might need to restart Cursor (and maybe reboot). Then open a Terminal (control+\`) and run `ollama serve`

Useful Ollama commands (run these in the terminal, or with an exclamation mark in this notebook):

`ollama pull <model_name>` downloads a model locally  
`ollama ls` lists all the models you've downloaded  
`ollama rm <model_name>` deletes the specified model from your downloads

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/stop.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Super important - ignore me at your peril!</h2>
            <span style="color:#ff7800;">The model called <b>llama3.3</b> is FAR too large for home computers - it's not intended for personal computing and will consume all your resources! Stick with the nicely sized <b>llama3.2</b> or <b>llama3.2:1b</b> and if you want larger, try llama3.1 or smaller variants of Qwen, Gemma, Phi or DeepSeek. See the <A href="https://ollama.com/models">the Ollama models page</a> for a full list of models and sizes.
            </span>
        </td>
    </tr>
</table>

In [16]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏  80 KB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏ 387 KB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ▕                  ▏ 1.2 MB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling

OSError: [Errno 5] Input/output error

In [None]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

response = ollama.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

In [17]:
# So where are we?

print(competitors)
print(answers)


['gpt-5-nano', 'claude-sonnet-4-5', 'gemini-2.5-flash', 'openai/gpt-oss-120b']


In [18]:
# It's nice to know how to use "zip"
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")


Competitor: gpt-5-nano

Below is a structured safety and governance plan you can adapt for each system, followed by audit allocation, and a plan for independent fairness/generalization validation and public communication.

System A. Autonomous last-mile delivery drones operating at night

1) The single most important policy/technical control before deployment
- Policy/Control: Data minimization and on-device privacy-preserving processing with strict automatic data purge (no retention of raw video; only non-identifying operational data; mandatory 24-hour automatic purge of any captured imagery; all processing and storage occur on the drone unless necessary for delivery).
- Why this control: Night operations plus drones create substantial privacy and civil-liberties risk if video or identifiable data is stored or transmitted. By keeping data on-device, deleting imagery rapidly, and avoiding cloud storage of video, you materially reduce the risk of privacy harm, data breaches, and public 

In [19]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

In [20]:
print(together)

# Response from competitor 1

Below is a structured safety and governance plan you can adapt for each system, followed by audit allocation, and a plan for independent fairness/generalization validation and public communication.

System A. Autonomous last-mile delivery drones operating at night

1) The single most important policy/technical control before deployment
- Policy/Control: Data minimization and on-device privacy-preserving processing with strict automatic data purge (no retention of raw video; only non-identifying operational data; mandatory 24-hour automatic purge of any captured imagery; all processing and storage occur on the drone unless necessary for delivery).
- Why this control: Night operations plus drones create substantial privacy and civil-liberties risk if video or identifiable data is stored or transmitted. By keeping data on-device, deleting imagery rapidly, and avoiding cloud storage of video, you materially reduce the risk of privacy harm, data breaches, and p

In [21]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [22]:
print(judge)

You are judging a competition between 4 competitors.
Each model has been given this question:


Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}

Here are the responses from each competitor:

# Response from competitor 1

Below is a structured safety and governance plan you can adapt for each system, followed by audit allocation, and a plan for independent fairness/generalization validation and public communication.

System A. Autonomous last-mile delivery drones operating at night

1) The single most important policy/technical control before deployment
- Policy/Control: Data minimization and on-device privacy-preserving processing with strict automatic data purge (no retention of raw video; only non-identifying operational data; mandatory 24-hour aut

In [23]:
judge_messages = [{"role": "user", "content": judge}]

In [24]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-5-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


{"results": ["4", "3", "1", "2"]}


In [25]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: openai/gpt-oss-120b
Rank 2: gemini-2.5-flash
Rank 3: gpt-5-nano
Rank 4: claude-sonnet-4-5


<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>