# Model Cards and Responsible AI

## Learning Objectives

In this notebook, you will learn:
- What model cards are and why they matter
- How to read and interpret HuggingFace model cards
- Understanding model limitations, biases, and intended use cases
- Ethical considerations when deploying AI models
- How to evaluate if a model is appropriate for your use case
- Responsible AI practices and best practices
- System cards for production deployment

**This notebook focuses on concepts, not code.** We'll examine real model cards from models used in previous notebooks.

## Prerequisites

| Requirement | Minimum | Recommended |
|------------|---------|-------------|
| RAM | 2GB | 4GB |
| GPU | Not required | Not required |
| Python | 3.8+ | 3.10+ |
| Storage | 1GB free | 2GB free |

**Recommended**: Complete notebooks 01-10 first to understand the models we'll reference.

## Expected Behaviors

When running this notebook, you should observe:

**Model Card Fetching**:
- Model cards download from HuggingFace Hub
- README.md files contain markdown-formatted documentation
- Some models have extensive documentation, others are minimal

**Card Content**:
- Training data sources and sizes
- Evaluation metrics and benchmarks
- Known limitations and biases
- Intended use cases (and out-of-scope uses)
- Ethical considerations

**Programmatic Access**:
- Can fetch model cards via HuggingFace API
- Metadata available in JSON format
- Tags and task classifications

**Common Observations**:
- Popular models have comprehensive documentation
- Smaller/newer models may have minimal cards
- Dataset biases are often documented
- Most cards include citation information

**Troubleshooting**:
- If card fetch fails: Check internet connection
- If card is missing: Model may not have documentation (red flag!)
- If information is unclear: Check original paper or dataset documentation

## Overview

**Model cards** are documentation that accompanies machine learning models. They provide transparency about:
- What the model does
- How it was trained
- What it's good (and bad) at
- Ethical considerations
- Appropriate and inappropriate uses

Think of model cards as **nutrition labels for AI models** - they help you make informed decisions.

**Why model cards matter**:
1. **Transparency**: Understand what you're deploying
2. **Accountability**: Document model behavior and limitations
3. **Safety**: Avoid using models for inappropriate tasks
4. **Ethics**: Recognize and mitigate biases
5. **Compliance**: Meet regulatory requirements

This notebook teaches you to be a **responsible AI practitioner** by critically evaluating models before deployment.

## Setup and Installation

In [None]:
# Import libraries
import random
from huggingface_hub import HfApi, model_info, hf_hub_download
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
random.seed(1103)

# Initialize HuggingFace API
api = HfApi()

print("✓ Libraries imported successfully")

## Part 1: Anatomy of a Model Card

A complete model card typically includes:

### 1.1 Essential Components

**1. Model Description**
- What task does it perform?
- What architecture is it based on?
- Who created it?

**2. Intended Use**
- Primary use cases
- Supported languages/domains
- Known limitations

**3. Training Data**
- Dataset(s) used
- Data size and composition
- Preprocessing steps

**4. Evaluation**
- Benchmarks and metrics
- Performance on test sets
- Comparison to other models

**5. Ethical Considerations**
- Known biases
- Risks and harms
- Recommendations for responsible use

**6. Limitations and Biases**
- What the model can't do
- Edge cases and failure modes
- Dataset biases

**7. Citation and License**
- How to cite the model
- Usage license
- Links to papers

### 1.2 Fetching a Model Card Programmatically

In [None]:
# Example: Fetch model card for DistilBERT (used in Notebook 02)
model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Get model information
info = model_info(model_id)

print("="*70)
print(f"MODEL INFORMATION: {model_id}")
print("="*70)
print(f"Task: {info.pipeline_tag}")
print(f"Model ID: {info.id}")
print(f"Downloads: {info.downloads:,}")
print(f"Likes: {info.likes}")
print(f"Tags: {', '.join(info.tags[:10])}")
print("="*70)

In [None]:
# Fetch the actual README (model card)
try:
    readme_path = hf_hub_download(repo_id=model_id, filename="README.md", repo_type="model")
    
    with open(readme_path, 'r', encoding='utf-8') as f:
        card_content = f.read()
    
    # Display first 2000 characters
    print("\nMODEL CARD PREVIEW:")
    print("="*70)
    print(card_content[:2000])
    print("\n... (truncated)")
    print("="*70)
    print(f"\nFull card is {len(card_content)} characters long")
    print(f"View online: https://huggingface.co/{model_id}")
    
except Exception as e:
    print(f"Could not fetch model card: {e}")
    print(f"View online: https://huggingface.co/{model_id}")

## Part 2: Examining Real Model Cards

Let's examine model cards from models we've used in previous notebooks.

### 2.1 Case Study: GPT-2 (Text Generation)

**Model**: `gpt2-medium` (Notebook 01)

**View card**: https://huggingface.co/gpt2-medium

In [None]:
# Examine GPT-2 metadata
gpt2_info = model_info("gpt2-medium")

print("GPT-2 MEDIUM - KEY INFORMATION")
print("="*70)
print(f"Task: {gpt2_info.pipeline_tag}")
print(f"Downloads: {gpt2_info.downloads:,}")
print(f"\nKey Points from Model Card:")
print("-"*70)

# These are real facts from GPT-2's model card
key_points = """
INTENDED USE:
• Research on language models and text generation
• Creative writing assistance
• NOT intended for factual information generation

TRAINING DATA:
• WebText dataset (8 million web pages)
• English language content from outbound Reddit links
• 40GB of text data

LIMITATIONS:
• May generate biased, offensive, or false content
• Reflects biases present in web text
• Not fact-checked or filtered for accuracy
• Can memorize and reproduce training data

BIASES:
• Gender biases (occupational stereotypes)
• Racial biases (from internet text)
• Toxicity (can generate harmful content)
• Western/English-centric worldview

OUT-OF-SCOPE USES:
• Medical, legal, or financial advice
• Generating misleading information
• Impersonating real people
• Automated content creation without disclosure
"""

print(key_points)
print("="*70)
print("\n⚠️  Critical: GPT-2 can generate harmful content. Always review outputs!")

### 2.2 Case Study: DETR (Object Detection)

**Model**: `facebook/detr-resnet-50` (Notebook 05)

**View card**: https://huggingface.co/facebook/detr-resnet-50

In [None]:
detr_info = model_info("facebook/detr-resnet-50")

print("DETR OBJECT DETECTION - KEY INFORMATION")
print("="*70)

key_points = """
INTENDED USE:
• Object detection in natural images
• Detecting common objects (COCO categories)
• Research and prototyping

TRAINING DATA:
• COCO 2017 dataset (118,000 training images)
• 80 object categories
• Annotated with bounding boxes

PERFORMANCE:
• 42.0 AP on COCO val2017
• Better on large objects than small objects

LIMITATIONS:
• Only detects 80 COCO categories (won't detect other objects)
• Performance degrades on:
  - Very small objects
  - Heavily occluded objects
  - Objects from unusual viewpoints
  - Images very different from COCO (medical, satellite, etc.)
• Dataset bias: More common objects detected more accurately

BIASES:
• COCO dataset is Western-centric (US/Europe images)
• Better performance on common objects (person, car) vs rare (toaster)
• Struggles with cultural variations (e.g., different chair styles)

ETHICAL CONSIDERATIONS:
• Can be used for surveillance (privacy concerns)
• May misidentify objects in critical applications
• Should not be used for security without human review
"""

print(key_points)
print("="*70)
print("\n⚠️  Don't use object detection for high-stakes decisions without validation!")

### 2.3 Case Study: Whisper (Speech Recognition)

**Model**: `openai/whisper-small` (Notebook 07)

**View card**: https://huggingface.co/openai/whisper-small

In [None]:
whisper_info = model_info("openai/whisper-small")

print("WHISPER SPEECH RECOGNITION - KEY INFORMATION")
print("="*70)

key_points = """
INTENDED USE:
• Automatic speech recognition
• Multilingual transcription (99 languages)
• Audio translation to English

TRAINING DATA:
• 680,000 hours of multilingual audio
• Web-scraped data (YouTube, podcasts, etc.)
• Diverse accents, recording conditions, and content types

PERFORMANCE:
• Near human-level accuracy on LibriSpeech (clean English)
• Variable performance across languages
• Better on high-resource languages (English, Spanish) vs low-resource

LIMITATIONS:
• Accuracy varies by:
  - Language (English > others)
  - Audio quality (clean > noisy)
  - Accent (native > non-native)
  - Domain (conversational > technical jargon)
• May hallucinate (generate text not in audio)
• Timestamp accuracy not guaranteed
• Struggles with overlapping speech

BIASES:
• Higher error rates for:
  - Non-native English speakers
  - Certain accents (African, South Asian)
  - Female vs male voices (slight difference)
  - Low-resource languages
• Reflects biases in web-scraped training data

ETHICAL CONSIDERATIONS:
• Privacy: Can transcribe private conversations
• Consent: Training data may include non-consented audio
• Accessibility: Don't rely solely for critical accessibility needs
• Hallucinations: May attribute words that weren't spoken

OUT-OF-SCOPE USES:
• Legal transcripts (hallucinations are unacceptable)
• Medical dictation (errors could be harmful)
• Surveillance without consent
• Speaker identification (not designed for this)
"""

print(key_points)
print("="*70)
print("\n⚠️  Whisper can hallucinate! Always verify transcripts for critical uses.")

## Part 3: Evaluating Model Appropriateness

Before using a model, ask these questions:

### 3.1 The Model Evaluation Checklist

**1. Task Alignment**
- [ ] Does the model's intended use match my use case?
- [ ] Is my domain similar to the training data?
- [ ] Are there out-of-scope uses that apply to me?

**2. Performance Requirements**
- [ ] Does the model's accuracy meet my needs?
- [ ] Have I tested it on representative examples?
- [ ] What happens when the model is wrong?

**3. Data Compatibility**
- [ ] Is my data similar to the training data?
- [ ] Are there distribution shifts (different languages, domains, etc.)?
- [ ] Does the model support my language/region?

**4. Bias and Fairness**
- [ ] What biases exist in the model?
- [ ] Could these biases harm my users?
- [ ] How will I monitor and mitigate biases?

**5. Safety and Ethics**
- [ ] What are the potential harms?
- [ ] Is human review needed?
- [ ] Do I have user consent?
- [ ] Am I transparent about using AI?

**6. Legal and Compliance**
- [ ] What's the model's license?
- [ ] Can I use it commercially?
- [ ] Do I comply with data protection laws (GDPR, etc.)?
- [ ] Are there intellectual property concerns?

### 3.2 Example: Should I Use This Model?

Let's evaluate a real scenario:

In [None]:
# Scenario: Building a customer review sentiment analyzer for a global e-commerce site

scenario = """
SCENARIO:
You want to automatically classify customer reviews as positive/negative
to help prioritize customer service responses.

Requirements:
• Handle reviews from multiple countries (US, UK, India, Japan)
• Process 10,000 reviews/day
• Flag negative reviews for human review
• Must not miss critical negative reviews (safety-critical product)
"""

print(scenario)
print("\n" + "="*70)
print("EVALUATING: distilbert-base-uncased-finetuned-sst-2-english")
print("="*70)

evaluation = """
✓ GOOD FIT:
• Fast inference (meets 10k/day requirement)
• Good accuracy on English reviews
• Free and commercially usable (Apache 2.0 license)

✗ POOR FIT:
• Only supports English (fails for Japanese reviews)
• Trained on movie reviews (different domain than products)
• May have bias toward certain writing styles
• No guarantee of catching all critical reviews

RECOMMENDATION:
⚠️  CONDITIONAL USE:
1. Use ONLY for English reviews (add language detection)
2. Implement human review for ALL high-confidence negative reviews
3. Regularly audit false negatives (missed negative reviews)
4. Consider fine-tuning on your product review data
5. Add multilingual model for non-English reviews
6. Set conservative threshold (flag uncertain cases for review)

ALTERNATIVE:
Consider: xlm-roberta-base (multilingual) with fine-tuning on
product reviews from your actual customer base.
"""

print(evaluation)
print("="*70)

## Part 4: Responsible AI Practices

Beyond reading model cards, follow these best practices:

### 4.1 Core Principles

**1. Transparency**
- Disclose when AI is being used
- Explain how decisions are made
- Document model choice and limitations

**2. Fairness**
- Test for bias across demographic groups
- Monitor performance disparities
- Mitigate identified biases

**3. Accountability**
- Assign responsibility for AI decisions
- Implement human oversight
- Create appeal/review processes

**4. Privacy**
- Minimize data collection
- Secure user data
- Comply with regulations (GDPR, CCPA)

**5. Safety**
- Test edge cases and failure modes
- Implement safety guardrails
- Monitor for misuse

**6. Reliability**
- Validate on diverse test sets
- Monitor performance in production
- Have fallback mechanisms

### 4.2 Red Flags: When NOT to Deploy

**Don't deploy a model if:**

❌ **No model card exists** - Lack of transparency is a warning sign

❌ **Unknown training data** - You can't assess biases or legal risks

❌ **Intended use unclear** - Model might not be designed for your task

❌ **High-stakes application without validation** - Medical, legal, financial decisions need extensive testing

❌ **Can't handle model errors** - No fallback mechanism for failures

❌ **No human oversight** - Critical decisions need human review

❌ **Biases affect vulnerable groups** - Documented biases that could harm users

❌ **License prohibits your use case** - Legal compliance is non-negotiable

❌ **Can't explain decisions to users** - Regulatory requirement in many domains

❌ **Performance degrades on your data** - Testing shows poor results

### 4.3 Bias Detection Example

Let's test a sentiment model for potential gender bias:

In [None]:
# Load sentiment classifier
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Test sentences with gendered pronouns
test_cases = [
    # Occupation stereotypes
    ("He is a great engineer", "She is a great engineer"),
    ("He is an excellent nurse", "She is an excellent nurse"),
    ("He is a talented CEO", "She is a talented CEO"),
    
    # Emotional expressions
    ("He was very emotional about it", "She was very emotional about it"),
    ("He is assertive", "She is assertive"),
    ("He is so bossy", "She is so bossy"),
]

print("\nTESTING FOR GENDER BIAS")
print("="*70)
print(f"{'Sentence':<40} {'Label':<10} {'Score':<10}")
print("-"*70)

for male_sent, female_sent in test_cases:
    male_result = classifier(male_sent)[0]
    female_result = classifier(female_sent)[0]
    
    print(f"{male_sent:<40} {male_result['label']:<10} {male_result['score']:.3f}")
    print(f"{female_sent:<40} {female_result['label']:<10} {female_result['score']:.3f}")
    
    # Flag if labels differ or scores differ significantly
    if male_result['label'] != female_result['label']:
        print("  ⚠️  BIAS DETECTED: Different labels!")
    elif abs(male_result['score'] - female_result['score']) > 0.1:
        print(f"  ⚠️  Score difference: {abs(male_result['score'] - female_result['score']):.3f}")
    
    print()

print("="*70)
print("\nNote: This is a simplified bias test. Real bias testing requires:")
print("• Large, diverse test sets")
print("• Statistical significance testing")
print("• Testing across multiple dimensions (race, age, etc.)")
print("• Domain-specific evaluation")

## Part 5: System Cards for Production

**System cards** document entire AI systems (not just individual models).

### 5.1 What is a System Card?

A **system card** describes the complete AI system, including:
- Models used
- Data pipelines
- Pre/post-processing
- Human-in-the-loop components
- Monitoring and governance
- Deployment context

**Example**: A content moderation system might use:
- Text classification model (detect toxic content)
- Image classification model (detect inappropriate images)
- Human reviewers (for borderline cases)
- Appeal process (for false positives)
- Monitoring dashboard (track performance)

The **system card** would document ALL of these components, not just the models.

### 5.2 System Card Template

```markdown
# System Card: [System Name]

## System Overview
- **Purpose**: What does the system do?
- **Users**: Who uses it?
- **Deployment context**: Where is it deployed?

## System Components
- **Models**: List all ML models
  - Model A: [purpose, version, model card link]
  - Model B: [purpose, version, model card link]
- **Data sources**: Input data and origins
- **Processing**: Pre/post-processing steps
- **Human oversight**: Where humans are involved

## Performance
- **Metrics**: System-level metrics
- **Benchmarks**: Performance on test sets
- **Monitoring**: How is performance tracked?

## Limitations
- **Known failures**: What doesn't work?
- **Edge cases**: Difficult scenarios
- **Scope**: What's out of scope?

## Ethical Considerations
- **Biases**: System-level biases
- **Harms**: Potential negative impacts
- **Mitigations**: How are risks reduced?

## Governance
- **Ownership**: Who is responsible?
- **Review process**: How are decisions audited?
- **Updates**: How often is the system updated?
- **Incident response**: What happens when things go wrong?

## Compliance
- **Regulations**: Which laws apply?
- **Privacy**: How is data protected?
- **Transparency**: What is disclosed to users?
```

### 5.3 Example System Card

Here's a simplified system card for our hypothetical review analysis system:

In [None]:
system_card_example = """
# System Card: Product Review Analysis System
Version 1.0 | Last Updated: 2025-01-15

## System Overview
Purpose: Automatically classify customer reviews and prioritize negative reviews
         for customer service team review.
Users: Customer service team (internal)
Scale: 10,000 reviews/day across 50 product categories

## System Components

Models:
• Language Detection: "papluca/xlm-roberta-base-language-detection"
• Sentiment (English): "distilbert-base-uncased-finetuned-sst-2-english"
• Sentiment (Multilingual): "cardiffnlp/twitter-xlm-roberta-base-sentiment"

Data Flow:
1. Review text → Language detection
2. Route to appropriate sentiment model
3. Classify sentiment + confidence score
4. If negative with >0.7 confidence → flag for human review
5. If uncertain (0.4-0.7) → flag for review
6. If positive with >0.7 → auto-acknowledge

Human Oversight:
• All flagged reviews reviewed by customer service team
• Weekly audit of 100 random positive classifications
• Monthly bias and fairness review

## Performance Metrics
Target: 90% accuracy on English reviews, 85% on multilingual
Current: 88% on English, 82% on multilingual (as of 2025-01-15)

False Negative Rate: <5% (missing critical negative reviews)
Current: 3.2%

Human Review Rate: 25-30% of all reviews
Current: 28%

## Limitations
• Sarcasm detection poor ("Oh great, another broken product!")
• Struggles with code-switching (mixing languages)
• Domain shift: Models trained on social media, not product reviews
• Lower accuracy on:
  - Technical products (specialized vocabulary)
  - Low-resource languages
  - Very short reviews (<10 words)

## Ethical Considerations

Biases:
• English reviews processed more accurately than others
• May miss culturally-specific expressions of dissatisfaction

Mitigations:
• Conservative threshold (flag uncertain cases)
• Mandatory human review of all flagged reviews
• Regular bias audits across language groups
• Continuous fine-tuning on misclassified examples

Privacy:
• Reviews contain PII (names, addresses in text)
• PII redacted before model processing
• Reviewers see full text (with PII)

## Governance

Ownership: Customer Experience Team
Technical Lead: AI/ML Team

Review Process:
• Weekly performance review
• Monthly fairness audit
• Quarterly model updates

Incident Response:
• Critical customer issue missed → Immediate review of case
• Pattern of misclassification → Temporary increased human review
• Model downtime → Fallback to 100% human review

## Compliance

Privacy: GDPR-compliant (data retention, right to explanation)
Transparency: Customers informed of automated processing
Appeals: Customers can request human review of auto-responses

## Change Log
- 2025-01-15: v1.0 Initial deployment
"""

print(system_card_example)

## Part 6: Model Card Comparison Tool

In [None]:
def compare_models(model_ids):
    """
    Compare multiple models side-by-side.
    """
    print("\nMODEL COMPARISON")
    print("="*100)
    print(f"{'Model':<50} {'Task':<20} {'Downloads':<15} {'Likes'}")
    print("-"*100)
    
    for model_id in model_ids:
        try:
            info = model_info(model_id)
            print(f"{model_id:<50} {info.pipeline_tag or 'N/A':<20} {info.downloads or 0:<15,} {info.likes or 0}")
        except Exception as e:
            print(f"{model_id:<50} {'ERROR':<20} {'-':<15} {'-'}")
    
    print("="*100)

# Example: Compare sentiment analysis models
sentiment_models = [
    "distilbert-base-uncased-finetuned-sst-2-english",
    "cardiffnlp/twitter-roberta-base-sentiment",
    "nlptown/bert-base-multilingual-uncased-sentiment"
]

compare_models(sentiment_models)

print("\nQuestions to ask when comparing:")
print("• Which has better documentation (model card)?")
print("• Which is more popular (downloads/likes)?")
print("• Which supports my languages?")
print("• Which is more recently updated?")
print("• Which has better performance on my domain?")

## Exercises

1. **Model Card Deep Dive**: Choose a model from notebooks 01-10. Read its full model card. Create a one-page summary covering: intended use, limitations, biases, and whether you'd use it in production.

2. **Bias Testing**: Pick a text classification model and design 10 test cases to check for gender, racial, or other biases. Run the tests and document findings.

3. **Evaluation Checklist**: For a hypothetical project (e.g., medical chatbot, resume screener, content moderator), fill out the Model Evaluation Checklist from section 3.1. Would you deploy? Why or why not?

4. **System Card**: Design a simple AI system (e.g., spam filter, product recommender). Write a system card following the template in section 5.2.

5. **Red Flags**: Find 3 models on HuggingFace Hub that have poor or missing model cards. Document why these are concerning and what information is missing.

6. **Fairness Audit**: Pick a model and define 3 demographic groups (e.g., age ranges, languages, genders). Design tests to measure performance disparities across these groups.

**Bonus Challenge**: Find a model with an excellent model card and one with a poor/missing card for the same task. Write a comparison explaining why documentation matters.

## Key Takeaways

**Model Cards**:
- Essential documentation for ML models
- Cover intended use, limitations, biases, and ethical considerations
- ALWAYS read before deploying a model

**Responsible AI**:
- Transparency, fairness, accountability, privacy, safety, reliability
- Test for biases before deployment
- Implement human oversight for high-stakes decisions

**Evaluation**:
- Use the Model Evaluation Checklist
- Test on YOUR data, not just benchmarks
- Know when NOT to deploy

**System Cards**:
- Document entire AI systems, not just models
- Include governance and incident response
- Essential for production deployments

**Red Flags**:
- Missing model cards
- Unknown training data
- High-stakes use without validation
- Biases that harm vulnerable groups

**Best Practice**: If you can't explain how your model works, what its limitations are, and how you'll handle failures - **don't deploy it**.

## Resources

**Model Card Resources**:
- [Model Cards for Model Reporting (Original Paper)](https://arxiv.org/abs/1810.03993)
- [HuggingFace Model Cards Guide](https://huggingface.co/docs/hub/model-cards)
- [System Cards (OpenAI)](https://openai.com/research/gpt-4-system-card)

**Responsible AI Frameworks**:
- [Google's Responsible AI Practices](https://ai.google/responsibilities/responsible-ai-practices/)
- [Microsoft's Responsible AI Principles](https://www.microsoft.com/en-us/ai/responsible-ai)
- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)

**Bias and Fairness**:
- [AI Fairness 360 (IBM)](https://aif360.mybluemix.net/)
- [What-If Tool (Google)](https://pair-code.github.io/what-if-tool/)
- [Fairlearn (Microsoft)](https://fairlearn.org/)

**Regulations**:
- [EU AI Act](https://artificialintelligenceact.eu/)
- [GDPR (Data Protection)](https://gdpr.eu/)
- [California CCPA](https://oag.ca.gov/privacy/ccpa)

**Academic Resources**:
- [ACM Conference on Fairness, Accountability, and Transparency (FAccT)](https://facctconference.org/)
- [Partnership on AI](https://partnershiponai.org/)

**Next Steps**:
- Notebook 13: Fine-tuning with LoRA (Text Generation)
- Apply responsible AI principles when fine-tuning your own models