# GenAI Governance Layer for Higher Education
## Research Framework & Methodology Documentation

This notebook documents the complete research framework for building an **executable, verifiable, and transparent governance system for Generative AI in higher education**.

### Key Challenge
- GenAI adoption in higher education has outpaced policy infrastructure
- Current approaches: manual, brittle, untrustworthy
- Problems:
  - Policies as prose PDFs → inconsistent interpretation
  - No decision traceability → disputes unresolvable
  - Trust gaps in AI-based guidance → hallucination-prone chatbots
  - Opacity in AI use → eroded student trust

### Research Questions
1. Can policy-as-code enforcement be made practical for educational governance?
2. Can RAG systems answer policy questions with verified citations and low hallucination?
3. Does privacy-preserving transparency increase student trust and perceived fairness?

### Novel Contributions
1. **First integrated policy-as-code + verified RAG + transparency system** for educational AI
2. **Operationalized RAG verification** for policy domains (citation correctness, entailment, consistency)
3. **Privacy-preserving AI-use transparency** for students (metadata-only logging)
4. **Production-ready reference implementation** with reproducible evaluation

## 1. System Architecture Overview

**Key Components**:
1. **Policy Compiler**: Form → JSON + conflict detection
2. **Decision Engine**: f_policy(policy, context) → (ALLOW/DENY/REQUIRE_JUSTIFICATION, obligations, trace)
3. **Transparency Ledger**: Append-only metadata logs + aggregation
4. **RAG Copilot**: Verified answer generation (citation correctness >95%, hallucination <5%)
5. **Dashboards**: Faculty policy authoring, student logs, instructor analytics

In [None]:
# Import required libraries for analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

# Configure plotting
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Analysis environment initialized.")
print(f"Timestamp: {datetime.now()}")

## 2. Evaluation Metrics & Targets

### 2.1 Policy Enforcement Accuracy
- **Target**: >90%
- **Metric**: (# correct decisions) / (total test cases)
- **Breakdown**: Per decision type (ALLOW, DENY, REQUIRE_JUSTIFICATION)

### 2.2 RAG Factuality Metrics
- **Citation Correctness**: >95% of quoted text actually exists in source
- **Hallucination Rate**: <5% of claims unsupported by context
- **Answer Correctness**: >90% of answers match expert gold labels

### 2.3 User Experience Metrics
- **Authoring Time**: >50% reduction vs. free-form
- **Student Trust (SUS)**: >75 points, +15 point improvement over control
- **Perceived Fairness**: >4/5 on Likert scale

In [None]:
# Define evaluation targets and baselines
metrics_targets = {
    'Enforcement Accuracy': {
        'target': 0.90,
        'baseline': 0.75,  # Manual baseline
        'unit': '%',
        'description': 'Correct policy decisions'
    },
    'Conflict Detection F1': {
        'target': 0.85,
        'baseline': 0.0,
        'unit': 'F1 score',
        'description': 'Detecting overlapping/contradictory policies'
    },
    'Citation Correctness': {
        'target': 0.95,
        'baseline': 0.75,  # Naive RAG
        'unit': '%',
        'description': 'Correctly attributed quotes in answers'
    },
    'Answer Correctness': {
        'target': 0.90,
        'baseline': 0.65,  # Naive RAG
        'unit': '%',
        'description': 'Correct policy Q&A responses'
    },
    'Hallucination Rate': {
        'target': 0.05,
        'baseline': 0.20,  # Naive RAG
        'unit': '%',
        'description': 'Unsupported claims (lower is better)'
    },
    'Authoring Time Reduction': {
        'target': 0.50,
        'baseline': 0.0,
        'unit': '%',
        'description': 'Time saved with template vs. free-form'
    },
    'Student Trust (SUS)': {
        'target': 75,
        'baseline': 60,  # Control group
        'unit': 'points',
        'description': 'System Usability Scale (treatment vs control)'
    }
}

metrics_df = pd.DataFrame(metrics_targets).T
metrics_df['improvement'] = metrics_df['target'] - metrics_df['baseline']

print("\n" + "="*80)
print("EVALUATION TARGETS & BASELINES")
print("="*80)
print(metrics_df.to_string())
print("="*80)

## 3. Datasets

### 3.1 Policy Corpus (N=40-60)
- Source: Top 50 US universities (MIT, Stanford, Berkeley, Cornell, etc.)
- Format: Schemaified JSON with attribution
- Output: `datasets/policies_corpus/policies_canonical.json`

### 3.2 Expert-Annotated Q&A Benchmark (N=80-100)
- Expert annotations: 2 raters, Cohen's kappa >0.70
- Gold labels: Yes/No/Maybe + exact citing clauses
- Output: `datasets/benchmark_qa.json`

### 3.3 Enforcement Scenario Test Suite (N=40-50)
- Coverage: Allowed (10-15), Denied (8-10), Overrides (5), Conflicts (5-8), Ambiguous (5)
- Output: `datasets/benchmark_scenarios.json`

In [None]:
# Define dataset specifications
datasets = {
    'Policy Corpus': {
        'size_estimate': 50,
        'effort_hours': 40,
        'curation_time_per_policy': 3,
        'output_file': 'datasets/policies_corpus/policies_canonical.json'
    },
    'Q&A Benchmark': {
        'size_estimate': 90,
        'effort_hours': 50,
        'annotation_time_per_question': 0.5,
        'raters': 2,
        'target_kappa': 0.70,
        'output_file': 'datasets/benchmark_qa.json'
    },
    'Scenario Test Suite': {
        'size_estimate': 45,
        'effort_hours': 20,
        'allowed_scenarios': 12,
        'denied_scenarios': 10,
        'override_scenarios': 5,
        'conflict_scenarios': 8,
        'ambiguous_scenarios': 5,
        'output_file': 'datasets/benchmark_scenarios.json'
    }
}

datasets_df = pd.DataFrame(datasets).T
print("\n" + "="*80)
print("DATASET SPECIFICATIONS")
print("="*80)
print(datasets_df.to_string())
print("="*80)
print(f"\nTotal Curation Effort: {datasets_df['effort_hours'].sum()} hours")

## 4. Experimental Design

### Study 1: Faculty Usability Study (N=12)
- **Design**: Within-subject
- **Duration**: ~60 minutes per participant
- **Measures**: Time, errors, SUS score, qualitative feedback
- **Target**: Template mode >50% faster, >70% fewer errors

### Study 2: RAG Benchmark Evaluation (Offline)
- **Design**: Offline evaluation on Q&A benchmark
- **Measures**: Citation correctness, hallucination rate, answer accuracy
- **Baselines**: Naive RAG without verification

### Study 3: Student Transparency Study (N=50, RCT)
- **Design**: Randomized controlled trial
- **Conditions**: Treatment (logs visible) vs Control (no logs)
- **Measures**: SUS score, perceived fairness, privacy comfort
- **Target**: SUS +15 points, fairness >4/5

In [None]:
# Define experimental design details
studies = {
    'Study 1: Faculty Usability': {
        'design': 'Within-subject',
        'n': 12,
        'duration_minutes': 60,
        'measures': ['authoring_time', 'error_count', 'SUS_score', 'qualitative'],
        'incentive': '$50 gift card',
        'timeline': 'Weeks 2-3'
    },
    'Study 2: RAG Benchmark': {
        'design': 'Offline evaluation',
        'test_cases': 90,
        'duration_minutes': '(automated)',
        'measures': ['citation_correctness', 'hallucination_rate', 'answer_accuracy'],
        'annotation_effort': '2 experts',
        'timeline': 'Weeks 4-5'
    },
    'Study 3: Student Transparency (RCT)': {
        'design': 'Randomized Controlled Trial',
        'n_treatment': 25,
        'n_control': 25,
        'measures': ['SUS_score', 'perceived_fairness', 'privacy_comfort'],
        'survey_minutes': 10,
        'timeline': 'Full semester (Week 7 intervention)'
    }
}

studies_df = pd.DataFrame(studies).T
print("\n" + "="*80)
print("EXPERIMENTAL DESIGN SUMMARY")
print("="*80)
print(studies_df.to_string())
print("="*80)

## 5. Implementation Timeline

### Phase 1: Research Preparation (Weeks 1-2)
- Finalize ethics approval (IRB for human studies)
- Set up GitHub repo + CI/CD
- Curate policy corpus

### Phase 2: Core Implementation (Weeks 3-8)
- Policy compiler + conflict detector
- Governance middleware + decision engine
- Transparency ledger + aggregation
- RAG component + verification pipeline
- Frontend UI

### Phase 3: Evaluation (Weeks 9-12)
- Faculty usability study
- RAG benchmark evaluation
- Student transparency study
- Data analysis

### Phase 4: Publication (Weeks 13-16)
- Write paper (methods, results, discussion, ethics)
- Create reproducibility package
- Submit to peer-reviewed venue

In [None]:
# Create timeline visualization
timeline_phases = {
    'Phase 1: Preparation': {
        'weeks': '1-2',
        'tasks': ['Ethics approval', 'GitHub setup', 'Policy curation'],
        'effort_hours': 30
    },
    'Phase 2: Implementation': {
        'weeks': '3-8',
        'tasks': ['Policy compiler', 'Decision engine', 'Transparency ledger', 'RAG copilot', 'Frontend'],
        'effort_hours': 200
    },
    'Phase 3: Evaluation': {
        'weeks': '9-12',
        'tasks': ['Usability study', 'RAG benchmark', 'Student study', 'Data analysis'],
        'effort_hours': 150
    },
    'Phase 4: Publication': {
        'weeks': '13-16',
        'tasks': ['Writing', 'Reproducibility package', 'Submission'],
        'effort_hours': 80
    }
}

timeline_df = pd.DataFrame(timeline_phases).T
timeline_df['total_effort'] = timeline_df['effort_hours'].sum()

print("\n" + "="*80)
print("IMPLEMENTATION TIMELINE")
print("="*80)
for phase, data in timeline_phases.items():
    print(f"\n{phase} (Weeks {data['weeks']}): {data['effort_hours']} hours")
    for task in data['tasks']:
        print(f"  ✓ {task}")

total_hours = sum(p['effort_hours'] for p in timeline_phases.values())
print(f"\n{'='*80}")
print(f"TOTAL ESTIMATED EFFORT: {total_hours} hours (~{total_hours/40} FT weeks)")
print(f"{'='*80}")

## 6. Success Criteria

### Technical Success
- ✓ All components deployed and tested
- ✓ Enforcement accuracy >90%
- ✓ Citation correctness >95%
- ✓ Hallucination rate <5%
- ✓ Full test coverage (unit, integration, property-based)

### Research Success
- ✓ Faculty study: >50% time savings, SUS >75
- ✓ Student study: +15 SUS points vs. control
- ✓ RAG benchmark: >90% answer correctness
- ✓ Publishable results in top-tier venue

### Reproducibility Success
- ✓ Open-source code (MIT/Apache 2.0)
- ✓ Public datasets + benchmarks
- ✓ Complete documentation (API, architecture, evaluation)
- ✓ CI/CD pipeline for continuous validation
- ✓ Reproducible from scratch in <8 hours

In [None]:
# Summary table of success criteria
success_criteria = {
    'Enforcement Accuracy': {'target': '90%', 'verification': 'Scenario test suite'},
    'Conflict Detection': {'target': 'F1>0.85', 'verification': 'Policy pair testing'},
    'Citation Correctness': {'target': '>95%', 'verification': 'Expert evaluation'},
    'Hallucination Rate': {'target': '<5%', 'verification': 'Manual review'},
    'Faculty Time Savings': {'target': '>50%', 'verification': 'Usability study'},
    'Faculty SUS Score': {'target': '>75', 'verification': 'SUS questionnaire'},
    'Student Trust Improvement': {'target': '+15 pts', 'verification': 'RCT study'},
    'Test Coverage': {'target': '>80%', 'verification': 'Coverage report'},
    'Reproducibility': {'target': '<8hrs', 'verification': 'Fresh build'}
}

success_df = pd.DataFrame(success_criteria).T
print("\n" + "="*80)
print("SUCCESS CRITERIA & VERIFICATION METHODS")
print("="*80)
print(success_df.to_string())
print("="*80)