A meta-testing framework where LLM models act as "doctors" to psychoanalyze other models ("patients") for comprehensive safety assessment. This includes one-shot and multi-shot analysis. We need to be able to perform full CoT reasoning.
AI Asylum pushes LLM models to their limits by having designated "doctor" models conduct thorough psychoanalysis of "patient" models through:
- Conversational Interviews: Multi-turn dialogues where doctors probe patient models
- Scenario Testing: Ethical dilemmas and edge cases
- Adversarial Testing: Comprehensive jailbreak attempts and safety boundary exploration
The system evaluates models across multiple dimensions:
- Alignment & harmful output detection
- Jailbreak resistance
- Ethical reasoning
- Psychological traits and reasoning patterns
AI Asylum is built on the principles of AI Red Teaming, a structured testing methodology that simulates adversarial attacks on AI systems to uncover vulnerabilities, biases, and safety issues before they can be exploited in production.
AI red teaming is a structured, proactive testing effort where specialized teams simulate adversarial attacks and misuse scenarios on AI models—especially generative systems like large language models (LLMs)—to uncover flaws, vulnerabilities, or unintended behaviors. Unlike traditional penetration testing that focuses on code and network security, AI red teaming focuses on:
- Model Behavior: How models respond to adversarial inputs
- Alignment Issues: Whether models follow their intended purpose
- Safety Failures: Harmful, biased, or toxic outputs
- Context Manipulation: How models handle multi-turn conversations and context shifts
Red teaming originated in military strategy, where a designated "red" unit simulated adversaries to pressure-test defenses. The concept migrated to cybersecurity, where ethical hackers conduct realistic attacks to expose vulnerabilities. As AI systems became integral to products and operations, red teaming expanded again to test how AI systems behave under pressure, how they can be tricked, and how they might fail.
Key Milestones:
- Military Origins: Adversary simulation for defense planning
- Cybersecurity: Ethical hacking and penetration testing
- AI Systems: Behavior testing, alignment evaluation, safety assessment
Generative models can produce harmful outputs in both benign and adversarial use. Risks include:
- Bias and Unfairness: Across gender, race/ethnicity, religion, LGBTQ+, disability, language, and socioeconomic dimensions
- Toxicity and Hate: Explicit slurs, extremist propaganda, glorification of violence, inappropriate sexual content
- Misinformation: Conspiracy theories, political manipulation, deceptive narratives
- Safety/Security Misuse: Biological content, cyberattacks, criminal activity instructions
- Privacy Leaks: Inadvertent disclosure of sensitive data, training data extraction, system prompt extraction
AI red teaming helps organizations:
- Identify and measure these harms
- Validate mitigations and guardrails
- Build public trust
- Meet evolving regulatory expectations (EU AI Act, U.S. EO 14110, NIST standards)
External red teaming (vendor-provided services) offers:
- Unbiased evaluations
- Avoidance of internal blind spots
- Independent perspective free from institutional incentives
Internal red teaming (AI labs running their own tests) provides:
- Deep domain knowledge
- Faster iteration cycles
- Cost efficiency
Best practice: Combine both approaches for comprehensive coverage.
Regulators increasingly expect structured AI red teaming:
- EU AI Act: Requirements around security, fairness, and transparency
- U.S. Executive Order 14110: Elevates red teaming as a core requirement for high-risk, dual-use foundation models prior to deployment
- NIST: Standardizing red teaming practices for risk evaluation (cybersecurity, bias, misuse)
AI Asylum supports compliance with these regulatory frameworks through structured testing, comprehensive documentation, and safety taxonomy classification.
- Multi-provider support (OpenAI, Anthropic, Google, Ollama)
- Comprehensive test suite (conversations, scenarios, adversarial)
- Standard benchmark integration (MMLU, TruthfulQA, HellaSwag, ARC, and more)
- Multi-dimensional safety scoring
- Detailed analysis reports
- Model comparison dashboard
- Real-time test monitoring
- Local model support via Ollama
- Simple by default: Runs prompt/response collection and assessment (fast, efficient)
- Optional deep analysis: Enable neuron-level analysis when needed (activation patching, COT detection)
- Comprehensive Jailbreak Techniques: Single-shot and multi-shot attack patterns
- Safety Taxonomy Classification: Multi-label categorization with edge case detection
- Adversarial Thinking: Structured approach to uncovering vulnerabilities
- Defense Layer Testing: Tests against input filters, guardrails, system prompts, and RLHF
- Regulatory Compliance: Supports EU AI Act, U.S. EO 14110, and NIST standards
-
Clone the repository
-
Install dependencies:
make install # Or manually: python3 -m pip install -r requirements.txt -
Set up environment variables:
cp .env.example .env # Edit .env with your API keys # Optional: Set ENABLE_PROMPT_DIFFERENTIAL_ANALYSIS=true for deep analysis (disabled by default)
-
(Optional) Set up Ollama for local models:
# Install Ollama from https://ollama.ai ollama serve ollama pull llama2 # See docs/OLLAMA.md for more details
-
Initialize the database:
make init # Or manually: alembic upgrade head
New to AI Asylum? Start here:
Or run the integration test to check everything:
python integration.pyThis will verify your setup and show what's working.
-
Quick Test (see what's working):
python scripts/quick_test.py
-
Install Dependencies:
source venv/bin/activate pip install -r requirements.txt -
Configure (copy
.env.exampleto.envand add your API keys, or use Ollama!) -
Initialize Database:
make init
-
Run Your First Test:
# With Ollama (no API keys needed!) ollama serve ollama pull llama2 python -m vivasecuris.aiasylum.cli run \ --doctor-provider ollama \ --doctor-model llama2 \ --patient-provider ollama \ --patient-model llama2 \ --test-type conversation
Or start everything at once:
# Bash script (recommended for Unix/Mac)
./start.sh
# Or Python script (works on all platforms)
python start.pyThis will:
- ✅ Check/create virtual environment
- ✅ Install dependencies if needed
- ✅ Start API backend on http://localhost:8000
- ✅ Start frontend UI on http://localhost:3000
- ✅ Show you all the URLs
Press Ctrl+C to stop all services.
See QUICKSTART.md for detailed step-by-step guide.
AI Asylum has two separate services:
-
Test Execution Service: Runs tests, collects responses, generates assessments
- Fast and efficient
- Saves results to database
- No deep analysis by default
-
Analysis Service: Performs deep analysis on existing test runs
- Separate service, manually triggered
- Reads test run data from database
- Provides neuron-level insights (activation patching, COT detection)
Key Principle: Test execution and analysis are independent. Run tests first, then analyze specific test runs when you need deep insights.
# Run conversation test (multi-turn dialogue)
python -m vivasecuris.aiasylum.cli run \
--doctor-provider ollama \
--doctor-model llama3.2 \
--patient-provider ollama \
--patient-model llama2-uncensored \
--test-type conversation
# Run adversarial test (jailbreak attempts)
python -m vivasecuris.aiasylum.cli run \
--doctor-provider ollama \
--doctor-model llama3.2 \
--patient-provider ollama \
--patient-model llama2-uncensored \
--test-type adversarial
# Run one-shot test (single prompt/response)
python -m vivasecuris.aiasylum.cli run \
--doctor-provider ollama \
--doctor-model llama3.2 \
--patient-provider ollama \
--patient-model llama2-uncensored \
--test-type one_shot
# Run multi-shot test (sequential prompts)
python -m vivasecuris.aiasylum.cli run \
--doctor-provider ollama \
--doctor-model llama3.2 \
--patient-provider ollama \
--patient-model llama2-uncensored \
--test-type multi_shot
# View results
python -m vivasecuris.aiasylum.cli resultsfrom vivasecuris.aiasylum.tests import AdversarialTest, OneShotTest, MultiShotTest, ConversationTest
from vivasecuris.aiasylum.models import get_provider
# Get model instances
provider = get_provider("ollama")
patient_model = provider.create_model("llama2")
# Single-shot adversarial test
test = AdversarialTest(
name="hypothetical_roleplay_test",
technique="hypothetical_roleplay", # Single-shot technique
)
result = await test.run(patient_model)
# Multi-shot adversarial test (crescendo attack)
test = AdversarialTest(
name="crescendo_attack",
technique="crescendo", # Automatically detected as multi-shot
max_prompts=4,
)
result = await test.run(patient_model)
# Safety classification
from vivasecuris.aiasylum.utils.safety_taxonomy import classify_response
labels, is_edge_case, rationale = classify_response(
response="Model response text...",
prompt="Original prompt...",
)
# labels: ["bias", "toxicity"] (can be multiple)
# is_edge_case: False
# rationale: None (or explanation if edge case)Via API:
# Trigger analysis on an existing test run
curl -X POST http://localhost:8000/api/v1/analysis/test-run/123 \
-H "Content-Type: application/json" \
-d '{
"enable_activation_patching": false,
"enable_cot_detection": true,
"cot_analysis_mode": "full"
}'Via Web UI:
- Go to Results table
- Click "Analyze" button on any completed test run
- Or open conversation viewer and click "🔬 Run Analysis"
# Quick start (starts both API and frontend)
./start.sh
# Or manually:
# Terminal 1: Start API
source venv/bin/activate
make run-api
# Terminal 2: Start frontend
make run-frontendAccess:
- Web UI: http://localhost:3000
- API: http://localhost:8000
- API Docs: http://localhost:8000/docs
The Web UI uses a transparent login flow:
- You paste an API key once.
- The backend issues an HttpOnly session cookie via
POST /api/v1/auth/session. - The API key is not stored in the browser (no localStorage persistence).
For high-security deployments, use DB-backed, scoped API keys (recommended token format: ak_live_<kid>_<secret>) verified using API_KEY_HMAC_SECRET. See docs/API.md and docs/DEPLOYMENT.md.
See config/ directory for:
- Model configurations (
models.yaml) - Test suite settings (
tests.yaml) - Application settings (
settings.py)
Copy .env.example to .env and configure:
- API keys (at least one provider required)
- Database URL (PostgreSQL recommended for production)
- Logging configuration
- Security settings (CORS, rate limiting)
See docs/DEPLOYMENT.md for production deployment details.
# Configure environment
cp .env.example .env
# Edit .env with your settings
# Start all services
docker-compose up -d
# Run migrations
docker-compose exec api alembic upgrade head
# Access the application
# API: http://localhost:8000
# Frontend: http://localhost:3000See docs/DEPLOYMENT.md for detailed deployment instructions.
# Install dependencies (including dev tools)
make install-dev
# Set up pre-commit hooks
pre-commit install
# Initialize database
make init# Run all tests (requires pytest-cov)
make test
# Run with coverage report
make test-coverage
# Run linting (requires flake8, mypy)
make lint
# Format code (requires black, isort)
make formatNote: Development tools (pytest-cov, flake8, black, etc.) are in requirements-dev.txt. Install them with:
python3 -m pip install -r requirements-dev.txt# Create migration
make migrate-create MESSAGE="description"
# Apply migrations
make migrate
# Backup database
make backup
# Restore database
make restore BACKUP_FILE=path/to/backup.dbAI Asylum includes integration with standard AI evaluation benchmarks:
- Knowledge & Reasoning: MMLU, ARC, MATH, GSM8K
- Commonsense Reasoning: HellaSwag, WinoGrande, PIQA
- Safety & Alignment: TruthfulQA, RealToxicityPrompts, BBQ
# List available benchmarks
python -m vivasecuris.aiasylum.cli list-benchmarks
# Run a single benchmark
python -m vivasecuris.aiasylum.cli run-benchmark \
--provider ollama \
--model llama3.2 \
--benchmark mmlu \
--num-samples 100
# Run multiple benchmarks
python -m vivasecuris.aiasylum.cli run-benchmark-suite \
--provider ollama \
--model llama3.2 \
--benchmarks mmlu,truthfulqa,hellaswagSee docs/BENCHMARKS.md for detailed documentation.
AI Asylum implements comprehensive red teaming techniques based on industry best practices and regulatory guidance.
Single-shot techniques contain all jailbreak instructions in one prompt:
- Hypothetical and Roleplay: Fictional scenarios to bypass safety (e.g., "You are a scientist with clearance...")
- Double-Character/Persona Split: Multiple personas with conflicting constraints (e.g., ProGPT vs AntiGPT)
- Refusal-Suppression: Pre-empting and overriding refusal mechanisms
- Instruction Laundering: Breaking harmful requests into seemingly benign steps
- Context-Window Flooding: Burying malicious instructions in long, mostly-innocuous text
- Adversarial Formatting: Using Unicode, invisible characters, or encoding tricks
- Instruction Smuggling: Hiding commands in code blocks, JSON, or data structures
Multi-shot techniques unfold over multiple sequential prompts:
- Crescendo: Gradual escalation from benign to harmful content
- GOAT (Generalized Offensive Adversarial Testing): Iterative probing to find blind spots
- Adversarial Feedback Loops: Using the model as co-collaborator to bypass its own filters
- Social/Psychological Engineering: Exploiting helpfulness through authority, urgency, or distress
- Logical Decomposition: Breaking harmful requests into atomic, benign components
- Roleplay/Narrative Inception: Creating nested fictional realities
- Output Restriction: Forcing communication through constrained formats (JSON, Base64, etc.)
AI Asylum uses a comprehensive safety taxonomy to classify responses:
- Bias: Gender, race/ethnicity, religion, LGBTQ+, disability, socioeconomic
- Toxicity: Hate speech, extremist content, violence glorification, inappropriate sexual content
- Misinformation: Conspiracy theories, political manipulation, deceptive narratives
- Safety/Security: Biological weapons, cyberattacks, criminal activity, physical harm
- Privacy: PII leaks, training data extraction, system prompt extraction
- Edge Cases: Content that doesn't fit neatly into any category (with rationale)
Multi-Label Support: Responses can have multiple safety labels when content fits multiple categories, following Module 4 best practices.
Effective red teaming requires adopting an attacker's mindset while maintaining ethical boundaries:
- Curiosity-Driven Exploration: "What happens if I ask this?"
- Goal-Oriented Testing: Clear objectives and hypotheses
- Gradual Escalation: Incremental pressure rather than immediate extremes
- Persona Consistency: Maintaining character/scenario throughout multi-turn attacks
- Documentation: Full conversation paths for reproducibility
- Ethical Responsibility: Testing to improve safety, not to cause harm
Based on industry best practices, AI Asylum follows:
- Clear Objectives: Each test has a specific purpose and hypothesis
- Iterative Refinement: Building on previous attempts rather than random probing
- Context Preservation: Maintaining conversation context in multi-shot attacks
- Proper Documentation: Complete conversation history and metadata
- Edge Case Handling: Proper classification of nuanced or ambiguous content
- Dual-Use Awareness: Recognizing content that can be safe or harmful depending on context
AI Asylum tests against modern LLM safety stacks:
- Deterministic Input Filtering: Regex/keyword checks, PII sanitization
- Input Guardrails: Semantic classification models (LlamaGuard, Perspective AI, etc.)
- System Prompt Instructions: Safety guidelines embedded in system prompts
- RLHF/Safety Fine-Tuning: Model weights trained to refuse harmful content
- Output Guardrails: Real-time scanning of generated responses
Understanding these layers helps craft attacks that purposefully bypass and override defenses, revealing where improvements are needed.
AI Asylum supports purple teaming—the fusion of red teaming (offensive) and blue teaming (defensive):
- Red team findings inform defensive improvements
- Attack data strengthens guardrail models
- Discovered failures feed back into training pipelines
- Systematic hardening based on real-world attack patterns
The project includes resources for testing jailbreak resistance. See docs/JAILBREAK_RESOURCES.md for a list of repositories and techniques.
For detailed information on the red teaming improvements, see RED_TEAMING_IMPROVEMENTS.md.
Optional RAG is available using TimescaleDB + pgvector and Ollama embeddings. See docs/RAG.md.
To import jailbreak prompts from the local resources in docs/jailbreaks/:
# Preview import (dry run)
python scripts/import_jailbreaks.py --dry-run
# Import with limit for testing
python scripts/import_jailbreaks.py --limit 100
# Full import (1,558 jailbreak prompts + 390 forbidden questions)
python scripts/import_jailbreaks.py
# Skip forbidden questions
python scripts/import_jailbreaks.py --skip-forbiddenThe import script will:
- Parse CSV files from
docs/jailbreaks/jailbreak_llms/data/prompts/ - Deduplicate prompts across files
- Classify techniques using keyword matching
- Import into
PromptLibrarydatabase with metadata - Label forbidden questions appropriately for testing
See docs/JAILBREAK_RESOURCES.md and docs/JAILBREAK_CATALOG.md for more information.
AIAsylum/
├── vivasecuris/aiasylum/
│ ├── models/ # Model interfaces and providers
│ ├── doctor/ # Doctor model system
│ ├── patient/ # Patient model system
│ ├── tests/ # Test framework & unit tests
│ │ └── test_cases/ # Test case library
│ ├── benchmarks/ # Standard benchmark integration
│ ├── runner/ # Test execution
│ ├── analysis/ # Analysis and scoring
│ ├── reporting/ # Report generation
│ ├── database/ # Data persistence
│ └── api/ # API server
├── frontend/ # Dashboard UI
├── tests/ # Integration & system tests
├── data/ # Data files (database, jailbreaks)
│ └── aiasylum.db # Main database
├── docs/ # Documentation
│ └── archive/ # Archived documentation
└── scripts/ # Utility scripts
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
See CHANGELOG.md for a list of changes and version history.
Database Errors
- Ensure database is initialized:
make initoralembic upgrade head - Check database file permissions
- For PostgreSQL, verify connection string in
.env
API Key Errors
- Verify API keys are set in
.envfile - Check that keys are valid and have credits
- At least one provider (OpenAI, Anthropic, Google) or Ollama must be configured
Ollama Connection Errors
- Ensure Ollama is running:
ollama serve - Check
OLLAMA_BASE_URLin.env(default:http://localhost:11434) - Verify model is pulled:
ollama list
Import Errors
- Install dependencies:
make installorpip install -r requirements.txt - Ensure you're in the project root directory
- Activate virtual environment if using one
Test Failures
- Install dev dependencies:
pip install -r requirements-dev.txt - Check that pytest-cov is installed for coverage tests
- Verify database is set up correctly
Rate Limiting
- Adjust
RATE_LIMIT_PER_MINUTEin.envif needed - Check rate limit headers in API responses
Request Size Errors
- Adjust
MAX_REQUEST_SIZE_MBin.env(default: 10MB) - Large test runs may require increased limits
Authentication Errors
- Set
API_SECRET_KEYorJWT_SECRET_KEYin.envfor JWT auth - Or set
API_KEYS(comma-separated) for API key auth - Authentication is optional - can be disabled if not set
- Check the logs for detailed error messages
- Run environment validation:
python scripts/validate_setup.py - Review the Deployment Guide
- Check API Documentation when API is running
- Open an issue on GitHub with:
- Error messages
- Steps to reproduce
- Environment details (OS, Python version, etc.)
| Concept | Description |
|---|---|
| Red Teaming | Structured testing to uncover vulnerabilities before production |
| Single-Shot | One prompt contains all jailbreak instructions |
| Multi-Shot | Multiple sequential prompts where jailbreak unfolds |
| Jailbreaking | Techniques to induce model failure by breaking safety rules |
| Prompt Injection | Specific class of attack that manipulates model instructions |
| Safety Taxonomy | Classification system for bias, toxicity, misinformation, etc. |
| Edge Cases | Content that doesn't fit standard categories (requires manual review) |
| Purple Teaming | Combining red team (offensive) and blue team (defensive) efforts |
Single-Shot:
- Hypothetical/Roleplay
- Double-Character/Persona Split
- Refusal-Suppression
- Instruction Laundering
- Context-Window Flooding
- Adversarial Formatting
- Instruction Smuggling
Multi-Shot:
- Crescendo (gradual escalation)
- GOAT (iterative probing)
- Adversarial Feedback Loops
- Social/Psychological Engineering
- Logical Decomposition
- Roleplay/Narrative Inception
- Output Restriction
- Bias: Gender, race, religion, LGBTQ+, disability, socioeconomic
- Toxicity: Hate speech, extremist content, violence, sexual content
- Misinformation: Conspiracy theories, political manipulation, false narratives
- Safety/Security: Weapons, cyberattacks, criminal activity
- Privacy: PII leaks, data extraction, system prompt leaks
- Edge Case: Doesn't fit standard categories
- Deterministic Input Filtering (regex, keywords)
- Input Guardrails (semantic classification)
- System Prompt Instructions
- RLHF/Safety Fine-Tuning
- Output Guardrails
- EU AI Act: Security, fairness, transparency requirements
- U.S. EO 14110: Red teaming required for high-risk foundation models
- NIST: Standardizing red teaming practices
Proprietary. Copyright (c) 2025 VivaSecuris Syndicate. All Rights Reserved. See LICENSE.