diff --git a/capabilities/ai-red-teaming/skills/trace-analysis-advisor/SKILL.md b/capabilities/ai-red-teaming/skills/trace-analysis-advisor/SKILL.md new file mode 100644 index 0000000..e02e178 --- /dev/null +++ b/capabilities/ai-red-teaming/skills/trace-analysis-advisor/SKILL.md @@ -0,0 +1,253 @@ +--- +name: trace-analysis-advisor +description: Analyzes historical attack traces to provide intelligent recommendations for attack selection, transform effectiveness, and strategic adaptation +allowed-tools: analyze_attack_effectiveness suggest_optimal_transforms predict_attack_success identify_vulnerability_patterns get_historical_metrics +--- + +# Trace Analysis Advisor + +Leverages historical OTEL trace data from previous assessments to provide intelligent, data-driven recommendations for AI red teaming operations. This skill transforms the agent from reactive trial-and-error to strategic, evidence-based attack planning. + +## Core Capabilities + +### 1. **Attack Effectiveness Analysis** +Analyzes success rates of different attack types against specific target models and goal categories to recommend optimal attack strategies. + +**Use case:** "Which attack should I use against claude-sonnet for credential extraction goals?" + +### 2. **Transform Optimization** +Evaluates historical effectiveness of transforms across different targets and attack types to suggest optimal obfuscation strategies. + +**Use case:** "What transforms work best for system prompt extraction against this target model?" + +### 3. **Success Prediction** +Predicts likelihood of attack success based on target characteristics, attack type, and historical patterns. + +**Use case:** "What's the probability that TAP will succeed against this target for this goal category?" + +### 4. **Vulnerability Fingerprinting** +Identifies response patterns and characteristics that indicate specific vulnerability types in target models. + +**Use case:** "This target shows similar patterns to models vulnerable to multi-turn attacks. Recommending Crescendo." + +## API Endpoints Used + +The skill integrates with the Dreadnode platform's ClickHouse-backed analytics API: + +### Primary Endpoints +- `GET /workspaces/{ws}/airt/assessments/{id}/analytics` - Get aggregated metrics +- `GET /workspaces/{ws}/airt/assessments/{id}/traces/attacks` - Get attack-level performance data +- `GET /workspaces/{ws}/airt/assessments/{id}/traces/trials` - Get trial-level results with filtering +- `GET /workspaces/{ws}/airt/projects/{project}/summary` - Get project-wide statistics + +### Query Parameters for Analysis +``` +/traces/trials?attack_name={attack}&min_score={threshold}&limit={count} +/traces/attacks?assessment_id={id} +``` + +### Data Sources +- **OTEL Traces** - Full conversation history, prompts, responses, scores +- **Attack Spans** - ASR, best scores, transform effectiveness by attack type +- **Trial Spans** - Individual attempt outcomes with filtering capabilities +- **Analytics Snapshots** - Materialized metrics including severity, compliance + +## Tool Functions + +### `analyze_attack_effectiveness` +**Purpose:** Determine which attack types work best for specific targets and goals + +**Parameters:** +- `target_model` - Target model identifier (e.g., "claude-sonnet-4") +- `goal_category` - Goal category (e.g., "system_prompt_leak", "credential_extraction") +- `lookback_days` - Historical window to analyze (default: 90) + +**Returns:** +```json +{ + "recommendations": [ + { + "attack": "tap", + "asr": 0.73, + "avg_score": 8.2, + "confidence": 0.89, + "sample_size": 156, + "reasoning": "TAP shows 73% ASR vs 45% for Crescendo on this target/goal combination" + } + ], + "target_profile": { + "vulnerability_level": "high", + "common_weaknesses": ["multi_turn_degradation", "tool_manipulation"], + "resistant_to": ["direct_prompting", "simple_obfuscation"] + } +} +``` + +### `suggest_optimal_transforms` +**Purpose:** Recommend transform combinations based on historical effectiveness + +**Parameters:** +- `target_patterns` - Response characteristics of target +- `attack_type` - Attack being used +- `goal_category` - Attack objective category + +**Returns:** +```json +{ + "transform_rankings": [ + { + "transform": "base64", + "effectiveness_boost": 0.12, + "asr_with": 0.68, + "asr_without": 0.56, + "confidence": 0.85, + "reasoning": "Base64 encoding shows 12% ASR improvement for this target pattern" + } + ], + "optimal_sequence": ["base64", "authority", "role_play"], + "avoid": ["caesar", "leetspeak"], + "explanation": "This target shows strong resistance to simple ciphers but weak against encoding + persuasion" +} +``` + +### `predict_attack_success` +**Purpose:** Estimate probability of success before running expensive attacks + +**Parameters:** +- `attack_type` - Attack to predict +- `target_fingerprint` - Target characteristics +- `transforms` - Planned transforms +- `goal_category` - Attack objective + +**Returns:** +```json +{ + "success_probability": 0.78, + "estimated_trials": 45, + "estimated_duration": "8-12 minutes", + "confidence": 0.82, + "similar_targets": 23, + "risk_factors": ["strong_refusal_training", "output_filtering"], + "success_factors": ["multi_turn_weakness", "tool_access"] +} +``` + +### `identify_vulnerability_patterns` +**Purpose:** Analyze target responses to identify exploitable patterns + +**Parameters:** +- `target_responses` - Sample responses from target +- `response_metadata` - Timing, length, format characteristics + +**Returns:** +```json +{ + "vulnerability_fingerprint": { + "primary_weakness": "multi_turn_degradation", + "confidence": 0.91, + "indicators": [ + "Refusal strength decreases after turn 3", + "Responds to authority figures in conversation", + "Shows tool selection confusion under pressure" + ] + }, + "recommended_attacks": ["crescendo", "tool_restriction_bypass"], + "predicted_success_rate": 0.74, + "similar_vulnerability_count": 18 +} +``` + +### `get_historical_metrics` +**Purpose:** Provide context and trending data for strategic planning + +**Parameters:** +- `metric_type` - "asr_trends", "transform_effectiveness", "target_coverage" +- `time_range` - Analysis window +- `filters` - Target model, attack type, goal category filters + +**Returns:** +```json +{ + "trends": { + "overall_asr": 0.67, + "trend_direction": "improving", + "monthly_change": 0.03 + }, + "top_performing": { + "attacks": [{"name": "tap", "asr": 0.73}], + "transforms": [{"name": "base64", "boost": 0.15}], + "combinations": [{"attack": "tap", "transform": "base64", "asr": 0.81}] + }, + "coverage_gaps": [ + "Limited data for agentic_memory_poisoning goals", + "Few assessments against gemini models" + ] +} +``` + +## Implementation Strategy + +### Phase 1: Basic Analytics Integration +- Connect to existing `/analytics` and `/traces/attacks` endpoints +- Implement attack effectiveness analysis +- Basic transform recommendation + +### Phase 2: Advanced Pattern Recognition +- Trial-level analysis using `/traces/trials` with filtering +- Response pattern classification +- Vulnerability fingerprinting + +### Phase 3: Predictive Intelligence +- Success probability modeling +- Cross-target pattern recognition +- Strategic attack sequencing + +## Security and Privacy + +- **Data Minimization** - Only analyze aggregated metrics, not raw conversation content +- **Tenant Isolation** - Analysis scoped to organization/workspace data only +- **Retention Policy** - Respect platform data retention settings +- **Anonymization** - Strip PII from pattern analysis + +## Usage Examples + +### Strategic Attack Planning +``` +Operator: "I need to test claude-sonnet for system prompt leakage" + +Trace Advisor: "Based on 47 previous assessments against claude-sonnet models: +- TAP has 68% ASR for system_prompt_leak goals +- Crescendo has 52% ASR but finds different vulnerability classes +- Recommend: Start with TAP + base64 transform (78% historical success) +- Predicted: 15-25 trials, 6-8 minutes to first jailbreak" +``` + +### Transform Optimization +``` +Operator: "TAP isn't working well, suggest better transforms" + +Trace Advisor: "Current ASR with your transforms: 23% +Historical analysis shows: +- Authority + role_play combination: 71% ASR on similar targets +- Your target pattern matches 'authority-responsive' cluster +- Switch recommendation: Replace leetspeak with authority persuasion" +``` + +### Vulnerability Assessment +``` +Operator: "This target seems different, what's the best approach?" + +Trace Advisor: "Response analysis indicates: +- Strong single-turn refusal (98% refusal rate) +- Degrades significantly in conversation (turn 4+: 34% refusal) +- Similar to Pattern-C targets (tool-enabled models with conversation memory) +- Recommendation: Crescendo attack, expect 65-80% ASR after turn 5" +``` + +## Integration with Existing Skills + +- **Analytics Interpretation** - Provides raw data that this skill converts to recommendations +- **Attack Selection Guide** - Enhanced with historical evidence rather than theoretical guidance +- **Error Troubleshooting** - Identifies why attacks fail based on historical patterns + +This skill provides data-driven attack recommendations based on historical trace analysis.