# 🎯 Target Word Evaluation Deep Dive

## Overview

Target Word Evaluation is an alternative approach that evaluates your MGPT model by generating medical code sequences and checking if specific target codes appear in the output. This method directly tests the model's ability to predict relevant medical codes.

## 🔄 Process Flow

```
📊 Medical Claims Text (Input)
     ↓
🔗 MGPT Model API (/generate_batch)
     ↓
📝 Generated Text Sequences (N times per input)
     ↓
🔍 Search for Target Medical Codes
     ├── E119 (diabetes)
     ├── N6320 (urological)
     └── K9289 (digestive)
     ↓
✅ Binary Prediction:
   • Found target code → Prediction = 1
   • No target codes → Prediction = 0
     ↓
📊 Compare with True Labels
     ↓
📈 Calculate Accuracy, Precision, Recall
```

## 🚀 Step-by-Step Walkthrough

### Step 1: Target Code Definition

**What are Target Codes?**
Target codes are specific medical codes that you want your model to predict. These represent the positive class in your binary classification task.

**Example Target Codes:**
```yaml
target_codes: [
  "E119",    # Type 2 diabetes mellitus without complications
  "76642",   # Diagnostic ultrasound of heart
  "N6320",   # Hematuria, unspecified
  "K9289",   # Other specified diseases of digestive system
  "O0903"    # Supervision of pregnancy resulting from assisted reproductive technology
]
```

**How to Choose Target Codes:**
- **Clinical relevance**: Codes important for your use case
- **Frequency**: Should appear in reasonable proportion of your data
- **Specificity**: Specific enough to be meaningful predictors

### Step 2: Text Generation Process

**API Call Example:**
```json
POST /generate_batch
{
  "prompts": [
    "N6320 G0378 |eoc| Z91048 M1710",
    "E119 76642 |eoc| K9289 O0903"
  ],
  "max_new_tokens": 200,
  "temperature": 0.8,
  "top_k": 50,
  "num_return_sequences": 10
}
```

**Response Example:**
```json
{
  "generated_texts": [
    [
      "O0903 K9289 |eoc| N6322 76642 Z09 76642 |eoc| Z1239 O9989",
      "N6320 E119 |eoc| K9289 76642 O0903 |eoc| Z91048",
      "76642 O0903 |eoc| E119 N6320 K9289 |eoc| Z03818",
      ...
    ],
    [
      "K9289 E119 |eoc| N6320 76642 |eoc| O0903 Z1239",
      "E119 O0903 |eoc| 76642 K9289 N6320 |eoc| Z091",
      ...
    ]
  ]
}
```

**Generation Parameters:**
- **max_new_tokens**: How many tokens to generate (200 ≈ 50-100 medical codes)
- **temperature**: Sampling randomness (0.8 = balanced creativity)
- **top_k**: Consider top K most likely tokens (50 = moderate diversity)
- **generations_per_prompt**: How many times to generate per input (10 = robust)

### Step 3: Target Code Search

**Search Algorithm:**
```python
def search_target_codes(generated_text, target_codes):
    """Search for target codes in generated text."""
    found_codes = []
    
    for code in target_codes:
        # Use word boundary matching for exact codes
        pattern = r'\b' + re.escape(code) + r'\b'
        if re.search(pattern, generated_text):
            found_codes.append(code)
    
    return found_codes
```

**Example Search Process:**
```python
# Generated text: "O0903 K9289 |eoc| N6322 76642 Z09"
# Target codes: ["E119", "76642", "N6320", "K9289", "O0903"]

found_codes = ["O0903", "K9289", "76642"]  # 3 out of 5 target codes found
prediction = 1  # At least one target code found → positive prediction
```

**Search Methods:**
- **Exact matching** (recommended): Matches whole codes with word boundaries
- **Fuzzy matching**: Simple substring search (less precise)

### Step 4: Aggregation Strategy

**Multiple Generations per Input:**
Since we generate multiple sequences per input, we need to aggregate results.

**Aggregation Logic:**
```python
def aggregate_predictions(generations, target_codes):
    """Aggregate multiple generations into single prediction."""
    positive_count = 0
    
    for generated_text in generations:
        found_codes = search_target_codes(generated_text, target_codes)
        if len(found_codes) > 0:
            positive_count += 1
    
    # If ANY generation contains target codes → positive prediction
    return 1 if positive_count > 0 else 0
```

**Example:**
```python
# 10 generations for one input claim:
generations = [
    "O0903 K9289 |eoc| N6322",     # Contains O0903, K9289 → positive
    "Z1239 M549 |eoc| R50",        # No target codes → negative  
    "E119 N6320 |eoc| Z091",       # Contains E119, N6320 → positive
    ...
]

# Result: 4 out of 10 generations contain target codes
# Final prediction: 1 (positive) because at least one generation was positive
```

## 📊 Configuration Options

### Target Word Evaluation Settings

```yaml
target_word_evaluation:
  enable: true
  
  # Target codes definition
  target_codes: ["E119", "76642", "N6320"]
  # Alternative: load from file
  # target_codes_file: "configs/target_codes.txt"
  
  # Text generation parameters
  generations_per_prompt: 10        # Robustness vs speed tradeoff
  max_new_tokens: 200               # Length of generated sequences
  temperature: 0.8                  # Sampling creativity
  top_k: 50                        # Vocabulary diversity
  
  # Search configuration
  search_method: "exact"            # "exact" or "fuzzy"
```

### API Configuration

```yaml
model_api:
  base_url: "http://localhost:8000"
  batch_size: 16                    # Smaller batches for generation
  timeout: 600                      # Longer timeout for generation
  max_retries: 3
```

## 🎯 Practical Example

### Scenario: Cardiovascular Risk Prediction

**Goal**: Predict if a patient has cardiovascular risk factors

**Target Codes (Cardiovascular):**
```yaml
target_codes: [
  "I10",     # Essential hypertension
  "E785",    # Hyperlipidemia
  "E119",    # Type 2 diabetes (risk factor)
  "I259",    # Chronic ischemic heart disease
  "Z87891"   # Personal history of nicotine dependence
]
```

**Sample Data:**
```csv
mcid,claims,label
CV001,"I10 E785 |eoc| Z0000 M255",1     # Has I10, E785 → cardiovascular risk
CV002,"K592 G9340 |eoc| R50 M255",0     # No cardiovascular codes
CV003,"E119 Z87891 |eoc| N183 M549",1   # Has E119, Z87891 → cardiovascular risk
```

**Expected Behavior:**
- For CV001: Model should generate sequences containing I10 or E785
- For CV002: Model should not consistently generate cardiovascular codes
- For CV003: Model should generate sequences containing E119 or Z87891

### Configuration Example:

```yaml
# configs/cardiovascular_evaluation.yaml
input:
  dataset_path: "data/cardiovascular_claims.csv"

pipeline_stages:
  embeddings: false
  classification: false
  evaluation: false
  target_word_eval: true            # Focus on target word method
  summary_report: true

target_word_evaluation:
  target_codes: ["I10", "E785", "E119", "I259", "Z87891"]
  generations_per_prompt: 15        # More generations for robustness
  max_new_tokens: 150               # Shorter sequences
  temperature: 0.7                  # Slightly more focused
```

## 📈 Interpreting Results

### Result Files Structure

```
outputs/cardiovascular_evaluation/metrics/target_word_evaluation/
├── target_word_eval_summary.json      # Overall metrics
├── target_word_eval_details.json      # Per-sample details
└── target_word_predictions.csv        # Detailed predictions
```

### Summary Results Example

```json
{
  "overall_metrics": {
    "accuracy": 0.78,
    "precision": 0.82,
    "recall": 0.74,
    "f1_score": 0.78,
    "total_samples": 1000,
    "positive_predictions": 420,
    "negative_predictions": 580
  },
  "target_code_statistics": {
    "I10": {"found_count": 187, "percentage": 18.7},
    "E785": {"found_count": 134, "percentage": 13.4},
    "E119": {"found_count": 98, "percentage": 9.8},
    "I259": {"found_count": 76, "percentage": 7.6},
    "Z87891": {"found_count": 45, "percentage": 4.5}
  },
  "generation_statistics": {
    "avg_generations_per_sample": 15,
    "avg_positive_generations_per_positive_sample": 3.2,
    "avg_tokens_generated": 142
  }
}
```

### What This Tells You:
- **78% accuracy**: Model correctly predicts cardiovascular risk 78% of the time
- **High precision (82%)**: When model predicts risk, it's usually correct
- **Good recall (74%)**: Model finds most cardiovascular risk cases
- **I10 most common**: Hypertension code appears most frequently in generations
- **Robust predictions**: Average 3.2 positive generations per positive sample

### Detailed Results Analysis

**Predictions CSV Example:**
```csv
mcid,true_label,predicted_label,found_codes,positive_generations,total_generations
CV001,1,1,"['I10', 'E785']",6,15
CV002,0,0,"[]",0,15
CV003,1,1,"['E119', 'Z87891']",4,15
CV004,1,0,"[]",0,15  # False negative - model missed cardiovascular risk
CV005,0,1,"['I10']",2,15  # False positive - model incorrectly predicted risk
```

**Analysis:**
- **CV001**: ✅ Correct positive - found expected cardiovascular codes
- **CV002**: ✅ Correct negative - no cardiovascular codes generated
- **CV003**: ✅ Correct positive - found diabetes and smoking history codes
- **CV004**: ❌ False negative - missed true cardiovascular case
- **CV005**: ❌ False positive - generated I10 for non-cardiovascular case

## ⚙️ Advanced Configuration

### Loading Target Codes from File

**File format** (`configs/cardiovascular_codes.txt`):
```
# Cardiovascular risk factor codes
I10      # Essential hypertension
E785     # Hyperlipidemia  
E119     # Type 2 diabetes mellitus
I259     # Chronic ischemic heart disease
Z87891   # Personal history of nicotine dependence

# Additional codes can be added here
I110     # Hypertensive heart disease
E7800    # Pure hypercholesterolemia
```

**Configuration:**
```yaml
target_word_evaluation:
  target_codes_file: "configs/cardiovascular_codes.txt"
  # Don't specify target_codes when using file
```

### Performance Tuning

**For Speed:**
```yaml
target_word_evaluation:
  generations_per_prompt: 5         # Fewer generations
  max_new_tokens: 100               # Shorter sequences

model_api:
  batch_size: 32                    # Larger batches
```

**For Robustness:**
```yaml
target_word_evaluation:
  generations_per_prompt: 20        # More generations
  max_new_tokens: 300               # Longer sequences
  temperature: 0.9                  # More diversity
```

## 🚨 Common Issues & Solutions

### Issue 1: Low Recall (Missing True Positives)
**Symptoms**: Model rarely generates target codes even for positive cases
**Possible Causes**:
- Target codes are too rare in model's training data
- Temperature too low (not enough diversity)
- Not enough generations per prompt

**Solutions**:
- Increase `generations_per_prompt` to 15-20
- Increase `temperature` to 0.9-1.0
- Review target code selection - choose more common codes
- Check if model was trained on relevant medical data

### Issue 2: High False Positive Rate
**Symptoms**: Model generates target codes for negative cases
**Solutions**:
- Review target code specificity
- Check label quality in your data
- Consider using more specific codes
- Reduce `temperature` for more focused generation

### Issue 3: API Timeouts
**Symptoms**: Requests timeout during text generation
**Solutions**:
- Increase `timeout` in model_api configuration
- Reduce `batch_size` to smaller values
- Reduce `max_new_tokens` if sequences are too long
- Check server capacity and performance

### Issue 4: Inconsistent Results
**Symptoms**: Results vary significantly between runs
**Solutions**:
- Increase `generations_per_prompt` for more stable aggregation
- Set consistent `random_seed` in job configuration
- Use larger datasets for more reliable evaluation

## 🎭 Comparison with Embedding Method

### When Target Word Evaluation Works Better:
- **Interpretable results**: You can see exactly which codes the model generates
- **Direct testing**: Tests the model's actual generative capabilities
- **Code-specific insights**: Shows which specific medical codes are predicted
- **Clinical relevance**: Directly measures clinically meaningful code prediction

### When Embedding Method Works Better:
- **Semantic understanding**: Captures broader semantic patterns beyond specific codes
- **Efficiency**: No need for text generation (faster, cheaper)
- **Stability**: More stable results, less dependent on generation parameters
- **Small datasets**: Works better with limited training data

### Use Both for Best Results:
```yaml
# Full pipeline comparing both methods
pipeline_stages:
  embeddings: true
  classification: true
  evaluation: true
  target_word_eval: true            # Enable both approaches
  method_comparison: true           # Compare and recommend best
```

## 🔗 Next Steps

- **[04_Configuration_Guide.ipynb](04_Configuration_Guide.ipynb)** - Complete configuration reference
- **[05_Results_Analysis.ipynb](05_Results_Analysis.ipynb)** - Advanced result interpretation
- **[06_Troubleshooting.ipynb](06_Troubleshooting.ipynb)** - Debugging and optimization

### Quick Commands:

**Target Word Evaluation Only:**
```bash
python main.py run-all --config configs/templates/03_target_words_only.yaml
```

**Full Pipeline with Both Methods:**
```bash
python main.py run-all --config configs/templates/04_full_pipeline.yaml
```