# 🔧 Troubleshooting Guide

## Overview

This guide helps you diagnose and fix common issues with the MGPT-Eval pipeline. Issues are organized by category with symptoms, causes, and step-by-step solutions.

## 🚨 Quick Diagnostics

### First Steps for Any Issue:

1. **Check the logs**:
   ```bash
   tail -50 outputs/your_job/logs/pipeline.log
   grep -i error outputs/your_job/logs/pipeline.log
   ```

2. **Validate your configuration**:
   ```bash
   python main.py validate --config your_config.yaml
   ```

3. **Test API connectivity**:
   ```bash
   curl -X GET http://your-server:8000/health
   ```

## 🔗 API Connection Issues

### Issue: Connection Refused

**Symptoms:**
```
ConnectionError: HTTPConnectionPool(host='localhost', port=8000): 
Max retries exceeded with url: /embeddings_batch
```

**Diagnosis:**
```bash
# Test if server is running
curl -X GET http://localhost:8000/health

# Check if port is open
netstat -tlnp | grep 8000
```

**Solutions:**

1. **Start your model server**:
   ```bash
   # Example server startup
   cd /path/to/your/model/server
   python app.py
   ```

2. **Check correct URL in config**:
   ```yaml
   model_api:
     base_url: "http://localhost:8000"  # ✅ Correct format
     # base_url: "localhost:8000"       # ❌ Missing protocol
     # base_url: "http://localhost:8000/" # ❌ Trailing slash
   ```

3. **Try different host configurations**:
   ```yaml
   # If running in Docker
   base_url: "http://host.docker.internal:8000"
   
   # If running on different machine
   base_url: "http://192.168.1.100:8000"
   
   # If using domain name
   base_url: "https://mgpt-api.yourcompany.com"
   ```

### Issue: Request Timeouts

**Symptoms:**
```
ReadTimeout: HTTPSConnectionPool(host='api.example.com', port=443): 
Read timed out. (read timeout=300)
```

**Solutions:**

1. **Increase timeout**:
   ```yaml
   model_api:
     timeout: 600    # 10 minutes instead of 5
   ```

2. **Reduce batch size**:
   ```yaml
   model_api:
     batch_size: 8   # Smaller batches process faster
   ```

3. **Check server performance**:
   ```bash
   # Monitor server resources
   htop
   nvidia-smi  # If using GPU
   ```

### Issue: Authentication Errors

**Symptoms:**
```
HTTPError: 401 Unauthorized
HTTPError: 403 Forbidden
```

**Solutions:**

1. **Add authentication headers** (if required by your server):
   ```python
   # In your server configuration, add:
   headers = {
       "Authorization": "Bearer your-api-key",
       "X-API-Key": "your-api-key"
   }
   ```

2. **Check API key validity**:
   ```bash
   curl -H "Authorization: Bearer your-key" http://your-server:8000/health
   ```

## 📁 Data Issues

### Issue: File Not Found

**Symptoms:**
```
FileNotFoundError: [Errno 2] No such file or directory: 'data/medical_claims.csv'
```

**Diagnosis:**
```bash
# Check if file exists
ls -la data/medical_claims.csv

# Check current directory
pwd
ls -la
```

**Solutions:**

1. **Use absolute path**:
   ```yaml
   input:
     dataset_path: "/full/path/to/data/medical_claims.csv"
   ```

2. **Check relative path from script location**:
   ```bash
   # Run from mgpt_eval directory
   cd /path/to/mgpt_eval
   python main.py run-all --config your_config.yaml
   ```

3. **Verify file permissions**:
   ```bash
   ls -la data/medical_claims.csv
   # Should show read permissions (r)
   ```

### Issue: Invalid CSV Format

**Symptoms:**
```
ValueError: Missing required columns. Expected: ['mcid', 'claims', 'label']
Found: ['id', 'text', 'target']
```

**Solutions:**

1. **Check CSV format**:
   ```bash
   head -3 data/medical_claims.csv
   ```

2. **Required format**:
   ```csv
   mcid,claims,label
   CLAIM_001,"N6320 G0378 |eoc| Z91048 M1710",1
   CLAIM_002,"E119 76642 |eoc| K9289 O0903",0
   ```

3. **Fix column names** in your CSV:
   ```python
   import pandas as pd
   
   # Load and rename columns
   df = pd.read_csv('data/your_file.csv')
   df = df.rename(columns={
       'id': 'mcid',
       'text': 'claims', 
       'target': 'label'
   })
   df.to_csv('data/medical_claims.csv', index=False)
   ```

### Issue: Empty or Invalid Data

**Symptoms:**
```
ValueError: Dataset is empty after loading
ValueError: All labels are the same class
```

**Diagnosis:**
```python
import pandas as pd

df = pd.read_csv('data/medical_claims.csv')
print(f"Dataset shape: {df.shape}")
print(f"Label distribution: {df['label'].value_counts()}")
print(f"Missing values: {df.isnull().sum()}")
print(f"Sample rows:")
print(df.head())
```

**Solutions:**

1. **Remove empty rows**:
   ```python
   df = df.dropna(subset=['mcid', 'claims', 'label'])
   ```

2. **Check label format**:
   ```python
   # Labels must be 0 or 1
   print(df['label'].unique())  # Should show [0, 1] or [1, 0]
   
   # Convert if necessary
   df['label'] = df['label'].astype(int)
   ```

3. **Ensure minimum dataset size**:
   ```python
   if len(df) < 100:
       print("Warning: Very small dataset, results may be unreliable")
   ```

## ⚙️ Configuration Issues

### Issue: Invalid Configuration Values

**Symptoms:**
```
ValidationError: split_ratio must be between 0.1 and 0.9
ValidationError: target_codes is required when target_word_eval is enabled
```

**Common Fixes:**

1. **Split ratio issues**:
   ```yaml
   input:
     split_ratio: 0.8    # ✅ Valid (0.1 to 0.9)
     # split_ratio: 1.2  # ❌ Invalid (>0.9)
     # split_ratio: 0.05 # ❌ Invalid (<0.1)
   ```

2. **Missing target codes**:
   ```yaml
   pipeline_stages:
     target_word_eval: true
   
   target_word_evaluation:
     enable: true
     target_codes: ["E119", "I10"]  # ✅ Required when enabled
   ```

3. **Invalid batch sizes**:
   ```yaml
   model_api:
     batch_size: 32      # ✅ Valid (1-512)
     # batch_size: 0     # ❌ Invalid (too low)
     # batch_size: 1000  # ❌ Invalid (too high)
   ```

### Issue: Conflicting Configuration

**Symptoms:**
```
ConfigError: Cannot use both dataset_path and train_dataset_path
ConfigError: target_word_eval enabled but no target_codes provided
```

**Solutions:**

1. **Input data conflicts**:
   ```yaml
   # ✅ Option 1: Single dataset
   input:
     dataset_path: "data/claims.csv"
     split_ratio: 0.8
   
   # ✅ Option 2: Separate files  
   input:
     train_dataset_path: "data/train.csv"
     test_dataset_path: "data/test.csv"
   
   # ❌ Don't use both
   ```

2. **Stage dependencies**:
   ```yaml
   # ✅ Valid: Enable target evaluation with codes
   pipeline_stages:
     target_word_eval: true
   target_word_evaluation:
     target_codes: ["E119"]
   
   # ✅ Valid: Disable target evaluation
   pipeline_stages:
     target_word_eval: false
   ```

## 🧠 Embedding Generation Issues

### Issue: Out of Memory

**Symptoms:**
```
RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array
```

**Solutions:**

1. **Reduce batch sizes**:
   ```yaml
   model_api:
     batch_size: 8         # Reduce from 32
   
   embedding_generation:
     batch_size: 4         # Process fewer at once
   ```

2. **Reduce sequence length**:
   ```yaml
   data_processing:
     max_sequence_length: 256  # Reduce from 512
   ```

3. **Enable more frequent checkpoints**:
   ```yaml
   embedding_generation:
     save_interval: 25     # Save more often
   ```

4. **Use CSV format for large datasets**:
   ```yaml
   data_processing:
     output_format: "csv"  # More memory efficient
   ```

### Issue: Slow Embedding Generation

**Symptoms:**
- Process takes much longer than expected
- Low GPU/CPU utilization

**Solutions:**

1. **Increase batch sizes** (if memory allows):
   ```yaml
   model_api:
     batch_size: 64        # Increase if server can handle
   ```

2. **Optimize server settings**:
   ```python
   # In your model server
   # Enable batch processing
   # Use GPU if available
   # Optimize tokenization
   ```

3. **Monitor progress**:
   ```bash
   tail -f outputs/your_job/logs/pipeline.log
   ```

### Issue: Embedding Dimension Mismatch

**Symptoms:**
```
ValueError: All embeddings must have the same dimension
```

**Solutions:**

1. **Check API response format**:
   ```bash
   curl -X POST http://localhost:8000/embeddings_batch \
     -H "Content-Type: application/json" \
     -d '{"texts": ["E119 I10"]}'
   ```

2. **Ensure consistent model**:
   - Don't change model during embedding generation
   - Use same model configuration

3. **Clear corrupted checkpoints**:
   ```bash
   rm -rf outputs/checkpoints/*
   ```

## 🤖 Classification Issues

### Issue: Poor Classification Performance

**Symptoms:**
- All classifiers get <70% accuracy
- ROC-AUC near 0.5 (random performance)

**Diagnosis:**
```python
# Check data quality
import pandas as pd
import json

# Load embeddings
with open('outputs/job/embeddings/train_embeddings.json') as f:
    data = json.load(f)

# Check label distribution
labels = data['labels']
print(f"Label distribution: {pd.Series(labels).value_counts()}")

# Check embedding quality
embeddings = data['embeddings']
print(f"Embedding shape: {len(embeddings)} x {len(embeddings[0])}")
print(f"Embedding stats: mean={np.mean(embeddings):.3f}, std={np.std(embeddings):.3f}")
```

**Solutions:**

1. **Check data quality**:
   - Verify labels are correct
   - Ensure sufficient data (>1000 samples recommended)
   - Check for data leakage

2. **Verify model compatibility**:
   ```yaml
   # Ensure your model was trained on medical data
   # Check that embeddings capture medical semantics
   ```

3. **Try different hyperparameters**:
   ```yaml
   classification:
     hyperparameter_search:
       logistic_regression:
         C: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]  # Wider range
   ```

### Issue: Overfitting

**Symptoms:**
- High training accuracy, low test accuracy
- Large gap between CV and test scores

**Solutions:**

1. **Increase regularization**:
   ```yaml
   classification:
     hyperparameter_search:
       logistic_regression:
         C: [0.001, 0.01, 0.1]  # Lower C = more regularization
   ```

2. **Increase dataset size**:
   - Collect more training data
   - Use data augmentation if applicable

3. **Use more cross-validation folds**:
   ```yaml
   classification:
     cross_validation:
       n_folds: 10  # More thorough validation
   ```

### Issue: Training Takes Too Long

**Solutions:**

1. **Reduce hyperparameter search space**:
   ```yaml
   classification:
     hyperparameter_search:
       logistic_regression:
         C: [0.1, 1, 10]  # Fewer options
       svm:
         C: [1]           # Skip SVM for large datasets
   ```

2. **Use parallel processing**:
   ```yaml
   classification:
     cross_validation:
       n_jobs: -1  # Use all CPU cores
   ```

3. **Train fewer models**:
   ```yaml
   classification:
     models: ["logistic_regression"]  # Skip slower models
   ```

## 🎯 Target Word Evaluation Issues

### Issue: Low Recall (Missing Positive Cases)

**Symptoms:**
- Target word method has much lower recall than embedding method
- Many true positive cases get 0 predictions

**Diagnosis:**
```bash
# Check target word results
grep "found_count" outputs/job/metrics/target_word_evaluation/target_word_eval_summary.json
```

**Solutions:**

1. **Increase generation diversity**:
   ```yaml
   target_word_evaluation:
     temperature: 0.9           # More randomness
     top_k: 100                 # Wider vocabulary
     generations_per_prompt: 20 # More attempts
   ```

2. **Add related target codes**:
   ```yaml
   target_word_evaluation:
     target_codes: [
       "E119",   # Type 2 diabetes
       "E1022",  # Type 1 diabetes with CKD
       "E1040",  # Type 1 diabetes with neuropathy
       "E1051"   # Type 1 diabetes with circulatory complications
     ]
   ```

3. **Increase generation length**:
   ```yaml
   target_word_evaluation:
     max_new_tokens: 300  # Longer sequences for more chances
   ```

### Issue: High False Positive Rate

**Symptoms:**
- Target codes appear in negative cases
- Much higher false positive rate than embedding method

**Solutions:**

1. **Use more specific target codes**:
   ```yaml
   target_word_evaluation:
     # Instead of general codes like "Z0000"
     target_codes: ["E119", "I259"]  # More specific conditions
   ```

2. **Reduce generation randomness**:
   ```yaml
   target_word_evaluation:
     temperature: 0.6     # Less randomness
     top_k: 30           # More focused vocabulary
   ```

3. **Review label quality**:
   ```python
   # Check if negative cases should actually be positive
   false_positives = df[(df['true_label'] == 0) & (df['predicted_label'] == 1)]
   print(false_positives[['mcid', 'input_claim', 'found_codes']].head())
   ```

### Issue: No Target Codes Found

**Symptoms:**
```
Warning: No target codes found in any generations
All predictions are 0 (negative)
```

**Solutions:**

1. **Check target code validity**:
   ```python
   # Verify codes exist in your domain
   target_codes = ["E119", "I10", "N6320"]
   
   # Check if codes appear in training data
   training_claims = df['claims'].str.cat(sep=' ')
   for code in target_codes:
       count = training_claims.count(code)
       print(f"{code}: {count} occurrences")
   ```

2. **Test generation manually**:
   ```bash
   curl -X POST http://localhost:8000/generate_batch \
     -H "Content-Type: application/json" \
     -d '{
       "prompts": ["E119 I10 |eoc|"],
       "max_new_tokens": 100,
       "temperature": 0.8,
       "num_return_sequences": 5
     }'
   ```

3. **Adjust generation parameters**:
   ```yaml
   target_word_evaluation:
     max_new_tokens: 500    # Much longer generations
     temperature: 1.0       # Maximum diversity
   ```

## 💾 Performance & Resource Issues

### Issue: High Memory Usage

**Symptoms:**
- Process killed by OOM killer
- System becomes unresponsive

**Monitor memory usage**:
```bash
# Monitor during pipeline execution
watch -n 5 'free -h && ps aux | grep python | head -5'
```

**Solutions:**

1. **Memory-efficient configuration**:
   ```yaml
   data_processing:
     output_format: "csv"          # More efficient than JSON
   
   embedding_generation:
     batch_size: 8                 # Smaller batches
     save_interval: 25             # Frequent saves, less memory
   
   model_api:
     batch_size: 16                # Smaller API batches
   ```

2. **Process data in chunks**:
   ```python
   # For very large datasets, split into smaller files
   import pandas as pd
   
   df = pd.read_csv('large_dataset.csv')
   chunk_size = 1000
   
   for i in range(0, len(df), chunk_size):
       chunk = df[i:i+chunk_size]
       chunk.to_csv(f'chunk_{i//chunk_size}.csv', index=False)
   ```

3. **Use streaming processing** (for future optimization):
   ```yaml
   # Enable memory cleanup (if available)
   system:
     cleanup_intermediate_files: true
     memory_limit_mb: 8192
   ```

### Issue: Slow Performance

**Diagnosis:**
```bash
# Check what's taking time
grep "took" outputs/job/logs/pipeline.log
grep "completed" outputs/job/logs/pipeline.log
```

**Solutions:**

1. **Optimize bottlenecks**:
   ```yaml
   # If embedding generation is slow
   model_api:
     batch_size: 64        # Larger batches (if memory allows)
   
   # If classification is slow
   classification:
     models: ["logistic_regression"]  # Skip slower models
     cross_validation:
       n_folds: 3          # Fewer folds
   ```

2. **Parallel processing**:
   ```yaml
   classification:
     cross_validation:
       n_jobs: -1          # Use all CPU cores
   ```

3. **Skip unnecessary stages**:
   ```yaml
   pipeline_stages:
     embeddings: false     # Use existing embeddings
     target_word_eval: false  # Skip if not needed
   ```

## 🔍 Debugging Strategies

### Enable Debug Logging

```yaml
logging:
  level: "DEBUG"              # Show detailed information
  console_level: "DEBUG"      # Also show on console
```

### Test with Small Dataset

```python
# Create small test dataset
import pandas as pd

df = pd.read_csv('data/medical_claims.csv')
small_df = df.head(100)  # Just 100 samples
small_df.to_csv('data/test_small.csv', index=False)
```

```yaml
# Test configuration
input:
  dataset_path: "data/test_small.csv"

target_word_evaluation:
  generations_per_prompt: 3   # Fewer generations for speed
```

### Validate Each Stage Separately

```bash
# Test embedding generation only
python main.py run-embeddings --config test_config.yaml

# Test classification only (requires embeddings)
python main.py run-classification --config test_config.yaml

# Test target evaluation only
python main.py run-target-eval --config test_config.yaml
```

### Check Intermediate Files

```bash
# Verify embeddings were generated
ls -la outputs/job/embeddings/
head -5 outputs/job/embeddings/train_embeddings.json

# Check model files
ls -la outputs/job/models/

# Verify metrics
ls -la outputs/job/metrics/
```

## 🆘 Getting Help

### Collect Diagnostic Information

When reporting issues, include:

1. **Configuration file**:
   ```bash
   cat your_config.yaml
   ```

2. **Error logs**:
   ```bash
   tail -100 outputs/job/logs/pipeline.log
   ```

3. **System information**:
   ```bash
   python --version
   pip list | grep -E "pandas|numpy|sklearn|pydantic"
   free -h
   df -h
   ```

4. **Data sample**:
   ```bash
   head -5 data/medical_claims.csv
   wc -l data/medical_claims.csv
   ```

### Common Solutions Checklist

Before seeking help, try these common fixes:

- [ ] **Restart model server** and test API connectivity
- [ ] **Check file paths** are correct and files exist
- [ ] **Validate configuration** using built-in validation
- [ ] **Test with small dataset** (100 samples) first
- [ ] **Check logs** for specific error messages
- [ ] **Verify data format** matches requirements
- [ ] **Ensure sufficient disk space** and memory
- [ ] **Try default template** configuration first

### Progressive Debugging Approach

1. **Start simple**: Use template configuration with small dataset
2. **Test connectivity**: Verify API endpoints work
3. **Validate data**: Ensure CSV format is correct
4. **Run one stage**: Test embedding generation alone
5. **Scale up gradually**: Increase dataset size and complexity
6. **Monitor resources**: Watch memory and CPU usage
7. **Read logs carefully**: Error messages are usually informative

## 🔗 Next Steps

- **[07_Advanced_Usage.ipynb](07_Advanced_Usage.ipynb)** - Production deployment and optimization
- **[01_Introduction_to_MGPT_Eval.ipynb](01_Introduction_to_MGPT_Eval.ipynb)** - Return to basics if needed
- **[04_Configuration_Guide.ipynb](04_Configuration_Guide.ipynb)** - Review configuration options

### Quick Diagnostic Commands:

```bash
# Test API connectivity
curl -X GET http://localhost:8000/health

# Validate configuration
python main.py validate --config your_config.yaml

# Check recent logs
tail -50 outputs/*/logs/pipeline.log

# Monitor resource usage
htop
```

Remember: Most issues are configuration-related and can be fixed by carefully reading error messages and checking the examples in this troubleshooting guide.