# ⚙️ Complete Configuration Guide

## Overview

This guide covers all configuration options for the MGPT-Eval pipeline. Understanding these settings will help you optimize the pipeline for your specific use case, data size, and computational resources.

## 📁 Configuration File Structure

All configuration files use YAML format with this structure:

```yaml
input:                    # Data sources and paths
job:                      # Job identification and output
model_api:                # Your MGPT model server settings
pipeline_stages:          # Which stages to run
data_processing:          # Data validation and processing
embedding_generation:     # Embedding creation settings
classification:           # Classifier training settings
evaluation:               # Model evaluation settings
target_word_evaluation:   # Target word method settings
output:                   # Output directories and formats
logging:                  # Logging configuration
```

## 📊 Input Configuration

### Option 1: Single Dataset (Auto-Split)

```yaml
input:
  dataset_path: "data/medical_claims.csv"  # 👈 Your CSV file
  split_ratio: 0.8                        # 80% train, 20% test
```

**When to use:**
- You have one CSV file with all your data
- You want automatic stratified train/test splitting
- Most common scenario for new projects

**Split ratio options:**
- `0.7`: 70% train, 30% test (smaller datasets)
- `0.8`: 80% train, 20% test (recommended)
- `0.9`: 90% train, 10% test (large datasets)

### Option 2: Separate Train/Test Files

```yaml
input:
  train_dataset_path: "data/train_claims.csv"
  test_dataset_path: "data/test_claims.csv"
  # split_ratio is ignored when using separate files
```

**When to use:**
- You have pre-split data
- You want to maintain specific train/test distributions
- You're comparing with previous experiments

### Option 3: Pre-computed Embeddings

```yaml
input:
  train_embeddings_path: "outputs/job1/embeddings/train_embeddings.json"
  test_embeddings_path: "outputs/job1/embeddings/test_embeddings.json"
```

**When to use:**
- Skip expensive embedding generation
- Experiment with different classifiers
- Use embeddings from previous runs

## 🏗️ Job Configuration

```yaml
job:
  name: "diabetes_prediction_v2"    # 👈 Descriptive job name
  output_dir: "outputs"             # Base output directory
  random_seed: 42                   # Reproducible results
```

### Job Naming Best Practices:

```yaml
# Good job names:
name: "diabetes_codes_full_pipeline"     # Clear purpose
name: "cardiovascular_risk_embeddings"   # Specific condition
name: "claim_classification_v3"          # Version tracking
name: "emergency_dept_target_eval"       # Department specific

# Avoid:
name: "test"                             # Too generic
name: "mgpt_eval_job"                    # Default name
name: "run1"                             # Not descriptive
```

### Output Structure:
```
outputs/
└── {job.name}/                    # Your job name becomes folder
    ├── embeddings/
    ├── models/
    ├── metrics/
    ├── summary/
    └── logs/
```

## 🔗 Model API Configuration

```yaml
model_api:
  base_url: "http://localhost:8000"        # 👈 Your model server URL
  batch_size: 32                           # Requests per batch
  timeout: 300                             # 5 minute timeout
  max_retries: 3                           # Retry attempts
```

### Server URL Examples:

```yaml
# Local development
base_url: "http://localhost:8000"

# Remote server
base_url: "https://mgpt-api.yourcompany.com"

# Docker container
base_url: "http://mgpt-container:8000"

# Cloud deployment
base_url: "https://mgpt-api.cloud.example.com"
```

### Batch Size Tuning:

| Server Capacity | Recommended batch_size | Use Case |
|----------------|------------------------|----------|
| Small (1-2 GB RAM) | 8-16 | Development, testing |
| Medium (4-8 GB RAM) | 32-64 | Production, medium datasets |
| Large (16+ GB RAM) | 64-128 | Large-scale production |

### Timeout Settings:

```yaml
# For embedding generation
timeout: 300      # 5 minutes (recommended)

# For text generation (slower)
timeout: 600      # 10 minutes

# For very large batches
timeout: 1200     # 20 minutes
```

## 🔄 Pipeline Stages Configuration

```yaml
pipeline_stages:
  embeddings: true           # Generate embeddings from text
  classification: true       # Train ML classifiers
  evaluation: true           # Evaluate trained models
  target_word_eval: true     # Run target word evaluation
  summary_report: true       # Create summary report
  method_comparison: true    # Compare both methods
```

### Common Stage Combinations:

#### 1. Embeddings Only
```yaml
pipeline_stages:
  embeddings: true
  classification: false
  evaluation: false
  target_word_eval: false
  summary_report: false
  method_comparison: false
```
**Use case**: Generate embeddings for later use

#### 2. Classification from Embeddings
```yaml
pipeline_stages:
  embeddings: false          # Use existing embeddings
  classification: true
  evaluation: true
  target_word_eval: false
  summary_report: true
  method_comparison: false
```
**Use case**: Train classifiers on pre-computed embeddings

#### 3. Target Word Only
```yaml
pipeline_stages:
  embeddings: false
  classification: false
  evaluation: false
  target_word_eval: true
  summary_report: true
  method_comparison: false
```
**Use case**: Quick evaluation using target codes

#### 4. Full Comparison
```yaml
pipeline_stages:
  embeddings: true
  classification: true
  evaluation: true
  target_word_eval: true
  summary_report: true
  method_comparison: true    # 👈 Compare both methods
```
**Use case**: Comprehensive evaluation and method comparison

## 🧠 Embedding Generation Configuration

```yaml
embedding_generation:
  batch_size: 16                           # Claims per processing batch
  save_interval: 100                       # Save progress every N batches
  checkpoint_dir: "outputs/checkpoints"    # Checkpoint directory
  resume_from_checkpoint: true             # Resume if interrupted
  tokenizer_path: "/app/tokenizer"         # Path to tokenizer
```

### Data Processing Settings

```yaml
data_processing:
  random_seed: 42
  max_sequence_length: 512          # Maximum tokens per claim
  include_mcid: true                # Include Medical Claim IDs
  output_format: "json"             # json or csv
  train_test_split: 0.8             # Fallback split ratio
```

### Sequence Length Guidelines:

| Sequence Length | Typical Use Case | Memory Impact |
|----------------|------------------|---------------|
| 128 | Short claims, fast processing | Low |
| 256 | Medium claims, balanced | Medium |
| 512 | Long claims, comprehensive | High |
| 1024+ | Very long claims, research | Very High |

### Output Format Comparison:

```yaml
# JSON format (default)
output_format: "json"
# Pros: Human readable, includes metadata
# Cons: Larger file size
# Use for: Development, small-medium datasets

# CSV format
output_format: "csv"
# Pros: Compact, efficient loading
# Cons: Less metadata
# Use for: Large datasets, production
```

### Checkpoint Configuration:

```yaml
# For large datasets (10,000+ claims)
save_interval: 50             # More frequent saves

# For medium datasets (1,000-10,000 claims)
save_interval: 100            # Balanced frequency

# For small datasets (<1,000 claims)
save_interval: 200            # Less frequent saves
```

## 🤖 Classification Configuration

```yaml
classification:
  models: ["logistic_regression", "svm", "random_forest"]
  
  cross_validation:
    n_folds: 5                          # CV folds
    scoring: "roc_auc"                  # Optimization metric
    n_jobs: -1                          # Parallel jobs
  
  hyperparameter_search:
    # Detailed hyperparameter grids...
```

### Model Selection Guide:

#### Logistic Regression
```yaml
models: ["logistic_regression"]
hyperparameter_search:
  logistic_regression:
    C: [0.001, 0.01, 0.1, 1, 10, 100]     # Regularization strength
    penalty: ["l1", "l2"]                  # Regularization type
    solver: ["liblinear", "saga"]          # Optimization algorithm
```
**Best for**: Fast training, interpretable results, linear relationships

#### Support Vector Machine
```yaml
models: ["svm"]
hyperparameter_search:
  svm:
    C: [0.1, 1, 10]                        # Regularization
    kernel: ["rbf", "linear"]              # Kernel type
    gamma: ["scale", "auto"]               # Kernel coefficient
```
**Best for**: High-dimensional data, robust to overfitting, non-linear patterns

#### Random Forest
```yaml
models: ["random_forest"]
hyperparameter_search:
  random_forest:
    n_estimators: [100, 200, 300]         # Number of trees
    max_depth: [10, 20, 30, null]         # Tree depth
    min_samples_split: [2, 5, 10]         # Split threshold
```
**Best for**: Robust predictions, feature importance, noisy data

### Cross-Validation Settings:

```yaml
cross_validation:
  # Small datasets (<1,000 samples)
  n_folds: 3
  
  # Medium datasets (1,000-10,000 samples)
  n_folds: 5                    # Recommended
  
  # Large datasets (>10,000 samples)
  n_folds: 10
  
  # Scoring options
  scoring: "roc_auc"             # Best for binary classification
  scoring: "accuracy"            # Simple overall correctness
  scoring: "f1"                  # Balance precision/recall
```

## 📊 Evaluation Configuration

```yaml
evaluation:
  metrics: ["accuracy", "precision", "recall", "f1_score", "roc_auc", "confusion_matrix"]
  
  visualization:
    generate_plots: true
    plot_formats: ["png", "pdf"]
    dpi: 300
```

### Metrics Selection Guide:

```yaml
# For balanced datasets
metrics: ["accuracy", "precision", "recall", "f1_score", "roc_auc"]

# For imbalanced datasets (focus on minority class)
metrics: ["precision", "recall", "f1_score", "roc_auc", "confusion_matrix"]

# For clinical applications (avoid missing positives)
metrics: ["recall", "f1_score", "roc_auc", "confusion_matrix"]

# For research (comprehensive analysis)
metrics: ["accuracy", "precision", "recall", "f1_score", "roc_auc", "confusion_matrix"]
```

### Visualization Options:

```yaml
visualization:
  generate_plots: true
  
  # For presentations/web
  plot_formats: ["png"]
  dpi: 150
  
  # For publications
  plot_formats: ["pdf", "png"]
  dpi: 300
  
  # For high-quality prints
  plot_formats: ["pdf", "eps"]
  dpi: 600
```

## 🎯 Target Word Evaluation Configuration

```yaml
target_word_evaluation:
  enable: true
  
  # Method 1: Direct list
  target_codes: ["E119", "76642", "N6320", "K9289"]
  
  # Method 2: Load from file
  # target_codes_file: "configs/target_codes.txt"
  
  generations_per_prompt: 10        # Robustness vs speed
  max_new_tokens: 200               # Length of generation
  temperature: 0.8                  # Sampling creativity
  top_k: 50                        # Vocabulary diversity
  search_method: "exact"            # Matching precision
```

### Target Code Selection Guidelines:

#### Clinical Condition Codes:
```yaml
# Diabetes-related codes
target_codes: ["E119", "E1022", "E1040", "E1051", "E1059"]

# Cardiovascular codes
target_codes: ["I10", "I259", "E785", "Z87891", "I110"]

# Emergency department codes
target_codes: ["R50", "R06", "R060", "R509", "G9340"]
```

#### Code Frequency Considerations:
```yaml
# High-frequency codes (appear in >10% of data)
target_codes: ["Z0000", "M549", "R50"]     # May cause high false positives

# Medium-frequency codes (appear in 1-10% of data)
target_codes: ["E119", "I10", "N6320"]     # Balanced (recommended)

# Low-frequency codes (appear in <1% of data)
target_codes: ["O0903", "Z87891"]          # May cause low recall
```

### Generation Parameters Tuning:

#### For Speed (Quick Testing):
```yaml
generations_per_prompt: 5
max_new_tokens: 100
temperature: 0.7
top_k: 30
```

#### For Accuracy (Production):
```yaml
generations_per_prompt: 15
max_new_tokens: 300
temperature: 0.8
top_k: 50
```

#### For Robustness (Research):
```yaml
generations_per_prompt: 20
max_new_tokens: 400
temperature: 0.9
top_k: 100
```

### Target Codes File Format:

**File: `configs/target_codes.txt`**
```
# Diabetes-related codes
E119      # Type 2 diabetes mellitus without complications
E1022     # Type 1 diabetes mellitus with diabetic chronic kidney disease
E1040     # Type 1 diabetes mellitus with diabetic neuropathy

# Cardiovascular codes
I10       # Essential hypertension
I259      # Chronic ischemic heart disease, unspecified
E785      # Hyperlipidemia, unspecified

# Lines starting with # are comments
# Empty lines are ignored
```

## 📂 Output Configuration

```yaml
output:
  embeddings_dir: "outputs/embeddings"     # Embedding files
  models_dir: "outputs/models"              # Trained models
  metrics_dir: "outputs/metrics"            # Evaluation results
  logs_dir: "outputs/logs"                  # Log files
  save_best_model_only: false              # Save all vs best only
  model_format: "pickle"                    # pickle or joblib
```

### Directory Structure:
```
outputs/{job.name}/
├── embeddings/
│   ├── train_embeddings.{format}
│   └── test_embeddings.{format}
├── models/
│   ├── logistic_regression_model.pkl
│   ├── svm_model.pkl
│   └── random_forest_model.pkl
├── metrics/
│   ├── logistic_regression/
│   ├── svm/
│   ├── random_forest/
│   └── target_word_evaluation/
├── summary/
│   ├── pipeline_summary.json
│   └── method_comparison.json
└── logs/
    └── pipeline.log
```

### Model Storage Options:

```yaml
# Save all trained models (recommended for analysis)
save_best_model_only: false

# Save only best performing model (saves disk space)
save_best_model_only: true

# Model serialization format
model_format: "pickle"    # Standard Python, smaller files
model_format: "joblib"    # Better for large numpy arrays
```

## 📝 Logging Configuration

```yaml
logging:
  level: "INFO"                             # File log level
  console_level: "INFO"                     # Console log level
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  file: "outputs/logs/pipeline.log"         # Log file path
```

### Log Level Guidelines:

#### Development & Debugging:
```yaml
logging:
  level: "DEBUG"           # Detailed information
  console_level: "INFO"    # Less console noise
```

#### Production:
```yaml
logging:
  level: "INFO"            # Important events only
  console_level: "WARNING" # Minimal console output
```

#### Research/Analysis:
```yaml
logging:
  level: "INFO"            # Balanced detail
  console_level: "INFO"    # See progress
```

### Log Levels Explained:

| Level | What it logs | Use case |
|-------|-------------|----------|
| DEBUG | Everything (API calls, data processing) | Debugging issues |
| INFO | Major steps, progress, results | Normal operation |
| WARNING | Issues that don't stop execution | Production monitoring |
| ERROR | Errors that stop execution | Critical issues only |

## 🎛️ Template-Based Configuration

### Starting with Templates:

1. **Copy a template:**
   ```bash
   cp configs/templates/04_full_pipeline.yaml my_config.yaml
   ```

2. **Edit key fields marked with 👈:**
   ```yaml
   input:
     dataset_path: "data/my_claims.csv"  # 👈 UPDATE
   
   model_api:
     base_url: "http://my-server:8000"   # 👈 UPDATE
   
   target_word_evaluation:
     target_codes: ["E119", "I10"]       # 👈 UPDATE
   ```

3. **Run the pipeline:**
   ```bash
   python main.py run-all --config my_config.yaml
   ```

### Template Selection Guide:

| Template | Purpose | Required Updates |
|----------|---------|------------------|
| `01_embeddings_only.yaml` | Generate embeddings | dataset_path, base_url |
| `02_from_embeddings.yaml` | Train classifiers | embedding paths |
| `03_target_words_only.yaml` | Target evaluation | dataset_path, base_url, target_codes |
| `04_full_pipeline.yaml` | Complete analysis | dataset_path, base_url, target_codes |

## ⚡ Performance Optimization

### For Large Datasets (>10,000 claims):

```yaml
model_api:
  batch_size: 64              # Larger batches
  timeout: 600                # Longer timeout

embedding_generation:
  batch_size: 32              # Process more at once
  save_interval: 50           # Frequent checkpoints

data_processing:
  output_format: "csv"        # More efficient format

classification:
  cross_validation:
    n_jobs: -1                # Use all CPU cores
```

### For Memory-Constrained Environments:

```yaml
model_api:
  batch_size: 8               # Smaller batches

embedding_generation:
  batch_size: 4               # Process fewer at once
  save_interval: 25           # More frequent saves

data_processing:
  max_sequence_length: 256    # Shorter sequences
```

### For High-Performance Systems:

```yaml
model_api:
  batch_size: 128             # Large batches
  timeout: 900                # Generous timeout

embedding_generation:
  batch_size: 64              # Large processing batches

classification:
  cross_validation:
    n_jobs: -1                # All CPU cores
    n_folds: 10               # More thorough CV
```

## 🔍 Validation and Troubleshooting

### Configuration Validation:

The pipeline automatically validates your configuration and will show helpful error messages:

```bash
# Test your configuration
python main.py validate --config my_config.yaml
```

### Common Configuration Issues:

#### Missing Required Fields:
```yaml
# ❌ Error: Missing target_codes when target_word_eval is enabled
pipeline_stages:
  target_word_eval: true
target_word_evaluation:
  enable: true
  # target_codes: []  # Missing!

# ✅ Fix: Provide target codes
target_word_evaluation:
  enable: true
  target_codes: ["E119", "I10"]
```

#### Path Issues:
```yaml
# ❌ Error: File not found
input:
  dataset_path: "data/medical_claims.csv"  # File doesn't exist

# ✅ Fix: Check file path
input:
  dataset_path: "../data/medical_claims.csv"  # Correct path
```

#### Invalid Values:
```yaml
# ❌ Error: Invalid split ratio
input:
  split_ratio: 1.5  # Must be between 0.1 and 0.9

# ✅ Fix: Valid range
input:
  split_ratio: 0.8
```

## 🔗 Next Steps

- **[05_Results_Analysis.ipynb](05_Results_Analysis.ipynb)** - Understanding your results
- **[06_Troubleshooting.ipynb](06_Troubleshooting.ipynb)** - Common issues and solutions
- **[07_Advanced_Usage.ipynb](07_Advanced_Usage.ipynb)** - Production deployment

### Quick Reference Commands:

```bash
# Validate configuration
python main.py validate --config my_config.yaml

# Run full pipeline
python main.py run-all --config my_config.yaml

# Run specific stage
python main.py run-embeddings --config my_config.yaml
python main.py run-classification --config my_config.yaml
python main.py run-evaluation --config my_config.yaml
python main.py run-target-eval --config my_config.yaml
```