# CTI-to-Hunt Logic Fine-tuning Environment Setup

This notebook sets up the environment for fine-tuning a model to convert cyber threat intelligence text into concise hunt logic.

## Hardware Requirements
- **GPU**: RTX 4090 (24GB) or A100 (40GB) recommended
- **RAM**: 32GB+ system memory
- **Storage**: 100GB+ available space

## Model Target
- **Base Model**: Llama 3.1 8B Instruct
- **Training Method**: QLoRA (4-bit quantized LoRA)
- **Output Format**: 1-10 line hunt logic statements

## 1. Environment Verification

In [None]:
import sys
import os
import platform
import subprocess
from pathlib import Path

print(f"Python version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Current working directory: {os.getcwd()}")

# Check if we're in the CTIScraper directory
if not Path('./src').exists():
    print("‚ö†Ô∏è  Warning: Not in CTIScraper root directory")
    print("Please navigate to the CTIScraper root directory")
else:
    print("‚úÖ In CTIScraper root directory")

## 2. GPU and CUDA Check

In [None]:
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU count: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            gpu_name = torch.cuda.get_device_name(i)
            gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1e9
            print(f"  GPU {i}: {gpu_name} ({gpu_memory:.1f} GB)")
    else:
        print("‚ö†Ô∏è  No CUDA GPUs detected. Training will be slow on CPU.")
        
except ImportError:
    print("‚ùå PyTorch not installed. Please install requirements first.")

## 3. Install ML Dependencies

In [None]:
# Install ML dependencies
!pip install -r requirements-ml.txt

# Download spaCy model for text processing
!python -m spacy download en_core_web_sm

## 4. Verify Core Libraries

In [None]:
# Test core library imports
libraries = {
    'torch': 'PyTorch',
    'transformers': 'Hugging Face Transformers',
    'datasets': 'Hugging Face Datasets',
    'peft': 'PEFT (Parameter Efficient Fine-tuning)',
    'bitsandbytes': 'BitsAndBytes (Quantization)',
    'accelerate': 'Hugging Face Accelerate',
    'trl': 'TRL (Transformer Reinforcement Learning)',
    'wandb': 'Weights & Biases',
    'evaluate': 'Hugging Face Evaluate'
}

print("Library verification:")
for lib, name in libraries.items():
    try:
        module = __import__(lib)
        version = getattr(module, '__version__', 'unknown')
        print(f"‚úÖ {name}: {version}")
    except ImportError as e:
        print(f"‚ùå {name}: Failed to import - {e}")

## 5. Create Training Directory Structure

In [None]:
# Create directory structure for training
directories = [
    'models',
    'models/base',
    'models/checkpoints',
    'models/fine_tuned',
    'data',
    'data/raw',
    'data/processed',
    'data/training',
    'notebooks/configs',
    'logs',
    'outputs'
]

for directory in directories:
    Path(directory).mkdir(parents=True, exist_ok=True)
    print(f"‚úÖ Created directory: {directory}")

print("\nTraining directory structure created successfully!")

## 6. Test Model Loading (Optional)

In [None]:
# Optional: Test loading a small model to verify everything works
test_model = input("Test model loading? (y/n): ").lower().strip()

if test_model == 'y':
    print("Testing model loading with a small model...")
    
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    # Test with a small model first
    model_name = "microsoft/DialoGPT-small"
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None
        )
        
        print(f"‚úÖ Successfully loaded test model: {model_name}")
        print(f"   Model parameters: {sum(p.numel() for p in model.parameters()):,}")
        print(f"   Model device: {next(model.parameters()).device}")
        
        # Clean up
        del model, tokenizer
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            
    except Exception as e:
        print(f"‚ùå Failed to load test model: {e}")
else:
    print("Skipping model loading test")

## 7. Initialize Weights & Biases (Optional)

In [None]:
# Initialize wandb for experiment tracking
setup_wandb = input("Setup Weights & Biases for experiment tracking? (y/n): ").lower().strip()

if setup_wandb == 'y':
    import wandb
    
    print("Please log in to Weights & Biases:")
    wandb.login()
    
    # Test connection
    test_run = wandb.init(
        project="cti-hunt-logic-fine-tuning",
        name="environment-setup-test",
        job_type="setup"
    )
    
    wandb.log({"setup_status": "complete"})
    wandb.finish()
    
    print("‚úÖ Weights & Biases setup complete!")
else:
    print("Skipping Weights & Biases setup")

## 8. Environment Summary

In [None]:
print("üéâ Environment Setup Complete!")
print("\n" + "="*50)
print("SUMMARY")
print("="*50)
print(f"‚úÖ Python {sys.version.split()[0]}")
print(f"‚úÖ PyTorch {torch.__version__} with CUDA {torch.cuda.is_available()}")
print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB" if torch.cuda.is_available() else "‚ùå No GPU detected")
print("‚úÖ All required libraries installed")
print("‚úÖ Directory structure created")
print("\nüìù Next Steps:")
print("1. Run notebook 02_data_preparation.ipynb")
print("2. Prepare your CTI training data")
print("3. Begin model fine-tuning")
print("\n" + "="*50)