# Fine-tuning Qwen2.5-72B-Instruct with SoftMoE & Quantization for Log Understanding

This notebook fine-tunes **Qwen/Qwen2.5-72B-Instruct** using:
- ‚ö° **4-bit Quantization** (QLoRA) for efficient training on large models
- üß† **Soft Mixture of Experts (SoftMoE)** for specialized log understanding
- üìä **Multi-task Learning** for:
  - Log Parsing (extracting templates)
  - Log Summarization (summarizing sequences)
  - Log Classification (anomaly detection)

## Requirements
- GPU: T4 (free), V100, or A100 recommended
- RAM: 12GB+ system RAM
- Storage: ~30GB for model + data

## Author
Created for 7030CEM - Log Understanding Fine-tuning

In [None]:
# Check if running on Colab
try:
    import google.colab
    IN_COLAB = True
    print("‚úÖ Running on Google Colab")
except:
    IN_COLAB = False
    print("‚ö†Ô∏è  Not running on Colab")

# Mount Google Drive (optional - for saving checkpoints)
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted")

In [None]:
%%capture
# Install required packages
!pip install -q -U transformers accelerate peft bitsandbytes
!pip install -q -U datasets huggingface_hub
!pip install -q -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
!pip install -q -U trl sentencepiece protobuf
!pip install -q -U wandb tensorboard
!pip install -q -U einops scipy

print("‚úÖ All packages installed successfully")

In [None]:
# Import libraries
import os
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from typing import Dict, List, Optional, Tuple
import json
from dataclasses import dataclass

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
)
from datasets import load_dataset, Dataset, DatasetDict
from torch.utils.data import DataLoader

# Check CUDA availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Configuration
@dataclass
class Config:
    # Model settings
    model_name: str = "Qwen/Qwen2.5-72B-Instruct"
    
    # Dataset settings - CHANGE THIS TO YOUR HUGGINGFACE DATASET
    dataset_name: str = "chYassine/ait-fox-raw-logs"  # Your HF dataset
    
    # Quantization settings
    use_4bit: bool = True
    bnb_4bit_compute_dtype: str = "bfloat16"
    bnb_4bit_quant_type: str = "nf4"  # Normal Float 4
    use_nested_quant: bool = True  # Double quantization
    
    # LoRA settings
    lora_r: int = 64  # LoRA rank
    lora_alpha: int = 128  # LoRA alpha (scaling)
    lora_dropout: float = 0.1
    
    # SoftMoE settings
    num_experts: int = 8  # Number of experts
    num_experts_per_token: int = 2  # Active experts per token
    use_softmoe: bool = True
    
    # Training settings
    max_seq_length: int = 2048
    num_train_epochs: int = 3
    per_device_train_batch_size: int = 1
    per_device_eval_batch_size: int = 1
    gradient_accumulation_steps: int = 16
    learning_rate: float = 2e-5
    warmup_steps: int = 100
    weight_decay: float = 0.01
    
    # Output settings
    output_dir: str = "./qwen-log-understanding"
    logging_steps: int = 10
    save_steps: int = 500
    eval_steps: int = 500
    
    # HuggingFace token
    hf_token: Optional[str] = None  # Will be set from secrets

config = Config()
print("‚úÖ Configuration loaded")

In [None]:
# Setup HuggingFace authentication
from huggingface_hub import login

# Login to HuggingFace
if IN_COLAB:
    from google.colab import userdata
    try:
        hf_token = userdata.get('HF_TOKEN')
        login(token=hf_token)
        print("‚úÖ Logged in to HuggingFace using Colab secrets")
    except:
        print("‚ö†Ô∏è  No HF_TOKEN found in Colab secrets")
        print("Please run: from huggingface_hub import login; login()")
        login()
else:
    print("Please authenticate with HuggingFace:")
    login()

## Soft Mixture of Experts (SoftMoE) Implementation

SoftMoE uses soft routing instead of discrete gating, allowing smoother gradient flow and better expert utilization.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SoftMoELayer(nn.Module):
    """Soft Mixture of Experts Layer with smooth routing."""
    
    def __init__(
        self,
        hidden_size: int,
        num_experts: int = 8,
        expert_capacity: int = 2,
        dropout: float = 0.1,
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_experts = num_experts
        self.expert_capacity = expert_capacity
        
        # Router network
        self.router = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 2),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size // 2, num_experts),
        )
        
        # Expert networks
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, hidden_size * 4),
                nn.GELU(),
                nn.Dropout(dropout),
                nn.Linear(hidden_size * 4, hidden_size),
                nn.Dropout(dropout),
            )
            for _ in range(num_experts)
        ])
        
        # Layer norm
        self.layer_norm = nn.LayerNorm(hidden_size)
        
    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """
        Args:
            hidden_states: (batch_size, seq_len, hidden_size)
        Returns:
            output: (batch_size, seq_len, hidden_size)
        """
        batch_size, seq_len, hidden_size = hidden_states.shape
        
        # Compute routing weights (soft routing)
        router_logits = self.router(hidden_states)  # (B, S, num_experts)
        router_weights = F.softmax(router_logits, dim=-1)  # Soft routing
        
        # Select top-k experts per token
        topk_weights, topk_indices = torch.topk(
            router_weights, self.expert_capacity, dim=-1
        )  # (B, S, expert_capacity)
        
        # Normalize top-k weights
        topk_weights = F.softmax(topk_weights, dim=-1)
        
        # Process through experts
        expert_outputs = []
        for i in range(self.expert_capacity):
            expert_idx = topk_indices[:, :, i]  # (B, S)
            expert_weight = topk_weights[:, :, i:i+1]  # (B, S, 1)
            
            # Gather expert outputs
            expert_output = torch.zeros_like(hidden_states)
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id).unsqueeze(-1)  # (B, S, 1)
                expert_result = self.experts[expert_id](hidden_states)
                expert_output = expert_output + mask.float() * expert_result
            
            expert_outputs.append(expert_weight * expert_output)
        
        # Combine expert outputs
        output = sum(expert_outputs)
        
        # Residual connection and layer norm
        output = self.layer_norm(hidden_states + output)
        
        return output

print("‚úÖ SoftMoE implementation loaded")

## Load and Prepare Dataset

In [None]:
# Load dataset from HuggingFace
print(f"Loading dataset: {config.dataset_name}")

try:
    raw_dataset = load_dataset(config.dataset_name)
    print(f"‚úÖ Dataset loaded successfully")
    print(f"Dataset info: {raw_dataset}")
except Exception as e:
    print(f"‚ùå Error loading dataset: {e}")
    print("\nCreating sample dataset for demonstration...")
    
    # Create a sample dataset if the real one isn't available
    sample_data = {
        'content': [
            '2024-01-01 10:00:00 INFO Starting service on port 8080',
            '2024-01-01 10:00:01 ERROR Connection refused to database',
            '2024-01-01 10:00:02 WARN High memory usage detected: 85%',
        ] * 100,  # Repeat for more examples
        'host': ['server1', 'server2', 'server3'] * 100,
        'log_type': ['system', 'database', 'system'] * 100,
    }
    raw_dataset = DatasetDict({
        'train': Dataset.from_dict(sample_data),
    })
    print("‚úÖ Sample dataset created")

In [None]:
# Display sample from dataset
print("\n" + "="*80)
print("SAMPLE DATA")
print("="*80)
train_data = raw_dataset['train'] if 'train' in raw_dataset else raw_dataset
for i in range(min(3, len(train_data))):
    print(f"\nExample {i+1}:")
    for key, value in train_data[i].items():
        print(f"  {key}: {value}")

## Create Multi-Task Training Data

We'll create three types of tasks:
1. **Log Parsing**: Extract structured templates from raw logs
2. **Log Summarization**: Summarize sequences of log events
3. **Log Classification**: Classify logs as normal/anomaly

In [None]:
import re

class LogTaskGenerator:
    """Generate multi-task training examples from log data."""
    
    def __init__(self):
        self.parsing_patterns = [
            (r'\d{4}-\d{2}-\d{2}', '<DATE>'),
            (r'\d{2}:\d{2}:\d{2}', '<TIME>'),
            (r'\d+\.\d+\.\d+\.\d+', '<IP>'),
            (r'\b\d+\b', '<NUM>'),
            (r'/[\w/]+', '<PATH>'),
        ]
    
    def extract_template(self, log_line: str) -> str:
        """Extract log template by replacing variables."""
        template = log_line
        for pattern, placeholder in self.parsing_patterns:
            template = re.sub(pattern, placeholder, template)
        return template
    
    def create_parsing_task(self, log_line: str) -> Dict[str, str]:
        """Create log parsing task."""
        template = self.extract_template(log_line)
        
        instruction = (
            "You are a log parsing expert. Extract the template from the following log line by "
            "replacing variable parts (dates, times, IPs, numbers, paths) with placeholders.\n\n"
            f"Log line: {log_line}\n\n"
            "Provide only the extracted template."
        )
        
        return {
            'task': 'parsing',
            'instruction': instruction,
            'response': template,
        }
    
    def create_classification_task(self, log_line: str, log_type: str = None) -> Dict[str, str]:
        """Create log classification task."""
        # Simple heuristic for anomaly detection
        anomaly_keywords = ['error', 'fail', 'exception', 'critical', 'fatal', 'refused', 'timeout']
        is_anomaly = any(keyword in log_line.lower() for keyword in anomaly_keywords)
        label = 'anomaly' if is_anomaly else 'normal'
        
        instruction = (
            "You are a log analysis expert. Classify the following log entry as 'normal' or 'anomaly'.\n\n"
            f"Log entry: {log_line}\n\n"
            "Classification:"
        )
        
        return {
            'task': 'classification',
            'instruction': instruction,
            'response': label,
        }
    
    def create_summarization_task(self, log_lines: List[str]) -> Dict[str, str]:
        """Create log summarization task."""
        # Create a simple summary
        templates = [self.extract_template(line) for line in log_lines]
        unique_templates = list(dict.fromkeys(templates))  # Preserve order
        
        summary = f"This sequence contains {len(log_lines)} log entries with {len(unique_templates)} unique event types.\n"
        summary += "Main events: " + ", ".join(unique_templates[:3])
        
        instruction = (
            "You are a log analysis expert. Summarize the following sequence of log entries.\n\n"
            f"Logs:\n" + "\n".join(f"{i+1}. {line}" for i, line in enumerate(log_lines[:10])) + "\n\n"
            "Summary:"
        )
        
        return {
            'task': 'summarization',
            'instruction': instruction,
            'response': summary,
        }

task_generator = LogTaskGenerator()
print("‚úÖ Task generator created")

In [None]:
# Generate multi-task training examples
def generate_training_examples(dataset, num_examples: int = 2000):
    """Generate multi-task training examples from dataset."""
    examples = []
    
    # Ensure we have the right split
    data = dataset['train'] if 'train' in dataset else dataset
    
    # Sample logs
    num_examples = min(num_examples, len(data))
    indices = np.random.choice(len(data), num_examples, replace=False)
    
    for idx in indices:
        sample = data[int(idx)]
        log_line = sample.get('content', '')
        
        if not log_line or (isinstance(log_line, str) and log_line.startswith('[BINARY')):
            continue
        
        # Create parsing task (40% of examples)
        if np.random.random() < 0.4:
            examples.append(task_generator.create_parsing_task(log_line))
        
        # Create classification task (40% of examples)
        elif np.random.random() < 0.75:
            log_type = sample.get('log_type', None)
            examples.append(task_generator.create_classification_task(log_line, log_type))
        
        # Create summarization task (20% of examples)
        else:
            # Get a sequence of logs
            start = max(0, int(idx) - 5)
            end = min(len(data), int(idx) + 5)
            log_sequence = [data[i].get('content', '') for i in range(start, end)]
            log_sequence = [l for l in log_sequence if l and not (isinstance(l, str) and l.startswith('[BINARY'))]
            
            if len(log_sequence) >= 3:
                examples.append(task_generator.create_summarization_task(log_sequence))
    
    return examples

# Generate examples
print("Generating training examples...")
training_examples = generate_training_examples(raw_dataset, num_examples=2000)
print(f"‚úÖ Generated {len(training_examples)} training examples")

# Show task distribution
task_counts = {}
for ex in training_examples:
    task = ex['task']
    task_counts[task] = task_counts.get(task, 0) + 1

print("\nTask distribution:")
for task, count in task_counts.items():
    print(f"  {task}: {count} ({100*count/len(training_examples):.1f}%)")

## Load Model with Quantization

In [None]:
# Configure 4-bit quantization
compute_dtype = getattr(torch, config.bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=config.use_4bit,
    bnb_4bit_quant_type=config.bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=config.use_nested_quant,
)

print("Quantization config:")
print(f"  4-bit: {config.use_4bit}")
print(f"  Quant type: {config.bnb_4bit_quant_type}")
print(f"  Compute dtype: {config.bnb_4bit_compute_dtype}")
print(f"  Nested quantization: {config.use_nested_quant}")

In [None]:
# Load tokenizer
print(f"\nLoading tokenizer: {config.model_name}")
tokenizer = AutoTokenizer.from_pretrained(
    config.model_name,
    trust_remote_code=True,
)

# Set padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"‚úÖ Tokenizer loaded")
print(f"  Vocab size: {len(tokenizer)}")
print(f"  Pad token: {tokenizer.pad_token}")

In [None]:
# Load model with quantization
print(f"\nLoading model: {config.model_name}")
print("‚ö†Ô∏è  This may take several minutes...")

model = AutoModelForCausalLM.from_pretrained(
    config.model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=compute_dtype,
)

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

print(f"‚úÖ Model loaded and quantized")
print(f"  Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")

## Add LoRA Adapters

In [None]:
# Configure LoRA
peft_config = LoraConfig(
    r=config.lora_r,
    lora_alpha=config.lora_alpha,
    lora_dropout=config.lora_dropout,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

# Add LoRA adapters
model = get_peft_model(model, peft_config)

print("‚úÖ LoRA adapters added")
print(f"\nTrainable parameters:")
model.print_trainable_parameters()

## Prepare Training Data

In [None]:
def format_instruction(example: Dict[str, str]) -> str:
    """Format example as instruction for Qwen."""
    return f"""<|im_start|>system
You are Qwen, an AI assistant specialized in log analysis and understanding.<|im_end|>
<|im_start|>user
{example['instruction']}<|im_end|>
<|im_start|>assistant
{example['response']}<|im_end|>"""

# Convert to HuggingFace Dataset
train_dataset = Dataset.from_list(training_examples)

# Split into train/eval
split_dataset = train_dataset.train_test_split(test_size=0.1, seed=42)
train_data = split_dataset['train']
eval_data = split_dataset['test']

print(f"‚úÖ Dataset prepared")
print(f"  Training examples: {len(train_data)}")
print(f"  Evaluation examples: {len(eval_data)}")

In [None]:
# Tokenize datasets
print("Tokenizing datasets...")

def tokenize_batch(batch):
    texts = [format_instruction({
        'instruction': batch['instruction'][i],
        'response': batch['response'][i]
    }) for i in range(len(batch['instruction']))]
    
    return tokenizer(
        texts,
        truncation=True,
        max_length=config.max_seq_length,
        padding="max_length",
    )

tokenized_train = train_data.map(
    tokenize_batch,
    batched=True,
    remove_columns=train_data.column_names,
    desc="Tokenizing training data",
)

tokenized_eval = eval_data.map(
    tokenize_batch,
    batched=True,
    remove_columns=eval_data.column_names,
    desc="Tokenizing evaluation data",
)

# Add labels
tokenized_train = tokenized_train.map(
    lambda x: {"labels": x["input_ids"]},
    desc="Adding labels",
)
tokenized_eval = tokenized_eval.map(
    lambda x: {"labels": x["input_ids"]},
    desc="Adding labels",
)

print("‚úÖ Tokenization complete")

## Setup Training

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=config.output_dir,
    num_train_epochs=config.num_train_epochs,
    per_device_train_batch_size=config.per_device_train_batch_size,
    per_device_eval_batch_size=config.per_device_eval_batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    learning_rate=config.learning_rate,
    warmup_steps=config.warmup_steps,
    weight_decay=config.weight_decay,
    logging_steps=config.logging_steps,
    save_steps=config.save_steps,
    eval_steps=config.eval_steps,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    report_to="tensorboard",
    save_total_limit=3,
    push_to_hub=False,
)

print("‚úÖ Training arguments configured")
effective_batch_size = config.per_device_train_batch_size * config.gradient_accumulation_steps
print(f"  Effective batch size: {effective_batch_size}")

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM (not masked LM)
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    data_collator=data_collator,
)

print("‚úÖ Trainer initialized")

## Train Model

In [None]:
# Start training
print("\n" + "="*80)
print("STARTING TRAINING")
print("="*80)
print(f"Training on {len(tokenized_train)} examples")
print(f"Evaluating on {len(tokenized_eval)} examples")
print(f"Number of epochs: {config.num_train_epochs}")
print("\n‚ö†Ô∏è  Training may take several hours depending on your GPU...\n")

# Train
train_result = trainer.train()

print("\n" + "="*80)
print("TRAINING COMPLETE")
print("="*80)
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training runtime: {train_result.metrics['train_runtime']:.2f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.2f}")

## Evaluate Model

In [None]:
# Evaluate
print("\nEvaluating model...")
eval_results = trainer.evaluate()

print("\n" + "="*80)
print("EVALUATION RESULTS")
print("="*80)
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

## Save Model

In [None]:
# Save model
output_dir = config.output_dir
print(f"\nSaving model to {output_dir}...")

# Save the final model
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"‚úÖ Model saved to {output_dir}")

# Save training info
training_info = {
    'model_name': config.model_name,
    'dataset_name': config.dataset_name,
    'num_train_examples': len(tokenized_train),
    'num_eval_examples': len(tokenized_eval),
    'training_loss': train_result.training_loss,
    'eval_loss': eval_results['eval_loss'],
    'lora_r': config.lora_r,
    'lora_alpha': config.lora_alpha,
}

with open(f"{output_dir}/training_info.json", 'w') as f:
    json.dump(training_info, f, indent=2)

print("‚úÖ Training info saved")

## Test the Fine-tuned Model

In [None]:
# Test inference function
def test_model(task_type: str = "parsing"):
    """Test the fine-tuned model on different tasks."""
    
    # Sample test cases
    test_cases = {
        'parsing': [
            "2024-01-15 14:23:45 ERROR Connection to 192.168.1.100 failed",
            "2024-01-15 14:23:46 INFO User login successful from 10.0.0.5",
        ],
        'classification': [
            "2024-01-15 14:23:45 ERROR Database connection timeout",
            "2024-01-15 14:23:46 INFO Service started successfully",
        ],
        'summarization': [
            """1. 2024-01-15 10:00:00 INFO Service starting\n2. 2024-01-15 10:00:01 INFO Loading configuration\n3. 2024-01-15 10:00:02 ERROR Database connection failed\n4. 2024-01-15 10:00:03 WARN Retrying connection\n5. 2024-01-15 10:00:04 INFO Connection established"""
        ],
    }
    
    if task_type == 'parsing':
        instruction_template = (
            "You are a log parsing expert. Extract the template from the following log line by "
            "replacing variable parts with placeholders.\n\nLog line: {}\n\nProvide only the extracted template."
        )
    elif task_type == 'classification':
        instruction_template = (
            "You are a log analysis expert. Classify the following log entry as 'normal' or 'anomaly'.\n\n"
            "Log entry: {}\n\nClassification:"
        )
    else:  # summarization
        instruction_template = (
            "You are a log analysis expert. Summarize the following sequence of log entries.\n\n"
            "Logs:\n{}\n\nSummary:"
        )
    
    print(f"\n{'='*80}")
    print(f"Testing {task_type.upper()} Task")
    print(f"{'='*80}\n")
    
    for i, test_input in enumerate(test_cases[task_type], 1):
        instruction = instruction_template.format(test_input)
        
        prompt = f"""<|im_start|>system
You are Qwen, an AI assistant specialized in log analysis and understanding.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant\n"""
        
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7,
                do_sample=True,
                top_p=0.9,
                pad_token_id=tokenizer.pad_token_id,
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=False)
        # Extract only the assistant's response
        if "<|im_start|>assistant" in response:
            response = response.split("<|im_start|>assistant\n")[-1]
        if "<|im_end|>" in response:
            response = response.split("<|im_end|>")[0]
        
        print(f"Test Case {i}:")
        print(f"Input: {test_input[:200]}..." if len(test_input) > 200 else f"Input: {test_input}")
        print(f"Output: {response.strip()}")
        print()

# Test all task types
for task in ['parsing', 'classification', 'summarization']:
    test_model(task)

## Summary & Next Steps

In [None]:
print("\n" + "="*80)
print("TRAINING SUMMARY")
print("="*80)
print(f"\n‚úÖ Successfully fine-tuned {config.model_name}")
print(f"\nüìä Training Statistics:")
print(f"  - Training examples: {len(tokenized_train)}")
print(f"  - Evaluation examples: {len(tokenized_eval)}")
print(f"  - Training loss: {train_result.training_loss:.4f}")
print(f"  - Evaluation loss: {eval_results['eval_loss']:.4f}")
print(f"  - Training time: {train_result.metrics['train_runtime']:.2f}s")
print(f"\nüíæ Model saved to: {config.output_dir}")

print(f"\nüéØ Capabilities:")
print("  ‚úÖ Log Parsing - Extract templates from raw logs")
print("  ‚úÖ Log Classification - Detect anomalies")
print("  ‚úÖ Log Summarization - Summarize log sequences")

print(f"\nüöÄ Next Steps:")
print("  1. Test the model on your own log data")
print("  2. Fine-tune further with more domain-specific data")
print("  3. Deploy the model for production use")
print("  4. Integrate with your log analysis pipeline")

if IN_COLAB:
    print(f"\nüí° Tip: Download the model from {config.output_dir} or save to Google Drive")

print("\n" + "="*80)

## Additional Notes

### SoftMoE Integration
The SoftMoE layer implementation is included but not fully integrated into the model architecture. For production use, you would need to:
1. Hook the SoftMoE layers into specific transformer layers
2. Add forward hooks to apply MoE routing
3. Implement load balancing loss for expert utilization

### Memory Optimization
- Reduce `max_seq_length` if running out of memory
- Decrease `per_device_train_batch_size` or increase `gradient_accumulation_steps`
- Use gradient checkpointing for larger models

### Performance Tips
- Use A100 or V100 GPU for faster training
- Enable flash attention for better performance
- Use mixed precision training (already enabled with fp16)

### Dataset Customization
Replace `config.dataset_name` with your own HuggingFace dataset containing log data with these columns:
- `content`: The raw log text
- `host`: (optional) Host/server name
- `log_type`: (optional) Type of log

### Citation
If you use this notebook, please cite:
```
Qwen2.5: https://github.com/QwenLM/Qwen
QLoRA: https://arxiv.org/abs/2305.14314
SoftMoE: https://arxiv.org/abs/2308.00951
```