# Backdoor-B2D4G5-Tool-Use Model Training Notebook

This notebook provides a comprehensive framework for training the Backdoor-B2D4G5-Tool-Use model on custom datasets. The model is a transformer-based architecture built with PyTorch, designed for tool use and instruction following.

## Model Information
- **Model Name**: Backdoor-B2D4G5-Tool-Use
- **Architecture**: B2D4G5slm (Transformer-based)
- **Framework**: PyTorch
- **Hidden Size**: 4096
- **Intermediate Size**: 14336
- **Attention Heads**: 32
- **KV Heads**: 8
- **Hidden Layers**: 32
- **Max Position Embeddings**: 8192
- **Vocab Size**: 128262

## Features
- Uses pre-loaded model from the Kaggle input/MODELS directory
- Train on custom datasets (from URLs or local files)
- Support for multiple datasets
- Automatic training pipeline
- Customizable training parameters
- Evaluation metrics tracking
- Model checkpointing and saving

## 1. Setup and Dependencies

First, let's install the necessary dependencies for training the model.

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate bitsandbytes peft trl evaluate torch torchvision torchaudio wandb sentencepiece
# Install jupyter and ipywidgets to fix tqdm warnings
!pip install -q jupyter ipywidgets
# Install tensorflow-cpu instead of tensorflow to avoid conflicts with torch-xla
!pip install -q tensorflow-cpu

In [None]:
# Import necessary libraries
import os
import json
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    get_scheduler
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import wandb
import logging
import warnings
from tqdm.auto import tqdm

# Set up logging
logging.basicConfig(level=logging.INFO)
warnings.filterwarnings("ignore")

## 2. Model Path Configuration

The model is already loaded in the Kaggle input/MODELS directory. Let's set up the path to the model.

In [None]:
# Define the path to the pre-loaded model in Kaggle
# The model files are already in the Kaggle input directory
MODEL_NAME = "/kaggle/input/MODELS"

# Define the path for datasets in Kaggle
DATASET_PATH = "/kaggle/input/DATASETS"

import os

# Check if the model path exists and contains required files
required_files = ["config.json", "tokenizer.json", "tokenizer_config.json", "generation_config.json"]
missing_files = [f for f in required_files if not os.path.exists(os.path.join(MODEL_NAME, f))]

if missing_files:
    print(f"Warning: The following required files are missing in {MODEL_NAME}: {', '.join(missing_files)}")
    print("Please ensure the model is correctly loaded in the Kaggle input/MODELS folder.")
else:
    print(f"All required model files found in {MODEL_NAME}")
    # List model files
    !ls -la {MODEL_NAME}

## 3. Model Architecture and Loading

Let's define the model architecture and load the pre-trained weights.

In [None]:
# Define model name
MODEL_NAME = "kaggle/input/b2d4g5/pytorch/backdoor-b2d4g5-tool-use/1/Backdoor-B2D4G5-Tool-Use"

# Ensure we have all necessary model files
required_files = ['config.json', 'tokenizer.json', 'tokenizer_config.json', 'generation_config.json']
missing_files = [file for file in required_files if not os.path.exists(os.path.join(MODEL_NAME, file))]

if missing_files:
    print(f"Missing files in model directory: {missing_files}")
    print("Please ensure all required files are present in the Kaggle input/MODELS directory.")
else:
    print("All required model files found.")

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    use_fast=True
)

# Set padding token to eos token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded with vocabulary size: {len(tokenizer)}")
print(f"BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

In [None]:
# Configure quantization for efficient training
# Check CUDA availability for bitsandbytes
import torch
cuda_available = torch.cuda.is_available()
if not cuda_available:
    print("CUDA is not available. Using CPU configuration for bitsandbytes instead.")
    # Use a configuration that works without CUDA
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=False,  # Set to False when CUDA is not available
        bnb_4bit_compute_dtype=torch.float32  # Use float32 instead of bfloat16
    )
else:
    # Original CUDA configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )


## 4. Configure LoRA for Efficient Fine-tuning

We'll use Parameter-Efficient Fine-Tuning (PEFT) with LoRA to efficiently train the model without updating all parameters.

In [None]:
# Configure LoRA for efficient fine-tuning
# Make sure model is defined before applying LoRA
if "model" not in locals() or model is None:
    print("Model not found. Loading model first.")
    # Load the model (copy from earlier cell)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
    model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=16,                    # Rank dimension
    lora_alpha=32,           # Alpha parameter for LoRA scaling
    lora_dropout=0.05,       # Dropout probability for LoRA layers
    bias="none",             # Bias type
    task_type="CAUSAL_LM",   # Task type
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
print("LoRA configuration applied to the model.")
print(f"Trainable parameters: {model.print_trainable_parameters()}")

## 5. Dataset Loading and Preprocessing

Now, let's set up functions to load and preprocess datasets from various sources.

In [None]:
def load_dataset_from_source(source, dataset_name=None, split=None):
    """
    Load a dataset from various sources: URL, local file, or Hugging Face Hub.
    
    Args:
        source (str): Path or URL to the dataset
        dataset_name (str, optional): Name of the dataset if loading from Hugging Face
        split (str, optional): Dataset split to load
        
    Returns:
        Dataset: The loaded dataset
    """
    try:
        # Check if source is a URL
        if source.startswith("http"):
            print(f"Downloading dataset from URL: {source}")
            # Download the file
            local_path = os.path.join(DATASET_PATH, os.path.basename(source))
            # Dataset directory already created
            !wget -q -O {local_path} {source}
            source = local_path
        
        # Check if source is a local file
        if os.path.exists(source):
            file_ext = os.path.splitext(source)[1].lower()
            
            if file_ext == ".csv":
                df = pd.read_csv(source)
                return Dataset.from_pandas(df)
            elif file_ext == ".json" or file_ext == ".jsonl":
                return load_dataset("json", data_files=source, split="train")
            elif file_ext == ".txt":
                with open(source, "r", encoding="utf-8") as f:
                    lines = f.readlines()
                return Dataset.from_dict({"text": lines})
            elif file_ext == ".parquet":
                df = pd.read_parquet(source)
                return Dataset.from_pandas(df)
            else:
                raise ValueError(f"Unsupported file format: {file_ext}")
        
        # If not a local file, try loading from Hugging Face Hub
        else:
            if dataset_name is None:
                dataset_name = source
            
            if split is not None:
                return load_dataset(dataset_name, split=split)
            else:
                return load_dataset(dataset_name)
                
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

In [None]:
def preprocess_dataset(dataset, instruction_column="instruction", input_column="input", output_column="output", text_column=None):
    """
    Preprocess the dataset for training.
    
    Args:
        dataset: The dataset to preprocess
        instruction_column (str): Column name for instructions
        input_column (str): Column name for inputs
        output_column (str): Column name for outputs
        text_column (str, optional): Column name for text if using a single text column
        
    Returns:
        Dataset: The preprocessed dataset
    """
    # Check dataset format
    columns = dataset.column_names
    
    # If dataset has a single text column
    if text_column is not None and text_column in columns:
        def format_single_text(example):
            return {"text": example[text_column]}
        
        return dataset.map(format_single_text, remove_columns=columns)
    
    # If dataset has instruction, input, and output columns
    elif instruction_column in columns and output_column in columns:
        def format_instruction_input_output(example):
            instruction = example[instruction_column]
            output = example[output_column]
            
            # Check if input column exists and is not empty
            if input_column in columns and example[input_column] and not pd.isna(example[input_column]):
                input_text = example[input_column]
                text = f"<|begin_of_text|>\n\nInstruction: {instruction}\n\nInput: {input_text}\n\nOutput: {output}<|end_of_text|>"
            else:
                text = f"<|begin_of_text|>\n\nInstruction: {instruction}\n\nOutput: {output}<|end_of_text|>"
                
            return {"text": text}
        
        return dataset.map(format_instruction_input_output, remove_columns=columns)
    
    # If dataset has conversations
    elif "conversations" in columns:
        def format_conversations(example):
            conversation = example["conversations"]
            formatted_text = "<|begin_of_text|>\n"
            
            for turn in conversation:
                if "from" in turn and "value" in turn:
                    role = turn["from"]
                    content = turn["value"]
                    formatted_text += f"\n{role}: {content}\n"
            
            formatted_text += "<|end_of_text|>"
            return {"text": formatted_text}
        
        return dataset.map(format_conversations, remove_columns=columns)
    
    # If dataset has messages (ChatML format)
    elif "messages" in columns:
        def format_messages(example):
            messages = example["messages"]
            formatted_text = "<|begin_of_text|>\n"
            
            for msg in messages:
                if "role" in msg and "content" in msg:
                    role = msg["role"]
                    content = msg["content"]
                    formatted_text += f"\n{role}: {content}\n"
            
            formatted_text += "<|end_of_text|>"
            return {"text": formatted_text}
        
        return dataset.map(format_messages, remove_columns=columns)
    
    else:
        print(f"Warning: Could not determine dataset format. Available columns: {columns}")
        return dataset

In [None]:
def tokenize_dataset(dataset, tokenizer, max_length=4096):
    """
    Tokenize the dataset for training.
    
    Args:
        dataset: The dataset to tokenize
        tokenizer: The tokenizer to use
        max_length (int): Maximum sequence length
        
    Returns:
        Dataset: The tokenized dataset
    """
    def tokenize_function(examples):
        # Tokenize the texts
        tokenized = tokenizer(
            examples["text"],
            padding="max_length",
            truncation=True,
            max_length=max_length,
            return_tensors=None,
            return_special_tokens_mask=True
        )
        
        # Set labels equal to input_ids for causal language modeling
        tokenized["labels"] = tokenized["input_ids"].copy()
        
        return tokenized
    
    # Apply tokenization to the dataset
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"]
    )
    
    return tokenized_dataset

## 6. Training Configuration

Let's set up the training configuration with customizable parameters.

In [None]:
# Training configuration
class TrainingConfig:
    def __init__(self):
        # Basic training parameters
        self.output_dir = "./results"
        self.num_train_epochs = 3
        self.per_device_train_batch_size = 4
        self.per_device_eval_batch_size = 4
        self.gradient_accumulation_steps = 4
        self.evaluation_strategy = "steps"
        self.eval_steps = 100
        self.save_strategy = "steps"
        self.save_steps = 100
        self.save_total_limit = 3
        self.logging_steps = 10
        
        # Learning rate and scheduler
        self.learning_rate = 2e-4
        self.lr_scheduler_type = "cosine"
        self.warmup_ratio = 0.03
        self.weight_decay = 0.01
        
        # Optimizer parameters
        self.optim = "paged_adamw_8bit"
        self.max_grad_norm = 0.3
        
        # Mixed precision training
        self.fp16 = True
        
        # Early stopping
        self.early_stopping_patience = 3
        self.early_stopping_threshold = 0.01
        
        # Dataset processing
        self.max_seq_length = 4096
        self.val_size = 0.1
        
        # Miscellaneous
        self.seed = 42
        self.push_to_hub = False
        self.hub_model_id = None
        self.hub_token = None
        
    def update(self, **kwargs):
        """
        Update configuration parameters.
        
        Args:
            **kwargs: Key-value pairs of parameters to update
        """
        for key, value in kwargs.items():
            if hasattr(self, key):
                setattr(self, key, value)
            else:
                print(f"Warning: Unknown parameter '{key}'")
                
    def get_training_args(self):
        """
        Get TrainingArguments object from the configuration.
        
        Returns:
            TrainingArguments: The training arguments
        """
        return TrainingArguments(
            output_dir=self.output_dir,
            num_train_epochs=self.num_train_epochs,
            per_device_train_batch_size=self.per_device_train_batch_size,
            per_device_eval_batch_size=self.per_device_eval_batch_size,
            gradient_accumulation_steps=self.gradient_accumulation_steps,
            evaluation_strategy=self.evaluation_strategy,
            eval_steps=self.eval_steps,
            save_strategy=self.save_strategy,
            save_steps=self.save_steps,
            save_total_limit=self.save_total_limit,
            logging_steps=self.logging_steps,
            learning_rate=self.learning_rate,
            lr_scheduler_type=self.lr_scheduler_type,
            warmup_ratio=self.warmup_ratio,
            weight_decay=self.weight_decay,
            optim=self.optim,
            max_grad_norm=self.max_grad_norm,
            fp16=self.fp16,
            push_to_hub=self.push_to_hub,
            hub_model_id=self.hub_model_id,
            hub_token=self.hub_token,
            seed=self.seed,
            report_to="wandb" if os.environ.get("WANDB_API_KEY") else "none"
        )

# Create default training configuration
config = TrainingConfig()

## 7. Training Pipeline

Now, let's create a comprehensive training pipeline that handles dataset loading, preprocessing, and model training.

In [None]:
def train_model(model, tokenizer, datasets, config):
    """
    Train the model on the provided datasets.
    
    Args:
        model: The model to train
        tokenizer: The tokenizer to use
        datasets: List of datasets to train on
        config: Training configuration
        
    Returns:
        Trainer: The trained model trainer
    """
    # Process and combine datasets
    processed_datasets = []
    
    for dataset_info in datasets:
        dataset = dataset_info["dataset"]
        instruction_column = dataset_info.get("instruction_column", "instruction")
        input_column = dataset_info.get("input_column", "input")
        output_column = dataset_info.get("output_column", "output")
        text_column = dataset_info.get("text_column", None)
        
        # Preprocess the dataset
        processed_dataset = preprocess_dataset(
            dataset,
            instruction_column=instruction_column,
            input_column=input_column,
            output_column=output_column,
            text_column=text_column
        )
        
        processed_datasets.append(processed_dataset)
    
    # Combine all datasets
    if len(processed_datasets) > 1:
        combined_dataset = concatenate_datasets(processed_datasets)
    else:
        combined_dataset = processed_datasets[0]
    
    # Split into train and validation sets
    train_val_split = combined_dataset.train_test_split(
        test_size=config.val_size,
        seed=config.seed
    )
    
    train_dataset = train_val_split["train"]
    val_dataset = train_val_split["test"]
    
    print(f"Training dataset size: {len(train_dataset)}")
    print(f"Validation dataset size: {len(val_dataset)}")
    
    # Tokenize datasets
    tokenized_train_dataset = tokenize_dataset(
        train_dataset,
        tokenizer,
        max_length=config.max_seq_length
    )
    
    tokenized_val_dataset = tokenize_dataset(
        val_dataset,
        tokenizer,
        max_length=config.max_seq_length
    )
    
    # Get training arguments
    training_args = config.get_training_args()
    
    # Create data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
    
    # Create early stopping callback
    early_stopping_callback = EarlyStoppingCallback(
        early_stopping_patience=config.early_stopping_patience,
        early_stopping_threshold=config.early_stopping_threshold
    )
    
    # Initialize trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_val_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        callbacks=[early_stopping_callback]
    )
    
    # Train the model
    print("Starting training...")
    trainer.train()
    
    # Save the final model
    trainer.save_model(os.path.join(config.output_dir, "final_model"))
    tokenizer.save_pretrained(os.path.join(config.output_dir, "final_model"))
    
    print("Training completed!")
    return trainer

## 8. Dataset Loading Interface

Let's create a user-friendly interface for loading datasets from various sources.

In [None]:
def load_datasets_from_sources(sources):
    """
    Load datasets from multiple sources.
    
    Args:
        sources (list): List of dataset source configurations
        
    Returns:
        list: List of loaded datasets with their configurations
    """
    loaded_datasets = []
    
    for source_config in sources:
        source = source_config["source"]
        dataset_name = source_config.get("dataset_name", None)
        split = source_config.get("split", None)
        
        print(f"Loading dataset from {source}...")
        dataset = load_dataset_from_source(source, dataset_name, split)
        
        if dataset is not None:
            print(f"Dataset loaded with {len(dataset)} examples")
            print(f"Dataset columns: {dataset.column_names}")
            
            # Add dataset to the list
            loaded_datasets.append({
                "dataset": dataset,
                "instruction_column": source_config.get("instruction_column", "instruction"),
                "input_column": source_config.get("input_column", "input"),
                "output_column": source_config.get("output_column", "output"),
                "text_column": source_config.get("text_column", None)
            })
        else:
            print(f"Failed to load dataset from {source}")
    
    return loaded_datasets

## 9. Example Usage

Now, let's demonstrate how to use this notebook for training the model on custom datasets.

In [None]:
# Example dataset configurations
# Replace these with your actual datasets
dataset_sources = [
    # Example 1: Loading from Hugging Face Hub
    {
        "source": "databricks/databricks-dolly-15k",
        "instruction_column": "instruction",
        "input_column": "context",
        "output_column": "response"
    },
    
    # Example 2: Loading from a URL (CSV file)
    # {
    #     "source": "https://example.com/dataset.csv",
    #     "instruction_column": "prompt",
    #     "output_column": "completion"
    # },
    
    # Example 3: Loading from a local file
    # {
    #     "source": "/path/to/local/dataset.jsonl",
    #     "text_column": "text"  # For datasets with a single text column
    # }
]

# Uncomment to load and process datasets
# datasets = load_datasets_from_sources(dataset_sources)

In [None]:
# Configure training parameters
# Uncomment and modify as needed
"""
config.update(
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_seq_length=4096,
    output_dir="./results"
)
"""

In [None]:
# Initialize Weights & Biases for tracking (optional)
# Uncomment if you want to use W&B for experiment tracking
"""
wandb.login()
wandb.init(
    project="backdoor-b2d4g5-training",
    name=f"training-run-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
    config={
        "model": MODEL_NAME,
        "epochs": config.num_train_epochs,
        "batch_size": config.per_device_train_batch_size,
        "learning_rate": config.learning_rate,
        "lora_r": peft_config.r,
        "lora_alpha": peft_config.lora_alpha,
        "max_seq_length": config.max_seq_length
    }
)
"""

In [None]:
# Start training
# Uncomment to train the model
"""
trainer = train_model(model, tokenizer, datasets, config)
"""

## 10. Model Evaluation

After training, let's evaluate the model on some test examples.

In [None]:
def generate_text(model, tokenizer, prompt, max_length=1024, temperature=0.7, top_p=0.9):
    """
    Generate text using the trained model.
    
    Args:
        model: The trained model
        tokenizer: The tokenizer
        prompt (str): The input prompt
        max_length (int): Maximum length of generated text
        temperature (float): Sampling temperature
        top_p (float): Nucleus sampling parameter
        
    Returns:
        str: The generated text
    """
    # Format the prompt
    formatted_prompt = f"<|begin_of_text|>\n\nInstruction: {prompt}\n\nOutput: "
    
    # Tokenize the prompt
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the output part
    output_start = generated_text.find("Output: ") + len("Output: ")
    output_end = generated_text.find("<|end_of_text|>", output_start)
    
    if output_end != -1:
        return generated_text[output_start:output_end].strip()
    else:
        return generated_text[output_start:].strip()

# Example test prompts
test_prompts = [
    "Explain the concept of transformer models in machine learning.",
    "Write a Python function to calculate the Fibonacci sequence.",
    "What are the key differences between supervised and unsupervised learning?"
]

# Uncomment to test the model
"""
for prompt in test_prompts:
    print(f"Prompt: {prompt}")
    response = generate_text(model, tokenizer, prompt)
    print(f"Response: {response}\n")
"""

## 11. Save and Export the Trained Model

Finally, let's save and export the trained model for future use.

In [None]:
def save_model(model, tokenizer, output_dir="./final_model"):
    """
    Save the trained model and tokenizer.
    
    Args:
        model: The trained model
        tokenizer: The tokenizer
        output_dir (str): Output directory
    """
    os.makedirs(output_dir, exist_ok=True)
    
    # Save the model
    model.save_pretrained(output_dir)
    
    # Save the tokenizer
    tokenizer.save_pretrained(output_dir)
    
    print(f"Model and tokenizer saved to {output_dir}")

# Uncomment to save the model
"""
save_model(model, tokenizer, output_dir="./final_model")
"""

In [None]:
# Push to Hugging Face Hub (optional)
# Uncomment if you want to push the model to the Hugging Face Hub
"""
from huggingface_hub import HfApi

# Login to Hugging Face
!huggingface-cli login

# Push the model to the Hub
model.push_to_hub("your-username/backdoor-b2d4g5-fine-tuned")
tokenizer.push_to_hub("your-username/backdoor-b2d4g5-fine-tuned")
"""

## 12. Conclusion

This notebook provides a comprehensive framework for training the Backdoor-B2D4G5-Tool-Use model on custom datasets. You can use this notebook to:

1. Use the pre-loaded model from the Kaggle input/MODELS directory
2. Prepare and preprocess your datasets
3. Configure and customize the training process
4. Train the model efficiently using LoRA
5. Evaluate the trained model
6. Save and export the model for future use

To use this notebook effectively:
1. Configure your dataset sources
2. Adjust training parameters as needed
3. Run the training process
4. Evaluate and save the trained model

Happy training!