# Backdoor-B2D4G5-Tool-Use Model Training with AutoGluon

This notebook provides an automated framework for training the Backdoor-B2D4G5-Tool-Use model using AutoGluon on custom code datasets. AutoGluon automates machine learning tasks, enabling more efficient model training with minimal manual configuration.

## Model Information
- **Model Name**: Backdoor-B2D4G5-Tool-Use
- **Architecture**: B2D4G5slm (Transformer-based)
- **Framework**: PyTorch with AutoGluon
- **Hidden Size**: 4096
- **Intermediate Size**: 14336
- **Attention Heads**: 32
- **KV Heads**: 8
- **Hidden Layers**: 32
- **Max Position Embeddings**: 8192
- **Vocab Size**: 128262

## Features
- Select and use any model from your Kaggle workspace
- Automated training with AutoGluon
- Train on code-specific datasets from Hugging Face
- Automatic hyperparameter tuning
- Model selection and ensemble learning
- Evaluation metrics tracking
- Model checkpointing and saving

## 1. Setup and Dependencies

First, let's install the necessary dependencies for training the model, including AutoGluon.

In [None]:
# Install required packages
!pip install -q transformers datasets accelerate bitsandbytes peft trl evaluate torch torchvision torchaudio wandb sentencepiece
# Install jupyter and ipywidgets to fix tqdm warnings
!pip install -q jupyter ipywidgets
# Install AutoGluon and its dependencies
!pip install -q autogluon
# Install tensorflow-cpu instead of tensorflow to avoid conflicts with torch-xla
!pip install -q tensorflow-cpu

In [None]:
# Import necessary libraries
import os
import json
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime
from datasets import load_dataset, Dataset, DatasetDict, concatenate_datasets
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import wandb
import logging
import warnings
from tqdm.auto import tqdm

# Import AutoGluon modules
from autogluon.tabular import TabularPredictor
from autogluon.multimodal import MultiModalPredictor
from autogluon.core.utils import set_logger_verbosity
from autogluon.core.utils.loaders import load_pkl
from autogluon.core.utils.savers import save_pkl

# Set up logging
logging.basicConfig(level=logging.INFO)
warnings.filterwarnings("ignore")
set_logger_verbosity(3)  # Set AutoGluon logging level (0: debug, 1: info, 2: warning, 3: error)

## 2. Model Path Configuration

Let's set up a file browser to select the model files from your Kaggle workspace.

In [None]:
import os
import glob
import ipywidgets as widgets
from IPython.display import display, HTML

# Define base paths for Kaggle
KAGGLE_INPUT_PATH = "/kaggle/input"
KAGGLE_WORKING_PATH = "/kaggle/working"

# Define the output directory for trained models
OUTPUT_DIR = os.path.join(KAGGLE_WORKING_PATH, "results")
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Function to list all directories in Kaggle input
def list_kaggle_input_directories():
    if not os.path.exists(KAGGLE_INPUT_PATH):
        return ["Kaggle input directory not found. Are you running this in Kaggle?"]
    
    dirs = [d for d in os.listdir(KAGGLE_INPUT_PATH) if os.path.isdir(os.path.join(KAGGLE_INPUT_PATH, d))]
    return dirs if dirs else ["No directories found in Kaggle input"]

# Create dropdown for selecting dataset directory
dataset_dirs = list_kaggle_input_directories()
dataset_dropdown = widgets.Dropdown(
    options=dataset_dirs,
    description='Dataset:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Create dropdown for selecting model directory
model_dirs = list_kaggle_input_directories()
model_dropdown = widgets.Dropdown(
    options=model_dirs,
    description='Model:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Function to update model subdirectories when model directory is selected
def update_model_subdirs(change):
    if change['type'] == 'change' and change['name'] == 'value':
        model_dir = os.path.join(KAGGLE_INPUT_PATH, change['new'])
        subdirs = [d for d in os.listdir(model_dir) if os.path.isdir(os.path.join(model_dir, d))]
        model_subdir_dropdown.options = subdirs if subdirs else ["No subdirectories found"]

# Create dropdown for model subdirectories
model_subdir_dropdown = widgets.Dropdown(
    options=["Select a model directory first"],
    description='Model Subdir:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Register the callback
model_dropdown.observe(update_model_subdirs, names='value')

# Display the dropdowns
display(HTML("<h3>Select Dataset and Model Directories</h3>"))
display(dataset_dropdown)
display(model_dropdown)
display(model_subdir_dropdown)

# Button to confirm selection
confirm_button = widgets.Button(
    description='Confirm Selection',
    disabled=False,
    button_style='success',
    tooltip='Click to confirm your selection',
    icon='check'
)

# Output widget to display confirmation
output = widgets.Output()

def on_confirm_button_clicked(b):
    with output:
        output.clear_output()
        
        # Set global variables for dataset and model paths
        global DATASET_PATH, MODEL_NAME
        
        DATASET_PATH = os.path.join(KAGGLE_INPUT_PATH, dataset_dropdown.value)
        
        if model_subdir_dropdown.value != "Select a model directory first" and model_subdir_dropdown.value != "No subdirectories found":
            MODEL_NAME = os.path.join(KAGGLE_INPUT_PATH, model_dropdown.value, model_subdir_dropdown.value)
        else:
            MODEL_NAME = os.path.join(KAGGLE_INPUT_PATH, model_dropdown.value)
        
        print(f"Dataset path set to: {DATASET_PATH}")
        print(f"Model path set to: {MODEL_NAME}")
        
        # Check if the model path exists and contains required files
        required_files = ["config.json", "tokenizer.json", "tokenizer_config.json", "generation_config.json"]
        missing_files = [f for f in required_files if not os.path.exists(os.path.join(MODEL_NAME, f))]
        
        if missing_files:
            print(f"Warning: The following required files are missing in {MODEL_NAME}: {', '.join(missing_files)}")
            print("Please ensure the model directory contains all required files.")
        else:
            print(f"All required model files found in {MODEL_NAME}")
            # List model files
            print("\nModel files:")
            for f in os.listdir(MODEL_NAME):
                print(f"- {f}")

# Register the callback
confirm_button.on_click(on_confirm_button_clicked)

# Display the button and output
display(confirm_button)
display(output)

## 3. Model Architecture and Loading

Let's define the model architecture and load the pre-trained weights.

In [None]:
# Verify the selected model path
if 'MODEL_NAME' not in globals():
    # Default model path if not set through the UI
    MODEL_NAME = os.path.join(KAGGLE_INPUT_PATH, "models")
    print(f"Using default model path: {MODEL_NAME}")
    print("Note: You can select a specific model path using the dropdowns above.")
else:
    print(f"Using selected model path: {MODEL_NAME}")

# Ensure we have all necessary model files
required_files = ['config.json', 'tokenizer.json', 'tokenizer_config.json', 'generation_config.json']
missing_files = [file for file in required_files if not os.path.exists(os.path.join(MODEL_NAME, file))]

if missing_files:
    print(f"Missing files in model directory: {missing_files}")
    print("Please ensure all required files are present in the selected model directory.")
    print("You may need to select a different model path using the dropdowns above.")
else:
    print("All required model files found.")
    print("\nModel files:")
    for f in os.listdir(MODEL_NAME):
        print(f"- {f}")

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    use_fast=True
)

# Set padding token to eos token if not set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer loaded with vocabulary size: {len(tokenizer)}")
print(f"BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

In [None]:
# Configure quantization for efficient training
# Check CUDA availability for bitsandbytes
cuda_available = torch.cuda.is_available()
if not cuda_available:
    print("CUDA is not available. Using CPU configuration for bitsandbytes instead.")
    # Use a configuration that works without CUDA
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=False,  # Set to False when CUDA is not available
        bnb_4bit_compute_dtype=torch.float32  # Use float32 instead of bfloat16
    )
else:
    # Original CUDA configuration
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True
    )

In [None]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if cuda_available else torch.float32
)
model = prepare_model_for_kbit_training(model)

print(f"Model loaded with {model.num_parameters():,} parameters")

## 4. Configure LoRA for Efficient Fine-tuning

We'll use Parameter-Efficient Fine-Tuning (PEFT) with LoRA to efficiently train the model without updating all parameters.

In [None]:
# Configure LoRA for efficient fine-tuning
peft_config = LoraConfig(
    r=16,                    # Rank dimension
    lora_alpha=32,           # Alpha parameter for LoRA scaling
    lora_dropout=0.05,       # Dropout probability for LoRA layers
    bias="none",             # Bias type
    task_type="CAUSAL_LM",   # Task type
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
print("LoRA configuration applied to the model.")
print(f"Trainable parameters: {model.print_trainable_parameters()}")

## 5. Dataset Loading and Preprocessing

Now, let's load and preprocess the specified code datasets from Hugging Face.

In [None]:
# Create checkboxes for dataset selection
dataset_selection = widgets.VBox([
    widgets.HTML("<h3>Select Datasets to Use</h3>"),
    widgets.Checkbox(value=True, description='Programming-Language/codeagent-python', indent=False),
    widgets.Checkbox(value=True, description='codeparrot/github-code', indent=False),
    widgets.Checkbox(value=True, description='NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt', indent=False)
])

# Display dataset selection
display(dataset_selection)

def load_and_preprocess_datasets(dataset_checkboxes=None):
    """
    Load and preprocess the specified code datasets from Hugging Face.
    
    Args:
        dataset_checkboxes: List of checkbox widgets for dataset selection
        
    Returns:
        pandas.DataFrame: Combined and preprocessed dataset ready for AutoGluon
    """
    print("Loading datasets from Hugging Face...")
    
    # Get selected datasets
    selected_datasets = []
    if dataset_checkboxes:
        # Skip the HTML title widget
        for checkbox in dataset_checkboxes.children[1:]:
            if checkbox.value:
                selected_datasets.append(checkbox.description)
    else:
        # Default to all datasets if no checkboxes provided
        selected_datasets = [
            'Programming-Language/codeagent-python',
            'codeparrot/github-code',
            'NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt'
        ]
    
    print(f"Selected datasets: {selected_datasets}")
    
    # Initialize empty DataFrames
    codeagent_df = pd.DataFrame()
    github_df = pd.DataFrame()
    code_290k_df = pd.DataFrame()
    
    # Dataset 1: Programming-Language/codeagent-python
    if 'Programming-Language/codeagent-python' in selected_datasets:
        try:
            print("Loading Programming-Language/codeagent-python dataset...")
            codeagent_dataset = load_dataset("Programming-Language/codeagent-python", split="train")
            print(f"Loaded codeagent-python with {len(codeagent_dataset)} examples")
            print(f"Columns: {codeagent_dataset.column_names}")
            
            # Convert to DataFrame and preprocess
            codeagent_df = codeagent_dataset.to_pandas()
            # Sample a few rows to understand the structure
            print("\nSample from codeagent-python:")
            print(codeagent_df.head(2))
        except Exception as e:
            print(f"Error loading codeagent-python dataset: {e}")
            codeagent_df = pd.DataFrame()
    
    # Dataset 2: codeparrot/github-code
    if 'codeparrot/github-code' in selected_datasets:
        try:
            print("\nLoading codeparrot/github-code dataset...")
            github_dataset = load_dataset("codeparrot/github-code", split="train")
            print(f"Loaded github-code with {len(github_dataset)} examples")
            print(f"Columns: {github_dataset.column_names}")
            
            # Convert to DataFrame and preprocess
            github_df = github_dataset.to_pandas()
            # Sample a few rows to understand the structure
            print("\nSample from github-code:")
            print(github_df.head(2))
        except Exception as e:
            print(f"Error loading github-code dataset: {e}")
            github_df = pd.DataFrame()
    
    # Dataset 3: NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt
    if 'NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt' in selected_datasets:
        try:
            print("\nLoading NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt dataset...")
            code_290k_dataset = load_dataset("NickyNicky/Code-290k-labels-programming_languages-NO_Chatgpt", split="train")
            print(f"Loaded Code-290k with {len(code_290k_dataset)} examples")
            print(f"Columns: {code_290k_dataset.column_names}")
            
            # Convert to DataFrame and preprocess
            code_290k_df = code_290k_dataset.to_pandas()
            # Sample a few rows to understand the structure
            print("\nSample from Code-290k:")
            print(code_290k_df.head(2))
        except Exception as e:
            print(f"Error loading Code-290k dataset: {e}")
            code_290k_df = pd.DataFrame()
    
    # Process and combine datasets based on their structure
    print("\nProcessing and combining datasets...")
    
    # Process codeagent-python dataset
    if not codeagent_df.empty:
        # Standardize column names and format
        if 'instruction' in codeagent_df.columns and 'response' in codeagent_df.columns:
            codeagent_df = codeagent_df[['instruction', 'response']]
            codeagent_df.rename(columns={'instruction': 'input_text', 'response': 'output_text'}, inplace=True)
        elif 'code' in codeagent_df.columns:
            codeagent_df['input_text'] = 'Generate Python code:'
            codeagent_df['output_text'] = codeagent_df['code']
            codeagent_df = codeagent_df[['input_text', 'output_text']]
        codeagent_df['source'] = 'codeagent-python'
    
    # Process github-code dataset
    if not github_df.empty:
        # Standardize column names and format
        if 'code' in github_df.columns:
            github_df['input_text'] = 'Generate code:'
            github_df['output_text'] = github_df['code']
            github_df = github_df[['input_text', 'output_text']]
        github_df['source'] = 'github-code'
    
    # Process Code-290k dataset
    if not code_290k_df.empty:
        # Standardize column names and format
        if 'code' in code_290k_df.columns and 'language' in code_290k_df.columns:
            code_290k_df['input_text'] = 'Generate ' + code_290k_df['language'] + ' code:'
            code_290k_df['output_text'] = code_290k_df['code']
            code_290k_df = code_290k_df[['input_text', 'output_text', 'language']]
        code_290k_df['source'] = 'code-290k'
    
    # Combine all datasets
    dfs_to_combine = []
    if not codeagent_df.empty:
        dfs_to_combine.append(codeagent_df)
    if not github_df.empty:
        dfs_to_combine.append(github_df)
    if not code_290k_df.empty:
        dfs_to_combine.append(code_290k_df)
    
    if dfs_to_combine:
        combined_df = pd.concat(dfs_to_combine, ignore_index=True)
        print(f"Combined dataset has {len(combined_df)} examples")
        
        # Format for language modeling with AutoGluon
        combined_df['text'] = combined_df.apply(
            lambda row: f"<|begin_of_text|>\n\nInstruction: {row['input_text']}\n\nOutput: {row['output_text']}<|end_of_text|>", 
            axis=1
        )
        
        # Create label column for AutoGluon
        combined_df['label'] = combined_df['output_text']
        
        return combined_df
    else:
        print("No datasets were successfully loaded and processed.")
        return None

In [None]:
# Button to start dataset loading
load_datasets_button = widgets.Button(
    description='Load Selected Datasets',
    disabled=False,
    button_style='primary',
    tooltip='Click to load the selected datasets',
    icon='download'
)

# Output widget for dataset loading results
dataset_output = widgets.Output()

# Global variable to store the combined dataframe
combined_df = None

def on_load_datasets_button_clicked(b):
    global combined_df
    with dataset_output:
        dataset_output.clear_output()
        print("Loading datasets...")
        combined_df = load_and_preprocess_datasets(dataset_selection)
        
        if combined_df is not None:
            print("\nDataset preprocessing complete.")
            print(f"Final dataset shape: {combined_df.shape}")
            print("Sample of processed data:")
            print(combined_df[['input_text', 'output_text', 'source']].head())
        else:
            print("Failed to load and process datasets. Please check the dataset configurations.")

# Register the callback
load_datasets_button.on_click(on_load_datasets_button_clicked)

# Display the button and output
display(load_datasets_button)
display(dataset_output)

## 6. Tokenization and Feature Engineering

Let's tokenize the text data and prepare it for AutoGluon.

In [None]:
def tokenize_and_prepare_data(df, tokenizer, max_length=4096, val_size=0.1, seed=42):
    """
    Tokenize the text data and prepare it for AutoGluon.
    
    Args:
        df (pandas.DataFrame): The dataset to tokenize
        tokenizer: The tokenizer to use
        max_length (int): Maximum sequence length
        val_size (float): Validation set size as a fraction of the total dataset
        seed (int): Random seed for reproducibility
        
    Returns:
        tuple: (train_df, val_df) - DataFrames for training and validation
    """
    print("Tokenizing and preparing data for AutoGluon...")
    
    # Create a new column with tokenized text
    def tokenize_text(text):
        tokens = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )
        return {
            'input_ids': tokens['input_ids'].tolist()[0],
            'attention_mask': tokens['attention_mask'].tolist()[0]
        }
    
    # Apply tokenization to a sample of the data to avoid memory issues
    sample_size = min(10000, len(df))
    print(f"Tokenizing a sample of {sample_size} examples...")
    sample_df = df.sample(sample_size, random_state=seed)
    
    # Tokenize the text column
    tokenized_data = sample_df['text'].apply(tokenize_text)
    
    # Extract input_ids and attention_mask as separate columns
    sample_df['input_ids'] = tokenized_data.apply(lambda x: x['input_ids'])
    sample_df['attention_mask'] = tokenized_data.apply(lambda x: x['attention_mask'])
    
    # Split into training and validation sets
    train_df = sample_df.sample(frac=1-val_size, random_state=seed)
    val_df = sample_df.drop(train_df.index)
    
    print(f"Training set size: {len(train_df)}")
    print(f"Validation set size: {len(val_df)}")
    
    return train_df, val_df

In [None]:
# Tokenize and prepare the data
if combined_df is not None:
    train_df, val_df = tokenize_and_prepare_data(
        combined_df, 
        tokenizer, 
        max_length=4096, 
        val_size=0.1, 
        seed=42
    )
    
    print("\nData preparation complete.")
    print(f"Training set columns: {train_df.columns.tolist()}")
else:
    print("Cannot prepare data because dataset loading failed.")

## 7. AutoGluon Training Configuration

Let's set up the AutoGluon training configuration with customizable parameters.

In [None]:
# AutoGluon training configuration
class AutoGluonConfig:
    def __init__(self):
        # Basic training parameters
        self.output_dir = os.path.join(OUTPUT_DIR, "autogluon_models")
        self.time_limit = 3600  # Time limit in seconds (1 hour)
        self.num_trials = 10    # Number of trials for hyperparameter tuning
        self.num_bag_folds = 5  # Number of folds for bagging
        self.num_bag_sets = 3   # Number of bagging sets
        self.num_stack_levels = 2  # Number of stacking levels
        
        # Model selection
        self.hyperparameters = {
            'GBM': {},
            'CAT': {},
            'XGB': {},
            'RF': {},
            'XT': {},
            'NN_TORCH': {},
            'FASTAI': {}
        }
        
        # Multimodal parameters
        self.problem_type = "text_generation"  # For MultiModalPredictor
        self.text_backbone = "distilroberta-base"  # Text backbone model
        self.optimization_metric = "perplexity"  # Metric to optimize
        
        # Early stopping
        self.early_stopping_patience = 3
        
        # Miscellaneous
        self.seed = 42
        self.verbosity = 3  # 0: debug, 1: info, 2: warning, 3: error
        
    def update(self, **kwargs):
        """
        Update configuration parameters.
        
        Args:
            **kwargs: Key-value pairs of parameters to update
        """
        for key, value in kwargs.items():
            if hasattr(self, key):
                setattr(self, key, value)
            else:
                print(f"Warning: Unknown parameter '{key}'")

# Create default AutoGluon configuration
ag_config = AutoGluonConfig()

# Create interactive widgets for configuration
time_limit_slider = widgets.IntSlider(
    value=3600,
    min=600,
    max=36000,
    step=600,
    description='Time Limit (s):',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
    style={'description_width': 'initial'}
)

num_trials_slider = widgets.IntSlider(
    value=10,
    min=1,
    max=50,
    step=1,
    description='Num Trials:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
    style={'description_width': 'initial'}
)

early_stopping_slider = widgets.IntSlider(
    value=3,
    min=1,
    max=10,
    step=1,
    description='Early Stopping Patience:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
    style={'description_width': 'initial'}
)

text_backbone_dropdown = widgets.Dropdown(
    options=['distilroberta-base', 'bert-base-uncased', 'roberta-base', 'distilbert-base-uncased'],
    value='distilroberta-base',
    description='Text Backbone:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Display configuration widgets
display(HTML("<h3>AutoGluon Training Configuration</h3>"))
display(time_limit_slider)
display(num_trials_slider)
display(early_stopping_slider)
display(text_backbone_dropdown)

# Button to apply configuration
apply_config_button = widgets.Button(
    description='Apply Configuration',
    disabled=False,
    button_style='info',
    tooltip='Click to apply the configuration',
    icon='check'
)

# Output widget for configuration results
config_output = widgets.Output()

def on_apply_config_button_clicked(b):
    with config_output:
        config_output.clear_output()
        
        # Update configuration
        ag_config.update(
            time_limit=time_limit_slider.value,
            num_trials=num_trials_slider.value,
            early_stopping_patience=early_stopping_slider.value,
            text_backbone=text_backbone_dropdown.value
        )
        
        print("Configuration applied:")
        print(f"Time limit: {ag_config.time_limit} seconds")
        print(f"Number of trials: {ag_config.num_trials}")
        print(f"Early stopping patience: {ag_config.early_stopping_patience}")
        print(f"Text backbone model: {ag_config.text_backbone}")
        print(f"Output directory: {ag_config.output_dir}")

# Register the callback
apply_config_button.on_click(on_apply_config_button_clicked)

# Display the button and output
display(apply_config_button)
display(config_output)

## 8. AutoGluon Training Pipeline

Now, let's create a comprehensive training pipeline using AutoGluon.

In [None]:
def train_with_autogluon(train_df, val_df, config):
    """
    Train the model using AutoGluon.
    
    Args:
        train_df (pandas.DataFrame): Training dataset
        val_df (pandas.DataFrame): Validation dataset
        config: AutoGluon configuration
        
    Returns:
        MultiModalPredictor: The trained AutoGluon predictor
    """
    print("Starting AutoGluon training pipeline...")
    
    # Create output directory
    os.makedirs(config.output_dir, exist_ok=True)
    
    # Initialize MultiModalPredictor for text generation
    predictor = MultiModalPredictor(
        label="label",
        problem_type=config.problem_type,
        path=config.output_dir,
        verbosity=config.verbosity,
        eval_metric=config.optimization_metric
    )
    
    # Set up hyperparameters
    hyperparameters = {
        "optimization.max_epochs": 10,
        "optimization.learning_rate": 2e-5,
        "optimization.patience": config.early_stopping_patience,
        "optimization.val_check_interval": 0.5,
        "env.num_gpus": 1 if torch.cuda.is_available() else 0,
        "env.num_workers": 4,
        "model.text_backbone": config.text_backbone,
        "model.hf_text.checkpoint_name": MODEL_NAME,  # Use our pre-loaded model
        "model.hf_text.max_text_len": 4096,
        "data.text.normalize_text": False,  # Don't normalize code
    }
    
    # Start training
    print("Training model with AutoGluon...")
    predictor.fit(
        train_data=train_df,
        tuning_data=val_df,
        time_limit=config.time_limit,
        hyperparameters=hyperparameters,
        seed=config.seed
    )
    
    print("Training completed!")
    
    # Save the predictor
    predictor.save(os.path.join(config.output_dir, "final_model"))
    print(f"Model saved to {os.path.join(config.output_dir, 'final_model')}")
    
    return predictor

## 9. Training Execution

Now, let's execute the training pipeline with the prepared data.

In [None]:
# Checkbox for Weights & Biases tracking
wandb_checkbox = widgets.Checkbox(
    value=False,
    description='Use Weights & Biases for tracking',
    disabled=False,
    indent=False
)

# Text input for W&B project name
wandb_project = widgets.Text(
    value='backdoor-b2d4g5-autogluon',
    placeholder='Enter W&B project name',
    description='W&B Project:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Display W&B options
display(HTML("<h3>Experiment Tracking</h3>"))
display(wandb_checkbox)
display(wandb_project)

# Button to start training
start_training_button = widgets.Button(
    description='Start Training',
    disabled=False,
    button_style='success',
    tooltip='Click to start training with AutoGluon',
    icon='play'
)

# Output widget for training results
training_output = widgets.Output()

# Global variable to store the predictor
predictor = None

def on_start_training_button_clicked(b):
    global predictor, combined_df
    with training_output:
        training_output.clear_output()
        
        # Check if dataset is loaded
        if combined_df is None:
            print("Error: No dataset loaded. Please load datasets first.")
            return
        
        # Initialize W&B if selected
        if wandb_checkbox.value:
            try:
                wandb.login()
                wandb.init(
                    project=wandb_project.value,
                    name=f"training-run-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
                    config={
                        "model": MODEL_NAME,
                        "time_limit": ag_config.time_limit,
                        "num_trials": ag_config.num_trials,
                        "text_backbone": ag_config.text_backbone,
                        "optimization_metric": ag_config.optimization_metric
                    }
                )
                print("Weights & Biases initialized for experiment tracking.")
            except Exception as e:
                print(f"Warning: Could not initialize Weights & Biases: {e}")
                print("Continuing without experiment tracking.")
        
        # Tokenize and prepare data
        print("Tokenizing and preparing data...")
        try:
            train_df, val_df = tokenize_and_prepare_data(
                combined_df, 
                tokenizer, 
                max_length=4096, 
                val_size=0.1, 
                seed=ag_config.seed
            )
            
            print("\nData preparation complete.")
            print(f"Training set size: {len(train_df)}")
            print(f"Validation set size: {len(val_df)}")
            
            # Start training
            print("\nStarting training with AutoGluon...")
            predictor = train_with_autogluon(train_df, val_df, ag_config)
            
            print("\nTraining complete!")
            print(f"Model saved to {os.path.join(ag_config.output_dir, 'final_model')}")
            
        except Exception as e:
            print(f"Error during training: {e}")
            import traceback
            traceback.print_exc()

# Register the callback
start_training_button.on_click(on_start_training_button_clicked)

# Display the button and output
display(start_training_button)
display(training_output)

## 10. Model Evaluation

After training, let's evaluate the model on the validation set.

In [None]:
def evaluate_model(predictor, val_df):
    """
    Evaluate the trained model on the validation set.
    
    Args:
        predictor: The trained AutoGluon predictor
        val_df (pandas.DataFrame): Validation dataset
        
    Returns:
        dict: Evaluation metrics
    """
    print("Evaluating model on validation set...")
    
    # Get model performance on validation set
    evaluation = predictor.evaluate(val_df)
    print("Evaluation results:")
    print(evaluation)
    
    # Generate predictions for a few examples
    print("\nGenerating predictions for sample examples:")
    sample_indices = val_df.sample(5, random_state=42).index
    
    for idx in sample_indices:
        input_text = val_df.loc[idx, 'input_text']
        true_output = val_df.loc[idx, 'output_text']
        
        # Generate prediction
        prediction = predictor.predict({"input_text": input_text})
        
        print(f"\nInput: {input_text}")
        print(f"True output: {true_output[:100]}..." if len(true_output) > 100 else f"True output: {true_output}")
        print(f"Predicted: {prediction[:100]}..." if len(prediction) > 100 else f"Predicted: {prediction}")
    
    return evaluation

In [None]:
# Button to evaluate model
evaluate_button = widgets.Button(
    description='Evaluate Model',
    disabled=False,
    button_style='warning',
    tooltip='Click to evaluate the trained model',
    icon='chart-bar'
)

# Output widget for evaluation results
evaluation_output = widgets.Output()

def on_evaluate_button_clicked(b):
    with evaluation_output:
        evaluation_output.clear_output()
        
        # Check if model and validation data are available
        if predictor is None:
            print("Error: No trained model available. Please train the model first.")
            return
        
        if 'val_df' not in globals() or val_df is None:
            print("Error: No validation data available. Please prepare data first.")
            return
        
        # Evaluate the model
        try:
            print("Evaluating model on validation set...")
            evaluation_metrics = evaluate_model(predictor, val_df)
            
            # Log metrics to Weights & Biases if available
            if wandb_checkbox.value and wandb.run is not None:
                try:
                    wandb.log(evaluation_metrics)
                    print("Metrics logged to Weights & Biases.")
                except Exception as e:
                    print(f"Warning: Could not log metrics to Weights & Biases: {e}")
        except Exception as e:
            print(f"Error during evaluation: {e}")
            import traceback
            traceback.print_exc()

# Register the callback
evaluate_button.on_click(on_evaluate_button_clicked)

# Display the button and output
display(evaluate_button)
display(evaluation_output)

## 11. Model Export and Integration

Finally, let's export the trained model for deployment.

In [None]:
def export_model(predictor, output_dir):
    """
    Export the trained model for deployment.
    
    Args:
        predictor: The trained AutoGluon predictor
        output_dir (str): Output directory
    """
    print("Exporting model for deployment...")
    
    # Create export directory
    export_dir = os.path.join(output_dir, "export")
    os.makedirs(export_dir, exist_ok=True)
    
    # Save the predictor
    predictor.save(export_dir)
    print(f"Model exported to {export_dir}")
    
    # Save model configuration
    config_path = os.path.join(export_dir, "model_config.json")
    with open(config_path, "w") as f:
        json.dump({
            "model_type": "autogluon_multimodal",
            "problem_type": "text_generation",
            "text_backbone": ag_config.text_backbone,
            "export_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        }, f, indent=2)
    
    print(f"Model configuration saved to {config_path}")
    
    # Create a simple inference script
    inference_script = """
from autogluon.multimodal import MultiModalPredictor

def load_model(model_path):
    """Load the trained AutoGluon model."""
    return MultiModalPredictor.load(model_path)

def predict(model, text):
    """Generate predictions using the trained model."""
    return model.predict({"input_text": text})

if __name__ == "__main__":
    # Example usage
    model = load_model("./export")
    
    # Test prediction
    test_input = "Generate Python code to sort a list of numbers:"
    prediction = predict(model, test_input)
    print(f"Input: {test_input}")
    print(f"Prediction: {prediction}")
"""
    
    inference_path = os.path.join(export_dir, "inference.py")
    with open(inference_path, "w") as f:
        f.write(inference_script)
    
    print(f"Inference script saved to {inference_path}")
    print("Model export complete.")

In [None]:
# Text input for export directory
export_dir_input = widgets.Text(
    value='',
    placeholder='Leave empty to use default export directory',
    description='Export Directory:',
    disabled=False,
    style={'description_width': 'initial'}
)

# Button to export model
export_button = widgets.Button(
    description='Export Model',
    disabled=False,
    button_style='info',
    tooltip='Click to export the trained model',
    icon='download'
)

# Output widget for export results
export_output = widgets.Output()

def on_export_button_clicked(b):
    with export_output:
        export_output.clear_output()
        
        # Check if model is available
        if predictor is None:
            print("Error: No trained model available. Please train the model first.")
            return
        
        # Determine export directory
        export_directory = export_dir_input.value.strip() if export_dir_input.value.strip() else ag_config.output_dir
        
        # Export the model
        try:
            print(f"Exporting model to {export_directory}...")
            export_model(predictor, export_directory)
            print("Model export complete. The model is ready for deployment.")
            
            # Show files in export directory
            export_path = os.path.join(export_directory, "export")
            if os.path.exists(export_path):
                print("\nExported files:")
                for root, dirs, files in os.walk(export_path):
                    level = root.replace(export_path, '').count(os.sep)
                    indent = ' ' * 4 * level
                    print(f"{indent}{os.path.basename(root)}/")
                    sub_indent = ' ' * 4 * (level + 1)
                    for f in files:
                        print(f"{sub_indent}{f}")
        except Exception as e:
            print(f"Error during model export: {e}")
            import traceback
            traceback.print_exc()

# Register the callback
export_button.on_click(on_export_button_clicked)

# Display the input, button and output
display(HTML("<h3>Model Export</h3>"))
display(export_dir_input)
display(export_button)
display(export_output)

## 12. Conclusion

In this notebook, we've demonstrated how to use AutoGluon to automatically train the Backdoor-B2D4G5-Tool-Use model on code-specific datasets from Hugging Face. The automated machine learning approach simplifies the training process while potentially improving model performance through automated hyperparameter tuning and model selection.

### Key Accomplishments
- Loaded and preprocessed code datasets from Hugging Face
- Configured AutoGluon for automated model training
- Trained the model with minimal manual configuration
- Evaluated model performance on validation data
- Exported the model for deployment

### Next Steps
- Fine-tune the model on additional domain-specific datasets
- Experiment with different AutoGluon configurations
- Deploy the model in a production environment
- Implement continuous training and evaluation pipelines