# XLM-RoBERTa Signal Classifier Training

Fine-tune [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) for the 4-way trading signal classification task used in the BingXtg project. This notebook mirrors the CLI script in `ai/training/hf/train_classifier.py` and is optimized for Google Colab.

## 1. Environment setup
Run the following cell to install the required libraries. Colab already provides a recent version of PyTorch.

In [1]:
!pip install -q datasets transformers evaluate accelerate sentencepiece

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## 2. (Optional) Mount Google Drive
If your dataset lives on Google Drive, uncomment the lines below and run the cell to mount it. Otherwise, you can upload files directly to Colab using the file browser.

In [2]:
import sys

if "google.colab" in sys.modules:
    from google.colab import drive  # type: ignore
    drive.mount('/content/drive')
else:
    print('Not running inside Colab; skipping Drive mount.')

Mounted at /content/drive


## 3. Imports and configuration
Adjust the configuration parameters as needed. Paths default to the Colab working directory (`/content`).

In [16]:
from dataclasses import dataclass
from typing import Dict

import evaluate
import numpy as np
import torch
from datasets import DatasetDict, load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
)

LABEL_LIST = ["NON_SIGNAL", "SIGNAL_LONG", "SIGNAL_SHORT", "SIGNAL_NONE"]
LABEL2ID = {label: idx for idx, label in enumerate(LABEL_LIST)}
ID2LABEL = {idx: label for label, idx in LABEL2ID.items()}

@dataclass
class TrainingConfig:
 data_file: str = '/content/classification_data.csv'
 output_dir: str = '/content/signal_classifier'
 epochs: int = 3
 batch_size: int = 32 # Increased batch size
 eval_batch_size: int = 64 # Increased eval batch size
 learning_rate: float = 2e-5
 test_split: float = 0.2
 val_split: float = 0.5
 seed: int = 42
 use_fp16: bool = True # Enabled mixed-precision
 gradient_accumulation_steps: int = 2 # Added gradient accumulation

config = TrainingConfig()
config

TrainingConfig(data_file='/content/classification_data.csv', output_dir='/content/signal_classifier', epochs=3, batch_size=32, eval_batch_size=64, learning_rate=2e-05, test_split=0.2, val_split=0.5, seed=42, use_fp16=True, gradient_accumulation_steps=2)

## 4. Dataset preparation
This cell loads the CSV dataset, tokenizes it with `xlm-roberta-base`, and prepares train/eval splits. The CSV should contain `text` and `label` columns.

In [18]:
def load_and_prepare_dataset(data_file: str, test_split: float, val_split: float, seed: int) -> DatasetDict:
    dataset_dict = load_dataset('csv', data_files=data_file)
    full_dataset = dataset_dict['train']

    # First split: train (1 - test_split) vs. temp (test_split)
    train_temp_split = full_dataset.train_test_split(test_size=test_split, seed=seed)

    # Second split: temp into validation (val_split) vs. test (1 - val_split)
    val_test_split = train_temp_split['test'].train_test_split(test_size=1 - val_split, seed=seed)

    # Combine: train, validation (from temp's train), test (from temp's test)
    split_dataset = DatasetDict({
        'train': train_temp_split['train'],  # Fixed: train_temp_split
        'validation': val_test_split['train'],
        'test': val_test_split['test']
    })

    tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

    def preprocess(batch: Dict[str, list]) -> Dict[str, list]:
        tokenized = tokenizer(
            batch['text'],
            truncation=True,
            max_length=256,
            padding=False,
        )
        tokenized['labels'] = [LABEL2ID[label] for label in batch['label']]
        return tokenized

    tokenized = split_dataset.map(preprocess, batched=True, remove_columns=['text', 'label'])
    return tokenized

tokenized_datasets = load_and_prepare_dataset(
    config.data_file, config.test_split, config.val_split, config.seed
)
tokenized_datasets

Map:   0%|          | 0/26092 [00:00<?, ? examples/s]

Map:   0%|          | 0/3262 [00:00<?, ? examples/s]

Map:   0%|          | 0/3262 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 26092
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3262
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 3262
    })
})

## 5. Metric computation helper
We reuse the accuracy and macro-F1 metrics from the CLI script.

In [7]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)

    accuracy_metric = evaluate.load('accuracy')
    f1_metric = evaluate.load('f1')

    accuracy_result = accuracy_metric.compute(predictions=preds, references=labels)
    f1_result = f1_metric.compute(predictions=preds, references=labels, average='macro')

    return {
        'accuracy': accuracy_result.get('accuracy', 0.0),
        'macro_f1': f1_result.get('f1', 0.0),
    }

## 6. Training
This cell mirrors the script's `main()` function. It checks for GPU availability, configures the Hugging Face `Trainer`, and starts fine-tuning.

In [19]:
set_seed(config.seed)

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
model = AutoModelForSequenceClassification.from_pretrained(
    'xlm-roberta-base',
    num_labels=len(LABEL_LIST),
    id2label=ID2LABEL,
    label2id=LABEL2ID,
)

use_cuda = torch.cuda.is_available()
if use_cuda:
    try:
        torch.cuda.set_device(0)
        _ = torch.randn(2, 2).cuda()
        del _
        print('✓ CUDA is available and working - using GPU')
    except Exception as exc:
        print(f'⚠️  CUDA test failed: {exc}')
        print('Falling back to CPU')
        use_cuda = False
else:
    print('CUDA not available - using CPU')

training_args = TrainingArguments(
    output_dir=config.output_dir,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='macro_f1',
    greater_is_better=True,
    learning_rate=config.learning_rate,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.eval_batch_size,
    num_train_epochs=config.epochs,
    weight_decay=0.01,
    warmup_ratio=0.1,
    logging_steps=50,
    fp16=config.use_fp16 and use_cuda,
    gradient_accumulation_steps=2,  # Keep if needed
    dataloader_pin_memory=True,
    dataloader_num_workers=2,
    push_to_hub=False,
    no_cuda=not use_cuda,
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],  # Changed from 'eval'
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# Eval on validation (already done in trainer.evaluate())
val_metrics = trainer.evaluate()
print("Validation Metrics:", val_metrics)

# Final eval on test set
test_metrics = trainer.evaluate(tokenized_datasets['test'])
print("Test Metrics:", test_metrics)

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ CUDA is available and working - using GPU


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.1152,0.126936,0.966891,0.941878
2,0.0821,0.093171,0.969344,0.944777
3,0.0831,0.089672,0.971183,0.952539


Validation Metrics: {'eval_loss': 0.08967241644859314, 'eval_accuracy': 0.9711833231146536, 'eval_macro_f1': 0.9525390757427811, 'eval_runtime': 12.185, 'eval_samples_per_second': 267.707, 'eval_steps_per_second': 4.185, 'epoch': 3.0}
Test Metrics: {'eval_loss': 0.10102306306362152, 'eval_accuracy': 0.9678111587982833, 'eval_macro_f1': 0.9487255519113003, 'eval_runtime': 12.4418, 'eval_samples_per_second': 262.18, 'eval_steps_per_second': 4.099, 'epoch': 3.0}


## 7. Save artifacts
Models and tokenizers trained in this notebook are saved to the `output_dir`. Adjust the path to save to Google Drive or upload as needed.

In [20]:
trainer.save_model(config.output_dir)
tokenizer.save_pretrained(config.output_dir)
print(f'Model saved to {config.output_dir}')

Model saved to /content/signal_classifier


In [23]:
!zip -r /content/signal_classifier.zip /content/signal_classifier/config.json /content/signal_classifier/model.safetensors /content/signal_classifier/sentencepiece.bpe.model /content/signal_classifier/special_tokens_map.json /content/signal_classifier/tokenizer.json /content/signal_classifier/tokenizer_config.json /content/signal_classifier/training_args.bin

  adding: content/signal_classifier/config.json (deflated 52%)
  adding: content/signal_classifier/model.safetensors (deflated 25%)
  adding: content/signal_classifier/sentencepiece.bpe.model (deflated 49%)
  adding: content/signal_classifier/special_tokens_map.json (deflated 52%)
  adding: content/signal_classifier/tokenizer.json (deflated 76%)
  adding: content/signal_classifier/tokenizer_config.json (deflated 76%)
  adding: content/signal_classifier/training_args.bin (deflated 53%)


In [24]:
!mv /content/signal_classifier.zip /content/drive/MyDrive/signal_classifier.zip