SetFit 是一个高效且无提示的框架，用于对 Sentence Transformers 进行少量微调。它用很少的标记数据实现了高精度 - 例如，在客户评论情感数据集上每类只有 8 个标记示例，SetFit 在 3000 个示例的完整训练集上与微调 RoBERTa Large 具有竞争力！

## 数据准备

让我们考虑一个带有少量标记训练数据的场景（例如 64 个句子）。我们将使用 ag_news 数据集来模拟这种场景。

In [1]:
from datasets import load_dataset
from setfit import sample_dataset

# Load a dataset from the Hugging Face Hub
dataset = load_dataset("ag_news")

# Create a sample few-shot dataset to train with
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=16)

# Dataset for evaluation
eval_dataset = dataset["test"]

print(train_dataset)
print(eval_dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['text', 'label'],
    num_rows: 64
})
Dataset({
    features: ['text', 'label'],
    num_rows: 7600
})


## 基线模型

我们可以使用标准的 SetFit 训练方法来准备模型。

In [2]:
from setfit import SetFitModel, TrainingArguments, Trainer

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-MiniLM-L3-v2")

args = TrainingArguments(
    batch_size=64,
    num_epochs=5,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

metrics = trainer.evaluate()
print(metrics)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


Map:   0%|          | 0/64 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 3072
  Batch size = 64
  Num epochs = 5


Step,Training Loss
1,0.3912
50,0.265
100,0.1439
150,0.0828
200,0.0594


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

***** Running evaluation *****


{'accuracy': 0.8281578947368421}


该模型在我们的数据集上达到了 82.81%。考虑到训练数据量很少，这当然值得尊敬，但我们可以使用知识蒸馏来从我们的模型中获得更多性能。

## 无标记数据准备

除了标记的训练数据之外，我们可能还有大量未标记的训练数据（例如 500 个句子）。让我们准备一下：

In [3]:
# Create a dataset of unlabeled examples to perform knowledge distillation
unlabeled_train_dataset = dataset["train"].shuffle(seed=0).select(range(500))
unlabeled_train_dataset = unlabeled_train_dataset.remove_columns("label")

print(unlabeled_train_dataset)

Dataset({
    features: ['text'],
    num_rows: 500
})


## 教师模型

然后，我们将准备一个更大的经过训练的 SetFit 模型，它将充当我们较小的学生模型的老师。强大的 Sentence-transformers/paraphrase-mpnet-base-v2 Sentence Transformer 模型将用于初始化 SetFit 模型。

In [4]:
from setfit import SetFitModel

teacher_model = SetFitModel.from_pretrained("google-bert/bert-base-uncased")

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


我们需要首先在标记数据集上训练这个模型：

In [5]:
from setfit import TrainingArguments, Trainer

teacher_args = TrainingArguments(
    batch_size=16,
    num_epochs=2,
)

teacher_trainer = Trainer(
    model=teacher_model,
    args=teacher_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train teacher model
teacher_trainer.train()
teacher_metrics = teacher_trainer.evaluate()
print(teacher_metrics)

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 3072
  Batch size = 16
  Num epochs = 2


Step,Training Loss
1,0.137
50,0.1751
100,0.0229
150,0.0011
200,0.0006
250,0.0004
300,0.0003
350,0.0003


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

***** Running evaluation *****


{'accuracy': 0.863421052631579}


这个大型教师模型达到了 86.34%，对于这个小数据来说相当强大，并且明显强于我们较小（但更高效）模型的 82.81%。

## 知识蒸馏

可以使用 DistillationTrainer 将较强的 Teacher_model 的性能提炼为较小的模型。它接受教师和学生模型，以及未标记的数据集。

In [6]:
from setfit import DistillationTrainer

distillation_args = TrainingArguments(
    batch_size=16,
    max_steps=500,
)

distillation_trainer = DistillationTrainer(
    teacher_model=teacher_model,
    student_model=model,
    args=distillation_args,
    train_dataset=unlabeled_train_dataset,
    eval_dataset=eval_dataset,
)

# Train student with knowledge distillation
distillation_trainer.train()
distillation_metrics = distillation_trainer.evaluate()
print(distillation_metrics)

***** Running training *****
  Num unique pairs = 4001
  Batch size = 16
  Num epochs = 1


Step,Training Loss
1,0.5776
50,0.3278
100,0.0077
150,0.0019
200,0.0012
250,0.0009
300,0.0007
350,0.0006
400,0.0005
450,0.0005


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
***** Running evaluation *****


{'accuracy': 0.8346052631578947}


使用知识蒸馏，我们能够在几分钟的训练内将模型从 82.81% 提高到 83.46%。