## 文本分类任务

使用IMDB数据集的情感分析任务为例来微调预训练的BERT模型。IMDB数据集由电影评论和情感标签（正面/负面）组成。

In [2]:
! pip install dill==0.3.5
! pip install nlp
! pip install Transformers==3.5.1

Collecting Transformers==3.5.1
  Using cached transformers-3.5.1-py3-none-any.whl.metadata (32 kB)
Collecting tokenizers==0.9.3 (from Transformers==3.5.1)
  Using cached tokenizers-0.9.3.tar.gz (172 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentencepiece==0.1.91 (from Transformers==3.5.1)
  Using cached sentencepiece-0.1.91.tar.gz (500 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

In [3]:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
from nlp import load_dataset
import torch
import numpy as np

### 准备阶段
下载并加载数据集，使用nlp库完成

In [4]:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')

Downloading...
From: https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
To: /content/imdbs.csv
  0% 0.00/132k [00:00<?, ?B/s]100% 132k/132k [00:00<00:00, 70.5MB/s]




In [5]:
type(dataset)

将数据集分为训练集和测试集

In [6]:
dataset = dataset.train_test_split(test_size=0.3)
print(dataset)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

{'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 70), 'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}, num_rows: 30)}


In [7]:
train_set = dataset['train']
test_set = dataset['test']

由于要进行序列分类，因此我们使用`BertForSequence-Classification`类

与`BertTokenizer`相比，`BertTokenizerFast`类有很多优点

In [8]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')  # 词元分析器

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 预处理数据集
词元分析器的自动处理

In [9]:
tokenizer('I love Paris')

{'input_ids': [101, 1045, 2293, 3000, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

In [10]:
tokenizer(['I love Paris', 'birds fly', 'snow fall'], padding = True, max_length = 5)



{'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 102, 0]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]}

In [11]:
def preprocess(data):
    # Preprocess the data
    # 当 truncation=True 时，如果输入的文本长度超过了模型的最大输入长度（通常由模型的配置决定），tokenizer 会自动截断文本，使其符合模型所能处理的长度限制。
    return tokenizer(data['text'], padding = True, truncation = True)

使用preprocess函数对训练集和测试集进行预处理。

`map` 方法将数据集中的每一条数据或一批数据应用到指定的函数中，产生一个新的数据集。这里是将 `preprocess` 函数应用到整个 `train_set` 数据集中

`batch_size=len(train_set)` 代表所有数据一次性被处理，不分小批次

In [12]:
train_set = train_set.map(preprocess, batched=True, batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

使用`set_format`函数，选择数据集中需要的列及其对应的格式，这里是转化为Pytorch的tensor格式。

In [13]:
train_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

### 训练模型

In [14]:
batch_size = 8
epochs = 2

`warmup_steps` 在深度学习的模型训练中，它代表了在训练的前 500 个步（steps）里，学习率会逐渐从一个较小的值线性增加到预设的最大学习率

In [15]:
warmup_steps = 500
weight_decay = 0.01

In [18]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=warmup_steps,
    weight_decay=weight_decay,
    eval_strategy="steps",
    logging_dir='./logs',
)

In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_set,
    eval_dataset=test_set
)

In [20]:
trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=18, training_loss=0.716180165608724, metrics={'train_runtime': 935.6297, 'train_samples_per_second': 0.15, 'train_steps_per_second': 0.019, 'total_flos': 36835547750400.0, 'train_loss': 0.716180165608724, 'epoch': 2.0})

In [21]:
trainer.evaluate()

{'eval_loss': 0.6674126982688904,
 'eval_runtime': 55.1552,
 'eval_samples_per_second': 0.544,
 'eval_steps_per_second': 0.073,
 'epoch': 2.0}

### 推理过程


In [23]:
model = BertForSequenceClassification.from_pretrained('./results/checkpoint-18')

In [29]:
# 假设我们有一些新影评需要进行情感分类
new_reviews = ["I loved this movie, it was amazing!",
               "If you came here, it's because you've already seen this film and were curious what others had to say about it."]

# 1. 使用与训练时相同的 tokenizer 对新影评进行预处理
inputs = tokenizer(new_reviews, padding=True, truncation=True, return_tensors="pt")

# 2. 将输入传递给模型进行推理
with torch.no_grad():  # 推理时不需要计算梯度
    outputs = model(**inputs)

# 3. 获取模型输出的 logits（未归一化的分类分数）
logits = outputs.logits

# 4. 使用 argmax 获取分类的预测结果
predictions = torch.argmax(logits, dim=-1)

# 5. 将预测的标签数字映射回实际的情感分类
# 假设 0 表示 "Negative", 1 表示 "Positive"
label_map = {0: "Negative", 1: "Positive"}

# 解码预测结果
predicted_labels = [label_map[pred.item()] for pred in predictions]

# 6. 打印预测结果
for review, label in zip(new_reviews, predicted_labels):
    print(f"Review: {review} -> Sentiment: {label}")

Review: I loved this movie, it was amazing! -> Sentiment: Positive
Review: If you came here, it's because you've already seen this film and were curious what others had to say about it. -> Sentiment: Positive
