In [1]:
# 文本分类是一种常见的 NLP 任务，它为文本分配标签或类别。一些最大的公司在生产中运行文本分类，以实现广泛的实际应用。
# 最流行的文本分类形式之一是情感分析，它为文本序列分配 🙂 积极、🙁 消极或 😐 中性等标签。
# 本节内容：
# 1. 在IMDb数据集上微调DistilBERT，以确定电影评论是正面还是负面。
# 2. 使用您的微调模型进行推理。

In [2]:
# 安装所有必须的库
# !pip install transformers datasets evaluate accelerate

In [3]:
# 1. 加载IMDB数据集

from datasets import load_dataset

imdb = load_dataset("imdb")

imdb["test"][0]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

In [4]:
# 该数据集中有两个字段：
# text: 影评文字。
# label：0表示负面评论, 1正面评论的值。

In [5]:
# 2.预处理

# 加载 DistilBERT 分词器来预处理text字段：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

# 创建一个预处理函数,对text文本序列进行标记和截断，使其长度不超过DistilBERT的最大输入长度

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

# 要将预处理函数应用于整个数据集，需要使用数据集映射函数。可以对map()方法通过设置batched=True一次处理数据集的多个元素来加快速度：

tokenizer_imdb = imdb.map(preprocess_function, batched=True)

# 在处理过程中，使用动态句子填充，比将整个数据集填充到最大长度更有效。

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
# 3. 评价

# 在训练期间包含指标通常有助于评估模型的性能。可以使用Evaluate库快速加载评估方法。对于此任务，加载准确性指标

import evaluate

accuracy = evaluate.load("accuracy")

# 创建一个函数，用于计算分类精确度，传入的参数是预测值和参考值，通过这两个计算精确度
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


In [7]:
# 4. 训练

# 创建ID和label之间的映射

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

# 加载预训练模型

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

# 在TrainingArguments中定义训练超参数。唯一必需的参数是output_dir指定保存模型的位置。
# 在每个 epoch 结束时，Trainer将评估准确性并保存训练检查点。

training_args = TrainingArguments(
    output_dir="my_text_classification_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# 将训练参数以及模型、数据集、分词器、数据整理器和compute_metrics 函数传递给Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenizer_imdb["train"],
    eval_dataset=tokenizer_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
# 调用train()来微调模型
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2267,0.223391,0.9238
2,0.1433,0.222952,0.9326


TrainOutput(global_step=3126, training_loss=0.20425316163232063, metrics={'train_runtime': 19146.6107, 'train_samples_per_second': 2.611, 'train_steps_per_second': 0.163, 'total_flos': 6563283548690880.0, 'train_loss': 0.20425316163232063, 'epoch': 2.0})

In [8]:
# 推送到模型仓库，默认节点上已经login
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/b43646/my_text_classification_model/commit/0d653c6a573a0112fefc4eba2bab7d03c92c9ab1', commit_message='End of training', commit_description='', oid='0d653c6a573a0112fefc4eba2bab7d03c92c9ab1', pr_url=None, pr_revision=None, pr_num=None)

In [1]:
# 推理

text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="b43646/my_text_classification_model")
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9967633485794067}]