In [1]:
import torch
from transformers import pipeline

这段代码导入了PyTorch库和Transformers库中的pipeline模块。Transformers库是一个流行的用于自然语言处理（NLP）任务的库，例如文本分类、命名实体识别和问答等。

pipeline模块提供了一个简单的API，用于使用预训练模型执行NLP任务。它允许用户轻松地加载预训练模型，并使用它们执行各种任务，而不需要深入了解底层模型。

总的来说，这段代码设置了使用Transformers库中预训练NLP模型执行各种任务所需的环境。

# Pipeline
Start by creating an instance of pipeline() and specifying a task you want to use it for. In this guide, you’ll use the pipeline() for **sentiment analysis** as an example:

首先创建一个pipeline()实例，并指定想要用它来执行的任务。在这个指南中，我们以情感分析的pipeline()为例进行介绍。


这句话是在介绍如何使用pipeline()模块来执行NLP任务。它建议我们首先创建一个pipeline()实例，并指定要执行的任务，例如情感分析。这个实例会自动加载相应的预训练模型，并可以用于对输入文本进行情感分析等任务。使用pipeline()可以简化NLP任务的执行过程，避免了手动下载和加载模型的繁琐步骤。

In [2]:
classifier = pipeline("sentiment-analysis") # 这段代码创建了一个名为classifier的变量，并将其赋值为pipeline("sentiment-analysis")。
# 这行代码使用了pipeline模块提供的API来创建一个情感分析的pipeline实例。
# 具体来说，pipeline("sentiment-analysis")会自动下载相应的预训练模型，并将其加载为一个情感分析器。
# 这个情感分析器可以接受一段文本作为输入，并输出一个表示该文本情感倾向的分数，例如正面情感或负面情感。

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


当调用pipeline()函数时没有明确指定模型名称和版本时，它会默认使用一个预先指定的模型。在这种情况下，出现了警告消息："No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). ↗.)"，即默认使用distilbert-base-uncased-finetuned-sst-2-english这个预训练模型。

The pipeline() downloads and caches a default pretrained model and tokenizer for sentiment analysis. Now you can use the classifier on your target text:

pipeline()函数会自动下载并缓存一个默认的预训练模型和分词器，用于情感分析任务。现在，你可以使用这个分类器对你的目标文本进行情感分析。

pipeline()函数下载和缓存了模型和分词器，你可以使用这个分类器对你的目标文本进行情感分析，而不需要手动下载和加载模型，从而简化了NLP任务的执行过程。

In [3]:
input = 'We are very happy to show you the 🤗 Transformers library.'
classifier(input)
#这行代码将输入文本字符串" We are very happy to show you the 🤗 Transformers library." 赋值给变量input
#然后将这个文本输入到名为classifier的情感分析分类器中进行情感分析。

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

classifier(input)将输入文本传递给预训练的情感分析模型进行处理，并返回一个包含预测情感类别及其置信度得分的Python字典。

这个结果告诉我们，模型将输入文本预测为"积极"情感，并且预测得分非常高 [{'label': 'POSITIVE', 'score': 0.9997795224189758}]




If you have more than one input, pass your inputs as a list to the pipeline() to return a list of dictionaries:

如果你有多个输入，可以将这些输入作为列表传递给pipeline()函数，以返回一个包含多个字典的列表，其中每个字典对应一个输入并包含其情感分析的结果。

In [4]:
input = [
    "We are very happy to show you the 🤗 Transformers library.", 
    "We hope you don't hate it."]
results = classifier(input)
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


## Use another model and tokenizer in the pipeline
The pipeline() can accommodate any model from the Hub, making it easy to adapt the pipeline() for other use-cases. For example, if you’d like a model capable of handling French text, use the tags on the Hub to filter for an appropriate model. The top filtered result returns a multilingual BERT model finetuned for sentiment analysis you can use for French text.

Below we use AutoModelForSequenceClassification and AutoTokenizer to load the pretrained model and it’s associated tokenizer (more on an AutoClass in the next section):

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Specify the model and tokenizer in the pipeline(), and now you can apply the classifier on French text:

In [6]:
input = 'Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.'
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier(input)

[{'label': '5 stars', 'score': 0.7272652387619019}]

If you can’t find a model for your use-case, you’ll need to finetune a pretrained model on your data.

# AutoClass
An AutoClass is a shortcut that automatically retrieves the architecture of a pretrained model from its name or path.

## AutoTokenizer
A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you’re using the same tokenization rules a model was pretrained with.

Load a tokenizer with AutoTokenizer:

In [7]:
from transformers import AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pass your text to the tokenizer:

In [8]:
input_text = 'We are very happy to show you the 🤗 Transformers library.'
tokenizer(input_text)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer returns a dictionary containing:

* **input_ids:** numerical representations of your tokens.
* **attention_mask:** indicates which tokens should be attended to.

A tokenizer can also accept a list of inputs, and **pad** and **truncate** the text to **return a batch with uniform length**.

In [9]:
raw_input = [
    "We are very happy to show you the 🤗 Transformers library.", 
    "We hope you don't hate it."]
pt_batch = tokenizer(
    raw_input,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

## AutoModel
HuggingFace Transformers provides a simple and unified way to load pretrained instances. This means you can load an AutoModel like you would load an AutoTokenizer. The only difference is selecting the correct AutoModel for the task. For text (or sequence) classification, you should load AutoModelForSequenceClassification:

In [10]:
from transformers import AutoModelForSequenceClassification

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

> See the [task summary](https://huggingface.co/docs/transformers/task_summary) for tasks supported by an AutoModel class.

Now pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding **:

In [11]:
pt_outputs = pt_model(**pt_batch)
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6222, -2.7745, -0.8967,  2.0137,  3.3064],
        [ 0.0064, -0.1258, -0.0503, -0.1655,  0.1329]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The model outputs the final activations in the logits attribute. Note that all Transformers models output the tensors before the final activation function (like softmax) because the final activation function is often fused with the loss.

Then, we apply the softmax function to the logits to retrieve the probabilities:

In [12]:
from torch import nn

pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
pt_predictions

tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
        [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)

## Save a model
Once your model is fine-tuned, you can save it with its tokenizer using PreTrainedModel.save_pretrained():

In [13]:
pt_save_directory = "./pt_save_pretrained"
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

When you are ready to use the model again, reload it with PreTrainedModel.from_pretrained():

In [14]:
pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")

# Custom model builds
You can modify the model’s configuration class to change how a model is built. The configuration specifies a model’s attributes, such as the number of hidden layers or attention heads.

You start from scratch when you initialize a model from a custom configuration class. The model attributes are randomly initialized, and you’ll need to train the model before you can use it to get meaningful results.

Start by importing AutoConfig, and then load the pretrained model you want to modify. Within AutoConfig.from_pretrained(), you can specify the attribute you want to change, such as the number of attention heads.

In [15]:
from transformers import AutoConfig

my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12)
my_config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.30.2",
  "vocab_size": 30522
}

Create a model from your custom configuration with AutoModel.from_config().

In [16]:
from transformers import AutoModel

custom_model = AutoModel.from_config(my_config)
custom_model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

Take a look at the [create a custom architecture guide](https://huggingface.co/docs/transformers/create_a_model) for more information about building custom configurations.

# Trainer: A PyTorch optimized training loop
All models are a standard torch.nn.Module so you can use them in any typical training loop. While you can write your own training loop, HuggingFace Transformers provides a Trainer class for PyTorch, which contains the basic training loop and adds additional functionality for features like **distributed training**, mixed precision, and more.

Depending on your task, you’ll typically pass the following parameters to Trainer:

1. A PreTrainedModel or a torch.nn.Module:

In [17]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
model

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.we

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

2. TrainingArguments contains the **model hyperparameters** you can change like learning rate, batch size, and the number of epochs to train for. The default values are used if you don’t specify any training arguments.

In [18]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./train_args",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
)
training_args

TrainingArguments(
_n_gpu=2,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=

3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

4. Load a dataset:

In [24]:
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes")

Found cached dataset rotten_tomatoes (/home/zonghang/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


  0%|          | 0/3 [00:00<?, ?it/s]

5. Create a function to tokenize the dataset, then apply it over the entire dataset with map.

In [25]:
def tokenize_dataset(dataset):
    return tokenizer(dataset["text"])

dataset = dataset.map(tokenize_dataset, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

6. A DataCollatorWithPadding to create a batch of examples from your dataset.

In [26]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Now gather all these classes in Trainer. When you’re ready, call train() to start training.

In [27]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.418
1000,0.2542




TrainOutput(global_step=1068, training_loss=0.3289355758424109, metrics={'train_runtime': 65.3118, 'train_samples_per_second': 261.208, 'train_steps_per_second': 16.352, 'total_flos': 214898859625128.0, 'train_loss': 0.3289355758424109, 'epoch': 2.0})

> For tasks - like translation or summarization - that use a sequence-to-sequence model, use the Seq2SeqTrainer and Seq2SeqTrainingArguments classes instead.