# 背景
> * 预训练模型和语言模型是两个不同角度的概念：预训练模型一般指之前在某些任务上训练过的模型，是为了给初始化模型导入先验知识（优化参数的初始化）来加速模型在下游任务下的训练，或者说约束模型在下游任务的学习；语言模型是指给模型导入人类语言的先验知识，当前主要包括Masked language modeling（人类可以通过上下文推测出中间的某个文本的能力）和Causal language modeling（人类可以通过上文推测下文的能力），通过模拟人类的部分语言理解能力来使模型的理解能力尽量接近人类的理解能力，约束模型的参数化能力范围接近人类思考的范围，让模型像人一样思考（所以应该还有其他模型可以模仿的人类语言理解能力，所以人类可能只是半成品）
> * 所以如果做下游任务，有一个对下游数据有一定了解或约束的语言模型作为预训练模型最后得到的效果会更好（迁移学习），所以如果有个预训练模型，它所了解的数据和你要做的任务数据差异很大，那需要我们微调这个预训练模型（例如，如果您的数据集包含法律合同或科学文章，像 BERT 这样的普通 Transformer 模型通常会将您的语料库中的特定领域词视为稀有标记，并且由此产生的性能可能不尽如人意。我们希望微调后，模型能够反映我们要做的数据的一些特点，通过mask或向后预测可以反映），称为domain adaptation 域适应，这个微调一般只做一次就行，只是让模型了解你要做的一些数据，所以训练得到的loss结果没有必要追求到极限，只要差不多就行
> * 语言模型反馈人类本身对于语言的理解，所以只需要原始文本，任务信息通过下游任务添加


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '3'
os.environ["WANDB_MODE"] = "offline"
import transformers

transformers.logging.set_verbosity_info()

  from .autonotebook import tqdm as notebook_tqdm


# 准备数据

In [2]:
from datasets import load_dataset

imdb_dataset = load_dataset(path="christykoh/imdb_pt",split='test',cache_dir='./cache')

Using custom data configuration christykoh--imdb_pt-d17ce256d2a40640
Found cached dataset parquet (/home/zyl/disk/algorithms_ai/algorithms_ai/learning/cache/christykoh___parquet/christykoh--imdb_pt-d17ce256d2a40640/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


In [3]:
train_dataset,validation_dataset = imdb_dataset.select(range(1000)).train_test_split(test_size=0.1).values()

# 指标
> * perplexity(PPL)困惑度
> * cross-entropy交叉熵

# Masked language modeling
> * 掩码模型，主要通过遮掩出一段文本中的某些词（15%）,通过训练让模型能够推测遮掩的内容是什么
> * 主要有bert结构
> * 实际是token的预测，每个token的预测维度是vocab的大小，而且难以通过传统PPL来评估它，所以huggingface中用交叉熵来替代PPL
> * 微调后的语言模型适应了新的数据集，并且不会忘掉它之前的预训练内容

## 导入model和tokenizer

In [19]:
from transformers import DistilBertForMaskedLM
model_checkpoint = "distilbert-base-uncased"
model = DistilBertForMaskedLM.from_pretrained(model_checkpoint)

loading configuration file config.json from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForMaskedLM.

All the weights of DistilBertForMaskedLM w

In [20]:
model

DistilBertForMaskedLM(
  (activation): GELUActivation()
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inp

In [5]:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)

loading file vocab.txt from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6

##  把数据处理成模型能够接收的格式
> * 最常见的做法是把每个文本作为一个样本进行输入处理，使用pad和truncation
> * 但huggingface有种推荐做法：首先把所有文本tokenzizer成每个子输入，然后组成一段长输入，然后使用chunk进行切分成多个样本，最后多余的可以忽略或padding
> * 注意这个chunk不能太小，最好参考下数据
> * 对于最后余留的部分可以选择保留或填充或忽略---这部分的实现可以使用tokenizer中的return-overflowing-token=True

In [6]:
def process_data_for_language_model(examples, tokenizer, use_chunk=True,chunk_size=None,text_column='text',create_new_labels=True):
    if use_chunk:
        # 使用的时候因为使用chunk使得输入和输出的维度不一样，所以map中要remove_columns已有的列
        # 先tokenizer
        result = tokenizer(examples[text_column])
        if tokenizer.is_fast:
            result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]

        # 再组合
        if not chunk_size:
            chunk_size=tokenizer.model_max_length
        concatenated_examples = {k: sum(result[k], []) for k in result.keys()} # Concatenate all texts
        total_length = len(concatenated_examples[list(result.keys())[0]])  # Compute length of concatenated texts
        total_length = (total_length // chunk_size) * chunk_size #  We drop the last chunk if it's smaller than chunk_size
        result = {
            k: [t[i: i + chunk_size] for i in range(0, total_length, chunk_size)]
            for k, t in concatenated_examples.items()
        }  # Split by chunks of max_len

        if create_new_labels:
            result["labels"] = result["input_ids"].copy() # Create a new labels column
    else:
        if chunk_size:
            tokenizer.model_max_length = chunk_size
        result = tokenizer(examples[text_column],truncation=True)
    return result



In [7]:

train_dataset = train_dataset.map(process_data_for_language_model,
                                       fn_kwargs={"tokenizer": tokenizer, "chunk_size": 154,'use_chunk':True,'text_column':'text','create_new_labels':True},
                                       batched=True, remove_columns=train_dataset.column_names)

validation_dataset = validation_dataset.map(process_data_for_language_model,
                                       fn_kwargs={"tokenizer": tokenizer, "chunk_size": 154,'use_chunk':True,'text_column':'text','create_new_labels':True},
                                       batched=True, remove_columns=validation_dataset.column_names)

  0%|                                                                                                                                                                                         | 0/1 [00:00<?, ?ba/s]Token indices sequence length is longer than the specified maximum sequence length for this model (567 > 512). Running this sequence through the model will result in indexing errors
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.70s/ba]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.99ba/s]


In [11]:
train_dataset[2]

{'input_ids': [7367,
  2226,
  16095,
  24110,
  3527,
  7367,
  4372,
  8663,
  6494,
  2213,
  6846,
  3672,
  2063,
  2053,
  3309,
  1025,
  3449,
  2050,
  1041,
  1037,
  10026,
  2063,
  1041,
  17540,
  2890,
  8529,
  3595,
  16843,
  1013,
  6187,
  2229,
  16137,
  10841,
  25226,
  2015,
  10861,
  3449,
  2063,
  2087,
  2527,
  10861,
  3449,
  2063,
  1041,
  14736,
  2015,
  2188,
  2213,
  2079,
  10861,
  10514,
  2050,
  10026,
  2063,
  1010,
  14163,
  9956,
  11498,
  1051,
  10975,
  10936,
  2121,
  3972,
  2050,
  1012,
  28681,
  2891,
  9808,
  4958,
  19565,
  20617,
  2015,
  23310,
  3286,
  1037,
  14163,
  6590,
  15287,
  2480,
  1041,
  3348,
  2080,
  1012,
  8823,
  9808,
  3595,
  16843,
  2015,
  11265,
  17811,
  7509,
  10938,
  28118,
  24110,
  3527,
  6366,
  2213,
  7367,
  2271,
  1051,
  10841,
  10483,
  1041,
  14017,
  15464,
  9808,
  9298,
  18349,
  2015,
  1012,
  1037,
  9353,
  7113,
  4424,
  1041,
  12990,
  2063,
  1010,
  16137

## 数据打包并且mask
> * 随机mask评估集会导致每次评估时测试的结果会有波动----其实还好，因为语言模型只是作为预训练模型，不需要达到顶级的精度，只要达到某个可以的值就行，所以最后只要看到loss波动平稳就行，
> * 整个词的mask会比单个token的mask好吗？？？

In [None]:
# 整个词的mask
# 如果您使用整个单词掩码整理器，您还需要进行设置remove_unused_columns=False以确保我们在训练期间不会丢失该word_ids列
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -100
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)



In [8]:
# 训练---随机屏蔽
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
# 固定mask，要额外使用dataloader
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

## 评估和训练和评估
> * 因为随机mask，可以看到每次结果都有波动，正常
> * 模型训练完使用的evaluation是训练后的模型，而不是预训练模型，里面每轮训练结果都会更新trainer里面的self.model
> * mask_lm实际是vocab-token-cls，所以计算损失用的是交叉熵损失，把指数-交叉熵损失作为困惑度
> * 用来fp16进行加速，可以，影响不大

In [9]:
from transformers import Trainer
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(train_dataset) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=False,
    fp16=True,
    logging_steps=logging_steps,
    save_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using cuda_amp half precision backend


In [10]:
trainer.evaluate()
import math
print(f'Perplexity:{math.exp(trainer.evaluate()[ "eval_loss"])}')

***** Running Evaluation *****
  Num examples = 281
  Batch size = 64


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


***** Running Evaluation *****
  Num examples = 281
  Batch size = 64


Perplexity:138.00671736557376


In [11]:
trainer.train()

***** Running training *****
  Num examples = 2604
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 123
  Number of trainable parameters = 66985530


Epoch,Training Loss,Validation Loss
1,4.2101,3.760866
2,3.7134,3.453746
3,3.5264,3.355043


***** Running Evaluation *****
  Num examples = 281
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-finetuned-imdb/checkpoint-41
Configuration saved in distilbert-base-uncased-finetuned-imdb/checkpoint-41/config.json
Model weights saved in distilbert-base-uncased-finetuned-imdb/checkpoint-41/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-imdb/checkpoint-41/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-imdb/checkpoint-41/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 281
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-finetuned-imdb/checkpoint-82
Configuration saved in distilbert-base-uncased-finetuned-imdb/checkpoint-82/config.json
Model weights saved in distilbert-base-uncased-finetuned-imdb/checkpoint-82/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-imdb/checkpoint-82/tokenizer_config.json
Special tokens fil

TrainOutput(global_step=123, training_loss=3.8075278600056968, metrics={'train_runtime': 43.8907, 'train_samples_per_second': 177.988, 'train_steps_per_second': 2.802, 'total_flos': 311479362732384.0, 'train_loss': 3.8075278600056968, 'epoch': 3.0})

In [12]:
trainer.evaluate()
print(f'Perplexity:{math.exp(trainer.evaluate()[ "eval_loss"])}')

***** Running Evaluation *****
  Num examples = 281
  Batch size = 64


***** Running Evaluation *****
  Num examples = 281
  Batch size = 64


Perplexity:29.385910858779788


## 使用accelarate加速

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

## 使用

In [13]:
from transformers import pipeline

checkpoint = 'distilbert-base-uncased'
# checkpoint = f"{checkpoint.split("/")[-1]}-finetuned-imdb"

mask_filler = pipeline(
    "fill-mask", model=checkpoint
)
text = "This is a great [MASK]."
preds = mask_filler(text)
print(preds)

loading configuration file config.json from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /home/zyl/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "ac

[{'score': 0.036511871963739395, 'token': 3066, 'token_str': 'deal', 'sequence': 'this is a great deal.'}, {'score': 0.02395874261856079, 'token': 3112, 'token_str': 'success', 'sequence': 'this is a great success.'}, {'score': 0.023744728416204453, 'token': 6172, 'token_str': 'adventure', 'sequence': 'this is a great adventure.'}, {'score': 0.01608501560986042, 'token': 2801, 'token_str': 'idea', 'sequence': 'this is a great idea.'}, {'score': 0.0108775170519948, 'token': 8658, 'token_str': 'feat', 'sequence': 'this is a great feat.'}]


In [15]:
checkpoint = 'distilbert-base-uncased'
checkpoint = f"{checkpoint.split('/')[-1]}-finetuned-imdb/checkpoint-123"

mask_filler = pipeline(
    "fill-mask", model=checkpoint
)
text = "This is a great [MASK]."
preds = mask_filler(text)
print(preds)

loading configuration file distilbert-base-uncased-finetuned-imdb/checkpoint-123/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-imdb/checkpoint-123",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading configuration file distilbert-base-uncased-finetuned-imdb/checkpoint-123/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-imdb/checkpoint-123",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dr

[{'score': 0.05909404158592224, 'token': 3066, 'token_str': 'deal', 'sequence': 'this is a great deal.'}, {'score': 0.033409856259822845, 'token': 6172, 'token_str': 'adventure', 'sequence': 'this is a great adventure.'}, {'score': 0.032685987651348114, 'token': 2801, 'token_str': 'idea', 'sequence': 'this is a great idea.'}, {'score': 0.016976814717054367, 'token': 3112, 'token_str': 'success', 'sequence': 'this is a great success.'}, {'score': 0.015000013634562492, 'token': 9467, 'token_str': 'shame', 'sequence': 'this is a great shame.'}]


## 把语言模型应用于下游任务

In [17]:
checkpoint = 'distilbert-base-uncased'
checkpoint = f"{checkpoint.split('/')[-1]}-finetuned-imdb/checkpoint-123"
from transformers import DistilBertForSequenceClassification
model = DistilBertForSequenceClassification.from_pretrained(
    checkpoint, 
    cache_dir="/large_files/5T/huggingface_cache/model"
)

loading configuration file distilbert-base-uncased-finetuned-imdb/checkpoint-123/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file distilbert-base-uncased-finetuned-imdb/checkpoint-123/pytorch_model.bin
Some weights of the model checkpoint at distilbert-base-uncased-finetuned-imdb/checkpoint-123 were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_nor

In [18]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [6]:
import torch
text = "This is a great [MASK]."
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


In [None]:
# 冻结层
for param in model.bert.parameters():
    param.requires_grad = False