# Hugging Face BERT Fine-Tuning with PyTorch

## 1. Download Pre-trained Model
- Download the pre-trained BERT model from the Hugging Face model hub.
- We will do fine-tuning on top of it for the sentiment analysis task.

In [8]:
import torch.nn
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "google-bert/bert-base-chinese"
cache_dir = "../local_models"
AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir)
AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


BertTokenizerFast(name_or_path='google-bert/bert-base-chinese', vocab_size=21128, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

## 2. Define Dataset class, for loading our custom dataset for further fine-tuning
- The dataset should be prepared in advance, and in different purpose, like train, validation, test, etc.
- There should more data for training, smaller size for validation and test.

In [9]:
from torch.utils.data import Dataset
from datasets import load_from_disk

class MyDataset(Dataset):
    def __init__(self, dataset_type, dataset_path):
        self.dataset = load_from_disk(dataset_path)
        if dataset_type == 'train':
            self.dataset = self.dataset["train"]
        elif dataset_type == 'validation':
            self.dataset = self.dataset["validation"]
        elif dataset_type == 'test':
            self.dataset = self.dataset["test"]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        text = self.dataset[item]['text']
        label = self.dataset[item]['label']
        return text,label

dataset_train = MyDataset("train", r"../local_datasets/ChnSentiCorp")
for data in dataset_train[:5]:
    print(data)

dataset_validation = MyDataset("validation", r"../local_datasets/ChnSentiCorp")
for data in dataset_validation[:5]:
    print(data)

dataset_test = MyDataset("test", r"../local_datasets/ChnSentiCorp")
for data in dataset_test[:5]:
    print(data)

['选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般', '15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错', '房间太小。其他的都一般。。。。。。。。。', '1.接电源没有几分钟,电源适配器热的不行. 2.摄像头用不起来. 3.机盖的钢琴漆，手不能摸，一摸一个印. 4.硬盘分区不好办.', '今天才知道这书还有第6卷,真有点郁闷:为什么同一套书有两种版本呢?当当网是不是该跟出版社商量商量,单独出个第6卷,让我们的孩子不会有所遗憾。']
[1, 1, 0, 0, 1]
['這間酒店環境和服務態度亦算不錯,但房間空間太小~~不宣容納太大件行李~~且房間格調還可以~~ 中餐廳的廣東點心不太好吃~~要改善之~~~~但算價錢平宜~~可接受~~ 西餐廳格調都很好~~但吃的味道一般且令人等得太耐了~~要改善之~~', '<荐书> 推荐所有喜欢<红楼>的红迷们一定要收藏这本书,要知道当年我听说这本书的时候花很长时间去图书馆找和借都没能如愿,所以这次一看到当当有,马上买了,红迷们也要记得备货哦!', '商品的不足暂时还没发现，京东的订单处理速度实在.......周二就打包完成，周五才发货...', '２００１年来福州就住在这里，这次感觉房间就了点，温泉水还是有的．总的来说很满意．早餐简单了些．', '不错的上网本，外形很漂亮，操作系统应该是个很大的 卖点，电池还可以。整体上讲，作为一个上网本的定位，还是不错的。']
[1, 1, 0, 1, 1]
['这个宾馆比较陈旧了，特价的房间也很一般。总体来说一般', '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片！开始还怀疑是不是赠送的个别现象，可是后来发现每张DVD后面都有！真不知道生产商怎么想的，我想看的是猫和老鼠，不是米老鼠！如果厂家是想赠送的话，那就全套米老鼠和唐老鸭都赠送，只在每张DVD后面添加一集算什么？？简直是画蛇添足！！', '还稍微重了点，可能是硬盘大的原故，还要再轻半斤就好了。其他要进一步验证。贴的几种膜气泡较多，用不了多久就要更换了，屏幕膜稍好点，但比没有要

## 3. Define downstream tasks model
- Extending the pretrained model for the fine-tuning

In [10]:
from transformers import BertModel
import torch

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {DEVICE}")

pretrained_model = (
    BertModel
    .from_pretrained(r"../local_models/models--google-bert--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")
    .to(DEVICE)
)
print(pretrained_model)

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self,input_ids,attention_mask,token_type_ids):
        # Freeze Pretrained model's parameters, don't engage in the fine-tuning train.
        with torch.no_grad():
            out = pretrained_model(input_ids=input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
        # Incremental model
        out = self.fc(out.last_hidden_state[:,0])
        print(out)
        return out

Device: cpu
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(21128, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, i


## 4. Training


In [15]:
from torch.utils.data import DataLoader
from transformers import BertTokenizer,AdamW
import torch

# Previous step loaded the DEVICE already, otherwise, you can load it again.
# DEVICE = torch.device("cuba" if torch.cuda.is_available() else "cpu")

# tokenizer to encode the data
tokenizer = (
    BertTokenizer
    .from_pretrained(r"../local_models/models--google-bert--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")
)

# to encode the data while loading process
def tokenize_batches(batch):
    texts, labels = zip(*batch)
    encoded_data = tokenizer.batch_encode_plus(
        batch_text_or_text_pairs=texts,
        truncation=True,
        max_length=500,
        padding="max_length",
        return_tensors="pt",
        return_length=True
    )
    tensor_labels = torch.tensor(labels)
    return encoded_data["input_ids"], encoded_data["attention_mask"], encoded_data["token_type_ids"], tensor_labels

# loading the training dataset
train_data_loader = DataLoader (
    dataset_train,
    batch_size=10,
    shuffle=True,
    drop_last=True,
    collate_fn=tokenize_batches
)

# loading the validation dataset
validation_data_loader = DataLoader (
    dataset_validation,
    batch_size=10,
    shuffle=True,
    drop_last=True,
    collate_fn=tokenize_batches
)

EPOCH = 2 # This should be very large, like 30000, but for the demo, we set it to 3.
def run_training(data_loader=train_data_loader):
    print(DEVICE)
    model = MyModel().to(DEVICE)
    optimizer = AdamW(model.parameters(), lr=1e-5)
    loss_func = torch.nn.CrossEntropyLoss()

    best_validation_acc = 0.0
    for epoch in range(EPOCH):
        for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(data_loader):
            input_ids, attention_mask, token_type_ids = input_ids.to(DEVICE), attention_mask.to(DEVICE), token_type_ids.to(DEVICE)
            optimizer.zero_grad()
            out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            loss = loss_func(out, labels)
            loss.backward()
            optimizer.step()

            if i % 5 == 0:
                out = out.argmax(dim=1)
                acc = (out==labels).sum().item()/len(labels)
                print(f"epoch:{epoch},i:{i},loss:{loss.item()},acc:{acc}")

        model.eval()
        with torch.no_grad():
            validation_acc = 0.0
            validation_loss = 0.0
            for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(validation_data_loader):
                input_ids, attention_mask, token_type_ids = input_ids.to(DEVICE), attention_mask.to(DEVICE), token_type_ids.to(DEVICE)
                out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
                validation_loss += loss_func(out, labels)
                out = out.argmax(dim=1)
                validation_acc += (out==labels).sum().item()
            validation_loss /= len(data_loader)
            validation_acc /= len(data_loader)
            print(f"epoch:{epoch},validation_loss:{validation_loss},validation_acc:{validation_acc}")

            if validation_acc > best_validation_acc:
                best_validation_acc = validation_acc
                torch.save(model.state_dict(), "params/best.pth")
                print(f"epoch:{epoch},best model saved with acc:{best_validation_acc}")

        torch.save(model.state_dict(), "params/last.pth")
        print(f"epoch:{epoch},last model saved")

run_training()

cpu
tensor([[ 0.4191,  0.5826],
        [ 0.5100,  0.2283],
        [ 0.6806,  0.1740],
        [ 0.5401,  0.0717],
        [ 0.7938,  0.2121],
        [ 0.4158, -0.0502],
        [ 0.4433,  0.3058],
        [ 0.4691,  0.2624],
        [ 0.4991,  0.4126],
        [ 0.4049,  0.2262]], grad_fn=<AddmmBackward0>)
epoch:0,i:0,loss:0.6742721796035767,acc:0.6
tensor([[0.3732, 0.2311],
        [0.8252, 0.1867],
        [1.1950, 0.4029],
        [0.3667, 0.4796],
        [0.2121, 0.8947],
        [0.2040, 0.6280],
        [0.5729, 0.3083],
        [0.4459, 0.1432],
        [0.1355, 0.8690],
        [0.3535, 0.0284]], grad_fn=<AddmmBackward0>)
tensor([[0.0740, 1.0953],
        [0.2969, 0.5424],
        [0.0809, 0.7053],
        [0.4876, 0.3754],
        [0.4712, 0.2832],
        [0.9317, 0.4327],
        [0.5018, 0.2652],
        [0.1576, 0.2888],
        [0.2545, 0.8140],
        [0.6750, 0.4960]], grad_fn=<AddmmBackward0>)
tensor([[0.4221, 0.5800],
        [0.7264, 0.4563],
        [0.4078, 0.

## 5. Testing
- After training, we need to test the model on the test dataset.
- Load the generated parameters model and test it

In [17]:
from torch.utils.data import DataLoader
import torch

test_data_loader = DataLoader (
    dataset_test,
    batch_size=10,
    shuffle=True,
    drop_last=True,
    collate_fn=tokenize_batches
)

def run_testing(param_path="params/best.pth"):
    test_acc = 0.0
    total = 0
    model = MyModel().to(DEVICE)
    model.load_state_dict(torch.load(param_path))
    model.eval()
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(test_data_loader):
        input_ids, attention_mask, token_type_ids = input_ids.to(DEVICE), attention_mask.to(DEVICE), token_type_ids.to(DEVICE)
        out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        out = out.argmax(dim=1)
        test_acc += (out==labels).sum().item()
        print(i, (out==labels).sum().item())
        total += len(labels)
    print(f"test_acc:{test_acc/total}")

run_testing()
run_testing("params/last.pth")

  model.load_state_dict(torch.load(param_path))


tensor([[ 1.0775,  0.1732],
        [-0.6142,  1.0059],
        [ 0.6673,  0.3187],
        [ 0.9344, -0.2741],
        [ 0.7287,  0.1582],
        [ 0.3977,  0.2502],
        [ 0.7632, -0.2315],
        [ 1.1574, -0.1033],
        [ 0.2967,  0.2492],
        [ 0.0971,  0.8467]], grad_fn=<AddmmBackward0>)
0 8
tensor([[-0.2091,  1.0616],
        [ 0.9648,  0.0087],
        [ 0.7289, -0.1153],
        [ 0.4560,  0.5984],
        [ 0.2703,  0.6799],
        [ 0.3741,  0.7123],
        [-0.3790,  1.2492],
        [ 0.8157, -0.0017],
        [ 0.6699,  0.5919],
        [ 0.0181,  0.6633]], grad_fn=<AddmmBackward0>)
1 8
tensor([[ 0.3769,  1.0327],
        [ 0.8426, -0.2956],
        [ 0.0201,  0.7767],
        [ 0.1181,  0.7516],
        [ 0.7721,  0.2439],
        [ 0.4364,  0.2145],
        [ 0.2549,  0.8052],
        [ 0.0570,  0.9045],
        [ 0.4581,  0.4258],
        [-0.3587,  1.3696]], grad_fn=<AddmmBackward0>)
2 8
tensor([[ 1.0620,  0.0835],
        [-0.2211,  1.0710],
        [-0

KeyboardInterrupt: 

## Extra 1. Customize the encoding vocabulary
- To extend the default vocabulary of the pre-trained model, you can add new tokens to the vocabulary.
- After adding new tokens, you will need training the model again.

In [18]:
from transformers import BertTokenizer

# How tokenizer works
tokenizer = (
    BertTokenizer
    .from_pretrained(r"../local_models/models--google-bert--bert-base-chinese/snapshots/c30a6ed22ab4564dc1e3b2ecbf6e766b0611a33f")
)
previous_out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=["阳光洒在大地上"],
    add_special_tokens=True,
    truncation=True,
    padding="max_length",
    max_length=20,
    return_length=None
)
print(previous_out["input_ids"][0])
print(tokenizer.decode(previous_out["input_ids"][0]))

# Get the vocabulary
vocab = tokenizer.vocab
print(len(vocab))
print('阳' in vocab)
print('光' in vocab)
print('阳光' in vocab)

# Add new tokens to the vocabulary
tokenizer.add_tokens(new_tokens=["阳光"])
vocab = tokenizer.get_vocab()
print(len(vocab))
print('阳' in vocab)
print('光' in vocab)
print('阳光' in vocab)

# Encode the sentence again
out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=["阳光洒在大地上"],
    add_special_tokens=True,
    truncation=True,
    padding="max_length",
    max_length=20,
    return_length=None
)

print(previous_out["input_ids"][0])
print(out["input_ids"][0])
print(tokenizer.decode(out["input_ids"][0]))

[101, 7345, 1045, 3818, 1762, 1920, 1765, 677, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[CLS] 阳 光 洒 在 大 地 上 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
21128
True
True
False
21129
True
True
True
[101, 7345, 1045, 3818, 1762, 1920, 1765, 677, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[101, 21128, 3818, 1762, 1920, 1765, 677, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[CLS] 阳光 洒 在 大 地 上 [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


## Extra 2. Bad Dataset Test
- If the dataset is not well-prepared, the model will not work as expected.

In [21]:
from torch.utils.data import Dataset
from datasets import load_dataset

class CsvDataset(Dataset):
    def __init__(self,file_path):
        #从磁盘加载csv数据
        self.dataset = load_dataset(path="csv",data_files=file_path,split="train")

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        text = self.dataset[item]["text"]
        label = self.dataset[item]["label"]

        return text,label

#TODO This dataset's labels are more than 2 classes, but the model is designed for binary classification.
dataset_bad = CsvDataset(f"../local_datasets/Weibo/test.csv")
for data in dataset_bad[:5]:
    print(data)

bad_data_loader = DataLoader (
    dataset_bad,
    batch_size=10,
    shuffle=True,
    drop_last=True,
    collate_fn=tokenize_batches
)

run_training(bad_data_loader)

['真的很开心啊！！！！！！！', '咳咳。。。。', '//@陈宝存:回复@梅海东messi:这种非守法的公民存在，足见我们法制建设的艰难，你懂法吗？', '也许有一天，你突然醒来，发现自己还在十几岁的年纪，年少时喜欢的男生就在你面前，看着你温暖地笑，说你是个傻瓜，告诉你你从没有失恋过，告诉你是他的唯一，告诉你后来的一切都没发生过，告诉你他其实一直爱你。', '独栋别墅跟农家乐的区别就是：内在装修好，各种设施齐备，别的嘛——完全没区别！']
[2, 7, 1, 7, 7]


TypeError: run_training() takes 0 positional arguments but 1 was given