# Hugging Face BERT Fine-Tuning with PyTorch

## 1. Fine-tuning Basics
- 微调是指在预训练模型的基础上，通过进一步的训练来适应特定的下游任务。
- BERT 模型通过预训练来学习语言的通用模式，然后通过微调来适应特定任务，如情感分析、命名实体识别等。
- 微调过程中，通常冻结BERT的预训练层，只训练与下游任务相关的层。本课件将介绍如何使用BERT模型进行情感分析任务的微调训练。

## 2. DataLoader
- 情感分析任务的数据通常包括文本及其对应的情感标签。
- 使用Hugging Face的datasets库可以轻松地加载和处理数据集。

In [None]:
from re import split

import torch
from datasets import load_dataset, load_from_disk
from qianfan.common.cli.dataset import predict

local_dataset = load_from_disk('demo/data/ChnSentiCorp')
print(local_dataset)

### 2.1 Dataset Format
- Hugging Face的datasets库支持多种数据集格式，如CSV、JSON、TFRecord等。
- 在本案例中，使用CSV格式，CSV文件应包含两列:一列是文本数据，另一列是情感标签。
### 2.3 Dataset Inspection
- 通过数据集的`features`属性可以查看数据集的特征信息。
- 查看数据集的基本信息，如数据集大小、列名、数据示例等。

In [None]:
remote_dataset = load_dataset(path="NousResearch/hermes-function-calling-v1", split="train")
print(remote_dataset.features)
remote_dataset_csv = remote_dataset.to_csv(path_or_buf='demo/data/hermes-function-calling-v1.csv')
print(remote_dataset_csv)


## 3. Making Dataset
- After loading the dataset, it needs to be processed to fit the input format of the model. This includes data cleaning, format conversion, etc.
### 3.1 Dataset Column Selection
When creating a dataset, you can select specific columns to include in the dataset. And each column should match the model's input and output format. In this case, we need to select the 'text' and 'label' columns.
### 3.2 Dataset information
After creating the dataset, you can check the dataset information by using `dataset.info`, such as the number of samples, column names, and data examples.

In [None]:
from datasets import Dataset

dict_dataset = Dataset.from_dict({
    "text": ["I love Hugging Face", "I hate Hugging Face"],
    "label": [1, 0]
})
print(dict_dataset.info)

dict_dataset.to_csv(path_or_buf='demo/data/dict_dataset.csv')
dict_dataset_csv = load_dataset('csv', data_files='demo/data/dict_dataset.csv')
print(dict_dataset_csv)

## 4. vocab Dictionary
- Before fine-tuning the BERT model, the model's vocabulary needs to be matched with the text in the dataset. This step ensures that the input text can be correctly converted into the model's input format.
### 4.1 Vocabulary (vocab)
- The BERT model uses a vocabulary (vocab) to convert text into the model's input format.
- The vocabulary contains all the known words and their corresponding indices.
- It is essential to ensure that all the text in the dataset can find the corresponding vocabulary index.
### 4.2 Tokenization
- The tokenizer is used to split the text into words in the vocabulary and convert them into the corresponding indices.
- This step needs to ensure that the text length, special character processing, etc., are consistent with the BERT model's pre-training settings.

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

dataset_map = dict_dataset_csv.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)

print(dataset_map)

## 5. Design Task-Specific Model
- Before fine-tuning the BERT model, you need to design a downstream model structure that fits the sentiment analysis task.
- This usually includes one or more fully connected layers to convert the feature vectors output by BERT into classification results.

### 5.1 Model Structure
- The downstream task model usually includes the following parts:
    - BERT Model: Used to generate context feature vectors for the text.
    - Dropout Layer: Used to prevent overfitting by randomly dropping some neurons to improve the model's generalization ability.
    - Fully Connected Layer: Used to map the output feature vectors of BERT to specific classification tasks.
### 5.2 Model Initialization
- The model is initialized by loading the pre-trained BERT model using the `BertModel.from_pretrained` method and initializing custom fully connected layers.
- When initializing the model, you need to define the appropriate output dimensions based on the downstream task requirements.

In [None]:
from transformers import BertModel
import torch.nn as nn

class SentimentAnalysisModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-chinese')
        self.dropout = nn.Dropout(0.3)
        self.linear = nn.Linear(768, 2) #假设情感分类为二分类任务 0: negative, 1: positive

    def forward(self, input_ids, attention_mask):
        _, pool_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=False
        )
        output = self.dropout(pool_output)
        return self.linear(output)

## 6. Customized Training Loop
- After the model design is completed, enter the training phase. Batch processing data efficiently using DataLoader and updating model parameters using the optimizer.

### 6.1 DataLoader
Use DataLoader to implement batch data loading. DataLoader automatically handles batch processing and random shuffling of data to ensure training efficiency and data diversity.
### 6.2 Optimizer
AdamW is an optimizer suitable for BERT models. It combines the characteristics of Adam and weight decay to effectively prevent overfitting.
### 6.3 Training Loop
The training loop includes forward pass, loss calculation, backward pass, parameter update, etc. Each epoch traverses the entire dataset once, updating the model parameters. The loss value is usually tracked during training to determine the model's convergence.


In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW

# Initialize the dataset
my_dataset = Dataset.from_dict({
    "input_ids": dataset_map['train']['input_ids'],
    "attention_mask": dataset_map['train']['attention_mask'],
    "labels": dict_dataset_csv['train']['label']
})
print(my_dataset)

# Initialize the data loader
data_loader = DataLoader(my_dataset, batch_size=16, shuffle=True)

# Initialize the model and optimizer
model = SentimentAnalysisModel()
optimizer = AdamW(model.parameters(), lr=1e-5)

# Training loop
for epoch in range(3):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()
        output = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        loss = nn.CrossEntropyLoss()(output, batch['labels'])
        loss.backward()
        optimizer.step()

        print(f'Epoch {epoch}, Loss: {loss.item()}')

## 7. Evaluation
- After training the model, evaluate its performance on the test set. Common metrics include accuracy, precision, recall, F1 score, etc.

### 7.1 Accuracy
- Accuracy is a basic metric for measuring the overall performance of a classification model.
- It is calculated by dividing the number of correctly classified samples by the total number of samples.

In [None]:
from sklearn.metrics import accuracy_score

dataset_test = Dataset.from_dict({
    "input_ids": dataset_map['test']['input_ids'],
    "attention_mask": dataset_map['test']['attention_mask'],
    "labels": dict_dataset_csv['test']['label']
})

# Initialize the test data loader
test_loader = DataLoader(dataset_test, batch_size=16, shuffle=False)

model.eval()
predictions, true_labels = [], []

with torch.no_grad():
    for batch in test_loader:
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'])
        _, preds = torch.max(outputs, dim=1)

        predictions.extend(preds.cpu().numpy())
        true_labels.extend(batch['labels'].cpu().numpy())

accuracy_score = accuracy_score(true_labels, predictions)
print(f'Accuracy: {accuracy_score:.4f}')

### 7.2 Precision, Recall, and F1 Score
- Precision and recall are two other important metrics for classification models, reflecting the model's accuracy and recall on positive predictions.
- F1 score is the harmonic mean of precision and recall, usually used for evaluating imbalanced datasets.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(true_labels, predictions, average='weighted')
recall = recall_score(true_labels, predictions, average='weighted')
f1 = f1_score(true_labels, predictions, average='weighted')

print(f'Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}')


### 7.3 Result Analysis and Model Optimization
- By analyzing the results on the test set, you can identify the strengths and weaknesses of the model.
- For example, if the F1 score is low, it may be due to an imbalanced dataset, leading to poor performance on some categories.
- By adjusting hyperparameters, improving data preprocessing steps, or using more complex model structures, you can further improve model performance.

### 7.4 Save and Load Model
- To use the trained model in the future, save it as a file for later loading for inference or further fine-tuning.

In [None]:
# Save the model
torch.save(model.state_dict(), 'demo/model/sentiment_analysis_model.pth')

# Load the model
model = SentimentAnalysisModel()
model.load_state_dict(torch.load('demo/model/sentiment_analysis_model.pth'))
model.eval()

## 8. Summary
- In this tutorial, we have detailed how to use Hugging Face's BERT model for fine-tuning training of Chinese sentiment analysis.
- We explained the entire fine-tuning process step by step, including loading the dataset, creating the Dataset, vocabulary operations, model design, custom training, and final performance evaluation and testing.
- Through this tutorial, you should be able to master the basic process of fine-tuning downstream tasks using pre-trained language models and apply it to practical NLP projects.