文本分类是自然语言处理（NLP）领域中的一种基本任务，它的目的是将文本分配到预定义的类别中，例如 spam/not spam、positive/negative 等。近年来， transformer 模型在 NLP 领域中的应用非常广泛，Hugging Face 是一个基于 transformer 模型的开源库，提供了许多预训练的模型和工具，可以轻松地实现文本分类任务。

In [1]:
!pip install transformers

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


DEPRECATION: Loading egg at e:\miniconda3\lib\site-packages\whisper_live-0.0.11-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330


下载我们需要的数据集：http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 。该数据集包含 50,000 条电影评论，每条评论都标记为 positive 或 negative。您可以从 Kaggle 下载该数据集。首先预处理该数据集，划分为训练和测试集：

In [1]:
from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

  labels.append(0 if label_dir is "neg" else 1)


In [2]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In [4]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

使用Trainer训练：

我们可以使用 Hugging Face 的 AutoModelForSequenceClassification 来加载预训练的模型：

In [5]:
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


在上面的代码中，我们加载了 BERT-base-uncased 模型，并将其配置为二分类任务。

如果您想使用自定义的模型，可以继承 AutoModelForSequenceClassification 并 override 相应的方法：
```python
class CustomModel(AutoModelForSequenceClassification):
    def __init__(self, config):
        super(CustomModel, self).__init__(config)
        self.dropout = torch.nn.Dropout(config.hidden_dropout_prob)
        self.classifier = torch.nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = super(CustomModel, self).bert(input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        outputs = self.classifier(pooled_output)
        return outputs
```

如果您想使用自定义的模型，可以继承 AutoModelForSequenceClassification 并 override 相应的方法：

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

In [7]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True)

现在，我们可以开始训练模型：

In [34]:
from tqdm import tqdm

for epoch in range(10):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        #optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        #loss = criterion(outputs, labels)
        loss = outputs[0]

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}')

    model.eval()
    total_correct = 0
    with torch.no_grad():
        for batch in tqdm(test_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            _, predicted = torch.max(outputs.logits, 1)
            total_correct += (predicted == labels).sum().item()

    accuracy = total_correct / len(test_loader)
    print(f'Epoch {epoch+1}, Test Accuracy: {accuracy:.4f}')

100%|██████████| 29/29 [00:03<00:00,  8.95it/s]


Epoch 1, Loss: 0.038864295126806045


100%|██████████| 33/33 [00:00<00:00, 47.15it/s]


Epoch 1, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  8.96it/s]


Epoch 2, Loss: 0.02929796355551687


100%|██████████| 33/33 [00:00<00:00, 46.62it/s]


Epoch 2, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  8.92it/s]


Epoch 3, Loss: 0.02734417317370916


100%|██████████| 33/33 [00:00<00:00, 46.76it/s]


Epoch 3, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  9.03it/s]


Epoch 4, Loss: 0.021551951910529672


100%|██████████| 33/33 [00:00<00:00, 46.33it/s]


Epoch 4, Test Accuracy: 0.5152


100%|██████████| 29/29 [00:03<00:00,  8.87it/s]


Epoch 5, Loss: 0.017725357384388817


100%|██████████| 33/33 [00:00<00:00, 46.11it/s]


Epoch 5, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  8.85it/s]


Epoch 6, Loss: 0.017422061365354676


100%|██████████| 33/33 [00:00<00:00, 45.86it/s]


Epoch 6, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  8.84it/s]


Epoch 7, Loss: 0.014557164322970242


100%|██████████| 33/33 [00:00<00:00, 45.64it/s]


Epoch 7, Test Accuracy: 0.4848


100%|██████████| 29/29 [00:03<00:00,  8.84it/s]


Epoch 8, Loss: 0.01457862925298255


100%|██████████| 33/33 [00:00<00:00, 45.78it/s]


Epoch 8, Test Accuracy: 0.5152


100%|██████████| 29/29 [00:03<00:00,  8.83it/s]


Epoch 9, Loss: 0.012392129789202892


100%|██████████| 33/33 [00:00<00:00, 45.68it/s]


Epoch 9, Test Accuracy: 0.5152


100%|██████████| 29/29 [00:03<00:00,  8.83it/s]


Epoch 10, Loss: 0.00959407995005363


100%|██████████| 33/33 [00:00<00:00, 43.79it/s]

Epoch 10, Test Accuracy: 0.5152





在训练完成后，我们可以评估模型的性能：

In [35]:
model.eval()
total_correct = 0

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        # print(labels)
        # print(outputs.logits)
        _, predicted = torch.max(outputs.logits, 1)
        total_correct += (predicted == labels).sum().item()

accuracy = total_correct / len(test_loader)
print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.5152


参考自：https://huggingface.co/transformers/v3.4.0/custom_datasets.html