# BBC Text Classification with BERT (Multi-class)

This notebook adapts the workflow described in the article *Bert多分类&多标签文本分类实战* (run.py → data → model → train/eval/test) to **your dataset**: `/home/mywsl/Workspace/NLP/data/bbc_text_cls.csv`.

- Dataset: BBC-style news, columns: `text`, `labels`
- Task: **multi-class text classification**
- Model: `bert-base-uncased` + Linear head

> Tip: If you are running this notebook outside this sandbox, update `CSV_PATH` to your local path (e.g. `/home/mywsl/Workspace/NLP/data/bbc_text_cls.csv`).


## 0) Install dependencies (run once)
If you already have them, you can skip this cell.

In [1]:
# !pip -q install torch transformers==4.* scikit-learn pandas tqdm

## 1) Imports & Reproducibility

In [2]:
import os
import time
import random
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from tqdm.auto import tqdm

from transformers import BertTokenizerFast, BertModel, get_linear_schedule_with_warmup
from torch.optim import AdamW


def set_seed(seed: int = 1):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True


set_seed(1)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device


  from .autonotebook import tqdm as notebook_tqdm


device(type='cuda')

## 2) Config (mirrors the article's Config class)

In [5]:
class Config:
    def __init__(self):
        # --- data ---
        self.CSV_PATH = os.getenv('CSV_PATH', '/home/mywsl/Workspace/NLP/data/bbc_text_cls.csv')

        # --- model ---
        # English dataset -> use an English pretrained model
        self.bert_name = 'bert-base-uncased'

        # --- training ---
        self.num_epochs = 3
        self.batch_size = 32
        self.max_length = 128
        self.learning_rate = 5e-5
        self.weight_decay = 0.01
        self.require_improvement = 1000  # early-stop patience in steps

        # --- saving ---
        self.output_dir = './outputs_bbc_bert'
        os.makedirs(self.output_dir, exist_ok=True)
        self.save_path = os.path.join(self.output_dir, 'bert_bbc_cls.pt')

        # --- runtime ---
        self.device = device


cfg = Config()
cfg.__dict__


{'CSV_PATH': '/home/mywsl/Workspace/NLP/data/bbc_text_cls.csv',
 'bert_name': 'bert-base-uncased',
 'num_epochs': 3,
 'batch_size': 32,
 'max_length': 128,
 'learning_rate': 5e-05,
 'weight_decay': 0.01,
 'require_improvement': 1000,
 'output_dir': './outputs_bbc_bert',
 'save_path': './outputs_bbc_bert/bert_bbc_cls.pt',
 'device': device(type='cuda')}

## 3) Load CSV dataset
Your file has multi-line quoted strings; `pandas.read_csv` handles it.

In [6]:
df = pd.read_csv(cfg.CSV_PATH)

# ensure expected columns
df.columns = [c.strip().strip('"') for c in df.columns]
assert 'text' in df.columns and 'labels' in df.columns

df = df.dropna(subset=['text', 'labels']).copy()
df['text'] = df['text'].astype(str)
df['labels'] = df['labels'].astype(str)

print('rows:', len(df))
df.head()


rows: 2225


Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


## 4) Encode labels & split train/dev/test
We replicate the article's idea of separate train/dev/test sets, but generated from a single CSV.

In [8]:
le = LabelEncoder()
df['label_id'] = le.fit_transform(df['labels'])
class_list = list(le.classes_)
num_classes = len(class_list)

print('num_classes:', num_classes)
print('classes:', class_list)

train_df, tmp_df = train_test_split(
    df, test_size=0.2, random_state=1, stratify=df['label_id']
)
dev_df, test_df = train_test_split(
    tmp_df, test_size=0.5, random_state=1, stratify=tmp_df['label_id']
)

print('train/dev/test:', len(train_df), len(dev_df), len(test_df))


num_classes: 5
classes: ['business', 'entertainment', 'politics', 'sport', 'tech']
train/dev/test: 1780 222 223


## 5) Tokenizer

In [9]:
tokenizer = BertTokenizerFast.from_pretrained(cfg.bert_name)
tokenizer.vocab_size


Downloading: 100%|██████████| 48.0/48.0 [00:00<00:00, 190kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 1.86MB/s]
Downloading: 100%|██████████| 455k/455k [00:00<00:00, 1.03MB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 1.97MB/s]


30522

## 6) Dataset & DataLoader
This replaces the article's `build_dataset()` and custom iterator, using PyTorch `Dataset/DataLoader`.

In [10]:
class BBCDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer: BertTokenizerFast, max_length: int):
        self.df = df.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx: int):
        text = self.df.loc[idx, 'text']
        label = int(self.df.loc[idx, 'label_id'])

        enc = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt',
        )

        return {
            'input_ids': enc['input_ids'].squeeze(0),
            'attention_mask': enc['attention_mask'].squeeze(0),
            'labels': torch.tensor(label, dtype=torch.long),
        }


train_loader = DataLoader(BBCDataset(train_df, tokenizer, cfg.max_length), batch_size=cfg.batch_size, shuffle=True)
dev_loader   = DataLoader(BBCDataset(dev_df, tokenizer, cfg.max_length),   batch_size=cfg.batch_size, shuffle=False)
test_loader  = DataLoader(BBCDataset(test_df, tokenizer, cfg.max_length),  batch_size=cfg.batch_size, shuffle=False)

batch = next(iter(train_loader))
{k: v.shape for k, v in batch.items()}


{'input_ids': torch.Size([32, 128]),
 'attention_mask': torch.Size([32, 128]),
 'labels': torch.Size([32])}

## 7) Model (BERT + Linear head)
Matches the article's `Model` class: take pooled `[CLS]` representation and apply a linear classifier.

In [11]:
class BertForBBC(nn.Module):
    def __init__(self, bert_name: str, num_classes: int, finetune: bool = True):
        super().__init__()
        self.bert = BertModel.from_pretrained(bert_name)
        for p in self.bert.parameters():
            p.requires_grad = finetune
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        _, pooled = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=None,
            return_dict=False
        )
        logits = self.fc(pooled)
        return logits


model = BertForBBC(cfg.bert_name, num_classes, finetune=True).to(cfg.device)
model


Downloading: 100%|██████████| 420M/420M [00:22<00:00, 19.5MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForBBC(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine

## 8) Optimizer + Scheduler (AdamW + linear warmup)
Same idea as the article: weight decay + warmup schedule. We use `torch.optim.AdamW` for compatibility.

In [12]:
def build_optimizer_and_scheduler(model: nn.Module, cfg: Config, total_steps: int):
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    named_params = list(model.named_parameters())

    grouped = [
        {
            'params': [p for n, p in named_params if not any(nd in n for nd in no_decay)],
            'weight_decay': cfg.weight_decay,
        },
        {
            'params': [p for n, p in named_params if any(nd in n for nd in no_decay)],
            'weight_decay': 0.0,
        },
    ]

    optimizer = AdamW(grouped, lr=cfg.learning_rate)

    warmup_steps = int(total_steps * 0.1)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=warmup_steps,
        num_training_steps=total_steps,
    )
    return optimizer, scheduler


total_steps = len(train_loader) * cfg.num_epochs
optimizer, scheduler = build_optimizer_and_scheduler(model, cfg, total_steps)
(total_steps, int(total_steps*0.1))


(168, 16)

## 9) Evaluation (accuracy + report + confusion)
Matches the article's `evaluate()` and `test()` outputs.

In [14]:
@torch.no_grad()
def evaluate(model: nn.Module, data_loader: DataLoader, cfg: Config, class_list=None, return_report: bool = False):
    model.eval()
    losses = []
    all_preds, all_labels = [], []

    for batch in data_loader:
        input_ids = batch['input_ids'].to(cfg.device)
        attention_mask = batch['attention_mask'].to(cfg.device)
        labels = batch['labels'].to(cfg.device)

        logits = model(input_ids, attention_mask)
        loss = F.cross_entropy(logits, labels)
        losses.append(loss.item())

        preds = torch.argmax(logits, dim=1)
        all_preds.append(preds.detach().cpu().numpy())
        all_labels.append(labels.detach().cpu().numpy())

    all_preds = np.concatenate(all_preds)
    all_labels = np.concatenate(all_labels)

    acc = metrics.accuracy_score(all_labels, all_preds)
    avg_loss = float(np.mean(losses))

    if return_report:
        report = metrics.classification_report(all_labels, all_preds, target_names=class_list, digits=4)
        confusion = metrics.confusion_matrix(all_labels, all_preds)
        return acc, avg_loss, report, confusion

    return acc, avg_loss


## 10) Train loop (with early stopping + best checkpoint)
This mirrors the article's train loop: evaluate on dev every N steps, save best, early-stop if no improvement.

In [15]:
def train(model: nn.Module, train_loader: DataLoader, dev_loader: DataLoader, test_loader: DataLoader, cfg: Config, class_list):
    start_time = time.time()

    total_batch = 0
    best_dev_loss = float('inf')
    last_improve = 0
    stop_flag = False

    total_steps = len(train_loader) * cfg.num_epochs
    optimizer, scheduler = build_optimizer_and_scheduler(model, cfg, total_steps)

    for epoch in range(cfg.num_epochs):
        model.train()
        print(f"Epoch [{epoch+1}/{cfg.num_epochs}]")

        for batch in tqdm(train_loader, desc='Training', leave=False):
            input_ids = batch['input_ids'].to(cfg.device)
            attention_mask = batch['attention_mask'].to(cfg.device)
            labels = batch['labels'].to(cfg.device)

            logits = model(input_ids, attention_mask)
            loss = F.cross_entropy(logits, labels)

            model.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            if total_batch % 100 == 0:
                preds = torch.argmax(logits, dim=1)
                train_acc = metrics.accuracy_score(labels.detach().cpu().numpy(), preds.detach().cpu().numpy())

                dev_acc, dev_loss = evaluate(model, dev_loader, cfg)

                improved = ''
                if dev_loss < best_dev_loss:
                    best_dev_loss = dev_loss
                    torch.save({
                        'model_state_dict': model.state_dict(),
                        'class_list': class_list,
                        'label_encoder_classes': class_list,
                        'bert_name': cfg.bert_name,
                        'max_length': cfg.max_length,
                    }, cfg.save_path)
                    improved = '*'
                    last_improve = total_batch

                elapsed = time.time() - start_time
                print(
                    f"Iter:{total_batch:>6} | "
                    f"TrainLoss:{loss.item():>5.2f} TrainAcc:{train_acc:>6.2%} | "
                    f"ValLoss:{dev_loss:>5.2f} ValAcc:{dev_acc:>6.2%} | "
                    f"Time:{elapsed:>.0f}s {improved}"
                )
                model.train()

            total_batch += 1

            if total_batch - last_improve > cfg.require_improvement:
                print('No optimization for a long time, auto-stopping...')
                stop_flag = True
                break

        if stop_flag:
            break

    print('=== Testing best checkpoint ===')
    ckpt = torch.load(cfg.save_path, map_location=cfg.device)
    model.load_state_dict(ckpt['model_state_dict'])

    test_acc, test_loss, report, confusion = evaluate(model, test_loader, cfg, class_list=class_list, return_report=True)
    print(f"Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.4%}")
    print('Classification report:')
    print(report)
    print('Confusion matrix:')
    print(confusion)


train(model, train_loader, dev_loader, test_loader, cfg, class_list)


Epoch [1/3]


Training:   2%|▏         | 1/56 [00:05<04:57,  5.41s/it]

Iter:     0 | TrainLoss: 1.66 TrainAcc:31.25% | ValLoss: 1.73 ValAcc:16.67% | Time:5s *


                                                         

Epoch [2/3]


Training:  80%|████████  | 45/56 [02:47<01:00,  5.49s/it]

Iter:   100 | TrainLoss: 0.04 TrainAcc:100.00% | ValLoss: 0.09 ValAcc:97.75% | Time:372s *


                                                         

Epoch [3/3]


                                                         

=== Testing best checkpoint ===
Test Loss: 0.1078 | Test Acc: 97.3094%
Classification report:
               precision    recall  f1-score   support

     business     1.0000    0.9216    0.9592        51
entertainment     1.0000    0.9487    0.9737        39
     politics     0.9333    1.0000    0.9655        42
        sport     1.0000    1.0000    1.0000        51
         tech     0.9302    1.0000    0.9639        40

     accuracy                         0.9731       223
    macro avg     0.9727    0.9741    0.9724       223
 weighted avg     0.9749    0.9731    0.9731       223

Confusion matrix:
[[47  0  3  0  1]
 [ 0 37  0  0  2]
 [ 0  0 42  0  0]
 [ 0  0  0 51  0]
 [ 0  0  0  0 40]]


## 11) Single-text prediction

In [16]:
@torch.no_grad()
def predict_one(text: str, model: nn.Module, tokenizer: BertTokenizerFast, cfg: Config, class_list):
    model.eval()

    enc = tokenizer(
        text,
        
        truncation=True,
        padding='max_length',
        max_length=cfg.max_length,
        return_tensors='pt'
    )

    input_ids = enc['input_ids'].to(cfg.device)
    attention_mask = enc['attention_mask'].to(cfg.device)

    logits = model(input_ids, attention_mask)
    probs = torch.softmax(logits, dim=1).squeeze(0).cpu().numpy()

    pred_id = int(probs.argmax())
    return pred_id, class_list[pred_id], float(probs[pred_id]), probs


text = 'Oil prices rise as traders fear supply disruptions'
pred_id, pred_label, conf, _ = predict_one(text, model, tokenizer, cfg, class_list)
(pred_id, pred_label, conf)


(0, 'business', 0.9633949398994446)

## 12) Notes for adapting to your local environment
- If you run on your own machine, set `CSV_PATH` to `/home/mywsl/Workspace/NLP/data/bbc_text_cls.csv` or edit `cfg.CSV_PATH`.
- For faster experiments, try `distilbert-base-uncased` (change `cfg.bert_name`).
