# Text Classification

First Try: We fine-tune BERT as text classifier: with finetune everything strategy

1. Download 20-newsgroup data set from sklearn
2. Load pretrained BERT from hugging face
3. Train BERT as classifier

Second Try: We fine-tune a text classifier based on the embeddings from BERT.

1. Download 20-newsgroup data set from sklearn
2. Load pretrained BERT from hugging face and use BERT for text embedding
3. Define a classifier
4. Train the classifier for embedded text input

## 1. Load Data from sklearn and create Dataset and DataLoader objects

In [1]:
# Load the data from sklearn
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(data_home=".")

In [2]:
print(newsgroups.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

In [3]:
data = list(newsgroups.data)
targets = list(newsgroups.target)
classes = list(newsgroups.target_names)
num_classes = len(classes)

In [4]:
# load csv into pandas dataframe
import pandas as pd
df = pd.DataFrame(data={"text":data, "label":targets})
df.head()

Unnamed: 0,text,label
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


In [5]:
classes

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## 2. Load pretrained BERT from hugging face

In [6]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
model_id = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_id) #, do_lower_case=True) #BertTokenizerFast?
model = BertForSequenceClassification.from_pretrained(model_id, num_labels=num_classes, 
                                                      output_attentions=False,
                                                      output_hidden_states=False)
model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## 3. Train BERT for Classification

In [7]:
len(df.text.values)

11314

In [8]:
#### Tokenization - this may take a while: 5 - 10 seconds
encoded_data = tokenizer.batch_encode_plus(df.text.values, 
                                           add_special_tokens=True, 
                                           return_attention_mask=True,
                                           #pad_to_max_length=True,
                                           padding='max_length',
                                           truncation=True,
                                           max_length=256,
                                           return_tensors="pt")
input_ids = encoded_data["input_ids"]
attention_masks = encoded_data["attention_mask"]

labels = torch.tensor(df.label.values) #dtype=torch.long
labels.dtype

torch.int64

In [14]:
df.text.values[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [13]:
input_ids[0]

tensor([  101,  2013,  1024,  3393,  2099,  2595,  3367,  1030, 11333,  2213,
         1012,  8529,  2094,  1012,  3968,  2226,  1006,  2073,  1005,  1055,
         2026,  2518,  1007,  3395,  1024,  2054,  2482,  2003,  2023,   999,
         1029,  1050,  3372,  2361,  1011, 14739,  1011,  3677,  1024, 10958,
         2278,  2509,  1012, 11333,  2213,  1012,  8529,  2094,  1012,  3968,
         2226,  3029,  1024,  2118,  1997,  5374,  1010,  2267,  2380,  3210,
         1024,  2321,  1045,  2001,  6603,  2065,  3087,  2041,  2045,  2071,
         4372,  7138,  2368,  2033,  2006,  2023,  2482,  1045,  2387,  1996,
         2060,  2154,  1012,  2009,  2001,  1037,  1016,  1011,  2341,  2998,
         2482,  1010,  2246,  2000,  2022,  2013,  1996,  2397, 20341,  1013,
         2220, 17549,  1012,  2009,  2001,  2170,  1037,  5318,  4115,  1012,
         1996,  4303,  2020,  2428,  2235,  1012,  1999,  2804,  1010,  1996,
         2392, 21519,  2001,  3584,  2013,  1996,  2717,  1997, 

In [18]:
#### Training Parameters
batch_size = 32
epochs = 5
learning_rate = 1e-5
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# no loss function

In [19]:
#todo load model to device
device = torch.device("mps")
model.to(device)
for epoch in range(epochs):
    model.train()
    total_loss = 0
    j = 0
    for i in range(0, input_ids.size(0), batch_size): # start - stop - step
        j += 1
        optimizer.zero_grad()
        outputs = model(input_ids[i:i+batch_size].to(device), 
                        attention_mask=attention_masks[i:i+batch_size].to(device),
                        labels=labels[i:i+batch_size].to(device))
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()


        if j%100 == 0:
            print(f'Epoch {epoch} {j} Training loss: {total_loss/input_ids[0:i+batch_size].size(0)}')
        
    
    print(f'Epoch {epoch} Training loss: {total_loss/input_ids.size(0)}')
    # save the model
    torch.save(model, f"bert_{epoch}.bin")


Epoch 0 100 Training loss: 0.08699866659939288
Epoch 0 200 Training loss: 0.0741283438168466
Epoch 0 300 Training loss: 0.06370054557298621
Epoch 0 Training loss: 0.059356899381168574
Epoch 1 100 Training loss: 0.029636690225452185
Epoch 1 200 Training loss: 0.02620628282893449
Epoch 1 300 Training loss: 0.023800453391547006
Epoch 1 Training loss: 0.022785322856245602
Epoch 2 100 Training loss: 0.015473079290241004
Epoch 2 200 Training loss: 0.01399453843710944
Epoch 2 300 Training loss: 0.012954272259958089
Epoch 2 Training loss: 0.012418476688539432
Epoch 3 100 Training loss: 0.008922530626878143
Epoch 3 200 Training loss: 0.008337276668753476
Epoch 3 300 Training loss: 0.007830981435254215
Epoch 3 Training loss: 0.007551219072657169
Epoch 4 100 Training loss: 0.005645932378247381
Epoch 4 200 Training loss: 0.005474376626079902
Epoch 4 300 Training loss: 0.005101131113090863
Epoch 4 Training loss: 0.00492341618932914


In [20]:
#torch.save(model.state_dict(), f" bert_{epoch}.pt")
#model = BertForSequenceClassification.from_pretrained(model_id, num_labels=num_classes, 
#                                                      output_attentions=False,
#                                                      output_hidden_states=False)
#model.load_state_dict(torch.load(f" bert_{epoch}.pt", weights_only=True))

In [15]:
import torch
model = torch.load(f"/Users/done/Documents/Hochschule/WS 2024/WI-Intelligente-Informationssysteme/archive/bert_4.bin", weights_only=False)
#model

In [16]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
#### Evaluate the model
accuracy = 0
batch_size = 1000
device = torch.device("mps")
model.to(device)
model.eval()
with torch.no_grad():
    for i in range(0, input_ids.size(0), batch_size): # start - stop - step
        #outputs = model(input_ids=input_ids.to(device), attention_mask=attention_masks.to(device))
        outputs = model(input_ids[i:i+batch_size].to(device), 
                        attention_mask=attention_masks[i:i+batch_size].to(device))
        predictions = torch.argmax(outputs[0], dim=1).flatten()
        accuracy += torch.sum(predictions == labels[i:i+batch_size].to(device)).item()
        print(i,"/",input_ids.size(0),":",accuracy/labels[0:i+batch_size].size(0))
print("Accuracy:", accuracy/input_ids.size(0))

0 / 11314 : 0.971
1000 / 11314 : 0.976
2000 / 11314 : 0.9746666666666667
3000 / 11314 : 0.9715
4000 / 11314 : 0.9716
5000 / 11314 : 0.9721666666666666
6000 / 11314 : 0.9727142857142858
7000 / 11314 : 0.97275
8000 / 11314 : 0.9741111111111111
9000 / 11314 : 0.9754


# Second Try
# 
# RESULT - not working well 
#

In [12]:
# Load the data from sklearn
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(data_home=".")
data = list(newsgroups.data)
targets = list(newsgroups.target)
classes = list(newsgroups.target_names)
num_classes = len(classes)

In [13]:
# Build a pytorch custom Dataset
# A custom Dataset class must have these three methods: __init__ , __getitem__ , __len__
# see: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer, BertModel

model_id = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_id) #BertTokenizer
model = BertModel.from_pretrained(model_id)         #BertModel
model.eval()


def preprocess(text):
    encoded_inputs = tokenizer(text, 
                               add_special_tokens=True, 
                               return_attention_mask=True,
                               padding="max_length", 
                               truncation=True, 
                               max_length=256, 
                               return_tensors='pt')
    
    with torch.no_grad():
        outputs = model(**encoded_inputs)
    
    return torch.sum(outputs.last_hidden_state, 1) # torch.Size([1, 256, 768]) [batch_size, sequence_length, hidden_size]


class NewsgroupDataset(Dataset):

    def __init__(self, data: list, labels: list, classes: list, preprocess=None):
        self.data = data
        self.labels = labels
        self.classes = classes
        self.preprocess = preprocess

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.data[idx]
        label = self.labels[idx]
        if self.preprocess:
            return preprocess(text), int(label)
        else:
            return text, int(label)

    def get_classes(self):
        return self.classes

    def num_classes(self):
        return len(self.classes)
        
dataset = NewsgroupDataset(data=data, labels=targets, classes=classes, preprocess=preprocess)

print(len(dataset))
x,y = dataset[0]
print(x.shape)

11314
torch.Size([1, 768])


In [14]:
# Split into test and train
train_size = int(0.7 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

In [15]:
# Create DataLoader
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) 
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=True)

## 3. Define a classifier

In [16]:
import torch.nn as nn
device = torch.device("mps")
classifier = nn.Linear(768, dataset.num_classes())
classifier.to(device)
classifier

Linear(in_features=768, out_features=20, bias=True)

## 4. Train only the classifier

In [17]:
loss_function = nn.CrossEntropyLoss()  
optimizer = torch.optim.AdamW(classifier.parameters(), lr=2e-5) 
#see: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

In [18]:
import numpy as np

def train(classifier, optimizer, train_loader, loss_function, num_classes):
    
    classifier.train()
    total_loss = 0
    j, n = 0, 0
    for batch in train_loader:
        j += 1
        optimizer.zero_grad()
        embeddings, labels = batch
        n += len(labels)
        outputs = classifier(embeddings.to(device))
        outputs = torch.squeeze(outputs)
        targets = np.array([labels]).reshape(-1)
        one_hot_targets = torch.tensor(np.eye(num_classes)[targets], dtype=torch.float32).to(device)
        loss = loss_function(outputs, one_hot_targets)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        print(f'Training loss: {total_loss/n}')
    print(f'Training loss: {total_loss/len(train_loader)}')
    

In [38]:
def evaluate(classifier, test_loader, loss_function):
    classifier.eval()
    total_loss = 0
    total_acc = 0
    j = 0
    with torch.no_grad():
        for batch in test_loader:
            embeddings, labels = batch
            j += len(labels)
            outputs = classifier(embeddings.to(device))
            outputs = torch.squeeze(outputs)
            targets = np.array([labels]).reshape(-1)
            one_hot_targets = torch.tensor(np.eye(num_classes)[targets], dtype=torch.float32).to(device)
            loss = loss_function(outputs, one_hot_targets)            
            total_loss += loss.item()
            predictions = torch.argmax(outputs, dim=1)
            total_acc += (predictions == labels.to(device)).sum().item()
            print(f'{j} of {len(test_dataset)}  Test loss: {total_loss/j} Test acc: {total_acc/j*100}%')
    print(f'Test loss: {total_loss/len(test_dataset)} Test acc: {total_acc/len(test_dataset)*100}%')

In [39]:
for epoch in range(3):
    print(f"####### EPOCH {epoch} #######")
    train(classifier, optimizer, train_loader, loss_function, num_classes=dataset.num_classes())
    evaluate(classifier, test_loader, loss_function)

####### EPOCH 0 #######
64 of 3395  Test loss: 0.42303887009620667 Test acc: 9.375%
128 of 3395  Test loss: 0.4134484529495239 Test acc: 10.15625%
192 of 3395  Test loss: 0.4286746084690094 Test acc: 7.8125%
256 of 3395  Test loss: 0.4171008840203285 Test acc: 8.59375%
320 of 3395  Test loss: 0.41969203352928164 Test acc: 7.187499999999999%
384 of 3395  Test loss: 0.4212421327829361 Test acc: 6.25%
448 of 3395  Test loss: 0.4302159547805786 Test acc: 5.580357142857143%
512 of 3395  Test loss: 0.42967483401298523 Test acc: 5.859375%
576 of 3395  Test loss: 0.42965347237057155 Test acc: 5.729166666666666%
640 of 3395  Test loss: 0.42942704558372496 Test acc: 5.9375%
704 of 3395  Test loss: 0.4317957813089544 Test acc: 6.392045454545454%
768 of 3395  Test loss: 0.43528714776039124 Test acc: 5.859375%
832 of 3395  Test loss: 0.4307731023201576 Test acc: 6.25%
896 of 3395  Test loss: 0.4327816388436726 Test acc: 6.138392857142857%
960 of 3395  Test loss: 0.4341835121313731 Test acc: 5.9375%

KeyboardInterrupt: 

In [1]:
########### Backup #############
import torch
from transformers import BertTokenizer, BertModel
model_id = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_id) #, do_lower_case=True) #BertTokenizerFast?
model = BertModel.from_pretrained(model_id, output_hidden_states=False)
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

# Working with datasets

## PyTorch Datasets

Build a pytorch custom dataset based on the Dataset class. A custom Dataset class must implement these three methods: __init__ , __getitem__ , __len__

see: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
see: https://rumn.medium.com/how-to-create-a-custom-pytorch-dataset-with-a-csv-file-e64b89bc2dcc

## Transformer Datasets

Use the datasets library from hugging face: pip install datasets

see: https://huggingface.co/docs/datasets/index

```
from datasets import load_dataset  
my_dataset = load_dataset("csv", data_files="data.csv")
split = my_dataset['train'].train_test_split(test_size=0.3, seed=42)
```

In [None]:
# see: https://medium.com/@lokaregns/fine-tuning-transformers-with-custom-dataset-classification-task-f261579ae068