# CPV Classifier POC

## 3.3 - Fine-Tune Transformers Classifier

Simple transformer-based classifier proof of concept based on data from TheyBuyForYou.

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

The full CPV listing included in this repo was downloaded from https://simap.ted.europa.eu/cpv

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
import shelve

2022-06-09 23:18:19.591858: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


## Load data

In [2]:
with shelve.open("data/train_val.shelf") as db:
    sents_train = db["sents_train"]
    sents_val = db["sents_val"]
    cpv_train = db["cpv_train"]
    cpv_val = db["cpv_val"]
    label2id = db["label2id"]
    id2label = db["id2label"]

## Prepare model

In [3]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [4]:
train_encodings = tokenizer(sents_train, truncation=True, padding=True)
val_encodings = tokenizer(sents_val, truncation=True, padding=True)

In [5]:
class CPVDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CPVDataset(train_encodings, cpv_train)
val_dataset = CPVDataset(val_encodings, cpv_val)

In [6]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(id2label))
model.config.id2label = id2label
model.config.label2id = label2id

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

## Fine Tuning

In [7]:
# Use accuracy metric
from datasets import load_metric
import numpy as np

metric = load_metric("accuracy")

def compute_accuracy(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [8]:
training_args = TrainingArguments(
    output_dir='./output',          # output directory
    num_train_epochs=15,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    save_total_limit=2,
    save_steps=10_000,
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10_000,
    evaluation_strategy="epoch",
    optim="adamw_torch",             # default optimizer is deprecated now
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_accuracy,
)

In [9]:
from datetime import datetime

start = datetime.now()
trainer.train()
finish = datetime.now()

print(f"Completed in {finish - start}")

***** Running training *****
  Num examples = 221288
  Num Epochs = 15
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 207465


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy
1,2.2131,1.549688,0.57951
2,1.4834,1.307204,0.648446
3,0.9362,1.200928,0.699976
4,0.6919,1.190173,0.726777
5,0.5572,1.266526,0.740198
6,0.3344,1.342455,0.757646
7,0.2712,1.471016,0.764275
8,0.1957,1.635881,0.765861
9,0.1428,1.849004,0.773304
10,0.1284,2.022361,0.773182


Saving model checkpoint to ./output/checkpoint-10000
Configuration saved in ./output/checkpoint-10000/config.json
Model weights saved in ./output/checkpoint-10000/pytorch_model.bin
Deleting older checkpoint [output/checkpoint-20000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 24588
  Batch size = 64
Saving model checkpoint to ./output/checkpoint-20000
Configuration saved in ./output/checkpoint-20000/config.json
Model weights saved in ./output/checkpoint-20000/pytorch_model.bin
Deleting older checkpoint [output/checkpoint-30000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 24588
  Batch size = 64
Saving model checkpoint to ./output/checkpoint-30000
Configuration saved in ./output/checkpoint-30000/config.json
Model weights saved in ./output/checkpoint-30000/pytorch_model.bin
Deleting older checkpoint [output/checkpoint-10000] due to args.save_total_limit
Saving model checkpoint to ./output/checkpoint-40000
Configuration sav

Completed in 8:47:47.526466


In [10]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 24588
  Batch size = 64


{'eval_loss': 2.3579022884368896,
 'eval_accuracy': 0.7824141857816821,
 'eval_runtime': 68.7178,
 'eval_samples_per_second': 357.811,
 'eval_steps_per_second': 5.603,
 'epoch': 15.0}

In [11]:
model.save_pretrained('./models/transformers')

Configuration saved in ./models/transformers/config.json
Model weights saved in ./models/transformers/pytorch_model.bin
