# CPV Classifier POC

## 2 - Fine-Tune Classifier

Simple transformer-based classifier proof of concept based on data from TheyBuyForYou.

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

The full CPV listing included in this repo was downloaded from https://simap.ted.europa.eu/cpv

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch

## Load and prepare data

In [2]:
df = pd.read_json("data/training_data.json",dtype=str)
df

Unnamed: 0,title,mapped_cpv,sent1
0,"APMS Services for Maple Surgery, Cambridge for...",85120000,"Maple Surgery is located at Hanover Close, Cam..."
1,Provision of Compliance Auditing Services,79212000,Dublin Bus is seeking submissions from suitabl...
2,DBC (SF) Pilot of Robotic Process Automation,72220000,Dacorum Borough Council requires support from ...
3,Passenger Transport for 8 Passengers or Less w...,60000000,Passenger transport for 8 passengers or less w...
4,Denbighshire Schools ICT Network Framework,48000000,The aims of this contract is to provide a fram...
...,...,...,...
245879,Delivered Ready Prepared Meals,15890000,The range is for selection of standard ready p...
245880,Traffic Signals Planned and Unplanned Inspecti...,50230000,Renfrewshire Council require a suitably qualif...
245881,The Supply for the Development of Dudley Counc...,73000000,Dudley Council invites providers to submit a q...
245882,LIFE Welsh Raised Bogs — Framework for Peat Re...,16000000,Lot 2: Removal of invasive Scrub:NRW is intend...


In [3]:
unique_classifs = df.mapped_cpv.unique()
print(f"There are {len(unique_classifs)} classifications")

There are 225 classifications


In [4]:
label2id = {k:v for v,k in enumerate(unique_classifs)}
id2label = {v:k for k,v in label2id.items()}

assert len(label2id) == len(id2label)
assert min(id2label.keys()) == 0
assert set(label2id.values()) == set(id2label.keys())

In [5]:
df["label"] = df.apply(lambda x: label2id[x.mapped_cpv],axis = 1)
df["text"] = df.apply(lambda x: x.title + '\n' + x.sent1, axis=1)
df

Unnamed: 0,title,mapped_cpv,sent1,label,text
0,"APMS Services for Maple Surgery, Cambridge for...",85120000,"Maple Surgery is located at Hanover Close, Cam...",0,"APMS Services for Maple Surgery, Cambridge for..."
1,Provision of Compliance Auditing Services,79212000,Dublin Bus is seeking submissions from suitabl...,1,Provision of Compliance Auditing Services\nDub...
2,DBC (SF) Pilot of Robotic Process Automation,72220000,Dacorum Borough Council requires support from ...,2,DBC (SF) Pilot of Robotic Process Automation\n...
3,Passenger Transport for 8 Passengers or Less w...,60000000,Passenger transport for 8 passengers or less w...,3,Passenger Transport for 8 Passengers or Less w...
4,Denbighshire Schools ICT Network Framework,48000000,The aims of this contract is to provide a fram...,4,Denbighshire Schools ICT Network Framework\nTh...
...,...,...,...,...,...
245879,Delivered Ready Prepared Meals,15890000,The range is for selection of standard ready p...,193,Delivered Ready Prepared Meals\nThe range is f...
245880,Traffic Signals Planned and Unplanned Inspecti...,50230000,Renfrewshire Council require a suitably qualif...,202,Traffic Signals Planned and Unplanned Inspecti...
245881,The Supply for the Development of Dudley Counc...,73000000,Dudley Council invites providers to submit a q...,23,The Supply for the Development of Dudley Counc...
245882,LIFE Welsh Raised Bogs — Framework for Peat Re...,16000000,Lot 2: Removal of invasive Scrub:NRW is intend...,137,LIFE Welsh Raised Bogs — Framework for Peat Re...


In [6]:
sents_train,sents_val,cpv_train,cpv_val = train_test_split(list(df.text),list(df.label),test_size=0.1)

## Prepare model

In [7]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [8]:
train_encodings = tokenizer(sents_train, truncation=True, padding=True)
val_encodings = tokenizer(sents_val, truncation=True, padding=True)

In [9]:
class CPVDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CPVDataset(train_encodings, cpv_train)
val_dataset = CPVDataset(val_encodings, cpv_val)

In [10]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(unique_classifs))
model.config.id2label = id2label
model.config.label2id = label2id

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

## Fine Tuning

In [11]:
training_args = TrainingArguments(
    output_dir='./output',          # output directory
    num_train_epochs=20,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    save_total_limit=2,
    save_steps=10_000,
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10_000,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

In [12]:
trainer.train()

Step,Training Loss
10000,2.219874
20000,1.501174
30000,1.225538
40000,0.957589
50000,0.715185
60000,0.587713
70000,0.496321
80000,0.360945
90000,0.303852
100000,0.268923


TrainOutput(global_step=276620, training_loss=0.3694330458302843)

In [13]:
trainer.evaluate()

{'eval_loss': 2.6237409114837646, 'epoch': 20.0}

In [14]:
model.save_pretrained('tbfy_cpv_model')