# CPV Classifier POC

## 2 Prepare Training set

Set up train/val data so that all later tests can be done on exactly the same data

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

The full CPV listing included in this repo was downloaded from https://simap.ted.europa.eu/cpv

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
from collections import Counter
import shelve

## Load and prepare data

In [2]:
df = pd.read_json("data/training_data.json",dtype=str)
df

Unnamed: 0,title,mapped_cpv,sent1
0,"APMS Services for Maple Surgery, Cambridge for...",85120000,"Maple Surgery is located at Hanover Close, Cam..."
1,Provision of Compliance Auditing Services,79212000,Dublin Bus is seeking submissions from suitabl...
2,DBC (SF) Pilot of Robotic Process Automation,72220000,Dacorum Borough Council requires support from ...
3,Passenger Transport for 8 Passengers or Less w...,60000000,Passenger transport for 8 passengers or less w...
4,Denbighshire Schools ICT Network Framework,48000000,The aims of this contract is to provide a fram...
...,...,...,...
245879,Delivered Ready Prepared Meals,15890000,The range is for selection of standard ready p...
245880,Traffic Signals Planned and Unplanned Inspecti...,50230000,Renfrewshire Council require a suitably qualif...
245881,The Supply for the Development of Dudley Counc...,73000000,Dudley Council invites providers to submit a q...
245882,LIFE Welsh Raised Bogs — Framework for Peat Re...,16000000,Lot 2: Removal of invasive Scrub:NRW is intend...


In [3]:
unique_classifs = df.mapped_cpv.unique()
print(f"There are {len(unique_classifs)} classifications")

There are 225 classifications


In [4]:
label2id = {k:v for v,k in enumerate(unique_classifs)}
id2label = {v:k for k,v in label2id.items()}

assert len(label2id) == len(id2label)
assert min(id2label.keys()) == 0
assert set(label2id.values()) == set(id2label.keys())

In [5]:
df["label"] = df.apply(lambda x: label2id[x.mapped_cpv],axis = 1)
df["text"] = df.apply(lambda x: x.title + '\n' + x.sent1, axis=1)
df

Unnamed: 0,title,mapped_cpv,sent1,label,text
0,"APMS Services for Maple Surgery, Cambridge for...",85120000,"Maple Surgery is located at Hanover Close, Cam...",0,"APMS Services for Maple Surgery, Cambridge for..."
1,Provision of Compliance Auditing Services,79212000,Dublin Bus is seeking submissions from suitabl...,1,Provision of Compliance Auditing Services\nDub...
2,DBC (SF) Pilot of Robotic Process Automation,72220000,Dacorum Borough Council requires support from ...,2,DBC (SF) Pilot of Robotic Process Automation\n...
3,Passenger Transport for 8 Passengers or Less w...,60000000,Passenger transport for 8 passengers or less w...,3,Passenger Transport for 8 Passengers or Less w...
4,Denbighshire Schools ICT Network Framework,48000000,The aims of this contract is to provide a fram...,4,Denbighshire Schools ICT Network Framework\nTh...
...,...,...,...,...,...
245879,Delivered Ready Prepared Meals,15890000,The range is for selection of standard ready p...,193,Delivered Ready Prepared Meals\nThe range is f...
245880,Traffic Signals Planned and Unplanned Inspecti...,50230000,Renfrewshire Council require a suitably qualif...,202,Traffic Signals Planned and Unplanned Inspecti...
245881,The Supply for the Development of Dudley Counc...,73000000,Dudley Council invites providers to submit a q...,23,The Supply for the Development of Dudley Counc...
245882,LIFE Welsh Raised Bogs — Framework for Peat Re...,16000000,Lot 2: Removal of invasive Scrub:NRW is intend...,137,LIFE Welsh Raised Bogs — Framework for Peat Re...


In [6]:
def drop_single_examples(df):
    insufficient_labels = [k for k, v in Counter(df.label).items() if v == 1]
    insufficient = df[df.label.isin(insufficient_labels)]
    return df.drop(insufficient.index)

In [7]:
df = drop_single_examples(df)
df.shape

(245876, 5)

In [8]:
sents_train,sents_val,cpv_train,cpv_val = train_test_split(list(df.text),list(df.label),test_size=0.1,stratify=df.label)

In [9]:
most_common = dict(Counter(cpv_train).most_common()[:10])
least_common = dict(Counter(cpv_train).most_common()[-10:])

In [10]:
val_counter = Counter(cpv_val)

print("Most Common labels")
for k,t in most_common.items():
    v = val_counter.get(k,0)
    print(f"{k}: train examples: {t}, val examples:{v}")
    
print("")
print("Least Common labels")
for k,t in least_common.items():
    v = val_counter.get(k,0)
    print(f"{k}: train examples: {t}, val examples:{v}")


Most Common labels
15: train examples: 9536, val examples:1060
48: train examples: 8295, val examples:922
3: train examples: 7979, val examples:887
24: train examples: 7907, val examples:879
9: train examples: 6138, val examples:682
4: train examples: 6008, val examples:668
16: train examples: 5761, val examples:640
65: train examples: 5391, val examples:599
13: train examples: 5052, val examples:561
14: train examples: 4870, val examples:541

Least Common labels
201: train examples: 88, val examples:10
209: train examples: 37, val examples:4
208: train examples: 19, val examples:2
213: train examples: 16, val examples:2
211: train examples: 8, val examples:1
215: train examples: 6, val examples:1
210: train examples: 3, val examples:0
218: train examples: 3, val examples:0
216: train examples: 2, val examples:0
220: train examples: 2, val examples:0


In [11]:
with shelve.open("data/train_val.shelf") as f:
    f["sents_train"] = sents_train
    f["sents_val"] = sents_val
    f["cpv_train"] = cpv_train
    f["cpv_val"] = cpv_val
    f["label2id"] = label2id
    f["id2label"] = id2label