## CPV Classifier POC

## 3.1 - Train SVM

Uses a simple Support Vector Machine: sklearn's LinearSVC (Support Vector Classifier).

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

The full CPV listing included in this repo was downloaded from https://simap.ted.europa.eu/cpv

In [1]:
import pandas as pd
import shelve

## Load data

Also convert text to lowercase so it's a fair comparison to the uncased model used in the transformers version

In [2]:
with shelve.open("data/train_val.shelf") as db:
    sents_train = db["sents_train"]
    sents_val = db["sents_val"]
    cpv_train = db["cpv_train"]
    cpv_val = db["cpv_val"]
    label2id = db["label2id"]
    id2label = db["id2label"]

In [3]:
sents_train = [x.lower() for x in sents_train]
sents_val = [x.lower() for x in sents_val]

## Prepare & Train model

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.9, max_features=5000, stop_words='english')

# Rename variables to be more commonly-used ones
X_train = tfidf.fit_transform(sents_train)
X_test = tfidf.transform(sents_val)
Y_train = cpv_train
Y_test = cpv_val

In [5]:
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from datetime import datetime

best_model = None
best_accuracy = 0

for i in [1,10,100,1_000]:
    svc = LinearSVC(max_iter = i)
    start = datetime.now()
    svc.fit(X_train, Y_train)
    finish = datetime.now()
    preds = svc.predict(X_test)
    accuracy = accuracy_score(preds,Y_test)
    print(f"max_iter {i} // accuracy: {accuracy}. Completed in {finish - start} after {svc.n_iter_} iteration(s)")
    if accuracy > best_accuracy:
        best_accuracy=accuracy
        print("Saving new best model")
        best_model = svc


max_iter 1 // accuracy: 0.47173417927444283. Completed in 0:00:25.276647 after 1 iteration(s)
Saving new best model


max_iter 10 // accuracy: 0.6314869041809013. Completed in 0:01:46.374777 after 10 iteration(s)
Saving new best model
max_iter 100 // accuracy: 0.6339677891654466. Completed in 0:02:18.461116 after 56 iteration(s)
Saving new best model
max_iter 1000 // accuracy: 0.6339677891654466. Completed in 0:02:16.402993 after 56 iteration(s)


## Save best model

In [6]:
import pickle

with open("models/svc/tfidf.pickle","wb") as f: pickle.dump(tfidf,f)
with open("models/svc/svc.pickle","wb") as f: pickle.dump(best_model,f)
