# CPV Classifier POC

## 3 - Test Model

Simple transformer-based classifier proof of concept based on data from TheyBuyForYou.

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.


## Load CPV list

originally downloaded from  https://simap.ted.europa.eu/cpv

In [1]:
import csv

cpv_list = {}

with open("data/cpv_listing.tsv") as csvfile:
    cpvreader = csv.reader(csvfile,delimiter="\t")
    for row in cpvreader:
        code = row[0][:8]
        desc = row[1]
        cpv_list[code] = desc
        
# an example CPV
cpv_list['03000000']

'Agricultural, farming, fishing, forestry and related products'

## Load model

In [2]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast, TextClassificationPipeline

In [3]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('tbfy_cpv_model')
pipe = TextClassificationPipeline(model=model,tokenizer=tokenizer)

In [4]:
def classify(text):
    res = pipe(text)
    cpv = cpv_list[res[0]['label']]
    print(f"'{text}' => {cpv} (score = {res[0]['score']})")

## Test on some examples

In [5]:
classify("mobile app")

'mobile app' => Software package and information systems (score = 0.9995172023773193)


In [6]:
classify("mobile phone")

'mobile phone' => Radio, television, communication, telecommunication and related equipment (score = 0.9999891519546509)


In [7]:
classify("mobile billboard")

'mobile billboard' => Advertising and marketing services (score = 0.5554304122924805)


In [8]:
classify("mobile office")

'mobile office' => Construction work (score = 0.9570050835609436)
