# CPV Classifier POC

## 4 - Test Models

Compare LinearSVC/Bi-LSTM/Transformer models

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.


## Load CPV list

originally downloaded from  https://simap.ted.europa.eu/cpv

In [1]:
import csv

cpv_list = {}

with open("data/cpv_listing.tsv") as csvfile:
    cpvreader = csv.reader(csvfile,delimiter="\t")
    for row in cpvreader:
        code = row[0][:8]
        desc = row[1]
        cpv_list[code] = desc
        
# an example CPV
cpv_list['03000000']

'Agricultural, farming, fishing, forestry and related products'

In [2]:
import shelve

with shelve.open("data/train_val.shelf") as db:
    id2label = db["id2label"]

## Load transformers model

In [3]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast, TextClassificationPipeline

In [4]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('models/transformers')
pipe = TextClassificationPipeline(model=model,tokenizer=tokenizer)

In [5]:
def transformer_classify(text):
    res = pipe(text)
    cpv_code = res[0]['label']
    cpv = cpv_list[cpv_code]
    return f"{cpv_code} - {cpv[:60]} (score {res[0]['score']})"

## Load SVC model

In [6]:
import pickle

with open("models/svc/tfidf.pickle","rb") as f: tfidf = pickle.load(f)
with open("models/svc/svc.pickle","rb") as f: svc = pickle.load(f)

In [7]:
def softmax(x):
    f_x = np.exp(x) / np.sum(np.exp(x))
    return f_x

In [8]:
def svc_classify(text):
    vals = tfidf.transform([text.lower()])
    preds = svc.decision_function(vals)[0]
    scores = softmax(preds)
    cpvid = np.argmax(scores)
    score = scores[cpvid]
    cpv_code = id2label[cpvid]
    cpv = cpv_list[cpv_code]
    return f"{cpv_code} - {cpv[:60]} (score {score})"

## Load Bi-LSTM model

In [9]:
from tensorflow import keras
with open("models/bilstm/tokenizer.pickle","rb") as f: tokenizer = pickle.load(f)
model = keras.models.load_model("models/bilstm")

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

def keras_classify(text):
    toks = tokenizer.texts_to_sequences([text.lower()])
    padded_tokens = pad_sequences(toks,maxlen=500)
    result = model.predict(padded_tokens)[0]
    cpvid = np.argmax(result)
    score = result[cpvid]
    cpv_code = id2label[cpvid]
    cpv = cpv_list[cpv_code]
    return f"{cpv_code} - {cpv[:60]} (score {score})"

## Test on some examples

In [11]:
def classify(text):
    print(f"Text:        {text}")
    print(f"LinearSVC:   {svc_classify(text)}")
    print(f"Bi-LSTM:     {keras_classify(text)}")
    print(f"Transformer: {transformer_classify(text)}")
    print()

In [12]:
classify("mobile app")

Text:        mobile app
LinearSVC:   48000000 - Software package and information systems (score 0.019731817920954978)


Bi-LSTM:     48000000 - Software package and information systems (score 0.8161516785621643)
Transformer: 48000000 - Software package and information systems (score 0.9999947547912598)



In [13]:
classify("mobile billboard")

Text:        mobile billboard
LinearSVC:   32000000 - Radio, television, communication, telecommunication and rela (score 0.015232568402872291)
Bi-LSTM:     48000000 - Software package and information systems (score 0.12404028326272964)
Transformer: 79340000 - Advertising and marketing services (score 0.9911177158355713)



In [14]:
classify("mobile office")

Text:        mobile office
LinearSVC:   32000000 - Radio, television, communication, telecommunication and rela (score 0.013213690361260659)
Bi-LSTM:     30200000 - Computer equipment and supplies (score 0.42848795652389526)
Transformer: 79000000 - Business services: law, marketing, consulting, recruitment,  (score 0.49228811264038086)



In [15]:
classify("mobile phone")

Text:        mobile phone
LinearSVC:   32000000 - Radio, television, communication, telecommunication and rela (score 0.031065103217753638)
Bi-LSTM:     64210000 - Telephone and data transmission services (score 0.6192469596862793)
Transformer: 32000000 - Radio, television, communication, telecommunication and rela (score 0.9982514977455139)



In [16]:
# Spelling dfferences

classify("mobile phones")
classify("mbile phones")
classify("mbl phone")
classify("mobilephone")

Text:        mobile phones
LinearSVC:   32000000 - Radio, television, communication, telecommunication and rela (score 0.015232568402872291)
Bi-LSTM:     32000000 - Radio, television, communication, telecommunication and rela (score 0.8991973996162415)
Transformer: 32000000 - Radio, television, communication, telecommunication and rela (score 0.9999998807907104)

Text:        mbile phones
LinearSVC:   45000000 - Construction work (score 0.008864960738348727)
Bi-LSTM:     32000000 - Radio, television, communication, telecommunication and rela (score 0.3602462708950043)
Transformer: 32000000 - Radio, television, communication, telecommunication and rela (score 0.9999998807907104)

Text:        mbl phone
LinearSVC:   32000000 - Radio, television, communication, telecommunication and rela (score 0.02137280931172376)
Bi-LSTM:     73110000 - Research services (score 0.04001806303858757)
Transformer: 32000000 - Radio, television, communication, telecommunication and rela (score 0.999990463256

In [17]:
# Longer text

txt = "Stoneybridge Parish require a promotional and educational video for the village." + \
" This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience."
classify(txt)

Text:        Stoneybridge Parish require a promotional and educational video for the village. This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience.
LinearSVC:   79340000 - Advertising and marketing services (score 0.018177768791825664)
Bi-LSTM:     44000000 - Construction structures and materials; auxiliary products to (score 0.5624648332595825)
Transformer: 79340000 - Advertising and marketing services (score 0.9617910385131836)



In [18]:
txt = "Stoneybridge Parish seeks partners to produce a promotional and educational video for the village." + \
" This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience."
classify(txt)

Text:        Stoneybridge Parish seeks partners to produce a promotional and educational video for the village. This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience.
LinearSVC:   79340000 - Advertising and marketing services (score 0.016557606589711657)
Bi-LSTM:     44000000 - Construction structures and materials; auxiliary products to (score 0.756646454334259)
Transformer: 92000000 - Recreational, cultural and sporting services (score 0.9576200246810913)



In [19]:
txt = "Stoneybridge Parish Council seek a partner to produce a promotional and educational video for the village." + \
" This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience."
classify(txt)

Text:        Stoneybridge Parish Council seek a partner to produce a promotional and educational video for the village. This will introduce village locations such as our Stoney Bridge and Bus Shelter to an international audience.
LinearSVC:   79340000 - Advertising and marketing services (score 0.013756549655601491)
Bi-LSTM:     03000000 - Agricultural, farming, fishing, forestry and related product (score 0.2669020891189575)
Transformer: 92000000 - Recreational, cultural and sporting services (score 0.9978830218315125)



In [20]:
txt = "Stoneybridge Parish Council seek a partner to produce a promotional video for the parish." + \
" The scope of this project includes recording key landmarks such as our stoney bridge and our bus shelter for posterity."
classify(txt)

Text:        Stoneybridge Parish Council seek a partner to produce a promotional video for the parish. The scope of this project includes recording key landmarks such as our stoney bridge and our bus shelter for posterity.
LinearSVC:   72210000 - Programming services of packaged software products (score 0.010672523824887882)
Bi-LSTM:     55000000 - Hotel, restaurant and retail trade services (score 0.7072287797927856)
Transformer: 92000000 - Recreational, cultural and sporting services (score 0.9881520867347717)

