## CPV Classifier POC

## 3.2 - Train Bi-LSTM

Uses a Bidirectional LSTM from Keras (see https://keras.io/examples/nlp/bidirectional_lstm_imdb/, also https://huggingface.co/spaces/keras-io/bidirectional_lstm_imdb)

See https://theybuyforyou.eu/ for background on TheyBuyForYou and http://data.tbfy.eu/ for information on the Knowledge Graph (KG) data that was created as part of this project. Data from the knowledge graph used in this proof of concept is made available under the following license terms and therefore the same license applies to the code and data in this repository.

> The KG data is provided under the Creative Commons BY-NC-SA 4.0 License, which allows you to use, share and adapt the data for non-commercial uses as long as you give appropriate credit and share any adapted data under the same license as the original. If you wish to use the data for commercial uses please contact the TheyBuyForYou project.

The full CPV listing included in this repo was downloaded from https://simap.ted.europa.eu/cpv

In [1]:
import pandas as pd
import shelve

## Load data

Also convert text to lowercase so it's a fair comparison to the uncased model used in the transformers version

In [2]:
with shelve.open("data/train_val.shelf") as db:
    sents_train = db["sents_train"]
    sents_val = db["sents_val"]
    cpv_train = db["cpv_train"]
    cpv_val = db["cpv_val"]
    label2id = db["label2id"]
    id2label = db["id2label"]

In [3]:
sents_train = [x.lower() for x in sents_train]
sents_val = [x.lower() for x in sents_val]

## Prepare & Train model

In [4]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_features = 20000  # Only consider the top 20k words
maxlen = 500  # Only consider the first 200 words of each item

In [5]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 300)(inputs)
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(len(label2id), activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.summary()


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 300)         6000000   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         186880    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 225)               29025     
Total params: 6,314,721
Trainable params: 6,314,721
Non-trainable params: 0
_________________________________________________________________


In [6]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(sents_train)

In [7]:
x_train = tokenizer.texts_to_sequences(sents_train)
x_val = tokenizer.texts_to_sequences(sents_val)
x_train = pad_sequences(x_train, maxlen=maxlen)
x_val = pad_sequences(x_val, maxlen=maxlen)

In [8]:
num_classes=len(label2id)

y_train = keras.utils.to_categorical(cpv_train,num_classes=num_classes)
y_val = keras.utils.to_categorical(cpv_val,num_classes=num_classes)


## Train and Evaluate

In [9]:
from datetime import datetime

model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])

start = datetime.now()
model.fit(x_train, y_train, batch_size=32, epochs=15, validation_data=(x_val, y_val))
finish = datetime.now()

print(f"Completed in {finish - start}")

Epoch 1/15


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Completed in 2:02:48.193593


## Save model and tokenizer

In [10]:
import pickle
model.save("models/bilstm")
with open("models/bilstm/tokenizer.pickle","wb") as f: pickle.dump(tokenizer,f)

INFO:tensorflow:Assets written to: models/bilstm/assets


INFO:tensorflow:Assets written to: models/bilstm/assets
