Skip to content

Week 06 ‐ RDF Encoder and Bi‐encoder

Aditya Hari edited this page Sep 4, 2023 · 1 revision

Last time we created the training corpus for our RDF encoder. This week we use them to train our RDF encoder, and subsequently the bi-encoder for aligning the embeddings to text.

RDF Encoder

We will follow this blog to construct the scaffolding for training our model. As in the blog, we will use a small, RoBERTa-like model with ~84m parameters.

For performing the pre-training, we use Huggingface's example training script. This pretrains a language model using the masked language modelling task. How this works is that the model is given inputs with some tokens randomly replaced with the special <mask> token. The model then has to identify the correct token that goes in its place. There are other pretraining tasks that might be suitable for this task such as next sentence prediction, or shuffled token detection, but this will do the job for now.

After training for 400,000 steps, let's look at some of the results. For this, we give the model a string as input with the mask token in it, and see how it chooses to replace it.

"Brazil | capital | <mask>"

[{'score': 0.4451245069503784, 'token': 24435, 'token_str': ' Lisbon', 'sequence': 'Brazil | capital | Lisbon'}, {'score': 0.4019171893596649, 'token': 2910, 'token_str': ' Brazil', 'sequence': 'Brazil | capital | Brazil'}, {'score': 0.04149125516414642, 'token': 20698, 'token_str': ' Janeiro', 'sequence': 'Brazil | capital | Janeiro'}, {'score': 0.012677792459726334, 'token': 5716, 'token_str': ' Rio', 'sequence': 'Brazil | capital | Rio'}, {'score': 0.007810568902641535, 'token': 8947, 'token_str': ' Rome', 'sequence': 'Brazil | capital | Rome'}]

The model doesn't encode "knowledge", but it does know to suggest cities for the given example. Good sign!

Now to use this with our bi-encoder

Bi-encoder

For our bi-encoder, we need a text encoder and our RDF encoder. We use Sentencebert's all-distilroberta-v1 model as our text encoder. For training, we use text-RDF pairs from WebNLG's English dataset as our positive examples, and randomly permuted pairs as the negative examples. Then we train our model using contrastive loss, which maximizes the distance between two vectors if they're a negative sample, and minimizes it if it's a positive sample.

Now let's look at the results by measuring the cosine similarity between Adirondack Regional Airport is 507 metres above sea level and Adirondack_Regional_Airport | elevationAboveTheSeaLevel_(in_metres) | 507, which should have a high similarity

tensor([0.0313])

Well, that's not a good sign. Something needs to change, which we will investigate over the next week. Here are some potential ideas that could work

  1. The basics, experimenting with different hyperparameters.
  2. Better negative sampling
  3. Continued pretraining of a language model for RDF encoder for domain adaption rather than starting from scratch, and then using the same language model as the text encoder.