In this notebook we will walk through a simple example using the PDPA dataset to illustrate how you can fine-tune Golden Retriever

##### Import relevant packages

In [1]:
import sys
sys.path.append("../..")

import os
import random
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split

from src.models import GoldenRetriever
from src.encoders import USEEncoder
from src.data_handler.kb_handler import kb, kb_handler
from src.finetune.generators import hard_triplet_generator
from src.finetune.config import CONFIG

In [2]:
%load_ext autoreload
%autoreload 2

##### Import .csv file data

In [3]:
# Get df using kb_handler
kbh = kb_handler()

path_to_csv = "./../../data/pdpa.csv"
answer_col = "ans_str"
query_col = "query_str"
context_col = ""
kb_name = "pdpa"

pdpa_kb = kbh.parse_csv(path_to_csv, answer_col, query_col, context_col, kb_name)
pdpa_df = pdpa_kb.create_df()
pdpa_df.head()

Unnamed: 0,query_string,processed_string,kb_name
0,What is personal data?,"Organisations, General Personal data refers t...",pdpa
1,When did the PDPA come into force?,"Organisations, General The PDPA was implement...",pdpa
2,What are the objectives of the PDPA?,"Organisations, General The PDPA aims to safeg...",pdpa
3,How does the PDPA benefit business?,"Organisations, General The PDPA will strength...",pdpa
4,How will the PDPA impact business costs?,"Organisations, General The provisions of the ...",pdpa


##### Fine-tuning

Train-test split

In [4]:
train_dict = dict()
test_dict = dict()

pdpa_id = pdpa_df.index.values
train_idx, test_idx = train_test_split(pdpa_id, test_size=0.4, random_state=100)

train_dict["pdpa"] = train_idx
test_dict["pdpa"] = test_idx

Triplet loss is used for the fine-tuning process and for each step, we mine the hard triplets by finding an incorrect response that is the most similar to the query text. We noticed that using these hard triplets for fine-tuning improves the model performance significantly

In [5]:
use = USEEncoder()
gr = GoldenRetriever(use)

train_dataset_loader = hard_triplet_generator(pdpa_df, train_dict, gr, CONFIG)
    
for q, r, neg_r in train_dataset_loader:

    cost_mean_batch = gr.finetune(
        question=q, answer=r, context=r,
        neg_answer=neg_r, neg_answer_context=neg_r,
        margin=0.3, loss="triplet"
    )

    print("cost_mean_batch", cost_mean_batch)

    break

INFO:absl:Using C:\Users\Kenneth\AppData\Local\Temp\tfhub_modules to cache modules.


model initiated!
cost_mean_batch 0.29314637


In [6]:
encoded_text = gr.encoder.encode("Why do we need PDPA?", string_type="query")
encoded_text.shape

TensorShape([1, 512])

##### Export finetuned weights

In [7]:
save_dir = "./finetune_use"
os.makedirs(save_dir)

In [8]:
gr.export_encoder(save_dir=save_dir)

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


INFO:tensorflow:Assets written to: ./finetune_use\assets


INFO:tensorflow:Assets written to: ./finetune_use\assets


###### Restore weights and ensure that that encoded texts are the same

In [9]:
use_res = USEEncoder()
gr_res = GoldenRetriever(use_res)

model initiated!


In [10]:
gr_res.restore_encoder(save_dir=save_dir)

model initiated!


In [11]:
encoded_text_res = gr_res.encoder.encode("Why do we need PDPA?", string_type="query")
encoded_text_res.shape

TensorShape([1, 512])

We can tell that the export and restoration of weights for the encoder was successful given that the two vectorized responses `encoded_text` and `encoded_text_res` are identical

In [12]:
tf.debugging.assert_equal(
    encoded_text, encoded_text_res, message=None, summarize=None, name=None
)