# KDD Tutorial: Search Ranking Model With Transformer, Session 2

## Introduction 

You will learn in this tutorial how to train a search ranking model with BERT-based model.

This task is to rank the documents given the query based the query document similarity. We use the BERT model to generate the embeddings of the queries and embeddings of the documents that capture the semantic meanings of them from the text. We concatenate the query and document embeddings, and feed into a classification network to calculate the probability the document related to the query. We rank the documents based on their probability score to the query.

You will learn how to download the pre-trained model and access datasets using hugging face libraries

The dataset used in this tutorial is Microsoft Query-Ad Matching (QADSM) dataset which is part of the XGLUE benchmark. XGLUE is a new benchmark dataset for cross-lingual pre-training, understanding and generation. 

The classification accuracy reported in the original paper of XGLUE QADSM [2004.01401.pdf (arxiv.org)](https://arxiv.org/pdf/2004.01401.pdf) is around 64~68. You should be able to reach similar performance in this ranking model.


#### Sections

1. [Installation of libraries and load datasets](#section01)

2. [Dataset Information](#section02)

3. [Classes and functions](#section03)

4. [Parameters](#section04)

5. [Training and validation](#section05)

6. [Prediction](#section06)

7. [Evaluation](#section07)

8. [Ranking Demo](#section08)



## Installation of libraries and load datasets <a name="section01"></a>

In [None]:
! pip install datasets
from datasets import load_dataset

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 5.0 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 65.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 66.1 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.7 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 55.3 MB/s 
Collecting responses<0.19
  Downloading respo

In [None]:
dataset = load_dataset('xglue', 'qadsm')

Downloading builder script:   0%|          | 0.00/6.50k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.90k [00:00<?, ?B/s]

Downloading and preparing dataset xglue/qadsm (download: 835.33 MiB, generated: 20.40 MiB, post-processed: Unknown size, total: 855.73 MiB) to /root/.cache/huggingface/datasets/xglue/qadsm/1.0.0/8566eedecd9ab28e01c051c023dadf97bf408e5195f76b06aba70ebd4697ae08...


Downloading data:   0%|          | 0.00/876M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating validation.en split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation.de split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation.fr split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test.en split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test.de split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating test.fr split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset xglue downloaded and prepared to /root/.cache/huggingface/datasets/xglue/qadsm/1.0.0/8566eedecd9ab28e01c051c023dadf97bf408e5195f76b06aba70ebd4697ae08. Subsequent calls will reuse this data.


  0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
# !pip install datasets
!pip install transformers==3.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==3.1.0
  Downloading transformers-3.1.0-py3-none-any.whl (884 kB)
[K     |████████████████████████████████| 884 kB 5.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 54.4 MB/s 
Collecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 58.2 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 41.9 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=391e5cd5779aad5d00354ac44876d97ad398b9

In [None]:
import torch
import torch.nn as nn
import os
import matplotlib.pyplot as plt
import copy
import torch.optim as optim
import random
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel, AdamW, get_linear_schedule_with_warmup

os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Dataset Information <a name="section02"></a>

In [None]:
dataset.shape

{'test.de': (10000, 4),
 'test.en': (10000, 4),
 'test.fr': (10000, 4),
 'train': (100000, 4),
 'validation.de': (10000, 4),
 'validation.en': (10000, 4),
 'validation.fr': (10000, 4)}

Generate training set, validation set and test set. In this tutorial, we use the english validataion and test data. Feel free to test on the validation and test data in the other language.

In [None]:
train = dataset['train']
val = dataset['validation.en']
test = dataset['test.en']
# Transform data into pandas dataframes
df_train_all = pd.DataFrame(train)
df_val = pd.DataFrame(val)
df_test = pd.DataFrame(test)

In [None]:
# if you want test the end-to-end code fast and want to use a sample of the training data, set the train_with_full_set = 0.
# if you want to use all the training data, # set the train_with_full_set = 1
train_with_full_set = 1
if train_with_full_set:
  df_train = df_train_all
else:
  df_train = df_train_all.head(1000)

In [None]:
df_train.head()

Unnamed: 0,query,ad_title,ad_description,relevance_label
0,cruise portland maine,New England Cruises,Your New England Cruise Awaits! Holland Americ...,1
1,transportation to cruise port miami,Holland America Line®,Explore Your World with Four Extraordinary Off...,0
2,transportation to cruise port miami,Holland America Line®,Cruise to Your Own Private Island In the Carib...,1
3,galveston cruise parking,Caribbean Cruises,Sign Up for Offers and Explore the Caribbean w...,0
4,cruise portland maine,Holland America Line®,Official Site - Sign Up for Special New Englan...,1


## Classes and functions <a name="section03"></a>

In [None]:
class BuildDataset(Dataset):

    def __init__(self, data, maxlen, with_labels=True, bert_model='albert-base-v2'):

        self.data = data 
        #Initialize the tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(bert_model)  
        self.maxlen = maxlen
        self.with_labels = with_labels 

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        # Selecting query and ads title in the data frame
        query = str(self.data.loc[index, 'query'])
        document = str(self.data.loc[index, 'ad_title'])

        # Tokenize <query, ad_title>, get token ids, attention masks and token type ids
        encoded_pair = self.tokenizer(query, document, 
                                      padding='max_length',  # Pad to max_length
                                      truncation=True,  # Truncate to max_length
                                      max_length=self.maxlen,  
                                      return_tensors='pt')  # Return torch.Tensor objects
        
        token_ids = encoded_pair['input_ids'].squeeze(0)  
        attn_masks = encoded_pair['attention_mask'].squeeze(0)  
        token_type_ids = encoded_pair['token_type_ids'].squeeze(0)

        if self.with_labels:  # True if the dataset has labels
          label = self.data.loc[index, 'relevance_label']
          return token_ids, attn_masks, token_type_ids, label  
        else:
          return token_ids, attn_masks, token_type_ids

In [None]:
class QueryDocumentClassifier(nn.Module):

    def __init__(self, bert_model="albert-base-v2", freeze_bert=False):
        super(QueryDocumentClassifier, self).__init__()
        #  Instantiating BERT-based model object
        self.bert_layer = AutoModel.from_pretrained(bert_model)
        
        #  to add other pre-trained models, search and add hidden-state size of the encoder outputs
        if bert_model == "albert-base-v2":  
            hidden_size = 768
        elif bert_model == "albert-large-v2":  
            hidden_size = 1024
        elif bert_model == "bert-base-uncased": 
            hidden_size = 768
        elif bert_model == "bert-large-uncased": 
            hidden_size = 1024

        # Freeze bert layers, train the classification layer weights only.
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        # Classification layer
        self.cls_layer = nn.Linear(hidden_size, 1)
        self.dropout = nn.Dropout(p=0.1)


    @autocast() 
    def forward(self, input_ids, attn_masks, token_type_ids):
        # Feeding the inputs to the BERT-based model
        cont_reps, pooler_output = self.bert_layer(input_ids, attn_masks, token_type_ids)
        # Feeding the last layer hidden-state of the [CLS] token to the classifier layer 
        logits = self.cls_layer(self.dropout(pooler_output))

        return logits

In [None]:
def set_seed(seed):
    """ Set all seeds to make results reproducible """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    

def evaluate_loss(net, device, criterion, dataloader):
    net.eval()
    mean_loss = 0
    count = 0
    with torch.no_grad():
        for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(dataloader)):
            seq, attn_masks, token_type_ids, labels = \
                seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device)
            logits = net(seq, attn_masks, token_type_ids)
            mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
            count += 1
    return mean_loss / count

In [None]:
print("Create model folder...")
!mkdir models

Create model folder...


In [None]:
def train_bert(net, criterion, opti, lr, lr_scheduler, train_loader, val_loader, epochs, iters_to_accumulate):

    best_loss = np.Inf
    best_ep = 1
    nb_iterations = len(train_loader)
    # print the training loss 5 times per epoch
    print_every = nb_iterations // 5  
    iters = []
    train_losses = []
    val_losses = []

    scaler = GradScaler()

    for ep in range(epochs):
        net.train()
        running_loss = 0.0
        for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(train_loader)):
            # Converting to cuda tensors
            seq, attn_masks, token_type_ids, labels = \
                seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device)
    
            # Enables autocasting for the forward pass (model + loss)
            with autocast():
                # Obtaining the logits from the model
                logits = net(seq, attn_masks, token_type_ids)
                # Computing loss
                loss = criterion(logits.squeeze(-1), labels.float())
                # Normalize the loss
                loss = loss / iters_to_accumulate  

            # Backpropagating the gradients
            # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
            scaler.scale(loss).backward()

            if (it + 1) % iters_to_accumulate == 0:
                # Optimization step
                # scaler.step() first unscales the gradients of the optimizer's assigned params.
                # If these gradients do not contain infs or NaNs, opti.step() is then called,
                # otherwise, opti.step() is skipped.
                scaler.step(opti)
                # Updates the scale for next iteration.
                scaler.update()
                # Adjust the learning rate based on the number of iterations.
                lr_scheduler.step()
                # Clear gradients
                opti.zero_grad()


            running_loss += loss.item()
            # Print the loss
            if (it + 1) % print_every == 0:  
                print()
                print("Iteration {}/{} of epoch {} complete. Loss : {} "
                      .format(it+1, nb_iterations, ep+1, running_loss / print_every))

                running_loss = 0.0

        # Compute validation loss
        val_loss = evaluate_loss(net, device, criterion, val_loader)  
        print()
        print("Epoch {} complete! Validation Loss : {}".format(ep+1, val_loss))

        if val_loss < best_loss:
            print("Best validation loss improved from {} to {}".format(best_loss, val_loss))
            print()
            net_copy = copy.deepcopy(net)  # save a copy of the model
            best_loss = val_loss
            best_ep = ep + 1

    # Saving the model
    path_to_model='models/{}_lr_{}_val_loss_{}_ep_{}.pt'.format(bert_model, lr, round(best_loss, 5), best_ep)
    torch.save(net_copy.state_dict(), path_to_model)
    print("The model has been saved in {}".format(path_to_model))

    del loss
    torch.cuda.empty_cache()

## Parameters <a name="section04"></a>

In [None]:
bert_model = "albert-base-v2"  # 'albert-base-v2', 'albert-large-v2', 'albert-xlarge-v2', 'albert-xxlarge-v2', 'bert-base-uncased', ...
freeze_bert = False  # if True, freeze the encoder weights and only update the classification layer weights
maxlen = 128  # maximum length of the tokenized input sentence pair : if greater than "maxlen", the input is truncated and else if smaller, the input is padded
bs = 16  # batch size
iters_to_accumulate = 2  # the gradient accumulation adds gradients over an effective batch of size : bs * iters_to_accumulate. If set to "1", you get the usual batch size
lr = 2e-5  # learning rate
epochs = 4  # number of training epochs

## Training and validation <a name="section05"></a>

In [None]:
#  Set all seeds to make reproducible results
set_seed(1)
# Creating instances of training and validation set
print("Reading training data...")
train_set = BuildDataset(df_train, maxlen, bert_model)
print("Reading validation data...")
val_set = BuildDataset(df_val, maxlen, bert_model)
# Creating instances of training and validation dataloaders
train_loader = DataLoader(train_set, batch_size=bs, num_workers=5)
val_loader = DataLoader(val_set, batch_size=bs, num_workers=5)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net = QueryDocumentClassifier(bert_model, freeze_bert=freeze_bert)

# if multiple GPUs
if torch.cuda.device_count() > 1:  
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    net = nn.DataParallel(net)

net.to(device)

criterion = nn.BCEWithLogitsLoss()
opti = AdamW(net.parameters(), lr=lr, weight_decay=1e-2)
# The number of steps for the warmup phase.
num_warmup_steps = 0
# The total number of training steps 
num_training_steps = epochs * len(train_loader) 
# Necessary to take into account Gradient accumulation 
t_total = (len(train_loader) // iters_to_accumulate) * epochs  
lr_scheduler = get_linear_schedule_with_warmup(optimizer=opti, num_warmup_steps=num_warmup_steps, num_training_steps=t_total)
train_bert(net, criterion, opti, lr, lr_scheduler, train_loader, val_loader, epochs, iters_to_accumulate)

Reading training data...


Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Reading validation data...


  cpuset_checked))


Downloading:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

 20%|██        | 1251/6250 [03:14<12:51,  6.48it/s]


Iteration 1250/6250 of epoch 1 complete. Loss : 0.3405399726986885 


 40%|████      | 2501/6250 [06:27<09:37,  6.49it/s]


Iteration 2500/6250 of epoch 1 complete. Loss : 0.3358336829543114 


 60%|██████    | 3751/6250 [09:40<06:24,  6.49it/s]


Iteration 3750/6250 of epoch 1 complete. Loss : 0.32608729009628296 


 80%|████████  | 5001/6250 [12:53<03:12,  6.48it/s]


Iteration 5000/6250 of epoch 1 complete. Loss : 0.3221067599058151 


100%|██████████| 6250/6250 [16:05<00:00,  6.47it/s]



Iteration 6250/6250 of epoch 1 complete. Loss : 0.32646589955091476 


100%|██████████| 625/625 [00:38<00:00, 16.28it/s]



Epoch 1 complete! Validation Loss : 0.6254636513710022
Best validation loss improved from inf to 0.6254636513710022



 20%|██        | 1251/6250 [03:12<12:49,  6.50it/s]


Iteration 1250/6250 of epoch 2 complete. Loss : 0.3137537691950798 


 40%|████      | 2501/6250 [06:25<09:37,  6.49it/s]


Iteration 2500/6250 of epoch 2 complete. Loss : 0.3110032447338104 


 60%|██████    | 3751/6250 [09:37<06:25,  6.48it/s]


Iteration 3750/6250 of epoch 2 complete. Loss : 0.29456329785585406 


 80%|████████  | 5001/6250 [12:50<03:12,  6.50it/s]


Iteration 5000/6250 of epoch 2 complete. Loss : 0.2894338568925858 


100%|██████████| 6250/6250 [16:03<00:00,  6.49it/s]



Iteration 6250/6250 of epoch 2 complete. Loss : 0.3005141176819801 


100%|██████████| 625/625 [00:38<00:00, 16.37it/s]



Epoch 2 complete! Validation Loss : 0.6157049472808838
Best validation loss improved from 0.6254636513710022 to 0.6157049472808838



 20%|██        | 1251/6250 [03:12<12:54,  6.45it/s]


Iteration 1250/6250 of epoch 3 complete. Loss : 0.28032425280809403 


 40%|████      | 2501/6250 [06:25<09:33,  6.53it/s]


Iteration 2500/6250 of epoch 3 complete. Loss : 0.27898313146829606 


 60%|██████    | 3751/6250 [09:38<06:26,  6.47it/s]


Iteration 3750/6250 of epoch 3 complete. Loss : 0.256544635540247 


 80%|████████  | 5001/6250 [12:50<03:10,  6.55it/s]


Iteration 5000/6250 of epoch 3 complete. Loss : 0.2522626137852669 


100%|██████████| 6250/6250 [16:02<00:00,  6.49it/s]



Iteration 6250/6250 of epoch 3 complete. Loss : 0.26860654550790786 


100%|██████████| 625/625 [00:38<00:00, 16.28it/s]



Epoch 3 complete! Validation Loss : 0.6330283178329468


 20%|██        | 1251/6250 [03:12<12:59,  6.42it/s]


Iteration 1250/6250 of epoch 4 complete. Loss : 0.24368842961788179 


 40%|████      | 2501/6250 [06:25<09:47,  6.38it/s]


Iteration 2500/6250 of epoch 4 complete. Loss : 0.24844682307243346 


 60%|██████    | 3751/6250 [09:38<06:23,  6.52it/s]


Iteration 3750/6250 of epoch 4 complete. Loss : 0.22942082131505012 


 80%|████████  | 5001/6250 [12:50<03:12,  6.48it/s]


Iteration 5000/6250 of epoch 4 complete. Loss : 0.23279945095777513 


100%|██████████| 6250/6250 [16:02<00:00,  6.49it/s]



Iteration 6250/6250 of epoch 4 complete. Loss : 0.25906161076426504 


100%|██████████| 625/625 [00:38<00:00, 16.42it/s]



Epoch 4 complete! Validation Loss : 0.6549724748611451
The model has been saved in models/albert-base-v2_lr_2e-05_val_loss_0.6157_ep_2.pt


The model is saved in the folder "models" under the "files" on the left of the colab notebook.

## Prediction <a name="section06"></a>
Predict the relevance score given the query, ad title pair in the test set and save the result in a file.

In [None]:
print("Create 'result' folder...")
!mkdir results

Create 'result' folder...


In [None]:
def get_probs_from_logits(logits):
    """
    apply sigmoid function, converts a tensor of logits into an array of probabilities
    """
    probs = torch.sigmoid(logits.unsqueeze(-1))
    return probs.detach().cpu().numpy()

def test_prediction(net, device, dataloader, with_labels=True, result_file="results/output.txt"):
    net.eval()
    w = open(result_file, 'w')
    probs_all = []

    with torch.no_grad():
        if with_labels:
            for seq, attn_masks, token_type_ids, _ in tqdm(dataloader):
                seq, attn_masks, token_type_ids = seq.to(device), attn_masks.to(device), token_type_ids.to(device)
                logits = net(seq, attn_masks, token_type_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()
        else:
            for seq, attn_masks, token_type_ids in tqdm(dataloader):
                seq, attn_masks, token_type_ids = seq.to(device), attn_masks.to(device), token_type_ids.to(device)
                logits = net(seq, attn_masks, token_type_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()

    w.writelines(str(prob)+'\n' for prob in probs_all)
    w.close()

In [None]:
path_to_model = '/content/models/albert-base-v2_lr_2e-05_val_loss_0.6157_ep_2.pt'
# You can add your trained model here
# path_to_model = '/content/models/...'  

path_to_output_file = 'results/output.txt'

print("Reading test data...")
test_set = BuildDataset(df_test, maxlen, bert_model)
test_loader = DataLoader(test_set, batch_size=bs, num_workers=5)

model = QueryDocumentClassifier(bert_model)
if torch.cuda.device_count() > 1:  # if multiple GPUs
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)

print("Loading the weights of the model...")
model.load_state_dict(torch.load(path_to_model))
model.to(device)

print("Predicting on test data...")
test_prediction(net=model, device=device, dataloader=test_loader, with_labels=True,  # set the with_labels parameter to False if your want to get predictions on a dataset without labels
                result_file=path_to_output_file)

print("Predictions are available in : {}".format(path_to_output_file))

Reading test data...


  cpuset_checked))


Loading the weights of the model...
Predicting on test data...


100%|██████████| 625/625 [00:38<00:00, 16.15it/s]

Predictions are available in : results/output.txt





The predictions is saved in the folder "results" under the "files" on the left of the colab notebook.

## Evaluation <a name="section07"></a>
Now we have the predicted relevance score for each query, ad title pair in the test set. We compare the predicted score with the true label to calculate the accuaracy, precision and recall.

In [None]:
# you can adjust this threshold for your own dataset
threshold = 0.6
# path to the file with prediction probabilities   
path_to_output_file = 'results/output.txt' 
# true labels 
labels_test = df_test['relevance_label']  
# prediction probabilities
probs_test = pd.read_csv(path_to_output_file, header=None)[0] 
# predicted labels using the above fixed threshold
preds_test=(probs_test>=threshold).astype('uint8') 


In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
preds = preds_test
labels = labels_test
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
acc = accuracy_score(labels, preds)
print('accuracy', acc)
print('f1', f1)
print('precision', precision)
print('recall', recall)

accuracy 0.666
f1 0.6347331583552056
precision 0.7002895752895753
recall 0.5804


## Ranking Demo <a name="section08"></a>
To give you a quick demo in this tutorial, we use all the ad title from the test set as the documents. Change the demo query to your favourite query, and see what are the most relevant ads from the model prediction.  

In [None]:
demo_query = 'black ops 2'
document = list(df_test['ad_title'])
query =  [demo_query]* len(document)
# create the dummy label that it has the same data structure to use the DataLoader
label_dummy = ['nan']* len(document)
data = {'query':query,
       'ad_title':document,
        'relevance_label':label_dummy}

df_demo = pd.DataFrame(data)
demo_set = BuildDataset(df_demo, maxlen, bert_model)
demo_loader = DataLoader(demo_set, batch_size=bs, num_workers=5)
test_prediction(net=model, device=device, dataloader=demo_loader, with_labels=True,  
                result_file='results/demo.txt')


  cpuset_checked))
100%|██████████| 625/625 [00:37<00:00, 16.65it/s]


In [None]:
# read the predicted relevance score of each ad to the demo query
probs_demo = pd.read_csv('results/demo.txt', header=None)[0]  
df_demo['predict_score'] = list(probs_demo)
# sort the data set based on the relevance score
df_demo_sorted = df_demo.sort_values(by='predict_score', ascending=False)
# print the top 20 ads
df_demo_sorted[:20]

Unnamed: 0,query,ad_title,relevance_label,predict_score
6218,black ops 2,Black Ops Game at Amazon,,0.868652
53,black ops 2,Black Ops 2 Poster,,0.848633
9308,black ops 2,Black Ops Game Guide,,0.794434
6937,black ops 2,COD Black Ops 2 Cheats,,0.793457
7799,black ops 2,COD Black Ops 2 Cheats,,0.793457
5913,black ops 2,iWin® Games Official Site,,0.761719
5915,black ops 2,iWin® Games Official Site,,0.761719
4935,black ops 2,Full Movies (Watch Now),,0.760254
1722,black ops 2,full length movies,,0.759277
2747,black ops 2,Create Your Own Games,,0.752441
