# Use Pretrained RoBERTa Model to Calculate Cosine Similarity
We often wonder why we need to use the sentence pair classification model when we can embed sentences using the BERT models and find document similarity with cosine similarity. However, note that we are currently asking the question on whether a resume is suitable for interview for a particular job description. This may or may not mean the two passages are similar in context. Afterall, they are written for different objectives.

In our particular training set, the labels are similarities of job titles which may or may not correlate to how the job descriptions are different from each other.

Due to the points above, simple job-resume cosine similarity makes a poor predictor on whether an resume is suitable for interview, and the code below shows exactly that.

In [1]:
# simpletransformers only works with this version
!pip install transformers==3.0.2



In [2]:
# We use the same pretrained model trained on the GLUE STS-B benchmark
from simpletransformers.classification import ClassificationModel
train_args = {
    'reprocess_input_data': True,
    'evaluate_during_training': True,
    'evaluate_during_training_steps': 200,
    'max_seq_length': 512,
    'num_train_epochs': 5,
    'train_batch_size': 6,
    #'train_batch_size': 16,
    'wandb_project': 'vec2rec-roberta',
    'wandb_kwargs':{"id":id, "resume":True},
    #'no-cache':True,
    #'no_save':True,
    #'save_model_every_epoch':False,
    'save_eval_checkpoints':False,
    'save_steps':False,
    #'best_model_dir':"/kaggle/tmp/outputs/best_model"
    #'output_dir':"/kaggle/tmp/outputs/",
    'overwrite_output_dir':True,
    'use_early_stopping':True,
    'early_stopping_delta':0.001,
    'early_stopping_consider_epochs':True,
    'regression': True,
}
model = ClassificationModel('roberta', '../saved_models/outputs-sts-b/', num_labels=1, args=train_args, use_cuda=True)
model.args

ClassificationArgs(adam_epsilon=1e-08, best_model_dir='outputs/best_model', cache_dir='cache_dir/', custom_layer_parameters=[], custom_parameter_groups=[], train_custom_parameters_only=False, config={}, dataloader_num_workers=6, do_lower_case=False, early_stopping_consider_epochs=True, early_stopping_delta=0.001, early_stopping_metric='eval_loss', early_stopping_metric_minimize=True, early_stopping_patience=3, encoding=None, eval_batch_size=8, evaluate_during_training=True, evaluate_during_training_silent=True, evaluate_during_training_steps=200, evaluate_during_training_verbose=False, fp16=True, gradient_accumulation_steps=1, learning_rate=4e-05, local_rank=-1, logging_steps=50, manual_seed=None, max_grad_norm=1.0, max_seq_length=512, multiprocessing_chunksize=500, n_gpu=1, no_cache=False, no_save=False, num_train_epochs=5, output_dir='outputs/', overwrite_output_dir=True, process_count=6, reprocess_input_data=True, save_best_model=True, save_eval_checkpoints=False, save_model_every_e

In [3]:
import pandas as pd
suffix = "_submit"
train_df = pd.read_excel(f"../data/train_df{suffix}.xlsx")

In [4]:
train_df

Unnamed: 0.1,Unnamed: 0,title_job,text_a,title_res,text_b,labels
0,0,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074
1,4,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074
2,5,c/ c++ software engineer,Dom Daniel Armenia is looking for dynamic self...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074
3,6,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074
4,10,java/j2ee developer,PointSource is seeking full time J2EE Develope...,sr. java / j2ee developer,Responsibilities: \n• Involved in Requirements...,4.280726
...,...,...,...,...,...,...
3839,7611,flash/ as3 developer,NY based social and mobile games startup is ge...,developer & solution analyst,"Domain Banking \nTechnology MVC, C#.net, JQuer...",1.575380
3840,7613,unix systems administrator,Administration of Corporate Unix servers Plan...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131
3841,7614,unix systems administrator,Administration of corporate Unix Solaris Linu...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131
3842,7615,unix systems administrator,ArmenTel is seeking for candidates to fulfill ...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131


In [5]:
# This calls the underlying roberta huggingface model and calculate the mean of the words to summarize sentence embedding
# The ClassificationModel itself does not define functions for embedding
import torch
def encode_sentences(sentence_pair, model=model):
    model.model.to("cuda")
    input_ids = model.tokenizer.batch_encode_plus(sentence_pair, add_special_tokens=True, max_length=model.args.max_seq_length, padding=True, truncation=True)["input_ids"]
    input_ids_tensor = torch.LongTensor(input_ids).cuda()
    encoded_sentences = None
    for sentence in model.model.roberta(input_ids_tensor)[0]:
        if encoded_sentences is None:
            encoded_sentences = np.mean(sentence.detach().cpu().numpy(), axis=0)[:, None]
        else:
            encoded_sentences = np.append(encoded_sentences, np.mean(sentence.detach().cpu().numpy(), axis=0)[:, None], axis=1)
    return encoded_sentences

In [6]:
# The similarity score ranges from -1 to 1 - the below transform them to a range of 0 to 5 to make it comparable to the
# prediction and then calculate the mean absolute error
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
train_df["similarity"] = train_df.apply(lambda x: cosine_similarity(encode_sentences([x.text_a, x.text_b]))[0,1], axis=1)
train_df["similarity"] = (train_df["similarity"]+1)/2*5
train_df["error"] = abs(train_df["labels"] - train_df["similarity"])

In [7]:
train_df

Unnamed: 0.1,Unnamed: 0,title_job,text_a,title_res,text_b,labels,similarity,error
0,0,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074,0.040377,4.278697
1,4,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074,0.000242,4.318832
2,5,c/ c++ software engineer,Dom Daniel Armenia is looking for dynamic self...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074,0.001559,4.317515
3,6,c/ c++ software engineer,LTX Credence Armenia LLC is looking for Softwa...,"c,c++ developer","Languages Known: C, C++, Data Structures, \nJa...",4.319074,0.000173,4.318901
4,10,java/j2ee developer,PointSource is seeking full time J2EE Develope...,sr. java / j2ee developer,Responsibilities: \n• Involved in Requirements...,4.280726,0.097656,4.183071
...,...,...,...,...,...,...,...,...
3839,7611,flash/ as3 developer,NY based social and mobile games startup is ge...,developer & solution analyst,"Domain Banking \nTechnology MVC, C#.net, JQuer...",1.575380,0.035516,1.539864
3840,7613,unix systems administrator,Administration of Corporate Unix servers Plan...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131,0.010762,0.262369
3841,7614,unix systems administrator,Administration of corporate Unix Solaris Linu...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131,0.005544,0.267587
3842,7615,unix systems administrator,ArmenTel is seeking for candidates to fulfill ...,senior developer/ business analyst,12-Sep-2012 Present Senior Developer/ Business...,0.273131,0.029544,0.243587


In [8]:
# The mean absolute error is 2.12, meaning on average the scores have a 2.12 deviation out of a scale of 5.
# It is simply unacceptable.
# In fact, in another trial not shown here, the final RoBERTa model we trained gives an error of 1.58,
# when cosine similarity is used to compare. While much better than the sts-b model, it is still not acceptable.
train_df.similarity.describe(), train_df.error.describe()

(count    3844.000000
 mean        0.034367
 std         0.105729
 min         0.000000
 25%         0.003032
 50%         0.015195
 75%         0.046644
 max         4.920129
 Name: similarity, dtype: float64,
 count    3844.000000
 mean        2.127326
 std         0.859403
 min         0.001106
 25%         2.108098
 50%         2.372054
 75%         2.542854
 max         4.772132
 Name: error, dtype: float64)