# Introduction

This document demonstrates the concept to build a job skill/job requirement classifier to classify a given JD with a set of relevant skills/requirements.

- The dataset: 
  * **ads-50k.json**: 50K job description from Seek with unstructured text contents as HTML format
  * **ads-50k-events.csv**: a many-to-many relationship, reflecting the events for candidates applying for job advertised.
- The requirements:
  * **Provide analyses and provide a solution** that would allow to annotate ads with skills, responsibilities and/or other requirements for a successful candidate to perform in their role.





# Observations

1. Each JD is provided as page of unstructured text, there isn't available labels or tags related to skills/reponsibilities/requirements
2. Text from JD provides rich info about the jobs, skills required, responsibilities and requirements. The in-line texts extracted from the JD are useful and could be considered as candicates for skills/reponsibilites/requirements
3. Majority of JDs have a similar format with starting with a **section header** and following with **bullet points** listing relevents. This information is useful to identify the role of each part of a JD.
4. Following the header of each section, it is possible to identify the key info mentioned
5. A large number of JD have the free style of writing without the structure of **section header** and **bullet points**
6. The relationship between job candidates and job adds could provide further infomation about which job ads are relevant on the same professionals or industries. This info is useful for grouping and identifying characteristics of the groups.
7. To simplify, skills/reponsibilites/requirements from now could be called as tags or labels

# Analysis
The problem of automatically annotating JDs with labels could be formulated using two ways:

1. A pattern-based approach
  - With this approach, patterns are designed and passing through JD to extract candidates of labels
  - Using purely rules to extract the high quality labels. These labels could be used as seeds to assist to (1) design new pattern or (2) extract further other label candicates.
  - Existing list of labels might partially available from Seek/Linkedin or other external sources. These could be used to increase the quality of pattern and the extraction
  - Using the associated between JD text and the extracted labels could be used to create ground-true training data to train a supervised model for annotation.

2. A supervised learning approach - eXtreme Multi-label Classification
  - Using data from (1), each JD is provided with a set of labels using pattern and existing knowledge.
  - The data is used as the ground true to train a multi-label classifier to assign tags to JDs.

# Proposal

With the purpose of demonstrate the concept (proof-of-concept), the following proposal has been undertaken:

1. Using a pattern-based approach to identify unstructured text where labels are listed. This could be done by following the high quality job ads where there is a right structure of writing using **"section header"**

2. Using a pattern-based approach to identify the **good quality key-phrases** that are mentioned from the text in (1). The redundancy/co-location is used to qualify the extraction.
Due to the limitted scope of this POC, no external data has been used. However, the quality of extraction could be significantly improved if external data is available such as a list of available skills from the job market.

3. Data from (1) and (2) are used to create a ground true dataset to feed into a multi-label classifier for the annotation.

4. A extreme multi-label classifier is proposed to build, which takes into the advantage of pre-train model on text domain and adapt to a multi-label classification on job-ad text domain. This model is build based on BERT pre-trained models.

# Discussion
1. The proposed solution demonstrates the ability to build a multi-label skill classifier from the scratch using bootstraping approach
2. The solution has the ability to annotate labels what are directly from the text or from a similar JD
3. The quality of the ground true dataset could be improved by using existing/available list of skills.  Rule-based or semantic matching could play an important role to support this.
4. The relationship between labels could be utilised to improve the quality of label selection. This could be done by using the relationship between job ads via resume and job ads event logs.
5. This solution is scalable to support batch processing and real-time prediction and lays a foundation for downstream applications.
6. There are many directions that could be improved from this POC to improve the accuracy


In [None]:
# Analysis and Proposal


# 1. Map Google Drive to Notebook to save data for re-producible

In [1]:
# Map the Google drive to the notebook colab to store the results

from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/git/

# Check out our git repo tailor for seek-dataset 
#!git checkout https://job-skill-prediction



Mounted at /content/drive
/content/drive/MyDrive/git


# [2. Install dependencies for ground-true dataset extraction](#depencency_1)

Dependencies for ground-true dataset extraction. Using BeautifulSoup to parse JD. Using textacy (spacy) for extracting key-phrases

In [None]:
!pip install beautifulsoup4
!pip install textacy
!python -m spacy download en_core_web_sm

import json
import re
from bs4 import BeautifulSoup
import traceback
from tqdm import tqdm
import textacy
from textacy import extract
import pandas as pd

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 23.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# [3. Process raw dataset and generate candidate skills/requirements](#preprocess_raw_dataset)

This process is to apply rule-based and heuristic approach to select the good quality set of skills/requirements/responsibilities. Given a JD, the following process are undertaken:
1. Scan the JD and select the **section headers** that are potentially the heading of skills and requirements
2. Extract **paragraphs of text** that are potentially contains skills and requirements
3. Apply **key-phrase extraction** to identify the significant mentions from the listed skill/requirement texts.
4. Select the** final set of skills/requirements/responsibilities** that are considered as the 'tags' or 'labels' for the JD. 

After this process, each JD will have a list of labels. Due to the strict selection process, these could be considered as the '**ground true'** information to support for the task of automatically identifying skills from a given JD. This task is formulated as a eXtreme Multi-label Classification problem, which is presented in the next section.


## [3.1 Define helper functions](#helper_function)

These functions to support the processing below it

In [2]:
import re

def get_file_size(raw_dataset_path):
  size = 0
  with open(raw_dataset_path,"rt") as f:
    for line in f:
      size+=1
  return size


def clean_html_text(text):
  """
    clean all html tag, replace with '.' if needed.
  """
  if text is None or len(text)<=0:
    return ""

  # remove html tag
  text = re.sub("<\/strong>","",text.lower()) 
  text = re.sub("<\w+>","",text)
  # handling those without ending line with 'dot'
  text = re.sub("(\w)</p>","\\1. ",text)
  text = re.sub("(\w)</li>","\\1. ",text)
  text = re.sub("(\w)<br/>","\\1. ",text)
  # remove the remaining html tag
  text = re.sub("<\/\w+>","",text)
  text = re.sub("<\w+\/>","",text)

  # remove 'tab'
  text = re.sub("\t"," ", text)

  # remove special html characters
  text = re.sub("\&nbsp;"," ", text)
  text = re.sub("\&amp;","and", text)
  text = re.sub("\n"," ", text)
  text = re.sub("&rsquo;"," ", text)
  text = re.sub("&rdquo;"," ", text)
  text = re.sub("&ldquo;"," ", text)
  return text


# test with sample text
text="""
</p>
<p>To be considered for this position, you will require:</p>
<ul>
<li>Proven experience in estimation within a construction/building environment</li>
<li>Degree qualified or equivalent (Degree/Diploma) preferably in Quantity Surveying</li>
<li>Demonstrated ability to analyse, evaluate and interpret a range of complex and technical documents, including relevant regulatory, legislative, and licensing requirements, codes and standards, plans, drawings and specifications, invitations to tender, contracts and procurement reports, and bills of quantities</li>
<li>Exhibit excellent communication and interpersonal skills</li>
<li>Experience in managing estimating teams (if applying for a Manager role)</li>
<li>Cost Planning experience and capability (relevant to building projects)</li>
<li>Ability to measure documentation provided by client during RFT to create a Bill of Quantities</li>
<li>Understand how to fill in the &ldquo;gaps&rdquo; in tender documentation</li>
<li>Can create a Builder&rsquo;s Bill of Quantities from the measure</li>
<li>Understands how to use Cost X and OST to measure and create the Bill</li>
<li>Has experience in using BIM models to measure Bills (preferable)</li>
</ul>
<p>The successful candidate will be rewarded with the opportunity to work on diverse and challenging projects and on-going professional development within a supportive and encouraging team environment.</p>
<p><em>We support diversity in the workplace. Women, Aboriginal and Torres Strait Islanders and people with a multicultural background are strongly encouraged to apply.</em></p>
<p><em>Please note: This role is being sourced through CPB Contractors directly and we will not accept applications via external recruitment agencies.</em></p></HTML>
"""
clean_html_text(text)

'  to be considered for this position, you will require:  proven experience in estimation within a construction/building environment.  degree qualified or equivalent (degree/diploma) preferably in quantity surveying.  demonstrated ability to analyse, evaluate and interpret a range of complex and technical documents, including relevant regulatory, legislative, and licensing requirements, codes and standards, plans, drawings and specifications, invitations to tender, contracts and procurement reports, and bills of quantities.  exhibit excellent communication and interpersonal skills.  experience in managing estimating teams (if applying for a manager role) cost planning experience and capability (relevant to building projects) ability to measure documentation provided by client during rft to create a bill of quantities.  understand how to fill in the  gaps  in tender documentation.  can create a builder s bill of quantities from the measure.  understands how to use cost x and ost to meas

## [3.2 Enrich the current raw dataset with extras info from the extraction](#dataset_enrichment)

1. This process is to extract a list of labels from a given JD
2. The output from this is a dataset file with additional data fields such as "skill_content", "skill_list", "skill_with_weight"
  - **skill_content**: a list of pairs of **section header** and **text contents** right below it.
  - **skill_list**: a list of labels extracted from a coresponding **section header**. One JD mights have multiple section, hence having multiple list of skills/requirements/responsibilities
  - **skill_with_weight**: a final list of labels that are aggreated from multiple sections from a JD. These also have their own weights
3.  Checkout the **ads-50k-with-skills.json** for details about the extraction for each JD. Only those JD having high quality info are selected to be used a ground-true, the rest is ignored from the training dataset.

In [None]:

# Define a list of keywords that potentially belong to a header of relevant sections from the JD. To demonstrate the concept, these are selected. 
# This list could be expanded based on further observation
skill_keyword_list = ["the role", # <strong>the role includes:</strong> or <strong>about the role:</strong>
                      "requirement", #<strong>the role includes:</strong>
                      "what we require", 
                      "about you", 
                      "responsibili", 
                      "need", 
                      "skill", 
                      "experience", 
                      "candidate"
                      ]
raw_dataset_path = "seek_training_data_generation/ads-50k.json" #input raw dataset
extracted_dataset_path = "seek_training_data_generation/ads-50k-with-skills.json" #output enriched dataset


size = get_file_size(raw_dataset_path)

f = open(raw_dataset_path)
g = open(extracted_dataset_path, 'a')

# Process extract skill text for each JD
for idx,line in tqdm(enumerate(f),total=size):
  data = json.loads(line)
  content = data["content"]

  #print(f"-------Processing id {idx}--------\n")

  #1. IDENTIFY SKILL CONTENT WHICH INCLUDE SKILL HEADER AND SKILL PARAGRAPH
  soup = BeautifulSoup(content, 'html.parser')
  
  skill_heading_text_pairs = [] # A list of tuple which contains the heading text and the skill paragraphs

  pair_dict= {} # contact the pair of headers where the text in the middle to be extracted
  header_list = [] # containing a list of header from the JD
  result_list = [] # containing the selected list of headers that are potentially contains kills and requirements

  for strong in soup.find_all('strong'): #identify heading
    text = strong.text
    
    if len(text.split()) >=5: # Any header with more than 5 words tends to be a false alarm
      continue

    header_list.append(text)
    # Create a mapping to mark the start and end of a block 
    if len(header_list)>=2:
      pair_dict[header_list[-2]] = header_list[-1]

    
    for keyword in skill_keyword_list: #check if a skills/requirements is a part of the header
      if keyword in text.lower():
        result_list.append(text)
        break #
  

  for keyword in result_list: #obtain skill paragraph from selected heading text
    rs = None
    if keyword in pair_dict:
      pattern = f"<strong>{keyword}</strong>([\w\W]+)<strong>{pair_dict[keyword]}</strong>"
    else:
      pattern = f"<strong>{keyword}</strong>([\w\W]+)"
    
    #print(f"pattern: {pattern}")

    try:
      rs = re.search(pattern, content)
    except:
      pass

    if rs is not None: 
      skill_text = rs.group(1)
      skill_text = clean_html_text(skill_text)

      skill_heading = clean_html_text(keyword)
      if len(skill_heading)>0 and len(skill_text)>0:
        skill_heading_text_pairs.append((skill_heading,skill_text))
        #print(f"{skill_heading}={skill_text}")

  data["skill_content"] = skill_heading_text_pairs #obtain skill content for each JD

  # 2. EXTRACTING KEYWORD FROM TEXT REPRESENTED FOR SKILLS/REQUIREMENTS

  data["skill_list"] = []
  skill_list = []
  skill_value_list = []
  for skill_pair in  skill_heading_text_pairs:
    # for each pair of "skill_heading" and skill_paragraph, extract list of keywords using Spacy
    heading = skill_pair[0]
    text = skill_pair[1]
    doc = textacy.make_spacy_doc(text,"en_core_web_sm")
    
    #print(f"heading={heading}")
    #print(f"text={text}")
    skill_tuples = kt.textrank(doc, normalize="lemma", topn=10)
    #print(f"skills={skill_tuples}\n")

    skill_list_per_heading = [v[0] for v in skill_tuples]
    
    # Add a list of skills based on each heading
    data["skill_list"].append((heading,skill_list_per_heading))

    #combine from all heading
    if len(skill_list) <= 0:
      skill_list = [v[0] for v in skill_tuples]
      skill_value_list = [v[1] for v in skill_tuples]
    else:
      skill_list.extend([v[0] for v in skill_tuples])
      skill_value_list.extend([v[1] for v in skill_tuples])

  # 3. CALCULATE AND PROVIDE THE FINAL OUTPUT DATA
  
  # Calculate the aggreation if there is duplication 
  df = pd.DataFrame({"skill_name":skill_list, 'skill_value': skill_value_list})
  df = df.groupby("skill_name").sum().reset_index().sort_values("skill_value",ascending=False)

  # Add skill list with weights
  data["skill_with_weight"] = df.values.tolist()

  #print(data["skill_list"])
  #print(data["skill_with_weight"])

  del data["content"]
  del data["metadata"]
  del data["abstract"]

  g.write(json.dumps(data))
  g.write("\n")

  #if idx >=10:
  #  break

f.close()
g.close()


100%|█████████▉| 49753/50000 [16:50<00:05, 49.05it/s]

#[4. Create ground-true dataset for support to train a minni-supervised skill classifier](#create_ground_true_dataset)

In this section, a ground-true dataset is created from the original raw dataset. This is done by selecting those with good quality of label extraction.

The following steps are undertaken:
1. Create a **global ground-true dataset** based on the processed JD with skills (labels). 
**Location**=bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/dataset.csv
2. Generate **train/validation datasets** to support to train a multi-skill classifier. 
**Location**=bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/{seek_dataset_train.pkl,seek_dataset_valid.pkl}
3. Generate **a list of global skills/requirements** from the given seek dataset. 
**Location**=bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/skill_list.csv

In [3]:
import numpy as np
def remove_outlier(df):
  d = df[df["skill_value_count"]>=10]
  #d = d[d["skill_value_sum"]>=2]
  return d

In [None]:
extracted_dataset_path = "seek_training_data_generation/ads-50k-with-skills.json"
label_path = "bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/skill_list.csv"
output_dataset_path = "bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/dataset.csv"




size = get_file_size(extracted_dataset_path)

f = open(extracted_dataset_path, 'rb')

# Process extract skill list
df_skills = None
for idx,line in tqdm(enumerate(f),total=size):
  if len(line)==1: #b'\n'
    continue
  try:
    data = json.loads(line)
  except:
    #print(line)
    pass
  if len(data["skill_with_weight"])>0:
    
    d = pd.DataFrame(data["skill_with_weight"], columns=["skill_name","skill_value"])
    
    df_skills = d if df_skills is None else df_skills.append(d,ignore_index=True)
f.close()


df_skills = df_skills.groupby("skill_name").agg(["sum","count"]).reset_index(level="skill_name")
df_skills.columns = ["_".join(a) for a in df_skills.columns.to_flat_index()]

#remove outlier 
df_skills = remove_outlier(df_skills)

df_skills = df_skills.sort_values(["skill_name_"],ascending=True)

#output label file
df_skills[["skill_name_"]].to_csv(label_path,header=False, index=False)
global_label_list = list(df_skills["skill_name_"])

# create mapping
label2index = dict()
index2label = dict()
for idx,label in enumerate(global_label_list):
  label2index[label]=idx
  index2label[idx]=label

n_label = len(global_label_list)

#output global dataset include text and label

dataset = []

f = open(extracted_dataset_path, 'rb')
# Process extract skill list
df_skills = None
for idx,line in tqdm(enumerate(f),total=size):
  if len(line)==1: #b'\n'
    continue
  try:
    data = json.loads(line)
  except:
    print(line)
  if len(data["skill_with_weight"])>0:
      d = pd.DataFrame(data["skill_with_weight"], columns=["skill_name","skill_value"])
      label_list = d["skill_name"]
      # create vector for label
      label_vector = np.zeros(n_label, dtype=int)
      for label in label_list:
        if label in global_label_list:
          label_vector[label2index[label]] = 1

      # create text content
      if len(set(label_vector))==2:
        for skill_content_pair in data["skill_content"]:
              text = skill_content_pair[1]
              dataset.append([text] + list(label_vector))

# Output dataset for later training
pd.DataFrame(dataset).to_csv(output_dataset_path)
print(f"output_dataset_path={output_dataset_path}")


100%|██████████| 33385/33385 [00:26<00:00, 1245.63it/s]
100%|██████████| 33385/33385 [00:19<00:00, 1753.98it/s]


output_dataset_path=bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/dataset.csv


In [None]:
print(f"obtained {len(dataset)} ground true for the dataset")
print(f"obtained {n_label} labels")


obtained 17850 ground true for the dataset
obtained 2014 labels


#[5. Train a eXtreme Multi-Label Classifier to predict labels for a given JD using BERT framework](#train_bert_xmlc)

Given a JD needed for a job, it is required to predict the list of skils/requirements suitable for this job. The following steps are under taken:
1. Formulate the problem to a exetreme multi-label classfication problem
1. Adapt a base BERT framework for multi-label text classification to train a classifier to predict a list of tags for a given JD

**Requirements**:
- Place BERT pre-train model into pybert/pretrain/bert/base-uncased/. The pre-trained models and config could be downloaded from [Bert-Multi-Label-Text-Classification](https://github.com/lonePatient/Bert-Multi-Label-Text-Classification)
- Check out the config file on pybert/config to make sure the config is correct with the current dataset such as Number of labels.

In [2]:


# change to the current directory
%cd /content/drive/MyDrive/git/job-skill-prediction/bert_extreme_multilabel_classification

# INSTALL REQURIED PACKAGES
!pip3 install -r requirements.txt

/content/drive/MyDrive/git/job-skill-prediction/bert_extreme_multilabel_classification
Collecting scikit-learn==0.21.3
  Downloading scikit_learn-0.21.3-cp37-cp37m-manylinux1_x86_64.whl (6.7 MB)
[K     |████████████████████████████████| 6.7 MB 5.1 MB/s 
[?25hCollecting pytorch-transformers==1.2.0
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[K     |████████████████████████████████| 176 kB 69.4 MB/s 
[?25hCollecting matplotlib==3.1.1
  Downloading matplotlib-3.1.1-cp37-cp37m-manylinux1_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 63.1 MB/s 
[?25hCollecting tensorboard==1.15.0
  Downloading tensorboard-1.15.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 64.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 60.7 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.

##5.1 Generating train/validation dataset from the ground-true dataset

- Ground true dataset generated in the previous step are in the CSV format, which needs to be pre-processed and convert with label encoded.  
- This dataset is required to split into train/validation with 85% vs 15% ratio.
- The output is stored at **pybert/dataset/seek_dataset** for further training/validating

In [None]:
!python3 run_bert.py --do_data --data_name seek_dataset --do_lower_case --valid_size 0.15 


Training/evaluation parameters Namespace(adam_epsilon=1e-08, arch='bert', data_name='seek_dataset', do_data=True, do_lower_case=True, epochs=10, eval_batch_size=4, eval_max_seq_len=256, fp16=False, fp16_opt_level='O1', grad_clip=1.0, gradient_accumulation_steps=1, learning_rate=0.0001, local_rank=-1, loss_scale=0, mode='min', monitor='valid_loss', n_gpu='0', predict_idx='0', predict_labels=False, resume_path='', save_best=False, seed=42, sorted=1, test=False, test_path='', train=False, train_batch_size=4, train_max_seq_len=256, valid_size=0.15, warmup_proportion=0.1, weight_decay=0.01)
split raw data into train and valid

## [5.2 Start training the multi-label classifier to predict skills/requirements/responsbilities](#start_training_bert_xmlc)

- Start training with seek_dataset using 100 epochs
- Training using GPU to speed up the training time

In [4]:
!python3 run_bert.py --train --data_name seek_dataset --do_lower_case --epochs 10 --save_best


Training/evaluation parameters Namespace(adam_epsilon=1e-08, arch='bert', data_name='seek_dataset', do_data=False, do_lower_case=True, epochs=10, eval_batch_size=4, eval_max_seq_len=256, fp16=False, fp16_opt_level='O1', grad_clip=1.0, gradient_accumulation_steps=1, learning_rate=0.0001, local_rank=-1, loss_scale=0, mode='min', monitor='valid_loss', n_gpu='0', predict_idx='0', predict_labels=False, resume_path='', save_best=False, seed=42, sorted=1, test=False, test_path='', train=True, train_batch_size=4, train_max_seq_len=256, valid_size=0.05, warmup_proportion=0.1, weight_decay=0.01)
*** Example ***
guid: train-0
tokens: [CLS] we are looking for someone to focus purely on recruiting permanent staff in the early childhood ed ##uca ##ton sector . the role will include : building relationships with range of existing and prospective clients . identify their issues and help provide the solution . attending client visits and understanding client requirements . representing pulse child care

# 6. Start predicting on samples from the dataset

- The sections below demonstrate real-time predicting on several JD and see the predicted labels

## 6.1 Prepare script for predicting

In [3]:
from pybert.test.predictor import Predictor
from pybert.io.bert_processor import BertProcessor
from torch.utils.data import SequentialSampler
from torch.utils.data import DataLoader
from pybert.model.nn.bert_for_multi_label import BertForMultiLable
from pybert.configs.basic_config import config
from pybert.common.tools import init_logger, logger
from pathlib import Path


import warnings


warnings.filterwarnings("ignore")

# Get the processor ready
processor = BertProcessor(vocab_path=config['bert_vocab_path'], do_lower_case=False)

# get label list ready 
idx2word = {}
for (w,i) in processor.tokenizer.vocab.items():
    idx2word[i] = w

label_list = processor.get_labels(label_path=config['data_label_path'])

idx2label = {i: label for i, label in enumerate(label_list)}



# Loading trained model
if False:
    args.test_path = Path(args.test_path)
    model = BertForMultiLable.from_pretrained(args.test_path, num_labels=len(label_list))
else:
    #model = BertForMultiLable.from_pretrained(config['bert_model_dir'], num_labels=len(label_list))
    trained_model_folder = Path("/content/drive/MyDrive/git/job-skill-prediction/bert_extreme_multilabel_classification/pybert/output/checkpoints/bert/checkpoint-epoch-10")
    model = BertForMultiLable.from_pretrained(trained_model_folder, num_labels=len(label_list))

    
for p in model.bert.parameters():
    p.require_grad = False



In [4]:
from pathlib import Path
config['test_path'] = Path('pybert/dataset/seek_dataset/seek_dataset.valid.pkl')

In [5]:
# Get data read to test 
test_data = processor.get_test(config['test_path'])
test_examples = processor.create_examples(lines=test_data, example_type='test', cached_examples_file=config[
                                                                    'data_dir'] / f"cached_test_examples_bert")
test_features = processor.create_features(examples=test_examples, max_seq_len=256, cached_features_file=config[
                                                                    'data_dir'] / "cached_test_features_{}_{}".format(
                                                256, 'bert'
                                            ))
test_dataset = processor.create_dataset(test_features)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=4)

# Start predicting

predictor = Predictor(model=model,
                      logger=logger,
                      n_gpu='0',
                      i2w = idx2word,
                      i2l = idx2label)


result = predictor.predict(data=test_dataloader)


 

NDCG@5: 6.519689674740005
NDCG@10: 8.66387888228148
NDCG@30: 12.813805196363525
NDCG@50: 14.66098760198134
NDCG@100: 17.12330679664693
Recall@5: 5.508213742843484
Recall@10: 9.230181928141176
Recall@30: 19.210030039882376
Recall@50: 24.776817710548634
Recall@100: 33.21054718296621
EIM: 140.70826961603663
RIIM: 11.954470329340197
REIM: 15.165385562066454
