# Description

Fasttext framework is used in the POC to demonstrate a different angle to build a classifier to predict labels from a large set of labels.

Refer to [bert_xmlc.ipynb](https://github.com/bimhud/job_skill_prediction/blob/main/bert-xmlc.ipynb) documents for more info about the context and data required for this POC.

This document is to outline:
1. How to convert the existing ground-true dataset to Fasttext format to be trained uisng Fasttext framework
2. Train a extreme multi-label classifier to classify job skills/requirements/responsibilites using Fasttext

## 1. Install and configure FastText



In [1]:
# Mount to perm folder
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [12]:
# Download fastext
!cd /content/drive/MyDrive/git && wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip && unzip v0.9.2.zip 
!ln -s fastText-0.9.2 fasttext
!cd /content/drive/MyDrive/git/fasttext && make  && pip install  .


make: Nothing to be done for 'opt'.
Processing /content/drive/MyDrive/git/fastText-0.9.2
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3127608 sha256=946d28719540a2e5ca092a79d9e3c97bec44aadc6081374d07534391777bbc4d
  Stored in directory: /root/.cache/pip/wheels/7c/90/81/393cc839ac5ff15498ba20aef75931b668231f5d1682e47fe9
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2


## 2. Generate Dataset with Fasttext format

- From the previous work, a global ground-true dataset has been generated. This dataset is required to convert to Fasttext format
- Train/Dev/Test dataset are also generated

In [None]:
!mkdir /content/drive/MyDrive/git/fasttext_dataset
%cd /content/drive/MyDrive/git/fasttext_dataset

# Using dataset previous extracted, this dataset is converted to be used with Fasttext
label_file = "/content/drive/MyDrive/git/job-skill-prediction/bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/skill_list.csv"
dataset_file = "/content/drive/MyDrive/git/job-skill-prediction/bert_extreme_multilabel_classification/pybert/dataset/seek_dataset/dataset.csv"

import numpy as np
import re
label_list = list(pd.read_csv(label_file,header=None)[0])


In [None]:
# Process to obtain sample list for datasets

dataset = pd.read_csv(dataset_file, header=0, index_col=0)
sample_list = []
for row_id, row in dataset.iterrows():
  row = list(row)
  text = row[0]
  #print(text)

  label_value_indicators = row[1:]

  tag_list = [label_list[idx] for idx,value in enumerate(label_value_indicators) if int(value)==1]

  if len(tag_list)<=0:
    continue 

  tag_list = [f"__label__{'-'.join(v.split())}" for v in tag_list]
  fasttext_sample = f"{' '.join(tag_list)} {text}"

  sample_list.append(fasttext_sample)


In [None]:
# Split to train, validation and testing dataset 

from sklearn.model_selection import train_test_split
train,val = train_test_split(sample_list, test_size=0.3, random_state=7, shuffle=True, stratify=None)
val,test = train_test_split(test, test_size=0.3, random_state=7, shuffle=True, stratify=None)
print(len(train))
print(len(val))
print(len(test))

12495
3373
1446


In [None]:
train_dataset_file = "/content/drive/MyDrive/git/fasttext_dataset/fasttext_train.csv"
val_dataset_file = "/content/drive/MyDrive/git/fasttext_dataset/fasttext_val.csv"
test_dataset_file = "/content/drive/MyDrive/git/fasttext_dataset/fasttext_test.csv"


with open(train_dataset_file, "wt") as g:
  g.write("\n".join(train)) 

with open(val_dataset_file, "wt") as g:
  g.write("\n".join(val)) 

with open(test_dataset_file, "wt") as g:
  g.write("\n".join(test)) 

## 3. Start Training multi-label classifier with fasttext

Download Wikipedia embedding to support for training. Using the pre-train word2vec from Wikipedia would help to improve the performance for this model.

In [2]:
!cd /content/drive/MyDrive/git/fasttext_dataset && wget -c https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

--2022-01-25 10:36:30--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 104.22.74.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6597238061 (6.1G) [binary/octet-stream]
Saving to: ‘wiki.en.vec’


2022-01-25 10:39:21 (36.9 MB/s) - ‘wiki.en.vec’ saved [6597238061/6597238061]



Train Fasttext model using pretrain model from Wiki and optimised using onevsall loss function. Training is demonstrated using 100 iterations

In [None]:
# Optimise for multi-lael
import fasttext
wiki_pretrain_vector_file = '/content/drive/MyDrive/git/fasttext_dataset/wiki.en.vec'

# Training multi-label with one vs all to provide the true probability for each label
model = fasttext.train_supervised(input=train_dataset_file, epoch=100, loss='ova', wordNgrams=2, pretrainedVectors= wiki_pretrain_vector_file, dim=300)
model.save_model("/content/drive/MyDrive/git/fasttext_dataset/seek_skill_fasttext_model.bin")


In [16]:
# Run a test over the validation dataset to predict top k=5 labels for each prediction.
# The number of sample, recall@5 and prediction@5 are listed below
model.test(val_dataset_file, k=5)

(3373, 0.3495404684257338, 0.2763324426944171)

## 4. Load and run prediction with new data

In [13]:
# Loading the model for testing on prediction phase 
import fasttext
model  = fasttext.load_model("/content/drive/MyDrive/git/fasttext_dataset/seek_skill_fasttext_model.bin")



In [35]:
# Test on a new text, the label and the probability for each listed below.
text = """Proven ability to work collaboratively to develop and maintain strong stakeholder relationships to achieve business outcomes; support decision making and influence others in the pursuit of project/business objectives."""
model.predict(text,k=7)

(('__label__strong-relationship',
  '__label__great-opportunity',
  '__label__strong-working-relationship',
  '__label__similar-role',
  '__label__business-process',
  '__label__business-solution',
  '__label__strong-experience'),
 array([0.99909896, 0.98360699, 0.97772384, 0.91965252, 0.84797776,
        0.78267252, 0.77730989]))