<h1>Named Entity Recognition (NER) for job data: Annotations from documents</h1>
<h3>Adel Rahmani</h3>
<hr style="height:5px;border:none;color:#333;background-color:#333;" />

<div style="background-color:#FBEFFB;">
<hr style="height:5px;border:none;color:#333;background-color:#333;" />
<h3>Licence</h3>
<p>Copyright (C) 2022  Adel Rahmani

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.</p>
<hr style="height:5px;border:none;color:#333;background-color:#333;" />
</div>

<div style="background-color:#F2FBEF;">
<h2><font color=#04B404>Summary</font></h2>
This notebook and associated code uses online job data sources to create annotations for occupation related entities that can be used to train a NER model.

The Python library <a href="https://spacy.io/">spaCy</a> is used to first create a rule-based NER model that is in turn used to produce annotated data, which can then be used to train a machine learning NER model.
</div>
<hr>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import swifter
import string
import regex
import time 
import hashlib

import pandas as pd
import numpy as np

import json

from collections import defaultdict

import spacy
from spacy.util import minibatch, compounding, filter_spans
from spacy.matcher import Matcher, PhraseMatcher
from spacy.training.example import Example
from spacy.tokens import DocBin, Span
from spacy.pipeline import EntityRuler
from spacy import displacy
from spacy.language import Language

from datetime import datetime
from pathlib import Path
from tqdm.notebook import tqdm_notebook
from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter('ignore')

from ner_utils import (
    save_spacy_ner_data_to_disk, 
    load_spacy_data_from_csv, 
    compute_elapsed_time,
    create_spacy_lang_component, 
    AVAILABLE_SPACY_MODELS, 
    create_model_and_add_rules, 
    add_custom_ner_pipelines, 
    create_spacy_components_from_dict,
    save_ner_regex_to_json, 
    load_ner_regex_from_json, 
    extract_ent_location, 
    entities_found, 
    build_annotations_from_sentences, 
    build_annotations_from_docs, 
    pattern_punctuation, 
    pattern_multiple_stars, 
    pattern_multiple_blanks, 
    get_annotation_metadata
    )

from IPython.display import display_html, HTML, display

----
# The Data

This pipeline uses the Adzuna data to construct an annotated data set for Named Entity Recognition (NER) for job ads.
- [Kaggle Adzuna](https://www.kaggle.com/c/job-salary-prediction/data) data containing over 200,000 job ads, mostly from the UK.

The data is preprocessed in the `NER_1_Data_Processing.ipynb` notebook.

---
## Loading the preprocessed Adzuna data

In [3]:
DATA_HASH = 'fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1'

In [4]:
DATA_DIR = Path(f'./experiments/data_{DATA_HASH}')
DATA_DIR.mkdir(parents=True, exist_ok=True)

In [5]:
source = DATA_DIR/f'Adzuna_job_data_{DATA_HASH}_docs.parq'

In [6]:
data = pd.read_parquet(source)
data.head(2)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SourceName
82,36757414,Primary Teachers,Are you a qualified/newly qualified teacher lo...,South Yorkshire,South Yorkshire,,contract,Vision for Education,Teaching Jobs,cv-library.co.uk
121,44452524,Chef De Partie,Chef De Partie Rustic Italian AA Rosette resta...,Colchester Essex South East,UK,,,Clear Selection,Hospitality & Catering Jobs,caterer.com


In [7]:
source_sentences = DATA_DIR/f'Adzuna_job_data_{DATA_HASH}_sents.parq'
data_sentences = pd.read_parquet(source_sentences)

In [8]:
data_sentences.head(2)

Unnamed: 0,doc_id,sent_id,sentence,sent_len
0,36757414,0,Are you a qualified/newly qualified teacher lo...,98
1,36757414,1,"Do you want a new challenge with varied work, ...",81


----
# Create regex patterns for training data

Let's sort the job titles and employers in reverse order of length to match the longer patterns first.

In [9]:
clean_titles = sorted(data.Title.unique(), key=len, reverse=True)
clean_titles[:10]

['Electronic Fire and Security Solutions Field Sales',
 'Qualified Person Quality Assurance Manager',
 'Mechanical Systems Certification Engineer',
 'Oracle Supply Chain Functional Consultant',
 'Graduate Trainee Recruitment Consultant',
 'Hospitality Vocational Learning Advisor',
 'Intermediate Electrical Design Engineer',
 'Estate Agent Senior Sales Negotiator',
 'Principal Mechanical Design Engineer',
 'Senior Mobile Applications Developer']

In [10]:
print({item.lower() for item in clean_titles if len(item.split())==1})

{'analyst', 'inspector', 'accountant', 'administrator', 'solicitor', 'optometrist', 'auditor', 'cashier', 'ombudsman', 'estimator', 'cook', 'engineer', 'carpenter', 'bookkeeper', 'butcher', 'joiner', 'driver', 'wireman', 'receptionist', 'machinist', 'chef', 'plumber', 'labourer', 'manager', 'underwriter', 'buyer', 'copywriter', 'paralegal', 'sommelier', 'electrician', 'caretaker', 'secretary', 'designer', 'developer', 'scheduler', 'toolmaker', 'technician', 'trainer', 'merchandiser', 'supervisor', 'planner', 'welder', 'draughtsman', 'housekeeper', 'cleaner', 'fitter'}


In [11]:
p = regex.compile('\s?(\w*ly)\s+', flags=regex.I)
print({p.findall(item)[0].lower() for item in clean_titles if p.search(item)})

{'early', 'supply', 'family'}


In [12]:
employers = sorted(data.Company.unique(), key=len, reverse=True)
employers[:5]

['LA International Computer Consultants Ltd',
 'Strata Construction Consulting UK Ltd',
 'Robinson Keane Finance Professionals',
 'Trickett Ames Recruitment Solutions',
 'Specialist Recruitment Partners LTD']

---
## Mitigation of data leakage
The regex annotation results in the same patterns potentially being present in the downstream training and test sets.
Let's try to split the data so that the same regexes for EMPLOYER __or__ (it's harder to do both) OCCUP aren't used in both data sets.

In [13]:
def train_test_split_on_feature(data, feature='Title', random_state=1, train_size=0.8, source=None):
    
    feat_train, feat_test = train_test_split(
        data[feature].value_counts().index.values, 
        train_size=train_size, 
        random_state=random_state
    )
    
    selection = data[feature].isin(feat_train)
    
    ntrain = selection.sum()
    ntest = (~selection).sum()
    
    suffix = f"_split_{feature}_RS{random_state}_{str(train_size).replace('.','_')}_{ntrain}_{ntest}.parq"
    out = source.stem + suffix
    print(f"Saving file data split on {feature} to {out}.")
    data.assign(training=selection).to_parquet(DATA_DIR/out)
    
    

In [14]:
train_test_split_on_feature(data, feature='Title', random_state=1, train_size=0.8, source=source)

Saving file data split on Title to Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_docs_split_Title_RS1_0_8_8137_2250.parq.


In [15]:
train_test_split_on_feature(data, feature='Company', random_state=1, train_size=0.8, source=source)

Saving file data split on Company to Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_docs_split_Company_RS1_0_8_8064_2323.parq.


In [16]:
!ls {DATA_DIR}

Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_docs.parq
Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_docs_split_Company_RS1_0_8_8064_2323.parq
Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_docs_split_Title_RS1_0_8_8137_2250.parq
Adzuna_job_data_fba836ee1bdf4fda32004145ffe1eeb8d3c6b5f1_sents.parq


## Create and save the regex patterns to json

In [17]:
flags = regex.M #|regex.I

REGEX_CASE = 'uncased' if 'regex.I' in str(flags) else 'cased'
print(REGEX_CASE)

# try to handle simple plurals using the s? regex at the end of title
regex_job_titles = '|'.join([f"({t.title()}s?)" for t in clean_titles])

# Made the decision to be case sensitive for the job title
pattern_job_title = regex.compile(regex_job_titles, flags=flags)

regex_employer = '|'.join([f"({t})" for t in employers])

# Made the decision to be case sensitive for the employer
pattern_employer = regex.compile(regex_employer, flags=flags)

cased


In [18]:
pattern_dict = {
    'employer': pattern_employer,
    'occup'   : pattern_job_title
}

save_ner_regex_to_json(pattern_dict, file=DATA_DIR/f'./spacy_ner_component_{REGEX_CASE}.json')

## Load the regex patterns from json

In [19]:
# pattern_dict = load_ner_regex_from_json(f'spacy_ner_component_{DATA_HASH}.json')

# for k in pat_dict:
#     assert pat_dict[k].pattern == pattern_dict[k].pattern
#     assert pat_dict[k].flags == pattern_dict[k].flags

## Create the spacy language components from the regexes

In [20]:
custom_spacy_components = create_spacy_components_from_dict(pattern_dict)
custom_spacy_components

{'find_employer': <function ner_utils.create_spacy_lang_component.<locals>.find_entity(doc)>,
 'find_occup': <function ner_utils.create_spacy_lang_component.<locals>.find_entity(doc)>}

## Remove NER entities to focus on our custom ones (optional)

In [21]:
ENT_LABELS_REMOVE = {
            'CARDINAL', 
            'DATE', 
            'EVENT', 
            'FAC', 
            'LAW', 
            'MONEY', 
            'NORP', 
            'ORDINAL', 
            # 'ORG', 
            'PERCENT', 
            'PERSON', 
            # 'GPE',
            'PRODUCT', 
            'QUANTITY', 
            'TIME', 
            'WORK_OF_ART'
        }

@Language.component('ner_removal')
def ner_removal(doc):
    ents = list(doc.ents)
    for ent in ents:
        if ent.label_ in ENT_LABELS_REMOVE:
            ents.remove(ent)
    ents = tuple(ents)
    doc.ents = ents
    return (doc)

----
# Testing the rules by adding them to a pretrained NER model

## Loading the model

In [22]:
model_type = 'sm'
model = create_model_and_add_rules(model_type, custom_spacy_components, 
                                   disable=[]
                                  )
model.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'senter',
 'attribute_ruler',
 'lemmatizer',
 'find_employer',
 'find_occup',
 'ner']

## Setting up the options for displacy visualisation

In [23]:
colors = {
    "EMPLOYER": "#DB81E2", 
    "OCCUP": "#AEE8EC", 
    "GPE": "#ECE3AE", 
    "ORG": "#ECBEAE", 
    "PERSON":"#17B4C2", 
    # "OCCUP": "#9017C2", 
    # "OCCUP": "#878787", 
    # "OCCUP": "#0A6DF5", 
    # "OCCUP": "#1F541D"
}
options = {"ents": ["EMPLOYER", "OCCUP", "GPE", "ORG"], "colors":colors}

In [24]:
texts = data.FullDescription.values

In [25]:
np.random.seed(1)

for t in np.random.choice(texts, size=2):
    doc = model(t)
    displacy.render(doc, style='ent', options=options)
    print()







----
# Building the training data

For each ad we use the rule-based NER model with our custom entities to annotate the text.

We then collect the entities type and location and create an annotation data set that conforms to the format expected by spaCy.

In order to do that we need to be able to extract the location of the entities from the processed text. 

<div style="background-color:#F7F2E0;">
<h4> <font color=MediumVioletRed>Note:</font> </h4>
<p>We can either pass the whole ad to the model, or break it up into sentences and only pass those sentences which contain the entities we care about. 
    
Both strategies have pros and cons and some experimentation is required.</p>
</div>

#### Locating the entitites

##### Example

In [26]:
annotations = extract_ent_location(texts[1], nlp=model, entities = {'OCCUP','EMPLOYER','ORG','GPE'})
annotations

('Chef De Partie Rustic Italian AA Rosette restaurant looking for young Chef De Partie to join creative brigade just north of Colchester. My client has an accolade winning, highly reputable country inn, boasting a fantastic rosette awarded restaurant serving brilliant rustic Italian cuisine. The best thing about it? The chefs in the brigade are a fantastic bunch. With a mild tempered Italian Head Chef that is willing to teach what he knows, this role of Chef De Partie is a very sought after position. Looking for a Chef De Partie that is willing to learn and grow as a professional, staying away from the pizza/pasta half hearted meals and cook with entirely fresh and often the finest locally sourced produce in a traditional manner. As Chef De Partie you will learn be able to make your own menu adjustments, following in the steps of the Head Chef and experimenting with flavours but never overdoing it. With all chefs from Chef De Partie to Senior Sous Chef being aided in whatever they need

In [27]:
entities_found(annotations, method='ALL')

True

----
### Building the annotated data from individual sentences

##### Example

In [28]:
build_annotations_from_sentences(data=data[:2], nlp=model, entities = {'OCCUP','EMPLOYER','ORG'}, method='ANY')

  0%|          | 0/2 [00:00<?, ?it/s]

([('If so, Vision for Education can help We are currently looking for enthusiastic and dedicated KS and KS Primary Teachers for a number of schools across the area.',
   {'entities': [(7, 27, 'EMPLOYER'), (103, 119, 'OCCUP')]},
   36757414),
  ('Vision for Education was started in by a group of like minded individuals with a desire for providing a quality service to customers.',
   {'entities': [(0, 20, 'EMPLOYER')]},
   36757414),
  ('You need look no further than Vision for Education for a professional, friendly service, provided by an experienced team for all your requirements.',
   {'entities': [(30, 50, 'EMPLOYER')]},
   36757414),
  ('Chef De Partie Rustic Italian AA Rosette restaurant looking for young Chef De Partie to join creative brigade just north of Colchester.',
   {'entities': [(0, 14, 'OCCUP'), (70, 84, 'OCCUP'), (124, 134, 'ORG')]},
   44452524),
  ('With a mild tempered Italian Head Chef that is willing to teach what he knows, this role of Chef De Partie is a very sou

----
### Building the annotated data from whole ads

##### Example

In [29]:
build_annotations_from_docs(data=data[:2], nlp=model, entities = {'OCCUP','EMPLOYER','ORG','GPE'})

  0%|          | 0/2 [00:00<?, ?it/s]

([('Are you a qualified/newly qualified teacher looking for supply work in and around South Yorkshire? Do you want a new challenge with varied work, flexibility and great rates of pay? If so, Vision for Education can help We are currently looking for enthusiastic and dedicated KS and KS Primary Teachers for a number of schools across the area. Candidates must have an enthusiasm for teaching, a good knowledge of the national curriculum and excellent classroom and behaviour management skills. It is essential that you hold a valid, recognised teaching qualification and ideally you will have 6 weeks recent experience of teaching in the UK. We must also be able to contact your past school to obtain a reference. Vision for Education was started in by a group of like minded individuals with a desire for providing a quality service to customers. Our promise is to serve the education community, be it Teachers, Schools or Students alike to the highest possible standards. If you are looking for a

----
# Creating the annotated data from sentences
Selecting sentences which contain specific entities.

----
# Creating the annotated data from whole ads

In [30]:
%%time 

MODEL_TYPE = 'sm'
METHOD = 'ANY'
# entities={'EMPLOYER', 'OCCUP'}
entities = {'OCCUP','EMPLOYER','ORG','GPE'}

entities = set(sorted(entities))

ANNOT_PATH = DATA_DIR/f"annotations_spacy_{MODEL_TYPE.upper()}"
ANNOT_PATH.mkdir(parents=True, exist_ok=True)

model = create_model_and_add_rules(MODEL_TYPE, custom_spacy_components)

selection = data
NDOCS = len(selection)

annotations, elapsed_time = build_annotations_from_docs(data=selection, entities=entities, nlp=model, method=METHOD)

print(f"Elapsed time: {elapsed_time}.")
print(f"Collected {len(annotations)} annotations.")

# tstamp = datetime.strftime(datetime.today(), format='%Y%m%d_%H%M')

suffix=f'Adzuna_{"_".join(entities)}_{METHOD}_{len(annotations)}_docs_{REGEX_CASE}'

save_spacy_ner_data_to_disk(annotations, 
                            path=ANNOT_PATH,
                            suffix=suffix)

metadata = get_annotation_metadata(annotations)
metadata.to_parquet(ANNOT_PATH/f'ner_annotation_metadata_{suffix}.parq')

  0%|          | 0/10387 [00:00<?, ?it/s]

Elapsed time: 0d0h5m54s.
Collected 10387 annotations.
CPU times: user 5min 46s, sys: 12.6 s, total: 5min 59s
Wall time: 12min 26s


In [31]:
annotations[0]

('Are you a qualified/newly qualified teacher looking for supply work in and around South Yorkshire? Do you want a new challenge with varied work, flexibility and great rates of pay? If so, Vision for Education can help We are currently looking for enthusiastic and dedicated KS and KS Primary Teachers for a number of schools across the area. Candidates must have an enthusiasm for teaching, a good knowledge of the national curriculum and excellent classroom and behaviour management skills. It is essential that you hold a valid, recognised teaching qualification and ideally you will have 6 weeks recent experience of teaching in the UK. We must also be able to contact your past school to obtain a reference. Vision for Education was started in by a group of like minded individuals with a desire for providing a quality service to customers. Our promise is to serve the education community, be it Teachers, Schools or Students alike to the highest possible standards. If you are looking for a p