#Spacy NER model

Source: https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

Maybe useful: https://spacy.io/usage/training/#ner

In [1]:
# Load spaCy model and check if it has NER
import spacy
nlp = spacy.load('en_core_web_sm')

# Make sure this includes 'ner'
nlp.pipe_names

['tagger', 'parser', 'ner']

In [2]:
# Perform default NER on a vacancy
vacancy_text = 'You know better than anyone how to bind other people to you, people for whom you can mean something. A great new assignment for jobseekers and the right match for their sourcing issue for clients. If you are also curious about market developments, would you like to hear more about projects within the industry and are you able to translate this information into opportunities for Brunel, then a role as a Sales Consultant is perfect for you! About this position As a Sales Consultant you always have something to do. Your main goal is to make the best match between clients and candidates, and that involves a lot. Your work does not stop at finding and connecting both parties. You are also responsible for expanding and maintaining your own network of candidates and clients. That means that you are in constant contact with both parties. Keeping an overview and keeping different balls in the air is no problem for you. Your focus area will be on specialists and organizations within the Northern Netherlands Industry. Together with your team you operate the fields: Maintenance & Asset Management, Industrial Automation, Supply Chain & Logistics and Innovation & Development. Your colleagues are all commercial, enterprising and ambitious. Together we aim for the best result and recognisability in the market.  About you In short, you have a mega palette of tasks and responsibilities. It is therefore important that you can keep an overview. And we ask more of you. So you have at least: A college degree from a commercial or technical-related study At least 1 year of experience in sales and preferably in job placement A lot of ambition to grow in your profession A representative attitude Strongly developed communication skills and a lot of persuasiveness And a valid driver\'s license B.  What we offer Give a little, take a little. You bring your expertise with you, and we provide a salary that matches it. You will also receive a laptop, smartphone and company car from us. And you have the chance to win interesting bonuses! We also arrange that you get a discount on the gym, cultural trips, insurance and your pension premium. And we don\'t take it overnight either with regard to your professional growth. You follow a tailor-made training program from the outset that is provided by renowned institutes. About us Don\'t be surprised if you hear a colleague in the office talking about how our location in Singapore handles certain matters, or if your supervisor makes a call in fluent German. We work from 44 countries around the world and we are proud of that! We would never have grown this big if we didn\'t just go for the best match between professional and company every day. And you are an indispensable link in this. So do you step into Brunel\'s world? Apply immediately!'

doc = nlp(vacancy_text)
for ent in doc.ents:
  print(ent.text, ent.label_)

Brunel PERSON
the Northern Netherlands Industry ORG
Maintenance & Asset Management ORG
Innovation & Development ORG
At least 1 year DATE
Strongly ORG
Singapore GPE
German NORP
44 CARDINAL
Brunel PERSON


In [3]:
# Get the NER pipeline
ner = nlp.get_pipe('ner')

# Add label
ner.add_label('SKILL')
optimizer = nlp.resume_training()
move_names = list(ner.move_names)


# Disable pipeline components we don't need to change
pipe_exceptions = ['ner', 'trf_wordpiecer', 'trf_tok2vec']
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [5]:
# Function to train NER pipeline to recognize skills
import random
import time
from spacy.util import minibatch, compounding
from pathlib import Path


k = 30
sizes = compounding(1.0, 4.0, 1.001)
drop = 0.5


def train_ner(train_data, iterations=k, sizes=sizes, drop=drop, optimizer=optimizer):
  with nlp.disable_pipes(*unaffected_pipes):

    for i in range(iterations):
      print('Iteration', i)
      start = time.time()

      # Shuffle examples before every iteration
      random.shuffle(train_data)
      losses = {}

      # Batch up the examples using spaCy's minibatch
      batches = minibatch(train_data, size=sizes)

      for batch in batches:
          texts, annotations = zip(*batch)
          nlp.update(
                      texts,       # batch of texts
                      annotations, # batch of annotations
                      sgd=optimizer,
                      drop=drop,    # dropout - make it harder to memorise data
                      losses=losses,
                  )
      
      end = time.time()
      print('Time:', end-start)

In [6]:
# From here we will start training the model with our data
# The training file can be found at https://github.com/dimaknyaz/DataProcessing/blob/main/DataFiles/training_set_1.txt
import json
with open("training_set_1.txt", "r") as fp:
  data = json.load(fp)

data[0:10]

[[' if you are also curious about market developments, would you like to hear more about projects within the industry and are you able to translate this information into opportunities for brunel, then a role as a sales consultant is perfect for you',
  {'entities': [[210, 226, 'SKILL']]}],
 [' together with your team you operate the fields: maintenance & asset management, industrial automation, supply chain & logistics and innovation & development',
  {'entities': [[63, 79, 'SKILL']]}],
 [' so you have at least: a college degree from a commercial or technical related study at least 1 year of experience in sales and preferably in job placement a lot of ambition to grow in your profession a representative attitude strongly developed communication skills and a lot of persuasiveness and a valid driver license b',
  {'entities': [[236, 259, 'SKILL']]}],
 [' so you have at least: a college degree from a commercial or technical related study at least 1 year of experience in sales and preferab

In [None]:
len(data)

7187

In [7]:
import math
x = math.floor(len(data) * .95)

train_data = data[0:x]
test_data = data[x:len(data)]

In [8]:
# Finally we can train the model
train_ner(train_data)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Losses {'ner': 11436.820758163929}
Losses {'ner': 11503.36825209856}
Losses {'ner': 11566.597240149975}
Losses {'ner': 11626.497311294079}
Losses {'ner': 11747.667285621166}
Losses {'ner': 11795.49426907301}
Losses {'ner': 11859.783612430096}
Losses {'ner': 11902.221711099148}
Losses {'ner': 12128.202755868435}
Losses {'ner': 12190.413777768612}
Losses {'ner': 12373.53767246008}
Losses {'ner': 12438.921762883663}
Losses {'ner': 12507.54422801733}
Losses {'ner': 12577.82922643423}
Losses {'ner': 12684.34888547659}
Losses {'ner': 12938.003191888332}
Losses {'ner': 13028.210271298885}
Losses {'ner': 13082.51035118103}
Losses {'ner': 13155.829234361649}
Losses {'ner': 13196.76461815834}
Losses {'ner': 13564.309920549393}
Losses {'ner': 13691.452726602554}
Losses {'ner': 13867.581153154373}
Losses {'ner': 13971.320274591446}
Losses {'ner': 14067.164564371109}
Losses {'ner': 14211.67554306984}
Losses {'ner': 14269.55982875824}


In [14]:
# Test if the NER model works
for line in test_data:
  doc = nlp(line[0])

  print(line[0])
  for ent in doc.ents:
    print(' > ', ent)

 you will work closely with the other healthcare technology project leader, information manager, the healthcare managers and team managers, the colleagues of espria
   initiating and leading projects with the aim of implementing one or more healthcare technology applications
 with our flexible, modular prefab construction method, we are uniquely able to translate wishes into practical and innovative solutions that we can realize
 >  innovative solutions
 you work closely with the project manager and you are a project team together with the production manager
 >  project manager
 commercial attitude and communication and social skills
 >  social skills
 with our flexible, modular prefab construction method, we are uniquely able to translate wishes into practical and innovative solutions that we can realize
 >  innovative solutions
 job requirements mbo + / hbo working & thinking level excellent command of the dutch language in word and writing command of the english language, spoken & w

In [None]:
# Save the model


# Output directory
from pathlib import Path
output_dir=Path('/content/')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'my_ner'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Loading the model from the directory
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
assert nlp2.get_pipe("ner").move_names == move_names
doc2 = nlp2(' Dosa is an extremely famous south Indian dish')
for ent in doc2.ents:
  print(ent.label_, ent.text)