## Named Entity Recognition

In this notebook I will show how to train an Named Entity Recognition (NER) algorithm in order to be able to extract all relevant skills from an employee or applicant. In the next step I want to use the detected skill for the creation of employee and project competence profiles.

The training data was genered with ChatGPT by prompting to create Jira stories in the field of Cloud, NLP and Computer Vision. The datasets were then labeled using the annotator: https://tecoholic.github.io/ner-annotator/ 
As the annotation process is very time consuming I only annotated texts for Cloud, NLP and Computer Vision. The python nlp library spacy will be used to create a custom NER model which is able to detect skill entities from texts (resumes, jira stories, project decsriptions, training courses.


In [None]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en")  # load a new spacy model
db = DocBin()  

import json
import os

First we will iterate over the folder labeled_entities were all the datasets are located. Each dataset gets then processed so that the data can be consumed by a spacy model and is then stored as "./training_data_skills.spacy".

In [None]:
folder_path = 'C:/Users/SEPA/topic_modeling/labeled_entities'  # Replace with the path to your folder

# Iterate over each file in the folder
for filename in os.listdir(folder_path):
    f = open(folder_path + '/' + filename)
    TRAIN_DATA = json.load(f)

    for text, annot in tqdm(TRAIN_DATA['annotations']): # text ist eben text, annot sind die gelabelten annotations
        # print(text) # text
        # print(annot) # die annotierten entities
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]: # ents sind einfach nur die beiden wörter die er sich aus start und end zusammenbaut
            span = doc.char_span(start, end, label=label, alignment_mode="contract")
            if span is None:
                print("Skipping entity")
            else:
                ents.append(span)
        doc.ents = ents
        db.add(doc)

db.to_disk("./training_data_skills.spacy") 

Run this command within a terminal to create the config file for model training.

In [None]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

Run this command within a terminal to start the model training.

In [None]:
! python -m spacy train config.cfg --output ./ --paths.train ./training_data_skills.spacy --paths.dev ./training_data_skills.spacy

Now we will load the best model and will use to detect the skills from a short description of my technical profile.

In [2]:
nlp_ner = spacy.load("C:/Users/SEPA/topic_modeling/model-best")

doc = nlp_ner('''I have several years of experience with NLP and MLOps. Here I implemented a Text Classification Algorithm with BERT Algorithm. Moreover I have worked with AWS, Kubernetes and Docker.''')
spacy.displacy.render(doc, style="ent", jupyter=True)

As one can see, the trained NER Algorithms was able to identify all skills from a short description of my technical profile. In the next step we will store the skills from applicants to mapp applicants and employees to skillclusters.