### Named Entity Recognition (NER) with spaCy v3.0
The goal of this notebook is to show the full end to end process for Name-Entity Recognition(NER) with Spacy.

- Read the data set per line.
- We will label texts with the entity using the Doccano labeling tool. This is a manual process.
- We will save the labels in a text file as JSONL.
- We will use the Spacy Neural Network model to train a new statistical model.
- We will save the model.
- We will create a Spacy NLP pipeline and use the new model to detect custom entities never seen before.
- Finally, we will use pattern matching instead of a deep learning model to compare both methods.

### Load a spacy model and check if it has NER
#### spaCy's NER model
spaCy's Named Entity Recognition system features a sophisticated word embedding strategy using subword features and "**Bloom**" embeddings, a deep convolutional neural network with residual connections, and a novel transition-based approach to named entity parsing.

The system is designed to give a good balance of efficiency, accuracy and adaptability. 

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

### Load labeled data & Convert the formate to spacy format.

In [2]:
import json
import pandas as pd

ids = []
entities = []

with open(r"label_config.json", "r", encoding='UTF8') as json_file:
    json_data = json.load(json_file)
    for i in range(0, len(json_data)):
        ids.append(json_data[i]['id'])
        entities.append(json_data[i]['text'])

annotations_dict = {"id": ids, "entities": entities}
annotations_df = pd.DataFrame(annotations_dict)
annotations_df

Unnamed: 0,id,entities
0,1,ATTACK-METHOD
1,2,TARGET
2,3,LOCATION
3,4,MALWARE
4,5,THREAT-ACTOR
5,6,SOFTWARE
6,7,MITIGATION
7,8,VULNERABILITY
8,9,EMAIL
9,10,IP


In [3]:
import json

labeled_data = []
cnt = 0
with open(r"labeled_sample.jsonl", "r", encoding='UTF8') as read_file:
    for line in read_file:
        cnt += 1
        data = json.loads(line)
        print("Line", cnt, ":", data['text'])
        entities = []
        for i in range(0, len(data['annotations'])):
            id_val = data['annotations'][i]['label']
            label2tuple = (data['annotations'][i]['start_offset'],
                           data['annotations'][i]['end_offset'],
                           annotations_df.at[id_val - 1, 'entities'])
            #print(line2tuple, end=' ')
            entities.append(label2tuple)
        line2tuple = (data['text'], {"entities": entities})
        print(line2tuple)
        labeled_data.append(line2tuple)
        print('\n')

Line 1 : Mandiant Advanced Practices (AP) closely tracks the shifting tactics, techniques, and procedures (TTPs) of financially motivated groups who severely disrupt organizations with ransomware.
('Mandiant Advanced Practices (AP) closely tracks the shifting tactics, techniques, and procedures (TTPs) of financially motivated groups who severely disrupt organizations with ransomware.', {'entities': []})


Line 2 : In May 2020, FireEye released a blog post detailing intrusion tradecraft associated with the deployment of MAZE.
('In May 2020, FireEye released a blog post detailing intrusion tradecraft associated with the deployment of MAZE.', {'entities': [(13, 20, 'ORGANIZATION'), (107, 111, 'MALWARE')]})


Line 3 : As of publishing this post, we track 11 distinct groups that have deployed MAZE ransomware.
('As of publishing this post, we track 11 distinct groups that have deployed MAZE ransomware.', {'entities': [(75, 90, 'MALWARE')]})


Line 4 : At the close of 2020, we noticed a shift

### Load pre-existing spacy model

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')

### Getting the pipeline component

In [5]:
ner = nlp.get_pipe("ner")

In [6]:
TRAIN_DATA = labeled_data

In [7]:
# Adding labels to the `ner`

for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [8]:
# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [
    pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions
]

In [9]:
# Import requirements
import random
from spacy.util import minibatch, compounding
from spacy.training import Example
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

    # Training for 30 iterations
    for iteration in range(30):

        # shuufling examples  before every iteration
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        for batch in spacy.util.minibatch(TRAIN_DATA, size=2):
            for text, annotations in batch:
                # create Example
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                # Update the model
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)

Losses {'ner': 7.754671955929029}
Losses {'ner': 7.754671955929029}
Losses {'ner': 11.6114246576788}
Losses {'ner': 17.308477852939163}
Losses {'ner': 19.67530075747458}
Losses {'ner': 27.372035390301303}
Losses {'ner': 27.372035390301303}
Losses {'ner': 29.75980579624453}
Losses {'ner': 32.65172636093184}
Losses {'ner': 32.65172636093184}
Losses {'ner': 35.00051299409803}
Losses {'ner': 39.98519660085188}
Losses {'ner': 45.74817852341403}
Losses {'ner': 47.74725037471931}
Losses {'ner': 52.97011756300898}
Losses {'ner': 52.97011756300898}
Losses {'ner': 55.871965386877214}
Losses {'ner': 60.88145060642174}
Losses {'ner': 66.02366481079312}
Losses {'ner': 67.72809617773797}
Losses {'ner': 71.31605297434258}
Losses {'ner': 75.16755402179982}
Losses {'ner': 82.02955926933149}
Losses {'ner': 83.2746723672821}
Losses {'ner': 87.09133966951886}
Losses {'ner': 91.39538011809479}
Losses {'ner': 94.67441936236557}
Losses {'ner': 99.36691964569836}
Losses {'ner': 103.20304532107086}
Losses {'ne

Losses {'ner': 17.35007008243946}
Losses {'ner': 17.35012260312257}
Losses {'ner': 2.951455539138168}
Losses {'ner': 2.9514572375230217}
Losses {'ner': 3.948098714640091}
Losses {'ner': 3.9481203709554435}
Losses {'ner': 5.5945623202827}
Losses {'ner': 6.521442396006777}
Losses {'ner': 6.556777273893087}
Losses {'ner': 6.556841132707466}
Losses {'ner': 7.447379279115291}
Losses {'ner': 7.447388833764082}
Losses {'ner': 9.167488958887374}
Losses {'ner': 10.47050669959735}
Losses {'ner': 10.470768696257068}
Losses {'ner': 10.470816807111817}
Losses {'ner': 10.471247503985877}
Losses {'ner': 10.628573257257921}
Losses {'ner': 10.628573257257921}
Losses {'ner': 10.628738187434244}
Losses {'ner': 12.587865059253181}
Losses {'ner': 14.157563349389436}
Losses {'ner': 15.261038924830673}
Losses {'ner': 15.261038924830673}
Losses {'ner': 15.261066866961963}
Losses {'ner': 17.21075879490021}
Losses {'ner': 17.21075879490021}
Losses {'ner': 18.42365935971536}
Losses {'ner': 18.42365935971536}
Los

Losses {'ner': 1.5996894186252393}
Losses {'ner': 1.5996894186252393}
Losses {'ner': 1.599700364340805}
Losses {'ner': 1.599700364340805}
Losses {'ner': 1.5997005927854977}
Losses {'ner': 1.5997005927854977}
Losses {'ner': 2.5034069723099623}
Losses {'ner': 3.94144435751055}
Losses {'ner': 3.941450280614004}
Losses {'ner': 4.274473956752483}
Losses {'ner': 4.274473956752483}
Losses {'ner': 4.274481849383279}
Losses {'ner': 4.2744832257501075}
Losses {'ner': 4.297328055271355}
Losses {'ner': 4.297461118463444}
Losses {'ner': 4.297461424583636}
Losses {'ner': 4.3022215645779625}
Losses {'ner': 4.302816157668287}
Losses {'ner': 4.302816667469737}
Losses {'ner': 4.3028169451345315}
Losses {'ner': 4.302817058008503}
Losses {'ner': 4.302819960571007}
Losses {'ner': 4.303540827699145}
Losses {'ner': 4.303540993145325}
Losses {'ner': 4.311294560551308}
Losses {'ner': 4.311385793170346}
Losses {'ner': 4.380775430425292}
Losses {'ner': 1.204735235895479e-08}
Losses {'ner': 1.3853718126815922e-08

Losses {'ner': 1.578874138078617}
Losses {'ner': 1.578874146746873}
Losses {'ner': 1.5788746763213344}
Losses {'ner': 5.931448482784134e-10}
Losses {'ner': 2.5103399402689785e-05}
Losses {'ner': 2.5103399402689785e-05}
Losses {'ner': 3.304572255163289e-05}
Losses {'ner': 3.304572255163289e-05}
Losses {'ner': 3.304572255163289e-05}
Losses {'ner': 3.304605670247782e-05}
Losses {'ner': 3.3048399883683884e-05}
Losses {'ner': 3.304879634204364e-05}
Losses {'ner': 0.0005099024838985582}
Losses {'ner': 0.0005287984463332445}
Losses {'ner': 0.000528832828514491}
Losses {'ner': 0.0005310392935256749}
Losses {'ner': 0.0005310392935256749}
Losses {'ner': 0.0005313949181061739}
Losses {'ner': 0.0005319254903558716}
Losses {'ner': 0.0005319257951791329}
Losses {'ner': 0.0006139424352080905}
Losses {'ner': 23.149330824308368}
Losses {'ner': 23.149330949467313}
Losses {'ner': 23.149330953613585}
Losses {'ner': 23.15084659175391}
Losses {'ner': 23.150846591887806}
Losses {'ner': 23.150846592618826}
Lo

In [10]:
# Testing the model
doc = nlp(
    "AP created UNC2198 based on a single intrusion in June 2020 involving ICEDID, BEACON, SYSTEMBC and WINDARC. UNC2198 compromised 32 systems in 26 hours during this incident; however, ransomware was not deployed."
)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Entities [('AP', 'MALWARE'), ('ICEDID', 'MALWARE'), ('BEACON', 'MALWARE'), ('SYSTEMBC', 'MALWARE'), ('WINDARC', 'MALWARE'), ('26 hours', 'TIME'), ('ransomware', 'MALWARE')]


In [None]:
import spacy
from spacy import displacy

displacy.serve(doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [23/Apr/2021 04:33:33] "GET / HTTP/1.1" 200 2715
127.0.0.1 - - [23/Apr/2021 04:33:33] "GET /favicon.ico HTTP/1.1" 200 2715
