**Building Custom Named Entity Recognition Model Using Spacy**

In [48]:
! pip install -U spacy -q

In [1]:
!python -m spacy info

[1m

spaCy version    3.8.7                         
Location         /usr/local/lib/python3.12/dist-packages/spacy
Platform         Linux-6.1.123+-x86_64-with-glibc2.35
Python version   3.12.11                       
Pipelines        en_core_web_sm (3.8.0)        



In [50]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import json

nlp = spacy.blank("en")
db = DocBin()

In [10]:
import json
f = open('/content/annotations.json')
TRAIN_DATA = json.load(f)

In [53]:
TRAIN_DATA

{'classes': ['DATE', 'MONEY', 'PERCENT', 'NAME', 'ORG'],
 'annotations': [['On March 15, 2023, Samantha Green, a financial analyst at Goldman Sachs, presented the company’s quarterly earnings report. The firm reported a revenue increase of 12% compared to the previous quarter, largely attributed to strong performance in its asset management division. Goldman Sachs also disclosed that it had invested $3.2 billion into sustainable energy projects, partnering with organizations like Tesla and NextEra Energy.\r',
   {'entities': [[3, 17, 'DATE'],
     [19, 33, 'NAME'],
     [58, 71, 'ORG'],
     [164, 167, 'PERCENT'],
     [327, 339, 'MONEY'],
     [409, 414, 'ORG'],
     [419, 433, 'ORG']]}],
  ['\r', {'entities': []}],
  ['During the same conference, Michael Chen, CEO of BrightFuture Capital, announced a merger valued at $7.5 billion with Orion Holdings, scheduled to finalize by October 2023. Analysts believe this could boost BrightFuture Capital’s market share by as much as 18% over the

In [16]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en")
db = DocBin()

for text, annot in tqdm(TRAIN_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./annotations_data.spacy") # save the docbin object

100%|██████████| 9/9 [00:00<00:00, 979.75it/s]


In [17]:
f = open('/content/validate.json')
VALIDATE_DATA = json.load(f)

In [18]:
VALIDATE_DATA

{'classes': ['DATE', 'MONEY', 'PERCENT', 'NAME', 'ORG'],
 'annotations': [['On March 15, 2023, Samantha Green, a financial analyst at Goldman Sachs, presented the company’s quarterly earnings report. The firm reported a revenue increase of 12% compared to the previous quarter, largely attributed to strong performance in its asset management division. Goldman Sachs also disclosed that it had invested $3.2 billion into sustainable energy projects, partnering with organizations like Tesla and NextEra Energy.\r',
   {'entities': [[3, 17, 'DATE'],
     [19, 33, 'NAME'],
     [58, 71, 'ORG'],
     [164, 167, 'PERCENT'],
     [327, 339, 'MONEY'],
     [409, 414, 'ORG'],
     [419, 433, 'ORG']]}]]}

In [19]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

nlp = spacy.blank("en")
db = DocBin()

for text, annot in tqdm(VALIDATE_DATA['annotations']):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./validate_data.spacy") # save the docbin object

100%|██████████| 1/1 [00:00<00:00, 856.68it/s]


In [20]:
! python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy


[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [28]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [29]:
! python -m spacy train config.cfg --output ./ --paths.train "./annotations_data.spacy" --paths.dev "./validate_data.spacy"


[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     25.91    0.00    0.00    0.00    0.00
 59     200         61.31   1247.49  100.00  100.00  100.00    1.00
134     400          0.00      0.00  100.00  100.00  100.00    1.00
232     600          0.00      0.00  100.00  100.00  100.00    1.00
332     800          0.00      0.00  100.00  100.00  100.00    1.00
504    1000          0.00      0.00  100.00  100.00  100.00    1.00
704    1200          0.00      0.00  100.00  100.00  100.00    1.00
904    1400          0.00      0.00  100.00  100.00  100.00    1.00
1104    1600          0.00      0.00  100.00  100.00  100.00    1.00
1304    1800          0.00      0.00  100.00  100.

In [31]:
nlp_ner = spacy.load("/content/model-best")


In [34]:
doc = nlp_ner('''On March 15, 2023, Samantha Green, a financial analyst at Goldman Sachs, presented the company’s quarterly earnings report. The firm reported a revenue increase of 12% compared to the previous quarter, largely attributed to strong performance in its asset management division. Goldman Sachs also disclosed that it had invested $3.2 billion into sustainable energy projects, partnering with organizations like Tesla and NextEra Energy.During the same conference, Michael Chen, CEO of BrightFuture Capital, announced a merger valued at $7.5 billion with Orion Holdings, scheduled to finalize by October 2023. Analysts believe this could boost BrightFuture Capital’s market share by as much as 18% over the next fiscal year.Meanwhile, philanthropist Linda Rodriguez donated $1 million to the World Health Organization to support global vaccination efforts. The funds will be distributed starting December 1, 2023, focusing on underdeveloped regions in Africa and Southeast Asia.In a related note, Apple Inc. reported a profit of $22 billion for the last quarter, with iPhone sales accounting for 65% of its total revenue. According to CFO David Liu, the company is also planning to increase R&D spending by 8% in the upcoming year.Overall, the financial sector saw significant growth, with experts predicting that investments in green technology could rise by 25% in 2024. As markets adapt, leaders like Elena Petrova of Morgan Stanley emphasize the importance of innovation and sustainability in long-term growth strategies.''')

In [35]:
import spacy
spacy.displacy.render(doc, style="ent", jupyter=True) # display in Jupyter