<a href="https://colab.research.google.com/github/abhimanyu1214/Entity-Detection-in-Text/blob/main/Entity_Detection_in_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Before developing and training your own NER model, it's worth your time to first consider the requirements of your project and try out some of the preexisting off-the-shelf NER models to see if they can do the job for you. Preexisting NER models have the advantage of being ready to test in a few lines of code and are in some cases designed around being fast and robust in a production setting.

If your project requires you to identify basic NER types like people, organizations, locations, etc. then I encourage you to first test your project with the existing NER models from spaCy, Stanford, and Flair.

The following code cell shows how to retrieve entity tags for some text using spaCy, which comes pre-installed on Colab.


In [13]:
import spacy

# Download a spacy model for processing English
nlp = spacy.load("en_core_web_sm")

# Process a sentence using the spacy model
doc = nlp("Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, a search engine, cloud computing, software, and hardware. It is considered one of the Big Five technology companies in the U.S. information technology industry, alongside Amazon, Facebook, Apple, and Microsoft.")
doc = nlp("Google was founded in September 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a California privately held company on September 4, 1998. Google was then reincorporated in Delaware on October 22, 2002.")
# Display the entities found by the model, and the type of each.
print('{:<12}  {:}\n'.format('Entity', 'Type'))

# For each entity found...
for ent in doc.ents:
    
    # Print the entity text `ent.text` and its label `ent.label_`.
    print('{:<12}  {:}'.format(ent.text, ent.label_))

Entity        Type

Google        ORG
September 1998  DATE
Larry Page    PERSON
Sergey Brin   PERSON
Ph.D.         WORK_OF_ART
Stanford University  ORG
California    GPE
about 14 percent  PERCENT
56 percent    PERCENT
Google        ORG
California    GPE
September 4, 1998  DATE
Google        ORG
Delaware      GPE
October 22, 2002  DATE


Named Entity Recognition (NER) tasks here, we wanted to provide some practical guidance and resources for building your own NER application since fine-tuning BERT may not be the best solution for every NER application.

In this post, we will:

1.Discuss when it might be appropriate to use an off-the-shelf library vs training / fine-tuning your own model.
2.Point you to some popular libraries for performing NER tagging and share some quick-start examples.
3.Share some resources we've found comparing and benchmarking different NER tools.


When to Fine-Tune

In some cases, these off-the-shelf libraries won't be the best solution for your project. You might have:

    Specific entity types that are not included in the off-the-shelf versions
    A different kind of text corpus from what the off-the-shelf models are trained on
    Very high accuracy or recall requirements

In general, fine-tuning BERT (or variants of BERT) on your dataset will yield a highly accurate tagger, and with less training data required than training a custom model from scratch.

The biggest caveat, however, is that BERT models are large, and typically warrant GPU acceleration. Working with GPUs can be expensive, and BERT will be slower to run on text than tools like spaCy.

So consider your production requirements for speed, accuracy, and cost before going straight to BERT!
Resources & Experiments

There is, of course, a lot more that can be said about these different NLP toolkits. Our goal for this post was to simply make sure that you were aware of them, and the reasons you might use them over BERT.

The following are some articles which we found informative, containing experiments, summaries, benchmarks, and other comparisons of different NER tools. They're worth looking through if you'd like to get a sense of NER pipelines and the power of existing NER tools. Below each article, we've highlighted the main points.

In [14]:
from flair.data import Sentence
from flair.models import SequenceTagger

# Make a sentence
sentence = Sentence("The company's rapid growth since incorporation has triggered a chain of products, acquisitions, and partnerships beyond Google's core search engine (Google Search). It offers services designed for work and productivity (Google Docs, Google Sheets, and Google Slides), email (Gmail), scheduling and time management (Google Calendar), cloud storage (Google Drive), instant messaging and video chat (Duo, Hangouts, Chat, and Meet), language translation (Google Translate), mapping and navigation (Google Maps, Waze, Google Earth, and Street View), podcast hosting (Google Podcasts), video sharing (YouTube), blog publishing (Blogger), note-taking (Google Keep and Google Jamboard), and photo organizing and editing (Google Photos). ")

# Load the NER tagger
# This file is around 1.5 GB so will take a little while to load.
tagger = SequenceTagger.load('ner-ontonotes')

# Run NER over sentence
tagger.predict(sentence)

# Retrieve the entities found by the tagger.
entity_dict = sentence.to_dict(tag_type='ner')

# Display the entities, and the type(s) of each.
print('\n{:<12}  {:}\n'.format('Entity', 'Type(s)'))

# For each entity...
for entity in entity_dict['entities']:
    
    # Print the entity text and its labels. Flair supports multiple labels
    # per entity, and includes a confidence score.
    print('{:<12}  {:}'.format(entity["text"], str(entity["labels"])))

2021-01-28 02:28:50,436 loading file /root/.flair/models/en-ner-ontonotes-v0.4.pt

Entity        Type(s)

Google        [ORG (0.9999)]
Google Search  [ORG (0.6656)]
Google Docs   [ORG (0.7732)]
Google Sheets  [ORG (0.4692)]
Google Slides  [ORG (0.6962)]
Google Calendar  [ORG (0.4058)]
Google Drive  [PRODUCT (0.725)]
Duo           [PRODUCT (0.8896)]
Hangouts      [PRODUCT (0.6412)]
Meet          [ORG (0.9559)]
Google        [ORG (0.6744)]
Translate     [PRODUCT (0.5829)]
Google Maps   [PRODUCT (0.7484)]
Waze          [ORG (0.8816)]
Google Earth  [ORG (0.6109)]
Street View   [ORG (0.7651)]
Google Podcasts  [ORG (0.618)]
YouTube       [ORG (0.8468)]
Blogger       [ORG (0.7347)]
Google        [ORG (0.9501)]
Jamboard      [PRODUCT (0.8827)]
Google Photos  [ORG (0.641)]


We'll first need to install the library from GitHub. Then we have to install it.

In [3]:
!pip install --upgrade git+https://github.com/flairNLP/flair.git

Collecting git+https://github.com/flairNLP/flair.git
  Cloning https://github.com/flairNLP/flair.git to /tmp/pip-req-build-hq4t_nkq
  Running command git clone -q https://github.com/flairNLP/flair.git /tmp/pip-req-build-hq4t_nkq
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting ftfy
[?25l  Downloading https://files.pythonhosted.org/packages/ff/e2/3b51c53dffb1e52d9210ebc01f1fb9f2f6eba9b3201fa971fd3946643c71/ftfy-5.8.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 4.4MB/s 
[?25hCollecting huggingface-hub
  Downloading https://files.pythonhosted.org/packages/b6/81/522aaa0e08224477c7fad38546003b2b83ee7f76e4b6150977fb5d595276/huggingface_hub-0.0.1-py3-none-any.whl
Collecting sqlitedict>=1.6.0
  Downloading https://files.pythonhosted.org/packages/5c/2d/b1d99e9ad157dd7de9cd0d36a8a5876b13b55e4b75f7498bc96035fb4e96/sqlitedict-1.7.0.tar.gz
Collecting sentence

install stanza

In [5]:
!pip install stanza

Collecting stanza
[?25l  Downloading https://files.pythonhosted.org/packages/50/ae/a70a58ce6b4e2daad538688806ee0f238dbe601954582a74ea57cde6c532/stanza-1.2-py3-none-any.whl (282kB)
[K     |█▏                              | 10kB 12.5MB/s eta 0:00:01[K     |██▎                             | 20kB 17.7MB/s eta 0:00:01[K     |███▌                            | 30kB 13.3MB/s eta 0:00:01[K     |████▋                           | 40kB 13.0MB/s eta 0:00:01[K     |█████▉                          | 51kB 14.6MB/s eta 0:00:01[K     |███████                         | 61kB 13.5MB/s eta 0:00:01[K     |████████▏                       | 71kB 10.9MB/s eta 0:00:01[K     |█████████▎                      | 81kB 11.7MB/s eta 0:00:01[K     |██████████▌                     | 92kB 10.5MB/s eta 0:00:01[K     |███████████▋                    | 102kB 10.6MB/s eta 0:00:01[K     |████████████▉                   | 112kB 10.6MB/s eta 0:00:01[K     |██████████████                  | 122kB 10.6MB/s

In [18]:
import stanza

# This downloads the English models for the neural pipeline
stanza.download('en')     

# This sets up a default neural pipeline in English
nlp = stanza.Pipeline('en') 

# Process a sentence.
doc = nlp(" The Google company leads the development of the Android mobile operating system, the Google Chrome web browser, and Chrome OS, a lightweight operating system based on the Chrome browser. Google has moved increasingly into hardware; from 2010 to 2015, it partnered with major electronics manufacturers in the production of its Nexus devices, and it released multiple hardware products in October 2016, including the Google Pixel line of smartphones, Google Home smart speaker, Google Wifi mesh wireless router, and Google Daydream virtual reality headset. Google has also experimented with becoming an Internet carrier (Google Fiber, Google Fi, and Google Station).")


# Display the text and type of entities the model found
print('\n{:<12}  {:}\n'.format('Entity', 'Type'))

# For each entity...
for entity in doc.entities:

    # Print the text and its type.
    print('{:<12}  {:}'.format(entity.text, entity.type))

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 29.8MB/s]                    
2021-01-28 02:34:35 INFO: Downloading default packages for language: en (English)...
2021-01-28 02:34:37 INFO: File exists: /root/stanza_resources/en/default.zip.
2021-01-28 02:34:42 INFO: Finished downloading models and saved to /root/stanza_resources.
2021-01-28 02:34:42 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-01-28 02:34:42 INFO: Use device: cpu
2021-01-28 02:34:42 INFO: Loading: tokenize
2021-01-28 02:34:42 INFO: Loading: pos
2021-01-28 02:34:42 INFO: Loading: lemma
2021-01-28 02:34:42 INFO: Loading: depparse
2021-01-28 02:34:42 INFO: Loading: sentiment
2021-01-28 02:34:43 INFO: Loading: ner
2021-01-28 02:34:43 


Entity        Type

Google        ORG
Android       PRODUCT
Google Chrome  PRODUCT
Chrome OS     PRODUCT
Chrome        PRODUCT
Google        ORG
2010 to 2015  DATE
Nexus         ORG
October 2016  DATE
Google Pixel  PRODUCT
Google Home   PRODUCT
Google Wifi   PRODUCT
Google Daydream  PRODUCT
Google        ORG
Google Fiber  ORG
Google Fi     PRODUCT
Google Station  ORG
