Skip to content

Named Entity Recognition locates and classifies named entities in a sequence [ open source ]

License

Notifications You must be signed in to change notification settings

bluetickconsultants/named-entity-recognition

Repository files navigation

Named Entity Recognition



The goal of Named Entity Recognition is to locate and classify named entities in a sequence. The named entities are pre-defined categories chosen according to the use case such as names of people, organizations, places, codes, time notations, monetary values, etc. Essentially, NER aims to assign a class to each token (usually a single word) in a sequence. Because of this, NER is also referred to as token classification.

Model developing Process

For this we are using pre-trained models from simple transformers which build over Hugging face Library .

Implemetation Process

The process of performing Named Entity Recognition in Simple Transformers does not deviate from the standard pattern.

 1 . Initialize a NERModel
 2 . Train the model with train_model()
 3 . Evaluate the model with eval_model()
 4 . Make predictions on (unlabelled) data with predict()

Supported Model Types using simple transformers

      1 .     ALBERT	       = albert
      2 .     BERT	         = bert
      3 .     BERTweet       = bertweet
      4 .     BigBird	       = bigbird
      5 .     CamemBERT      = camembert
      6 .     DeBERTa	       = deberta
      7 .     DeBERTa	       = deberta
      8 .     DeBERTaV2	     = deberta-v2
      9 .     DistilBERT     = distilbert
      10 .    ELECTRA        = electra
      11 .    HerBERT        = herbert
      12 .    LayoutLM       = layoutlm
      13 .    Longformer     = longformer
      14 .    MobileBERT     = mobilebert
      15 .    MPNet          = mpnet
      16 .    RoBERTa        = roberta
      17 .    SqueezeBert    = squeezebert
      18 .    XLM            = xlm
      19 .    XLM-RoBERTa    = xlmroberta
      20 .    XLNet          = xlnet

The above models completely uses concept encoders and decoders

Data Formats

A DataFrame containing the 3 columns sentence_id, words, labels. Each value in words will have a corresponding labels value. The sentence_id determines which words belong to a given sentence. I.e. the words from the same sequence should be assigned the same unique sentence_id.

Named entity recognition depends on the lables . Model can develop in different lable format here we are using an couple of lablels ['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim', 'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve', 'I-eve', 'I-nat']

Explanation of the labels


  • O = Outside of a named entity
  • B-MIS = Beginning of a miscellaneous entity right after another miscellaneous entity
  • I-MIS = Miscellaneous entity
  • B-PER = Beginning of a person’s name right after another person’s name
  • I-PER = Person’s name
  • B-ORG = Beginning of an organisation right after another organisation
  • I-ORG = Organisation
  • B-LOC = Beginning of a location right after another location
  • I-LOC = Location

Model used for Named entity recognition

Bert-Based-cased

Best Arguments used
           1 . Arguments.num_train_epochs = 3
           2 . Arguments.train_batch_size = 32
           3 . Arguments.eval_batch_size = 32
           4 . Arguments.learning_rate = 4e-5
           5 . Arguments.max_seq_length = 128
           6 . Arguments.adam_epsilon = 1e-8
           7 . Arguments.do_lower_case = True
           8 . Arguments.n_gpu = 1
           9 . Arguments.overwrite_output_dir = True

Follow technical report.docx file for complete explanation about models used and for dataset https://www.kaggle.com/bhallaakshit/ner-dataset

Other Projects

To view all other open source projects visit

Author

Bluetick Consultants LLP

About

Named Entity Recognition locates and classifies named entities in a sequence [ open source ]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages