# Bio - Named Entity Recognition
Preprocess -> Extract basic word features -> POS tagging -> extract sentence level features -> Entity recognition

The goal of NER is to provide a label for words or groups of words in a text sequence with an entity name or no entity. There is an assumption here that there are no overlapping entities. The bio_nlp2004 dataset contains tokenised text sequences with a matching list that contains a number representing an entity (0 is used to signify no entity). The numbers map to entity names. Each defined entity begins with ‘B-‘ or ‘I-‘ representing the beginning of a new entity and then a token that is inside an entity respectively (Jurafsky & Martin, 2021). The tokens are used as features and the tags are used as the labels for training the model. To improve the performance of NER we can add additional features such as Part-of-speech (POS) tagging. This assigns each token, of a document, a tag that describes its grammatical properties. By using this as an additional feature for NER we can maintain some information from the sentence structure and each token’s grammatical role in a sentence. It also allows for patters that relate to sentence structure to be identified. This should improve performance for entities that contain multiple works with distinct POS attributes, such as nouns. Bi-grams can also be used as an additional feature. These allows the context of tokens, based on their neighbours, to be preserved. This should improve entity recognition when an entity depends on neighbouring words, such as ‘in’ followed by an entity. For implementation the data is processed by adding POS tagging and previous words and next words are added (a similar idea to using bi grams). A conditional random field (CRF) model is instantiated. This uses neighbouring tags and features to calculate tag probabilities and identify named entities (Griffiths & Steyvers, 2004). This model is then trained, with the training data, containing the word ids as input tokens. The NLTK library is used for modelling and pre-processing; the pos_tag module for POS Tagging and the CRF tagger for the model. Punctuation, suffix and prev/next word processing is implemented in the code. For the f1 scores a function is used that perform the calculations for each metric.

In [2]:
#Loading the data
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library
from datasets import load_dataset
import numpy as np

# Loading the BIONLP 2004 dataset
dataset = load_dataset(
    "tner/bionlp2004", 
    cache_dir='./data_cache'
)

print(f'The dataset is a dictionary with {len(dataset)} splits: \n\n{dataset}')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Using the latest cached version of the module from /Users/C289216/.cache/huggingface/modules/datasets_modules/datasets/tner--bionlp2004/9f41d3f0270b773c2762dee333ae36c29331e2216114a57081f77639fdb5e904 (last modified on Wed Apr  5 15:55:51 2023) since it couldn't be found locally at tner/bionlp2004., or remotely on the Hugging Face Hub.
Reusing dataset bio_nlp2004 (./data_cache/tner___bio_nlp2004/bionlp2004/1.0.0/9f41d3f0270b773c2762dee333ae36c29331e2216114a57081f77639fdb5e904)


  0%|          | 0/3 [00:00<?, ?it/s]

The dataset is a dictionary with 3 splits: 

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 16619
    })
    validation: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 1927
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 3856
    })
})


In [3]:
train_sentences_ner = [item['tokens'] for item in dataset['train']]
train_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['train']]

val_sentences_ner = [item['tokens'] for item in dataset['validation']]
val_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['validation']]

test_sentences_ner = [item['tokens'] for item in dataset['test']]
test_labels_ner = [[str(tag) for tag in item['tags']] for item in dataset['test']]

In [4]:
print(f'Number of training sentences = {len(train_sentences_ner)}')
print(f'Number of validation sentences = {len(val_sentences_ner)}')
print(f'Number of test sentences = {len(test_sentences_ner)}')

Number of training sentences = 16619
Number of validation sentences = 1927
Number of test sentences = 3856


In [5]:
print(f'What does one instance look like from the training set? \n\n{train_sentences_ner[234]}')
print(f'...and here is its corresponding label \n\n{train_labels_ner[234]}')

What does one instance look like from the training set? 

['Hence', ',', 'PPAR', 'can', 'positively', 'or', 'negatively', 'influence', 'TH', 'action', 'depending', 'on', 'TRE', 'structure', 'and', 'THR', 'isotype', '.']
...and here is its corresponding label 

['0', '0', '3', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '3', '4', '0']


In [6]:
print(f'Number of unique labels: {np.unique(np.concatenate(train_labels_ner))}')

Number of unique labels: ['0' '1' '10' '2' '3' '4' '5' '6' '7' '8' '9']


In [7]:
id2label = {
    "O": "0",
    "B-DNA": "1",
    "I-DNA": "2",
    "B-protein": "3",
    "I-protein": "4",
    "B-cell_type": "5",
    "I-cell_type": "6",
    "B-cell_line": "7",
    "I-cell_line": "8",
    "B-RNA": "9",
    "I-RNA": "10"
}

label2id = {v:k for k, v in id2label.items()}
print(label2id)

{'0': 'O', '1': 'B-DNA', '2': 'I-DNA', '3': 'B-protein', '4': 'I-protein', '5': 'B-cell_type', '6': 'I-cell_type', '7': 'B-cell_line', '8': 'I-cell_line', '9': 'B-RNA', '10': 'I-RNA'}


In [8]:
train_sentences_ner[0]

['Since',
 'HUVECs',
 'released',
 'superoxide',
 'anions',
 'in',
 'response',
 'to',
 'TNF',
 ',',
 'and',
 'H2O2',
 'induces',
 'VCAM-1',
 ',',
 'PDTC',
 'may',
 'act',
 'as',
 'a',
 'radical',
 'scavenger',
 '.']

In [9]:
train_set = []
# for each word in the sentence in the training set train_sentences_ner add a tuple of the word and its label from train_labels_ner
for sentence, labels in zip(train_sentences_ner, train_labels_ner):
    train_set.append([(word, label2id[label]) for word, label in zip(sentence, labels)])

# map each label in the training set to its corresponding name in the label2id dictionary
# train_set = [[(word, label2id[label]) for (word, label) in sentence] for sentence in train_set]


test_set = []
for sentence, labels in zip(test_sentences_ner, test_labels_ner):
    test_set.append([(word, label2id[label]) for word, label in zip(sentence, labels)])


In [10]:
# replace the values of each array in test_labels_ner with the corresponding values in the label2id dictionary
test_tags = [[label2id[label] for label in labels] for labels in test_labels_ner]
test_tags[0]

['O',
 'O',
 'B-protein',
 'I-protein',
 'O',
 'B-cell_type',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [11]:
test_tokens = test_sentences_ner
test_tokens[0]

['Number',
 'of',
 'glucocorticoid',
 'receptors',
 'in',
 'lymphocytes',
 'and',
 'their',
 'sensitivity',
 'to',
 'hormone',
 'action',
 '.']

In [12]:
import nltk

# Train a CRF NER tagger
def train_CRF_NER_tagger(train_set):
    tagger = nltk.tag.CRFTagger()
    tagger.train(train_set, 'model.crf.tagger')
    return tagger  # return the trained model

tagger = train_CRF_NER_tagger(train_set)

In [None]:
predicted_tags = tagger.tag_sents(test_tokens)

In [None]:
def extract_spans(tagged_sents):
    spans = {}
        
    for sidx, sent in enumerate(tagged_sents):
        start = -1
        entity_type = None
        for i, (tok, lab) in enumerate(sent):
            if 'B-' in lab:
                start = i
                end = i + 1
                entity_type = lab[2:]
            elif 'I-' in lab:
                end = i + 1
            elif lab == 'O' and start >= 0:
                
                if entity_type not in spans:
                    spans[entity_type] = []
                
                spans[entity_type].append((start, end, sidx))
                start = -1      
        # Sometimes an I-token is the last token in the sentence, so we still have to add the span to the list
        if start >= 0:    
            if entity_type not in spans:
                spans[entity_type] = []
                
            spans[entity_type].append((start, end, sidx))
                
    return spans


def cal_span_level_f1(test_sents, test_sents_with_pred):
    # get a list of spans from the test set labels
    gold_spans = extract_spans(test_sents)

    # get a list of spans predicted by our tagger
    pred_spans = extract_spans(test_sents_with_pred)
    
    # compute the metrics for each class:
    f1_per_class = []
    
    ne_types = gold_spans.keys()  # get the list of named entity types (not the tags)
    
    for ne_type in ne_types:
        # compute the confusion matrix
        true_pos = 0
        false_pos = 0
        
        for span in pred_spans[ne_type]:
            if span in gold_spans[ne_type]:
                true_pos += 1
            else:
                false_pos += 1
                
        false_neg = 0
        for span in gold_spans[ne_type]:
            if span not in pred_spans[ne_type]:
                false_neg += 1
                
        if true_pos + false_pos == 0:
            precision = 0
        else:
            precision = true_pos / float(true_pos + false_pos)
            
        if true_pos + false_neg == 0:
            recall = 0
        else:
            recall = true_pos / float(true_pos + false_neg)
        
        if precision + recall == 0:
            f1 = 0
        else:
            f1 = 2 * precision * recall / (precision + recall)
            
        f1_per_class.append(f1)
        print(f'F1 score for class {ne_type} = {f1}')
        
    print(f'Macro-average f1 score = {np.mean(f1_per_class)}')

cal_span_level_f1(test_set, predicted_tags)

F1 score for class protein = 0.6529513555085561
F1 score for class cell_type = 0.6308724832214766
F1 score for class DNA = 0.5829145728643217
F1 score for class cell_line = 0.4780793319415449
F1 score for class RNA = 0.6017699115044248
Macro-average f1 score = 0.5893175310080648


## Including NER Features:  POS Tagging

In [None]:
import nltk
import re
import unicodedata

class CustomCRFTagger(nltk.tag.CRFTagger):
    _current_tokens = None

    def _get_features(self, tokens, idx):
        """
        The function extracts features for a token.
        """
        token = tokens[idx]
        feature_list = []

        if not token:
            return feature_list

        # Punctuation
        punc_cat = set(["Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po"])
        if all(unicodedata.category(x) in punc_cat for x in token):
            feature_list.append("PUNCTUATION")

        # Suffix up to length 3
        if len(token) > 1:
            feature_list.append("SUF_" + token[-1:])
        if len(token) > 2:
            feature_list.append("SUF_" + token[-2:])
        if len(token) > 3:
            feature_list.append("SUF_" + token[-3:])

        # Current word
        feature_list.append("WORD_" + token)

        # Previous word
        if idx > 0:
            feature_list.append("PREVWORD_" + tokens[idx-1])
        # Next word
        if idx < len(tokens)-1:
            feature_list.append("NEXTWORD_" + tokens[idx+1])

        return feature_list


class CRFTaggerWithPOS(CustomCRFTagger):
    _current_tokens = None

    def _get_features(self, tokens, index):
        """
        Extract the features for a token and append the POS tag as an additional feature.
        """
        basic_features = super()._get_features(tokens, index)

        # Get the pos tags for the current sentence and save it
        if tokens != self._current_tokens:
            self._pos_tagged_tokens = nltk.pos_tag(tokens)
            self._current_tokens = tokens

        # Add POS tag to the features
        basic_features.append(self._pos_tagged_tokens[index][1])

        return basic_features


In [None]:
def train_CRF_NER_tagger_with_POS(train_set):
    tagger = CRFTaggerWithPOS()
    tagger.train(train_set, 'model.crf.tagger')
    return tagger


# Train the model
tagger = train_CRF_NER_tagger_with_POS(train_set)

# Generate predictions
predicted_tags = tagger.tag_sents([[t[0] for t in sentence] for sentence in test_set])

# Now score the model
score = cal_span_level_f1(test_set, predicted_tags)
print("F1 Score: ", score)

F1 score for class protein = 0.6850019033117625
F1 score for class cell_type = 0.7015918958031838
F1 score for class DNA = 0.6393606393606395
F1 score for class cell_line = 0.5413687436159346
F1 score for class RNA = 0.6206896551724138
Macro-average f1 score = 0.6376025674527868
F1 Score:  None


## Evaluation 
We can see that the pre-processing steps offered improvement for every entity type. This shows that the grammatical information the POS tagging includes is helpful in entity identification as hypothesised. These results show that in both models ‘cell_line’ entity recognition performed the worst. This approach to NER has offered somewhat positive results. One area where this model would perform poorly is when entities that did not appear in the training data are used. To tackle this a hand-crafted dictionary could be used in training, that contains all the possible protein/ cell_type/ DNA/ cell_line/ RNA entities. This would make the model much more capable of handling entities that are out of the vocabulary of the training set. Another approach to tackle this problem would be to utilise a pretrained large language model such as BERT. These models have been trained of vast corpuses of text and can better handle complex patterns in text. Overall NER is a good approach to this task as biological names are comparatively distinct, which makes them easy to identify. This solution could be applied to identify similar published texts.