<a href="https://colab.research.google.com/github/ahmedwasfey/NER-from-HTML/blob/main/notebooks/NER_Task_v0_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## installing dependencies 

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))
    !nvidia-smi

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

No GPU available, using the CPU instead.


In [None]:
!pip install sentencepiece
!git clone https://github.com/huggingface/transformers
!cd transformers && pip install .
!pip install nervaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
Cloning into 'transformers'...
remote: Enumerating objects: 121942, done.[K
remote: Counting objects: 100% (339/339), done.[K
remote: Compressing objects: 100% (243/243), done.[K
remote: Total 121942 (delta 167), reused 192 (delta 72), pack-reused 121603[K
Receiving objects: 100% (121942/121942), 116.13 MiB | 23.55 MiB/s, done.
Resolving deltas: 100% (91156/91156), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/transformers
  Installing build dependencies ... [?25l[?25hdon

## Creating the dataset

### loading data

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm


In [None]:
from collections import Counter
import matplotlib.pyplot as plt

reading the data from its 'txt' file


In [None]:
import json
with open(r"/content/drive/MyDrive/tahaluf/news_sample_ner.txt" , 'r') as fb:
  data = fb.read()
print(len(data), len(data.split()))

64225 7881


### EDA on the Data

cleaning the data using regex

In [None]:
import re
CLEAN_PATTERN = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

raw_data = re.sub(CLEAN_PATTERN, "", data)
len(raw_data), raw_data[100:500]

(42902,
 "ATED: daylight hours.\n\n\n   Shortly after Fossett's launching Monday his competitors sent\nhim telegrams of congratulation.\n\n   The British balloon, called the Virgin Global Challenger, is to\nbe flown by Richard Branson, chairman of Virgin Atlantic Airways;\nPer Lindstrand, chairman of Lindstrand Balloons Ltd. of Oswestry,\nEngland, and an Irish balloonist, Rory McCarthy.\n\n   Branson and Lindstrand, w")



cleaning the text this way will cause the following problems 

*   now we cannot separate the the text to paragraphs or sentences 
*   so , we cannot extract the annotation for every sentence/ paragraph
*   we have unneccassry metadata for every article that we can consider as noise 



experiments for extracting the entities 

In [None]:
for matching in re.findall("<ENAMEX TYPE=\".+?\">.+?</ENAMEX>?", data)[:5]:
  # print(matching)
  print(re.findall("\".+\"", matching)[0][1:-1], re.findall(">.+<", matching)[0][1:-1])

PERSON Fossett
ORGANIZATION Virgin
PERSON Richard Branson
ORGANIZATION Virgin Atlantic Airways
PERSON Per Lindstrand


making sure to extract entities correctly 

In [None]:
re.findall("<ENAMEX TYPE=\".+?\">.+?</ENAMEX>?", data) == re.findall("<ENAMEX TYPE=\".+?\">.+?</ENAMEX>*", data)

True

we cannot split the data on the dots, as we have many subwords that ends with dots inside the sentence 

In [None]:
re.findall("[A-Z][a-z]+\.", data)

['Ltd.',
 'Dec.',
 'Ariz.',
 'Del.',
 'Calif.',
 'Mo.',
 'Mass.',
 'Pa.',
 'Del.',
 'Co.',
 'St.',
 'Pa.',
 'Feb.',
 'Co.',
 'Dominicans.',
 'Corp.',
 'Dec.',
 'Germans.',
 'Gen.',
 'Maj.',
 'Dominican.',
 'Md.',
 'Capt.',
 'Dec.',
 'Inc.',
 'Feb.',
 'Co.',
 'Tomcat.',
 'Jan.',
 'Feb.',
 'Adm.',
 'Corp.',
 'Feb.',
 'Maj.']

we cannot use dots to seperate the sentences or event the \n

In [None]:
for s in raw_data.split(".")[:50]:print(s,'\n')

all the text of the articles is in \<PREAMBLE\> and \<TEXT\> headers

In [None]:
body_pattern = re.compile(r"<PREAMBLE>")#re.compile(r"<p>.*\n*.+\n*.*\.<")
re.findall(body_pattern, data)

['<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>',
 '<PREAMBLE>']

### extracting raw text 

sample of the data

In [None]:
data[:300]

'<DOC>\n<DOCID> nyt960108.0493 </DOCID>\n<STORYID cat=a pri=u> A5852 </STORYID>\n<SLUG fv=sci-z> BC-BALLOON-RACE-2ndTAKE- </SLUG>\n<DATE> <TIMEX TYPE="DATE">01-08</TIMEX> </DATE>\n<NWORDS> 0745 </NWORDS>\n<PREAMBLE>\nBC-BALLOON-RACE-2ndTAKE-NYT\nUNDATED: daylight hours.</PREAMBLE>\n<TEXT>\n<p>\n   Shortly after'

the body is between the \<PREAMBLE\> and \<TEXT\> tags
every paragraph is separated with \<p\>

In [None]:
raw_data =[]
for body in data.split("<PREAMBLE>"):
  if "</PREAMBLE>" in body  :
    for segment in body.split("</PREAMBLE>"):
      if "<TEXT>" in segment :
        for paragraph in segment.split("<p>"):
          raw_data.append(paragraph)
      else :
        raw_data.append(segment)
raw_data[:5]

['\nBC-BALLOON-RACE-2ndTAKE-NYT\nUNDATED: daylight hours.',
 '\n<TEXT>\n',
 '\n   Shortly after <ENAMEX TYPE="PERSON">Fossett</ENAMEX>\'s launching <TIMEX TYPE="DATE">Monday</TIMEX> his competitors sent\nhim telegrams of congratulation.\n',
 '\n   The British balloon, called the <ENAMEX TYPE="ORGANIZATION">Virgin</ENAMEX> Global Challenger, is to\nbe flown by <ENAMEX TYPE="PERSON">Richard Branson</ENAMEX>, chairman of <ENAMEX TYPE="ORGANIZATION">Virgin Atlantic Airways</ENAMEX>;\n<ENAMEX TYPE="PERSON">Per Lindstrand</ENAMEX>, chairman of <ENAMEX TYPE="ORGANIZATION">Lindstrand Balloons Ltd.</ENAMEX> of <ENAMEX TYPE="LOCATION">Oswestry</ENAMEX>,\n<ENAMEX TYPE="LOCATION">England</ENAMEX>, and an Irish balloonist, <ENAMEX TYPE="PERSON">Rory McCarthy</ENAMEX>.\n',
 '\n   <ENAMEX TYPE="PERSON">Branson</ENAMEX> and <ENAMEX TYPE="PERSON">Lindstrand</ENAMEX>, who have set several ballooning records,\nwere the first pilots of hot-air balloons to cross both the\n<ENAMEX TYPE="LOCATION">Atlantic O

saving the raw clean text to a txt file 
* requirment #1

In [None]:
import re
def remove_html(text):
    html_pattern = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return html_pattern.sub(r' ', text)
def clean_text(text):
  text = remove_html(text)
  text = text.replace("\n", " ").strip()
  return text
html_clean_raw_data= []
with open(r"/content/drive/MyDrive/tahaluf/html_free_data.txt", "w") as fp :
  for element in raw_data :
    clean_element = clean_text(element)
    if clean_element!="" :html_clean_raw_data.append(clean_element)
    fp.write(html_clean_raw_data[-1]+"\n")
len(html_clean_raw_data)

183

### extracting annotated text

this cell extracts the annotated data for every paragraph in this format 
```
entities': [{'label': 'PER', 'word': 'Henk', 'start': 24, 'end': 28},
   {'label': 'PER', 'word': 'Brink', 'start': 29, 'end': 34},
   {'label': 'ORG', 'word': 'Unicef', 'start': 75, 'end': 81},
   ]
```
side note : I tried not to split the entity sentence into words but I think this shall be more accurate for the evaluation for example considering `Virgin Atlantic Airways ` as single token with single labels not 3 tokens with 3 labels 





In [None]:
annotated_data =[]
def extract_entities(text, clean_text ):
  entities = []
  for expression_matching in re.findall("<ENAMEX TYPE=\".+?\">.+?</ENAMEX>?", text):
    entity_sentence =  re.findall(">.+<", expression_matching)[0][1:-1]
    matches = list(re.finditer(entity_sentence , clean_text))
    words = entity_sentence.split() # 
    label = re.findall("\".+\"", expression_matching)[0][1:-1][:3]
    # label_type ="B-"
    for matching in matches :
      prev_word = ""
      for word in words : 
        start = matching.span()[0]+len(prev_word)
        # print(start, word)
        entity = {"label": label , 
            "word":word ,
            "start" :start  ,
            "end" : start+len(word)
            }
        prev_word += word +" "
        if entity not in entities :
          entities.append(entity)
  return entities
def format_paragraph_entities(segment):
  # print(segment)
  clean_segment = clean_text(segment)
  if clean_segment =="":
    return None
  return { 
      "paragraph":clean_segment, 
      "entities": extract_entities(segment , clean_segment)
  }

for segment in raw_data :
  annotated_element= format_paragraph_entities(segment)
  if annotated_element is not None :  annotated_data.append(annotated_element)
  # print(annotated_data[-1])
  # print([ annotated_data[-1]["paragraph"][x["start"]:x["end"]] for x in annotated_data[-1]["entities"]])




In [None]:
annotated_data[:20]

[{'paragraph': 'BC-BALLOON-RACE-2ndTAKE-NYT UNDATED: daylight hours.',
  'entities': []},
 {'paragraph': "Shortly after  Fossett 's launching  Monday  his competitors sent him telegrams of congratulation.",
  'entities': [{'label': 'PER', 'word': 'Fossett', 'start': 15, 'end': 22}]},
 {'paragraph': 'The British balloon, called the  Virgin  Global Challenger, is to be flown by  Richard Branson , chairman of  Virgin Atlantic Airways ;  Per Lindstrand , chairman of  Lindstrand Balloons Ltd.  of  Oswestry ,  England , and an Irish balloonist,  Rory McCarthy .',
  'entities': [{'label': 'ORG', 'word': 'Virgin', 'start': 33, 'end': 39},
   {'label': 'ORG', 'word': 'Virgin', 'start': 110, 'end': 116},
   {'label': 'PER', 'word': 'Richard', 'start': 79, 'end': 86},
   {'label': 'PER', 'word': 'Branson', 'start': 87, 'end': 94},
   {'label': 'ORG', 'word': 'Atlantic', 'start': 117, 'end': 125},
   {'label': 'ORG', 'word': 'Airways', 'start': 126, 'end': 133},
   {'label': 'PER', 'word': 'Per', 

In [None]:
i = 4
s , e = annotated_data[i]["entities"][0]["start"] , annotated_data[i]["entities"][0]["end"]
print(annotated_data[i])
annotated_data[i]["paragraph"][s:e]


{'paragraph': 'Lindstrand  said that because of unfavorable weather patterns over  England  he and his colleagues had decided to launch their  Virgin  Global Challenger from a military airfield at  Marrakech ,  Morocco .', 'entities': [{'label': 'PER', 'word': 'Lindstrand', 'start': 0, 'end': 10}, {'label': 'LOC', 'word': 'England', 'start': 68, 'end': 75}, {'label': 'ORG', 'word': 'Virgin', 'start': 128, 'end': 134}, {'label': 'LOC', 'word': 'Marrakech', 'start': 183, 'end': 192}, {'label': 'LOC', 'word': 'Morocco', 'start': 196, 'end': 203}]}


'Lindstrand'

## Transformer based NER

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)


[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [None]:
given_tags= ["PER" , "ORG", "LOC"] #[
#     "B-PER",  
#     "I-PER",  
#     "B-LOC" ,
#     "I-LOC" ,
#     "B-ORG" ,
#     "I-ORG" 
# ]

### post processing for formatting the huggingface pipeline output to the same format we formatted the entities above 



```
[{'word': 'British', 'label': 'MISC', 'start': 4, 'end': 11},
  {'word': 'Virgin', 'label': 'MISC', 'start': 33, 'end': 39},
  {'word': 'Global', 'label': 'MISC', 'start': 41, 'end': 47}]
```



In [None]:
def predict_paragraph_transformers(element):
  results = nlp(element)
  # print(results)
  post_processed_results= []
  # print(element)
  word = ""
  entity =""
  start , end = 0 , 0
  for result in results  :

    if result["word"][0]== "#" :
      word += result["word"][2:]
      end = start + len(word)
    elif word != "":

        if entity[2:] in given_tags : post_processed_results.append({
            "word":word ,
            "label": entity[2:], 
            "start":start, 
            "end": end})
        word = result["word"]
        entity =result["entity"]
        start = result["start"]
        end = result["end"]
    else :
        word = result["word"]
        entity =result["entity"]
        start = result["start"]
        end = result["end"]
  if len(results)>1 and entity[2:] in given_tags: post_processed_results.append({
            "word":word ,
            "label": entity[2:], 
            "start":start, 
            "end": end})
  return post_processed_results#[x for x in post_processed_results if x['label'] in given_tags]
predict_paragraph_transformers(annotated_data[2]["paragraph"]) , annotated_data[2]

([{'word': 'Richard', 'label': 'PER', 'start': 79, 'end': 86},
  {'word': 'Branson', 'label': 'PER', 'start': 87, 'end': 94},
  {'word': 'Virgin', 'label': 'ORG', 'start': 110, 'end': 116},
  {'word': 'Atlantic', 'label': 'ORG', 'start': 117, 'end': 125},
  {'word': 'Airways', 'label': 'ORG', 'start': 126, 'end': 133},
  {'word': 'Per', 'label': 'PER', 'start': 137, 'end': 140},
  {'word': 'Lindstrand', 'label': 'PER', 'start': 141, 'end': 151},
  {'word': 'Lindstrand', 'label': 'ORG', 'start': 167, 'end': 177},
  {'word': 'Balloons', 'label': 'ORG', 'start': 178, 'end': 186},
  {'word': 'Ltd', 'label': 'ORG', 'start': 187, 'end': 190},
  {'word': 'Oswestry', 'label': 'LOC', 'start': 197, 'end': 205},
  {'word': 'England', 'label': 'LOC', 'start': 209, 'end': 216},
  {'word': 'Rory', 'label': 'PER', 'start': 245, 'end': 249},
  {'word': 'McCarthy', 'label': 'PER', 'start': 250, 'end': 258}],
 {'paragraph': 'The British balloon, called the  Virgin  Global Challenger, is to be flown by  

### now we ready to parse/predict all the paragraphs we have 

In [None]:
for idx , paragraph in enumerate(annotated_data) :
  annotated_data[idx]["predicted_entities"] = predict_paragraph_transformers(paragraph["paragraph"])

## Transformer Evaluation 
using https://github.com/MantisAI/nervaluate
for producing more detailed classification report 

In [None]:
true = [x["entities"] for x in annotated_data]
pred = [x["predicted_entities"] for x in annotated_data]
from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=given_tags)

# Returns overall metrics and metrics for each tag

results, results_per_tag = evaluator.evaluate()

results

{'ent_type': {'correct': 545,
  'incorrect': 30,
  'partial': 0,
  'missed': 84,
  'spurious': 147,
  'possible': 659,
  'actual': 722,
  'precision': 0.7548476454293629,
  'recall': 0.8270106221547799,
  'f1': 0.7892831281679942},
 'partial': {'correct': 517,
  'incorrect': 0,
  'partial': 58,
  'missed': 84,
  'spurious': 147,
  'possible': 659,
  'actual': 722,
  'precision': 0.7562326869806094,
  'recall': 0.8285280728376327,
  'f1': 0.7907313540912382},
 'strict': {'correct': 502,
  'incorrect': 73,
  'partial': 0,
  'missed': 84,
  'spurious': 147,
  'possible': 659,
  'actual': 722,
  'precision': 0.6952908587257618,
  'recall': 0.7617602427921093,
  'f1': 0.7270094134685011},
 'exact': {'correct': 517,
  'incorrect': 58,
  'partial': 0,
  'missed': 84,
  'spurious': 147,
  'possible': 659,
  'actual': 722,
  'precision': 0.7160664819944599,
  'recall': 0.7845220030349014,
  'f1': 0.7487328023171615}}

#### results for the experiment before splitting entity to its separate words 
for example considering an entity like `john smith` as single entity and single token not two tokens/words 

{'ent_type': {'correct': 46,
  'incorrect': 0,
  'partial': 0,
  'missed': 7,
  'spurious': 7,
  'possible': 53,
  'actual': 53,
  'precision': 0.8679245283018868,
  'recall': 0.8679245283018868,
  'f1': 0.8679245283018869},
 'partial': {'correct': 37,
  'incorrect': 0,
  'partial': 9,
  'missed': 7,
  'spurious': 7,
  'possible': 53,
  'actual': 53,
  'precision': 0.7830188679245284,
  'recall': 0.7830188679245284,
  'f1': 0.7830188679245284},
 'strict': {'correct': 37,
  'incorrect': 9,
  'partial': 0,
  'missed': 7,
  'spurious': 7,
  'possible': 53,
  'actual': 53,
  'precision': 0.6981132075471698,
  'recall': 0.6981132075471698,
  'f1': 0.6981132075471698},
 'exact': {'correct': 37,
  'incorrect': 9,
  'partial': 0,
  'missed': 7,
  'spurious': 7,
  'possible': 53,
  'actual': 53,
  'precision': 0.6981132075471698,
  'recall': 0.6981132075471698,
  'f1': 0.6981132075471698}}

### results per tag

In [None]:
results_per_tag

{'PER': {'ent_type': {'correct': 120,
   'incorrect': 10,
   'partial': 0,
   'missed': 24,
   'spurious': 16,
   'possible': 154,
   'actual': 146,
   'precision': 0.821917808219178,
   'recall': 0.7792207792207793,
   'f1': 0.7999999999999999},
  'partial': {'correct': 110,
   'incorrect': 0,
   'partial': 20,
   'missed': 24,
   'spurious': 16,
   'possible': 154,
   'actual': 146,
   'precision': 0.821917808219178,
   'recall': 0.7792207792207793,
   'f1': 0.7999999999999999},
  'strict': {'correct': 109,
   'incorrect': 21,
   'partial': 0,
   'missed': 24,
   'spurious': 16,
   'possible': 154,
   'actual': 146,
   'precision': 0.7465753424657534,
   'recall': 0.7077922077922078,
   'f1': 0.7266666666666666},
  'exact': {'correct': 110,
   'incorrect': 20,
   'partial': 0,
   'missed': 24,
   'spurious': 16,
   'possible': 154,
   'actual': 146,
   'precision': 0.7534246575342466,
   'recall': 0.7142857142857143,
   'f1': 0.7333333333333334}},
 'ORG': {'ent_type': {'correct': 236

#### results per tag also without splitting 

{'PERSON': {'ent_type': {'correct': 24,
   'incorrect': 0,
   'partial': 0,
   'missed': 2,
   'spurious': 3,
   'possible': 26,
   'actual': 27,
   'precision': 0.8888888888888888,
   'recall': 0.9230769230769231,
   'f1': 0.9056603773584906},
  'partial': {'correct': 20,
   'incorrect': 0,
   'partial': 4,
   'missed': 2,
   'spurious': 3,
   'possible': 26,
   'actual': 27,
   'precision': 0.8148148148148148,
   'recall': 0.8461538461538461,
   'f1': 0.830188679245283},
  'strict': {'correct': 20,
   'incorrect': 4,
   'partial': 0,
   'missed': 2,
   'spurious': 3,
   'possible': 26,
   'actual': 27,
   'precision': 0.7407407407407407,
   'recall': 0.7692307692307693,
   'f1': 0.7547169811320754},
  'exact': {'correct': 20,
   'incorrect': 4,
   'partial': 0,
   'missed': 2,
   'spurious': 3,
   'possible': 26,
   'actual': 27,
   'precision': 0.7407407407407407,
   'recall': 0.7692307692307693,
   'f1': 0.7547169811320754}},
 'LOCATION': {'ent_type': {'correct': 14,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 3,
   'possible': 14,
   'actual': 17,
   'precision': 0.8235294117647058,
   'recall': 1.0,
   'f1': 0.9032258064516129},
  'partial': {'correct': 12,
   'incorrect': 0,
   'partial': 2,
   'missed': 0,
   'spurious': 3,
   'possible': 14,
   'actual': 17,
   'precision': 0.7647058823529411,
   'recall': 0.9285714285714286,
   'f1': 0.8387096774193549},
  'strict': {'correct': 12,
   'incorrect': 2,
   'partial': 0,
   'missed': 0,
   'spurious': 3,
   'possible': 14,
   'actual': 17,
   'precision': 0.7058823529411765,
   'recall': 0.8571428571428571,
   'f1': 0.7741935483870968},
  'exact': {'correct': 12,
   'incorrect': 2,
   'partial': 0,
   'missed': 0,
   'spurious': 3,
   'possible': 14,
   'actual': 17,
   'precision': 0.7058823529411765,
   'recall': 0.8571428571428571,
   'f1': 0.7741935483870968}},
 'ORGANIZATION': {'ent_type': {'correct': 8,
   'incorrect': 0,
   'partial': 0,
   'missed': 5,
   'spurious': 1,
   'possible': 13,
   'actual': 9,
   'precision': 0.8888888888888888,
   'recall': 0.6153846153846154,
   'f1': 0.7272727272727274},
  'partial': {'correct': 5,
   'incorrect': 0,
   'partial': 3,
   'missed': 5,
   'spurious': 1,
   'possible': 13,
   'actual': 9,
   'precision': 0.7222222222222222,
   'recall': 0.5,
   'f1': 0.5909090909090908},
  'strict': {'correct': 5,
   'incorrect': 3,
   'partial': 0,
   'missed': 5,
   'spurious': 1,
   'possible': 13,
   'actual': 9,
   'precision': 0.5555555555555556,
   'recall': 0.38461538461538464,
   'f1': 0.4545454545454546},
  'exact': {'correct': 5,
   'incorrect': 3,
   'partial': 0,
   'missed': 5,
   'spurious': 1,
   'possible': 13,
   'actual': 9,
   'precision': 0.5555555555555556,
   'recall': 0.38461538461538464,
   'f1': 0.4545454545454546}}}

## Statistical Models Based NER 

In [None]:
import spacy

statistical_nlp = spacy.load("en_core_web_sm")
doc = statistical_nlp(annotated_data[2]["paragraph"])

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
annotated_data[2]['entities']

British 4 11 NORP
Richard Branson 79 94 PERSON
Virgin Atlantic Airways 110 133 ORG
Lindstrand 141 151 FAC
Lindstrand Balloons Ltd. 167 191 ORG
England 209 216 GPE
Irish 226 231 NORP
Rory McCarthy 245 258 PERSON


[{'label': 'ORG', 'word': 'Virgin', 'start': 33, 'end': 39},
 {'label': 'ORG', 'word': 'Virgin', 'start': 110, 'end': 116},
 {'label': 'PER', 'word': 'Richard', 'start': 79, 'end': 86},
 {'label': 'PER', 'word': 'Branson', 'start': 87, 'end': 94},
 {'label': 'ORG', 'word': 'Atlantic', 'start': 117, 'end': 125},
 {'label': 'ORG', 'word': 'Airways', 'start': 126, 'end': 133},
 {'label': 'PER', 'word': 'Per', 'start': 137, 'end': 140},
 {'label': 'PER', 'word': 'Lindstrand', 'start': 141, 'end': 151},
 {'label': 'ORG', 'word': 'Lindstrand', 'start': 167, 'end': 177},
 {'label': 'ORG', 'word': 'Balloons', 'start': 178, 'end': 186},
 {'label': 'ORG', 'word': 'Ltd.', 'start': 187, 'end': 191},
 {'label': 'LOC', 'word': 'Oswestry', 'start': 197, 'end': 205},
 {'label': 'LOC', 'word': 'England', 'start': 209, 'end': 216},
 {'label': 'PER', 'word': 'Rory', 'start': 245, 'end': 249},
 {'label': 'PER', 'word': 'McCarthy', 'start': 250, 'end': 258}]

In [None]:
map_tags ={
    "GPE": "LOC",
    "PERSON" :"PER"
}

format for the same format we decided to use 

In [None]:
for idx , element in enumerate(annotated_data) :
  entities =[]
  for ent in statistical_nlp(element['paragraph']).ents :
    prev = ""
    entity = ent.label_
    if entity in map_tags : entity = map_tags[ent.label_]
    for word in ent.text.split(): 
      if entity in given_tags :entities.append({
          "word" : word , 
          "label" : entity , 
          "start" : ent.start_char + len(prev),
          "end" : ent.start_char + len(prev) + len(word)
      })    
      prev += word + " "
  annotated_data[idx]["stats_predicted_entities"]= entities


In [None]:
annotated_data[2]

{'paragraph': 'The British balloon, called the  Virgin  Global Challenger, is to be flown by  Richard Branson , chairman of  Virgin Atlantic Airways ;  Per Lindstrand , chairman of  Lindstrand Balloons Ltd.  of  Oswestry ,  England , and an Irish balloonist,  Rory McCarthy .',
 'entities': [{'label': 'ORG', 'word': 'Virgin', 'start': 33, 'end': 39},
  {'label': 'ORG', 'word': 'Virgin', 'start': 110, 'end': 116},
  {'label': 'PER', 'word': 'Richard', 'start': 79, 'end': 86},
  {'label': 'PER', 'word': 'Branson', 'start': 87, 'end': 94},
  {'label': 'ORG', 'word': 'Atlantic', 'start': 117, 'end': 125},
  {'label': 'ORG', 'word': 'Airways', 'start': 126, 'end': 133},
  {'label': 'PER', 'word': 'Per', 'start': 137, 'end': 140},
  {'label': 'PER', 'word': 'Lindstrand', 'start': 141, 'end': 151},
  {'label': 'ORG', 'word': 'Lindstrand', 'start': 167, 'end': 177},
  {'label': 'ORG', 'word': 'Balloons', 'start': 178, 'end': 186},
  {'label': 'ORG', 'word': 'Ltd.', 'start': 187, 'end': 191},
  

## Statistical Method Evaluation

In [None]:
stat_pred = [x["stats_predicted_entities"] for x in annotated_data]

evaluator = Evaluator(true, stat_pred, tags=given_tags)

# Returns overall metrics and metrics for each tag

results, results_per_tag = evaluator.evaluate()

results

{'ent_type': {'correct': 541,
  'incorrect': 51,
  'partial': 0,
  'missed': 67,
  'spurious': 125,
  'possible': 659,
  'actual': 717,
  'precision': 0.7545327754532776,
  'recall': 0.8209408194233687,
  'f1': 0.7863372093023256},
 'partial': {'correct': 561,
  'incorrect': 0,
  'partial': 31,
  'missed': 67,
  'spurious': 125,
  'possible': 659,
  'actual': 717,
  'precision': 0.8040446304044631,
  'recall': 0.8748103186646434,
  'f1': 0.8379360465116279},
 'strict': {'correct': 511,
  'incorrect': 81,
  'partial': 0,
  'missed': 67,
  'spurious': 125,
  'possible': 659,
  'actual': 717,
  'precision': 0.7126917712691772,
  'recall': 0.7754172989377845,
  'f1': 0.7427325581395349},
 'exact': {'correct': 561,
  'incorrect': 31,
  'partial': 0,
  'missed': 67,
  'spurious': 125,
  'possible': 659,
  'actual': 717,
  'precision': 0.7824267782426778,
  'recall': 0.8512898330804249,
  'f1': 0.815406976744186}}

In [None]:
results_per_tag

{'PER': {'ent_type': {'correct': 115,
   'incorrect': 21,
   'partial': 0,
   'missed': 18,
   'spurious': 27,
   'possible': 154,
   'actual': 163,
   'precision': 0.7055214723926381,
   'recall': 0.7467532467532467,
   'f1': 0.7255520504731863},
  'partial': {'correct': 135,
   'incorrect': 0,
   'partial': 1,
   'missed': 18,
   'spurious': 27,
   'possible': 154,
   'actual': 163,
   'precision': 0.8312883435582822,
   'recall': 0.8798701298701299,
   'f1': 0.8548895899053627},
  'strict': {'correct': 115,
   'incorrect': 21,
   'partial': 0,
   'missed': 18,
   'spurious': 27,
   'possible': 154,
   'actual': 163,
   'precision': 0.7055214723926381,
   'recall': 0.7467532467532467,
   'f1': 0.7255520504731863},
  'exact': {'correct': 135,
   'incorrect': 1,
   'partial': 0,
   'missed': 18,
   'spurious': 27,
   'possible': 154,
   'actual': 163,
   'precision': 0.8282208588957055,
   'recall': 0.8766233766233766,
   'f1': 0.8517350157728707}},
 'ORG': {'ent_type': {'correct': 259