#Utility

This part of the assignment has been coded in order to describe the mapping from Spacy to Conll, in order to allow the mapping after having readed the file through the function *read_corpus_conll* defined in conll.py which has been imported.

The class *Tokenizer* load the vocabulary and describes the tozenization rule which has been implemented by considering as a different word, each part of the sentence separated from another one by a white space.

In [229]:
%%capture
!wget -c https://github.com/esrel/NLU.Lab.2021/raw/master/src/conll2003.zip
!unzip -o ./conll2003.zip
!wget -c https://raw.githubusercontent.com/esrel/NLU.Lab.2021/master/src/conll.py

import spacy
from conll import read_corpus_conll, get_chunks, evaluate
from spacy.tokens import Doc

spacy_to_conll = {
    "": "",
    "PERSON": "PER",
    "NORP": "MISC",
    "FAC": "MISC",
    "ORG": "ORG",
    "GPE": "LOC",
    "LOC": "LOC",
    "PRODUCT": "MISC",
    "EVENT": "MISC",
    "WORK_OF_ART": "MISC",
    "LAW": "MISC",
    "LANGUAGE": "MISC",
    "DATE": "MISC",
    "TIME": "MISC",
    "PERCENT": "MISC",
    "MONEY": "MISC",
    "QUANTITY": "MISC",
    "ORDINAL": "MISC",
    "CARDINAL":  "MISC",
}

conll_corpus = read_corpus_conll('test.txt')
nlp = spacy.load("en_core_web_sm")

class Tokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
    def __call__(self, text):
        words = text.split(" ")
        return Doc(self.vocab, words=words)

# Exercise 1


## Exercise 1.1 - functions

Exercise: Evaluate spaCy NER on CoNLL 2003 data (provided): report token-level performance (per class and total) accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy)



The function *true_predicted* takes as input a corpus which is a list of lists of strings, in order to deal with it, according to the description of the Conll2003 dataset and the way how I previously defined the Tokenizer, I iterate over the whole corpus (over each sentence and each element within a sentence) in order to extract from it, both the tokens and the name entities.
Tokens and name entities are both appended to a list in order to be used in the following iterative cycle.

In the last iterative cycle, for each token extracted from the last sentence, I check if its mapping correspond to the empty string, if is not I append the mapping to its mapping to the IOB code of named entity tag of the current token.

Then for each token, I append to the list of prediction, the mapping I have defined and I append to the list of correct label, the name entity tags corresponding to the current token as defined by the Conll2003 dataset.

Finally I return the both the list of predictions and the list of correct labels which will be used to evaluate the performances.


In [230]:
nlp.tokenizer = Tokenizer(nlp.vocab)

def true_predicted(corpus):
  y_pred=[]
  y_true=[]
  for sentence in corpus:
      tokens_list = []
      ne_tags_list =[]
      for element in sentence:
        splitted_el = element[0].split()
        tokens_list.append(splitted_el[0])
        ne_tags_list.append(splitted_el[3])
      doc = nlp(" ".join(tokens_list))
      token_index=0
      for token in doc:
         ent_iob_conll = token.ent_iob_
         if spacy_to_conll[token.ent_type_] != "" :
            ent_iob_conll += "-" + spacy_to_conll[token.ent_type_]
         y_pred.append(ent_iob_conll)
         y_true.append(ne_tags_list[token_index])
         token_index += 1
  return (y_pred, y_true)  
  



The function *produce_report* simply print the classification report by using the predicted labels and the correct ones by using the classification report method defined in the library sklearn.


In [231]:
from sklearn.metrics import classification_report

def produce_report(true_label, predicted_label):
  print(classification_report(true_label, predicted_label))

##EXERCISE 1.2 - FUNCTIONS

Exercise: Evaluate spaCy NER on CoNLL 2003 data (provided), report CoNLL chunk-level performance (per class and total);
precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total

The function *produce_report_chunk* receives as input the predicted labels and the correct ones, in order to evaluate them by using the function *evaluate* defined in *conll.py*, then by using the *Pandas* library, it returns a table in order to better visualize the results

In [232]:
import pandas as pd
def produce_report_chunk(true_label, predicted_label):
  evaluation = evaluate(true_label, predicted_label)
  evaluation_pd = pd.DataFrame().from_dict(evaluation, orient='index')
  return evaluation_pd

The function *true_predicted_chunk* receiving the same input of the previously defined function (*true_predicted*), the first part of the function is the same of *true_predicted*, the main difference is related to the last for, because in this case we are no more dealing with tokens but we are dealing with chunk.

In [233]:
def true_predicted_chunk(corpus):
  y_pred=[]
  y_true=[]
  for sentence in corpus:
      tokens_list = []
      ne_tags_list =[]
      for element in sentence:
        splitted_el = element[0].split()
        tokens_list.append(splitted_el[0])
        ne_tags_list.append(splitted_el[3])

      doc = nlp(" ".join(tokens_list))

      token_index=0

      tmp_pred = []
      tmp_true = []
      for token in doc:
         ent_iob_conll = token.ent_iob_
         if spacy_to_conll[token.ent_type_] != "" :
             ent_iob_conll += "-" + spacy_to_conll[token.ent_type_]
         tmp_pred.append((token.text,ent_iob_conll))
         tmp_true.append((token.text,ne_tags_list[token_index]))
         token_index += 1
      y_pred.append(tmp_pred)
      y_true.append(tmp_true)
  return (y_true, y_pred)  

## Exercise 1.1 - Test

Function for the testing of the *true_predicted* function and the function *produce_report* in order to take the correct labels and the predictions by using *true_predicted* and evaluating the predictions provided with *produce_report*

In [234]:
(true_label, predicted_label)= true_predicted(conll_corpus)
produce_report(true_label, predicted_label)

              precision    recall  f1-score   support

       B-LOC       0.70      0.78      0.74      1514
      B-MISC       0.56      0.11      0.18      3668
       B-ORG       0.30      0.50      0.38      1008
       B-PER       0.61      0.79      0.69      1253
       I-LOC       0.62      0.60      0.61       266
      I-MISC       0.40      0.05      0.09      1640
       I-ORG       0.52      0.42      0.46      1034
       I-PER       0.76      0.82      0.78      1072
           O       0.86      0.94      0.90     35211

    accuracy                           0.81     46666
   macro avg       0.59      0.56      0.54     46666
weighted avg       0.79      0.81      0.78     46666



Function for the testing of the *true_predicted_chunk* function and the function *produce_report_chunk* in order to take the correct labels and the predictions of the chunks by using *true_predicted_chunk* and evaluating the predictions provided with *produce_report_chunk*.

## Exercise 1.2 - Test

In [235]:
chunk_label, chunk_pred= true_predicted_chunk(conll_corpus)
evaluation_table = produce_report_chunk(chunk_label, chunk_pred)
evaluation_table.round(decimals=2)


Unnamed: 0,p,r,f,s
LOC,0.77,0.7,0.73,1668
ORG,0.45,0.27,0.34,1661
MISC,0.11,0.55,0.18,702
PER,0.76,0.59,0.66,1617
total,0.4,0.52,0.45,5648


#Exercise 2

## Exercise 2 - functions

EXERCISE: Grouping of Entities. Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

In the following code section, there are defined two functions: *get_list_of_lists* and *get_frequencies*.

The first one grouping the entities by extracting all the chunks in the doc.

Then we iterate over all entities in the doc, the main principle of the function is based on the boolean *in_chunk* telling us if we are in a chunk or not, based on the fact that it is *False* by default and it gets *True* as we check to be in a chunk, while it is deactived (i.e. it is set to *False) as we check to be out of the chunk where the previous entity (i.e. the entity at the previous iteration) was.

The second function *get_frequency* simply iterates over the entire list of list of entities, to count the frequency of each list of entities.

In [236]:
from collections import defaultdict
def get_list_of_lists(text = None, doc = None):
  if doc == None:
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
  #iterate chunk - ents
  chunk_list_first =[]
  for chunk in doc.noun_chunks:
    chunk_list_first.append(chunk.ents)
  
  list_of_lists = []
  chunk_index = 0
  in_chunk = False

  for ent in doc.ents:
    #In this case, the current entity does not belong to a chunk
    if (len(chunk_list_first)==0 or len(chunk_list_first)==chunk_index) or (ent not in chunk_list_first[chunk_index]):
      list_of_lists.append([ent.label_])
      if in_chunk == True:
        chunk_index +=1
        in_chunk = False
    #In this case, the current entity belongs to a chunk
    else:
      tmp_list=[]
      if in_chunk == False:
        list_of_lists.append([ent.label_])
        in_chunk = True
      else:
        list_of_lists[chunk_index].append(ent.label_)
  return list_of_lists

def get_frequencies(list_of_lists):
  dict_of_ent = defaultdict(int)
  for ent in list_of_lists:
      key = str(ent)
      dict_of_ent[key] = dict_of_ent[key]+1
  return dict_of_ent

The function *sort_dict* has been defined only to convert the dictionary of lists into a list of lists to order it based on its value representing the frequency.

In [237]:
def sort_dict(dic):
  list_of_freq = dic_of_ent.items()
  sorted_list = sorted(list_of_freq, key = lambda x: x[1], reverse=True)
  return sorted_list

## Exercise 2 - Test

Here we simply show how does the previously defined functions work.

We take a simple test sentence ("Apple's Steve Jobs died in 2011 in Palo Alto, California."), we group its entities based on the function *get_list_of_lists* then we measure the frequency of each group of entities by exploiting the function *get_frequencies*.

Then we sort the dictionary defined by *get_frequency* by using the previously defined function: *sort_dict*

In [238]:
test = "Apple's Steve Jobs died in 2011 in Palo Alto, California."

list_of_lists = get_list_of_lists(text=test)
print(list_of_lists)
dic_of_ent = get_frequencies(list_of_lists)
print("This is the list of entities and their frequencies")
print(dic_of_ent)

print("This is the sorted list of entities according to their frequencies")
sorted_list_of_freq = sort_dict(dic_of_ent)
print(sorted_list_of_freq)


[['ORG', 'PERSON'], ['DATE'], ['GPE'], ['GPE']]
This is the list of entities and their frequencies
defaultdict(<class 'int'>, {"['ORG', 'PERSON']": 1, "['DATE']": 1, "['GPE']": 2})
This is the sorted list of entities according to their frequencies
[("['GPE']", 2), ("['ORG', 'PERSON']", 1), ("['DATE']", 1)]


# Exercise 3

##Exercise 3 - functions

EXERCISE: One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

In the function *post_processing*, given the corpus as input, we compute the preprocessing step already defined in the previous sections, then we iterate over all entities in the current sentence (as we can notice by the way how the iterative cycles are nested), then we store both the list of entities and the list of tokens by appending to them the current entity based on the IOB tag, then for all tokens in list of tokens we have defined.

In order to expand the noun compound, we need to find each token having a "compound" dependency relation and having the entity type corresponding to the empty string.

In order to update the doc, we need to exploit the *set_ents* method of the Spacy class which Set the named entities in the document, considering that the ents correspond to a tuple of Span, we need to define a Span so we need to compute the beginning index and take also the proper label which is the one associated to the current Token (that we have set to the label of the last element in the entity list).

It is important to notice that the iterative cycle over the token has been iterated over the children of the token provided by the outer iterative cycle that has been extracted from the list of old entities.

Moreover we add the children token which has been processed (so the one of which we have updated the *ent_type_* in order to look into into children if there are further compound dependency relation and noun chunk to be extended.

In [239]:
nlp.tokenizer = Tokenizer(nlp.vocab)

def post_processing(conll_corpus):
  for sentence in conll_corpus:
    tokens_list = []
    for element in sentence:
      splitted_el = element[0].split()
      tokens_list.append(splitted_el[0])
    doc = nlp(" ".join(tokens_list))
    
    old_t_list=[] 
    entities_list = []
    for e in doc.ents:
      for tok in e:
        old_t_list.append(tok)
      entities_list.append(e)
      for t in old_t_list:
        utility =[]
        utility.append(t)
        for child in t.children:
          utility.append(child)
        for item in utility:
          if item.i == entities_list[-1].start or item.i == entities_list[-1].end-1:
            if item.ent_type_ == '' and item.dep_ == 'compound':
                item.ent_type_ = entities_list[-1].label_ 
                old_t_list.append(item)
                beg_index = min(entities_list[-1].start, item.i)
                end_index = max(entities_list[-1].end-1, item.i)
                entities_list[-1] = Span(doc, beg_index, end_index+1, item.ent_type)
    doc.set_ents(entities_list)


## Exercise 3 - Test

We call the *post_processing* function, previously defined by passing the *conll_corpus* as parameter

In [240]:
post_processing(conll_corpus)