
# Trento University master of Artificial Intelligence Systems - Natural Language Understanding Course Second Assignment



*   Student Name: Ali Akay
*   Course: Natural Language Understanding

*   Professor: [Giuseppe Riccardi](http://disi.unitn.it/~riccardi/)
*   Lab: https://github.com/esrel/NLU.Lab.2021


## Requirements

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import spacy 
import pandas as pd
from spacy.matcher import Matcher
from spacy.tokens import Token, Doc
from collections import Counter
!pip install seqeval
from seqeval.metrics import accuracy_score
!pip install spacy_conll

Collecting seqeval
[?25l  Downloading https://files.pythonhosted.org/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43kB)
[K     |███████▌                        | 10kB 11.8MB/s eta 0:00:01[K     |███████████████                 | 20kB 12.3MB/s eta 0:00:01[K     |██████████████████████▌         | 30kB 8.4MB/s eta 0:00:01[K     |██████████████████████████████  | 40kB 8.9MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.8MB/s 
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-cp37-none-any.whl size=16172 sha256=9ab58ea88709b0fdca687bad2eef90b760778d1e163fb427685f8c134ffdf70a
  Stored in directory: /root/.cache/pip/wheels/52/df/1b/45d75646c37428f7e626214704a0e35bd3cfc32eda37e59e5f
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Collecting spacy_conll

## Download Data 

In [2]:
!git clone https://github.com/esrel/NLU.Lab.2021.git

Cloning into 'NLU.Lab.2021'...
remote: Enumerating objects: 67, done.[K
remote: Counting objects: 100% (67/67), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 67 (delta 17), reused 58 (delta 13), pack-reused 0[K
Unpacking objects: 100% (67/67), done.


In [3]:
import zipfile
with zipfile.ZipFile("/content/NLU.Lab.2021/src/conll2003.zip", 'r') as zip_ref:
    zip_ref.extractall("/content")

### **1st Problem-**  Evaluate spaCy NER on CoNLL 2003 data

Firstly I created a function from load txt file to list of sentences as dictionary.

In [4]:
def load_sentences(filepath):
    """
    Load sentences (separated by newlines) from dataset

    Parameters
    ----------
    filepath : str
        path to corpus file

    Returns
    -------
    List of sentences represented as dictionaries

    """
    
    sentences, tok, pos, chunk, ne = [], [], [], [], []

    with open(filepath, 'r') as f:
        for line in f.readlines():
            if line == ('-DOCSTART- -X- -X- O\n') or line == '\n':
               # Sentence as a sequence of tokens, POS, chunk and NE tags
                sentence = dict({'TOKENS' : [], 'POS' : [], 'CHUNK_TAG' : [], 'NE' : [], 'SEQ' : []})
                sentence['TOKENS'] = tok
                sentence['POS'] = pos
                sentence['CHUNK_TAG'] = chunk
                sentence['NE'] = ne
               
                # Once a sentence is processed append it to the list of sentences
                sentences.append(sentence)
               
                # Reset sentence information
                tok = []
                pos= []
                chunk = []
                ne = []
            else:
                l = line.split(' ')
               
                # Append info for next word
                tok.append(l[0])
                pos.append(l[1])
                chunk.append(l[2])
                ne.append(l[3].strip('\n'))
    
    return sentences

In [5]:
train_data=load_sentences("train.txt")
test_data=load_sentences("test.txt")

As you can see we have Chunk_tag, NE, POS, SEQ and Tokens in the dic. In this assingment we are going to use only TOKEN and Named Entities Tag.We can get token and NE from dictionary to dataframe by using **convert_data_to_df()** function. In this assignment I tried to use DataFrame for understanding and process data better.

In [6]:
test_data[4]

{'CHUNK_TAG': ['B-NP', 'O', 'B-NP', 'I-NP', 'I-NP', 'I-NP'],
 'NE': ['B-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O'],
 'POS': ['NNP', ',', 'NNP', 'NNP', 'NNPS', 'CD'],
 'SEQ': [],
 'TOKENS': ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']}

In [7]:
def conver_data_to_df(data):

  """
   Parameters
    ----------
    data : dictionary format of data. Eg.(test_data[4])

    Returns
    -------
    Dataframe which include word and token.


  """

  test_data_sen=pd.DataFrame()
  test_data_sen=pd.DataFrame(pd.DataFrame(data)["TOKENS"][0])
  #append tokens in dataframe
  for i in range(len(data)):
    b=pd.DataFrame(pd.DataFrame(data)["TOKENS"][i])
    test_data_sen=pd.concat([test_data_sen,b])
  test_data_sen.reset_index(drop=True,inplace=True)
  test_data_sen.columns=["Word"]
  test_data_sen=test_data_sen.apply(lambda x: x.astype(str).str.lower())

  #append NE in dataframe
  data_NE=pd.DataFrame()
  data_NE=pd.DataFrame(pd.DataFrame(data)["NE"][0])
  for i in range(len(data)):
    b=pd.DataFrame(pd.DataFrame(data)["NE"][i])
    data_NE=pd.concat([data_NE,b])
  data_NE.reset_index(drop=True,inplace=True)
  data_NE.columns=["NE_TAG"]
  #combine two dataframe
  data_final=pd.concat([test_data_sen,data_NE],axis=1)
  return data_final

In this assignment I only work on **"test.txt"**.

In [8]:
test_data_df=conver_data_to_df(test_data)

In [9]:
test_data_df

Unnamed: 0,Word,NE_TAG
0,soccer,O
1,-,O
2,japan,B-LOC
3,get,O
4,lucky,O
...,...,...
46430,younger,O
46431,brother,O
46432,",",O
46433,bobby,B-PER


In [10]:
#train_data_df=conver_data_to_df(train_data)
#train_data_df

In [11]:
#train_data_df.to_csv("train_data.csv")
#test_data_df.to_csv("test_data.csv")

We have to predict name entities using spacy for the first question. First we have to recreate sentences by using tokens then we have to find spacy name entities tag.I used **create_sentences_NE_data()** function to re-crate sentences from token.

In [12]:
def create_sentences_NE_data(data):
  """
   Parameters
    ----------
    data : dictionary format of data. 
    For the sentences we use token column.

    Returns
    -------
    list of sentences 
    list of NE
  """

  datasentences=[]
  for i in range(len(data)):
    datasentences.append(' '.join(word for word in pd.DataFrame(data)["TOKENS"][i])) #convert tokens to a sentence.
  
  data_ne=[]
  for i in range(len(data)):
    data_ne.append(pd.DataFrame(data)["NE"][i])
  return datasentences,data_ne

In [13]:
test_data_sent,test_data_ne=create_sentences_NE_data(test_data)

In [14]:
#Drop NA and null in the lists.
test_data_sent = list(filter(None, test_data_sent))
test_data_ne = list(filter(None, test_data_ne))

In [15]:
test_data_sent[2]

'AL-AIN , United Arab Emirates 1996-12-06'

In [16]:
test_data_ne[2]

['B-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O']

In [17]:
print(len(test_data_sent))
print(len(test_data_ne))

3453
3453


Using WhitespaceTokenizer, I can able to tokenize sentences same as Conll token format. Basically,tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.

In [18]:
import spacy
from spacy.tokens import Doc
from spacy_conll import init_parser,ConllFormatter

class WhitespaceTokenizer:
    def __init__(self, vocab): self.vocab = vocab
    def __call__(self, text):
        words = text.split(" ")                       
        return Doc(self.vocab, words=words)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
conllformatter = ConllFormatter(nlp)
nlp.add_pipe(conllformatter, last=True)

In spacy we get different labels for the NE. We should convert them the the conll format to evaluate.

In [19]:
def replace_values_in_string(text, args_dict):
    for key in args_dict.keys():
        text = text.replace(key, str(args_dict[key]))
    
    return text

dic_ne = {   
    "ORG": "ORG",
    "PERSON": "PER",
    "EVENT": "MISC",
    "NORP": "MISC",
    "FAC": "MISC",
    "WORK_OF_ART": "MISC",
    "PRODUCT": "MISC",
    "LAW": "MISC",
    "LANGUAGE": "MISC",
    "DATE": "MISC",
    "TIME": "MISC",
    "PERCENT": "MISC",
    "ORDINAL": "MISC",
    "CARDINAL":  "MISC",
    "MONEY": "MISC",
    "QUANTITY": "MISC",
    "GPE": "LOC",
    "LOC": "LOC"
}


Basically I have to empty list.Tokens added in **ent_list** for every sentences.**ent_list_sen** is used to collect token for every sentences and then appended to **ent_list**. Then I covert list to dataframe format.

In [31]:
def spacy_prediction(sentences_data):
  """
   Parameters
    ----------
    data : list of sentences. eg.(test_data_sent) 
    
    Returns
    -------
    data frame which is include token and NE prediction fo spacy.
  """

  #append entities in this list
  ent_list=[]
  for i in range(len(sentences_data)):
    ent_list_sen=[] 

    text=sentences_data[i]
    #to precise prediction I converted all sentences in lowercase
    text=text.lower() 
    doc = nlp(text)
    for token in doc:
      ent_list_sen.append([token.text,token.ent_iob_ + '-' + replace_values_in_string(token.ent_type_, dic_ne)])
      
    ent_list.append(ent_list_sen)
    
  spacy_ne_pred=pd.DataFrame()
  spacy_ne_pred=pd.DataFrame(ent_list[0])

  #Creating dataframe after getting list
  for i in range(len(ent_list)):
    list=pd.DataFrame(ent_list[i])
    spacy_ne_pred=pd.concat([spacy_ne_pred,list])
  spacy_ne_pred.reset_index(drop=True,inplace=True)
  spacy_ne_pred.columns=["Word","NE_pred"]
  return spacy_ne_pred

spacy_pred=spacy_prediction(test_data_sent)
spacy_pred

Unnamed: 0,Word,NE_pred
0,soccer,O-
1,-,O-
2,japan,B-LOC
3,get,O-
4,lucky,O-
...,...,...
46442,younger,O-
46443,brother,O-
46444,",",O-
46445,bobby,B-PER


In [32]:
#Convert "O-" to "O"
spacy_pred['NE_pred'] = spacy_pred['NE_pred'].str.replace('O-','O')

We got our spacy prediction. As you can see two dataframe is in same format.

In [33]:
spacy_pred.head()

Unnamed: 0,Word,NE_pred
0,soccer,O
1,-,O
2,japan,B-LOC
3,get,O
4,lucky,O


In [34]:
test_data_df.head()

Unnamed: 0,Word,NE_TAG
0,soccer,O
1,-,O
2,japan,B-LOC
3,get,O
4,lucky,O


We have to merge on "Word" these two dataframe to evaluate.After merged dataset I got Token-Ne_Tag-Spacy NE Tag prediction in same dataframe. I used this dataframe to compute accuracy for the token-level evaluation.

In [35]:
data=pd.merge(test_data_df,spacy_pred,on=["Word"]).drop_duplicates(subset=['Word'])
data.reset_index(drop=True,inplace=True)
data

Unnamed: 0,Word,NE_TAG,NE_pred
0,soccer,O,O
1,-,O,O
2,japan,B-LOC,B-LOC
3,get,O,O
4,lucky,O,O
...,...,...,...
8543,well-fancied,O,O
8544,lanky,O,O
8545,1966,B-MISC,I-MISC
8546,younger,O,O


In [36]:
data[15:25]

Unnamed: 0,Word,NE_TAG,NE_pred
15,united,B-LOC,B-LOC
16,arab,I-LOC,I-LOC
17,emirates,I-LOC,I-LOC
18,1996-12-06,O,B-MISC
19,began,O,O
20,the,O,O
21,defence,O,O
22,of,O,O
23,their,O,O
24,asian,B-MISC,B-MISC


In [37]:
data.NE_TAG.unique()

array(['O', 'B-LOC', 'B-PER', 'I-PER', 'I-LOC', 'B-MISC', 'I-MISC',
       'B-ORG', 'I-ORG'], dtype=object)

In [38]:
data.NE_pred.unique()

array(['O', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER',
       'B-ORG', 'I-ORG'], dtype=object)

In [39]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score

In [40]:
print("Token-level Accuracy",accuracy_score(data.NE_TAG, data.NE_pred))

Token-level Accuracy 0.6720870379036031


In [41]:
labels_ne = data.NE_TAG.unique().tolist()
labels_ne
precision, recall, fscore, support = score(data.NE_TAG.tolist(), data.NE_pred.tolist())
results_df=pd.DataFrame()
results_df=results_df.append([labels_ne,precision, recall, fscore, support]).T
results_df.columns=["Labels","Precision","Recall","Fscore","Support"]
results_df

Unnamed: 0,Labels,Precision,Recall,Fscore,Support
0,O,0.638655,0.484076,0.550725,314
1,B-LOC,0.0789755,0.350711,0.12892,211
2,B-PER,0.510638,0.193159,0.280292,497
3,I-PER,0.7,0.457849,0.553603,688
4,I-LOC,0.488889,0.407407,0.444444,54
5,B-MISC,0.028777,0.177778,0.0495356,45
6,I-MISC,0.350785,0.304545,0.326034,220
7,B-ORG,0.752747,0.578873,0.654459,710
8,I-ORG,0.810573,0.791875,0.801115,5809


### Chunk-Level Evaluation

In [42]:
test_data=pd.DataFrame(test_data)
test_data.head()

Unnamed: 0,TOKENS,POS,CHUNK_TAG,NE,SEQ
0,[],[],[],[],[]
1,[],[],[],[],[]
2,"[SOCCER, -, JAPAN, GET, LUCKY, WIN, ,, CHINA, ...","[NN, :, NNP, VB, NNP, NNP, ,, NNP, IN, DT, NN, .]","[B-NP, O, B-NP, B-VP, B-NP, I-NP, O, B-NP, B-P...","[O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O]",[]
3,"[Nadim, Ladki]","[NNP, NNP]","[B-NP, I-NP]","[B-PER, I-PER]",[]
4,"[AL-AIN, ,, United, Arab, Emirates, 1996-12-06]","[NNP, ,, NNP, NNP, NNPS, CD]","[B-NP, O, B-NP, I-NP, I-NP, I-NP]","[B-LOC, O, B-LOC, I-LOC, I-LOC, O]",[]


For the chunk level evaluation I used conll.evaluate function. For this reason we have to get "ref" and "hyp" data in a same format.

In [43]:
#Collect ref data using for loop.
test_data=test_data[["TOKENS","NE"]]
ref=[]
for i in range(len(test_data["TOKENS"])):
  text=test_data["TOKENS"][i]
  ne=test_data["NE"][i]
  a=[]
  for j in range(len(text)):
    a.append((text[j],ne[j]))
  ref.append(a)

In [44]:
#dropping null lists
ref = list(filter(None, ref))

In [45]:
ref[61]

[('9.', 'O'),
 ('Johann', 'B-PER'),
 ('Gregoire', 'I-PER'),
 ('(', 'O'),
 ('France', 'B-LOC'),
 (')', 'O'),
 ('22.58', 'O')]

In [46]:
#example hyp for a sentence
d=[]
doc=nlp(test_data_sent[61])
for token in doc:
  d.append((token.text,token.ent_iob_ + '-' + replace_values_in_string(token.ent_type_, dic_ne)))
d

[('9.', 'O-'),
 ('Johann', 'B-PER'),
 ('Gregoire', 'I-PER'),
 ('(', 'O-'),
 ('France', 'B-LOC'),
 (')', 'O-'),
 ('22.58', 'B-MISC')]

With this for loop, I could able to create hyp data. Basically for every sentences I run spacy code to collect data then append it in the list.

In [47]:
import numpy as np
hyp=[]
for i in range(len(test_data_sent)):
    text=test_data_sent[i]
    c=[]
    doc = nlp(text)
    for token in doc:
        c.append((token.text,token.ent_iob_ + '-' + replace_values_in_string(token.ent_type_, dic_ne) if token.ent_type_ != "" else "O" ))
    hyp.append(c)

In [48]:
hyp[2]

[('AL-AIN', 'B-LOC'),
 (',', 'O'),
 ('United', 'B-LOC'),
 ('Arab', 'I-LOC'),
 ('Emirates', 'I-LOC'),
 ('1996-12-06', 'B-MISC')]

In [49]:
ref[2]

[('AL-AIN', 'B-LOC'),
 (',', 'O'),
 ('United', 'B-LOC'),
 ('Arab', 'I-LOC'),
 ('Emirates', 'I-LOC'),
 ('1996-12-06', 'O')]

In [51]:
import conll
results = conll.evaluate(ref, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.724,0.577,0.642,1617
ORG,0.441,0.297,0.355,1661
MISC,0.094,0.551,0.161,702
LOC,0.771,0.668,0.716,1668
total,0.368,0.518,0.43,5648


## **2nd Problem**- Grouping of Entities. Write a function to group recognized named entities using noun_chunks method of spaCy. Analyze the groups in terms of most frequent combinations (i.e. NER types that go together).

For this question I create a function that predict entities by sentences.Then I use counter library to get frequency of the group entities.

In [52]:
len(test_data_sent)

3453

Example of for loop

In [66]:
test_data_sent[:5]

['SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT .',
 'Nadim Ladki',
 'AL-AIN , United Arab Emirates 1996-12-06',
 'Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .',
 'But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .']

In [67]:
  chunk=[] 
  for i in range(len(test_data_sent[:5])):
    doc = nlp(test_data_sent[i])
    for nc in doc.noun_chunks: 
      c_l=[]
      for ent in nc.ents:
        for token in ent:
          c_l.append(token.ent_type_ if token.ent_type_ != "" else "O" )
    chunk.append(c_l)

In this case ['GPE', 'GPE', 'GPE'] is United Arab Emirates. ["Date"] is Friday in 4th sentence.

In [69]:
chunk

[[], [], ['GPE', 'GPE', 'GPE'], ['DATE'], ['GPE']]

In [70]:
def group_entities(sentences_data):
  """
   Parameters
    ----------
    data : list of sentences. eg.(test_data_sent) 
    
    Returns
    -------
    list of Name entities of the sentences.
  """
  chunk=[] 
  for i in range(len(sentences_data)):
    doc = nlp(sentences_data[i])
    for nc in doc.noun_chunks: 
      c_l=[]
      for ent in nc.ents:
        for token in ent:
          c_l.append(token.ent_type_ if token.ent_type_ != "" else "O" )
    chunk.append(c_l)

  #dropping nulls.
  chunk_list = []
  for val in chunk:
      if val != [] :
          chunk_list.append(val)
  return chunk_list

In [71]:
chunk=group_entities(test_data_sent)
chunk[:5]

[['GPE', 'GPE', 'GPE'], ['DATE'], ['GPE'], ['ORDINAL'], ['GPE']]

In [72]:
from collections import Counter
chunk_frequency = Counter(map(tuple, chunk))
chunk_frequency.most_common(20)

[(('GPE',), 488),
 (('ORG',), 189),
 (('DATE',), 175),
 (('PERSON', 'PERSON'), 137),
 (('CARDINAL',), 105),
 (('PERSON',), 64),
 (('GPE', 'GPE'), 59),
 (('NORP',), 54),
 (('ORG', 'ORG'), 48),
 (('PERSON', 'PERSON', 'PERSON'), 27),
 (('ORDINAL',), 21),
 (('DATE', 'DATE'), 20),
 (('ORG', 'ORG', 'ORG', 'ORG'), 18),
 (('ORG', 'ORG', 'ORG'), 15),
 (('PERCENT', 'PERCENT'), 13),
 (('CARDINAL', 'GPE'), 12),
 (('ORG', 'ORG', 'ORG', 'ORG', 'ORG'), 11),
 (('CARDINAL', 'CARDINAL'), 9),
 (('MONEY', 'MONEY'), 9),
 (('TIME', 'TIME'), 8)]

## **3th Problem**- One of the possible post-processing steps is to fix segmentation errors. Write a function that extends the entity span to cover the full noun-compounds. Make use of compound dependency relation.

For the problem I try to create logic that if we find a token with a ‘compund’ dependency, check if it is in the same entity as it’s head; if not, create a new entity in order to fix this. If the token has not a compound dependency  append it as a same. In this problem I found the solution trial and error method. I am not sure it is right solution or not.

Child token should have compound dependency relation with its parent. 

In [74]:
def seg(sentences_data):
  """
   Parameters
    ----------
    data : list of sentences. eg.(test_data_sent) 
    
    Returns
    -------
    list of Name entities of the sentences.
  """
  hypp=[]
  for i in range(len(sentences_data)):
    doc=nlp(sentences_data[i])
    d=[]
    for token in doc:
      if (token.ent_type_=='') and (token.dep_=='compound')  and (token.head.ent_type_!=''):
        token.ent_type_ = token.head.ent_type_
        if (token.i < token.head.i): #check it is same entity with it's head. 
          d.append((token.text,"B-"+replace_values_in_string(token.ent_type_, dic_ne) if token.head.ent_type_ != "" else "O"))
        else:
          d.append((token.text, "I-"+replace_values_in_string(token.ent_type_, dic_ne) if token.ent_type_ != "" else "O"))   
      else: #use without making change
        d.append((token.text,token.ent_iob_ + '-' + replace_values_in_string(token.ent_type_, dic_ne) if token.ent_type_ != "" else "O"))
    hypp.append(d)
  return hypp


In [75]:
hypp=seg(test_data_sent)

In [76]:
hypp[2]

[('AL-AIN', 'B-LOC'),
 (',', 'O'),
 ('United', 'B-LOC'),
 ('Arab', 'I-LOC'),
 ('Emirates', 'I-LOC'),
 ('1996-12-06', 'B-MISC')]

In [77]:
ref[2]

[('AL-AIN', 'B-LOC'),
 (',', 'O'),
 ('United', 'B-LOC'),
 ('Arab', 'I-LOC'),
 ('Emirates', 'I-LOC'),
 ('1996-12-06', 'O')]

In [78]:
results = conll.evaluate(ref, hypp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.625,0.577,0.6,1617
ORG,0.422,0.298,0.349,1661
MISC,0.093,0.553,0.159,702
LOC,0.746,0.668,0.705,1668
total,0.352,0.519,0.419,5648


In [79]:
results = conll.evaluate(ref, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
PER,0.724,0.577,0.642,1617
ORG,0.441,0.297,0.355,1661
MISC,0.094,0.551,0.161,702
LOC,0.771,0.668,0.716,1668
total,0.368,0.518,0.43,5648


As a results, using fix segmentation function, I couldnt get better result. It is almost same with the prediction.