# **Create Annotation Files**

## **Author:** Gema De Vargas Romero

## **Master Thesis:** "Development of a Named Entity Recognition System to automatically assign tumor morphology entity mentions to health-related documents in Spanish." 

The aim of this notebook is to construct the annotation files of the predictions. For this purpose, for every machine learning method employed, it will:

1. take the predicted labels for each clinical case 
2. construct the corresponding BRAT annotation 
3. Store the annotations in as many files as clinical cases in the test dataset using the same name as the txt file

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

path='drive/My Drive/Ejemplos NER - TFM/'
!ls 'drive/My Drive/Ejemplos NER - TFM/'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/
 bert
 data
 dev_set
 dev_set2
'Dictionary based NER (spacy).ipynb'
'Ehealth_Dictionary based NER (spacy).ipynb'
 last_step_cantemist.ipynb
 last_step_cantemist_TEST.ipynb
 NER_by_BERT_Cantemist_BIOESV.ipynb
 NER_by_BERT_Cantemist_Competicion.ipynb
 NER_by_BERT_Cantemist.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_BIOESV_2.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_BIOESV.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_Competicion.ipynb
 NER_by_BI_LSTM_CRF_Cantemist.ipynb
 NER_by_CRF_C

### **Load libraries**

In [None]:
import pandas as pd
import numpy as np
import pickle as pkl
# Library spacy
!pip install -U spacy 
#!python -m spacy validate
!python -m spacy download es_core_news_lg
import spacy

# nlp = spacy.load("es") # no longer works with updated version of spacy 2.3.1
import es_core_news_lg
nlp = es_core_news_lg.load()

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/10/b5/c7a92c7ce5d4b353b70b4b5b4385687206c8b230ddfe08746ab0fd310a3a/spacy-2.3.2-cp36-cp36m-manylinux1_x86_64.whl (9.9MB)
[K     |████████████████████████████████| 10.0MB 8.1MB/s 
Collecting thinc==7.4.1
[?25l  Downloading https://files.pythonhosted.org/packages/10/ae/ef3ae5e93639c0ef8e3eb32e3c18341e511b3c515fcfc603f4b808087651/thinc-7.4.1-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 60.4MB/s 
Installing collected packages: thinc, spacy
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed spacy-2.3.2 thinc-7.4.1
Collecting es_core_news_lg==2.3.1
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-2.3.1/es_core_news_lg-2.3.1.

### **Read the names of the files**

In [None]:
# read names of the files
with open(path+'data/files_txt_test', 'rb') as file: 
  files_txt_test = pkl.load(file)
file.close()


For simplicity, the prediction files of each method are being stored in arrays with the same name. The reason for this overlap is to avoid repeating the code to construct the annotation files as many times as methods. 

This way, 

### **Read CRF prediction files**

In [None]:
with open(path+'results_CRF/predictions/new_tokens_cc', 'rb') as file: 
  new_tokens_cc = pkl.load(file)
file.close()

with open(path+'results_CRF/predictions/new_labels_cc', 'rb') as file: 
  new_labels_cc = pkl.load(file)
file.close()

with open(path+'results_CRF/predictions/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc = pkl.load(file)
file.close()

In [None]:
print("Number of sentences in complete set: %d" %len(new_tokens_cc))

Number of sentences in complete set: 5232


In [None]:
# Example:
for token, label, new_start in zip(new_tokens_cc[0][0], new_labels_cc[0][0], new_start_pos_cc[0][0]):
    print("{}\t{}\t{}".format(label, token,new_start))

O	paciente	0
O	mujer	9
O	,	14
O	75	16
O	anos	19
O	consulta	24
O	el	33
O	4-6-2003	36
O	,	44
O	refiriendo	46
O	como	57
O	antecedentes	62
O	personales	75
O	:	85
O	alergia	87
O	a	95
O	salicilatos	97
O	.	108


### **Read BILSTM approach 1 prediction files**

In [None]:
with open(path+'results_BILSTM_ap1/predictions/new_tokens_cc', 'rb') as file: 
  new_tokens_cc = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap1/predictions/new_labels_cc', 'rb') as file: 
  new_labels_cc = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap1/predictions/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc = pkl.load(file)
file.close()

In [None]:
print("Number of sentences in complete set: %d" %len(new_tokens_cc))

Number of sentences in complete set: 5232


In [None]:
# Example:
for token, label, new_start in zip(new_tokens_cc[0][0], new_labels_cc[0][0], new_start_pos_cc[0][0]):
    print("{}\t{}\t{}".format(label, token,new_start))

O	Paciente	0
O	mujer	9
O	,	14
O	75	16
O	años	19
O	consulta	24
O	el	33
O	4-6-2003	36
O	,	44
O	refiriendo	46
O	como	57
O	antecedentes	62
O	personales	75
O	:	85
O	Alergia	87
O	a	95
O	salicilatos	97
O	.	108


### **Read BILSTM approach 2 prediction files**

NOTE: the predictions of this approach were stored in different foulders bases on the subset of files.

In [None]:
!ls 'drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap2/predictions/'


subset1  subset2  subset3  subset4  subset5  subset6


In [None]:
with open(path+'results_BILSTM_ap2/predictions/subset1/new_tokens_cc', 'rb') as file: 
  new_tokens_cc1 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset1/new_labels_cc', 'rb') as file: 
  new_labels_cc1 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset1/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc1 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap2/predictions/subset2/new_tokens_cc', 'rb') as file: 
  new_tokens_cc2 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset2/new_labels_cc', 'rb') as file: 
  new_labels_cc2 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset2/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc2 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap2/predictions/subset3/new_tokens_cc', 'rb') as file: 
  new_tokens_cc3 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset3/new_labels_cc', 'rb') as file: 
  new_labels_cc3 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset3/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc3 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap2/predictions/subset4/new_tokens_cc', 'rb') as file: 
  new_tokens_cc4 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset4/new_labels_cc', 'rb') as file: 
  new_labels_cc4 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset4/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc4 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap2/predictions/subset5/new_tokens_cc', 'rb') as file: 
  new_tokens_cc5 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset5/new_labels_cc', 'rb') as file: 
  new_labels_cc5 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap2/predictions/subset5/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc5 = pkl.load(file)
file.close()

In [None]:
print("Number of sentences in subset 1: %d" %len(new_tokens_cc1))
print("Number of sentences in subset 2: %d" %len(new_tokens_cc2))
print("Number of sentences in subset 3: %d" %len(new_tokens_cc3))
print("Number of sentences in subset 4: %d" %len(new_tokens_cc4))
print("Number of sentences in subset 5: %d" %len(new_tokens_cc5))

Number of sentences in subset 1: 1000
Number of sentences in subset 2: 1000
Number of sentences in subset 3: 1000
Number of sentences in subset 4: 1000
Number of sentences in subset 5: 1232


In [None]:
new_tokens_cc = new_tokens_cc1.copy()
new_tokens_cc.extend(new_tokens_cc2)
new_tokens_cc.extend(new_tokens_cc3)
new_tokens_cc.extend(new_tokens_cc4)
new_tokens_cc.extend(new_tokens_cc5)

new_labels_cc = new_labels_cc1.copy()
new_labels_cc.extend(new_labels_cc2)
new_labels_cc.extend(new_labels_cc3)
new_labels_cc.extend(new_labels_cc4)
new_labels_cc.extend(new_labels_cc5)

new_start_pos_cc = new_start_pos_cc1.copy()
new_start_pos_cc.extend(new_start_pos_cc2)
new_start_pos_cc.extend(new_start_pos_cc3)
new_start_pos_cc.extend(new_start_pos_cc4)
new_start_pos_cc.extend(new_start_pos_cc5)

In [None]:
print("Number of sentences in subset 1: %d" %len(new_tokens_cc1))
print("Number of sentences in subset 2: %d" %len(new_tokens_cc2))
print("Number of sentences in subset 3: %d" %len(new_tokens_cc3))
print("Number of sentences in subset 4: %d" %len(new_tokens_cc4))
print("Number of sentences in subset 5: %d" %len(new_tokens_cc5))
print("Number of sentences in complete set: %d" %len(new_tokens_cc))

Number of sentences in subset 1: 1000
Number of sentences in subset 2: 1000
Number of sentences in subset 3: 1000
Number of sentences in subset 4: 1000
Number of sentences in subset 5: 1232
Number of sentences in complete set: 5232


In [None]:
# Example:
for token, label, new_start in zip(new_tokens_cc[0][0], new_labels_cc[0][0], new_start_pos_cc[0][0]):
    print("{}\t{}\t{}".format(label, token,new_start))

O	Paciente	0
O	mujer	9
O	,	14
O	75	16
O	años	19
O	consulta	24
O	el	33
O	4-6-2003	36
O	,	44
O	refiriendo	46
O	como	57
O	antecedentes	62
O	personales	75
O	:	85
O	Alergia	87
O	a	95
O	salicilatos	97
O	.	108


### **Read BILSTM approach 3 prediction files**

NOTE: the predictions of this approach were stored in different foulders bases on the subset of files.

In [None]:
!ls 'drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap3/predictions/'


subset1  subset2  subset3  subset4  subset5


In [None]:
with open(path+'results_BILSTM_ap3/predictions/subset1/new_tokens_cc', 'rb') as file: 
  new_tokens_cc1 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset1/new_labels_cc', 'rb') as file: 
  new_labels_cc1 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset1/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc1 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap3/predictions/subset2/new_tokens_cc', 'rb') as file: 
  new_tokens_cc2 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset2/new_labels_cc', 'rb') as file: 
  new_labels_cc2 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset2/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc2 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap3/predictions/subset3/new_tokens_cc', 'rb') as file: 
  new_tokens_cc3 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset3/new_labels_cc', 'rb') as file: 
  new_labels_cc3 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset3/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc3 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap3/predictions/subset4/new_tokens_cc', 'rb') as file: 
  new_tokens_cc4 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset4/new_labels_cc', 'rb') as file: 
  new_labels_cc4 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset4/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc4 = pkl.load(file)
file.close()

In [None]:
with open(path+'results_BILSTM_ap3/predictions/subset5/new_tokens_cc', 'rb') as file: 
  new_tokens_cc5 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset5/new_labels_cc', 'rb') as file: 
  new_labels_cc5 = pkl.load(file)
file.close()

with open(path+'results_BILSTM_ap3/predictions/subset5/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc5 = pkl.load(file)
file.close()

In [None]:
print("Number of sentences in subset 1: %d" %len(new_tokens_cc1))
print("Number of sentences in subset 2: %d" %len(new_tokens_cc2))
print("Number of sentences in subset 3: %d" %len(new_tokens_cc3))
print("Number of sentences in subset 4: %d" %len(new_tokens_cc4))
print("Number of sentences in subset 5: %d" %len(new_tokens_cc5))

Number of sentences in subset 1: 1000
Number of sentences in subset 2: 1000
Number of sentences in subset 3: 1000
Number of sentences in subset 4: 1000
Number of sentences in subset 5: 1232


In [None]:
new_tokens_cc = new_tokens_cc1.copy()
new_tokens_cc.extend(new_tokens_cc2)
new_tokens_cc.extend(new_tokens_cc3)
new_tokens_cc.extend(new_tokens_cc4)
new_tokens_cc.extend(new_tokens_cc5)

new_labels_cc = new_labels_cc1.copy()
new_labels_cc.extend(new_labels_cc2)
new_labels_cc.extend(new_labels_cc3)
new_labels_cc.extend(new_labels_cc4)
new_labels_cc.extend(new_labels_cc5)

new_start_pos_cc = new_start_pos_cc1.copy()
new_start_pos_cc.extend(new_start_pos_cc2)
new_start_pos_cc.extend(new_start_pos_cc3)
new_start_pos_cc.extend(new_start_pos_cc4)
new_start_pos_cc.extend(new_start_pos_cc5)

In [None]:
print("Number of sentences in subset 1: %d" %len(new_tokens_cc1))
print("Number of sentences in subset 2: %d" %len(new_tokens_cc2))
print("Number of sentences in subset 3: %d" %len(new_tokens_cc3))
print("Number of sentences in subset 4: %d" %len(new_tokens_cc4))
print("Number of sentences in subset 5: %d" %len(new_tokens_cc5))
print("Number of sentences in complete set: %d" %len(new_tokens_cc))

Number of sentences in subset 1: 1000
Number of sentences in subset 2: 1000
Number of sentences in subset 3: 1000
Number of sentences in subset 4: 1000
Number of sentences in subset 5: 1232
Number of sentences in complete set: 5232


In [None]:
# Example:
for token, label, new_start in zip(new_tokens_cc[0][1], new_labels_cc[0][1], new_start_pos_cc[0][1]):
    print("{}\t{}\t{}".format(label, token,new_start))

O	A	110
O	los	112
O	59	116
O	años	119
O	fué	124
O	diagnosticada	128
O	de	142
O	fiebre	145
O	de	152
O	probable	155
O	etiología	164
O	específica	174
O	,	184
O	tratada	186
O	con	194
O	tuberculostáticos	198
O	,	215
O	según	217
O	pauta	223
O	habitual	229
O	.	237


### **Read BERT prediction files**

In [None]:
with open(path+'results_bert2/predictions/new_tokens_cc', 'rb') as file: 
  new_tokens_cc = pkl.load(file)
file.close()

with open(path+'results_bert2/predictions/new_labels_cc', 'rb') as file: 
  new_labels_cc = pkl.load(file)
file.close()

with open(path+'results_bert2/predictions/new_start_pos_cc', 'rb') as file: 
  new_start_pos_cc = pkl.load(file)
file.close()

In [None]:
print("Number of sentences in complete set: %d" %len(new_tokens_cc))

Number of sentences in complete set: 5232


In [None]:
# Example:
for token, label, new_start in zip(new_tokens_cc[0][0], new_labels_cc[0][0], new_start_pos_cc[0][0]):
    print("{}\t{}\t{}".format(label, token,new_start))

O	Paciente	0
O	mujer	9
O	,	14
O	75	16
O	años	19
O	consulta	24
O	el	33
O	4	36
O	-	36
O	6	36
O	-	36
O	2003	36
O	,	44
O	refiriendo	46
O	como	57
O	antecedentes	62
O	personales	75
O	:	85
O	Alergia	87
O	a	95
O	salicilatos	97
O	.	108


### **Read test files**

In [None]:
# TEST FILES:
with open(path+'data/sentences_test', 'rb') as file: 
  sentences_test = pkl.load(file)
file.close()

with open(path+'data/sentences_test_by_cc', 'rb') as file: 
  sentences_test_by_cc = pkl.load(file)
file.close()

df_data_test = pd.read_csv(path+'data/df_data_test.csv')

## **Results**

This code must be executed after reading the prediction files of each method. 
For example, read CRF prediction files and then execute these cells. 

### **Construct the results dataframe**


In [None]:
print("Number of files in the dataset: %d" %len(new_tokens_cc))

file_index = 0
sent_index = 0
data_words = []

for cc in range(len(new_tokens_cc)): # iterate over clinical cases
  file_index = file_index + 1

  for sent in range(len(new_tokens_cc[cc])): # iterate over sentences of a clinical case
    sent_index = sent_index + 1
    for word in range(len(new_tokens_cc[cc][sent])):
      data_words.append((file_index,sent_index,new_tokens_cc[cc][sent][word], new_start_pos_cc[cc][sent][word], 
                         new_labels_cc[cc][sent][word]))

df_data_results = pd.DataFrame(data_words, columns = ["File_Index","Sentence_Index", "Word", "New_Start_Char_position", "New_Label"])

Number of files in the dataset: 5232


In [None]:
df_data_results

Unnamed: 0,File_Index,Sentence_Index,Word,New_Start_Char_position,New_Label
0,1,1,Paciente,0,O
1,1,1,mujer,9,O
2,1,1,",",14,O
3,1,1,75,16,O
4,1,1,años,19,O
...,...,...,...,...,...
2104646,5232,88839,433,6226,O
2104647,5232,88839,U,6230,O
2104648,5232,88839,/,6231,O
2104649,5232,88839,ml,6232,O


Bert tokenization has produced sentences with either a higher number of words (tokens) or smaller number of words. However, we can trust merging by the new start char position. In fact, the tokens that were not captured by bert tokenization do not have a semantic worth, therefore, it is not crucial for them to merged,since their tag is 'O'.

On the other hand, bert tokenizer may have split words that are entities into various. This way, the original word will be merged as many times as subwords. 

#### **Original dataframe**

In [None]:
df_data_test

Unnamed: 0,File_Index,Sentence_Index,Word,POS,Start_Char_position
0,1,1,Paciente,PROPN,0
1,1,1,mujer,NOUN,9
2,1,1,",",PUNCT,14
3,1,1,75,NUM,16
4,1,1,años,NOUN,19
...,...,...,...,...,...
2167543,5232,88839,433,NUM,6226
2167544,5232,88839,U,PROPN,6230
2167545,5232,88839,/,PUNCT,6231
2167546,5232,88839,ml,NOUN,6232


### **Merge the dataframe of the results with the original dataframe**

The aim of this operation is to obtain the position of each token in the whole clinical case.

In [None]:
df_data_results2 = df_data_test.merge(df_data_results[['File_Index','Sentence_Index','Word','New_Start_Char_position','New_Label']],
                                      how = 'left', left_on=["File_Index","Sentence_Index", "Start_Char_position"], 
                                      right_on=["File_Index","Sentence_Index","New_Start_Char_position"], suffixes=('','_new'))

In [None]:
df_data_results2

Unnamed: 0,File_Index,Sentence_Index,Word,POS,Start_Char_position,Word_new,New_Start_Char_position,New_Label
0,1,1,Paciente,PROPN,0,Paciente,0.0,O
1,1,1,mujer,NOUN,9,mujer,9.0,O
2,1,1,",",PUNCT,14,",",14.0,O
3,1,1,75,NUM,16,75,16.0,O
4,1,1,años,NOUN,19,años,19.0,O
...,...,...,...,...,...,...,...,...
2233130,5232,88839,433,NUM,6226,433,6226.0,O
2233131,5232,88839,U,PROPN,6230,U,6230.0,O
2233132,5232,88839,/,PUNCT,6231,/,6231.0,O
2233133,5232,88839,ml,NOUN,6232,ml,6232.0,O


In [None]:
df_data_results2_n = df_data_results2.drop_duplicates(['File_Index','Sentence_Index','Start_Char_position'], keep = 'first')
df_data_results2_n = df_data_results2_n[['File_Index','Sentence_Index','Word','Start_Char_position','New_Label']]
#df_data_results2_n = df_data_results2_n.dropna()

In [None]:
df_data_results2_n

Unnamed: 0,File_Index,Sentence_Index,Word,Start_Char_position,New_Label
0,1,1,Paciente,0,O
1,1,1,mujer,9,O
2,1,1,",",14,O
3,1,1,75,16,O
4,1,1,años,19,O
...,...,...,...,...,...
2233130,5232,88839,433,6226,O
2233131,5232,88839,U,6230,O
2233132,5232,88839,/,6231,O
2233133,5232,88839,ml,6232,O


In [None]:
num_BMOR = len(df_data_results2_n[df_data_results2_n['New_Label']=='B-MOR'])
num_IMOR = len(df_data_results2_n[df_data_results2_n['New_Label']=='I-MOR'])
num_EMOR = len(df_data_results2_n[df_data_results2_n['New_Label']=='E-MOR'])
num_SMOR = len(df_data_results2_n[df_data_results2_n['New_Label']=='S-MOR'])
num_VMOR = len(df_data_results2_n[df_data_results2_n['New_Label']=='V-MOR'])

print("Number of B-MOR entities: %d" %num_BMOR)
print("Number of I-MOR entities: %d" %num_IMOR)
print("Number of E-MOR entities: %d" %num_EMOR)
print("Number of S-MOR entities: %d" %num_SMOR)
print("Number of V-MOR entities: %d" %num_VMOR)

print("\nTotal number of identified entity words: %d" %(num_BMOR + num_IMOR + num_EMOR + num_SMOR + num_VMOR))

Number of B-MOR entities: 5297
Number of I-MOR entities: 8172
Number of E-MOR entities: 5202
Number of S-MOR entities: 7628
Number of V-MOR entities: 0

Total number of identified entity words: 26299


In [None]:
Number of B-MOR entities: 5840
Number of I-MOR entities: 8674
Number of E-MOR entities: 5820
Number of S-MOR entities: 7616
Number of V-MOR entities: 0

Total number of identified entity words: 27950

In [None]:
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, l) for w, p, l in zip(s['Word'].values.tolist(),
                                                       s['Start_Char_position'].values.tolist(),
                                                       s['New_Label'].values.tolist())]
        self.grouped = self.df.groupby("Sentence_Index").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [None]:
# by clinical case
getter_by_cc = df_data_results2_n.groupby("File_Index").apply(sentence)

sentences_by_cc = []

for getter_i in getter_by_cc: # iterating over all the files
  sentences_by_cc.append(getter_i.sentences)

  

['T1', 'MORFOLOGIA_NEOPLASIA 2719 2740', 'Carcinoma microcítico\n']

In [None]:
file_index = 0
sent_index = 0
ann = []
for cc in range(len(sentences_by_cc)):
#for cc in range(1):
  file_index = file_index + 1
  entity_counter = 0
  ann_f = []
  for sent in range(len(sentences_by_cc[cc])):
    sent_index = sent_index + 1
    skip_w = 0
    for word in range(len(sentences_by_cc[cc][sent])): 
      try:
        w = sentences_by_cc[cc][sent][word+skip_w]
        if str(w[2])!='nan' and w[2]!='O': 
          entity_counter = entity_counter + 1
          text = w[0]
          start_pos = w[1] 
          end_pos = w[1] + len(w[0]) - 1 
          for ws in sentences_by_cc[cc][sent][word+skip_w+1:]:
            if str(ws[2])=='nan' or ws[2]=='O' or ws[2]=='B-MOR':
            #if ws[2]=='O':
              break
              
            else:
              text = text + ' ' + ws[0]
              end_pos = end_pos + 1 + len(ws[0])
              skip_w = skip_w + 1 # to skip this word in the next iteration of the outside loop

          ann_f.append(['T'+str(entity_counter),'MORFOLOGIA_NEOPLASIA '+str(start_pos)+' '+str(end_pos+1),text+'\n'])
      except:
        break

        
  ann.append(ann_f)

In [None]:
ann[2]

[['T1', 'MORFOLOGIA_NEOPLASIA 659 669', 'neoplásica\n'],
 ['T2', 'MORFOLOGIA_NEOPLASIA 779 801', 'ADK próstata Gleason 6\n']]

In [None]:
sentences_cc = ["".join(["\t".join([word for word in sent]) for sent in cc]) for cc in ann]

In [None]:
sentences_cc[0]

'T1\tMORFOLOGIA_NEOPLASIA 767 779\thipernefroma\nT2\tMORFOLOGIA_NEOPLASIA 1284 1296\thipernefroma\nT3\tMORFOLOGIA_NEOPLASIA 1660 1670\ttumoración\nT4\tMORFOLOGIA_NEOPLASIA 1718 1734\tnefroma quístico\nT5\tMORFOLOGIA_NEOPLASIA 1783 1790\ttumoral\n'

In [None]:
sentences_cc[-1]

'T1\tMORFOLOGIA_NEOPLASIA 393 421\tcarcinoma ductal infiltrante\nT2\tMORFOLOGIA_NEOPLASIA 440 446\tT2N0M0\nT3\tMORFOLOGIA_NEOPLASIA 677 684\ttumoral\nT4\tMORFOLOGIA_NEOPLASIA 714 744\tmetastásicas supraclaviculares\nT5\tMORFOLOGIA_NEOPLASIA 1098 1105\ttumoral\nT6\tMORFOLOGIA_NEOPLASIA 1218 1246\tcarcinoma ductal infiltrante\nT7\tMORFOLOGIA_NEOPLASIA 1464 1483\tmetástasis M1 óseas\nT8\tMORFOLOGIA_NEOPLASIA 1596 1614\tafectación anexial\nT9\tMORFOLOGIA_NEOPLASIA 1884 1910\tcáncer de mama metastásico\nT10\tMORFOLOGIA_NEOPLASIA 1912 1915\tCMm\nT11\tMORFOLOGIA_NEOPLASIA 3350 3397\tCarcinoma de mama ductal infiltrante estadio IV\nT12\tMORFOLOGIA_NEOPLASIA 4171 4178\tgrado I\nT13\tMORFOLOGIA_NEOPLASIA 6198 6209\tmetastásica\n'

In [None]:
sentences_cc[-1]

'T1\tMORFOLOGIA_NEOPLASIA 393 421\tcarcinoma ductal infiltrante\nT2\tMORFOLOGIA_NEOPLASIA 440 446\tT2N0M0\nT3\tMORFOLOGIA_NEOPLASIA 677 684\ttumoral\nT4\tMORFOLOGIA_NEOPLASIA 714 744\tmetastásicas supraclaviculares\nT5\tMORFOLOGIA_NEOPLASIA 1098 1105\ttumoral\nT6\tMORFOLOGIA_NEOPLASIA 1218 1246\tcarcinoma ductal infiltrante\nT7\tMORFOLOGIA_NEOPLASIA 1464 1483\tmetástasis M1 óseas\nT8\tMORFOLOGIA_NEOPLASIA 1596 1614\tafectación anexial\nT9\tMORFOLOGIA_NEOPLASIA 1884 1910\tcáncer de mama metastásico\nT10\tMORFOLOGIA_NEOPLASIA 1912 1915\tCMm\nT11\tMORFOLOGIA_NEOPLASIA 3350 3397\tCarcinoma de mama ductal infiltrante estadio IV\nT12\tMORFOLOGIA_NEOPLASIA 4171 4178\tgrado I\nT13\tMORFOLOGIA_NEOPLASIA 6198 6209\tmetastásica\n'

In [None]:
files_txt_test[0] 

'drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/S0004-06142005000100009-1.txt'

In [None]:
len(sentences_cc)

5232

### **Store the annotation files**

Change the method_name based on the method's predictions being employed.

In [None]:
method_name = 'results_CRF/ann/'
#method_name = 'results_BILSTM_ap1/ann/'
#method_name = 'results_BILSTM_ap2/ann/'
#method_name = 'results_BILSTM_ap3/ann/'
#method_name = 'results_bert/annotations/'

In [None]:
len_path = len('drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/')

for cc in range(len(sentences_cc)):
  file_name = files_txt_test[cc][len_path:]
  file_name = file_name[:-4] # remove .txt
  file_name_full = path+ method_name + file_name + '.ann'
  with open(file_name_full, 'w') as file:
    file.write(sentences_cc[cc])
