<a href="https://colab.research.google.com/github/ajtamayoh/NLP-CIC-WFU-Contribution-to-LivingNER-shared-task-2022/blob/main/NLP_CIC_WFU_Contribution_to_LivingNER_shared_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP-CIC-WFU contribution to LivingNER shared task 2022 

Here you are the source code for the paper:

### Multilingual BERT for Spanish Named Entity Recognition and Normalization in Clinical Cases

Authors:

Antonio Tamayo (ajtamayo2019@ipn.cic.mx, ajtamayoh@gmail.com)

Diego A. Burgos (burgosda@wfu.edu)

Alexander Gelbulkh (gelbukh@gelbukh.com)

For bugs or questions related to the code, do not hesitate to contact us (Antonio Tamayo: ajtamayoh@gmail.com)

If you use this code please cite our work:



# Requirements

To run this code you need to download the dataset at: https://drive.google.com/drive/folders/1Tn7h3RMez23aF-iyD1YLQH4VxMNSq5sf?usp=sharing

You must unzip and then upload, in the root of your Google Drive, the folder called "Dataset" previously downloaded.

## About the infrastructure

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Jun  9 23:01:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    32W / 250W |   8319MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


## Connecting to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Subtask 1 - NER

## Exploring & Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
import spacy

In [None]:
livingner_subtask1_training = pd.read_csv("/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/training/subtask1-NER/training_entities_subtask1.tsv", delimiter="\t")
livingner_subtask1_training.head()

Unnamed: 0,filename,mark,label,off0,off1,span
0,32032497_ES,T1,HUMAN,112,118,hombre
1,32032497_ES,T2,HUMAN,1025,1033,paciente
2,32032497_ES,T3,HUMAN,1098,1106,paciente
3,32032497_ES,T4,HUMAN,1395,1403,paciente
4,32032497_ES,T5,SPECIES,1075,1084,2019-nCoV


In [None]:
livingner_subtask1_training["label"].unique()

array(['HUMAN', 'SPECIES'], dtype=object)

In [None]:
text_files_path = "/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/training/text-files"

In [None]:
f = open(text_files_path + "/" + livingner_subtask1_training.iloc[6,0] + ".txt", "r", encoding="UTF-8")
for l in f:
  print(l)

﻿El 21 de enero de 2020, ingresó en el Hospital Popular de Wuwei un hombre de 47 años con fiebre sin foco y tos de 7 días de evolución. El paciente refirió que había tenido fiebre (hasta un máximo de 39,3 °C), tos productiva con esputo blanco, congestión nasal, rinorrea, mareo, fatiga, opresión torácica y náuseas, pero sin dolor torácico, irritación de garganta ni problemas respiratorios. Informó que había llegado a la ciudad de Wuwei en coche el 18 de enero, procedente de Wuhan. El paciente tenía antecedentes de hipertensión de grado 2 y diabetes de tipo 2 y era fumador desde los 27 años; no refirió antecedentes de alcoholismo. El 23, 29 y 30 de enero se tomaron frotis nasofaríngeos, de acuerdo con las directrices del CDC. Tras introducir un frotis nasofaríngeo en las fosas nasales, se gira sobre la mucosa nasofaríngea, de 10 a 15 segundos y, luego, se retira; finalmente se inserta en un tubo estéril con un medio de transporte vírico. Las muestras se analizaron mediante RT-PCR. Se det

In [None]:
#Clinical cases
HCs = {}
for fname in livingner_subtask1_training["filename"]:
  with open(text_files_path + "/" + fname + ".txt", "r", encoding="UTF-8") as f:
    HCs.update({fname: f.read()})

In [None]:
len(HCs)

1000

In [None]:
#Entities
ENT = {}
entities = []
fn = livingner_subtask1_training["filename"][0]
for fname, ent, label in zip(livingner_subtask1_training["filename"], livingner_subtask1_training["span"], livingner_subtask1_training["label"]):
    if fname!=fn:
      entities = []
    entities.append((ent,label))
    ENT.update({fname: entities})
    fn = fname

In [None]:
len(ENT)

1000

In [None]:
HCs["32032497_ES"]

'El 1 de enero de 2020, ingresó en el Union Hospital (facultad de medicina Tongji, Wuhan, provincia de Hubei) un hombre de 42 años con hipertermia (39,6 °C), tos y que refería fatiga de una semana de evolución. A la auscultación, se percibieron ruidos respiratorios bilaterales con estertores húmedos en las bases de ambos pulmones. Las pruebas analíticas mostraron leucocitopenia (cifra de leucocitos: 2,88 3 109/L) y linfocitosis (cifra de linfocitos: 0,90 3 109/L). La fórmula leucocitaria mostró un 56,6% de neutrófilos, un 32,1% de linfocitos y un 10,2% de monocitos. Varias pruebas analíticas adicionales dieron resultados anormales, como proteína C-reactiva (158,95 mg/L; intervalo normal: 0-10 mg/L), velocidad de sedimentación globular (38 mm/h; valor normal: 20 mm/h), proteína amiloide A sérica (607,1 mg/L; valor normal: 10 mg/L), aspartato-aminotransferasa (53 U/L; intervalo normal: 8-40 U/L) y alanina-aminotransferasa (60 U/L; intervalo normal: 5-40 U/L). La PCR de fluorescencia en t

In [None]:
ENT["32032497_ES"]

[('hombre', 'HUMAN'),
 ('paciente', 'HUMAN'),
 ('paciente', 'HUMAN'),
 ('paciente', 'HUMAN'),
 ('2019-nCoV', 'SPECIES'),
 ('antivirales', 'SPECIES')]

## Tokenization using SpaCy

In [None]:
from spacy.lang.es import Spanish
nlp = Spanish()
# Create a Tokenizer with the default settings for Spanish
# including punctuation rules and exceptions
tokenizer_spacy = nlp.tokenizer

In [None]:
HCs_tokenized = []
for hc in HCs:
    hl = []
    tokens = tokenizer_spacy(HCs[hc])
    #tokens = HCs[hc].split(" ") #The simplest option. It was not used in our work.
    for t in tokens:
        hl.append(str(t))
    HCs_tokenized.append(hl)

In [None]:
len(HCs_tokenized)

1000

In [None]:
#HCs_tokenized[0]

In [None]:
Ent_tokenized = []
for ent in ENT:
    Tks = []
    for e in ENT[ent]:
      sl = []
      tokens = tokenizer_spacy(e[0])
      #tokens = e.split(" ")
      for t in tokens:
          sl.append(str(t))
      Tks.append((sl,e[1]))
    Ent_tokenized.append(Tks)

In [None]:
len(Ent_tokenized)

1000

In [None]:
Ent_tokenized[4]

[(['enfermera', 'de', 'urgencias'], 'HUMAN'),
 (['hombre'], 'HUMAN'),
 (['médico', 'de', 'urgencias'], 'HUMAN'),
 (['paciente'], 'HUMAN'),
 (['paciente'], 'HUMAN'),
 (['paciente'], 'HUMAN'),
 (['pacientes'], 'HUMAN'),
 (['virus', 'respiratorios'], 'SPECIES'),
 (['personas'], 'HUMAN'),
 (['nCoV-19'], 'SPECIES'),
 (['nCoV-19'], 'SPECIES'),
 (['nCoV-19'], 'SPECIES'),
 (['nCoV-19'], 'SPECIES'),
 (['nCoV-19'], 'SPECIES'),
 (['operador', '1'], 'HUMAN'),
 (['operador', '2'], 'HUMAN'),
 (['operador', '1'], 'HUMAN'),
 (['operador', '2'], 'HUMAN'),
 (['operadores'], 'HUMAN'),
 (['contactos'], 'HUMAN')]

## Tagging Data with IO scheme

In [None]:
def find_idx(list_to_check, item_to_find):
    indices = []
    for idx, value in enumerate(list_to_check):
        if value == item_to_find:
            indices.append(idx)
    return indices

In [None]:
# Generate a list of the alphabet in Python with a for loop
Calphabet = []
alphabet = []
for i in range(97, 123):
    Calphabet.append(chr(i).upper())
    alphabet.append(chr(i))
print(alphabet)
print(Calphabet)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


In [None]:
import sys
numbers = ['1','2','3','4','5','6','7','8','9','0']
labels_tokenized = []
idx =-1
for hct, et in zip(HCs_tokenized, Ent_tokenized):
    idx+=1
    labels = []
    for i in range(len(hct)):
        #Labels: 0->'O'; 1->'Species'; 2->'Human'
        #labels.append('O')
        labels.append(0)

    #For Entities
    for enf in et:
      s = enf[1]
      for e in enf[0]:
          if s == 'SPECIES':
            try:
              #labels[hct.index(e)] = 'S'
              #labels[posLab] = 'S'
              indices = find_idx(hct, e)
              if len(indices) > 1:
                for id in indices:
                    if (e=='-' or e=='de' or e=='del' or e=='y' or e=='la' or e=='las' or e=='los' or e in numbers or e in Calphabet) and (labels[id-1])==0:
                      continue
                    labels[id] = 1
              else:
                labels[hct.index(e)] = 1
            except:
              if e == "VIH)-1" or e == "VIH)-1/2":
                continue
              print(hct)
              print(et)
              print(enf)
              print(e)
              print(idx)
              sys.exit(0)
          else:
            try:
              #labels[hct.index(e)] = 'H'
              #labels[posLab] = 'H'
              indices = find_idx(hct, e)
              if len(indices) > 1:
                for id in indices:
                    if (e=='-' or e=='de' or e=='del' or e=='y' or e=='la' or e=='las' or e=='los' or e in numbers or e in Calphabet) and (labels[id-1]==0):
                      continue
                    labels[id] = 2
              else:
                labels[hct.index(e)] = 2
            except:
              if e == "VIH)-1" or e == "VIH)-1/2":
                continue
              print(hct)
              print(et)
              print(enf)
              print(e)
              print(idx)
              sys.exit(0)

    labels_tokenized.append(labels)

In [None]:
livingner_subtask1_training["filename"].unique()[3]

'32161941_ES'

In [None]:
HCs["32161941_ES"]

'La madre del paciente neonatal es una mujer embarazada de 34 años que vive cerca del mercado mayorista de marisco de Huanan (a unos 1,2 km de distancia), en Wuhan. No ha visitado el mercado durante su embarazo y su familia no tiene casos confirmados ni presuntos de COVID-19, pero en la misma comunidad en la que vive se han diagnosticado más de 15 personas. Tiene antecedentes de hipotiroidismo de 4 años de evolución y se ha tratado con fármacos por vía oral; no tiene antecedentes de hipertensión, diabetes ni cardiopatías. Tuvo un aborto en 2016 a causa de alteraciones cromosómicas. Es alérgica a la penicilina y a las cefalosporinas de primera generación (positivo en pruebas cutáneas).\n\nA las 20:00 h del 1 de febrero de 2020, la mujer, de 40 semanas de gestación, presentó una pequeña hemorragia vaginal y dolor en la región abdominal inferior. Dos horas después, presentó fiebre (37,8 °C) y acudió al centro de asistencia maternoinfantil de Wuhan. Como tenía fiebre, fue derivada al consu

In [None]:
#HCs_tokenized[40]

In [None]:
j = 4
for i in range(len(HCs_tokenized[j])):
  print(str(HCs_tokenized[j][i]) + "\t" + str(labels_tokenized[j][i]))

En	0
nuestro	0
servicio	0
de	0
urgencias	2
se	0
presentó	0
un	0
hombre	2
de	2
52	0
años	0
que	0
reportó	0
fiebre	0
,	0
tos	0
,	0
astenia	0
,	0
cefalea	0
,	0
mialgia	0
y	0
fotofobia	0
de	0
una	0
semana	0
de	0
evolución	0
.	0
Afirmó	0
no	0
haber	0
viajado	0
durante	0
los	0
últimos	0
meses	0
,	0
pero	0
informó	0
de	0
contactos	2
con	0
diversas	0
personas	2
chinas	0
(	0
ninguna	0
de	0
ellas	0
con	0
antecedentes	0
de	0
infección	0
por	0
nCoV-19	1
)	0
e	0
italianas	0
procedentes	0
de	0
la	0
ciudad	0
de	0
Bérgamo	0
(	0
norte	0
de	0
Italia	0
)	0
,	0
actualmente	0
considerada	0
zona	0
de	0
alto	0
riesgo	0
para	0
las	0
infecciones	0
por	0
nCoV-19	1
por	0
parte	0
del	0
ministerio	0
de	0
salud	0
de	0
Italia	0
.	0

	0
Su	0
saturación	0
de	0
oxígeno	0
con	0
aire	0
ambiental	0
era	0
de	0
90	0
%	0
y	0
se	0
inició	0
oxigenoterapia	0
de	0
bajo	0
flujo	0
.	0
A	0
la	0
exploración	0
,	0
el	0
paciente	2
tenía	0
aspecto	0
enfermo	0
y	0
disneico	0
;	0
a	0
la	0
auscultación	0
se	0
percibieron	0
crepitantes	0
b

## Validating tokenization and alignment with the IO tags (S: Species, H: Human, O: Outside).




In [None]:
flag = 0
for st, lt in zip(HCs_tokenized, labels_tokenized):
    if len(st) != len(lt):
        print(st)
        print(lt)
        flag = 1
if flag==0:
    print("Everything is aligned!")

Everything is aligned!


## Sentence tokenization

In [None]:
sent_tokenized = []
label_sent_tokenized = []
for ht, lht in zip(HCs_tokenized, labels_tokenized):
  st = []; lbst = []
  for h, l in zip(ht,lht):
    if h != ".":
      st.append(h)
      lbst.append(l)
    else:
      st.append(".")
      lbst.append(0)
      sent_tokenized.append(st)
      label_sent_tokenized.append(lbst)
      st = []; lbst = []

In [None]:
len(sent_tokenized)

27388

In [None]:
sent_tokenized[0]

['El',
 '1',
 'de',
 'enero',
 'de',
 '2020',
 ',',
 'ingresó',
 'en',
 'el',
 'Union',
 'Hospital',
 '(',
 'facultad',
 'de',
 'medicina',
 'Tongji',
 ',',
 'Wuhan',
 ',',
 'provincia',
 'de',
 'Hubei',
 ')',
 'un',
 'hombre',
 'de',
 '42',
 'años',
 'con',
 'hipertermia',
 '(',
 '39,6',
 '°',
 'C',
 ')',
 ',',
 'tos',
 'y',
 'que',
 'refería',
 'fatiga',
 'de',
 'una',
 'semana',
 'de',
 'evolución',
 '.']

In [None]:
len(label_sent_tokenized)

27388

In [None]:
label_sent_tokenized[0]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

## An approximation to paragraph-based approach

### (Maximum 5 sentences per paragraph)

In [None]:
paragraph_tokenized = []
label_paragraph_tokenized = []
p_tokenized = []
lp_tokenized = []
count_sents = 0
for sentt, lsentt in zip(sent_tokenized, label_sent_tokenized):
  if count_sents < 5:
    p_tokenized = p_tokenized + sentt
    lp_tokenized = lp_tokenized + lsentt
  else:
    paragraph_tokenized.append(p_tokenized)
    label_paragraph_tokenized.append(lp_tokenized)
    count_sents = 0
    p_tokenized = []
    lp_tokenized = []
  count_sents+=1

In [None]:
len(paragraph_tokenized)

5477

In [None]:
len(label_paragraph_tokenized)

5477

# Species and Human entities identification (NER) as a token classification problem

## Install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 9.6 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 60.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 86.7 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.9 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 71.1 MB/s 
Collecting aiohttp
 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.9.0-py3-none-any.whl (106 kB)
[K     |████████████████████████████████| 106 kB 8.8 MB/s 
Installing collected packages: accelerate
Successfully installed accelerate-0.9.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


## Building the Dataset

In [None]:
#dic = {"tokens": HCs_tokenized, "ner_tags": labels_tokenized} #For the whole clinical case.
dic = {"tokens": sent_tokenized, "ner_tags": label_sent_tokenized} # We used this option for our paper. Use this option if you want to check the model performance with sentences tokenized by ". " but the whole clinical cases.
#dic = {"tokens": paragraph_tokenized, "ner_tags": label_paragraph_tokenized} #Use this option if you want to check the model performance with an approximation to paragraph tokenizing but the whole clinical cases.

In [None]:
from datasets import Dataset, DatasetDict
dataset = Dataset.from_dict(dic)

In [None]:
dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 5477
})

In [None]:
#For training, validation, and test partitions
"""
#Train, val, test partitions
train_test = dataset.train_test_split()
test_val = train_test['test'].train_test_split()
raw_datasets = DatasetDict({
    'train': train_test['train'],
    'validation': test_val['train'],
    'test': test_val['test']
    })
"""

#Just for training and validation partitions
train_test = dataset.train_test_split()
raw_datasets = DatasetDict({
    'train': train_test['train'],
    'validation': train_test['test']
    })


In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 4107
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1370
    })
})

In [None]:
raw_datasets["train"][0]["ner_tags"]

[0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [None]:
raw_datasets['train']

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 4107
})

In [None]:
label_names = ['O','S','H']
label_names

['O', 'S', 'H']

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = [int(n) for n in raw_datasets["train"][0]["ner_tags"]]
#labels = raw_datasets["train"][0]["pos_tags"]
#labels = raw_datasets["train"][0]["chunk_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

No se observan BAAR . No desarrollo de micobacterias . 
 • Radiografía de tórax : Índice cardiotorácico normal . No imagen de infiltrado ni condensaciones . 
O  O  O        S    O O  O          O  S             O O O O           O  O     O O      O              O      O O  O      O  O          O  O              O 


## Loading mBERT as a pre-trained model

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpd7b3qe68


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
creating metadata file for /root/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpk7uqwo47


Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
creating metadata file for /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidde

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
creating metadata file for /root/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp_1f8s16i


Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24
creating metadata file for /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'No',
 'se',
 'observa',
 '##n',
 'BA',
 '##AR',
 '.',
 'No',
 'desarrollo',
 'de',
 'mic',
 '##oba',
 '##cter',
 '##ias',
 '.',
 '•',
 'Radio',
 '##grafía',
 'de',
 'tó',
 '##rax',
 ':',
 'Í',
 '##ndi',
 '##ce',
 'card',
 '##iot',
 '##or',
 '##áci',
 '##co',
 'normal',
 '.',
 'No',
 'imagen',
 'de',
 'in',
 '##fil',
 '##tra',
 '##do',
 'ni',
 'conde',
 '##nsa',
 '##ciones',
 '.',
 '[SEP]']

In [None]:
inputs.word_ids()

[None,
 0,
 1,
 2,
 2,
 3,
 3,
 4,
 5,
 6,
 7,
 8,
 8,
 8,
 8,
 9,
 11,
 12,
 12,
 13,
 14,
 14,
 15,
 16,
 16,
 16,
 17,
 17,
 17,
 17,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 23,
 23,
 23,
 24,
 25,
 25,
 25,
 26,
 None]

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    0,    0,    0,    0,    1,    2,    0,    0,    0,    0,    1,
            2,    2,    2,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0, 

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_metric

metric = load_metric("seqeval")

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['O',
 'O',
 'O',
 'S',
 'O',
 'O',
 'O',
 'O',
 'S',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

{'_': {'f1': 1.0, 'number': 2, 'precision': 1.0, 'recall': 1.0},
 'overall_accuracy': 1.0,
 'overall_f1': 1.0,
 'overall_precision': 1.0,
 'overall_recall': 1.0}

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

{'0': 'O', '1': 'S', '2': 'H'}

In [None]:
label2id

{'H': '2', 'O': '0', 'S': '1'}

## Changing the head of prediction for NER under the IO scheme

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(    
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels = 3,
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "S",
    "2": "H"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "H": "2",
    "O": "0",
    "S": "1"
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_n

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
creating metadata file for /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
loading weights file https://huggingface.co/bert-base-multilingual-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship

In [None]:
model.config.num_labels

3

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "NLP-CIC-WFU_Clinical_Cases_NER_Sents_tokenized_mBERT_cased_fine_tuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=7,
    weight_decay=0.01,
    push_to_hub=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


## Hugging Face Authentication

If you want to save your own model and make it available online we strongly recommend signing up at: https://huggingface.co/

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your_email"
!git config --global user.name "your_username"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


## Fine-tuning mBERT for NER

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Cloning https://huggingface.co/ajtamayoh/NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned into local empty directory.
***** Running training *****
  Num examples = 4107
  Num Epochs = 7
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3598


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0693,0.04159,0.948512,0.649229,0.77084,0.98843
2,0.0367,0.039589,0.939114,0.67098,0.78272,0.989238
3,0.0283,0.038541,0.938837,0.688877,0.794664,0.989933
4,0.0222,0.042248,0.945552,0.678965,0.790385,0.989845
5,0.0182,0.045671,0.934944,0.692456,0.795634,0.990095
6,0.013,0.04844,0.894663,0.706222,0.789352,0.98986
7,0.0084,0.053715,0.858522,0.710077,0.777275,0.989287


***** Running Evaluation *****
  Num examples = 1370
  Batch size = 8
Saving model checkpoint to NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/checkpoint-514
Configuration saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/checkpoint-514/config.json
Model weights saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/checkpoint-514/pytorch_model.bin
tokenizer config file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/checkpoint-514/tokenizer_config.json
Special tokens file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/checkpoint-514/special_tokens_map.json
tokenizer config file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/tokenizer_config.json
Special tokens file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/special_tokens_map.json
***** Running Evaluati

TrainOutput(global_step=3598, training_loss=0.02750393009437595, metrics={'train_runtime': 1052.0302, 'train_samples_per_second': 27.327, 'train_steps_per_second': 3.42, 'total_flos': 3673367002806606.0, 'train_loss': 0.02750393009437595, 'epoch': 7.0})

## Saving the fine-tuned model at Hugging Face (It requires previous authentication)

In [None]:
trainer.push_to_hub(commit_message="Fine-tuning completed")

Saving model checkpoint to NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned
Configuration saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/config.json
Model weights saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/pytorch_model.bin
tokenizer config file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/tokenizer_config.json
Special tokens file saved in NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/676M [00:00<?, ?B/s]

Upload file runs/Jun09_23-02-14_cd70aa5e2986/events.out.tfevents.1654815761.cd70aa5e2986.82.2:  39%|###9      …

remote: Enforcing permissions...        
remote: Allowed refs: all
To https://huggingface.co/ajtamayoh/NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned
   577e064..6016158  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Token Classification', 'type': 'token-classification'}, 'metrics': [{'name': 'Precision', 'type': 'precision', 'value': 0.8585219707057257}, {'name': 'Recall', 'type': 'recall', 'value': 0.7100770925110133}, {'name': 'F1', 'type': 'f1', 'value': 0.7772754671488848}, {'name': 'Accuracy', 'type': 'accuracy', 'value': 0.9892868689441663}]}
remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/ajtamayoh/NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned
   6016158..ed75ffc  main -> main



'https://huggingface.co/ajtamayoh/NLP-CIC-WFU_Clinical_Cases_NER_Paragraph_Tokenized_mBERT_cased_fine_tuned/commit/601615881c21b387ab613b97aacc56f7e3aadcdb'

## Analyzing predictions

In [None]:
import numpy as np

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)


***** Running Prediction *****
  Num examples = 250
  Batch size = 8




(250, 512, 3) (250, 512)


In [None]:
"""
i=0
print(raw_datasets["validation"][i]['tokens'])
for j in range(len(preds[i])):
  print(raw_datasets["validation"][i]['ner_tags'][j], "\t", preds[i][j])
print(' '.join(raw_datasets["validation"][i]['tokens']))
"""

'\ni=0\nprint(raw_datasets["validation"][i][\'tokens\'])\nfor j in range(len(preds[i])):\n  print(raw_datasets["validation"][i][\'ner_tags\'][j], "\t", preds[i][j])\nprint(\' \'.join(raw_datasets["validation"][i][\'tokens\']))\n'

## Loading the model for inference

In [None]:
from transformers import pipeline

#Replace this with your own checkpoint. If you have run all the previous cells successfully, the model should be available at your hugging face account with the name: NLP-CIC-WFU_Clinical_Cases_NER_Sents_tokenized_mBERT_cased_fine_tuned
model_checkpoint = "ajtamayoh/NLP-CIC-WFU_Clinical_Cases_NER_Sents_tokenized_mBERT_cased_fine_tuned"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

Downloading:   0%|          | 0.00/967 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/359 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Some examples

In [None]:
#pred = token_classifier("El 2 de febrero de 2020, una mujer de 28 años y 30 semanas de gestación acudió a un consultorio de enfermedades infecciosas del hospital municipal de Suzhóu refiriendo fiebre intermitente de una semana de evolución. Informó de que había llegado a Suzhóu el 24 de enero tras visitar a su familia en Wuhan tres semanas antes. Teniendo en cuenta los antecedentes del paciente de fiebre y de desplazamiento a Wuhan, se recogieron dos frotis faríngeos, que dieron resultado negativo para el SARS-CoV-2 al ser analizados con el kit recomendado por el Centro de Control de Enfermedades chino (BioGerm, Shanghái, China) y de acuerdo con las directrices de la OMS para la RT-PCR cuantitativa. Una TAC torácica el día 4 de febrero mostró consolidaciones parcheadas subpleurales en el lado izquierdo y opacidades en vidrio esmerilado en el lado derecho. Se le puso en aislamiento en el mismo hospital. El 6 de febrero, los resultados de la segunda RT-PCR del esputo de la paciente para detectar SARS-CoV-2 fueron positivos, de modo que se transfirió a la unidad de cuidados intensivos (UCI) en una planta de enfermedades infecciosas del Hospital Asociado de Enfermedades Infecciosas de la universidad de Suchow, el centro hospitalario de referencia para la COVID-19 en Suzhóu. Al ingreso, la exploración física mostró una temperatura de 36,2 °C, presión arterial de 95/64 mmHg, pulso de 92 l.p.m., y saturación de oxígeno de 97% con mascarilla de Venturi a 5 litros por minuto de oxígeno. La auscultación pulmonar reveló roncus en el campo pulmonar inferior izquierdo. Otros resultados analíticos fueron: cifra de leucocitos de 10,60*109/L; cifra de neutrófilos de 9,14*109/L; cifra de linfocitos de 0,86*109/L; seroalbúmina de 24,6 g/L, proteína C-reactiva de 19,6 mg/L, dímero D de 840 ug/L, procalcitonina (PCT) de 0,288 ng/ml; lactato-deshidrogenasa (LDH) de 544 U/L; prohormona N-terminal del péptido natriurético cerebral (NT-proBNP) de 318 pg/ml. Las concentraciones de creatinina y aminotransferasa estaban dentro del intervalo normal. Una ecografía fetal mostró un feto intrauterino con características anatómicas normales de unas 30 semanas de gestación.")
#pred = token_classifier("En cuanto a la radiografía de tórax y al ecocardiograma no presentaron alteraciones. En la ecografía abdominal se observó una discreta hepatomegalia. Los hemocultivos y las serologías frente a Salmonella, Brucella, Rickettsia, Coxiella, Borrelia, Leishmania, Parvovirus B19, Citomegalovirus, virus de Epstein-Barr y herpes simplex, Toxoplasma y sífilis fueron negativos.")
#pred = token_classifier('Una niña de 55 días, por lo demás sana, con lactancia mixta, se puso enferma con rinorrea y tos seca el 28 de enero de 2020. Ingresó en nuestro hospital el 2 de febrero de 2020. Antes del inicio de los síntomas, entre el 16 y el 24 de enero, sus padres la habían llevado a Lu"an (provincia de Hubei) para una fiesta familiar. En dicha fiesta, el tío y la tía de la niña, residentes en Wuhan, presentaron tos y fiebre. Luego, el 31 de enero de 2020, se diagnosticó COVID-19 a los padres de la niña, a partir de sus síntomas, de radiografías torácicas y de pruebas de ácido nucleico de frotis faríngeos. El frotis nasofaríngeo obtenido de la niña también resultó positivo para el síndrome respiratorio agudo grave de coronavirus (SARS-CoV-2) en una RT-PCR en tiempo real.')
#pred = token_classifier("En la bioquímica mostró unos niveles de glucosa, iones, CPK, urea y creatinina normales. GOT/GPT 140/176 U/L; LDH 691 U/L; PCR 12,24 mg/dL. En la gasometría venosa se detectó un pH 7,41; pCO2 34 mmHg; HCO3 16 mEq/dL. En el sedimento de orina presentaba indicios de proteinuria. La radiografía de tórax y la ecografía abdominal fueron normales. La gota gruesa y el frotis de sangre periférica fueron negativos, así como los hemocultivos seriados. La serología frente a los virus de la hepatitis A, B y C, Coxiella, Borrellia, Rickettsia, dengue y VIH fue negativa.")
pred = token_classifier('Un mes después comenzó con episodios ocasionales de visión borrosa y picor en el ojo izquierdo. \nLa propia familia y posteriormente el servicio de Oftalmología apreciaron en el ojo de la paciente un elemento con forma de "gusano" de aproximadamente 1 cm de longitud, que se desplazaba bajo la conjuntiva.\n\nSe realizó un hemograma que mostró ligera eosinofilia (700/μl) y se estudió la presencia de microfilarias, en preparaciones directas y tras concentración (técnica de Knott) de sangre periférica extraída a las 12:00 h, siendo negativa. \nLa serología frente a filaria mediante ELISA utilizando como antígeno un extracto crudo de parásito adulto de Brugia malaya y Onchocerca volvulus (Centro Nacional de Microbiología, Majadahonda) presentó el siguiente patrón de anticuerpos: IgG total, IgG1 e IgG3: positivas; IgG2 e IgG4: negativas.\n\nAnte la sospecha diagnóstica de loiasis, recibió tratamiento con dietilcarbamazina a dosis progresivas los primeros 3 días, y continuando hasta completar 21 días de tratamiento, con lo que desapareció el exantema y el prurito de muslos y no volvió a presentar sintomatología ocular.')
pred

[{'end': 114,
  'entity_group': 'H',
  'score': 0.9997085,
  'start': 107,
  'word': 'familia'},
 {'end': 195,
  'entity_group': 'H',
  'score': 0.99993587,
  'start': 187,
  'word': 'paciente'},
 {'end': 224,
  'entity_group': 'S',
  'score': 0.999673,
  'start': 222,
  'word': 'gu'},
 {'end': 228,
  'entity_group': 'H',
  'score': 0.99980944,
  'start': 224,
  'word': '##sano'},
 {'end': 403,
  'entity_group': 'S',
  'score': 0.9998394,
  'start': 398,
  'word': 'micro'},
 {'end': 411,
  'entity_group': 'H',
  'score': 0.99972594,
  'start': 403,
  'word': '##filarias'},
 {'end': 568,
  'entity_group': 'S',
  'score': 0.99974626,
  'start': 564,
  'word': 'fila'},
 {'end': 571,
  'entity_group': 'H',
  'score': 0.99970335,
  'start': 568,
  'word': '##ria'},
 {'end': 636,
  'entity_group': 'S',
  'score': 0.9998555,
  'start': 633,
  'word': 'par'},
 {'end': 641,
  'entity_group': 'H',
  'score': 0.9997669,
  'start': 636,
  'word': '##ásito'},
 {'end': 654,
  'entity_group': 'S',
  

In [None]:
val_path = "/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/valid/text-files/"
test_path = "/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/test_background/text-files/"

In [None]:
livingner_subtask1_validation = pd.read_csv("/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/valid/subtask1-NER/validation_entities_subtask1.tsv", delimiter="\t")
livingner_subtask1_validation.head()

Unnamed: 0,filename,mark,label,off0,off1,span
0,32119083_ES,T1,HUMAN,287,294,familia
1,32119083_ES,T2,HUMAN,4827,4831,hijo
2,32119083_ES,T3,HUMAN,3948,3953,madre
3,32119083_ES,T4,HUMAN,4499,4504,madre
4,32119083_ES,T5,HUMAN,4567,4572,madre


In [None]:
i = 0
with open(val_path + livingner_subtask1_validation["filename"].unique()[i] + ".txt", "r", encoding="UTF-8") as fval:
  pred = token_classifier(fval.read())
pred

[{'end': 34,
  'entity_group': 'H',
  'score': 0.99941397,
  'start': 29,
  'word': 'mujer'},
 {'end': 294,
  'entity_group': 'H',
  'score': 0.9996308,
  'start': 287,
  'word': 'familia'},
 {'end': 372,
  'entity_group': 'H',
  'score': 0.9996573,
  'start': 364,
  'word': 'paciente'},
 {'end': 489,
  'entity_group': 'S',
  'score': 0.9982133,
  'start': 486,
  'word': 'SAR'},
 {'end': 490,
  'entity_group': 'H',
  'score': 0.99902165,
  'start': 489,
  'word': '##S'},
 {'end': 493,
  'entity_group': 'S',
  'score': 0.9974661,
  'start': 490,
  'word': '- Co'},
 {'end': 496,
  'entity_group': 'H',
  'score': 0.9994777,
  'start': 493,
  'word': '##V - 2'},
 {'end': 969,
  'entity_group': 'H',
  'score': 0.9996933,
  'start': 961,
  'word': 'paciente'},
 {'end': 987,
  'entity_group': 'S',
  'score': 0.9979159,
  'start': 984,
  'word': 'SAR'},
 {'end': 988,
  'entity_group': 'H',
  'score': 0.9986725,
  'start': 987,
  'word': '##S'},
 {'end': 991,
  'entity_group': 'S',
  'score': 0

## Post-Processing

In [None]:
def grouping_entities(pred):
  import re
  output = []
  for e in pred:
    if "##" not in e['word']:
      output.append(e)
    else:
      try:
        if e['start'] == (output[-1]['end']):
          output[-1]['word'] = output[-1]['word']+re.sub("##","",e['word'])
          output[-1]['end'] = e['end']
      except:
        pass
    
    try:
      if (e['entity_group'] == "S" or e['entity_group'] == "H") and (e['start'] == (output[-2]['end']+1)) and (e['entity_group'] == output[-2]['entity_group']):
        output[-2]['word'] = output[-2]['word']+" "+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass

    try:
      if e['start'] == (output[-2]['end']):
        output[-2]['word'] = output[-2]['word']+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass
    

  return output


In [None]:
grouping_entities(pred)

[{'end': 114,
  'entity_group': 'H',
  'score': 0.9997085,
  'start': 107,
  'word': 'familia'},
 {'end': 195,
  'entity_group': 'H',
  'score': 0.99993587,
  'start': 187,
  'word': 'paciente'},
 {'end': 228,
  'entity_group': 'S',
  'score': 0.999673,
  'start': 222,
  'word': 'gusano'},
 {'end': 411,
  'entity_group': 'S',
  'score': 0.9998394,
  'start': 398,
  'word': 'microfilarias'},
 {'end': 571,
  'entity_group': 'S',
  'score': 0.99974626,
  'start': 564,
  'word': 'filaria'},
 {'end': 665,
  'entity_group': 'S',
  'score': 0.9998555,
  'start': 633,
  'word': 'parásito adulto de Brugia malaya'},
 {'end': 687,
  'entity_group': 'S',
  'score': 0.9998512,
  'start': 668,
  'word': 'Onchocerca volvulus'}]

## Inference on validation dataset

In [None]:
val_file_names = listdir(val_path)
#for individual testing
#val_file_names = ['caso_clinico_medtropical103.txt']
#val_file_names = ['32168162_ES.txt']

In [None]:
val_file_names[0]

'caso_clinico_medicina_interna424.txt'

### Using whole clinical cases

In [None]:
print("Processing...")
import re

f = open("/content/drive/MyDrive/LivingNER_subtask1NER_validation_predictions_NLP-CIC-WFU_Clinical_Cases_NER_mBERT_cased_fine_tuned.tsv", "w", encoding="UTF-8")
f.write("filename\tmark\tlabel\toff0\toff1\tspan\n")

for filename in val_file_names:
  print(f"Text: {i}", end="\r")
  with open(val_path + filename, "r", encoding="UTF-8") as fval:
    lista_spans = []
    hc = fval.read()
    pred = token_classifier(hc)
    pred_grouped = grouping_entities(pred)
    t = 1
    for p in pred_grouped:

      off0 = int(p['start'])
      off1 = int(p['end'])
      if p['entity_group'] == 'S':
        label = 'SPECIES'
      else:
        label = 'HUMAN'
      span = hc[off0:off1]
      
      spant = span
      span = re.sub("^, |^\. |^: |^; |^\( |^\) ","",span)
      
      if span != spant:
        off0 = off0+2
      else:
        span = re.sub("^,|^\.|^:|^;||^\(|^\)","",span)
        if span != spant:
          off0 = off0+1
 

      if "\n" in span:
        span = re.sub("\n"," ",span)

      if " - " in span:
        span = re.sub(" - ","-",span)
        off1 = off1-2

      if "- " in span:
        span = re.sub("- ","-",span)
        off1 = off1-2

      if " -" in span:
        span = re.sub(" -","-",span)
        off1 = off1-2

      if "( " in span:
        span = re.sub("\( ","(",span)
        off1 = off1-1

      if " )" in span:
        span = re.sub(" \)",")",span)
        off1 = off1-1

      if span.endswith(" y") :
        span = span[:-2]
        off1 = off1-2
      
      if span.endswith(" de") or span.endswith(" en"):
        span = span[:-3]
        off1 = off1-3

      if span.endswith(" por") or span.endswith(" con"):
        span = span[:-4]
        off1 = off1-4

      if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
        span = span[:-1]
        off1 = off1-1

      if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
        span = span[:-2]
        off1 = off1-2

      pattern = r"^[a-z|á|é|í|ó|ú|/]{0,3}$|^[0-9]+$|^[A-Z]$"
        match = re.findall(pattern, span)
        if len(match) > 0 and match[0] != 'tía' and match[0] != 'tío':
          continue

        if span not in lista_spans:
          # Find all indices of 'span'
          indices = [index for index in range(len(hc)) if hc.startswith(span, index)]
          #print(indices)
          for ind in indices:
            off0 = ind
            off1 = ind+len(span)
            f.write(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            #print(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            t+=1

          lista_spans.append(span)
f.close()
print("Completo.")

Processing...
Completo.


### Using the texts partitioned 

In [None]:
print("Processing...")
import re
f = open("/content/drive/MyDrive/LivingNER_subtask1NER_validation_predictions_Texts_Partitioned_pl_1_NLP-CIC-WFU_Clinical_Cases_NER_Sents_Tokenized_mBERT_cased_fine_tuned.tsv", "w", encoding="UTF-8")
f.write("filename\tmark\tlabel\toff0\toff1\tspan\n")

for filename in val_file_names:
  print(f"Text: {filename}", end="\r")
  with open(val_path + filename, "r", encoding="UTF-8") as fval:
    lista_spans = []
    hc = fval.read()
    #Partitioning the texts.
    t = 1
    pattern = r'\. |\.\n|\. \n'
    sents = re.split(pattern, hc)

    paragraphs = []
    prgh = ""
    
    #Paragraphs with "pl" sentences no overlapping
    #Here you can change the parameters pl and step to test the different experiments carried out in our work.
    ant = 0
    step = 3
    pl = 3 #paragraph length
    for lpr in range(pl,len(sents),step):
      paragraphs.append(" ".join(sents[ant:lpr]))  
      ant = lpr
    if lpr != len(sents):
      paragraphs.append(" ".join(sents[lpr:]))

    #Inference per paragraphs
    for ph in paragraphs: 

      pred = token_classifier(ph)
      pred_grouped = grouping_entities(pred)
  
      for p in pred_grouped:

        off0 = int(p['start'])
        off1 = int(p['end'])
        if p['entity_group'] == 'S':
          label = 'SPECIES'
        else:
          label = 'HUMAN'
        
        span = p['word']

        span = re.sub("^, |^,|^\. |^\.|^: |^:|^; |^;|^\( |^\(|^\) |^\)","",span)

        if "\n" in span:
          span = re.sub("\n"," ",span)

        if " - " in span:
          span = re.sub(" - ","-",span)

        if "- " in span:
          span = re.sub("- ","-",span)

        if " -" in span:
          span = re.sub(" -","-",span)

        if "( " in span:
          span = re.sub("\( ","(",span)

        if " )" in span:
          span = re.sub(" \)",")",span)

        if span.endswith(" y") :
          span = span[:-2]
        
        if span.endswith(" de") or span.endswith(" en"):
          span = span[:-3]

        if span.endswith(" por") or span.endswith(" con"):
          span = span[:-4]

        if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
          span = span[:-1]

        if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
          span = span[:-2]

        pattern = r"^[a-z|á|é|í|ó|ú|/]{0,3}$|^[0-9]+$|^[A-Z]$"
        match = re.findall(pattern, span)
        if len(match) > 0 and match[0] != 'tía' and match[0] != 'tío':
          continue

        if span not in lista_spans:
          # Find all indices of 'span'
          indices = [index for index in range(len(hc)) if hc.startswith(span, index)]
          #print(indices)
          for ind in indices:
            off0 = ind
            off1 = ind+len(span)
            f.write(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            #print(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            t+=1

          lista_spans.append(span)
f.close()
print("Completo.")

Processing...
Completo.


## Inference on test dataset

In [None]:
from os import listdir
test_file_names = listdir(test_path)

In [None]:
i = 1
with open(test_path + test_file_names[i], "r", encoding="UTF-8") as ftest:
  pred = token_classifier(ftest.read())
pred

[{'end': 5,
  'entity_group': 'H',
  'score': 0.999522,
  'start': 0,
  'word': 'Varón'},
 {'end': 37,
  'entity_group': 'H',
  'score': 0.99865144,
  'start': 33,
  'word': 'hijo'},
 {'end': 47,
  'entity_group': 'H',
  'score': 0.9995177,
  'start': 41,
  'word': 'padres'},
 {'end': 106,
  'entity_group': 'H',
  'score': 0.9992803,
  'start': 98,
  'word': 'hermanos'},
 {'end': 1702,
  'entity_group': 'H',
  'score': 0.99990165,
  'start': 1694,
  'word': 'paciente'}]

In [None]:
len(test_file_names)

13472

In [None]:
#This cell is used to pass the testing data in batches. It is setting for the first 1000 testing clinical cases. There are 13472 in total.
test_files_part = test_file_names[0:1000]

In [None]:
len(test_files_part)

972

### Using whole clinical cases

In [None]:
print("Processing...")
import re
f = open("/content/drive/MyDrive/LivingNER_subtask1NER_test_predictions_NLP-CIC-WFU_Clinical_Cases_NER_mBERT_cased_fine_tuned.tsv", "w", encoding="UTF-8")
f.write("filename\tmark\tlabel\toff0\toff1\tspan\n")
for filename in test_files_part:
  print(f"Text: {filename}", end="\r")
  with open(test_path + filename, "r", encoding="UTF-8") as ftest:
    lista_spans = []
    hc = ftest.read()
    pred = token_classifier(hc)
    pred_grouped = grouping_entities(pred)
    t = 1
    for p in pred_grouped:

      off0 = int(p['start'])
      off1 = int(p['end'])
      if p['entity_group'] == 'S':
        label = 'SPECIES'
      else:
        label = 'HUMAN'
      span = hc[off0:off1]

      spant = span
      span = re.sub("^, |^\. |^: |^; |^\( |^\) ","",span)
      
      if span != spant:
        off0 = off0+2
      else:
        span = re.sub("^,|^\.|^:|^;||^\(|^\)","",span)
        if span != spant:
          off0 = off0+1

      if "\n" in span:
        span = re.sub("\n"," ",span)

      if " - " in span:
        span = re.sub(" - ","-",span)
        off1 = off1-2

      if "- " in span:
        span = re.sub("- ","-",span)
        off1 = off1-2

      if " -" in span:
        span = re.sub(" -","-",span)
        off1 = off1-2

      if "( " in span:
        span = re.sub("\( ","(",span)
        off1 = off1-1

      if " )" in span:
        span = re.sub(" \)",")",span)
        off1 = off1-1

      if span.endswith(" y") :
        span = span[:-2]
        off1 = off1-2

      if span.endswith(" de") or span.endswith(" en"):
        span = span[:-3]
        off1 = off1-3

      if span.endswith(" por") or span.endswith(" con"):
        span = span[:-4]
        off1 = off1-4

      if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
        span = span[:-1]
        off1 = off1-1

      if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
        span = span[:-2]
        off1 = off1-2

      pattern = r"^[a-z|á|é|í|ó|ú|/]{0,3}$|^[0-9]+$|^[A-Z]$"
        match = re.findall(pattern, span)
        if len(match) > 0 and match[0] != 'tía' and match[0] != 'tío':
          continue

        if span not in lista_spans:
          # Find all indices of 'span'
          indices = [index for index in range(len(hc)) if hc.startswith(span, index)]

          for ind in indices:
            off0 = ind
            off1 = ind+len(span)
            f.write(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            #print(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            t+=1

          lista_spans.append(span)
f.close()
print("Completo.")

### using the texts partitioned

In [None]:
print("Processing...")
import re

f = open("/content/drive/MyDrive/LivingNER_subtask1NER_test_predictions_Texts_Partitioned_pl_3_NLP-CIC-WFU_Clinical_Cases_NER_Sents_Tokenized_mBERT_cased_fine_tuned.tsv", "a", encoding="UTF-8")
f.write("filename\tmark\tlabel\toff0\toff1\tspan\n")

for filename in test_files_part:
  print(f"Text: {filename}", end="\r")
  with open(test_path + filename, "r", encoding="UTF-8") as ftest:
    lista_spans = []
    hc = ftest.read()
    #Partitioning the texts.
    t = 1
    pattern = r'\. |\.\n|\. \n'
    sents = re.split(pattern, hc)

    paragraphs = []
    prgh = ""

    #Paragraphs with "pl" sentences no overlapping
    #Here you can change the parameters pl and step to test the different experiments carried out in our work.
    ant = 0
    step = 3
    pl = 3 #paragraph length
    for lpr in range(pl,len(sents),step):
      paragraphs.append(" ".join(sents[ant:lpr]))
      ant = lpr
    if lpr != len(sents):
      paragraphs.append(" ".join(sents[lpr:]))
  
    #Inference per paragraphs
    for ph in paragraphs: 

      pred = token_classifier(ph)
      pred_grouped = grouping_entities(pred)
  
      for p in pred_grouped:

        off0 = int(p['start'])
        off1 = int(p['end'])
        if p['entity_group'] == 'S':
          label = 'SPECIES'
        else:
          label = 'HUMAN'

        span = p['word']

        span = re.sub("^, |^,|^\. |^\.|^: |^:|^; |^;|^\( |^\(|^\) |^\)","",span)

        if "\n" in span:
          span = re.sub("\n"," ",span)

        if " - " in span:
          span = re.sub(" - ","-",span)

        if "- " in span:
          span = re.sub("- ","-",span)

        if " -" in span:
          span = re.sub(" -","-",span)

        if "( " in span:
          span = re.sub("\( ","(",span)

        if " )" in span:
          span = re.sub(" \)",")",span)

        if span.endswith(" y") :
          span = span[:-2]

        if span.endswith(" de") or span.endswith(" en"):
          span = span[:-3]

        if span.endswith(" por") or span.endswith(" con"):
          span = span[:-4]

        if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
          span = span[:-1]

        if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
          span = span[:-2]

        pattern = r"^[a-z|á|é|í|ó|ú|/]{0,3}$|^[0-9]+$|^[A-Z]$"
        match = re.findall(pattern, span)
        if len(match) > 0 and match[0] != 'tía' and match[0] != 'tío':
          continue

        if span not in lista_spans:
          # Find all indices of 'span'
          indices = [index for index in range(len(hc)) if hc.startswith(span, index)]

          for ind in indices:
            off0 = ind
            off1 = ind+len(span)
            f.write(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            #print(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            t+=1

          lista_spans.append(span)
f.close()
print("Completo.")

Processing...
Completo.


# Subtask 2 - Normalization

## Loading datasets

In [None]:
livingner_subtask2_training = pd.read_csv("/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/training/subtask2-Norm/training_entities_subtask2.tsv", delimiter="\t")
livingner_subtask2_training.head()

Unnamed: 0,filename,mark,label,off0,off1,span,isH,isN,iscomplex,code
0,32032497_ES,T1,HUMAN,112,118,hombre,False,False,False,9606
1,32032497_ES,T2,HUMAN,1025,1033,paciente,False,False,False,9606
2,32032497_ES,T3,HUMAN,1098,1106,paciente,False,False,False,9606
3,32032497_ES,T4,HUMAN,1395,1403,paciente,False,False,False,9606
4,32032497_ES,T5,SPECIES,1075,1084,2019-nCoV,False,False,False,2697049


In [None]:
livingner_subtask2_validation = pd.read_csv("/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/valid/subtask2-Norm/validation_entities_subtask2.tsv", delimiter="\t")
livingner_subtask2_validation.head()

Unnamed: 0,filename,mark,label,off0,off1,span,isH,isN,iscomplex,code
0,32119083_ES,T1,HUMAN,287,294,familia,False,False,False,9606
1,32119083_ES,T2,HUMAN,4827,4831,hijo,False,False,False,9606
2,32119083_ES,T3,HUMAN,3948,3953,madre,False,False,False,9606
3,32119083_ES,T4,HUMAN,4499,4504,madre,False,False,False,9606
4,32119083_ES,T5,HUMAN,4567,4572,madre,False,False,False,9606


## Loading the NCBI Taxonomy

In [None]:
NCBITax_tr = livingner_subtask2_training[['span','code']]
NCBITax_vl = livingner_subtask2_validation[['span','code']]
NCBITax = pd.concat([NCBITax_tr, NCBITax_vl], axis=0) 
NCBITax.head()

Unnamed: 0,span,code
0,hombre,9606
1,paciente,9606
2,paciente,9606
3,paciente,9606
4,2019-nCoV,2697049


In [None]:
NCBITax.shape

(23203, 2)

In [None]:
NCBI_dic = {}
for s, c in zip (NCBITax['span'],NCBITax['code']):
  NCBI_dic.update({s:c})

In [None]:
len(NCBI_dic)

3792

In [None]:
NCBI_Taxonomia = open("/content/drive/MyDrive/Dataset/training_valid_test_background_multilingual/Resources/ncbi-taxo-names-spanish_v2.dmp", 'r', encoding='UTF-8')
for t in NCBI_Taxonomia:
  line = t.split('\t')
  code = line[0]
  #spanEn = line[1]
  spanEs = line[-1]
  #NCBI_dic.update({spanEn: code})
  NCBI_dic.update({spanEs: code})
NCBI_Taxonomia.close()

In [None]:
len(NCBI_dic)

3299314

In [None]:
NCBI_dic["SARS-CoV-2"]

'2697049'

## Normalization on validation dataset

In [None]:
output_task_1 = pd.read_csv("/content/drive/MyDrive/LivingNER_subtask1NER_validation_predictions_Texts_Partitioned_pl_3_NLP-CIC-WFU_Clinical_Cases_NER_Sents_Tokenized_mBERT_cased_fine_tuned.tsv", delimiter="\t")
output_task_1.head()

Unnamed: 0,filename,mark,label,off0,off1,span
0,caso_clinico_medicina_interna424,T1,HUMAN,0,5,Varón
1,caso_clinico_medicina_interna424,T2,HUMAN,88,98,personales
2,caso_clinico_medicina_interna424,T3,HUMAN,521,529,paciente
3,caso_clinico_medicina_interna424,T4,HUMAN,1668,1676,paciente
4,caso_clinico_medicina_interna424,T5,HUMAN,1738,1746,paciente


In [None]:
f = open("/content/drive/MyDrive/LivingNER_subtask2-Norm_validation_predictions.tsv", "w", encoding="UTF-8")

f.write("filename\tmark\tlabel\toff0\toff1\tspan\tNCBITax\n")

for fn, m, l, off0, off1, span in zip(output_task_1['filename'],output_task_1['mark'],output_task_1['label'],output_task_1['off0'],output_task_1['off1'],output_task_1['span']):
  if l == "HUMAN":
    f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+str(9606)+'\n')
  else:
    try:
      f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+NCBI_dic[span]+'\n')
    except:
      f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+'_NOCODE_'+'\n')
f.close()

## Normalization on test dataset

In [None]:
output_task_1 = pd.read_csv("/content/drive/MyDrive/LivingNER_subtask1NER_test_predictions_Texts_Partitioned_pl_3_NLP-CIC-WFU_Clinical_Cases_NER_Sents_Tokenized_mBERT_cased_fine_tuned.tsv", delimiter="\t")
output_task_1.head()

Unnamed: 0,filename,mark,label,off0,off1,span
0,cc_onco1576,T1,HUMAN,10,15,Varón
1,cc_onco1576,T2,HUMAN,511,519,paciente
2,cc_onco1576,T3,HUMAN,2876,2884,paciente
3,cc_onco1576,T4,HUMAN,3167,3175,paciente
4,cc_onco1576,T5,HUMAN,3856,3864,paciente


In [None]:
f = open("/content/drive/MyDrive/LivingNER_subtask2-Norm_test_predictions.tsv", "w", encoding="UTF-8")

f.write("filename\tmark\tlabel\toff0\toff1\tspan\tNCBITax\n")

for fn, m, l, off0, off1, span in zip(output_task_1['filename'],output_task_1['mark'],output_task_1['label'],output_task_1['off0'],output_task_1['off1'],output_task_1['span']):
  if l == "HUMAN":
    f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+str(9606)+'\n')
  else:
    try:
      f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+NCBI_dic[span]+'\n')
    except:
      f.write(fn+'\t'+m+'\t'+l+'\t'+str(off0)+'\t'+str(off1)+'\t'+span+'\t'+'_NOCODE_'+'\n')
f.close()