<a href="https://colab.research.google.com/github/ajtamayoh/NLP-CIC-WFU-Contribution-to-SocialDisNER-shared-task-2022/blob/main/Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP-CIC-WFU contribution to SocialDisNER 2022

Here you are the source code for the paper:

### NLP-CIC-WFU at SocialDisNER: Disease Mention Extraction in Spanish Tweets Using Transfer Learning and Search by Propagation

Authors:

Antonio Tamayo (ajtamayo2019@ipn.cic.mx, ajtamayoh@gmail.com)

Diego A. Burgos (burgosda@wfu.edu)

Alexander Gelbulkh (gelbukh@gelbukh.com)

For bugs or questions related to the code, do not hesitate to contact us (Antonio Tamayo: ajtamayoh@gmail.com)

If you use this code please cite our work:



# Requirements

To run this code you need to download the dataset at: https://drive.google.com/drive/folders/1q6eZwL7sNTupRQW_bNkISvxY9t-zBsdu?usp=sharing

Then, you must create a folder called "Dataset" in the root of your Google Drive and load there both folders and the file called mentions.tsv previously downloaded.

Once the dataset is ready to use, you should [open this notebook in colab](https://colab.research.google.com/github/ajtamayoh/NLP-CIC-WFU-Contribution-to-SocialDisNER-shared-task-2022/blob/main/Code.ipynb) and save a copy in your drive.

## About the infrastructure

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Jul  5 00:23:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


## Connecting to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Exploring & Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
import spacy

In [None]:
socialdisner_training = pd.read_csv("/content/drive/MyDrive/Dataset/mentions.tsv", delimiter="\t")
socialdisner_training.head()

Unnamed: 0,tweets_id,begin,end,type,extraction
0,1357198223706894339,12,19,ENFERMEDAD,alergia
1,1357198223706894339,21,26,ENFERMEDAD,covid
2,1357198223706894339,29,34,ENFERMEDAD,gripe
3,1494382407398662147,17,25,ENFERMEDAD,DIABETES
4,1494382407398662147,114,122,ENFERMEDAD,diabetes


In [None]:
text_files_path = "/content/drive/MyDrive/Dataset/train-valid-txt-files/training/"

In [None]:
f = open(text_files_path + str(socialdisner_training.iloc[1,0]) + ".txt", "r", encoding="UTF-8")
for l in f:
  print(l)

No sé si es alergia, covid o gripe, ya solo espero si sobrevivo o no







In [None]:
socialdisner_training["tweets_id"].unique().shape

(7475,)

In [None]:
#Tweets
Tws = {}
Tws_ids_training = []
for fname in socialdisner_training["tweets_id"].unique():
  try:
    with open(text_files_path + str(fname) + ".txt", "r", encoding="UTF-8") as f:
      Tws.update({fname: f.read()})
      Tws_ids_training.append(fname)
  except:
    pass

In [None]:
len(Tws)

4975

In [None]:
#Diseases
ENF = {}
enfermedades = []
fn = socialdisner_training["tweets_id"][0]
for fname, enf in zip(socialdisner_training["tweets_id"], socialdisner_training["extraction"]):
    try: 
      if Tws[fname]: #To take only the diseases in the training dataset
        if fname!=fn:
          enfermedades = []
        enfermedades.append(enf)
        ENF.update({fname: enfermedades})
        fn = fname
    except:
      pass

In [None]:
len(ENF)

4975

In [None]:
Tws[1494382407398662147]

'ADOLESCENTES CON DIABETES\n\nHola!\nMe llamo Elisenda, tengo 17 años y estoy haciendo mi trabajo de recerca sobre la diabetes, ya que yo también soy diabética.\nEl trabajo en concreto va sobre los adolescentes con diabet...\n\nLeer más https://t.co/cLmT3CzWnV\n\n#diabetes #diabetES P https://t.co/KqX4HKVjW3\n\n\n'

In [None]:
ENF[1494382407398662147]

['DIABETES', 'diabetes', 'diabética', 'diabetes', 'diabet', 'diabetES']

## Tokenization using SpaCy

In [None]:
from spacy.lang.es import Spanish
nlp = Spanish()
# Create a Tokenizer with the default settings for Spanish
# including punctuation rules and exceptions
tokenizer_spacy = nlp.tokenizer

In [None]:
Tws_tokenized = []
for tw in Tws:
    tx = []
    tokens = tokenizer_spacy(Tws[tw])
    #tokens = HCs[hc].split(" ") #The simplest option. It was not used in our work.
    for t in tokens:
        tx.append(str(t))
    Tws_tokenized.append(tx)

In [None]:
len(Tws_tokenized)

4975

In [None]:
Tws_tokenized[1]

['ADOLESCENTES',
 'CON',
 'DIABETES',
 '\n\n',
 'Hola',
 '!',
 '\n',
 'Me',
 'llamo',
 'Elisenda',
 ',',
 'tengo',
 '17',
 'años',
 'y',
 'estoy',
 'haciendo',
 'mi',
 'trabajo',
 'de',
 'recerca',
 'sobre',
 'la',
 'diabetes',
 ',',
 'ya',
 'que',
 'yo',
 'también',
 'soy',
 'diabética',
 '.',
 '\n',
 'El',
 'trabajo',
 'en',
 'concreto',
 'va',
 'sobre',
 'los',
 'adolescentes',
 'con',
 'diabet',
 '...',
 '\n\n',
 'Leer',
 'más',
 'https://t.co/cLmT3CzWnV',
 '\n\n',
 '#',
 'diabetes',
 '#',
 'diabetES',
 'P',
 'https://t.co/KqX4HKVjW3',
 '\n\n\n']

In [None]:
Ent_tokenized = []
for enf in ENF:
    Tks = []
    for e in ENF[enf]:
      sl = []
      tokens = tokenizer_spacy(e)
      #tokens = e.split(" ")
      for t in tokens:
          sl.append(str(t))
      Tks.append(sl)
    Ent_tokenized.append(Tks)

In [None]:
len(Ent_tokenized)

4975

In [None]:
Ent_tokenized[1]

[['DIABETES'],
 ['diabetes'],
 ['diabética'],
 ['diabetes'],
 ['diabet'],
 ['diabetES']]

## Tagging Data with BIO scheme

In [None]:
def find_idx(list_to_check, item_to_find):
    indices = []
    for idx, value in enumerate(list_to_check):
        if value == item_to_find:
            indices.append(idx)
    return indices

In [None]:
import sys
labels_tokenized = []
idx =-1
for hct, et in zip(Tws_tokenized, Ent_tokenized):
    idx+=1
    labels = []
    for i in range(len(hct)):
        #Labels: 0->'O'; 1->'B'; 2->'I'
        #labels.append('O')
        labels.append(0)

    #For Entities (Diseases|Enfermedades)
    for enf in et:
      first = True
      for e in enf:
          if first == True:
              try:
                #labels[hct.index(e)] = 'B'
                #labels[posLab] = 'B'
                indices = find_idx(hct, e)
                if len(indices) > 1:
                  for id in indices:
                      labels[id] = 1
                else:
                  labels[hct.index(e)] = 1
                
                first = False   
              except:
                first = False   
                continue
          else:
              try:
                #labels[hct.index(e)] = 'I'
                #labels[posLab] = 'I'
                indices = find_idx(hct, e)
                if len(indices) > 1:
                  for id in indices:
                      if labels[id-1] != 0:
                        labels[id] = 2
                else:
                  labels[hct.index(e)] = 2
              except:
                continue

    labels_tokenized.append(labels)

In [None]:
print(Tws[Tws_ids_training[48]])
print(Tws_ids_training[48])

No fui fumador, pero tengo cáncer de pulmón, y pienso que sí tan sólo dejará de toser por un día, sería maravilloso. Cuatro años tosiendo mañana, tarde y noche es desgastante. Sólo los que tenemos cáncer de pulmón sabemos lo que esto significa. #Elcancerdepulmónimporta



1144065562240327681


In [None]:
j = 48
for i in range(len(Tws_tokenized[j])):
  print(str(Tws_tokenized[j][i]) + "\t" + str(labels_tokenized[j][i]))

No	0
fui	0
fumador	1
,	0
pero	0
tengo	0
cáncer	1
de	2
pulmón	2
,	0
y	0
pienso	0
que	0
sí	0
tan	0
sólo	0
dejará	0
de	0
toser	0
por	0
un	0
día	0
,	0
sería	0
maravilloso	0
.	0
Cuatro	0
años	0
tosiendo	0
mañana	0
,	0
tarde	0
y	0
noche	0
es	0
desgastante	0
.	0
Sólo	0
los	0
que	0
tenemos	0
cáncer	1
de	2
pulmón	2
sabemos	0
lo	0
que	0
esto	0
significa	0
.	0
#	0
Elcancerdepulmónimporta	0



	0


## Validating tokenization and alignment with the BIO tags.




In [None]:
flag = 0
for st, lt in zip(Tws_tokenized, labels_tokenized):
    if len(st) != len(lt):
        print(st)
        print(lt)
        flag = 1
if flag==0:
    print("Everything is aligned!")

Everything is aligned!


## Sentence tokenization is an alternative but we finally used the whole tweet as samples.

In [None]:
sent_tokenized = []
label_sent_tokenized = []
for ht, lht in zip(Tws_tokenized, labels_tokenized):
  st = []; lbst = []
  for h, l in zip(ht,lht):
    if h != ".":
      st.append(h)
      lbst.append(l)
    else:
      st.append(".")
      lbst.append(0)
      sent_tokenized.append(st)
      label_sent_tokenized.append(lbst)
      st = []; lbst = []

In [None]:
len(sent_tokenized)

6870

In [None]:
sent_tokenized[0]

['ADOLESCENTES',
 'CON',
 'DIABETES',
 '\n\n',
 'Hola',
 '!',
 '\n',
 'Me',
 'llamo',
 'Elisenda',
 ',',
 'tengo',
 '17',
 'años',
 'y',
 'estoy',
 'haciendo',
 'mi',
 'trabajo',
 'de',
 'recerca',
 'sobre',
 'la',
 'diabetes',
 ',',
 'ya',
 'que',
 'yo',
 'también',
 'soy',
 'diabética',
 '.']

In [None]:
len(label_sent_tokenized)

6870

In [None]:
label_sent_tokenized[0]

[0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0]

# Disease mentions identification as a Token classification problem

## Install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 7.4 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 41.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 83.8 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 73.0 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 71.8 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.10.0-py3-none-any.whl (117 kB)
[K     |████████████████████████████████| 117 kB 7.3 MB/s 
Installing collected packages: accelerate
Successfully installed accelerate-0.10.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 62 not upgraded.


## Building the Dataset

In [None]:
dic = {"tokens": Tws_tokenized, "ner_tags": labels_tokenized} #For the whole clinical case. We used this option for our paper.
#dic = {"tokens": sent_tokenized, "ner_tags": label_sent_tokenized} #Use this option if you want to check the model performance with sentences tokenized by ". " but the whole clinical cases.

In [None]:
from datasets import Dataset, DatasetDict
dataset = Dataset.from_dict(dic)

In [None]:
dataset

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 4975
})

In [None]:
#For training, validation, and test partitions
"""
#Train, val, test partitions
train_test = dataset.train_test_split()
test_val = train_test['test'].train_test_split()
raw_datasets = DatasetDict({
    'train': train_test['train'],
    'validation': test_val['train'],
    'test': test_val['test']
    })
"""

#Just for training and validation partitions
train_test = dataset.train_test_split()
raw_datasets = DatasetDict({
    'train': train_test['train'],
    'validation': train_test['test']
    })


In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3731
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1244
    })
})

In [None]:
raw_datasets["train"][0]["ner_tags"]

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 2,
 2,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [None]:
raw_datasets['train']

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 3731
})

In [None]:
label_names = ['O','B','I']    #BIO scheme
#label_names = ['O','I']    #IO scheme
label_names

['O', 'B', 'I']

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = [int(n) for n in raw_datasets["train"][0]["ner_tags"]]
#labels = raw_datasets["train"][0]["pos_tags"]
#labels = raw_datasets["train"][0]["chunk_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

A nuestra Ministra de Sanidad le entra la risa cuando le preguntan cuando se incluirá al dietista-nutricionista en atención primaria . Mientras tanto , las enfermedades no transmisibles , como el cáncer , la obesidad o la diabetes , siguen siendo la principal causa de muerte evitable https://t.co/cJBg68Uv5 M 


 
O O       O        O  O       O  O     O  O    O      O  O         O      O  O        O  O                      O  O        O        O O        O     O O   B            I  I             O O    O  B      O O  B        O O  B        O O      O      O  O         O     O  O      O        O                      O O   


## Loading mBERT as a pre-trained model

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "ajtamayoh/NER_EHR_Spanish_model_Mulitlingual_BERT"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'A',
 'nuestra',
 'Ministra',
 'de',
 'San',
 '##idad',
 'le',
 'entra',
 'la',
 'ri',
 '##sa',
 'cuando',
 'le',
 'pregunta',
 '##n',
 'cuando',
 'se',
 'incluir',
 '##á',
 'al',
 'diet',
 '##ista',
 '-',
 'nu',
 '##trici',
 '##onista',
 'en',
 'atención',
 'primaria',
 '.',
 'Mientras',
 'tanto',
 ',',
 'las',
 'enfermedades',
 'no',
 'trans',
 '##misi',
 '##bles',
 ',',
 'como',
 'el',
 'cáncer',
 ',',
 'la',
 'ob',
 '##esi',
 '##dad',
 'o',
 'la',
 'diabetes',
 ',',
 'siguen',
 'siendo',
 'la',
 'principal',
 'causa',
 'de',
 'muerte',
 'evit',
 '##able',
 'https',
 ':',
 '/',
 '/',
 't',
 '.',
 'co',
 '/',
 'c',
 '##J',
 '##B',
 '##g',
 '##6',
 '##8',
 '##U',
 '##v',
 '##5',
 'M',
 '[SEP]']

In [None]:
inputs.word_ids()

[None,
 0,
 1,
 2,
 3,
 4,
 4,
 5,
 6,
 7,
 8,
 8,
 9,
 10,
 11,
 11,
 12,
 13,
 14,
 14,
 15,
 16,
 16,
 16,
 16,
 16,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 27,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 34,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 46,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 47,
 48,
 None]

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)



  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1,
            2,    2,    2,    2,    0,    0,    0,    1,    0,    0,    1,    2,
            2,    0,    0,    1,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    1,    2,    0,    0,    0,    0,
            1,    0,    0,    0,    0,    0, 

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import load_metric

metric = load_metric("seqeval")

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B',
 'I',
 'I',
 'O',
 'O',
 'O',
 'B',
 'O',
 'O',
 'B',
 'O',
 'O',
 'B',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [None]:
predictions = labels.copy()
predictions[2] = "I"
metric.compute(predictions=[predictions], references=[labels])

{'_': {'f1': 0.888888888888889, 'number': 4, 'precision': 0.8, 'recall': 1.0},
 'overall_accuracy': 0.98,
 'overall_f1': 0.888888888888889,
 'overall_precision': 0.8,
 'overall_recall': 1.0}

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

{'0': 'O', '1': 'B', '2': 'I'}

In [None]:
label2id

{'B': '1', 'I': '2', 'O': '0'}

## Changing the head of prediction for Disease Mentions Identification under the BIO scheme

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(    
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels = 3,   #for BIO scheme
)

In [None]:
model.config.num_labels

3

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v2",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=7,
    weight_decay=0.01,
    push_to_hub=True,
)

## Fine-tuning mBERT for Disease mentions identification

## Hugging Face Authentication

If you want to save your own model and make it available online we strongly recommend signing up at: https://huggingface.co/

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "ajtamayoh@gmail.com"
!git config --global user.name "ajtamayoh"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

/content/NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3 is already a clone of https://huggingface.co/ajtamayoh/NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3. Make sure you pull the latest changes with `repo.git_pull()`.
***** Running training *****
  Num examples = 3731
  Num Epochs = 7
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 3269


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.084678,0.852901,0.83974,0.846269,0.974932
2,0.100700,0.08511,0.849301,0.871142,0.860083,0.974589
3,0.055300,0.096344,0.842051,0.880347,0.860773,0.976352
4,0.031800,0.112114,0.860484,0.876557,0.868446,0.976591
5,0.019800,0.125298,0.860032,0.874932,0.867418,0.976715
6,0.010300,0.130126,0.857827,0.872225,0.864966,0.977078
7,0.006800,0.144355,0.852795,0.879805,0.866089,0.976384


***** Running Evaluation *****
  Num examples = 1244
  Batch size = 8
Saving model checkpoint to NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/checkpoint-467
Configuration saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/checkpoint-467/config.json
Model weights saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/checkpoint-467/pytorch_model.bin
tokenizer config file saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/checkpoint-467/tokenizer_config.json
Special tokens file saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/checkpoint-467/special_tokens_map.json
tokenizer config file saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/tokenizer_config.json
Special tokens file saved in NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v3/special_tok

TrainOutput(global_step=3269, training_loss=0.034707054846802414, metrics={'train_runtime': 571.4813, 'train_samples_per_second': 45.701, 'train_steps_per_second': 5.72, 'total_flos': 1401874877491380.0, 'train_loss': 0.034707054846802414, 'epoch': 7.0})

## Saving the fine-tuned model at Hugging Face (It requires previous authentication)

In [None]:
trainer.push_to_hub(commit_message="Training complete")

## Analyzing predictions

In [None]:
import numpy as np

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)


In [None]:
i=0
print(raw_datasets["validation"][i]['tokens'])
for j in range(len(preds[i])):
  print(raw_datasets["validation"][i]['ner_tags'][j], "\t", preds[i][j])
print(' '.join(raw_datasets["validation"][i]['tokens']))

## Loading the model for inference

In [None]:
from transformers import pipeline

#Replace this with your own checkpoint. If you have run all the previous cells successfully, the model should be available at your hugging face account with the name: NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v2
model_checkpoint = "ajtamayoh/NLP-CIC-WFU_SocialDisNER_fine_tuned_NER_EHR_Spanish_model_Mulitlingual_BERT_v2"

token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

Downloading:   0%|          | 0.00/988 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/409 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
#pred = token_classifier("¿Qué probabilidad hay de no pillar la gripe si me paso la vida en el hospital?")
pred = token_classifier("Wanna collaborate in breast cancer research? Quieres colaborar en nuestra lucha contra el cáncer de mama? https://t.co/P53hAhcdb5 #cancer #breastcancer #cancerdemama https://t.co/wP9pVG41UW")
pred

In [None]:
val_path = "/content/drive/MyDrive/Dataset/train-valid-txt-files/validation/"

In [None]:
from os import listdir
val_file_names = listdir(val_path)

In [None]:
i = 14
with open(val_path + val_file_names[i], "r", encoding="UTF-8") as ftest:
  pred = token_classifier(ftest.read())
pred

## Post-Processing

In [None]:
def grouping_entities(pred):
  import re
  output = []
  for e in pred:
    if "##" not in e['word']:
      output.append(e)
    else:
      try:
        if e['start'] == (output[-1]['end']):
          output[-1]['word'] = output[-1]['word']+re.sub("##","",e['word'])
          output[-1]['end'] = e['end']
      except:
        pass
    
    try:
      if (e['entity_group'] == "B" or e['entity_group'] == "I") and (e['start'] == (output[-2]['end']+1)):
        output[-2]['word'] = output[-2]['word']+" "+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass
    
    try:
      if e['start'] == (output[-2]['end']):
        output[-2]['word'] = output[-2]['word']+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass

  return output


In [None]:
grouping_entities(pred)

## Predictions on validation dataset

In [None]:
def delete_accents(s):
  l = [('á', 'a'), ('é','e'), ('í','i'), ('ó','o'), ('ú','u')]
  for v in l:
    s = s.replace(v[0],v[1])
  return s

In [None]:
print("Processing...")
import re
f = open("/content/drive/MyDrive/dev_predictions_SocialDisNER_model_v2.tsv", "w", encoding="UTF-8")
f.write("tweets_id\tbegin\tend\ttype\textraction\n")
for fname in val_file_names:
  print(f"Text: {fname}", end="\r")
  with open(val_path + fname, "r", encoding="UTF-8") as fval:
    lista_spans = []
    offs = []
    hc = fval.read()
    pred = token_classifier(hc)
    pred_grouped = grouping_entities(pred)
    t = 1
    for p in pred_grouped:

      off0 = int(p['start'])
      off1 = int(p['end'])
      span = p['word']

      span = re.sub("^, |^,|^\. |^\.|^: |^:|^; |^;|^\( |^\(|^\) |^\)","",span)

      if "\n" in span:
        span = re.sub("\n"," ",span)

      if " - " in span:
        span = re.sub(" - ","-",span)

      if "- " in span:
        span = re.sub("- ","-",span)

      if " -" in span:
        span = re.sub(" -","-",span)

      if "( " in span:
        span = re.sub("\( ","(",span)

      if " )" in span:
        span = re.sub(" \)",")",span)

      if span.endswith(" y") :
        span = span[:-2]
        
      if span.endswith(" de") or span.endswith(" en"):
        span = span[:-3]

      if span.endswith(" por") or span.endswith(" con") or span.endswith(" del"):
        span = span[:-4]

      if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-") or span.endswith("!"):
        span = span[:-1]

      if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
        span = span[:-2]

      if span.startswith("#"):
        span = span[1:]

      if span.startswith("🔸 "):
        span = span[2:]

      pattern = r"^[a-z|á|é|í|ó|ú|/]{0,2}$|^[0-9]+$|^[A-Z]$|^#$| a$| el$| la$|^🔸$"
      match = re.findall(pattern, span)
      if len(match) > 0:
        continue

      if span not in lista_spans:
        #For multiword spans
        mwspan = "".join(span.split())
        if span != mwspan:
          spans = [span, mwspan]
        else:
          spans = [span]
        # Find all indices of 'span'
        for sp in spans:
          indices = [index for index in range(len(hc)) if delete_accents(hc.lower()).startswith(delete_accents(sp.lower()), index)]
          for ind in indices:
            off0 = ind
            off1 = ind+len(sp)
            extraction = hc[off0:off1]
            match = re.findall(pattern, extraction)
            no_subsumed = True
            for of in offs:
              if off0 == of[0] and off1 < of[1]:
                no_subsumed = False
            output = fname[:-4]+"\t"+str(off0)+"\t"+str(off1)+"\t"+"ENFERMEDAD"+"\t"+extraction+"\n"
            if len(match) == 0 and no_subsumed:
              f.write(output)
              t+=1
              offs.append((off0,off1))

          lista_spans.append(sp)
          
f.close()
print("Completo.")

## Predictions on test dataset

In [None]:
test_path = "/content/drive/MyDrive/Dataset/test-data/test-data-txt-files/"

In [None]:
from os import listdir
test_file_names = listdir(test_path)

In [None]:
len(test_file_names)

23430

In [None]:
print("Processing...")
import re
f = open("/content/drive/MyDrive/test_predictions.tsv", "a", encoding="UTF-8")
f.write("tweets_id\tbegin\tend\ttype\textraction\n")
for fname in test_file_names:
  print(f"Text: {fname}", end="\r")
  with open(test_path + fname, "r", encoding="UTF-8") as fval:
    lista_spans = []
    offs = []
    hc = fval.read()
    pred = token_classifier(hc)
    pred_grouped = grouping_entities(pred)
    t = 1
    for p in pred_grouped:

      off0 = int(p['start'])
      off1 = int(p['end'])
      span = p['word']

      span = re.sub("^, |^,|^\. |^\.|^: |^:|^; |^;|^\( |^\(|^\) |^\)","",span)

      if "\n" in span:
        span = re.sub("\n"," ",span)

      if " - " in span:
        span = re.sub(" - ","-",span)

      if "- " in span:
        span = re.sub("- ","-",span)

      if " -" in span:
        span = re.sub(" -","-",span)

      if "( " in span:
        span = re.sub("\( ","(",span)

      if " )" in span:
        span = re.sub(" \)",")",span)

      if span.endswith(" y") :
        span = span[:-2]
        
      if span.endswith(" de") or span.endswith(" en"):
        span = span[:-3]

      if span.endswith(" por") or span.endswith(" con") or span.endswith(" del"):
        span = span[:-4]

      if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-") or span.endswith("!"):
        span = span[:-1]

      if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
        span = span[:-2]

      if span.startswith("#"):
        span = span[1:]

      if span.startswith("🔸 "):
        span = span[2:]

      pattern = r"^[a-z|á|é|í|ó|ú|/]{0,2}$|^[0-9]+$|^[A-Z]$|^#$| a$| el$| la$|^🔸$"
      match = re.findall(pattern, span)
      if len(match) > 0:
        continue

      if span not in lista_spans:
        #For multiword spans
        mwspan = "".join(span.split())
        if span != mwspan:
          spans = [span, mwspan]
        else:
          spans = [span]
        # Find all indices of 'span'
        for sp in spans:
          indices = [index for index in range(len(hc)) if delete_accents(hc.lower()).startswith(delete_accents(sp.lower()), index)]
          for ind in indices:
            off0 = ind
            off1 = ind+len(sp)
            extraction = hc[off0:off1]
            match = re.findall(pattern, extraction)
            no_subsumed = True
            for of in offs:
              if off0 == of[0] and off1 < of[1]:
                no_subsumed = False
            output = fname[:-4]+"\t"+str(off0)+"\t"+str(off1)+"\t"+"ENFERMEDAD"+"\t"+extraction+"\n"
            if len(match) == 0 and no_subsumed:
              f.write(output)
              t+=1
              offs.append((off0,off1))

          lista_spans.append(sp)
          
f.close()
print("Completo.")

Processing...
Completo.
