<a href="https://colab.research.google.com/github/ajtamayoh/Data_Mining_in_the_Medical_Field_in_Spanish/blob/main/Procedures_Identification_shared_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining in the Medical Field Through Negation Scope Detection and Named Entity Recognition

Here you are the source code for the experiments in:

### Procedures Identification

Authors:

Antonio Tamayo (ajtamayo2019@ipn.cic.mx, ajtamayoh@gmail.com)

Alexander Gelbulkh (gelbukh@gelbukh.com)

For bugs or questions related to the code, do not hesitate to contact us (Antonio Tamayo: ajtamayoh@gmail.com)

If you use this code please cite our work:

Comming soon ...


# Requirements

To run this code you need to download the dataset (MedProcNER_training.json) at: [download dataset](https://github.com/ajtamayoh/Data_Mining_in_the_Medical_Field_in_Spanish/tree/main/Procedures/Dataset)

Then, you must create a folder called "Datasets" in the root of your Google Drive and load there both folders previously downloaded.

Once the dataset is ready to use, you should [open this notebook in colab](https://colab.research.google.com/drive/1L3_eeh9znNzxhxf03AyZl3aypsaOb6Dv?authuser=1#scrollTo=6S9L_KErP3yM) and save a copy in your drive.

## About the infrastructure

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

## Connecting to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Exploring & Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
import spacy

# Procedure mentions identification as a Token classification problem

## Install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollec

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.19.0
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 28 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for

## Hugging Face Authentication

If you want to save your own model and make it available online we strongly recommend signing up at: https://huggingface.co/

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your_email"
!git config --global user.name "your_name"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading the Preprocessed Dataset

In [None]:
from datasets import load_dataset
import json

# MedProcNER dataset (preprocessed 4 BIO)
MedProcNER_dataset_train = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/MedProcNER_training.json", field="data")

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-92ff6189c97625b3/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-92ff6189c97625b3/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
from datasets import DatasetDict

#Just for training and validation partitions
train_test = MedProcNER_dataset_train["train"].train_test_split()
raw_datasets = DatasetDict({
    'train': train_test['train'],
    'validation': train_test['test']
    })

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 9052
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3018
    })
})

In [None]:
raw_datasets["train"][0]["ner_tags"]
#raw_datasets["train"][0]["pos_tags"]
#raw_datasets["train"][0]["chunk_tags"]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [None]:
raw_datasets['train']

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 9052
})

In [None]:
label_names = ['O','B','I']
label_names

['O', 'B', 'I']

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = [int(n) for n in raw_datasets["train"][0]["ner_tags"]]
#labels = raw_datasets["train"][0]["pos_tags"]
#labels = raw_datasets["train"][0]["chunk_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Se encontraron los siguientes cambios mutacionales en el gen PAH : c.165delT ( p.Phe55fs ) / c.q62G > A ( p.Val388Met ) , siendo ambos hijos únicamente portadores de la mutación p.Phe55fs . 
O  O           O   O          O       O            O  O  O   O   O O         O O         O O O      O O O O           O O O      O     O     O          O          O  O  O        O         O 


## Loading mBERT as a pre-trained model

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "PlanTL-GOB-ES/roberta-base-biomedical-clinical-es"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/540k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['<s>',
 'ĠSe',
 'Ġencontraron',
 'Ġlos',
 'Ġsiguientes',
 'Ġcambios',
 'Ġmu',
 't',
 'acionales',
 'Ġen',
 'Ġel',
 'Ġgen',
 'ĠPA',
 'H',
 'Ġ',
 ':',
 'Ġc',
 '.',
 '165',
 'del',
 'T',
 'Ġ(',
 'Ġp',
 '.',
 'P',
 'he',
 '55',
 'fs',
 'Ġ)',
 'Ġ/',
 'Ġc',
 '.',
 'q',
 '62',
 'G',
 'Ġ>',
 'ĠA',
 'Ġ(',
 'Ġp',
 '.',
 'Val',
 '38',
 '8',
 'Met',
 'Ġ)',
 'Ġ,',
 'Ġsiendo',
 'Ġambos',
 'Ġhijos',
 'ĠÃºnicamente',
 'Ġportadores',
 'Ġde',
 'Ġla',
 'ĠmutaciÃ³n',
 'Ġp',
 '.',
 'P',
 'he',
 '55',
 'fs',
 'Ġ.',
 '</s>']

In [None]:
inputs.word_ids()

[None,
 0,
 1,
 2,
 3,
 4,
 5,
 5,
 5,
 6,
 7,
 8,
 9,
 9,
 10,
 10,
 11,
 11,
 11,
 11,
 11,
 12,
 13,
 13,
 13,
 13,
 13,
 13,
 14,
 15,
 16,
 16,
 16,
 16,
 16,
 17,
 18,
 19,
 20,
 20,
 20,
 20,
 20,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 31,
 31,
 31,
 31,
 31,
 32,
 None]

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

Map:   0%|          | 0/9052 [00:00<?, ? examples/s]

Map:   0%|          | 0/3018 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0, -100],
        [-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100]])

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]
[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [None]:
from datasets import load_metric

metric = load_metric("seqeval")

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O']

In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


{'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 1.0}

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

{'0': 'O', '1': 'B', '2': 'I'}

In [None]:
label2id

{'O': '0', 'B': '1', 'I': '2'}

## Changing the head of prediction for Disease Mentions Identification under the BIO scheme

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels = 3,
)

Downloading pytorch_model.bin:   0%|          | 0.00/504M [00:00<?, ?B/s]

Some weights of the model checkpoint at PlanTL-GOB-ES/roberta-base-biomedical-clinical-es were not used when initializing RobertaForTokenClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at PlanTL-GOB-ES/roberta-base-biomedical-clinical-es and are newly initialized: ['classifier

In [None]:
model.config.num_labels

3

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "Procedures_Identification_RoBERTa_fine_tuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=7,
    weight_decay=0.1,
    push_to_hub=True,
)

## Fine-tuning Transformer-based model for Procedure mentions identification

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Cloning https://huggingface.co/ajtamayoh/Procedures_Identification_RoBERTa_fine_tuned into local empty directory.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.1142,0.109419,0.638699,0.741692,0.686353,0.960497
2,0.087,0.109915,0.682935,0.755665,0.717461,0.962544
3,0.05,0.127917,0.71338,0.765106,0.738338,0.965024
4,0.0308,0.151248,0.672031,0.78852,0.72563,0.961117
5,0.0201,0.164339,0.72536,0.780967,0.752137,0.966708
6,0.0128,0.179849,0.713941,0.783233,0.746984,0.96555
7,0.0095,0.181961,0.720906,0.781344,0.749909,0.96589


TrainOutput(global_step=7924, training_loss=0.048752946242246284, metrics={'train_runtime': 1300.9882, 'train_samples_per_second': 48.705, 'train_steps_per_second': 6.091, 'total_flos': 1954407175332048.0, 'train_loss': 0.048752946242246284, 'epoch': 7.0})

## Saving the fine-tuned model at Hugging Face (It requires previous authentication)

In [None]:
trainer.push_to_hub(commit_message="Training complete")

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 1.00/478M [00:00<?, ?B/s]

Upload file runs/May25_15-59-42_9871cf2e21b3/events.out.tfevents.1685030399.9871cf2e21b3.2480.0:   0%|        …

To https://huggingface.co/ajtamayoh/Procedures_Identification_RoBERTa_fine_tuned
   8818654..0d11e58  main -> main

   8818654..0d11e58  main -> main

To https://huggingface.co/ajtamayoh/Procedures_Identification_RoBERTa_fine_tuned
   0d11e58..6f15411  main -> main

   0d11e58..6f15411  main -> main



'https://huggingface.co/ajtamayoh/Procedures_Identification_RoBERTa_fine_tuned/commit/0d11e585673075ca668647c0e9f222f923e7988f'

## Loading the model for inference

In [None]:
from transformers import pipeline

model_checkpoint = "your_username/Procedures_Identification_RoBERTa_fine_tuned"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/502M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/894k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/540k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.32M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

In [None]:
pred = token_classifier("La paciente no ha requerido hemotransfusión.")
pred

[{'entity_group': 'B',
  'score': 0.9991001,
  'word': ' ure',
  'start': 32,
  'end': 35},
 {'entity_group': 'I',
  'score': 0.9992242,
  'word': 'trografía',
  'start': 35,
  'end': 44}]

In [None]:
test_path = "Path_to_text_files"

In [None]:
import os
test_files = os.listdir(test_path)
for f in test_files:
  with open(test_path+'/'+f, "r", encoding="UTF-8") as ftest:
    pred = token_classifier(ftest.read())
  print(pred)
  break

[{'entity_group': 'B', 'score': 0.99752146, 'word': ' analgésicos', 'start': 535, 'end': 546}, {'entity_group': 'I', 'score': 0.61860216, 'word': ' orales', 'start': 547, 'end': 553}, {'entity_group': 'B', 'score': 0.98207575, 'word': ' exploración', 'start': 560, 'end': 571}, {'entity_group': 'B', 'score': 0.99856466, 'word': ' palpación', 'start': 608, 'end': 617}, {'entity_group': 'I', 'score': 0.99786675, 'word': ' del borde costal izquierdo', 'start': 618, 'end': 644}, {'entity_group': 'B', 'score': 0.9893059, 'word': ' maniobra', 'start': 748, 'end': 756}, {'entity_group': 'I', 'score': 0.99869996, 'word': ' del gancho', 'start': 757, 'end': 767}, {'entity_group': 'B', 'score': 0.9742329, 'word': ' exploración', 'start': 807, 'end': 818}, {'entity_group': 'B', 'score': 0.9994836, 'word': ' radiografía', 'start': 865, 'end': 876}, {'entity_group': 'I', 'score': 0.9984913, 'word': ' de tórax', 'start': 877, 'end': 885}, {'entity_group': 'B', 'score': 0.9710995, 'word': ' tratamient

## Post-Processing

In [None]:
def grouping_entities(pred):
  import re
  output = []
  for e in pred:
    if "##" not in e['word']:
      output.append(e)
    else:
      try:
        if e['start'] == (output[-1]['end']):
          output[-1]['word'] = output[-1]['word']+re.sub("##","",e['word'])
          output[-1]['end'] = e['end']
      except:
        pass

    try:
      if (e['entity_group'] == "B" or e['entity_group'] == "I") and (e['start'] == (output[-2]['end']+1)):
        output[-2]['word'] = output[-2]['word']+" "+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass

    try:
      if e['start'] == (output[-2]['end']):
        output[-2]['word'] = output[-2]['word']+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass

  return output


In [None]:
grouping_entities(pred)

[{'entity_group': 'B',
  'score': 0.99752146,
  'word': ' analgésicos  orales',
  'start': 535,
  'end': 553},
 {'entity_group': 'B',
  'score': 0.98207575,
  'word': ' exploración',
  'start': 560,
  'end': 571},
 {'entity_group': 'B',
  'score': 0.99856466,
  'word': ' palpación  del borde costal izquierdo',
  'start': 608,
  'end': 644},
 {'entity_group': 'B',
  'score': 0.9893059,
  'word': ' maniobra  del gancho',
  'start': 748,
  'end': 767},
 {'entity_group': 'B',
  'score': 0.9742329,
  'word': ' exploración',
  'start': 807,
  'end': 818},
 {'entity_group': 'B',
  'score': 0.9994836,
  'word': ' radiografía  de tórax',
  'start': 865,
  'end': 885},
 {'entity_group': 'B',
  'score': 0.9710995,
  'word': ' tratamiento  antiinflamatorio',
  'start': 900,
  'end': 928},
 {'entity_group': 'B',
  'score': 0.9983222,
  'word': ' extirpación  bajo anestesia general de la unión de la 10a costilla izquierda con la 11a',
  'start': 1022,
  'end': 1108}]

## Predictions on test datasets

In [None]:
import os
print("Processing...")
import re
f = open("/content/drive/MyDrive/Results/Test_Results.tsv", "w", encoding="UTF-8")
f.write("filename\tlabel\tstart_span\tend_span\ttext\n")
for fl in test_files:
  with open(test_path + '/' + fl, "r", encoding="UTF-8") as ftest:
    hc = ftest.read()
    pred = token_classifier(hc)
    pred_grouped = grouping_entities(pred)
    t = 1
    for p in pred_grouped:

      start_span = int(p['start'])
      end_span = int(p['end'])
      span = hc[start_span:end_span]

      if span in [".", ",", ";", ":", '"', "-", "a", "de", "por", "in", "que", "da", "di", "se", "Las", "re", "sin", " ", "(", ")", "y"]:
        continue

      if "\n" in span:
        span = re.sub("\n"," ",span)

      if " - " in span:
        span = re.sub(" - ","-",span)
        end_span = end_span-2

      if "( " in span:
        span = re.sub("\( ","(",span)
        end_span = end_span-1

      if " )" in span:
        span = re.sub(" \)",")",span)
        end_span = end_span-1

      if span.endswith(" y") :
        span = span[:-2]
        end_span = end_span-2

      if span.endswith(" de") or span.endswith(" en"):
        span = span[:-3]
        end_span = end_span-3

      if span.endswith(" por") or span.endswith(" con"):
        span = span[:-4]
        end_span = end_span-4

      if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
        span = span[:-1]
        end_span = end_span-1

      if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
        span = span[:-2]
        end_span = end_span-2

      f.write(fl[:-4]+"\t"+"PROCEDIMIENTO"+"\t"+str(start_span)+"\t"+str(end_span)+"\t"+span+"\n")

      t+=1
f.close()
print("Completed!")

Processing...
Completed!
