<a href="https://colab.research.google.com/github/ajtamayoh/ClinicalTextMining/blob/main/Organisms_Identification_shared_code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clinical Text Mining in Spanish (Organisms)

Here you are the source code for the paper:

### Clinical Text Mining in Spanish Enhanced by Negation Detection and Named Entity Recognition

Authors:

Antonio Tamayo (ajtamayo2019@ipn.cic.mx, ajtamayoh@gmail.com)

Diego A. Burgos (burgosda@wfu.edu)

Alexander Gelbulkh (gelbukh@gelbukh.com)

For bugs or questions related to the code, do not hesitate to contact us (Antonio Tamayo: ajtamayoh@gmail.com)

If you use this code please cite our work:

Comming soon ...



# Requirements

To run this code you need to download the dataset (three files: LivingNER_training.json, LivingNER_validations.json and LivingNER_testing.json) at: [download dataset](https://github.com/ajtamayoh/ClinicalTextMining/tree/main/Organisms/Dataset)

Then, you must create a folder called "Datasets" in the root of your Google Drive and load there both folders previously downloaded.

Once the dataset is ready to use, you should [open this notebook in colab](https://colab.research.google.com/github/ajtamayoh/ClinicalTextMining/blob/main/Organisms_Identification_shared_code.ipynb) and save a copy in your drive.

## About the infrastructure

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

## Install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs
!pip install seqeval

## Hugging Face Authentication

If you want to save your own model and make it available online we strongly recommend signing up at: https://huggingface.co/

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "your_email"
!git config --global user.name "your_name"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Connecting to Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Exploring & Preprocessing Data

In [None]:
import pandas as pd
import numpy as np
import spacy

# Organisms identification as a Token classification problem

## Loading the Preprocessed Dataset

In [None]:
from datasets import load_dataset
import json

# LivingNER dataset (preprocessed 4 BIO)
LivingNER_dataset_train = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/LivingNER_training.json", field="data")
LivingNER_dataset_val = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/LivingNER_validation.json", field="data")
LivingNER_dataset_test = load_dataset("json", data_files="/content/drive/MyDrive/Datasets/LivingNER_testing.json", field="data")

In [None]:
from datasets import DatasetDict

# Training, validation, and Testing partitions from LivingNER
raw_datasets = DatasetDict({
    'train': LivingNER_dataset_train['train'],
    'validation': LivingNER_dataset_val['train'],
    'test': LivingNER_dataset_test['train']
    })

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['ner_tags', 'tokens'],
        num_rows: 27810
    })
    validation: Dataset({
        features: ['ner_tags', 'tokens'],
        num_rows: 12480
    })
    test: Dataset({
        features: ['ner_tags', 'tokens'],
        num_rows: 12955
    })
})

In [None]:
raw_datasets['train']

Dataset({
    features: ['ner_tags', 'tokens'],
    num_rows: 20857
})

In [None]:
label_names = ['O','HUMAN','SPECIES']
label_names

['O', 'HUMAN', 'SPECIES']

In [None]:
words = raw_datasets["train"][0]["tokens"]
labels = [int(n) for n in raw_datasets["train"][0]["ner_tags"]]
#labels = raw_datasets["train"][0]["pos_tags"]
#labels = raw_datasets["train"][0]["chunk_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

El 1 de enero de 2020 , ingresó en el Union Hospital ( facultad de medicina Tongji , Wuhan , provincia de Hubei ) un hombre de 42 años con hipertermia ( 39,6 °C ) , tos y que refería fatiga de una semana de evolución . 
O  O O  O     O  O    O O       O  O  O     O        O O        O  O        O      O O     O O         O  O     O O  HUMAN  O  O  O    O   O           O O    O  O O O   O O   O       O      O  O   O      O  O         O 


## Loading mBERT as a pre-trained model

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

In [None]:
inputs.word_ids()

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

In [None]:
from datasets import load_metric

metric = load_metric("seqeval")

In [None]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

In [None]:
predictions = labels.copy()
predictions[2] = "O"
metric.compute(predictions=[predictions], references=[labels])

In [None]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
id2label

{'0': 'O', '1': 'HUMAN', '2': 'SPECIES'}

In [None]:
label2id

{'O': '0', 'HUMAN': '1', 'SPECIES': '2'}

## Changing the head of prediction for Organism Mentions Identification under the BIO scheme

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(    
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
    num_labels = 3,
)

Downloading pytorch_model.bin:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at 

In [None]:
model.config.num_labels

3

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    
    "Species_Identification_mBERT_fine_tuned_Train_Test_your_identifier",
    
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=7,
    weight_decay=0.1,
    push_to_hub=True,
)

## Fine-tuning Transformer-based model for Procedure mentions identification

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    #eval_dataset=tokenized_datasets["validation"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

## Saving the fine-tuned model at Hugging Face (It requires previous authentication)

In [None]:
trainer.push_to_hub(commit_message="Training complete")

## Loading the model for inference

In [None]:
from transformers import pipeline

model_checkpoint = "ajtamayoh/Species_Identification_mBERT_fine_tuned_Train_Test_your_identifier"

token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

In [None]:
pred = token_classifier("El Paciente hipertenso no presenta fiebre ni infección.")
pred

[{'entity_group': 'HUMAN',
  'score': 0.9996307,
  'word': 'Pac',
  'start': 3,
  'end': 6},
 {'entity_group': 'SPECIES',
  'score': 0.99972814,
  'word': '##iente',
  'start': 6,
  'end': 11}]

In [None]:
test_path = "Path_to_test_files"

## Post-Processing

In [None]:
def grouping_entities(pred):
  import re
  output = []
  for e in pred:
    #for RoBERTa
    if e['word'].startswith(' '):
      e['word'] = e['word'][1:]
    ###

    if "##" not in e['word']:
      output.append(e)
    else:
      try:
        if e['start'] == (output[-1]['end']):
          output[-1]['word'] = output[-1]['word']+re.sub("##","",e['word'])
          output[-1]['end'] = e['end']
      except:
        pass
    
    try:
      if (e['entity_group'] == "SPECIES" or e['entity_group'] == "HUMAN") and (e['start'] == (output[-2]['end']+1)) and (e['entity_group'] == output[-2]['entity_group']):
        output[-2]['word'] = output[-2]['word']+" "+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass

    try:
      if e['start'] == (output[-2]['end']):
        output[-2]['word'] = output[-2]['word']+e['word']
        output[-2]['end'] = e['end']
        output.pop(-1)
    except:
      pass
    

  return output


In [None]:
grouping_entities(pred)

[{'entity_group': 'HUMAN',
  'score': 0.9996307,
  'word': 'Paciente',
  'start': 3,
  'end': 11}]

## Predictions on test datasets

In [None]:
import os
test_files = os.listdir(test_path)

In [None]:
print("Processing...")
import re
f = open("/content/drive/MyDrive/Results/test_predictions_mBERT_ParTNER-1.tsv", "w", encoding="UTF-8")

f.write("filename\tmark\tlabel\toff0\toff1\tspan\n")

for fl in test_files:
  with open(test_path +'/'+ fl, "r", encoding="UTF-8") as ftest:
    lista_spans = []
    hc = ftest.read()
    #Partitioning the texts.
    t = 1
    pattern = r'\. |\.\n|\. \n'
    sents = re.split(pattern, hc)

    paragraphs = []
    prgh = ""

    #Paragraphs with 'step or pl' sentences no overlapping
    ant = 0
    step = 1
    pl = 1 #paragraph length
    for lpr in range(pl,len(sents),step):
      paragraphs.append(" ".join(sents[ant:lpr]))  
      ant = lpr
    if lpr != len(sents):
      paragraphs.append(" ".join(sents[lpr:]))
    
    
    '''
    #Paragraphs with 1-sentence window
    ant = 0
    step = 3
    pl = 5 #paragraph length
    for lpr in range(pl,len(sents),step):
      paragraphs.append(" ".join(sents[ant:lpr]))  
      ant+=step
    '''
  
    #Inference per paragraphs
    for ph in paragraphs: 

      pred = token_classifier(ph)
      pred_grouped = grouping_entities(pred)
  
      for p in pred_grouped:

        off0 = int(p['start'])
        off1 = int(p['end'])
        if p['entity_group'] == 'SPECIES':
          label = 'SPECIES'
        else:
          label = 'HUMAN'
        #span = hc[off0:off1]
        span = p['word'] #for mBERT

        
        span = re.sub("^, |^,|^\. |^\.|^: |^:|^; |^;|^\( |^\(|^\) |^\)","",span)

        if "\n" in span:
          span = re.sub("\n"," ",span)

        if " - " in span:
          span = re.sub(" - ","-",span)

        if "- " in span:
          span = re.sub("- ","-",span)

        if " -" in span:
          span = re.sub(" -","-",span)

        if "( " in span:
          span = re.sub("\( ","(",span)

        if " )" in span:
          span = re.sub(" \)",")",span)

        if span.endswith(" y") :
          span = span[:-2]

        if span.endswith(" de") or span.endswith(" en"):
          span = span[:-3]

        if span.endswith(" por") or span.endswith(" con"):
          span = span[:-4]

        if span.endswith(".") or span.endswith(",") or span.endswith(";") or span.endswith(":") or span.endswith("–") or span.endswith("-"):
          span = span[:-1]

        if span.endswith(" .") or span.endswith(" ,") or span.endswith(" ;") or span.endswith(" :") or span.endswith(" –") or span.endswith(" -"):
          span = span[:-2]

        #if span in ["",".", ",", ";", ":", '"', "-", "a", "de", "por", "in", "que", "da", "di", "se", "Las", "re", "sin", "en", "(", ")", "la", "y", "con", "o", "E", "ba", "mic", "su", "no", "S", "sa", "P", "co"] or span in Calphabet or span in alphabet or span in numbers:
          #continue

        pattern = r"^[a-z|á|é|í|ó|ú|/]{0,3}$|^[0-9]+$|^[A-Z]$"
        match = re.findall(pattern, span)
        if len(match) > 0 and match[0] != 'tía' and match[0] != 'tío':
          continue

        if span not in lista_spans:
          # Find all indices of 'span'
          indices = [index for index in range(len(hc)) if hc.startswith(span, index)]
          #print(indices)
          for ind in indices:
            off0 = ind
            off1 = ind+len(span)
            f.write(fl[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            #print(filename[:-4]+"\t"+"T"+str(t)+"\t"+label+"\t"+str(off0)+"\t"+str(off1)+"\t"+span+"\n")
            t+=1

          lista_spans.append(span)
f.close()
print("Completo.")

Processing...
Completo.
