#ENTREGA FINAL: TEXTO

##AUTORES: Antonio Gonzalez Suarez, Fernando Revuelta San Emeterio y Agustín Rodríguez Agudo


## INTRODUCCIÓN
 + La generación automática de texto clínico es una tarea fundamental en el campo de la medicina, con aplicaciones que van desde la documentación de historias clínicas hasta la redacción de informes médicos completos. En los últimos años, los modelos de lenguaje preentrenados han demostrado ser herramientas poderosas para abordar este desafío. Sin embargo, para lograr un rendimiento óptimo en el contexto médico, es necesario adaptar estos modelos a datos específicos de informes médicos.

 + El objetivo de este trabajo práctico es explorar y aplicar el concepto de  fine tune LM utilizando datos específicos de informes médicos. Al aprovechar un modelo de lenguaje preentrenado como punto de partida, podemos beneficiarnos de su conocimiento general del lenguaje y luego ajustarlo a un dominio médico más especializado.

In [2]:
# Requirements

!pip install transformers==4.26.0
!pip install datasets
!pip install gradio
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.26.0
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.29.2
    Uninstalling transformers-4.29.2:
      Successfully uninstalled transformers-4.29.2
Successfully installed transformers-4.26.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## CONFIGURACIONES PREVIAS

In [3]:
# Imports

from google.colab import drive 
import os
import sys

import pandas as pd
from sklearn.model_selection import train_test_split
import re
from functools import reduce

from datasets import DatasetDict, Dataset, load_dataset
from transformers import AutoTokenizer, DataCollatorForLanguageModeling, DefaultDataCollator, AutoModelForMaskedLM, AutoModelForQuestionAnswering, pipeline, Trainer, TrainingArguments, EarlyStoppingCallback

import gradio as gr

In [4]:
# Mount drive

drive.mount('/content/drive') 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Colab Path set up

os.chdir("/content/drive/My Drive") 
sys.path.append("/content/drive/My Drive/") 
    
# Root directory of the project 
ROOT_DIR = os.path.abspath(".") 

* Lectura y transformación del dataset médico.

In [6]:
# Load data

# Load the data from a pandas DataFrame
data = pd.read_csv("./lawLanguageModel/data/mtsamples.csv")
data = data[['transcription']]
data.columns = ['text']
data.dropna(inplace=True)
data['text'] = list(map(lambda x: x.split('.'), data['text']))
flat = reduce(lambda x, y: x + y, data['text'])
data = pd.DataFrame({'text': flat})

In [7]:
# Clean text
def clean_function(example):
  new_text = re.sub('([:,.+\-%])', ' ', example)
  new_text = re.sub('\s{1,}', ' ', new_text)
  return new_text.strip()

data['text'] = list(map(clean_function, data['text']))

In [8]:
# Convert the DataFrame to a Hugging Face Dataset

dataset = Dataset.from_pandas(data)
train_dataset, val_dataset = train_test_split(dataset, test_size=0.2)
train_dataset = Dataset.from_dict(train_dataset)
val_dataset = Dataset.from_dict(val_dataset)
dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})

* *Bert base cased* es un modelo preentrenado ampliamente utilizado y popular en el campo del procesamiento del lenguaje natural. Algunas de las razones por las que se utiliza:
  + Representación contextual de palabras: Bert-base-cased cada palabra se representa teniendo en cuenta las palabras que la rodean, lo que permite capturar mejor el significado y la semántica de las palabras en función de su contexto. Esta representación contextualizada de las palabras es especialmente beneficiosa en tareas donde la comprensión y generación de texto de calidad son cruciales, como la generación de informes médicos.

  + Cobertura de vocabulario extenso: Bert-base-cased utiliza un tokenizador de subpalabras que divide las palabras en unidades más pequeñas, lo que permite manejar eficientemente un vocabulario amplio. Esto es especialmente útil en el campo médico, donde se encuentran términos técnicos y científicos específicos.

In [9]:
# Load pre-trained model

pretrained_model = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
model = AutoModelForMaskedLM.from_pretrained(pretrained_model)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
# Tokenize

def tokenize_function(examples):
  return tokenizer(examples['text'], truncation = True, max_length = 150)

dataset = dataset.map(tokenize_function, batched = True)

Map:   0%|          | 0/181236 [00:00<?, ? examples/s]

Map:   0%|          | 0/45309 [00:00<?, ? examples/s]

+ Se limita el tamaño del dataset de entrenamiento ya que se requiere un tiempo considerable con un conjunto de datos tan extenso. El tamaño se elige según los recursos computacionales.

In [11]:
# Limit training
dataset['train'] = Dataset.from_dict(dataset['train'][:5000])

In [12]:
# Train Args
training_args = TrainingArguments(
    # Output
    output_dir="./lawLanguageModel/tmp/ml_model",
    overwrite_output_dir=True,
    # Gradient config
    num_train_epochs=50,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=5e-5,
    weight_decay = 0,
    warmup_steps = 100,
    no_cuda = False,
    # Checkpoint config
    evaluation_strategy='steps',
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True
    # metric_for_best_model (define compute_metrics function if necessary, if not takes loss function)
)

# Fine-tune Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: text. If text are not expected by `BertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 5000
  Num Epochs = 50
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 7850
  Number of trainable parameters = 108340804
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


In [None]:
# Save Model

trainer.save_model("./lawLanguageModel/tmp/ml_model/")

In [13]:
# Inference

mask_filler = pipeline("fill-mask", model="./lawLanguageModel/tmp/ml_model/", tokenizer = tokenizer)
# mask_filler = pipeline("fill-mask", model="bert-base-cased", tokenizer = tokenizer) # Different result -> Trained
model.to('cpu') # Using Cuda when training
predicts = [el['sequence'] for el in mask_filler("Thank you so much [MASK], you are a nice person")]
list(map(print, predicts));

loading configuration file ./lawLanguageModel/tmp/ml_model/config.json
Model config BertConfig {
  "_name_or_path": "./lawLanguageModel/tmp/ml_model/",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading configuration file ./lawLanguageModel/tmp/ml_model/config.json
Model config BertConfig {
  "_name_or_path": "./lawLanguageModel/tmp/ml_model/",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropou

Thank you so much Doctor, you are a nice person
Thank you so much Miss, you are a nice person
Thank you so much sir, you are a nice person
Thank you so much honey, you are a nice person
Thank you so much Mother, you are a nice person


 + Nuevamente limitamos los datos con los que trajamos debido a los recuros temporales y computacionales.

In [14]:
# QA Dataset

squad = load_dataset("squad", split="train[:5000]")
# squad = load_dataset("squad", split="train")
squad = squad.train_test_split(test_size=0.2)



+ En el dataset se cuenta con la pregunta, la respuesta y el caracter de inicio de la repsuesta. Lo tokenizamos y buscamos los tokens iniciales y finales que contienen la respuesta, teniendo en cuenta el caso en el que esta truncada y no se encuentra disponible como resultado de la tokenizacion. 

In [15]:
# QA Preprocess

def preprocess_function(examples):
    # Tokenize
    questions = list(map(lambda x: x.strip(), examples["question"]))
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=200,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )
    # Prepare iterations
    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []
    # Iterate throught mapping on each example
    for i, offset in enumerate(offset_mapping):
        # Get character mapping
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        # Token type (q, a or special tokens such as padding)
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context tokens (first token with 1, and last token with 1, before padding)
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context (both side truncation), label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            # Search answer start token
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)
            # Search answer end token
            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)
    # Add answer start and end token
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
data_collator = DefaultDataCollator()

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [16]:
# Load QA Model

model = AutoModelForQuestionAnswering.from_pretrained("./lawLanguageModel/tmp/ml_model/")

loading configuration file ./lawLanguageModel/tmp/ml_model/config.json
Model config BertConfig {
  "_name_or_path": "./lawLanguageModel/tmp/ml_model/",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file ./lawLanguageModel/tmp/ml_model/pytorch_model.bin
Some weights of the model checkpoint at ./lawLanguageModel/tmp/ml_model/ were not used when initializing BertForQuestionAnswering: ['cls.predictions.decod

In [17]:
# Train Args
training_args = TrainingArguments(
   # Output
    output_dir="./lawLanguageModel/tmp/qa_model",
    overwrite_output_dir=True,
    # Gradient config
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    weight_decay = 0,
    warmup_steps = 100,
    no_cuda = False,
    # Checkpoint config
    evaluation_strategy='steps',
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True
    # metric_for_best_model (define compute_metrics functionif needed, if not takes loss)
)

# Fine-tune Model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=squad["train"],
    eval_dataset=squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)

trainer.train()

using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 4000
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1250
  Number of trainable parameters = 107721218


Step,Training Loss,Validation Loss


In [None]:
# Save model

trainer.save_model("./lawLanguageModel/tmp/qa_model/")

In [18]:
# Inference 
model.to('cpu') # Using Cuda when trianing
question_answerer = pipeline("question-answering", model="./lawLanguageModel/tmp/qa_model/")

def inference(context: str, question: str, question_answerer = question_answerer) -> str:
  return question_answerer(question=question, context=context)['answer']

loading configuration file ./lawLanguageModel/tmp/qa_model/config.json
Model config BertConfig {
  "_name_or_path": "./lawLanguageModel/tmp/qa_model/",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading configuration file ./lawLanguageModel/tmp/qa_model/config.json
Model config BertConfig {
  "_name_or_path": "./lawLanguageModel/tmp/qa_model/",
  "architectures": [
    "BertForQuestionAnswering"
  ],
  "atte

In [20]:
# Example

context = "BLOOM has 176 billion parameters and can generate text in 46 natural languages and 13 programming languages."
question = "How many programming languages does BLOOM support?"
inference(context, question, question_answerer)

'13'

In [21]:
# Topic example

context = "The human body contains about 60% water. Water is essential for a variety of bodily functions, including regulating body temperature, transporting nutrients and oxygen to cells, and removing waste from the body. "
question = "How much water does the human body contain?"
inference(context, question, question_answerer)

'60%'

In [22]:
# Demo

demo = gr.Interface(fn=inference, inputs=["text", "text"], outputs="text")
demo.launch()  

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



In [27]:
# Evaluation
# Query: Chat GPT
'''
Write in a json format a health or medical  context, a question about that context and an aswer to that question,  such as the answer to thas question is contained in the context:
{'context': ' ', 'question': ' ', 'answer': ' '}
'''


model.to('cpu')
question_answerer = pipeline("question-answering", model="./lawLanguageModel/tmp/qa_model/")
test_dataset = [{
    "context": "Cardiovascular disease (CVD) is a group of conditions that affect the heart and blood vessels. The most common type of CVD is coronary artery disease, which is caused by a buildup of plaque in the arteries that supply blood to the heart. Other types of CVD include heart failure, stroke, and peripheral artery disease. Risk factors for CVD include smoking, high blood pressure, high cholesterol, diabetes, and a family history of the disease.",
    "question": "What are the risk factors for cardiovascular disease?",
    "answer": "Risk factors for CVD include smoking, high blood pressure, high cholesterol, diabetes, and a family history of the disease."
}]
list(map(lambda x: (question_answerer(question=x['question'], context=x['context'])['answer'], x['answer']), test_dataset))

[('Cardiovascular disease (CVD',
  'Risk factors for CVD include smoking, high blood pressure, high cholesterol, diabetes, and a family history of the disease.')]

+ Ejemplo de uso mediante la API de chat-GPT-4

In [None]:
"""
import openai 

openai.api_key = "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"

system_intel = "You are GPT-4, answer my questions as if you were an expert in the field."
prompt = "WWrite in a json format a health or medical  context, a question about that context and an aswer to that question,  such as the answer to thas question is contained in the context:{'context': ' ', 'question': ' ', 'answer': ' '}"

# Function that calls the GPT-4 API
def ask_GPT4(system_intel, prompt): 
    result = openai.ChatCompletion.create(model="gpt-4",
                                 messages=[{"role": "system", "content": system_intel},
                                           {"role": "user", "content": prompt}])
    return result['choices'][0]['message']['content']

# Call the function above
test_dataset = ask_GPT4(system_intel, prompt)
list(map(lambda x: (question_answerer(question=x['question'], context=x['context'])['answer'], x['answer']), test_dataset))
"""

"""
Devuelve algo como esto 
{
  'context': 'Diabetes mellitus is a metabolic disease characterized by high blood sugar levels over a prolonged period. It is caused by the body's inability to produce or properly use insulin, which is essential in regulating blood sugar levels. Type 1 diabetes is typically caused by an autoimmune reaction, while Type 2 diabetes is primarily caused by lifestyle factors and genetics. Common symptoms of diabetes include increased thirst, frequent urination, extreme fatigue, and slow-healing wounds.',
  'question': 'What are the common symptoms of diabetes mellitus?',
  'answer': 'Increased thirst, frequent urination, extreme fatigue, and slow-healing wounds.'
}

"""
