# En este notebook se entrena un modelo que reconozca entidades más finas dentro de la entidad ADMIN

Este será entrenado (fine-tunning) en 80% de los datos etiquetados a mano y testeado con 20 % de estos (texto etiquetado pero que contiene solo aquel subconjunto previamente etiquetado con la etiqueta ADMIN).

In [1]:
import sys
sys.path.append('../datos/procesamiento')
from corpus import Corpus

datos_conll = Corpus()

In [2]:
ner_dict = {'O': 0,
            'B-CANT': 1,
            'I-CANT': 2,
            'B-UND':3,
            'I-UND':4,
            'B-VIA_ADMIN': 5,
            'I-VIA_ADMIN': 6
            }

datos_conll.entidades = ner_dict

In [3]:
for i in range(4):
    datos_conll.load_conll('../datos/Etiquetado/corpus_admins_s{}_etiquetados.conll'.format(i+1))

Agregadas 251 secuencias de token-entidad al corpus
Agregadas 250 secuencias de token-entidad al corpus
Agregadas 252 secuencias de token-entidad al corpus
Agregadas 251 secuencias de token-entidad al corpus


In [4]:
HF_data_mini = datos_conll.to_HF_dataset()

HF_dataset = HF_data_mini.train_test_split(test_size=0.2,seed=0)
HF_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 803
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 201
    })
})

---

In [5]:
from transformers import AutoTokenizer
from auxfunctions import tokenize_and_align_labels

MODEL = "plncmm/bert-clinical-scratch-wl-es"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

process = lambda examples: tokenize_and_align_labels(examples,tokenizer)

tokenized_data_mini = HF_dataset.map(process, batched=True)
tokenized_data_mini = tokenized_data_mini.remove_columns(['id','tokens','ner_tags'])

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [6]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model_mini = AutoModelForTokenClassification.from_pretrained(MODEL, num_labels=len(ner_dict))

Some weights of the model checkpoint at plncmm/bert-clinical-scratch-wl-es were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at plncmm/bert-clinical-s

In [None]:
training_args = TrainingArguments(
    output_dir = "./results",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 20,
    weight_decay = 0.01,
)



In [8]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

trainer = Trainer(
    model=model_mini,
    args=training_args,
    train_dataset = tokenized_data_mini["train"],
    eval_dataset = tokenized_data_mini["test"],
    tokenizer=tokenizer,
    data_collator = data_collator,
)

In [9]:
trainer.train()

***** Running training *****
  Num examples = 803
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1020


  0%|          | 0/1020 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2831179201602936, 'eval_runtime': 0.2135, 'eval_samples_per_second': 941.452, 'eval_steps_per_second': 60.89, 'epoch': 1.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.28510886430740356, 'eval_runtime': 0.2145, 'eval_samples_per_second': 936.912, 'eval_steps_per_second': 60.596, 'epoch': 2.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.23784422874450684, 'eval_runtime': 0.2161, 'eval_samples_per_second': 930.332, 'eval_steps_per_second': 60.171, 'epoch': 3.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.29607969522476196, 'eval_runtime': 0.2218, 'eval_samples_per_second': 906.101, 'eval_steps_per_second': 58.604, 'epoch': 4.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2632068693637848, 'eval_runtime': 0.2235, 'eval_samples_per_second': 899.13, 'eval_steps_per_second': 58.153, 'epoch': 5.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2801365554332733, 'eval_runtime': 0.2287, 'eval_samples_per_second': 878.844, 'eval_steps_per_second': 56.841, 'epoch': 6.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.26943403482437134, 'eval_runtime': 0.2311, 'eval_samples_per_second': 869.618, 'eval_steps_per_second': 56.244, 'epoch': 7.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2451435923576355, 'eval_runtime': 0.2265, 'eval_samples_per_second': 887.522, 'eval_steps_per_second': 57.402, 'epoch': 8.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2474113553762436, 'eval_runtime': 0.2494, 'eval_samples_per_second': 806.079, 'eval_steps_per_second': 52.134, 'epoch': 9.0}


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json


{'loss': 0.1074, 'learning_rate': 1.0196078431372549e-05, 'epoch': 9.8}


Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.27362945675849915, 'eval_runtime': 0.2282, 'eval_samples_per_second': 880.988, 'eval_steps_per_second': 56.979, 'epoch': 10.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.29599815607070923, 'eval_runtime': 0.2354, 'eval_samples_per_second': 853.99, 'eval_steps_per_second': 55.233, 'epoch': 11.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.3107137382030487, 'eval_runtime': 0.2431, 'eval_samples_per_second': 826.674, 'eval_steps_per_second': 53.466, 'epoch': 12.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2953023314476013, 'eval_runtime': 0.2301, 'eval_samples_per_second': 873.658, 'eval_steps_per_second': 56.505, 'epoch': 13.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.31270602345466614, 'eval_runtime': 0.2238, 'eval_samples_per_second': 897.945, 'eval_steps_per_second': 58.076, 'epoch': 14.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2970276176929474, 'eval_runtime': 0.2474, 'eval_samples_per_second': 812.527, 'eval_steps_per_second': 52.551, 'epoch': 15.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2996915876865387, 'eval_runtime': 0.226, 'eval_samples_per_second': 889.412, 'eval_steps_per_second': 57.524, 'epoch': 16.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.307156503200531, 'eval_runtime': 0.2281, 'eval_samples_per_second': 881.284, 'eval_steps_per_second': 56.998, 'epoch': 17.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.31138694286346436, 'eval_runtime': 0.2263, 'eval_samples_per_second': 888.355, 'eval_steps_per_second': 57.456, 'epoch': 18.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.30846288800239563, 'eval_runtime': 0.2481, 'eval_samples_per_second': 810.115, 'eval_steps_per_second': 52.395, 'epoch': 19.0}


Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json


{'loss': 0.0376, 'learning_rate': 3.921568627450981e-07, 'epoch': 19.61}


Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




{'eval_loss': 0.30861106514930725, 'eval_runtime': 0.2273, 'eval_samples_per_second': 884.22, 'eval_steps_per_second': 57.188, 'epoch': 20.0}
{'train_runtime': 103.3946, 'train_samples_per_second': 155.327, 'train_steps_per_second': 9.865, 'train_loss': 0.0719288045869154, 'epoch': 20.0}


TrainOutput(global_step=1020, training_loss=0.0719288045869154, metrics={'train_runtime': 103.3946, 'train_samples_per_second': 155.327, 'train_steps_per_second': 9.865, 'train_loss': 0.0719288045869154, 'epoch': 20.0})

In [10]:
trainer.save_model("bert-clinical-scratch-wl-es-NER-prescription-mini")

Saving model checkpoint to bert-clinical-scratch-wl-es-NER-prescription-mini
Configuration saved in bert-clinical-scratch-wl-es-NER-prescription-mini/config.json
Model weights saved in bert-clinical-scratch-wl-es-NER-prescription-mini/pytorch_model.bin
tokenizer config file saved in bert-clinical-scratch-wl-es-NER-prescription-mini/tokenizer_config.json
Special tokens file saved in bert-clinical-scratch-wl-es-NER-prescription-mini/special_tokens_map.json


In [11]:
model_mini = AutoModelForTokenClassification.from_pretrained("bert-clinical-scratch-wl-es-NER-prescription-mini")

loading configuration file bert-clinical-scratch-wl-es-NER-prescription-mini/config.json
Model config BertConfig {
  "_name_or_path": "bert-clinical-scratch-wl-es-NER-prescription-mini",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_typ

In [12]:
from auxfunctions import eval_text, map_entities, calculate_metrics

y_test = [row['ner_tags'] for row in HF_dataset['test']]
y_preds = [list(eval_text(row['tokens'],tokenizer,model_mini)) for row in HF_dataset['test']]

calculate_metrics(y_preds,y_test,ner_dict=ner_dict)

Resultados de evaluación
	 f1: 0.94 | precision: 0.92 | recall: 0.95


(0.9225700164744646, 0.9491525423728814, 0.9356725146198831)

---

In [16]:
MODEL = "ccarvajal/beto-emoji"
folder = "bert-clinical-scratch-wl-es-NER-prescription"

try:
    tokenizer = AutoTokenizer.from_pretrained(folder)
    model_mini = AutoModelForTokenClassification.from_pretrained(folder, num_labels=len(ner_dict),ignore_mismatched_sizes=True)
except ValueError:
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    tokenizer.save_pretrained(folder)
    model_mini = AutoModelForTokenClassification.from_pretrained(MODEL, num_labels=len(ner_dict),ignore_mismatched_sizes=True)
    model_mini.save_pretrained(folder)

loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file bert-clinical-scratch-wl-es-NER-prescription/config.json
Model config BertConfig {
  "_name_or_path": "bert-clinical-scratch-wl-es-NER-prescription",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "mo

In [17]:
training_args = TrainingArguments(
    output_dir = "./results",
    evaluation_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 20,
    weight_decay = 0.01,
)

trainer = Trainer(
    model=model_mini,
    args=training_args,
    train_dataset = tokenized_data_mini["train"],
    eval_dataset = tokenized_data_mini["test"],
    tokenizer=tokenizer,
    data_collator = data_collator,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [18]:
trainer.train()

***** Running training *****
  Num examples = 803
  Num Epochs = 20
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1020


  0%|          | 0/1020 [00:00<?, ?it/s]

***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.22868359088897705, 'eval_runtime': 0.21, 'eval_samples_per_second': 957.33, 'eval_steps_per_second': 61.917, 'epoch': 1.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.24253275990486145, 'eval_runtime': 0.212, 'eval_samples_per_second': 948.318, 'eval_steps_per_second': 61.334, 'epoch': 2.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.21249742805957794, 'eval_runtime': 0.2164, 'eval_samples_per_second': 929.015, 'eval_steps_per_second': 60.086, 'epoch': 3.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2826649844646454, 'eval_runtime': 0.2138, 'eval_samples_per_second': 940.129, 'eval_steps_per_second': 60.804, 'epoch': 4.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.26883289217948914, 'eval_runtime': 0.2191, 'eval_samples_per_second': 917.34, 'eval_steps_per_second': 59.33, 'epoch': 5.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.28622210025787354, 'eval_runtime': 0.215, 'eval_samples_per_second': 934.702, 'eval_steps_per_second': 60.453, 'epoch': 6.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.24464352428913116, 'eval_runtime': 0.2189, 'eval_samples_per_second': 918.166, 'eval_steps_per_second': 59.384, 'epoch': 7.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2451096475124359, 'eval_runtime': 0.2323, 'eval_samples_per_second': 865.185, 'eval_steps_per_second': 55.957, 'epoch': 8.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.24365155398845673, 'eval_runtime': 0.2214, 'eval_samples_per_second': 907.756, 'eval_steps_per_second': 58.711, 'epoch': 9.0}


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json


{'loss': 0.094, 'learning_rate': 1.0196078431372549e-05, 'epoch': 9.8}


Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.25818634033203125, 'eval_runtime': 0.2244, 'eval_samples_per_second': 895.599, 'eval_steps_per_second': 57.924, 'epoch': 10.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.28662988543510437, 'eval_runtime': 0.222, 'eval_samples_per_second': 905.423, 'eval_steps_per_second': 58.56, 'epoch': 11.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.30096009373664856, 'eval_runtime': 0.2241, 'eval_samples_per_second': 896.883, 'eval_steps_per_second': 58.007, 'epoch': 12.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2921050786972046, 'eval_runtime': 0.2305, 'eval_samples_per_second': 872.125, 'eval_steps_per_second': 56.406, 'epoch': 13.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.30789169669151306, 'eval_runtime': 0.2298, 'eval_samples_per_second': 874.531, 'eval_steps_per_second': 56.562, 'epoch': 14.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.29233843088150024, 'eval_runtime': 0.2274, 'eval_samples_per_second': 883.96, 'eval_steps_per_second': 57.172, 'epoch': 15.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2875967025756836, 'eval_runtime': 0.2263, 'eval_samples_per_second': 888.39, 'eval_steps_per_second': 57.458, 'epoch': 16.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.3000867962837219, 'eval_runtime': 0.2267, 'eval_samples_per_second': 886.508, 'eval_steps_per_second': 57.336, 'epoch': 17.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.3093368709087372, 'eval_runtime': 0.2231, 'eval_samples_per_second': 900.874, 'eval_steps_per_second': 58.265, 'epoch': 18.0}


***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.30833378434181213, 'eval_runtime': 0.2319, 'eval_samples_per_second': 866.75, 'eval_steps_per_second': 56.058, 'epoch': 19.0}


Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json


{'loss': 0.0365, 'learning_rate': 3.921568627450981e-07, 'epoch': 19.61}


Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 201
  Batch size = 16


  0%|          | 0/13 [00:00<?, ?it/s]



Training completed. Do not forget to share your model on huggingface.co/models =)




{'eval_loss': 0.30688950419425964, 'eval_runtime': 0.2237, 'eval_samples_per_second': 898.399, 'eval_steps_per_second': 58.105, 'epoch': 20.0}
{'train_runtime': 99.2278, 'train_samples_per_second': 161.85, 'train_steps_per_second': 10.279, 'train_loss': 0.0648114346406039, 'epoch': 20.0}


TrainOutput(global_step=1020, training_loss=0.0648114346406039, metrics={'train_runtime': 99.2278, 'train_samples_per_second': 161.85, 'train_steps_per_second': 10.279, 'train_loss': 0.0648114346406039, 'epoch': 20.0})

In [19]:
trainer.save_model("bert-clinical-scratch-wl-es-NER-prescription-mini")

model_mini = AutoModelForTokenClassification.from_pretrained("bert-clinical-scratch-wl-es-NER-prescription-mini")

y_test = [row['ner_tags'] for row in HF_dataset['test']]
y_preds = [list(eval_text(row['tokens'],tokenizer,model_mini)) for row in HF_dataset['test']]

calculate_metrics(y_preds,y_test,ner_dict=ner_dict)

Saving model checkpoint to bert-clinical-scratch-wl-es-NER-prescription-mini
Configuration saved in bert-clinical-scratch-wl-es-NER-prescription-mini/config.json
Model weights saved in bert-clinical-scratch-wl-es-NER-prescription-mini/pytorch_model.bin
tokenizer config file saved in bert-clinical-scratch-wl-es-NER-prescription-mini/tokenizer_config.json
Special tokens file saved in bert-clinical-scratch-wl-es-NER-prescription-mini/special_tokens_map.json
loading configuration file bert-clinical-scratch-wl-es-NER-prescription-mini/config.json
Model config BertConfig {
  "_name_or_path": "bert-clinical-scratch-wl-es-NER-prescription-mini",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",

Resultados de evaluación
	 f1: 0.94 | precision: 0.93 | recall: 0.95


(0.9289256198347108, 0.9525423728813559, 0.9405857740585774)

In [20]:
trainer.save_model("bert-clinical-scratch-wl-es-NER-ADMIN")

Saving model checkpoint to bert-clinical-scratch-wl-es-NER-ADMIN
Configuration saved in bert-clinical-scratch-wl-es-NER-ADMIN/config.json
Model weights saved in bert-clinical-scratch-wl-es-NER-ADMIN/pytorch_model.bin
tokenizer config file saved in bert-clinical-scratch-wl-es-NER-ADMIN/tokenizer_config.json
Special tokens file saved in bert-clinical-scratch-wl-es-NER-ADMIN/special_tokens_map.json


In [21]:
from huggingface_hub import notebook_login

notebook_login()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Login successful
Your token has been saved to /home/camilo/.huggingface/token
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [22]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [23]:
model_mini.push_to_hub("beto-prescripciones-medicas-ADMIN")
tokenizer.push_to_hub("beto-prescripciones-medicas-ADMIN")

Configuration saved in /tmp/tmpbxtxy9to/config.json
Model weights saved in /tmp/tmpbxtxy9to/pytorch_model.bin
Uploading the following files to ccarvajal/beto-prescripciones-medicas-ADMIN: config.json,pytorch_model.bin
tokenizer config file saved in /tmp/tmp1mtfflnn/tokenizer_config.json
Special tokens file saved in /tmp/tmp1mtfflnn/special_tokens_map.json
Uploading the following files to ccarvajal/beto-prescripciones-medicas-ADMIN: tokenizer.json,vocab.txt,tokenizer_config.json,special_tokens_map.json


CommitInfo(commit_url='https://huggingface.co/ccarvajal/beto-prescripciones-medicas-ADMIN/commit/819fa4d244d112b5cabe11ad253238875ad09b1a', commit_message='Upload tokenizer', commit_description='', oid='819fa4d244d112b5cabe11ad253238875ad09b1a', pr_url=None, pr_revision=None, pr_num=None)

El modelo ahora está accesible en el repositorio transformers de HuggingFace [en este link](https://huggingface.co/ccarvajal/beto-prescripciones-medicas-ADMIN). El MODEL ID es "ccarvajal/beto-prescripciones-medicas-ADMIN".

In [24]:
import sys
import shutil
shutil.rmtree("bert-clinical-scratch-wl-es-NER-prescription-mini")