#NLP con HuggingFace
##Procesando los datos para NLP
###Descargar datasets

In [1]:
%%capture 
!pip install datasets transformers evaluate

Se usara el dataset MRPC. Este es uno de los 10 datasets que componen el benchmark (punto de referencia) GLUEE. Se utiliza para medir el rendimiento de los modelos ML en 10 tareas de clasificacion de textos diferentes.
En otras palabras, seleccionamos el subset *`mrpc`* del dataset `glue`

In [2]:
from datasets import load_dataset

ds = load_dataset("glue", "mrpc")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
ex = ds['train'][400]

In [4]:
ex

{'sentence1': 'U.S. Agriculture Secretary Ann Veneman , who announced Tuesdays ban , also said Washington would send a technical team to Canada to help .',
 'sentence2': "U.S. Agriculture Secretary Ann Veneman , who announced yesterday 's ban , also said Washington would send a technical team to Canada to assist in the Canadian situation .",
 'label': 1,
 'idx': 446}

In [5]:
labels = ds['train'].features['label']

In [6]:
labels.int2str(1)

'equivalent'

###TOKENIZADO

In [7]:
from transformers import AutoTokenizer

repo_id = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Un tokenizer convierte texto a numero, para procesar efectivamente la informacion. Esto se le llama encoding.

In [8]:
tokenized_sentence_1 = tokenizer(ds['train']['sentence1'][2])
tokenized_sentence_1

{'input_ids': [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
inputs = tokenizer('This is the first','This is the second')
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 102, 2023, 2003, 1996, 2117, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 '[SEP]']

In [11]:
repo_id = "distilroberta-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [12]:
def tokenize_fn(ex):
  return tokenizer(ex['sentence1'],ex['sentence2'],truncation=True)

In [13]:
prepared_ds = ds.map(tokenize_fn,batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

###Data Collector: Dynamic Padding

Necesitamos que nuestros tensores tengan una forma rectangular. Es decir que tengan el mismo tamaño cada uno de los ejemplos sin embargo, los textos no necesariamente tienen el mismo tamaño.

Para ello usamos el relleno o paddin. El padding se asegura de que todas nuestras oraciones tengan la misma longitud al agregar una palabra especial llamada padding token a las roaciones con menos valores. Por ejemplo, si tenemos 10 oraciones con 10 palabras, 1 oracion 20 plabaras, el relleno garantizara que todas la oraciones tengan 20 palabras.

In [14]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Entrenamiento y Evaluación

In [21]:
import evaluate
import numpy as np

def compute_metrics(eval_pred):
  metric = evaluate.load("glue", "mrpc")
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

Configurando el Trainer

In [16]:
from transformers import AutoModelForSequenceClassification

labels = ds['train'].features['label'].names

model = AutoModelForSequenceClassification.from_pretrained(
    repo_id,
    num_labels = len(labels),
    id2label = {str(i): c for i, c in enumerate(labels)},
    label2id = {c: str(i) for i, c in enumerate(labels)}
)

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [22]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = './distilroberta-base-mrpc-glu-cristian-agudelo',
    evaluation_strategy = 'steps',
    num_train_epochs = 3,
    #push_to_hub_organization='platzi'
    push_to_hub = True,
    load_best_model_at_end = True
)

using `logging_steps` to initialize `eval_steps` to 500
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [18]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
    
Token: 
Add token as git credential? (Y/n) y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credenti

In [23]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset = prepared_ds['train'],
    eval_dataset = prepared_ds['validation'],
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics
)

/content/./distilroberta-base-mrpc-glu-cristian-agudelo is already a clone of https://huggingface.co/agudelozc/distilroberta-base-mrpc-glu-cristian-agudelo. Make sure you pull the latest changes with `repo.git_pull()`.


Entrenamiento

In [24]:
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics('train', train_results.metrics)
trainer.save_metrics('train', train_results.metrics)


The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2. If idx, sentence1, sentence2 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377
  Number of trainable parameters = 82119938


Step,Training Loss,Validation Loss,Accuracy,F1
500,0.285,0.895872,0.840686,0.884547
1000,0.2653,0.913127,0.821078,0.871252


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2. If idx, sentence1, sentence2 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Saving model checkpoint to ./distilroberta-base-mrpc-glu-cristian-agudelo/checkpoint-500
Configuration saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/checkpoint-500/config.json
Model weights saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/checkpoint-500/special_tokens_map.json
tokenizer config file saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/tokenizer_config.json
Special tokens file saved in ./distilroberta-base-mrpc-glu-cristian-agudelo/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2. If idx, sentence1, sentence2 are not expected by `RobertaForSequenceClassification.forward`,  you can saf

***** train metrics *****
  epoch                    =        3.0
  total_flos               =   191920GF
  train_loss               =     0.2353
  train_runtime            = 0:02:42.88
  train_samples_per_second =     67.558
  train_steps_per_second   =      8.454


Evaluación

In [25]:
metrics = trainer.evaluate(prepared_ds['validation'])
trainer.log_metrics('eval',metrics)
trainer.save_metrics('eval',metrics)

The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2. If idx, sentence1, sentence2 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.8407
  eval_f1                 =     0.8845
  eval_loss               =     0.8959
  eval_runtime            = 0:00:01.25
  eval_samples_per_second =    324.948
  eval_steps_per_second   =     40.618
