## 1- Preprocessing

### 1.1 Load Dataset

In this notebook we will use the Dataset library to load and process a dataset to train and validate our model 

In [2]:
from datasets import load_dataset

# Load mrpc dataset
raw_datasets = load_dataset("glue", "mrpc")

# different methods and access to the dataset
print(raw_datasets)
print(raw_datasets['train'])
print(raw_datasets['train'].features)
print(raw_datasets['train'][16])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None)}
{'sentence1': 'As well as the dolphin scheme , the chaos has allowed foreign companies to engage in damaging logging and fishing operations without proper monitoring or export controls .', 'sentence2': 'Internal chaos has allowed foreign companies to set up damaging commercial logging and fishing operations without proper monitoring or

### 1.2 Tokenize dataset

The Dataset library use ApacheArrow to store the dataset which help us to reduce the memory use, to keep as Dataset while we process we must use Dataset methods

In this example we are using a dataset that take 2 sentences and determinate if they have the same meaning, so before go into details, let check how tokenizer deal with 2 sentences at the same time.

In [3]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# the tokenizer function use token_type_ids to identify 2 sentences in the same vector
tokenized_dataset = tokenizer(
    raw_datasets["train"][0]["sentence1"],
    raw_datasets["train"][0]["sentence2"],
    padding=True,
    truncation=True,
)
# The tokenizer function add [CLS] and [SEP], in addition the token_type_ids help us to split the sentences
print(tokenizer.convert_ids_to_tokens(tokenized_dataset['input_ids']))
print(tokenized_dataset['input_ids'])
print(tokenized_dataset['token_type_ids'])

['[CLS]', 'am', '##ro', '##zi', 'accused', 'his', 'brother', ',', 'whom', 'he', 'called', '"', 'the', 'witness', '"', ',', 'of', 'deliberately', 'di', '##stor', '##ting', 'his', 'evidence', '.', '[SEP]', 'referring', 'to', 'him', 'as', 'only', '"', 'the', 'witness', '"', ',', 'am', '##ro', '##zi', 'accused', 'his', 'brother', 'of', 'deliberately', 'di', '##stor', '##ting', 'his', 'evidence', '.', '[SEP]']
[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [13]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

#### 1.2.1 the map function
When we use the map function, in contrast to the last block, we keep a Dataset object and we save a lot of memory use

In [4]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# we can see that each sentence has a different length 
token_id_length = [len(tokenized_datasets['train']['input_ids'][idx]) for idx in range(100)]
print(token_id_length)

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-0b1db9b23b727341.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-605feb9995750299.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-81324fca8758f5b3.arrow


[50, 59, 47, 67, 59, 50, 62, 32, 45, 60, 51, 47, 42, 61, 53, 44, 53, 79, 57, 70, 63, 35, 54, 64, 52, 47, 68, 58, 60, 35, 43, 34, 48, 65, 27, 73, 31, 50, 36, 61, 57, 54, 41, 64, 53, 38, 68, 45, 57, 39, 36, 68, 63, 47, 37, 62, 59, 58, 50, 33, 61, 34, 71, 64, 74, 30, 54, 53, 72, 70, 44, 58, 78, 40, 60, 50, 55, 31, 62, 46, 58, 70, 49, 49, 42, 34, 70, 50, 34, 65, 49, 39, 53, 37, 28, 70, 66, 68, 62, 62]


### 1.3 Dynamic padding

As we saw early, even if the tokenizer function works, it returned IDs of different lenghts, but for training we need batch of the same shape, that is way Hugging implement the class DataCollatorWithPadding which help us to solve this issue

In [6]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

samples = tokenized_datasets["train"][:8]
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'sentence1': TensorShape([8]),
 'sentence2': TensorShape([8]),
 'idx': TensorShape([8]),
 'input_ids': TensorShape([8, 67]),
 'token_type_ids': TensorShape([8, 67]),
 'attention_mask': TensorShape([8, 67]),
 'labels': TensorShape([8])}

## 2- All in one step

Even if we can iterate to create the different batches, Hugging implement the method to_tf_dataset which help us to select specific columns, set the label and optionally receive a data collation function.

In [7]:

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)
tf_train_dataset

<PrefetchDataset shapes: ({input_ids: (None, None), token_type_ids: (None, None), attention_mask: (None, None)}, (None,)), types: ({input_ids: tf.int64, token_type_ids: tf.int64, attention_mask: tf.int64}, tf.int64)>

## 3- training Model 
### 3.1- Default training 

In [30]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(optimizer="adam",
              loss=SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset)


Extension horovod.torch has not been built: /usr/local/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
[2022-06-30 14:06:53.219 tensorflow-2-6-cpu-py3-ml-m5-large-f9407cc662b9414742aa5e846834:429 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-06-30 14:06:53.350 tensorflow-2-6-cpu-py3-ml-m5-large-f9407cc662b9414742aa5e846834:429 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.


<keras.callbacks.History at 0x7f1739910bb0>

### 3.2 Hyperparameter tunning

In [None]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam
from transformers import TFAutoModelForSequenceClassification
import tensorflow as tf

num_epochs = 3
# The training steps are defined by the length of the dataset divided by the batch size (3668 / 8) * number of epochs
num_train_steps = len(tf_train_dataset) * num_epochs

# define the learning rate function
lr_scheduler = PolynomialDecay(initial_learning_rate=5e-5, 
                               end_learning_rate=0.0, 
                               decay_steps=num_train_steps)

# define the optimizer
opt = Adam(learning_rate=lr_scheduler)

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.compile(optimizer=opt, 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=["accuracy"])

model.fit(tf_train_dataset, 
          validation_data=tf_validation_dataset, 
          epochs=num_epochs)


All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Extension horovod.torch has not been built: /usr/local/lib/python3.8/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
[2022-06-30 14:40:57.957 tensorflow-2-6-cpu-py3-ml-m5-large-f9407cc662b9414742aa5e846834:546 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2022-06-30 14:40:58.077 tensorflow-2-6-cpu-py3-ml-m5-large-f9407cc662b9414742aa5e846834:546 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
Epoch 2/3
  5/459 [..............................] - ETA: 27:02 - loss: 0.3550 - accuracy: 0.8750

In [None]:
preds = model.predict(tf_validation_dataset)["logits"]

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

model_name = 'qanastek/XLMRoberta-Alexa-Intents-Classification'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [8]:
classifier = TextClassificationPipeline(model=model, tokenizer=tokenizer)

res = classifier("How can i cook the rice?")
print(res)

[{'label': 'cooking_recipe', 'score': 0.9999580383300781}]


## 
- respaldar código
- acceso a sandbox
- busqueda de dataset
- intentos/subintentos - utterance