## Hugging Face NPL Course Notes

### 1.- Tokenizers 

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 16.1kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.20MB/s]


In [12]:
raw_inputs = [
    "TGS sucks a lot.",
    "Tell me that you like me, without telling me that you like me.",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 17), dtype=int32, numpy=
array([[  101,  1056,  5620, 19237,  1037,  2843,  1012,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0],
       [  101,  2425,  2033,  2008,  2017,  2066,  2033,  1010,  2302,
         4129,  2033,  2008,  2017,  2066,  2033,  1012,   102]])>, 'attention_mask': <tf.Tensor: shape=(2, 17), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}


### Model

In [13]:
### Model 
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


The vector output by the Transformer module is usually large. It generally has three dimensions:

Batch size: The number of sequences processed at a time (2 in our example).
Sequence length: The length of the numerical representation of the sequence (16 in our example).
Hidden size: The vector dimension of each model input.

In [14]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 17, 768)


In [15]:
from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [16]:
print(outputs.logits.shape)

(2, 2)


In [17]:
outputs

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 2.8571773, -2.4997537],
       [-3.199863 ,  3.3846471]], dtype=float32)>, hidden_states=None, attentions=None)

### 3.-Postprocessing the output

In [18]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[0.9953068  0.00469323]
 [0.0013797  0.9986203 ]], shape=(2, 2), dtype=float32)


In [19]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

## Creating Transformers

In [20]:
from transformers import BertConfig, TFBertModel

In [21]:
# Building the config
config = BertConfig()

# Building the model from the config
model = TFBertModel(config)

In [22]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.32.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [23]:
model = TFBertModel.from_pretrained("bert-base-cased")

Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 191kB/s]
Downloading model.safetensors: 100%|██████████| 436M/436M [00:15<00:00, 27.6MB/s] 
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel we

In [24]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [26]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [27]:
import tensorflow as tf

model_inputs = tf.constant(encoded_sequences)

In [28]:
output = model(model_inputs)

## Tokenization

Translating text to numbers is known as encoding, and it is done in a 2-step process: the tokenization, followed by the conversion to input IDs

In [31]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "I’ve been waiting for a HuggingFace course my whole life."
tokens = tokenizer.tokenize(sequence)

print(tokens)

['I', '’', 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', '.']


In [33]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[146, 787, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 119]


In [34]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

I ’ ve been waiting for a HuggingFace course my whole life.


### Handling Multiple Sequences

In [37]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tf.constant(ids)
# This line will fail.
model(input_ids)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.7276192,  2.8789363]], dtype=float32)>, hidden_states=None, attentions=None)

In [38]:
tokenized_inputs = tokenizer(sequence, return_tensors="tf")
print(tokenized_inputs["input_ids"])

tf.Tensor(
[[  101  1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026
   2878  2166  1012   102]], shape=(1, 16), dtype=int32)


In [39]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = tf.constant([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Input IDs: tf.Tensor(
[[ 1045  1005  2310  2042  3403  2005  1037 17662 12172  2607  2026  2878
   2166  1012]], shape=(1, 14), dtype=int32)
Logits: tf.Tensor([[-2.7276192  2.8789363]], shape=(1, 2), dtype=float32)


In [40]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [41]:
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

In [42]:
model_inputs

{'input_ids': <tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[ 101, 7592,  999,  102],
       [ 101, 4658, 1012,  102],
       [ 101, 3835,  999,  102]])>, 'attention_mask': <tf.Tensor: shape=(3, 4), dtype=int32, numpy=
array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])>}

In [43]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [44]:
output

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606955,  1.6122804],
       [-3.6183178,  3.9137495]], dtype=float32)>, hidden_states=None, attentions=None)

In [45]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
result = tokenizer.tokenize("Hello!")

In [46]:
result

['Hello', '!']

## Processing Data

In [47]:
import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = dict(tokenizer(sequences, padding=True, truncation=True, return_tensors="tf"))

# This is new
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
labels = tf.convert_to_tensor([1, 1])
model.train_on_batch(batch, labels)

Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 9.36kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 284kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.59MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 4.00MB/s]
Downloading model.safetensors: 100%|██████████| 440M/440M [00:35<00:00, 12.2MB/s] 
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


14.910134315490723

In [48]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 3.60MB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 7.17MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 4.00MB/s]
Downloading data: 6.22kB [00:00, 583kB/s]0/3 [00:00<?, ?it/s]
Downloading data: 1.05MB [00:00, 4.76MB/s]/3 [00:00<00:01,  1.82it/s]
Downloading data: 441kB [00:00, 3.52MB/s]2/3 [00:01<00:00,  1.51it/s]
Downloading data files: 100%|██████████| 3/3 [00:01<00:00,  1.58it/s]
Generating train split: 100%|██████████| 3668/3668 [00:00<00:00, 17318.86 examples/s]
Generating validation split: 100%|██████████| 408/408 [00:00<00:00, 6594.94 examples/s]
Generating test split: 100%|██████████| 1725/1725 [00:00<00:00, 30870.87 examples/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [54]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [61]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

In [62]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [63]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

In [64]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

In [65]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [66]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map: 100%|██████████| 3668/3668 [00:00<00:00, 5679.33 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 3590.96 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 5692.84 examples/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [67]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [69]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [70]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': TensorShape([8, 67]),
 'token_type_ids': TensorShape([8, 67]),
 'attention_mask': TensorShape([8, 67]),
 'labels': TensorShape([8])}

In [71]:
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [72]:
checkpoint = "bert-base-cased"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) # Value before the sotfmax layer in the model
model.compile(optimizer="adam", loss=loss)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [74]:
model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
)

  5/459 [..............................] - ETA: 24:43 - loss: 0.7697

InvalidArgumentError: Graph execution error:

Detected at node 'tf_bert_for_sequence_classification_1/bert/embeddings/assert_less/Assert/Assert' defined at (most recent call last):
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
      app.start()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelapp.py", line 728, in start
      self.io_loop.start()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\tornado\platform\asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 596, in run_forever
      self._run_once()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1890, in _run_once
      handle._run()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 516, in dispatch_queue
      await self.process_one()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 505, in process_one
      await dispatch(*args)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 412, in dispatch_shell
      await result
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 740, in execute_request
      reply_content = await reply_content
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\zmqshell.py", line 540, in run_cell
      return super().run_cell(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3009, in run_cell
      result = self._run_cell(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3064, in _run_cell
      result = runner(coro)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3269, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3448, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\Francisco.Colina\AppData\Local\Temp\ipykernel_21432\3352150589.py", line 1, in <module>
      model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1742, in fit
      tmp_logs = self.train_function(iterator)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1338, in train_function
      return step_function(self, iterator)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1322, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1303, in run_step
      outputs = model.train_step(data)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1637, in train_step
      y_pred = self(x, training=True)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 569, in __call__
      return super().__call__(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs
      "method added in TF 2.8. If you want the original HF compute_loss, please call "
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 1569, in call
      outputs = self.bert(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs
      "method added in TF 2.8. If you want the original HF compute_loss, please call "
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 780, in call
      embedding_output = self.embeddings(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 201, in call
      if input_ids is not None:
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 202, in call
      check_embeddings_within_bounds(input_ids, self.config.vocab_size)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\tf_utils.py", line 161, in check_embeddings_within_bounds
      tf.debugging.assert_less(
Node: 'tf_bert_for_sequence_classification_1/bert/embeddings/assert_less/Assert/Assert'
assertion failed: [The maximum value of input_ids (Tensor(\"tf_bert_for_sequence_classification_1/bert/embeddings/Max:0\", shape=(), dtype=int32)) must be smaller than the embedding layer\'s input dimension (28996). The likely cause is some problem at tokenization time.] [Condition x < y did not hold element-wise:] [x (tf_bert_for_sequence_classification_1/Cast:0) = ] [[101 1996 2132...]...] [y (tf_bert_for_sequence_classification_1/bert/embeddings/Cast/x:0) = ] [28996]
	 [[{{node tf_bert_for_sequence_classification_1/bert/embeddings/assert_less/Assert/Assert}}]] [Op:__inference_train_function_96840]

In [75]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size = 8
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)
from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)

In [76]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [77]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

Epoch 1/3


InvalidArgumentError: Graph execution error:

Detected at node 'tf_bert_for_sequence_classification_2/bert/embeddings/assert_less/Assert/Assert' defined at (most recent call last):
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\traitlets\config\application.py", line 1043, in launch_instance
      app.start()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelapp.py", line 728, in start
      self.io_loop.start()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\tornado\platform\asyncio.py", line 195, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 596, in run_forever
      self._run_once()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 1890, in _run_once
      handle._run()
    File "C:\Users\Francisco.Colina\AppData\Local\Programs\Python\Python39\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 516, in dispatch_queue
      await self.process_one()
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 505, in process_one
      await dispatch(*args)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 412, in dispatch_shell
      await result
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\kernelbase.py", line 740, in execute_request
      reply_content = await reply_content
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\ipkernel.py", line 422, in do_execute
      res = shell.run_cell(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\ipykernel\zmqshell.py", line 540, in run_cell
      return super().run_cell(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3009, in run_cell
      result = self._run_cell(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3064, in _run_cell
      result = runner(coro)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3269, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3448, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\Francisco.Colina\AppData\Local\Temp\ipykernel_21432\1411177282.py", line 1, in <module>
      model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1742, in fit
      tmp_logs = self.train_function(iterator)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1338, in train_function
      return step_function(self, iterator)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1322, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 1303, in run_step
      outputs = model.train_step(data)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1637, in train_step
      y_pred = self(x, training=True)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\training.py", line 569, in __call__
      return super().__call__(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs
      "method added in TF 2.8. If you want the original HF compute_loss, please call "
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 1569, in call
      outputs = self.bert(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\modeling_tf_utils.py", line 1557, in run_call_with_unpacked_inputs
      "method added in TF 2.8. If you want the original HF compute_loss, please call "
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 780, in call
      embedding_output = self.embeddings(
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\engine\base_layer.py", line 1150, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\keras\src\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 201, in call
      if input_ids is not None:
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\models\bert\modeling_tf_bert.py", line 202, in call
      check_embeddings_within_bounds(input_ids, self.config.vocab_size)
    File "c:\Users\Francisco.Colina\Documents\Code\ArtPrompt\.venv\lib\site-packages\transformers\tf_utils.py", line 161, in check_embeddings_within_bounds
      tf.debugging.assert_less(
Node: 'tf_bert_for_sequence_classification_2/bert/embeddings/assert_less/Assert/Assert'
assertion failed: [The maximum value of input_ids (Tensor(\"tf_bert_for_sequence_classification_2/bert/embeddings/Max:0\", shape=(), dtype=int32)) must be smaller than the embedding layer\'s input dimension (28996). The likely cause is some problem at tokenization time.] [Condition x < y did not hold element-wise:] [x (tf_bert_for_sequence_classification_2/Cast:0) = ] [[101 8769 1999...]...] [y (tf_bert_for_sequence_classification_2/bert/embeddings/Cast/x:0) = ] [28996]
	 [[{{node tf_bert_for_sequence_classification_2/bert/embeddings/assert_less/Assert/Assert}}]] [Op:__inference_train_function_134109]

In [79]:
from transformers import CamembertTokenizer, TFCamembertForMaskedLM

tokenizer = CamembertTokenizer.from_pretrained("camembert-base")
model = TFCamembertForMaskedLM.from_pretrained("camembert-base")

ImportError: 
CamembertTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.
