# Training Hugging Face Models

Up to this point, we've used data and models from the Hugging Face hub unmodified. In this section, we will transfer and train a Hugging Face model. We will use Hugging Face data sets, tokenizers, and pretrained models to achieve this training.

We begin by installing Hugging Face if needed. It is also essential to install Hugging Face datasets.


We begin by loading the emotion data set from the Hugging Face hub. Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. The following code loads the emotion data set from the Hugging Face hub.

In [33]:
# HIDE OUTPUT
from datasets import load_dataset

emotions = load_dataset("emotion")


Using the latest cached version of the module from C:\Users\yoda\.cache\huggingface\modules\datasets_modules\datasets\emotion\cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd (last modified on Tue Apr 18 14:48:41 2023) since it couldn't be found locally at emotion., or remotely on the Hugging Face Hub.
No config specified, defaulting to: emotion/split
Found cached dataset emotion (C:/Users/yoda/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

You can see a single observation from the training data set here. This observation includes both the text sample and the assigned emotion label. The label is a numeric index representing the assigned emotion.

In [34]:
emotions['train'][2]

{'text': 'im grabbing a minute to post i feel greedy wrong', 'label': 3}

We can display the labels in order of their index labels.

In [35]:
emotions['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

Next, we utilize Hugging Face tokenizers and data sets together. The following code tokenizes the entire emotion data set. You can see below that the code has transformed the training set into subword tokens that are now ready to be used in conjunction with a transformer for either inference or training.

In [36]:
# HIDE OUTPUT
from transformers import AutoTokenizer


def tokenize(rows):
    return tokenizer(rows['text'], padding="max_length", truncation=True)


model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

emotions.set_format(type=None)

tokenized_datasets = emotions.map(tokenize, batched=True)


Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

We will utilize the Hugging Face **DefaultDataCollator** to transform the emotion data set into TensorFlow type data that we can use to finetune a neural network.

In [37]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

Now we generate a shuffled training and evaluation data set.

In [38]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)


In [39]:
small_train_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 16000
})

We can now generate the TensorFlow data sets. We specify which columns should map to the input features and labels. We do not need to shuffle because we previously shuffled the data.

In [40]:
tf_train_dataset = small_train_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = small_eval_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["label"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)


We will now load the distilbert model for classification. We will adjust the pretrained weights to predict the emotions of text lines.

In [42]:
# HIDE OUTPUT
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(\
    "distilbert-base-uncased", num_labels=6) 

ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /distilbert-base-uncased/resolve/main/pytorch_model.bin (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000022C8A1062E0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

We now train the neural network. Because the network is already pretrained, we use a small learning rate.

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_dataset, validation_data=tf_validation_dataset,
          epochs=5)
