<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Hugging_Face/4-HF_Datasets_Trainer_Accelerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Datasets, Trainers

This notebook covers essential tools from Hugging Face, including handling datasets, custom datasets, data batching, data manipulation with Pandas, data collators, the Trainer API, and accelerators. Each section contains implementation examples and explanations.

## Datasets

Hugging Face's `datasets` library offers a quick and efficient way to access a wide array of NLP datasets.

In [1]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset(path='glue', name='mrpc')
print(dataset)

# Accessing different splits
train_dataset = dataset['train']
valid_dataset = dataset['validation']
test_dataset = dataset['test']

# Example: Print the first example from the training dataset
print(train_dataset[0])


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 649k/649k [00:00<00:00, 2.70MB/s]
Downloading data: 100%|██████████| 75.7k/75.7k [00:00<00:00, 682kB/s]
Downloading data: 100%|██████████| 308k/308k [00:00<00:00, 2.85MB/s]


Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0}


The output shows a `DatasetDict` object from the Hugging Face `datasets` library, typically containing several datasets organized by their split names such as `train`, `validation`, and `test`. Here's a detailed breakdown of its components and the structure:

- **DatasetDict**: This is a Python dictionary-like container that groups multiple `Dataset` objects, each corresponding to a different subset of the data, usually differentiated by the data split (e.g., training, validation, testing). This structure is particularly useful for managing datasets that inherently come with predefined splits for different stages of model training and evaluation.

1. **Keys (Splits)**:
   - `train`
   - `validation`
   - `test`

2. **Dataset**:
   - Each entry under a key is a `Dataset` object, which itself is a structured collection of data points (rows).
   - These datasets are implemented efficiently to handle large amounts of data without loading everything into memory.

3. **Features**:
   - Each `Dataset` has a schema describing its features, which are the columns of the dataset.

4. **Num_rows**:
   - Indicates the number of rows (data points) available in each dataset split.

The `DatasetDict` structure is particularly important for:
- **Modularization**: It keeps different data splits well-organized and easily accessible through keys.
- **Scalability**: It supports lazy loading from disk, allowing it to handle very large datasets efficiently.
- **Flexibility**: It provides built-in methods for common data operations, such as shuffling, batching, and applying transformations.

### Converting to Pandas DataFrames

Integration with Pandas allows for sophisticated data manipulation.

In [2]:
# Convert Hugging Face dataset to Pandas DataFrame
df_test = test_dataset.to_pandas()
df_test.head()

Unnamed: 0,sentence1,sentence2,label,idx
0,"PCCW 's chief operating officer , Mike Butcher...",Current Chief Operating Officer Mike Butcher a...,1,0
1,The world 's two largest automakers said their...,Domestic sales at both GM and No. 2 Ford Motor...,1,1
2,According to the federal Centers for Disease C...,The Centers for Disease Control and Prevention...,1,2
3,A tropical storm rapidly developed in the Gulf...,A tropical storm rapidly developed in the Gulf...,0,3
4,The company didn 't detail the costs of the re...,But company officials expect the costs of the ...,0,4


## Data Collators

In the Hugging Face ecosystem, data collators are utilities designed to handle the common preprocessing steps required to prepare batches of data for model training.

Data collators are used to streamline the process of preparing data for neural networks, particularly those trained on batches where each input needs to have the same shape or size. The main functions of data collators in Hugging Face include:

1. **Padding**: Automatically pad input sequences to the maximum length in each batch or to a predefined length, ensuring that all tensor inputs to the model are of the same size.
2. **Truncation**: Reduce the length of any input sequence that exceeds a maximum length, either specified or inferred from the batch.
3. **Creating Attention Masks**: Generate masks to inform the model which parts of the input are actual data and which are padding, critical for transformer models that use attention mechanisms.
4. **Handling Special Tokens**: Insert special tokens that are necessary for certain models (like BERT’s `[CLS]` and `[SEP]` tokens) appropriately across batches.

Data collators in Hugging Face are often used with the `DataLoader` class from PyTorch or similar batching utilities in other frameworks to automate the process of generating batches from a dataset.

Let's consider an example where you have a dataset of text sequences of varying lengths, and we want to prepare batches for training a model using dynamic padding. This will highlight how `DataCollatorWithPadding` dynamically adjusts the padding based on the longest sequence in each batch, thus minimizing the padded length and optimizing GPU memory usage.

In [3]:
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader, Dataset
import torch

# Define a tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sample text data of varying lengths
texts = [
    "Hello world",
    "Hugging Face is revolutionizing AI.",
    "They are making natural language processing accessible to developers.",
    "Here is a very long sentence, purely created to demonstrate how dynamic padding works within the context of training deep learning models.",
    "Short!"
]

# Tokenize the data properly, ensuring all necessary fields are included
tokenized_texts = [tokenizer(text, add_special_tokens=True, return_attention_mask=True, truncation=True, padding=False) for text in texts]

# Define the dataset class
class TextDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings)

    def __getitem__(self, idx):
        # Directly return the dictionary of inputs for each item
        return {key: torch.tensor(self.encodings[idx][key]) for key in self.encodings[idx]}

# Initialize the dataset
dataset = TextDataset(tokenized_texts)

# Initialize the data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors='pt')

# Create a DataLoader
data_loader = DataLoader(dataset, batch_size=2, collate_fn=data_collator)

# Iterate through the DataLoader
for batch in data_loader:
    print("Batch Input IDs shape:", batch['input_ids'].shape)
    print("Batch Attention Mask shape:", batch['attention_mask'].shape)


2024-04-26 21:12:34.568387: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-26 21:12:34.568490: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-26 21:12:34.724381: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Batch Input IDs shape: torch.Size([2, 9])
Batch Attention Mask shape: torch.Size([2, 9])
Batch Input IDs shape: torch.Size([2, 27])
Batch Attention Mask shape: torch.Size([2, 27])
Batch Input IDs shape: torch.Size([1, 4])
Batch Attention Mask shape: torch.Size([1, 4])


- **Tokenization**: Each text is tokenized using a pre-trained tokenizer. This typically involves converting texts into a sequence of token IDs.
- **Data Collator Initialization**: `DataCollatorWithPadding` is initialized with the same tokenizer to ensure that the padding uses the tokenizer’s padding token ID.
- **DataLoader**: The `DataLoader` is set up with the `data_collator` as its `collate_fn`, which means that each batch produced by the DataLoader will be processed by the data collator. This process involves dynamically padding the sequences so that all sequences in a batch are the same length.
- **Batch Processing**: When iterating over the DataLoader, each batch is automatically padded to the length of the longest sequence in the batch, and appropriate attention masks are created.

## Custom Datasets

Creating custom datasets involves defining a dataset structure that Hugging Face's `datasets` library can process. Custom datasets can be created from various sources, such as Pandas dataframes, using the `from_pandas` method.

In [4]:
from datasets import Dataset
import pandas as pd

data = {'text': ['Hello, world!', 'Hugging Face is amazing'], 'label': [0, 1]}
df = pd.DataFrame(data)

# Convert a pandas dataframe to a Hugging Face dataset
custom_dataset = Dataset.from_pandas(df)
print(custom_dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 2
})


## The Trainer API

The `Trainer` API in Hugging Face's Transformers library is designed to abstract away much of the boilerplate code typically required for training transformer models. It provides a convenient interface for fitting models on datasets, handling everything from setting up the training loop, managing device placement (CPU or GPU), to logging and saving models.

1. **Abstraction of Training Loops**: The `Trainer` handles the intricacies of the training loops internally. Users don't need to manually code the steps for each epoch, batch processing, gradient calculation, or updates.

2. **Automatic Device Management**: It automatically moves models, inputs, and outputs to the appropriate devices (CPU or GPU), based on availability and configuration.

3. **Built-In Training Features**: Includes support for mixed precision training, multi-GPU training, and even TPU support, simplifying the process of scaling up training to larger models and datasets.

4. **Metrics and Logging**: Integrates seamlessly with TensorBoard and other logging frameworks to monitor training progress. This includes automatic logging of loss, evaluation metrics, and other useful training information.

5. **Pre-Configured Optimizations**: Comes with pre-configured optimizers and learning rate schedules, which can be customized easily through its flexible argument structure.

**Main and Critical Arguments of `TrainingArguments`**

- **output_dir**: Where to store the output files (e.g., model checkpoints).
- **num_train_epochs**: Total number of training epochs.
- **per_device_train_batch_size** and **per_device_eval_batch_size**: Batch sizes for training and evaluation, respectively, allowing for different settings during training and evaluation phases.
- **warmup_steps**: Number of steps to perform learning rate warmup, which is crucial for stabilizing the model's initial training phase.
- **weight_decay**: Regularization parameter to prevent overfitting.
- **logging_dir**: Directory for storing logs, useful for tracking training process and for later analysis with tools like TensorBoard.

Let's create an example using the BERT model for a sentiment classification task on the IMDB dataset. This will illustrate how `Trainer` API facilitates model training:

In [6]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('imdb')

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Preprocess the text data
def preprocess_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

# Tokenize the dataset using the map function
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Load the model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

# Run training
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,0.3014,0.270388




TrainOutput(global_step=1563, training_loss=0.3665847448835904, metrics={'train_runtime': 583.5635, 'train_samples_per_second': 42.84, 'train_steps_per_second': 2.678, 'total_flos': 1644444096000000.0, 'train_loss': 0.3665847448835904, 'epoch': 1.0})

- **Dataset Loading and Tokenization**: The IMDB dataset is loaded and tokenized using the BERT tokenizer. The preprocessing step ensures all inputs are padded and truncated to a uniform length necessary for batch processing.

- **Model Setup**: A pre-trained BERT model is loaded and configured for sequence classification with two labels (positive and negative sentiments).

- **TrainingArguments Configuration**: Critical training parameters are set up, including the directory for outputs, batch size, warmup steps, and the logging directory.

- **Trainer Initialization and Execution**: The `Trainer` is set up with the model, training arguments, and the datasets, and the model is trained using a simple call to `trainer.train()`.


## What Are Accelerators?

In the context of machine learning, particularly with the Hugging Face ecosystem, "accelerators" refer to tools that simplify and optimize the deployment of computation across different hardware configurations, such as CPUs, GPUs, and TPUs. The `Accelerator` class from Hugging Face's `accelerate` library is designed to abstract away most of the boilerplate code associated with device placement and parallel execution.

The `Accelerator` library automatically handles:
1. **Device Placement**: It automatically moves your models and data to the appropriate device (CPU, GPU, TPU).
2. **Distributed Computing**: It simplifies the setup for running models on multiple GPUs or TPUs with minimal changes to the codebase.
3. **Mixed Precision**: It supports automatic mixed precision to make training faster and more memory efficient without loss of accuracy.
4. **Dynamic Adjustment**: It adjusts computation and data loading based on the available hardware, optimizing performance without manual configuration.

By using `Accelerator`, developers can:
- Write hardware-agnostic code: Write once and run on any configuration (single GPU, multi-GPU, TPU, etc.).
- Easily implement distributed training: Simplify the complexity typically associated with scaling up training to multiple devices.
- Improve performance: Utilize optimizations like mixed precision automatically.

Now let's create an example that uses the `Accelerator` tool to train a model on the IMDB dataset for sentiment classification.

In [14]:
from transformers import BertTokenizer, BertForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
from accelerate import Accelerator

# Load the dataset
dataset = load_dataset('imdb', split='train[:10%]')

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization and data preparation function
def preprocess_function(examples):
    # Tokenize the text and ensure tensor conversion happens here
    tokenized_output = tokenizer(examples['text'], truncation=True, padding="max_length", max_length=512, return_tensors='np')
    return tokenized_output

# Apply preprocessing to tokenize dataset
tokenized_data = dataset.map(preprocess_function, remove_columns=['text'], batched=True)

# create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# DataLoader with default settings
data_loader = DataLoader(tokenized_data, batch_size=2, shuffle=True,  collate_fn=data_collator)

# Load BERT for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Initialize Accelerator
accelerator = Accelerator()
model, optimizer, data_loader = accelerator.prepare(model, optimizer, data_loader)

# Training loop
for epoch in range(1):  # One epoch for demonstration
    for i, batch in enumerate(data_loader):
        data = {k: v for k, v in batch.items()}
        outputs = model(**data)
        loss = outputs.loss
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        if i % 20 == 0:
            print(f"Loss: {loss.item()}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loss: 0.8140441179275513
Loss: 0.014169598929584026
Loss: 0.0032063524704426527
Loss: 0.0017942762933671474
Loss: 0.0013011569390073419
Loss: 0.0010204471182078123
Loss: 0.0008431096794083714
Loss: 0.0006774751236662269
Loss: 0.00046975305303931236
Loss: 0.00035428686533123255
Loss: 0.00030852516647428274
Loss: 0.00023880728986114264
Loss: 0.00020591284555848688
Loss: 0.00017510310863144696
Loss: 0.00014232576359063387
Loss: 0.00013135997869540006
Loss: 0.00010662701970431954
Loss: 9.25617350731045e-05
Loss: 9.101210162043571e-05
Loss: 7.599303353345022e-05
Loss: 7.396662840619683e-05
Loss: 7.104620453901589e-05
Loss: 6.329812458716333e-05
Loss: 5.829164365422912e-05
Loss: 5.721882189391181e-05
Loss: 5.358314228942618e-05
Loss: 5.096067616250366e-05
Loss: 4.869580880040303e-05
Loss: 4.708655978902243e-05
Loss: 4.4404474465409294e-05
Loss: 4.1722378227859735e-05
Loss: 4.404685751069337e-05
Loss: 3.8980677345534787e-05
Loss: 3.832505171885714e-05
Loss: 3.7252208130666986e-05
Loss: 3.5642

- **Dataset Loading and Processing**: Loads the IMDB dataset and processes it with a tokenizer to make it suitable for BERT.
- **Model and Optimizer Setup**: A BERT model is initialized along with an AdamW optimizer.
- **Accelerator Integration**: The model, optimizer, and dataset are prepared with the `Accelerator`. This step is crucial as it configures the model and other components to use the appropriate hardware and optimizations (like mixed precision).
- **Training Loop**: The model is trained with a simple loop, using `Accelerator` to handle device-specific operations and optimizations transparently.