To use the three pickle files (for training, test, and validation) as input for the OPT model, you need to follow a series of steps that involve loading the tokenized data from the pickle files, creating a PyTorch dataset and dataloader for handling the data efficiently, and then iterating over these loaders during the training and evaluation phases. This approach allows you to manage the data in batches, which is essential for effective and efficient training, especially with large datasets.

Below is an outline of the process, integrating the loading of pickle files into a training loop with PyTorch:  
Step 1: Define a Dataset

First, you need to define a custom PyTorch dataset that can load your tokenized data from the pickle files:

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
import pickle

class WritingPromptsDataset(Dataset):
    def __init__(self, pickle_file):
        with open(pickle_file, 'rb') as f:
            self.data = pickle.load(f)
    
    def __len__(self):
        return len(self.data['input_ids'])
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.data.items()}
        return item


This dataset loads the entire tokenized data into memory, which should be fine for the batch-wise processing approach used earlier. If your dataset is extremely large, you might need a more memory-efficient loading strategy.

Step 2: Create DataLoaders

Next, create DataLoader instances for each of your datasets. The DataLoader allows you to specify a batch size and whether to shuffle the data, among other parameters:

In [3]:
batch_size = 2  # Adjust based on your GPU memory

train_dataset = WritingPromptsDataset('data/hd/prepro/tokenized/tokenized_train_data.pkl')
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

valid_dataset = WritingPromptsDataset('data/hd/prepro/tokenized/tokenized_valid_data.pkl')
valid_loader = DataLoader(valid_dataset, batch_size=batch_size)

test_dataset = WritingPromptsDataset('data/hd/prepro/tokenized/tokenized_test_data.pkl')
test_loader = DataLoader(test_dataset, batch_size=batch_size)


  from .autonotebook import tqdm as notebook_tqdm


Step 3: Training Loop

With the DataLoader set up, you can iterate over the data in your training loop. Here’s a simplified version to illustrate how it integrates with the model training:

In [4]:
from transformers import AdamW, OPTForCausalLM
model_name = "facebook/opt-350m"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#model = OPTForCausalLM.from_pretrained(model_name)
model = OPTForCausalLM.from_pretrained(model_name).to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 1

model.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch, labels=batch["input_ids"])
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f"Loss: {loss.item()}")


  item = {key: torch.tensor(val[idx]) for key, val in self.data.items()}


OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB. GPU 0 has a total capacity of 10.91 GiB of which 261.25 MiB is free. Including non-PyTorch memory, this process has 10.65 GiB memory in use. Of the allocated memory 10.37 GiB is allocated by PyTorch, and 122.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Note:

    Memory Management: Depending on the size of your data and the capacity of your GPUs, you may need to adjust the batch size to prevent out-of-memory errors.
    Model Inputs and Labels: This example assumes that the input_ids in your batch can serve both as the inputs to the model and the labels for calculating loss, which is typical for causal language models where the task is to predict the next token. Adjust this according to your specific task requirements.
    Evaluation and Testing: The example focuses on the training loop. Don't forget to implement an evaluation loop for your validation and test datasets to monitor the model's performance and generalizeability on unseen data.

This setup should help you incorporate your tokenized data from pickle files into the model training process with PyTorch and the Transformers library.

## Utilizing GPUs with Transformers Library

The transformers library leverages PyTorch or TensorFlow underneath, both of which can automatically use GPUs if they are available and properly configured. Here’s how to ensure you’re set up for GPU usage:

In [2]:
import torch
print(torch.cuda.is_available())  # Should return True if CUDA is properly set up
print(torch.cuda.device_count())  # Should return the number of GPUs available


True
2


Specify Device for Model and Data: To use GPUs, you need to move your models and data onto the GPU. This is usually done by specifying the device:

In [None]:
from transformers import OPTForCausalLM

model_name = "facebook/opt-350m"
model = OPTForCausalLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # Assuming `model` is your model instance
# When processing data, you move tensors to the same device:
inputs = inputs.to(device)

Data Parallelism for Multi-GPU Utilization: If you have multiple GPUs and want to leverage them to parallelize data processing, you can use torch.nn.DataParallel for model training. For tokenization and data preparation, work is generally CPU-bound, but batched processing as described above helps manage memory usage.

In [None]:
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

This setup automatically splits data across your GPUs during training, aggregating the results. Note that effective multi-GPU training involves considerations around batch size, memory usage, and data loading that may require further adjustments to optimize performance.

By processing data in smaller chunks and ensuring your setup is configured to utilize available GPU resources, you can manage memory usage more effectively and speed up processing and training tasks.