In [11]:
!nvidia-smi

Sat May 11 07:59:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8              2W /  150W |      10MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## datasets

A `Dataset` object is a wrapper of an `Arrow` table, which allows fast zero-copy reads from arrays in the dataset to PyTorch tensors.

In [17]:
from datasets import Dataset

data = [[1, 2], [3, 4]]

ds = Dataset.from_dict({ "data": data })

Loading a `Dataset` to the CUDA device

In [18]:
import torch

has_cuda = torch.cuda.is_available()

if has_cuda:
  print("CUDA is available")
else:
  print("CUDA is not available")

device = torch.device("cuda" if has_cuda else "cpu")
torch_ds = ds.with_format("torch", device=device)

ds[0]

CUDA is available


{'data': [1, 2]}

### Data loading

Like torch.utils.data.Dataset objects, a Dataset can be passed directly to a PyTorch DataLoader:

https://huggingface.co/docs/datasets/en/use_with_pytorch#data-loading

In [19]:
import numpy as np
from datasets import Dataset
from torch.utils.data import DataLoader

data = np.random.rand(16)
label = np.random.randint(0, 2, size=16)

ds = Dataset.from_dict({"data": data, "label": label}).with_format("torch")
dataloader = DataLoader(ds, batch_size=4)

for batch in dataloader:
    print(batch)

{'data': tensor([0.3826, 0.7325, 0.6480, 0.7994]), 'label': tensor([1, 1, 0, 1])}
{'data': tensor([0.1497, 0.1252, 0.4785, 0.6125]), 'label': tensor([0, 1, 0, 0])}
{'data': tensor([0.9730, 0.2870, 0.8643, 0.6671]), 'label': tensor([1, 1, 1, 0])}
{'data': tensor([0.3033, 0.2434, 0.3117, 0.9448]), 'label': tensor([1, 0, 1, 1])}


### Load text data

Text files are one of the most common file types for storing a dataset. By default, 🤗 Datasets samples a text file line by line to build the dataset.

https://huggingface.co/docs/datasets/en/nlp_load

In [10]:
from datasets import load_dataset

dataset = load_dataset("text", data_files={
  "train": ["/data/000_sample_data.txt"],
  "test": ["/data/000_sample_data.txt"]
})

Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 9457.28it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 2200.58it/s]
Generating train split: 10 examples [00:00, 7688.92 examples/s]
Generating test split: 10 examples [00:00, 9504.43 examples/s]
