## Know your datasets

- `Dataset`: Thường được sử dụng khi dữ liệu của bạn có thể nằm trong bộ nhớ và bạn có thể tải toàn bộ nó một lần
- `IterableDataset`: được thiết kế cho các trường hợp khi bộ dữ liệu của bạn quá lớn để chứa hoàn toàn trong bộ nhớ và bạn cần tải dữ liệu theo cách động, có thể từ các tệp hoặc luồng dữ liệu.

### Dataset

In [1]:
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

#### Indexing

In [3]:
# indexing
dataset[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [5]:
# indexing by the column name
dataset['text'][0]

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

In [11]:
%%time

dataset[0]['text']

CPU times: user 123 µs, sys: 15 µs, total: 138 µs
Wall time: 129 µs


'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

In [12]:
%%time

dataset['text'][0]

CPU times: user 4.27 ms, sys: 37 µs, total: 4.31 ms
Wall time: 4.44 ms


'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

#### sliding

In [13]:
dataset[:3]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

### IterableDataset

In [14]:
from datasets import load_dataset

iterable_dataset = load_dataset("rotten_tomatoes", split="train", streaming=True)

for example in iterable_dataset:
    print(example)
    break

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}


You can also create an `IterableDataset` from an existing `Dataset`

In [15]:
dataset = load_dataset("rotten_tomatoes", split="train")
iterable_dataset = dataset.to_iterable_dataset()

In [16]:
next(iter(iterable_dataset))

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

You can return a subset of the dataset with a specific number of examples in it with `IterableDataset.take()`:

In [17]:
list(iterable_dataset.take(3))

[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 1},
 {'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'label': 1},
 {'text': 'effective but too-tepid biopic', 'label': 1}]

But unlike `slicing`, `IterableDataset.take()` creates a new `IterableDataset`.