# 🤗 Datasets

## Know your dataset

Please see the [Know your dataset tutorial](https://huggingface.co/docs/datasets/access#know-your-dataset) for 🤗 Datasets.

##### In a nutshell...

> * A `Dataset` provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory
> * ...  for really, really big datasets that won’t even fit on disk or in memory, an `IterableDataset` allows you to access and use the dataset without waiting for it to download completely

In [1]:
import json

from datasets import load_dataset

### `Dataset`

In [2]:
#from datasets import load_dataset

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
type(dataset)

datasets.arrow_dataset.Dataset

#### Indexing

In [3]:
print(json.dumps(dataset[0], indent=2))
#dataset[0]

{
  "text": "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",
  "label": 1
}


In [4]:
print(json.dumps(dataset[-1], indent=2))
#dataset[-1]

{
  "text": "things really get weird , though not particularly scary : the movie is all portent and no content .",
  "label": 0
}


In [5]:
dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

> But it is important to remember that indexing order matters, especially when working with large audio and image datasets. Indexing by the column name returns all the values in the column first, then loads the value at that position. <span style="background-color:#33FFFF">For large datasets, it may be slower to index by the column name first.</span>

#### Slicing

In [6]:
print(json.dumps(dataset[:3], indent=2))
#dataset[:3]

{
  "text": [
    "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",
    "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",
    "effective but too-tepid biopic"
  ],
  "label": [
    1,
    1,
    1
  ]
}


In [7]:
print(json.dumps(dataset[3:6], indent=2))
#dataset[3:6]

{
  "text": [
    "if you sometimes like to go to the movies to have fun , wasabi is a good place to start .",
    "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
    "the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game ."
  ],
  "label": [
    1,
    1,
    1
  ]
}


### `IterableDataset`

> An `IterableDataset` is loaded when you set `streaming=True` in `load_dataset()`

In [8]:
#from datasets import load_dataset

iterable_dataset = load_dataset("ethz/food101", split="train", streaming=True)

for example in iterable_dataset:
    print(example)
    break

{'image': <PIL.Image.Image image mode=RGB size=384x512 at 0x7FDDB58167C0>, 'label': 6}


In [9]:
type(iterable_dataset)

datasets.iterable_dataset.IterableDataset

In [10]:
list(iterable_dataset.take(3))

[{'image': <PIL.Image.Image image mode=RGB size=384x512>, 'label': 6},
 {'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512>,
  'label': 6},
 {'image': <PIL.Image.Image image mode=RGB size=512x383>, 'label': 6}]