# Datasets

To train any machine learning model, we need data. Huggingface has a great python-library to deal with datasets.

The ```datasets``` python library is efficient at loading and preprocessing data, which are the main two operations we want to perform.

There are thousand of public available datasets on the [hub](link-to-the-hub.com), mainly divided into NLP, Vision, Audio or Tabular categories. Depending on the tasks, you may need to perform additional preprocessing steps to prepare data for training, but don't worry, with ```datasets``` python package is simple to work with data.



In [1]:
!pip install datasets==3.5.0

Collecting datasets==3.5.0
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets==3.5.0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets==3.5.0)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets==3.5.0)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets==3.5.0)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownload

## load datasets

As mentioned above, ```datasets``` is great for loading data. We can load any datasets from the hub with a single line of code!

But before we download a dataset, which may take a couple of GBs to download, it's often helpful to get information about the data such as the description of the data, the number of features, labels, splits, size, etc.

All this information is stored in ```DatasetInfo``` object. Below is a snapshot of all attributes available;

```python
DatasetInfo(
    description: str = <factory>,
    citation: str = <factory>,
    homepage: str = <factory>,
    license: str = <factory>,
    features: Optional[datasets.features.features.Features] = None,
    post_processed: Optional[datasets.info.PostProcessedInfo] = None,
    supervised_keys: Optional[datasets.info.SupervisedKeysData] = None,
    task_templates: Optional[List[datasets.tasks.base.TaskTemplate]] = None,
    builder_name: Optional[str] = None,
    dataset_name: Optional[str] = None,
    config_name: Optional[str] = None,
    version: Union[str, datasets.utils.version.Version, NoneType] = None,
    splits: Optional[dict] = None,
    download_checksums: Optional[dict] = None,
    download_size: Optional[int] = None,
    post_processing_size: Optional[int] = None,
    dataset_size: Optional[int] = None,
    size_in_bytes: Optional[int] = None,
) -> None
```

to get the information about a dataset on the hub, we use the ```load_dataset_builder``` function.

In [2]:
from datasets import load_dataset_builder

let's start by inspecting the MMLU dataset hosted at ```lighteval/mmlu```, which is a famous benchmark to evaluate LLMs of multiple-choice questions.

In [3]:
ds_builder = load_dataset_builder('lighteval/mmlu', name='all')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/39.7k [00:00<?, ?B/s]

mmlu.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

In [4]:
ds_builder.info

DatasetInfo(description='This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge, covering 57 tasks including elementary mathematics, US history, computer science, law, and more.\n', citation='@article{hendryckstest2021,\n      title={Measuring Massive Multitask Language Understanding},\n      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},\n      journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n      year={2021}\n    }\n', homepage='https://github.com/hendrycks/test', license='', features={'question': Value(dtype='string', id=None), 'subject': Value(dtype='string', id=None), 'choices': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'answer': ClassLabel(names=['A', 'B', 'C', 'D'], id=None)}, post_processed=None, supervised_keys=None, builder_name='parquet', dataset_name='mmlu', config_name='all

In [5]:
ds_builder.info.description

'This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge, covering 57 tasks including elementary mathematics, US history, computer science, law, and more.\n'

The features attributes describes what information is available and its structure.

In the example below, the question field is a string of characters, where as the answer field is a label with four possible choices `[A, B, C, D]`. This information allowed us to know with what kind of data we are dealing with before downloading the dataset

In [6]:
ds_builder.info.features

{'question': Value(dtype='string', id=None),
 'subject': Value(dtype='string', id=None),
 'choices': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'answer': ClassLabel(names=['A', 'B', 'C', 'D'], id=None)}

In [7]:
ds_builder.info.download_size # size in bytes

166184960

Most datasets already have a pre-defined split for training, validation and testing. We can take a lof of the size of the splits as follows:

In [8]:
for split_info in ds_builder.info.splits.values():
    print(split_info)

SplitInfo(name='auxiliary_train', num_bytes=161000625, num_examples=99842, shard_lengths=None, dataset_name='mmlu')
SplitInfo(name='test', num_bytes=6967453, num_examples=14042, shard_lengths=None, dataset_name='mmlu')
SplitInfo(name='validation', num_bytes=763484, num_examples=1531, shard_lengths=None, dataset_name='mmlu')
SplitInfo(name='dev', num_bytes=125353, num_examples=285, shard_lengths=None, dataset_name='mmlu')


let's try a vision dataset now. The `google-research-datasets/conceptual_captions` dataset consists ~3.3M images annotated with captions. We can train a image-to-text model to generate the caption based on the iamge.

In [9]:
ds_builder = load_dataset_builder('google-research-datasets/conceptual_captions')

README.md:   0%|          | 0.00/14.2k [00:00<?, ?B/s]

In [10]:
ds_builder.info.features

{'image_url': Value(dtype='string', id=None),
 'caption': Value(dtype='string', id=None)}

In [11]:
ds_builder.info.download_size # size in bytes

375258708

In [12]:
for split_info in ds_builder.info.splits.values():
    print(split_info)

SplitInfo(name='train', num_bytes=584517500, num_examples=3318333, shard_lengths=None, dataset_name=None)
SplitInfo(name='validation', num_bytes=2698710, num_examples=15840, shard_lengths=None, dataset_name=None)


## Load datasets
the ```load_dataset``` function allowed to load data from many different formats in a single line of code!
### from the hub!
To load a dataset from hub, we just need to specify the name of the dataset as before!

To keep this example simple, let's load a smaller dataset for classification of movie's reviews.

In [13]:
from datasets import load_dataset

In [14]:
dataset = load_dataset('cornell-movie-review-data/rotten_tomatoes')

README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

the function will return either a `Dataset` or a `DatasetDict`.A `DatasetDict` is a collection of Dataset objects that can be accesed with keys, just like a dictionary (hashmap) in python. The `Dataset` object is where all magic happens, where the data is actually stored and manipulated.

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [16]:
type(dataset) # DatasetDict

datasets.dataset_dict.DatasetDict

In [17]:
# DatasetDict is actually a instance of a Python Dictionary
# which means that we can perform any dictionary-like operation
assert isinstance(dataset, dict)

In [18]:
dataset['train'] # get a dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

In [19]:
dataset.keys() # key iterator

dict_keys(['train', 'validation', 'test'])

In [20]:
dataset.values() # values iterator

dict_values([Dataset({
    features: ['text', 'label'],
    num_rows: 8530
}), Dataset({
    features: ['text', 'label'],
    num_rows: 1066
}), Dataset({
    features: ['text', 'label'],
    num_rows: 1066
})])

Let's analyze the `Dataset` object and come back later to DatasetDict and their usecases

### Datasets

Theorically, a dataset is just an indexable object, which means we can perform ```dataset[i]``` and get the ith item of the dataset. The items in the dataset are structure as python-dictionary where the keys are the `features` availables.

Nevertheless, hf's datasets allowed us to easily manipulate/process data which is often required for training/evaluating models.

In [21]:
train_ds = dataset['train']

#### Indexing

In [22]:
train_ds.features # describe the stored data

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [23]:
train_ds[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [24]:
assert isinstance(train_ds[0], dict)

In [25]:
train_ds[-1]

{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
 'label': 0}

We can also index by features to get a python-list of all items in the column, for example:

In [26]:
train_ds['text']

['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
 'effective but too-tepid biopic',
 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
 "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .',
 'offers that rare combination of entertainment and education .',
 'perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions .',
 "steers turns i

Finally, we can index by row and column to get a specific value

In [27]:
train_ds[10]['label']

1

In [28]:
train_ds['label'][10]

1

In [29]:
assert train_ds[10]['label'] == train_ds['label'][10]

however, the order of indexing does matter!. If we index by a column and row, we first load all items in a python-list and then get an specific item, whereas if we index by a row and column, we only allocate that row and get an specific feature.

Even with this small dataset, we can verify that column indexing is much slower, and it will be worse for larger datasets.

In [30]:
import time

start_time = time.time()
text = train_ds[0]["text"]
end_time = time.time()
print(f"Elapsed time: {end_time - start_time:.4f} seconds")

start_time = time.time()
text = train_ds["text"][0]
end_time = time.time()
print(f"Elapsed time: {end_time - start_time:.4f} seconds")

Elapsed time: 0.0003 seconds
Elapsed time: 0.0133 seconds


#### Slicing
Slicing returns a slice - or subset - of the dataset. Which is often useful for visualization purproses or splitting the dataset into continouos chunks.

In [31]:
train_ds[:3]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

In [32]:
train_ds[10:20]

{'text': ['this is a film well worth seeing , talking and singing heads and all .',
  'what really surprises about wisegirls is its low-key quality and genuine tenderness .',
  '( wendigo is ) why we go to the cinema : to be fed through the eye , the heart , the mind .',
  'one of the greatest family-oriented , fantasy-adventure movies ever .',
  'ultimately , it ponders the reasons we need stories so much .',
  "an utterly compelling 'who wrote it' in which the reputation of the most famous author who ever lived comes into question .",
  'illuminating if overly talky documentary .',
  'a masterpiece four years in the making .',
  "the movie's ripe , enrapturing beauty will tempt those willing to probe its inscrutable mysteries .",
  'offers a breath of the fresh air of true sophistication .'],
 'label': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [33]:
# the output is NOT a dataset, instead, slicing a dataset returns a dictionary!
type(train_ds[30: 50])

dict

#### Select and Filter
There exists two options to filter the rows of a dataset: `.select()` and `.filter()`

- `select(indices: List[int])`: return rows according to a list of indices

In [34]:
small_dataset = train_ds.select([1, 50, 3])

In [35]:
assert len(small_dataset) == 3

In [36]:
small_dataset[-1]

{'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
 'label': 1}

In [37]:
assert small_dataset[1] == train_ds[50]

- `filter(condition: Callable)` returns rows that match a specified condition:

For example, let's filter all positive movie's reviews.

In [38]:
positive_sent_ds = train_ds.filter(lambda row: row['label'] == 1)

Filter:   0%|          | 0/8530 [00:00<?, ? examples/s]

In [39]:
positive_sent_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 4265
})

In [40]:
assert all(1 == label for label in positive_sent_ds['label'])

we can also acces the indices with the flag with_indices=True:

In [41]:
positive_sent_and_odd_idx_ds = train_ds.filter(lambda row, idx: row['label'] == 1 and idx % 2 == 1, with_indices=True)

Filter:   0%|          | 0/8530 [00:00<?, ? examples/s]

In [42]:
positive_sent_and_odd_idx_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 2132
})

### Map

the most powerful method of `Dataset` is the `.map()` function. It allow us to apply python-functions to process each item in the dataset, independently or in batches. Under the hood, this method will take care to speed up processing functions, compute batches, use multiple cores, etc.

To start, let's write a function which removes stopwords from a text.

Stop words are common words in any language that occur with a high frequency but carry much less substantive information about the meaning of a phrase.

Examples of some common stop words include:
a, the, and , or , of , on , this , we , were, is, not …

Thankfully, the `nltk` module provides a list of english stop words for us.

In [43]:
import nltk
from nltk.corpus import stopwords

In [44]:
nltk.download('stopwords')
stopwords_list = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [45]:
stopwords_list

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [46]:
len(stopwords_list) # there ~180 stopwords

198

Now that we have a list of words we want to remove. We need to write a function that:
1. Divides the text into a list of words (tokenization)
2. Remove the words in the stopwords_list
3. Join the remainig words into a single text

Tokenization is a important part of NLP, but to keep things simple, we will use ntlk's `wordpunct_tokenize` function, which simply divides words by punctations. Nothing fancy for now

In [47]:
from nltk.tokenize import wordpunct_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [48]:
wordpunct_tokenize('Hello, This function will divide the text into words!. Can you guess what the result would be?')

['Hello',
 ',',
 'This',
 'function',
 'will',
 'divide',
 'the',
 'text',
 'into',
 'words',
 '!.',
 'Can',
 'you',
 'guess',
 'what',
 'the',
 'result',
 'would',
 'be',
 '?']

Great!, it seems to be working just fine.

In [49]:
def remove_stop_words(text: str) -> str:
    words = wordpunct_tokenize(text)

    words_without_sw = [
        word for word in words if word not in stopwords_list
    ]

    return ' '.join(words_without_sw)

In [50]:
assert remove_stop_words('there is a dog in the street') == 'dog street'
assert remove_stop_words('hello there, I love hugging face') == 'hello , I love hugging face'

Now, lets apply this function to our dataset!. We will write a function `remove_stop_words_pipeline_indiviually` which takes as input a row (dictionary) and returns a dictionary with the processed text.

Finally, we just have to call the .map method with our process function

In [51]:
from typing import Dict, Any, List
def remove_stop_words_pipeline_indiviually(example: Dict[str, Any]) -> Dict[str, str]:
    return {'processed_text': remove_stop_words(example['text'])}

In [52]:
processed_train_ds = train_ds.map(remove_stop_words_pipeline_indiviually)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

In [53]:
processed_train_ds

Dataset({
    features: ['text', 'label', 'processed_text'],
    num_rows: 8530
})

In [54]:
processed_train_ds[0]['text']

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'

In [55]:
processed_train_ds[0]['processed_text']

'rock destined 21st century \' new " conan " \' going make splash even greater arnold schwarzenegger , jean - claud van damme steven segal .'

In [56]:
# we can use the .remove_columns method to remove the original text field
# this operation is not inplace!
processed_train_ds.remove_columns('text')

Dataset({
    features: ['label', 'processed_text'],
    num_rows: 8530
})

####  Multiprocessing
We can speed this up by using multiples cores. Set the `num_proc` parameter in `.map()` to set the number of processes to use:

In [57]:
processed_train_ds = train_ds.map(remove_stop_words_pipeline_indiviually, num_proc=2)

Map (num_proc=2):   0%|          | 0/8530 [00:00<?, ? examples/s]

In [58]:
processed_train_ds

Dataset({
    features: ['text', 'label', 'processed_text'],
    num_rows: 8530
})

#### Batch processing
`.map()` also operates on batches of items. To use this feature set `batched=True` and set the batch_size to the number items on each batch, by default, `batch_size=1_000`.

Our function `remove_stop_words_pipeline_indiviually` works on individual rows. To enable batch processing, we modify our function to process multiple texts instead of a single item at a time.

In [59]:
def remove_stop_words_pipeline_batches(example: Dict[str, List[Any]]) -> Dict[str, List[str]]:
    processed_texts = [remove_stop_words(text) for text in example['text']]
    return {'processed_text': processed_texts}

In [60]:
processed_train_ds = train_ds.map(
    remove_stop_words_pipeline_batches,
    batched=True,
    batch_size=1_000)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

In [61]:
processed_train_ds

Dataset({
    features: ['text', 'label', 'processed_text'],
    num_rows: 8530
})

### Process multiple splits

Most of datasets have precomputed train, validation and test splits and many processing operations must be apply to each dataset independently, instead of calling `.map()` for each dataset, we can process them simultaneously `DatasetDict.map()`.

We already have a `DatasetDict` object with our train, validation and test split which was initialized at the beginning of the notebook

In [62]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

Let's remove stop words from the text for all datasets by calling the .map function as before

In [63]:
processed_dataset = dataset.map(
    remove_stop_words_pipeline_batches,
    batched=True,
    batch_size=1_000)

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

We process all datasets at the same time!

In [64]:
processed_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'processed_text'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 'processed_text'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 'processed_text'],
        num_rows: 1066
    })
})

In this notebook we cover mostof the important functionality of datasets, however, there are many more things we can do with them. Feel free to explore the full [documentation](link-to-doc.com) to fully exploid the powers of datasets.