# Time to slice and dice

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Most of the time, the data you work with <font color='blue'>won't be perfectly prepared</font> for <font color='blue'>training models</font>. In this section we'll explore the various features that 🤗 Datasets provides to clean up your datasets.

## Slicing and dicing our data

Similar to <font color='blue'>Pandas</font>, 🤗 Datasets provides <font color='blue'>several functions</font> to <font color='blue'>manipulate</font> the contents of <font color='blue'>`Dataset`</font> and <font color='blue'>`DatasetDict`</font> objects. We already encountered the <font color='blue'>`Dataset.map()`</font> method in [Chapter 3](https://huggingface.co/course/chapter3), and in this section we'll explore some of the other functions at our disposal.

For this example we'll use the [Drug Review Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29) that's hosted on the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), which contains <font color='blue'>patient reviews</font> on <font color='blue'>various drugs</font>, along with the <font color='blue'>condition being treated</font> and a <font color='blue'>10-star rating</font> of the patient's satisfaction.

First we need to download and extract the data, which can be done with the `wget` and `unzip` commands:

In [3]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2025-02-25 03:39:56--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [         <=>        ]  41.00M  2.09MB/s    in 39s     

2025-02-25 03:40:37 (1.05 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   



Since <font color='blue'>TSV</font> is just a variant of <font color='blue'>CSV</font> that uses <font color='blue'>tabs instead of commas</font> as the <font color='blue'>separator</font>, we can load these files by using the <font color='blue'>`csv`</font> loading script and specifying the <font color='blue'>`delimiter` argument</font> in the `load_dataset()` function as follows:

In [4]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

A <font color='blue'>good practice</font> when doing any sort of <font color='blue'>data analysis</font> is to <font color='blue'>grab a small random sample</font> to get a quick feel for the type of data you're working with. In 🤗 Datasets, we can create a <font color='blue'>random sample</font> by <font color='blue'>chaining</font> the <font color='blue'>`Dataset.shuffle()`</font> and <font color='blue'>`Dataset.select()`</font> functions together:

In [6]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

Note that we've <font color='blue'>fixed</font> the <font color='blue'>seed</font> in <font color='blue'>`Dataset.shuffle()`</font> for reproducibility purposes. <font color='blue'>`Dataset.select()`</font> expects an <font color='blue'>iterable of indices</font>, so we've passed <font color='blue'>`range(1000)`</font> to grab the <font color='blue'>first 1,000 examples</font> from the shuffled dataset. From this sample we can already see a few quirks in our dataset:

* The <font color='blue'>`Unnamed: 0` column</font> looks suspiciously like an anonymized ID for each patient.
* The <font color='blue'>`condition` column</font> includes a mix of uppercase and lowercase labels.
* The <font color='blue'>reviews</font> are of <font color='blue'>varying length</font> and contain a <font color='blue'>mix of Python line separators</font> (`\r\n`) as well as HTML character codes like `&\#039;`.

Let's see how we can use 🤗 Datasets to deal with each of these issues. To <font color='blue'>test</font> the <font color='blue'>patient ID hypothesis</font> for the `Unnamed: 0` column, we can use the <font color='blue'>`Dataset.unique()` function</font> to verify that the number of IDs matches the number of rows in each split:

In [7]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to <font color='blue'>confirm our hypothesis</font>, so let's <font color='blue'>clean up the dataset</font> a bit by <font color='blue'>renaming</font> the <font color='blue'>`Unnamed: 0` column</font> to <font color='blue'>something</font> a bit <font color='blue'>more interpretable</font>. We can use the `DatasetDict.rename_column()` function to rename the column across both splits in one go:

In [8]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

<Tip>

✏️ **Try it out!** Use the <font color='blue'>`Dataset.unique()` function</font> to <font color='blue'>find the number of unique drugs and conditions</font> in the training and test sets.

</Tip>

In [9]:
# Exercise
training_set = drug_dataset["train"]
num_unique_drug_names = len(training_set.unique(column='drugName'))
print("Number of unique drug names in train:", num_unique_drug_names)

num_unique_conditions = len(training_set.unique(column='condition'))
print("Number of unique conditions in train:", num_unique_conditions)

Number of unique drug names in train: 3436
Number of unique conditions in train: 885


In [10]:
test_set = drug_dataset["test"]
num_unique_drug_names = len(test_set.unique(column='drugName'))
print("Number of unique drug names in test:", num_unique_drug_names)

num_unique_conditions = len(test_set.unique(column='condition'))
print("Number of unique conditions in test:", num_unique_conditions)

Number of unique drug names in test: 2637
Number of unique conditions in test: 709


Next, let's <font color='blue'>normalize</font> all the <font color='blue'>`condition` labels</font> using <font color='blue'>`Dataset.map()`</font>. As we did with tokenization in [Chapter 3](https://huggingface.co/course/chapter3), we can define a simple function that can be applied across all the rows of each split in `drug_dataset`:

In [11]:
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


drug_dataset.map(lowercase_condition)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

AttributeError: 'NoneType' object has no attribute 'lower'

Oh no, we've run into a problem with our map function! From the error we can infer that <font color='blue'>some</font> of the <font color='blue'>entries</font> in the <font color='blue'>`condition` column</font> are <font color='blue'>`None`</font>, which <font color='blue'>cannot be lowercased</font> as they're not strings. Let's <font color='blue'>drop these rows</font> using `Dataset.filter()`, which works in a similar way to `Dataset.map()` and expects a function that receives a single example of the dataset. Instead of writing an explicit function like:

In [12]:
def filter_nones(x):
    return x["condition"] is not None

and then running `drug_dataset.filter(filter_nones)`, we can do this in <font color='blue'>one line</font> using a <font color='blue'>_lambda function_</font>. In Python, lambda functions are small functions that you can define without explicitly naming them. They take the general form:

```
lambda <arguments> : <expression>
```

where `lambda` is one of Python's special [keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords), `<arguments>` is a list/set of comma-separated values that define the inputs to the function, and `<expression>` represents the operations you wish to execute. For example, we can define a <font color='blue'>lambda function</font> that <font color='blue'>squares a number</font> as follows:

In [13]:
(lambda x: x * x)(3)

9

To apply this function to an input, we need to wrap it and the input in parentheses:

In [None]:
(lambda base, height: 0.5 * base * height)(4, 8)

16.0

Lambda functions are handy when you want to define small, single-use functions (for more information about them, we recommend reading the excellent [Real Python tutorial](https://realpython.com/python-lambda/) by Andre Burgaud). In the 🤗 Datasets context, we can use lambda functions to define simple map and filter operations, so let's use this trick to eliminate the `None` entries in our dataset:

In [14]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

With the `None` entries removed, we can normalize our `condition` column:

In [15]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

['left ventricular dysfunction', 'adhd', 'birth control']

It works! Now that we've cleaned up the labels, let's take a look at cleaning up the reviews themselves.

## Creating new columns

Whenever you're dealing with <font color='blue'>customer reviews</font>, a good practice is to <font color='blue'>check the number of words</font> in <font color='blue'>each review</font>. A review might be just a single word like "Great!" or a full-blown essay with thousands of words, and depending on the use case you'll need to handle these extremes differently. To compute the number of words in each review, we'll use a rough heuristic based on splitting each text by whitespace.

Let's define a function that counts the number of words in each review:


In [16]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

Unlike our `lowercase_condition()` function, `compute_review_length()` returns a <font color='blue'>dictionary</font> whose key does not correspond to one of the column names in the dataset. In this case, when `compute_review_length()` is passed to `Dataset.map()`, it will be applied to all the rows in the dataset to create a new `review_length` column:

In [17]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

As expected, we can see a `review_length` column has been added to our training set. We can <font color='blue'>sort</font> this new column with <font color='blue'>`Dataset.sort()`</font> to see what the extreme values look like:

In [18]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

As we suspected, <font color='blue'>some reviews</font> contain <font color='blue'>just a single word</font>, which, although it may be okay for sentiment analysis, would not be informative if we want to predict the condition.

<Tip>

🙋 An <font color='blue'>alternative way</font> to add new columns to a <font color='blue'>dataset</font> is with the <font color='blue'>`Dataset.add_column()` function</font>. This allows you to provide the <font color='blue'>column</font> as a <font color='blue'>Python list</font> or <font color='blue'>NumPy array</font> and can be handy in situations where `Dataset.map()` is not well suited for your analysis.

</Tip>

Let's use the <font color='blue'>`Dataset.filter()` function</font> to <font color='blue'>remove reviews</font> that contain <font color='blue'>fewer than 30 words</font>. Similarly to what we did with the `condition` column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Filter:   0%|          | 0/160398 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53471 [00:00<?, ? examples/s]

{'train': 138514, 'test': 46108}


As you can see, this has <font color='blue'>removed around 15%</font> of the <font color='blue'>reviews</font> from our original training and test sets.

<Tip>

✏️ **Try it out!** Use the `Dataset.sort()` function to inspect the reviews with the largest numbers of words. See the [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort) to see which argument you need to use sort the reviews by length in descending order.

</Tip>

In [None]:
# Exercise
drug_dataset["train"].sort("review_length", reverse=True)[:1]

{'patient_id': [121004],
 'drugName': ['Venlafaxine'],
 'condition': ['migraine'],
 'review': ['"Two and a half months ago I was prescribed Venlafaxine to help prevent chronic migraines.\r\nIt did help the migraines (reduced them by almost half), but with it came a host of side effects that were far worse than the problem I was trying to get rid of.\r\nHaving now come off of the stuff, I would not recommend anyone ever use Venlafaxine unless they suffer from extreme / suicidal depression. I mean extreme in the most emphatic sense of the word. \r\nBefore trying Venlafaxine, I was a writer. While on Venlafaxine, I could barely write or speak or communicate at all. More than that, I just didn\'t want to. Not normal for a usually outgoing extrovert.\r\nNow, I\'m beginning to write again - but my ability to speak and converse with others has deteriorated by about 95%. Writing these words is taking forever; keeping up in conversation with even one person is impossible, and I barely see the p

The last thing we need to deal with is the <font color='blue'>presence of HTML character codes</font> in our <font color='blue'>reviews</font>. We can use <font color='blue'>Python's `html` module</font> to <font color='blue'>unescape</font> these <font color='blue'>characters</font>, like so:

In [None]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

We'll use <font color='blue'>`Dataset.map()`</font> to <font color='blue'>unescape all the HTML characters</font> in our corpus:

In [None]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

As you can see, the <font color='blue'>`Dataset.map()` method</font> is <font color='blue'>quite useful</font> for <font color='blue'>processing data</font> -- and we haven't even scratched the surface of everything it can do!

## The `map()` method's superpowers

The <font color='blue'>`Dataset.map()`</font> method takes a <font color='blue'>`batched` argument</font> that, if set to <font color='blue'>`True`</font>, causes it to <font color='blue'>send a batch of examples</font> to the <font color='blue'>map function</font> at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can <font color='blue'>speed this up</font> by <font color='blue'>processing several elements</font> at the same time using a <font color='blue'>list comprehension</font>.

When you specify <font color='blue'>`batched=True`</font> the function receives a <font color='blue'>dictionary</font> with the <font color='blue'>fields of the dataset</font>, but <font color='blue'>each value</font> is now a <font color='blue'>_list of values_</font>, and not just a single value. The <font color='blue'>return value</font> of <font color='blue'>`Dataset.map()`</font> should be the <font color='blue'>same</font>: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using `batched=True`:

In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

If you're running this code in a notebook, you'll see that this command executes way faster than the previous one. And it's not because our reviews have already been HTML-unescaped -- if you re-execute the instruction from the previous section (without `batched=True`), it will take the same amount of time as before. This is because <font color='blue'>list comprehensions</font> are <font color='blue'>usually faster</font> than <font color='blue'>executing</font> the <font color='blue'>same code</font> in a <font color='blue'>`for` loop</font>, and we also gain some <font color='blue'>performance</font> by <font color='blue'>accessing lots of elements</font> at the <font color='blue'>same time</font> instead of one by one.

Using <font color='blue'>`Dataset.map()`</font> with <font color='blue'>`batched=True`</font> will be essential to <font color='blue'>unlock the speed</font> of the <font color='blue'>"fast" tokenizers</font> that we'll encounter in [Chapter 6](https://huggingface.co/course/chapter6), which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

As you saw in [Chapter 3](https://huggingface.co/course/chapter3), we can pass <font color='blue'>one or several examples</font> to the <font color='blue'>tokenizer</font>, so we can use this function with or without `batched=True`. Let's take this opportunity to compare the performance of the different options. In a notebook, you can time a one-line instruction by adding `%time` before the line of code you wish to measure:

In [None]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

CPU times: user 1min 34s, sys: 827 ms, total: 1min 35s
Wall time: 1min 7s


You can also time a whole cell by putting `%%time` at the beginning of the cell. On the hardware we executed this on, it showed 10.8s for this instruction (it's the number written after "Wall time").

<Tip>

✏️ **Try it out!** Execute the same instruction with and without `batched=True`, then try it with a slow tokenizer (add `use_fast=False` in the `AutoTokenizer.from_pretrained()` method) so you can see what numbers you get on your hardware.

</Tip>

In [21]:
# Exercise
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=False)

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

CPU times: user 6min 27s, sys: 4.35 s, total: 6min 31s
Wall time: 7min 7s


In [22]:
#Exercise
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True)
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/160398 [00:00<?, ? examples/s]

Map:   0%|          | 0/53471 [00:00<?, ? examples/s]

CPU times: user 1min 51s, sys: 951 ms, total: 1min 52s
Wall time: 1min 26s


Here are the results we obtained with and without batching, with a fast and a slow tokenizer:

Options         | Fast tokenizer | Slow tokenizer
:--------------:|:--------------:|:-------------:
`batched=True`  | 10.8s          | 4min41s
`batched=False` | 59.2s          | 5min3s

This means that using a <font color='blue'>fast tokenizer</font> with the <font color='blue'>`batched=True`</font> option is <font color='blue'>30 times faster</font> than its slow counterpart with no batching -- this is truly amazing! That's the main reason why <font color='blue'>fast tokenizers</font> are the <font color='blue'>default</font> when using <font color='blue'>`AutoTokenizer`</font> (and why they are called "fast"). They're able to achieve such a speedup because behind the scenes the tokenization code is executed in Rust, which is a language that makes it easy to parallelize code execution.

<font color='blue'>Parallelization</font> is also the <font color='blue'>reason</font> for the nearly <font color='blue'>6x speedup</font> the <font color='blue'>fast tokenizer achieves</font> with batching: you can't parallelize a single tokenization operation, but when you want to tokenize lots of texts at the same time you can just split the execution across several processes, each responsible for its own texts.

<font color='blue'>`Dataset.map()`</font> also has some <font color='blue'>parallelization capabilities</font> of its own. Since they are not backed by Rust, they won't let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you're using a tokenizer that doesn't have a fast version). To <font color='blue'>enable multiprocessing</font>, use the <font color='blue'>`num_proc` argument</font> and <font color='blue'>specify</font> the <font color='blue'>number of processes</font> to use in your call to `Dataset.map()`:

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)


def slow_tokenize_function(examples):
    return slow_tokenizer(examples["review"], truncation=True)


tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

You can experiment a little with timing to determine the optimal number of processes to use; in our case 8 seemed to produce the best speed gain. Here are the numbers we got with and without multiprocessing:

Options         | Fast tokenizer | Slow tokenizer
:--------------:|:--------------:|:-------------:
`batched=True`  | 10.8s          | 4min41s
`batched=False` | 59.2s          | 5min3s
`batched=True`, `num_proc=8`  | 6.52s          | 41.3s
`batched=False`, `num_proc=8` | 9.49s          | 45.2s

Those are much more reasonable results for the slow tokenizer, but the performance of the fast tokenizer was also substantially improved. Note, however, that won't always be the case -- for values of `num_proc` other than 8, our tests showed that it was faster to use `batched=True` without that option. In general, we don't recommend using Python multiprocessing for fast tokenizers with `batched=True`.

<Tip>

Using `num_proc` to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

</Tip>

All of this functionality condensed into a single method is already pretty amazing, but there's more! With <font color='blue'>`Dataset.map()`</font> and <font color='blue'>`batched=True`</font> you can <font color='blue'>change the number of elements</font> in <font color='blue'>your dataset</font>. This is super useful in many situations where you want to create several training features from one example, and we will need to do this as part of the preprocessing for several of the NLP tasks we'll undertake in [Chapter 7](https://huggingface.co/course/chapter7).

<Tip>

💡 In machine learning, an <font color='blue'>_example_</font> is usually <font color='blue'>defined</font> as the <font color='blue'>set of _features_</font> that we feed to the model. In some contexts, these <font color='blue'>features</font> will be the <font color='blue'>set of columns in a `Dataset`</font>, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column.

</Tip>

Let's have a look at how it works! Here we will <font color='blue'>tokenize our examples</font> and <font color='blue'>truncate them to a maximum length of 128</font>, but we will ask the tokenizer to return *all* the chunks of the texts instead of just the first one. This can be done with `return_overflowing_tokens=True`:

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

Let's test this on one example before using `Dataset.map()` on the whole dataset:

In [None]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let's do this for all elements of the dataset!

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

ArrowInvalid: Column 8 named input_ids expected length 1000 but got length 1463

Oh no! That didn't work! Why not? Looking at the error message will give us a clue: there is a <font color='blue'>mismatch in the lengths of one of the columns</font>, one being of length 1,463 and the other of length 1,000. If you've looked at the `Dataset.map()` [documentation](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map), you may recall that it's the number of samples passed to the function that we are mapping; here those 1,000 examples gave 1,463 new features, resulting in a shape error.

The problem is that we're trying to <font color='blue'>mix two different datasets of different sizes</font>: the `drug_dataset` columns will have a certain number of examples (the 1,000 in our error), but the `tokenized_dataset` we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using `return_overflowing_tokens=True`). That doesn't work for a `Dataset`, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:

In [None]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

In [None]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

We mentioned that we can also deal with the mismatched length problem by <font color='blue'>making the old columns the same size as the new ones</font>. To do this, we will need the <font color='blue'>`overflow_to_sample_mapping` field</font> the tokenizer returns when we set `return_overflowing_tokens=True`. It gives us a mapping from a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

In [None]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

We can see it works with `Dataset.map()` without us needing to remove the old columns:

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

We get the same number of training features as before, but here we've kept all the old fields. If you need them for some post-processing after applying your model, you might want to use this approach.

You've now seen how 🤗 Datasets can be used to preprocess a dataset in various ways. Although the processing functions of 🤗 Datasets will cover most of your model training needs,
there may be times when you'll need to switch to Pandas to access more powerful features, like `DataFrame.groupby()` or high-level APIs for visualization. Fortunately, 🤗 Datasets is designed to be interoperable with libraries such as Pandas, NumPy, PyTorch, TensorFlow, and JAX. Let's take a look at how this works.

## From `Datasets` to `DataFrames` and back

To enable the conversion between various third-party libraries, 🤗 Datasets provides a <font color='blue'>`Dataset.set_format()` function</font>. This function only <font color='blue'>changes</font> the <font color='blue'>_output format_</font> of the <font color='blue'>dataset</font>, so you can easily switch to another format without affecting the underlying _data format_, which is Apache Arrow. The formatting is done in place. To demonstrate, let's convert our dataset to Pandas:

In [25]:
drug_dataset.set_format("pandas")

Now when we access elements of the dataset we get a `pandas.DataFrame` instead of a dictionary:

In [26]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,206461,Valsartan,left ventricular dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27,17
1,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
2,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134


Let's create a `pandas.DataFrame` for the whole training set by selecting all the elements of `drug_dataset["train"]`:

In [27]:
train_df = drug_dataset["train"][:]

<Tip>

🚨 Under the hood, `Dataset.set_format()` changes the return format for the dataset's `__getitem__()` dunder method. This means that when we want to create a new object like `train_df` from a `Dataset` in the `"pandas"` format, we need to slice the whole dataset to obtain a `pandas.DataFrame`. You can verify for yourself that the type of `drug_dataset["train"]` is `Dataset`, irrespective of the output format.

</Tip>


From here we can use all the Pandas functionality that we want. For example, we can do fancy chaining to compute the class distribution among the `condition` entries:

In [28]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,frequency,count
0,birth control,28788
1,depression,9069
2,pain,6145
3,anxiety,5904
4,acne,5588



And once we're done with our Pandas analysis, we can always  <font color='blue'>create a new `Dataset` object</font> by using the  <font color='blue'>`Dataset.from_pandas()` function</font> as follows:

In [29]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 884
})

<Tip>

✏️ **Try it out!** Compute the average rating per drug and store the result in a new `Dataset`.

</Tip>

In [31]:
# Exercise

# Convert dataset to a Pandas DataFrame
train_df = drug_dataset["train"].to_pandas()

# Compute the average rating per drug
avg_rating = train_df.groupby("drugName")["rating"].mean().to_frame().reset_index()
avg_rating.head()

# Create a new Dataset from the Pandas DataFrame
avg_rating_dataset = Dataset.from_pandas(avg_rating)
avg_rating_dataset

Dataset({
    features: ['drugName', 'rating'],
    num_rows: 3431
})

In [32]:
avg_rating_dataset.to_pandas().head()

Unnamed: 0,drugName,rating
0,A + D Cracked Skin Relief,10.0
1,A / B Otic,10.0
2,Abacavir / dolutegravir / lamivudine,8.211538
3,Abacavir / lamivudine / zidovudine,9.0
4,Abatacept,7.157895


This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let's  <font color='blue'>create a validation set</font> to  <font color='blue'>prepare the dataset for training</font> a classifier on. Before doing so, we'll reset the output format of `drug_dataset` from `"pandas"` to `"arrow"`:

In [None]:
drug_dataset.reset_format()

## Creating a validation set

Although we have a test set we could use for evaluation, it's a good practice to leave the test set untouched and  <font color='blue'>create a separate validation set</font> during development. Once you are happy with the performance of your models on the validation set, you can do a final sanity check on the test set. This process helps  <font color='blue'>mitigate the risk that you'll overfit</font> to the  <font color='blue'>test set</font> and deploy a model that fails on real-world data.

🤗 Datasets provides a `Dataset.train_test_split()` function that is based on the famous functionality from `scikit-learn`. Let's use it to split our training set into `train` and `validation` splits (we set the `seed` argument for reproducibility):

In [None]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

Great, we've now prepared a dataset that's ready for training some models on! In [section 5](https://huggingface.co/course/chapter5/5) we'll show you how to upload datasets to the Hugging Face Hub, but for now let's cap off our analysis by looking at a few ways you can save datasets on your local machine.

## Saving a dataset

Although 🤗 Datasets will cache every downloaded dataset and the operations performed on it, there are times when you'll want to  <font color='blue'>save a dataset to disk</font> (e.g., in case the cache gets deleted). As shown in the table below, 🤗 Datasets provides three main functions to save your dataset in different formats:

| Data format |        Function        |
| :---------: | :--------------------: |
|    Arrow    | `Dataset.save_to_disk()` |
|     CSV     |    `Dataset.to_csv()`    |
|    JSON     |   `Dataset.to_json()`    |

For example, let's save our cleaned dataset in the Arrow format:

In [None]:
drug_dataset_clean.save_to_disk("drug-reviews")

Saving the dataset (0/1 shards):   0%|          | 0/110811 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/27703 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/46108 [00:00<?, ? examples/s]

This will create a directory with the following structure:

```
drug-reviews/
├── dataset_dict.json
├── test
│   ├── dataset.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── indices.arrow
│   └── state.json
└── validation
    ├── dataset.arrow
    ├── dataset_info.json
    ├── indices.arrow
    └── state.json
```

where we can see that  <font color='blue'>each split</font> is  <font color='blue'>associated</font> with its own  <font color='blue'>*dataset.arrow* table</font>, and some metadata in *dataset_info.json* and *state.json*. You can think of the  <font color='blue'>Arrow format</font> as a  <font color='blue'>fancy table of columns and rows</font> that is  <font color='blue'>optimized for building high-performance applications</font> that process and transport large datasets.

Once the dataset is saved, we can load it by using the `load_from_disk()` function as follows:

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the `DatasetDict` object:

In [None]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/111 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

This saves each split in [JSON Lines format](https://jsonlines.org), where each row in the dataset is stored as a single line of JSON. Here's what the first example looks like:

In [None]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


We can then use the techniques from [section 2](/course/chapter5/2) to load the JSON files as follows:

In [None]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

And that's it for our excursion into data wrangling with 🤗 Datasets! Now that we have a cleaned dataset for training a model on, here are a few ideas that you could try out:

1. Use the techniques from [Chapter 3](https://huggingface.co/course/chapter3) to train a classifier that can predict the patient condition based on the drug review.
2. Use the `summarization` pipeline from [Chapter 1](https://huggingface.co/course/chapter1) to generate summaries of the reviews.

Next, we'll take a look at how 🤗 Datasets can enable you to work with huge datasets without blowing up your laptop!


In [48]:
# Exercise

from transformers import pipeline
from datasets import load_dataset

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device=0)

# Select 5 sample reviews from the training set
sample_reviews = drug_dataset["train"]["review"][15:20]

summarized_reviews = [
    summarizer(review, max_length=100, min_length=10, do_sample=True, early_stopping=True)[0]["summary_text"]
    for review in sample_reviews
]

for i, (original, summary) in enumerate(zip(sample_reviews, summarized_reviews)):
    print(f"Review {i+1}:\nOriginal: {original}\nSummary: {summary}\n{'-'*80}")

Device set to use cuda:0


Review 1:
Original: "I have been taking Saxenda since July 2016.  I had severe nausea for about a month once I got up to the 2.6 dosage.  It has since subsided and the only side effect I notice now is the dry mouth.  I make sure to drink  2.5 litres of water a day (about 10 glasses).  This helps with the weight loss as well as the constipation.  I have been reducing my dose to find a comfortable spot where I am still losing weight but don&#039;t feel like I am over medicating.  For me, 1.8 is working very well.  I also feel wearing a Fitbit has really helped.  I can track my food, water, exercise and steps - it keeps me moving more.  When this started I could barely walk the length of myself without getting winded - I have lost 58 lbs so far."
Summary:  "I have been taking Saxenda since July 2016.  I make sure to drink  2.5 litres of water a day (about 10 glasses) This helps with the weight loss as well as the constipation . I have been reducing my dose to find a comfortable spot where