<a href="https://colab.research.google.com/github/almutareb/huggingface-nlp-demo/blob/main/working_with_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🤗 Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:

* CSV & TSV (csv) : 	load_dataset("csv", data_files="my_file.csv")
* Text files 	(text) : 	load_dataset("text", data_files="my_file.txt")
* JSON & JSON Lines (json) : 	load_dataset("json", data_files="my_file.jsonl")
* Pickled DataFrames (pandas) : 	load_dataset("pandas", data_files="my_dataframe.pkl")

For this example we’ll use the SQuAD-it dataset, which is a large-scale dataset for question answering in Italian.

The training and test splits are hosted on GitHub, so we can download them with a simple wget command:

In [None]:
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

In [None]:
!gzip -dkv SQuAD_it-*.json.gz

In [None]:
!pip install datasets

To load a JSON file with the load_dataset() function, we just need to know if we’re dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a data field. This means we can load the dataset by specifying the field argument as follows:

In [None]:
from datasets import load_dataset

squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

By default, loading local files creates a DatasetDict object with a train split. We can see this by inspecting the squad_it_dataset object:

In [None]:
squad_it_dataset

In [None]:
squad_it_dataset["train"][0]

Great, we’ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the train and test splits in a single DatasetDict object so we can apply Dataset.map() functions across both splits at once. To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:

In [None]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

The loading scripts in 🤗 Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the data_files argument directly to the compressed files:

In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

This can be useful if you don’t want to manually decompress many GZIP files. The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point data_files to the compressed files and you’re good to go!

Loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. For example, for the SQuAD-it dataset hosted on GitHub, we can just point data_files to the SQuAD_it-*.json.gz URLs as follows:

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [None]:
squad_it_dataset["test"][0]

Most of the time, the data you work with won’t be perfectly prepared for training models. In the following section we’ll explore the various features that 🤗 Datasets provides to clean up your datasets.

# Slicing and dicing the data

Similar to Pandas, 🤗 Datasets provides several functions to manipulate the contents of Dataset and DatasetDict objects.

For this example we’ll use the Drug Review Dataset that’s hosted on the UC Irvine Machine Learning Repository, which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient’s satisfaction.

First we need to download and extract the data, which can be done with the wget and unzip commands:

In [None]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

Since TSV is just a variant of CSV that uses tabs instead of commas as the separator, we can load these files by using the csv loading script and specifying the delimiter argument in the load_dataset() function as follows:

In [None]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data you’re working with. In 🤗 Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(100))
# peek ar the first 3 examples
drug_sample[:3]

From this sample we can already see a few quirks in our dataset:

* The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
* The condition column includes a mix of uppercase and lowercase labels.
* The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

Let’s see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:

In [None]:
for split in drug_dataset.keys():
  assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

This seems to confirm our hypothesis, so let’s clean up the dataset a bit by renaming the Unnamed: 0 column to something a bit more interpretable. We can use the DatasetDict.rename_column() function to rename the column across both splits in one go:

In [None]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

number of unique drugs and conditions in the training and test sets.

In [None]:
len(drug_dataset["train"].unique("drugName"))

In [None]:
len(drug_dataset["train"].unique("condition"))

In the 🤗 Datasets context, we can use lambda functions to define simple map and filter operations, so let’s use this trick to eliminate the None entries in our dataset:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

In Python, lambda functions are small functions that you can define without explicitly naming them. They take the general form:

> lambda \<arguments> : \<expression>

where lambda is one of Python’s special keywords, \<arguments> is a list/set of comma-separated values that define the inputs to the function, and \<expression> represents the operations you wish to execute.

With the None entries removed, we can normalize our condition column using Dataset.map(). We can define a simple function that can be applied across all the rows of each split in drug_dataset:

In [None]:
def lowercase_condition(example):
  return {"condition": example["condition"].lower()}

drug_dataset.map(lowercase_condition)

Now that we’ve cleaned up the labels, let’s take a look at cleaning up the reviews themselves.

# Creating new columns

Whenever you’re dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like “Great!” or a full-blown essay with thousands of words, and depending on the use case you’ll need to handle these extremes differently. To compute the number of words in each review, we’ll use a rough heuristic based on splitting each text by whitespace.

Let’s define a simple function that counts the number of words in each review:

In [None]:
def compute_review_length(example):
  return {"review_length": len(example["review"].split())}

Unlike our lowercase_condition() function, compute_review_length() returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when compute_review_length() is passed to Dataset.map(), it will be applied to all the rows in the dataset to create a new review_length column:

In [None]:
drug_dataset = drug_dataset.map(compute_review_length)
# inspect the first training example
drug_dataset["train"][0]

We can sort this new column with Dataset.sort() to see what the extreme values look like:

In [None]:
drug_dataset["train"].sort("review_length")[:3]

Looks like some reviews contain just a single word, which, although it may be okay for sentiment analysis, would not be informative if we want to predict the condition.

Let’s use the Dataset.filter() function to remove reviews that contain fewer than 30 words. Similarly to what we did with the condition column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

This has removed around 15% of the reviews from our original training and test sets.

The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python’s html module to unescape these characters, like so:

In [None]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

We’ll use Dataset.map() to unescape all the HTML characters in our corpus:

In [None]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000).

When you specify batched=True the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using batched=True:

In [None]:
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

this command executes way faster than the previous one. And it’s not because our reviews have already been HTML-unescaped.

Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers, which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

In [None]:
!pip install transformers

In [None]:
%%time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
  return tokenizer(examples["review"], truncation=True)

Let’s take this opportunity to compare the performance of the different options. In a notebook, you can time a one-line instruction by adding %time before the line of code you wish to measure:

In [None]:
 %time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

You can also time a whole cell by putting %%time at the beginning of the cell.

Parallelization is also the reason for the nearly 6x speedup the fast tokenizer achieves with batching: you can’t parallelize a single tokenization operation, but when you want to tokenize lots of texts at the same time you can just split the execution across several processes, each responsible for its own texts.

Dataset.map() also has some parallelization capabilities of its own. Since they are not backed by Rust, they won’t let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you’re using a tokenizer that doesn’t have a fast version). To enable multiprocessing, use the num_proc argument and specify the number of processes to use in your call to Dataset.map():

In [None]:
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

def slow_tokenize_function(examples):
  return slow_tokenizer(examples["review"], truncation=True)

tokenized_dataset = drug_dataset.map(slow_tokenize_function, batched=True, num_proc=8)

In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.

Using num_proc to speed up your processing is usually a great idea, as long as the function you are using is not already doing some kind of multiprocessing of its own.

All of this functionality condensed into a single method is already pretty amazing, but there’s more! With Dataset.map() and batched=True you can change the number of elements in your dataset. This is super useful in many situations where you want to create several training features from one example.

In [None]:
def tokenize_and_split(examples):
  return tokenizer(
      examples["review"],
      truncation=True,
      max_length=128,
      return_overflowing_tokens=True,
  )

Let's test the function on one example first, before using map to apply it to the whole dataset:

In [None]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let’s do this for all elements of the dataset!

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)