# Dataset Quick Tour

Let’s have a quick look at the 🤗datasets library. This library has three main features:

* It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. 
* It provides a very simple way to access and share datasets with the research and practitioner communities (over 130 NLP datasets are already accessible in one line with the library as we’ll see below).
* It was designed with a particular focus on interoperabilty with frameworks like pandas, NumPy, PyTorch and TensorFlow.

In [104]:
import datasets
from datasets import list_datasets
datasets_list = list_datasets()
len(datasets_list)

515

## 1. Loading a dataset

### 1.1 Loading a pre-built dataset

The list of available datasets and their descriptions are available [[here]](https://huggingface.co/datasets).

In [12]:
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [13]:
# behaving like a list
len(dataset)

3668

In [14]:
# can be indexed
dataset[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

In [5]:
dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

In [9]:
# get the class names to output meaningful label in prediction
dataset.features['label'].names

['not_equivalent', 'equivalent']

### 1.2 Spliting the dataset

This method is adapted from scikit-learn celebrated `train_test_split` method. The splits will be shuffled by default. You can deactivate this behavior by setting `shuffle=False` in the arguments of `datasets.Dataset.train_test_split()`.

In [22]:
temp = dataset.train_test_split(test_size=0.1)
print(temp)
train_dataset = temp['train']
test_dataset = temp['test']

Loading cached split indices for dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-df8919e1b406b9af.arrow and /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-d7b7d3495a20dc51.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3301
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 367
    })
})


### 1.3 Concatenate several datasets
When you have several datasets.Dataset objects that share the same column types, you can create a new datasets.Dataset object that is the concatenation of them:

In [25]:
from datasets import concatenate_datasets
print(len(train_dataset), len(test_dataset))
assert train_dataset.features.type == test_dataset.features.type
dataset = concatenate_datasets([train_dataset, test_dataset])
print(len(dataset))

3301 367
3668


### 1.4 Loading dataset from local file or in-memory object

If you want to create a datasets.Dataset from local CSV, JSON, text or pandas files instead of a community provided dataset, you can use one of the `csv`, `json`, `text` or `pandas` builder. They all accept a variety of file paths as inputs: a path to a single file, a list of paths to files or a dict of paths to files for each split. Here are some examples to load from CSV files:

```
>>> from datasets import load_dataset
>>> dataset = load_dataset('csv', data_files='my_file.csv')
>>> dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])
>>> dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'],
>>>                                           'test': 'my_test_file.csv'})
```

**WARNING:** Watch out [[this issue]](https://github.com/huggingface/datasets/issues/743).

#### 1.4.1 Loading from CSV file

<div class="alert alert-block alert-info">
<b>Tip:</b> 
f you don’t provide a split argument to datasets.load_dataset(), this method will return a dictionary containing a datasets for each split in the dataset. So to access the dataset from a single CSV/JSON file, you need to specify the split at **train**.
</div>

In [None]:
from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files='test_data.csv')

In [8]:
# this doesn't work
dataset[0]

KeyError: 0

In [9]:
# this does
dataset['train'][0]

{'text': 'I hate google', 'label': 'negative'}

In [10]:
# or this one
dataset = load_dataset('csv', script_version="master", data_files='test_data.csv', split='train')
dataset[0]

Using custom data configuration default
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-3b6254ff4dd403e5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2)


{'text': 'I hate google', 'label': 'negative'}

#### 1.4.2 Loading from JSON file

The most efficient format is to have JSON files consisting of multiple JSON objects, one per line, representing individual data rows:
    
```
{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}
```

In this case, interesting features are provided out-of-the-box by the Apache Arrow backend:

* multi-threaded reading
* automatic decompression of input files (based on the filename extension, such as `my_data.json.gz`)
* sophisticated type inference (see below)

To export a pandas DF to such format, use the following command:

```
df.to_json('test_data.json', orient='records', lines=True)
```

You can load such a dataset direcly with:

In [11]:
from datasets import load_dataset
dataset = load_dataset('json', data_files='test_data.json', split='train')
dataset[0]

Using custom data configuration default
Reusing dataset json (/root/.cache/huggingface/datasets/json/default-fec0ccfda8a48221/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514)


{'text': 'I hate google', 'label': 'negative'}

#### 1.4.3 Loading from in-momory objects

They are quick and easy. However, it doesn't have the benefit of low-memory usage since the whole dataset needs to be loaded in the first place. This is suitable for smaller datasets.

In [86]:
# from python dict
from datasets import Dataset
my_dict = {'id': [0, 1, 2],
           'name': ['mary', 'bob', 'eve'],
           'age': [24, 53, 19]}

dataset_dict = Dataset.from_dict(my_dict)
dataset_dict[0]

{'id': 0, 'name': 'mary', 'age': 24}

In [87]:
# from pandas df
from datasets import Dataset
import pandas as pd
df = pd.read_csv('test_data.csv')
dataset_df = Dataset.from_pandas(df)
dataset_df[0]

{'text': 'I hate google', 'label': 'negative'}

In [96]:
df.to_json('test_data.json', orient='records', lines=True)

## 2. Fine-tuning a deep-learning model

In the rest of this quick-tour we will use this dataset to fine-tune a Bert model on the sentence pair classification task of Paraphrase Classification. Let’s have a quick look at our task.

As you can see from the above features, the labels are a `datasets.ClassLabel` instance with two classes: `not_equivalent` and `equivalent`.

We can print one example of each class using `datasets.Dataset.filter()` and a name-to-integer conversion method of the feature `datasets.ClassLabel` called `datasets.ClassLabel.str2int()` (that we detail these methods in processing and exploring):

In [1]:
from datasets import load_dataset
train_dataset, test_dataset = load_dataset('glue', 'mrpc', split=['train', 'validation'])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)


In [2]:
test_dataset.filter(lambda example: example['label'] == test_dataset.features['label'].str2int('equivalent'))[0]

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-20cd1c87f24775b4.arrow


{'idx': 9,
 'label': 1,
 'sentence1': "He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .",
 'sentence2': '" The foodservice pie business does not fit our long-term growth strategy .'}

Now our goal will be to train a model which can predict the correct label (`not_equivalent` or `equivalent`) from a pair of sentence.

Let’s import a pretrained Bert model and its tokenizer using 🤗transformers.

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

### 2.1 Tokenizing the dataset
The first step is to tokenize our sentences in order to build sequences of integers that our model can digest from the pairs of sequences. Bert’s tokenizer knows how to do that and we can simply feed it with a pair of sentences as inputs to generate the right inputs for our model.

Check out [the other notebook](https://github.com/YipingNUS/huggingface-learning-notes/blob/master/Getting%20Started.ipynb) for details regarding tokenizers.

In [4]:
tokenizer(train_dataset[0]['sentence1'], train_dataset[0]['sentence2'])

{'input_ids': [101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In our case, we want to tokenize our full dataset, so we will use a method called `datasets.Dataset.map()` to apply the encoding process to their whole dataset. To be sure we can easily build tensors batches for our model, we will truncate and pad the inputs to the max length of our model.

The function you provide to `datasets.Dataset.map()` should accept an input with the format of an item of the dataset: `function(dataset[0])` and return a python dict.

In [5]:
def encode(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True, padding='max_length')

train_dataset = train_dataset.map(encode, batched=True)
test_dataset = train_dataset.map(encode, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-96d180898128a3c3.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-bc1538d8f1aed734.arrow


In [6]:
# the outputs of the tokenizer are added as additional columns
train_dataset[0].keys()

dict_keys(['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'])

<div class="alert alert-block alert-info">
<b>Tip:</b> 
Note that this is not the most efficient padding strategy, we could also avoid padding at this stage and use tokenizer.pad as the collate_fn method in the torch.utils.data.DataLoader further below.
</div>

### 2.2 Formatting the dataset

Now that we have encoded our dataset, we want to use it in a `torch.Dataloader` or a `tf.data.Dataset` and use it to train our model.

To be able to train our model with this dataset and PyTorch, we will need to do three modifications:

* rename our `label` column in `labels` which is the expected input name for labels in `BertForSequenceClassification` or `TFBertForSequenceClassification`,
* get pytorch (or tensorflow) tensors out of our `datasets.Dataset`, instead of python objects, and
* filter the columns to return only the subset of the columns that we need for our model inputs (`input_ids`, `token_type_ids` and `attention_mask`).

<div class="alert alert-block alert-info">
<b>Tip:</b> 
We don’t want the columns sentence1 or sentence2 as inputs to train our model, but we could still want to keep them in the dataset, for instance for the evaluation of the model. 🤗datasets let you control the output format of datasets.Dataset.__getitem__() to just mask them as detailed in exploring.
</div>

The first modification is just a matter of renaming the column as follow (we could have done it during the tokenization process as well:

In [7]:
train_dataset = train_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)
test_dataset = test_dataset.map(lambda examples: {'labels': examples['label']}, batched=True)

Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-a956b99b23e13ffb.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4/cache-623a4e90a6c42513.arrow


The two other modifications can be handled by the `datasets.Dataset.set_format()` method which will convert, on the fly, the returned output from `datasets.Dataset.__getitem__()` to filter the unwanted columns and convert python objects in PyTorch tensors.

Here is how we can apply the right format to our dataset using `datasets.Dataset.set_format()` and wrap it in a `torch.utils.data.DataLoader` or a `tf.data.Dataset`:

In [8]:
import torch
train_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

In [9]:
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
next(iter(dataloader))

  return torch.tensor(x, **format_kwargs)


{'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'input_ids': tensor([[  101,  7277,  2180,  ...,     0,     0,     0],
         [  101, 10684,  2599,  ...,     0,     0,     0],
         [  101,  1220,  1125,  ...,     0,     0,     0],
         ...,
         [  101,  1109,  2026,  ...,     0,     0,     0],
         [  101, 22263,  1107,  ...,     0,     0,     0],
         [  101,   142,  1813,  ...,     0,     0,     0]]),
 'labels': tensor([1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
         1, 1, 0, 0, 1, 1, 1, 0]),
 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]])}

### 2.3 Fine-tuning the model using HuggingFace Trainer

In [13]:
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    output_dir='/storage/yiping-transformer/glue-mrpc',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    evaluation_strategy='epoch',
    fp16=True,
    logging_dir='/storage/yiping-transformer/logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

In [14]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.093457,0.974646,0.98114,0.984534,0.977769
2,No log,0.090195,0.977372,0.983044,0.993804,0.972514
3,0.157023,0.028031,0.993184,0.994951,0.994348,0.995554


TrainOutput(global_step=690, training_loss=0.1463607124660326)

### 3 Saving a processed dataset on disk and reload it

Once you have your final dataset you can save it on your disk and reuse it later using datasets.load_from_disk. Saving a dataset creates a directory with various files:

* arrow files: they contain your dataset’s data
* dataset_info.json: contains the description, citations, etc. of the dataset
* state.json: contains the list of the arrow files and other informations like the dataset format type, if any (torch or tensorflow for example)

In [26]:
test_dataset.save_to_disk("data/mprc")

from datasets import load_from_disk
reloaded_encoded_dataset = load_from_disk("data/mprc")

In [28]:
len(reloaded_encoded_dataset)

367