# Tutorial 1: Dataset

In this tutorial, we simply show how the dataset-related functionalities work in **EasyTPP**.


Firstly, we install the package.

In [None]:
!pip install easy_tpp

Currently, there are two options to load the preprocessed dataset:
- copy the pickle files from [Google Drive](https://drive.google.com/drive/folders/1f8k82-NL6KFKuNMsUwozmbzDSFycYvz7).
- load the json fils from [HuggingFace](https://huggingface.co/easytpp).

In the future the first way will be depreciated and the second way is recommended.


## Load pickle data files

If we choose to use the pickle files as the sources, we can download the data files, put it under a data folder, specify the directory in the config file and run the training and prediction pipeline.


Take taxi dataset for example, we put it this way:

```
data:
  taxi:
    data_format: pickle
    train_dir:  ./data/taxi/train.pkl
    valid_dir:  ./data/taxi/dev.pkl
    test_dir:  ./data/taxi/test.pkl
```

See [experiment_config](https://github.com/ant-research/EasyTemporalPointProcess/blob/main/examples/configs/experiment_config.yaml) for the full example.



## Load json data files


The recommended way is to load data from HuggingFace, where all data have been preprocessed in json format and hosted in [EasyTPP Repo](https://huggingface.co/easytpp).

We use the official APIs to directly download and inspect the dataset.

In [None]:
from datasets import load_dataset

# we choose taxi dataset as it is relatively small
dataset = load_dataset('easytpp/taxi')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['time_since_last_event', 'type_event', 'time_since_start', 'dim_process', 'seq_len', 'seq_idx'],
        num_rows: 1400
    })
    validation: Dataset({
        features: ['time_since_last_event', 'type_event', 'time_since_start', 'dim_process', 'seq_len', 'seq_idx'],
        num_rows: 200
    })
    test: Dataset({
        features: ['time_since_last_event', 'type_event', 'time_since_start', 'dim_process', 'seq_len', 'seq_idx'],
        num_rows: 400
    })
})

In [None]:
dataset['train']['type_event'][0]

To activate this loading process in the train/evaluation pipeline, similarly, we put the directory of huggingface repo in the config file, e.g.,

```
data:
  taxi:
    data_format: json
    train_dir:  easytpp/taxi
    valid_dir:  easytpp/taxi
    test_dir:  easytpp/taxi
```

Note that we can also manually put the locally directory of json files in the config:

```
data:
  taxi:
    data_format: json
    train_dir:  ./data/taxi/train.json
    valid_dir:  ./data/taxi/dev.json
    test_dir:  ./data/taxi/test.json
```