<a href="https://colab.research.google.com/github/anhatsingh/hugging-face-learning/blob/main/datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Playing with Dataset Methods

In [17]:
!pip install datasets

from pprint import pprint



#### Downloading the dataset

Find the dataset on huggingface website, and provide the key like `stanfordnlp/imdb` to the `load_dataset` function in the `datasets` module.

The downloaded dataset may be huge and can be divided into multiple parts which may be json, csv, parquet etc etc.

In [3]:
from datasets import load_dataset
imdb_dataset = load_dataset("stanfordnlp/imdb")
print(imdb_dataset)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


#### Extracting Training/Test Dataset

Since the `imdb_dataset` is a `DatasetDict` class which is just like a normal dictionary for our understanding purposes, we can simply extract parts using the list reference `[]` operator.

In [4]:
imdb_train_dataset = imdb_dataset['train']
print(imdb_train_dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


#### Removing a dataset we don't need

In [5]:
_ = imdb_dataset.pop('unsupervised')

print(imdb_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


#### **Downloading** only the dataset parts we need

What if the dataset is too huge, and we just need the train or just test or similar part of it to download?

In [6]:
train_split = load_dataset("stanfordnlp/imdb", split="train")
print(train_split)

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


#### Creating Validatation Dataset from the training set

In [7]:
small_ds = train_split.train_test_split(test_size = 0.2)
print(small_ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


#### Saving a loaded dataset into the Disk
The dataset may be of different formats like CSV, JSON, parquet etc etc. We can save it into the disk in the common `pyarrow` format which takes relatively less space and can be effeciently loaded later on.

In [9]:
small_ds['train'].save_to_disk("data/train")
small_ds['test'].save_to_disk("data/test")

Saving the dataset (0/1 shards):   0%|          | 0/20000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

#### What if the datasets are stored locally as, lets say, CSV files?

We can load such files in hugging face, but note that we need to ensure that both the train.csv and test.csv have the same number of columns (features) with same column names too! Considering that test.csv does not have labels, ensure that there is a column named `label` but all the records for this column are empty.


Note that all the CSV files will be combined into 1 single dataset. So, if train and test are there, load them separately. This function is mainly used to load files which are split in a lot of different parts.

In [None]:
files = ['data/train.csv', 'data/test.csv']
local_dataset = load_dataset("csv", data_files = files)
print(local_dataset)

##### Loading arrow files

In [12]:
from datasets import load_from_disk

train = load_from_disk("data/train")
print(train)

Dataset({
    features: ['text', 'label'],
    num_rows: 20000
})


### Accessing Records in a dataset

In [18]:
idx = 1000

#seeing the 1000th record
rec = imdb_dataset['train'][idx]
pprint(rec)

{'label': 0,
 'text': 'Although I have to admit I laughed more watching this movie than the '
         'last few comedies I saw.<br /><br />The budget must have consisted '
         'of pocket change from the actors. The production values are so low '
         'that they actual made it kind of fun to watch. Reminds me of the '
         'Robot Monster made up of a guy in a gorilla suit with a cardboard '
         'diving helmet on.<br /><br />In one scene a hapless victim gets '
         'their arm and leg cut off. Geez, hard to believe but the Black '
         'Knight scene from Holy Grail was more realistic. I kept wondering '
         'why the victim didn\'t start shouting " None Shall Pass" and " It\'s '
         'only a flesh wound, I\'ve had worse". It was one of the funniest '
         'scenes I\'ve seen in the past year.<br /><br />The "gladiator/demon" '
         'was a stitch too. Between the horribly cheap costume and the geeky '
         'look of the guy in it the end result

#### Extracting subsets of the dataset

In [19]:
indices = [1000, 1001, 1050, 1090]
recs = imdb_dataset['train'].select(indices) # select method takes any iterable as its argument that returns int

pprint(recs)

Dataset({
    features: ['text', 'label'],
    num_rows: 4
})


## Collection of Datasets

Sometimes a particular dataset is a collection of a huge number of datasets (like `wmt14` dataset which contains language pairs). How to check which subsets are available in any dataset?

In [20]:
from datasets import get_dataset_config_names, get_dataset_split_names

print(get_dataset_config_names("wmt/wmt14"))
print(get_dataset_split_names("wmt/wmt14", "hi-en"))

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

['cs-en', 'de-en', 'fr-en', 'hi-en', 'ru-en']
['train', 'validation', 'test']


#### Loading only Hindi-English Translation dataset from wmt14

In [21]:
translation_dataset = load_dataset(path="wmt/wmt14", name="hi-en")
print(translation_dataset)

train-00000-of-00001.parquet:   0%|          | 0.00/992k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/85.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/506k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32863 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/520 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2507 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 32863
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})


#### Get the dataset as a combination, not split into 3 parts

In [22]:
raw_dataset = load_dataset(path="wmt/wmt14", name="hi-en", split="train+test+validation")
print(raw_dataset)

Dataset({
    features: ['translation'],
    num_rows: 35890
})


### Seeing the features in a dataset

In [24]:
pprint(imdb_dataset['train'].features)
pprint(translation_dataset['train'].features)

{'label': ClassLabel(names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
{'translation': Translation(languages=['hi', 'en'], id=None)}


## Loading Dataset 3 -> Microsoft Research Paraphrase Corpus (MRPC) under GLUE dataset

mrpc is a part of glue dataset, loading is same as done before

In [25]:
mrpc_dataset = load_dataset('glue', 'mrpc', split='train')
print(mrpc_dataset)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})


In [26]:
pprint(mrpc_dataset.features)

{'idx': Value(dtype='int32', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None)}


## Filtering Datasets

*   Happens on each and every record of the dataset


Let us try and keep only those reviews in IMDB dataset that have length of words > 100, and remove the rest.



In [27]:
print("Before Filtering")
print("-" * 20)
print(imdb_dataset)

filtered_imdb_dataset = imdb_dataset.filter(lambda record: len(record['text'].split(' ')) >= 100)
print("After Filtering")
print("-" * 20)
print(filtered_imdb_dataset)

Before Filtering
--------------------
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

After Filtering
--------------------
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22074
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 21909
    })
})


## Map Method

Suppose to all the imdb dataset, we want to add "IMDB:" keyword in front of each record, we use map method for it.

In [29]:
def add_prefix(record):
  record['text'] = "IMBD:" + record['text']
  return record


imdb_prefixed_dataset = imdb_dataset.map(add_prefix)
pprint(imdb_prefixed_dataset['train']['text'][1234])

("IMBD:They really can't get stupider than this film dealing with 3 losers who "
 'try to capture the college spirit during the annual spring break festivities '
 'at many of our higher schools of learning. The problem is that these losers '
 'try to do this 15 years after their college years when one is assigned to '
 'watch over the daughter of a woman senator being groomed to be the next vice '
 'president.<br /><br />Trouble is that her daughter is anything but popular, '
 'but of course she comes out of all that. The girls go through drunken rages, '
 "exotic dancing and other absolute nonsense.<br /><br />It really can't get "
 'much worse than this awful film.')


## Concatenating Datasets


*   Datasets must have same number of features and same number of splits
*   Usually done for pretraining



In [31]:
imdb_dataset_whole = load_dataset("imdb", split="train+test")
print(imdb_dataset_whole)
print(imdb_dataset_whole.features)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


In [35]:
print(get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes"))

rt_dataset_whole = load_dataset("cornell-movie-review-data/rotten_tomatoes", split='all')
print(rt_dataset_whole)
print(rt_dataset_whole.features)

['train', 'validation', 'test']


train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 10662
})
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}
