# Text Processing Benchmark

> This module contains some benchmarks for `TextDataController`

- skip_showdoc: true
- skip_exec: true

In [None]:
!conda list | grep 'datasets\|transformers\|torch'

datasets                  2.14.4                   pypi_0    pypi
pytorch-ignite            0.4.11                   pypi_0    pypi
pytorch-lightning         2.0.1.post0              pypi_0    pypi
torch                     2.0.1+cu118              pypi_0    pypi
torchaudio                2.0.2+cu118              pypi_0    pypi
torchmetrics              1.1.1                    pypi_0    pypi
torchvision               0.15.2+cu118             pypi_0    pypi
transformers              4.31.0                   pypi_0    pypi


In [None]:
# !conda list | grep 'datasets\|transformers'
# datasets                  2.11.0                   pypi_0    pypi
# transformers              4.28.1                   pypi_0    pypi

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *
from importlib.machinery import SourceFileLoader
from datasets import load_dataset,enable_caching,disable_caching
from transformers import RobertaTokenizer
import os
import time
from underthesea import text_normalize
import nlpaug.augmenter.char as nac
from functools import partial
import random
from memory_profiler import memory_usage

In [None]:
disable_caching() # disable huggingface caching to get a fair benchmark

In [None]:
def benchmarking(tdc,bs,tokenizer,n=10,shuffle_trn=True):
    time1 = time.time()
    tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=shuffle_trn)
    time2 = time.time() 
    print(f'Time it takes to process + tokenize training texts: {(time2-time1):.3f} s')
    for i,v in enumerate(tdc.main_ddict['train']):
        if n is not None and i==bs*n: break
    time3 = time.time()
    if n is not None:
        print(f'Time it takes to go through {n*bs} items: {(time3-time2):.3f} s')
    else:
        print(f'Time it takes to go through all items: {(time3-time2):.3f} s')

#     print(f'Total time: {(time3-time1):.3f} s')
def benchmarking_and_memory_usage(tdc,bs,tokenizer,n=10,shuffle_trn=True):
    mem_usage = memory_usage((benchmarking,[tdc,bs,tokenizer,n,shuffle_trn]))
    print(f'Maximum memory usage: {max(mem_usage):.3f} MiB')


In [None]:
def nlp_aug_stochastic(x,aug=None,p=0.5):
    results = aug.augment(x)
    if not isinstance(x,list): return results[0] if random.random()<p else x
    return [a if random.random()<p else b for a,b in zip(results,x)]

aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug_stochastic,aug=aug,p=0.5)

## Benchmark on medium-size dataset (~117k rows)

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)
len(dset)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


117430

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

In [None]:
bs=128

### Without iterable dataset

With filter

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 14.940 s
Time it takes to go through 1280 items: 0.155 s
Maximum memory usage: 825.723 MiB


With filter + metadatas concatenation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 15.741 s
Time it takes to go through 1280 items: 0.168 s
Maximum memory usage: 857.930 MiB


With filter + metadatas concatenation + content transformation + content augmentation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 35.980 s
Time it takes to go through 1280 items: 0.176 s
Maximum memory usage: 893.555 MiB


With filter + metadatas concatenation + content transformation + content augmentation + no shuffling

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 35.534 s
Time it takes to go through 1280 items: 0.180 s
Maximum memory usage: 892.668 MiB


With filter + metadatas concatenation + content transformation + content augmentation + higher batch size

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=512,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,512,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 35.427 s
Time it takes to go through 5120 items: 0.746 s
Maximum memory usage: 794.441 MiB


### With iterable dataset

With filter

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 2.888 s
Time it takes to go through 1280 items: 0.571 s
Maximum memory usage: 752.379 MiB


With filter + metadatas concatenation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 2.615 s
Time it takes to go through 1280 items: 0.547 s
Maximum memory usage: 804.832 MiB


With filter + metadatas concatenation + content transformation + content augmentation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 22.078 s
Time it takes to go through 1280 items: 0.606 s
Maximum memory usage: 857.551 MiB


With filter + metadatas concatenation + content transformation + content augmentation + no shuffling

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 22.369 s
Time it takes to go through 1280 items: 0.543 s
Maximum memory usage: 857.930 MiB


With filter + metadatas concatenation + content transformation + content augmentation + higher batch size

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=512,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,512,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 21.482 s
Time it takes to go through 5120 items: 2.150 s
Maximum memory usage: 752.199 MiB


### With streaming (v1)

With filter

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.002 s
Time it takes to go through 1280 items: 1.327 s
Maximum memory usage: 686.387 MiB


With filter + metadatas concatenation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.002 s
Time it takes to go through 1280 items: 1.470 s
Maximum memory usage: 803.281 MiB


With filter + metadatas concatenation + content transformation + content augmentation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.082 s
Time it takes to go through 1280 items: 95.631 s
Maximum memory usage: 6908.953 MiB


With filter + metadatas concatenation + content transformation + content augmentation + no shuffling

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=True,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,shuffle_trn=False)

Time it takes to process + tokenize training texts: 0.078 s
Time it takes to go through 1280 items: 11.870 s
Maximum memory usage: 6892.258 MiB


With filter + metadatas concatenation + content transformation + content augmentation + higher batch size

In [None]:
# dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
#                     split='train',
#                     streaming=True)

# tdc = TextDataController(dset,
#                          main_text='Review Text',
#                          label_names='Department Name',
#                          filter_dict={'Review Text': lambda x: x is not None,
#                                       'Department Name': lambda x: x is not None,
#                                      },
#                          class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
#                          metadatas=['Title','Division Name'],
#                          content_transformations=[text_normalize,str.lower],
#                          content_augmentations= [nearby_aug_func,str.lower], 
#                          val_ratio=None,
#                          batch_size=512,
#                          seed=42,
#                          convert_training_to_iterable=True,
#                          verbose=False
#                         )
# benchmarking_and_memory_usage(tdc,512,tokenizer,shuffle_trn=False)

### With streaming (v2)

In [None]:
def benchmarking(tdc,bs,tokenizer,n=10):
    time1 = time.time()
    tdc.process_and_tokenize(tokenizer,max_length=512)
    time2 = time.time() 
    print(f'Time it takes to process + tokenize training texts: {(time2-time1):.3f} s')
    for i,v in enumerate(tdc.main_ddict['train']):
        if n is not None and i==bs*n: break
    time3 = time.time()
    if n is not None:
        print(f'Time it takes to go through {n*bs} items: {(time3-time2):.3f} s')
    else:
        print(f'Time it takes to go through all items: {(time3-time2):.3f} s')

def benchmarking_and_memory_usage(tdc,bs,tokenizer,n=10):
    mem_usage = memory_usage((benchmarking,[tdc,bs,tokenizer,n]))
    print(f'Maximum memory usage: {max(mem_usage):.3f} MiB')


With filter

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataControllerStreaming(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         batch_size=bs,
                         seed=42,
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

In [None]:
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.808 s
Time it takes to go through 1280 items: 0.639 s
Maximum memory usage: 672.266 MiB


With filter + metadatas concatenation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataControllerStreaming(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         batch_size=bs,
                         seed=42,
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.818 s
Time it takes to go through 1280 items: 0.568 s
Maximum memory usage: 679.590 MiB


With filter + metadatas concatenation + content transformation + content augmentation

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataControllerStreaming(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         batch_size=bs,
                         seed=42,
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Time it takes to process + tokenize training texts: 0.826 s
Time it takes to go through 1280 items: 1.599 s
Maximum memory usage: 679.723 MiB


With filter + metadatas concatenation + content transformation + content augmentation + higher batch size

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=True)

tdc = TextDataControllerStreaming(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         class_names_predefined=['Bottoms', 'Dresses', 'Intimate', 'Jackets', 'Tops', 'Trend'],
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         batch_size=512,
                         seed=42,
                        )
benchmarking_and_memory_usage(tdc,512,tokenizer)

Time it takes to process + tokenize training texts: 0.835 s
Time it takes to go through 5120 items: 5.734 s
Maximum memory usage: 677.559 MiB


### Test the effect of batch size and num_proc

Text processing + tokenization are the most time-consuming tasks, thus we will check how different batch size and num proc will affect these tasks' running time

In [None]:
bs=16

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,n=None,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map:   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 71.757 s
Time it takes to go through all items: 15.128 s
Maximum memory usage: 1041.410 MiB


In [None]:
bs=128

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,n=None,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map:   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 60.165 s
Time it takes to go through all items): 15.950 s
Maximum memory usage: 831.129 MiB


In [None]:
bs=128*10

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,n=None,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=4):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=4):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map:   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 58.748 s
Time it takes to go through all items: 16.631 s
Maximum memory usage: 845.074 MiB


In [None]:
bs=128*10

dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         num_proc=16,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer,n=None,shuffle_trn=False)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Filter (num_proc=16):   0%|          | 0/117430 [00:00<?, ? examples/s]

Filter (num_proc=16):   0%|          | 0/113205 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map (num_proc=16):   0%|          | 0/113140 [00:00<?, ? examples/s]

Map:   0%|          | 0/113140 [00:00<?, ? examples/s]

Time it takes to process + tokenize training texts: 47.417 s
Time it takes to go through all items: 16.684 s
Maximum memory usage: 1009.738 MiB


Conclusion: increase BOTH batch size and num_proc can help decrease the processing + tokenization time, but the relationship between batch size, num_proc and running time are not linear

## Improving processing time with caching

The worst processing time is recorded with an non-iterable training set, with the following preprocessing: 2-column filtering, 2-column metadatas, 2 content transformations, 2 content augmentation; the total preprocessing time is ~62s for 117k dataset. However, this results in the best data iteration time: 0.183s for going through 1280 items.

With caching, we can significantly reduce the preprocessing time. That means, you only need to do all preprocessings once; all subsequent call will take advatages of this cached result.

In [None]:
enable_caching()

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv' for i in range(5)],
                    split='train',
                    streaming=False)

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         val_ratio=None,
                         batch_size=bs,
                         seed=42,
                         convert_training_to_iterable=False,
                         verbose=False
                        )
benchmarking_and_memory_usage(tdc,bs,tokenizer)

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-a8e48b2fdcc1675b_*_of_00004.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-7f67ed2247bad412_*_of_00004.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-8895dee11a0750d6_*_of_00004.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-b5f53892a1b938ad/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec

Time it takes to process + tokenize training texts: 1.471 s
Time it takes to go through 1280 items: 0.176 s
Maximum memory usage: 874.715 MiB


## Conclusion

With CPU batch size of 128, and data iteration of 1280 items (10 batches)

1. Time to process + tokenize. Unit: seconds

|  | Filtering | + 2-column metadatas | + 2 tfms and 2 augs | + no train shuffling |
|------------------------------|-------------------------|-------------------------|-----------------|--------------------|
| no iterable training         | 37.038                  | 40.147                  | 62.309          | 59.452             |
| iterable training            | 2.85                    | 2.623                   | 22.31           | 22.421             |
| streaming                    | 0.002                   | 0.002                   | 0.084           | 0.08               |

2. Time to loop through 1280 items (10 batches). Unit: seconds

|                              | Filtering | + 2-column metadatas | + 2 tfms and 2 augs | + no train shuffling |
|------------------------------|-------------------------|-----------------|--------------------|------------------------------------|
| no iterable training         | 0.155                    | 0.181           | 0.183              | 0.184                              |
| iterable training            | 0.464                    | 0.544           | 0.562              | 0.474                              |
| streaming                    | 1.244                    | 1.365           | 95.443             | 11.529                             |

3. Maximum memory usage. Unit: megabytes

|                              | Filtering | + 2-column metadatas | + 2 tfms and 2 augs | + no train shuffling |
|------------------------------|-------------------------|-----------------|--------------------|------------------------------------|
| no iterable training         | 762.734 | 806.473                  | 859.008         | 867.031            | 
| iterable training            |799.742 | 838.613                  | 891.176         | 892                |
| streaming                    | 752.238 | 829.074                  | 6955.02         | 6841.391           |

## Tips and tricks

- For non-streaming data, the best way to minimize processing and iteration time is:
    - Use non-iterable training (which means don't turn training set into an Iterable Dataset)
    - Turn on dataset caching, and run the processing step once for it to be cached
- If caching is not an option, then use iterable training (turn trainingset into an Iterable Dataset)
- The more content transformations and augmentations added, the slower the process + iteration. This is especially true for streaming data
- For streaming data, which might be the slowest option, here are a few things to speed up the whole pipeline:
    - Try to define and create a validation set split in your dataset; don't use the validation split functionality of `TextDataController
    - Minimize the amount of content transformation and content augmentation
    - Turn off `shuffle_trn`
    - Set a smaller CPU batch size. E.g. in my 64gb RAM machine, and this dataset of 117k rows, I can only set batch size up to 200 to avoid memory error
