<a href="https://colab.research.google.com/github/Zarasim/LLM_projects/blob/main/course/chapter5/section4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Big data? 🤗 Datasets to the rescue!

In [1]:
from datasets import load_dataset

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [20]:
# train split is automatically set to true
dataset = load_dataset("oscar", "unshuffled_deduplicated_af",split="train",streaming=True,trust_remote_code=True)

In [21]:
import psutil

# Process.memory_info is expressed in bytes, so convert to megabytes
print(f"RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB")

RAM used: 675.41 MB


In [24]:
dataset

IterableDataset({
    features: ['id', 'text'],
    n_shards: 1
})

In [26]:
next(iter(dataset))

{'id': 0,
 'text': "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"}

In [18]:
# Cannot access directly sample in streaming mode
#dataset["train"][0]

The elements from a streamed dataset can be processed on the fly using IterableDataset.map(), which is useful during training if you need to tokenize the inputs. The process is exactly the same as the one we used to tokenize our dataset in Chapter 3, with the only difference being that outputs are returned one by one:

In [27]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-base")
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"]))
next(iter(tokenized_dataset))

{'id': 0,
 'text': "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel",
 'input_ids': [5,
  33,
  364,
  23908,
  21,
  10963,
  487,
  33,
  10,
  6844,
  4821,
  383,
  5770,
  91,
  10,
  7316,
  8607,
  5697,
  1606,
  21,
  11,
  255,
  2446,
  15961,
  4134,
  6634,
  22870,
  647,
  77,
  2332,
  5295,
  9650,
  18319,
  640,
  4034,
  91,
  10,
  16590,
  816,
  10,
  18,
  521,
  11488,
  35,
  647,
  234,
  91,
  10,
  16445,
  3355,
  346,
  4134,
  10,
  2364,
  22870,
  647,
  6],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

💡 To speed up tokenization with streaming you can pass batched=True, as we saw in the last section. It will process the examples batch by batch; the default batch size is 1,000 and can be specified with the batch_size argument.

In [28]:
shuffled_dataset = dataset.shuffle(buffer_size=100, seed=42)
next(iter(shuffled_dataset))

{'id': 8,
 'text': "Ons blog bestaan om God alle eer te gee. Ons wil vroue wees wat lewe volgens God se riglyne in Sy woord. Ons wil vir mekaar getuig van God se seen en van Sy voorskrifte. Op so 'n manier wil ons ‘n ondersteuning netwerk vorm van vroue vir vroue om mekaar te bemoedig, versterk en op te skerp in die volmaakte wil van God vir ons lewens en gesinne.\nDie komitee bestaan uit 5 vroue wat aktief lid is van die bespreking groep. Elke 6 maande word 'n nuwe komitee saamgestel."}

In this example, we selected a random example from the first 100 examples in the buffer. Once an example is accessed, its spot in the buffer is filled with the next example in the corpus (i.e., the 101st example in the case above). You can also select elements from a streamed dataset using the IterableDataset.take() and IterableDataset.skip() functions, which act in a similar way to Dataset.select(). For example, to select the first 5 examples in the PubMed Abstracts dataset we can do the following:

In [31]:
dataset_head = dataset.take(5)
list(dataset_head)

[{'id': 0,
  'text': "aanlyn markte as gevolg van ons voortgesette 'n begrip opsie handel sakeplan pdf terwyl ons steeds die gereelde ons binêre opsies handel"},
 {'id': 1,
  'text': 'Die ekonomiese posisie van die Asiaat in Suid-Afrika en enkele ander gebiede in Afrika 3 2 1 0 1 0 0'},
 {'id': 2,
  'text': 'Nadat dit duidelik geword het dat die Regering die aangeleentheid nie verder sou voer nie, het die Volksraad die ANC-regering se miskenning van internasionaal-aanvaarde regte en verpligtinge en die vergrype teen ons volk, onder die aandag van die internasionale gemeenskap gebring.'},
 {'id': 3,
  'text': "Liefste Alicia, Graag wil ek jou uit my hart bedank dat jy my in 'n prinsessie omtower het! Asook die meisies wat jou gehelp het. Jy het die dag ekstra spesiaal gemaak deur jou sagte persoonlikheid, jou perfekte timing, jou glimlag en jou opgewondenheid vir ons sprokiesdag, dankie dat jy kon deel wees van ons mooiste dag! Groete, Nadia"},
 {'id': 4,
  'text': "En Ek sal jou die sl

In [32]:
# Skip the first 1,000 examples and include the rest in the training set
train_dataset = dataset.skip(1000)
# Take the first 1,000 examples for the validation set
validation_dataset = dataset.take(1000)