**LLM Workshop 2024 by Sebastian Raschka**

This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)

<br>
<br>
<br>
<br>

# 2) Understanding LLM Input Data

Packages that are being used in this notebook:

In [1]:
!pip3 install torch tiktoken



In [2]:
from importlib.metadata import version


print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.6.0
tiktoken version: 0.9.0


- This notebook provides a brief overview of the data preparation and sampling procedures to get input data "ready" for an LLM
- Understanding what the input data looks like is a great first step towards understanding how LLMs work

<img src="https://camo.githubusercontent.com/590a463dcb825375473c9fd366013e86204589d68be0bd0207d43b158ba10558/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30312e776562703f74696d657374616d703d31" width="700px">

<br>
<br>
<br>
<br>

# 2.1 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="https://camo.githubusercontent.com/b92bf8c18c5d51258b4a8a55d9612fd1a2eb5f3b6c6a79fc0a1d7b1ba59a2e99/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30342e77656270" width="600px">

Enlace a texto: https://drive.google.com/file/d/1H_ZU_35t3sqg9LklLau-twiWKNQraonx/view?usp=sharing

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above

<img src="https://camo.githubusercontent.com/241f7a302c33bc1e8156e7d0b153caae8728f2c9cd03884487c05d931fd88be2/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30352e77656270" width="600px">

- The following regular expression will split on whitespaces and punctuation

#### Ejercicio

Crea el código que permita dividir el texto siempre que haya uno de los siguientes caracteres empleando expresiones regulares:

- ,
- .
- :
- ;
- ?
- _
- !
- "
- (
- )
- '

o bien:

- "--"

o bien:

- " "


In [4]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item]
print(preprocessed[:38])

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '--', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ']


In [5]:
print("Number of tokens:", len(preprocessed))

Number of tokens: 8405


<br>
<br>
<br>
<br>

# 2.2 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later
- For this we first need to build a vocabulary

<img src="https://camo.githubusercontent.com/bf01ba4b1b924633325cda845feac84ae1a3f154db5098f5d70e90470ff4484e/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30362e77656270" width="900px">

- The vocabulary contains the unique words in the input text

In [6]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1132


#### Ejercicio

A continuación deberéis crear el vocabulario. La forma más sencilla es emplear `all_words` para crear un diccinario en la forma: `{palabra1: 0, palabra2: 1, ...}`.

In [7]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- Below are the first 50 entries in this vocabulary:

In [8]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('\n', 0)
(' ', 1)
('!', 2)
('"', 3)
("'", 4)
('(', 5)
(')', 6)
(',', 7)
('--', 8)
('.', 9)
(':', 10)
(';', 11)
('?', 12)
('A', 13)
('Ah', 14)
('Among', 15)
('And', 16)
('Are', 17)
('Arrt', 18)
('As', 19)
('At', 20)
('Be', 21)
('Begin', 22)
('Burlington', 23)
('But', 24)
('By', 25)
('Carlo', 26)
('Chicago', 27)
('Claude', 28)
('Come', 29)
('Croft', 30)
('Destroyed', 31)
('Devonshire', 32)
('Don', 33)
('Dubarry', 34)
('Emperors', 35)
('Florence', 36)
('For', 37)
('Gallery', 38)
('Gideon', 39)
('Gisburn', 40)
('Gisburns', 41)
('Grafton', 42)
('Greek', 43)
('Grindle', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="https://camo.githubusercontent.com/8955d3aea45dc06f156d0579f7f3302c27b6635e649c301dbab33427b2d8d2a8/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30372e776562703f313233" width="600px">

- Let's now put it all together into a tokenizer class

Cómo podréis imaginaros, en la vida real esto no se hace con un jupyter notebook y con funciones, sino que se programan clases que puedan organizar y modularizar el código para poder reaprovecharlo.

A continuación crearéis una clase llamada `SimpleTokenizerV1` que tendrá un método `__init__(vocab)` que permitirá iniciar el vocabulario para poder convertir de caracter a ID (int) y de ID (int) a caracter.

Además, tendrá también un método `encode(self, text)` que se encargará de devolver las IDs del texto que tiene como entrada. La implementación de este método es muy sencilla si os basáis en lo hecho previamente.

Por último, habrá un método `decode(self, ids)` que permitirá convertir de IDs a texto. En este método tendréis que emplear la variable `self.int_to_str` para pasar de IDs a caracteres, y luego concatenar todos los caracteres para obtener una cadena de texto.

La última línea, `text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)`, se encarga de eliminar los espacios **previos** a los símbolos indicados en la primera cadena del `re.sub(...)`.

In [9]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="https://camo.githubusercontent.com/b324e29fe9d3d4191a9200d6a08983eef4d3f835cff85ce1ee4aceb47117891a/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f30382e776562703f313233" width="600px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [10]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[3, 58, 4, 852, 990, 604, 535, 748, 7, 1128, 598, 7, 3, 69, 9, 40, 853, 1110, 756, 795, 9]


- We can decode the integers back into text

In [11]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [12]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

Si todo ha ido bien, estas dos últimas celdas deberían devolver la misma cadena que hay disponible en la variable `text`.

<br>
<br>
<br>
<br>

# 2.3 BytePair encoding

Byte-Pair Encoding (BPE) es una técnica que permite codificar el contenido a nivel de sub-palabra. Su método básico de funcionamiento es:

1. Se parte de un vocabulario con todos los caracteres individuales
2. Se van incorporando nuevos pares de tokens basándose en su mayor frecuencia de aparición en el texto a tokenizar
3. Así hasta llegar a un tamaño de vocabulario definido (hiperparámetro)

Por ejemplo:

1. Partimos del vocabulario: `u, g, n`
2. Se crean los siguientes pares:

```
("u", "g") -> "ug"
("u", "n") -> "un"
("h", "ug") -> "hug"
```

La palabra "bug" será tokenizada como ["b", "ug"]. En cambio, "mug", será tokenizado como ["[UNK]", "ug"] dado que la letra "m" no fue parte del vocabulario base. De la misma manera, la palabra "thug" será tokenizada como ["[UNK]", "hug"]: la letra "t" no está en el vocabulario base, y aplicando las reglas de fusión resulta primero la fusión de "u" y "g" y luego de "hu" and "g".

Se trata de un método de tokenización mucho más sofisticado que proporciona mayor velocidad. De hecho, es el tokenizador empleado para entrenar GPT-2, GPT-3, y ChatGPT, entre otros.

Aquí podéis ver un tutorial en el que explican detalladamente su funcionamiento: https://huggingface.co/learn/nlp-course/es/chapter6/5.

Y aquí otro de Sebastian Raschka en el que explica su implementación paso a paso: https://sebastianraschka.com/blog/2025/bpe-from-scratch.html.

Como su implementación puede resultar complicada, vamos a emplear una librería open-source disponible en Python: `tiktoken` (https://github.com/openai/tiktoken), que implementa el algoritmo BPE de forma muy eficiente en Rust.


More info:
- GPT-2 used BytePair encoding (BPE) as its tokenizer
- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
- In this lecture, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
- (Based on an analysis [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb), I found that `tiktoken` is approx. 3x faster than the original tokenizer and 6x faster than an equivalent tokenizer in Hugging Face)

In [13]:
# pip install tiktoken

In [14]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [15]:
tokenizer = tiktoken.get_encoding("gpt2")

In [16]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [17]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://camo.githubusercontent.com/5938dff392e5cb7404d2636e4d7157fceb4c36ecf57a2173001bd3edf22234da/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f31312e77656270" width="600px">

In [18]:
tokenizer.encode("Akwirw ier", allowed_special={"<|endoftext|>"})

[33901, 86, 343, 86, 220, 959]

This code allows the special token "*<|endoftext|>*" to be encoded as a special token if it appears in the input text. However, since "Akwirw ier" doesn't contain "<|endoftext|>", the allowed_special parameter won't have any effect on the encoding of this specific input.

The `allowed_special` parameter is particularly useful when you want to include certain special tokens in your input without raising errors or having them split into regular tokens.

It's a safety measure to prevent accidental encoding of special tokens that might have unintended effects on model behavior.

<br>
<br>
<br>
<br>

# 2.4 Data sampling with a sliding window

Por último, vamos a abordar el formato de los datos de entrenamiento. En el caso de los modelos GPT (Generative Pre-Trained models), el modelo va a aprender a predecir la siguiente palabra dada una secuencia de entrada.

Por tanto, los datos de entrenamiento serán como podéis ver en la siquiente imagen.

<img src="https://camo.githubusercontent.com/b6245f4e6c64740c06f71ddd30d6495342b37315f0fd3556a0dc511be009a61f/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f31322e77656270" width="600px">

- For this, we use a sliding window approach, changing the position by +1:

Para ello usaremos una ventana deslizante (*sliding window*) que nos permita ir recorriendo la totalidad del texto.

Al final, tendremos un lote (*batch*) de datos tal que así:

```
input_batch = [
    palabra1 palabra2 palabra3
    palabra2 palabra3 palabra4
    ...
    palabraN-2 palabraN-1 palabraN
]
```

<img src="https://camo.githubusercontent.com/9c738e75095f70d3dc4f6b3630008dd67607b5fa92e3bf776b0ed2cbb68db299/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f31332e776562703f313233" width="900px">

Sin embargo, si os fijáis, estamos repitiendo las mismas palabras muchas veces, por lo que en realidad lo que se suele hacer es tener un "salto" (*stride*) igual a la longitud del vector de contexto (*context vector*).

Tened en cuenta que estamos hablando de los inputs, las etiquetas seguirán siendo siempre la siguiente palabra para cada elemento del batch:

```
input_batch = [
    "<start-of-sequence>" palabra1 palabra2
    palabra1 palabra2 palabra3
    palabra2 palabra3 palabra4
    ...
    palabraN-2 palabraN-1 palabraN
]
```

```
targets = [
    palabra3
    palabra4
    palabra5
    ...
    "<enf-of-sequence>"
]
```

<img src="https://camo.githubusercontent.com/181fa38c6bcf2259633e9a15874d189bf7a9f5ec1ca7b161f521a03ed27ec086/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830325f636f6d707265737365642f31342e77656270" width="600px">

Let's first have a look at our text:

In [19]:
print(raw_text[:100])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


Now, we will use the function `create_dataloader_v1` in `supplementary.py` to create a `DataLoader`, a Python object that will allow us to load the data efficiently to train our model.

Se trata de código estándar de Python para crear un DataLoader, aquí lo podéis ver también:

```
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader
```

In [20]:
import torch
from torch.utils.data import Dataset
import tiktoken

class GPTDatasetV1(Dataset):
    def __init__(self, text, tokenizer, max_length, stride):
        """
        Args:
            text (str): The raw text.
            tokenizer: The tokenizer (e.g., from tiktoken.get_encoding("gpt2")).
            max_length (int): The number of tokens in each sample.
            stride (int): The step of the sliding window.
        """
        self.tokenizer = tokenizer
        self.tokens = tokenizer.encode(text, allowed_special=set())
        self.max_length = max_length
        self.stride = stride
        self.samples = []
        
        # Create samples using a sliding window approach
        for i in range(0, len(self.tokens) - max_length, stride):
            self.samples.append(self.tokens[i : i + max_length])
            
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        tokens = self.samples[idx]
        # For language modeling, input is tokens[:-1], target is tokens[1:]
        input_ids = torch.tensor(tokens[:-1], dtype=torch.long)
        target_ids = torch.tensor(tokens[1:], dtype=torch.long)
        return input_ids, target_ids

In [21]:
from torch.utils.data import DataLoader

def create_dataloader_v1(txt, batch_size=4, max_length=256,
                        stride=128, shuffle=True, drop_last=True,
                        num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [22]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885],
        [ 1807,  3619,   402],
        [10899,  2138,   257],
        [15632,   438,  2016],
        [  922,  5891,  1576],
        [  568,   340,   373],
        [ 1049,  5975,   284],
        [  284,  3285,   326]])

Targets:
 tensor([[ 367, 2885, 1464],
        [3619,  402,  271],
        [2138,  257, 7026],
        [ 438, 2016,  257],
        [5891, 1576,  438],
        [ 340,  373,  645],
        [5975,  284,  502],
        [3285,  326,   11]])


Veamos el contenido de las `inputs`:

In [23]:
for vector in inputs:
    strings = tokenizer.decode(vector.numpy())
    print(strings)

I HAD
 thought Jack G
burn rather a
 genius--though
 good fellow enough
so it was
 great surprise to
 to hear that


Y ahora de los `targets`:

In [24]:
for vector in targets:
    strings = tokenizer.decode(vector.numpy())
    print(strings)

 HAD always
 Jack Gis
 rather a cheap
--though a
 fellow enough--
 it was no
 surprise to me
 hear that,


**¿Notáis algo extraño?**

Según os he dicho, en targets está la palabra a predecir, ¿no? ¿Por qué, entonces, tiene tamaño 4, y no 1?


Esto es así porque los transformers predicen para cada token de entrada, el probable siguiente token, y lo hacen **a la vez para todos los tokens**. De ahí que digamos que tienen en cuenta el contexto de la *context window*. Es decir, al final esta es la realidad:

```
input_batch = [
    "<start-of-sequence>" palabra1 palabra2
    palabra1 palabra2 palabra3
    palabra2 palabra3 palabra4
    ...
    palabraN-2 palabraN-1 palabraN
]
```

```
targets = [
    palabra1 palabra2 palabra3
    palabra2 palabra3 palabra4
    palabra3 palabra4 palabra5
    ...
    palabraN-1 palabraN "<enf-of-sequence>"
]
```

Ahora lo haremos con `stride=1` para ver la diferencia:

In [25]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885],
        [  367,  2885,  1464],
        [ 2885,  1464,  1807],
        [ 1464,  1807,  3619],
        [ 1807,  3619,   402],
        [ 3619,   402,   271],
        [  402,   271, 10899],
        [  271, 10899,  2138]])

Targets:
 tensor([[  367,  2885,  1464],
        [ 2885,  1464,  1807],
        [ 1464,  1807,  3619],
        [ 1807,  3619,   402],
        [ 3619,   402,   271],
        [  402,   271, 10899],
        [  271, 10899,  2138],
        [10899,  2138,   257]])


In [26]:
for vector in inputs:
    strings = tokenizer.decode(vector.numpy())
    print(strings)

I HAD
 HAD always
AD always thought
 always thought Jack
 thought Jack G
 Jack Gis
 Gisburn
isburn rather


In [27]:
for vector in targets:
    strings = tokenizer.decode(vector.numpy())
    print(strings)

 HAD always
AD always thought
 always thought Jack
 thought Jack G
 Jack Gis
 Gisburn
isburn rather
burn rather a


Fijaos como las palabras se dividen en sub-palabras:

In [28]:
print(inputs[1])

tensor([ 367, 2885, 1464])


In [29]:
for i, token in enumerate(inputs[1]):
    print(f"token #{i}: '{tokenizer.decode([token.numpy()])}'")

token #0: ' H'
token #1: 'AD'
token #2: ' always'


# **Notáis algo interesante?**

¡Fijáos en que estamos codificando los espacios! Los modelos GPT codifican los espacios como un símbolo especial (^G). Esto es dependiente del modelo, por ejemplo, BERT no codifica los espacios. Sin embargo, ambos codifican los símbolos de puntuación. Pensad que tienen que ser capaces de reconstruir el texto original, incluyendo espacios y símbolos de puntuación.

Ejemplo de tokenización de: "Hello, how are you?"

BERT: `[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]`

GPT: `[('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)),
 ('?', (19, 20))]`

<br>
<br>
<br>
<br>

# Ejercicio **evaluable**: Prepara tu dataset favorito

Si quieres probar con algo similar a `the-verdict.txt` pero en español, puedes usar los disponibles en este dataset de HuggingFace: https://huggingface.co/datasets/Fernandoefg/cuentos_es (aquí más info: https://www.linkedin.com/pulse/dataset-de-cuentos-en-espa%C3%B1ol-fernando-fuentes-gallegos-ssuyc/).

In [30]:
!pip3 install datasets



In [31]:
from datasets import load_dataset

# Load the Spanish cuentos dataset from Hugging Face
dataset = load_dataset("Fernandoefg/cuentos_es")
print(dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['author', 'country', 'years', 'title', 'category', 'content'],
        num_rows: 7239
    })
})


In [32]:
# Extract text from a chosen split (e.g. "train") and combine them
texts = dataset["train"]["content"]
spanish_text = "\n".join(texts)

In [33]:
print("Total Spanish characters:", len(spanish_text))
print(spanish_text[:300])

Total Spanish characters: 140347979
En un reino vivía una vez un comerciante con su mujer y su única hija, llamada Basilisa la Hermosa. Al cumplir la niña los ocho años se puso enferma su madre, y presintiendo su próxima muerte llamó a Basilisa, le dio una muñeca y le dijo:
-Escúchame, hijita mía, y acuérdate bien de mis últimas palab


In [34]:
# Create a dataloader with desired hyperparameters
dataloader_es = create_dataloader_v1(spanish_text, batch_size=8, max_length=32, stride=16, shuffle=False)

data_iter = iter(dataloader_es)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[ 4834,   555,   302,  2879,   410,   452, 29690,   555,    64,  1569,
            89,   555,   401,   263,   979, 12427,   369,   424,   285, 23577,
           263,   331,   424,  6184,   118,    77,  3970, 16836,    64,    11,
         32660],
        [  369,   424,   285, 23577,   263,   331,   424,  6184,   118,    77,
          3970, 16836,    64,    11, 32660,   321,  4763, 32520,  9160,  8591,
         18113,  8546,    13,   978, 10973,   489,   343,  8591, 37628, 30644,
         22346],
        [ 4763, 32520,  9160,  8591, 18113,  8546,    13,   978, 10973,   489,
           343,  8591, 37628, 30644, 22346,   267,  6679,   257, 12654,   418,
           384,  4192,    78,   551,  2232,  2611,   424,  8805,   260,    11,
           331],
        [ 6679,   257, 12654,   418,   384,  4192,    78,   551,  2232,  2611,
           424,  8805,   260,    11,   331,   906,   600,    72, 31110,   424,
           778, 10205,    87,  8083,   285, 15573,   660, 32660,   321,