<a href="https://colab.research.google.com/github/dpalacioj/llm-to-production/blob/feature_transformers/1_Static_Dynamic_Batching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers
!pip install datasets
!pip install torch
!pip install tqdm

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:0

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch
from tqdm.auto import tqdm

**Documentación oficial**

* `pad_token` (str or tokenizers.AddedToken, optional) — A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Will be associated to self.pad_token and self.pad_token_id.

* `eos_token` (str or tokenizers.AddedToken, optional) — A special token representing the end of a sentence. Will be associated to `self.eos_token` and `self.eos_token_id`.

`padding_side` especifica en qué lado agregar el *padding* cuando una secuencia es más corta que el tamaño máximo. En este caso con 'left', el relleno se añade al comienzo. Es útil en modelos de tipo causal donde la secuandi más importante está al final.<br/>

Por su parte `eos_token` indica el final de la secuencia y el uso de este como `pad_token` asegura que el modelo ignore el relleno.

In [3]:
model = AutoModelForCausalLM.from_pretrained("TheFuzzyScientist/diabloGPT_open-instruct").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [4]:
dataset = load_dataset("hakurei/open-instruct-v1", split="train")

README.md:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

instruct_data.json:   0%|          | 0.00/104M [00:00<?, ?B/s]

additional_data.json:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

alpaca_data.json:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

gpt4_data.json:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

roleplay_instruct.json:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

self_instruct.json:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

sharegpt_data.json:   0%|          | 0.00/109M [00:00<?, ?B/s]

synthetic_instruct.json:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/498813 [00:00<?, ? examples/s]

In [5]:
dataset = dataset.to_pandas()
dataset.head(2)

Unnamed: 0,output,input,instruction
0,1. Eat a balanced diet and make sure to includ...,,Give three tips for staying healthy.
1,"The three primary colors are red, blue, and ye...",,What are the three primary colors?


## Text Generation Functions

La siguiente función genera una respuesta a partir de un prompt. Luego devuelve solo la primera oración, terminando en el primer punto.

In [26]:
def generate_text(prompt):
    inputs = tokenizer.encode(prompt, return_tensors="pt", padding=True, truncation=True).to(model.device)
    print("Inputs (tokens):", inputs)

    outputs = model.generate(inputs, max_length=150, repetition_penalty=1.2)
    print("Outputs (generated tokens):", outputs)
    print("\n" + "="*40 + "\n")

    generated = tokenizer.decode(outputs[0], skip_special_tokens=True) # Almacena el texto decodificado
    print("Generated (before slicing):", generated)
    print("\n" + "="*40 + "\n")

    return generated[: generated.find(".") + 1]# Toma el texto desde el inicio hasta el primer "."

In [27]:
generate_text("What is the best criteria to choose a hobbie?")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Inputs (tokens): tensor([[ 2061,   318,   262,  1266,  9987,   284,  3853,   257, 32724, 12590,
            30]], device='cuda:0')
Outputs (generated tokens): tensor([[ 2061,   318,   262,  1266,  9987,   284,  3853,   257, 32724, 12590,
            30,   220,   383,  1266,  9987,   329, 11236,   257, 32724, 12590,
          8338,   319,   534,  4661,   290, 15387,    13,  2773,   286,   262,
           749,  1593,  5087,   284,  2074,   618, 17246,   257, 17073, 12590,
          2291,    25,   198,    12, 20737,   284,  2193,   649,  4678,   290,
          7605,   198,    12,  3862,  7901,   198,    12, 15401,   290,  4067,
           198,    12, 13397,   460,   307,   281,  2071,    11,   355,   617,
           661,   743,   407,   423,  1895,   284,   257,   922,  1171,  9358,
          1080,    13, 12032,    11,   340,   338,  1593,   284,  2074,   262,
          1575,   286,  2877,   287,   257, 17073,   494,    11,   355,   867,
          4736,   389,  5140,  1969,  1978,    13, 

'What is the best criteria to choose a hobbie?  The best criteria for choosing a hobbie depends on your goals and preferences.'

### Text Generation Demo

#### Generate Text in Batches

A continuación, se generará una versión en batch de la misma función `generate_text`.

In [35]:
def batch_generate_texts(prompts):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)["input_ids"]
    print("Inputs (tokens):", inputs)

    outputs = model.generate(inputs, max_length=100, pad_token_id=tokenizer.eos_token_id, repetition_penalty=1.2)

    generated = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return generated

In [36]:
print(batch_generate_texts(dataset["instruction"][:1].tolist()))

Inputs (tokens): tensor([[23318,  1115,  9040,   329, 10589,  5448,    13]], device='cuda:0')
["Give three tips for staying healthy.  1. Eat a balanced diet.\n2. Exercise regularly.\n3. Get enough sleep.\n4. Drink plenty of water.\n5. Avoid smoking and excessive alcohol consumption.\n6. Limit your intake of processed foods, sugary drinks, and unhealthy fats.\n7. Make sure to get adequate rest.\n8. Take regular breaks throughout the day.\n9. Don't smoke or drink alcohol.\n10. Try to limit"]


In [37]:
print(batch_generate_texts(dataset["instruction"][:20].tolist()))

Inputs (tokens): tensor([[50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 23318,
          1115,  9040,   329, 10589,  5448,    13],
        [50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,  2061,
           389,   262,  1115,  4165,  7577,    30],
        [50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 24564,  4892,
           262,  4645,   286,   281, 22037,    13],
        [50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,  2437,
           460,   356,  4646,  1633, 12231,    30],
        [   47,  1186,   437,   345,   389,   257,  1628,  4706,   286,   257,
          5103,  1664,    13, 39373,  4892,   257,   640,   618,

#### Dynamic Batching

In [38]:
def batch_generate_tokens(tokens):
    outputs = model.generate(torch.stack(tokens), max_length=64, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [41]:
def dynamic_batching(prompts, max_tokens, is_pretokenized=False):
    # Tokenización inicial si los textos no están preprocesados
    if not is_pretokenized:
        # Tokeniza los prompts, aplica padding y mueve a cpu
        tokenized_texts = tokenizer(prompts, return_tensors="pt", padding=True)["input_ids"].to(model.device)
    else:
        tokenized_texts = prompts

    current_batch = [] # Almacena el batch actual
    current_batch_size = 0 # Almacena el tamao del batch actual en tokens


    for tokenized_text in tokenized_texts:
        # Si agregar el siguiente texto supera max_tokens, se procesa el batch actual
        if current_batch_size + len(tokenized_text) > max_tokens and current_batch:
            yield batch_generate_tokens(current_batch) # Genera el batch actual
            current_batch, current_batch_size = [], 0 # Reinicia el batch


        # Agrega el texto al batch actual y actualiza el tamaño
        current_batch.append(tokenized_text)
        current_batch_size += len(tokenized_text)

    # Procesa el último batch si contiene datos

    if current_batch:
        yield batch_generate_tokens(current_batch)
        pass

In [42]:
# Crea un generador que procesa 40 instrucciones del dataset repetidas 1000 veces en lotes de hasta 3200 tokens

generator = dynamic_batching(dataset["instruction"][:40].tolist() * 1000, 3200)
generator

<generator object dynamic_batching at 0x790f0d7ca6c0>

In [46]:
from contextlib import contextmanager
import time

# Mide tiempo de ejecución de bloques de código
@contextmanager
def track_time():
    start = time.time()
    yield
    end = time.time()
    print(f"Execution time: {end - start} seconds")

# Mide tiempo de ejecución mientras recorre cada lote de predicciones
with track_time():
    for batch_predictions in tqdm(generator):
        continue

# Organiza prompts en lotes basados en la longitud de tokens optimizando la eficiencia
def sort_batches(prompts, max_tokens):
    # Tokeniza los prompts sin aplicar padding para facilitar la ordenación por longitud
    tokenized_texts = tokenizer(prompts, padding=False)["input_ids"]
    sorted_tokens = sorted(tokenized_texts, key=len) # Ordena los tokens por longitud

    # Agrupa textos tokenizados en lotes según su longitud
    sorted_batches = {}
    for sorted_token in sorted_tokens:
        length = len(sorted_token)
        if length not in sorted_batches:
            sorted_batches[length] = []

        sorted_batches[length].append(sorted_token)

    # Genera predicciones en lote de manera dinámica para cada grupo de longitud similar
    for length, sorted_batch in sorted_batches.items():
        tensor_batch = torch.stack([torch.tensor(sorted_token) for sorted_token in sorted_batch]).to(model.device)
        for batch_prediction in dynamic_batching(tensor_batch, max_tokens=max_tokens, is_pretokenized=True):
            yield batch_prediction


generator = sort_batches(dataset["instruction"][:10].tolist(), 320)

# Mide el tiempo de ejecución mientras procesa y muestra el tamaño de cada lote de predicciones generadas.
with track_time():
    for batch_predictions in tqdm(generator):
        print("Batch size:", len(batch_predictions))
        print("\n" + "="*40 + "\n")
        print("Batch predictions:", batch_predictions)

0it [00:00, ?it/s]

Execution time: 0.06367015838623047 seconds


0it [00:00, ?it/s]

Batch size: 1


Batch predictions: ['How did Julius Caesar die?  Julius Caesar was assassinated by his enemies. He was assassinated by his enemies because he was a tyrant and he wanted to rule the Roman Empire. He was assassinated by his enemies because he was a tyrant and he wanted to rule the Roman Empire. He was assassinated by his enemies because he was']
Batch size: 4


Batch predictions: ['Give three tips for staying healthy.  1. Eat a balanced diet.\n2. Exercise regularly.\n3. Get enough sleep.\n4. Get enough rest.\n5. Get enough sleep.\n6. Eat a variety of healthy foods.\n7. Drink plenty of water.\n8. Get', 'What are the three primary colors?  The three primary colors are red, blue, and yellow. Red is the primary color, blue is the secondary color, and yellow is the tertiary color. Red is the primary color, blue is the secondary color, and yellow is the tertiary color. Red is the primary', 'How can we reduce air pollution?  - Use public transportation instead of driving.\n- Use