# ```generate()``` and 🤗accelerate
This notebook is for people using ```model.generate()``` with multiple GPUs. I ❤️ 🤗 accelerate but some things were not obvious to me. 

I wish I had these examples earlier when starting out. In my opinion, the most important functions to know are `split_between_processes` and `gather_object`."

## 0. Test

In [6]:
from accelerate import notebook_launcher

def test():
    from accelerate import Accelerator
    accelerator = Accelerator()
    print(accelerator.distributed_type)

notebook_launcher(test, num_processes=2)  

Launching training on 2 GPUs.
DistributedType.MULTI_GPU
DistributedType.MULTI_GPU


## 1. Hello world
simplest example: create strings on each GPU and collect them using ```gather_object()```

change ```num_processes``` to the number of GPUs in your system

In [24]:
from accelerate import notebook_launcher

def hello_world():
    from accelerate import Accelerator
    from accelerate.utils import gather_object

    accelerator = Accelerator()

    message= [f"Hello this is GPU {accelerator.process_index}"]
    messages=gather_object(message)

    accelerator.print(messages)

notebook_launcher(hello_world, num_processes=5)   

Launching training on 5 GPUs.
['Hello this is GPU 0', 'Hello this is GPU 1', 'Hello this is GPU 2', 'Hello this is GPU 3', 'Hello this is GPU 4']


## 2. Multi-GPU text-generation
* load a model on each GPU
* distribute the prompts with ```split_between_processes```
* have each GPU ```generate```
* gather and output the result

In [7]:
from accelerate import notebook_launcher

def hello_world():
    from accelerate import Accelerator
    from accelerate.utils import gather_object
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    
    # https://www.penguin.co.uk/articles/2022/04/best-first-lines-in-books
    prompts_all=[
        "The King is dead. Long live the Queen.",
        "Once there were four children whose names were Peter, Susan, Edmund, and Lucy. This story",
        "The story so far: in the beginning, the universe was created. This has made a lot of people very angry",
        "It was a queer, sultry summer, the summer they electrocuted the Rosenbergs, and I didn’t know what",
        "We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
        "It was a bright cold day in April, and the clocks were striking thirteen.",
        "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
        "The snow in the mountains was melting and Bunny had been dead for several weeks before we came to understand the gravity of",
        "The sweat wis lashing oafay Sick Boy; he wis trembling.",
        "124 was spiteful. Full of Baby's venom.",
        "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
        "When Mary Lennox was sent to Misselthwaite Manor to live with her uncle everybody said she was",
        "I write this sitting in the kitchen sink.",
    ]

    accelerator = Accelerator()

    model_path="models/llama2-7b"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,    
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path)   

    with accelerator.split_between_processes(prompts_all) as prompts:
        outputs=[]
        for prompt in prompts:
            prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")
            output_tokenized = model.generate(**prompt_tokenized, max_new_tokens=50)
            outputs.append(tokenizer.decode(output_tokenized[0]))

    outputs_gathered=gather_object(outputs)

    for output in outputs_gathered:
        accelerator.print(output)

    with open('outputs.txt','w') as file:
        file.write('\n\n'.join(outputs_gathered))

notebook_launcher(hello_world, num_processes=5)

Launching training on 5 GPUs.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> The King is dead. Long live the Queen.
The King is dead. Long live the Queen.
The King is dead. Long live the King.
The King is dead. Long live the King.
The King is dead. Long live the King.
The King is dead.
<s> Once there were four children whose names were Peter, Susan, Edmund, and Lucy. This story is about Lucy.
Lucy was the youngest of the four children. She was eleven years old, and she was very small for her age. She was also very curious.
One day, Lucy was playing in the garden with her brother
<s> The story so far: in the beginning, the universe was created. This has made a lot of people very angry and been widely regarded as a bad move.
The story so far: in the beginning, the universe was created. This has made a lot of people very angry and been widely regarded as a bad move. Then God said, "Let there be
<s> It was a queer, sultry summer, the summer they electrocuted the Rosenbergs, and I didn’t know what I was doing in New York. I was 23 years old, my wife was pregnant

## 3. Generate while training Multi-GPUs
This one adds a bit more, here we train a tiny LLM and see how it's output changes during training. These are the steps:
* load ```Locutusque/TinyMistral-248M```
* load ```timdettmers/openassistant-guanaco```
* make up a few random ```prompts```
* add a ```TrainerCallback``` to evaluate after each epoch
    * distribute ```prompts``` among the GPUs
    * on each GPU: split the received prompts into batches (```bs=8``` in the code below)
    * batched inference with ```generate()```
    * collect outputs using ```gather_object```
    * log, print, whatever

In [13]:
from accelerate import notebook_launcher

def hello_world():
    from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, TrainingArguments, Trainer, TrainerControl, TrainerCallback, TrainerState
    from accelerate import Accelerator
    from accelerate.utils import gather_object
    from datasets import load_dataset
    import torch
    import os
    
    accelerator = Accelerator()
    
    # Load model and tokenizer
    model_path="Locutusque/TinyMistral-248M"
    model = AutoModelForCausalLM.from_pretrained(
        model_path,    
        torch_dtype=torch.bfloat16,
    )
    model.config.use_cache = False
    tokenizer = AutoTokenizer.from_pretrained(model_path)   
    
    # Load and tokenize dataset
    dataset = load_dataset("timdettmers/openassistant-guanaco")
    
    def tokenize(element):
        return tokenizer(element["text"], truncation=True, max_length=512, add_special_tokens=False)
    dataset_tokenized = dataset.map(tokenize, batched=True, num_proc=os.cpu_count(), remove_columns=["text"])
    
    # collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
    def collate(elements):
        tokenlist=[e["input_ids"] for e in elements]
        tokens_maxlen=max([len(t) for t in tokenlist])
    
        input_ids,labels,attention_masks = [],[],[]
        for tokens in tokenlist:
            pad_len=tokens_maxlen-len(tokens)
    
            # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content otherwise 0
            input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len )   
            labels.append( tokens + [-100]*pad_len )    
            attention_masks.append( [1]*len(tokens) + [0]*pad_len ) 
    
        batch={
            "input_ids": torch.tensor(input_ids),
            "labels": torch.tensor(labels),
            "attention_mask": torch.tensor(attention_masks)
        }
        return batch
    
    # List of prompts for evaluation
    prompt_template="### Human: {}\n### Assistant:"
    questions = [ 
        "Hello! Who are you? Introduce yourself please",
        "How much is 2+2? Think step by step",
        "What is on your mind?",
        "Define artificial general intelligence",
        ] * 10 # expand. not creative enough for more
    prompts = [ prompt_template.format(q) for q in questions ]
    
    # Callback class for generation during training
    class GenerateEvalCallback(TrainerCallback):
        def __init__(self, prompts, accelerator):
            self.prompts_all=prompts
            self.accelerator=Accelerator()
        
        # left pad for inference and tokenize
        def prepare_prompts(self, prompts, tokenizer):
            tokenizer.padding_side="left"     
            prompts_tok=tokenizer(
                prompts, 
                return_tensors="pt", 
                padding='longest', 
                truncation=False, 
                pad_to_multiple_of=8,
                add_special_tokens=False).to("cuda")
            tokenizer.padding_side="right"
    
            return prompts_tok
    
        def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model, tokenizer, eval_dataloader, **kwargs):
            model.eval()
            model.config.use_cache = True
    
            # split questions among GPUs
            with accelerator.split_between_processes(self.prompts_all) as prompts:
                # batched inference on each GPU
                bs=8
                batches=[prompts[i:i + bs] for i in range(0, len(prompts), bs)]  
                outputs=[]   # outputs per GPU
                for prompt_batch in batches:
                    prompts_tok=self.prepare_prompts(prompt_batch, tokenizer)
                    with torch.no_grad():
                        outputs_tok=model.generate(**prompts_tok, max_new_tokens=30).to("cpu")
                    outputs.extend([
                        tokenizer.decode(outputs_tok[i][outputs_tok[i]!=tokenizer.pad_token_id])
                        for i,t in enumerate(outputs_tok) 
                        ])
            outputs_gathered=gather_object(outputs)  # collect results from all GPUs

            # print a few to console
            accelerator.print(f"EPOCH {state.epoch:0.2f}:")
            for output in outputs_gathered[:5]:  
                accelerator.print(output)

            # write all to file
            if accelerator.is_main_process:
                with open(f"outputs_epoch-{state.epoch:0.2f}.txt",'w') as file:
                    file.write('\n\n'.join(outputs_gathered))
    
            model.config.use_cache = False
            return control
    
    args = TrainingArguments(
        output_dir="out",
        per_device_train_batch_size=16,
        evaluation_strategy="epoch",
        logging_steps=1,
        num_train_epochs=4,
        learning_rate=0.001,
        ddp_find_unused_parameters=False,
    )
    
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=args,
        data_collator=collate,
        train_dataset=dataset_tokenized["train"],
        eval_dataset=dataset_tokenized["test"],
    )
    
    trainer.add_callback(
        GenerateEvalCallback(
            prompts=prompts,
            accelerator=accelerator,
        ))
    
    trainer.train()

notebook_launcher(hello_world, num_processes=5)

Launching training on 5 GPUs.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Failed to detect the name of this notebook, you can set it manually with th

Epoch,Training Loss,Validation Loss
1,2.7482,3.525146
2,1.8143,3.376099
3,1.1947,3.40565
4,0.853,3.433367


Epoch,Training Loss,Validation Loss
1,2.7482,3.525146
2,1.8143,3.376099
3,1.1947,3.40565
4,0.853,3.433367


Epoch,Training Loss,Validation Loss
1,2.7482,3.525146
2,1.8143,3.376099
3,1.1947,3.40565
4,0.853,3.433367


Epoch,Training Loss,Validation Loss
1,2.7482,3.525146
2,1.8143,3.376099
3,1.1947,3.40565
4,0.853,3.433367


Epoch,Training Loss,Validation Loss
1,2.7482,3.525146
2,1.8143,3.376099
3,1.1947,3.40565
4,0.853,3.433367


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


EPOCH 1.00:
### Human: Hello! Who are you? Introduce yourself please
### Assistant: Hello! I am Open Assistant, a chat-based chat-based chat-based chat-based chat-based chat-based chat-based chat
### Human: How much is 2+2? Think step by step
### Assistant: 1+2 = 1

The average height of a sphere depends on the size of the Earth and the distance between the Earth and the
### Human: What is on your mind?
### Assistant: I am a computer program that is designed to provide a high-quality, high-quality, high-quality, high-quality, high-quality
### Human: Define artificial general intelligence
### Assistant: La idea de un concepto de concepto de concepto de concepto de concepto de concepto de concepto de concepto de concepto
### Human: Hello! Who are you? Introduce yourself please
### Assistant: Hello! I am Open Assistant, a chat-based chat-based chat-based chat-based chat-based chat-based chat-based chat


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


EPOCH 2.00:
### Human: Hello! Who are you? Introduce yourself please
### Assistant: I am Open Assistant, a chat based chat based chat based on the open source chat. I am a chat based chat based on the open source chat
### Human: How much is 2+2? Think step by step
### Assistant: 3+2 = 2

The average height in a 2 is 2.

The average height in a 2 is
### Human: What is on your mind?
### Assistant: I am a language model, and I am trained on a large corpus of text from a large corpus of text from a large corpus of
### Human: Define artificial general intelligence
### Assistant: La capacidad de la inteligencia artificial (IA) es un concepto fundamental en la que se utiliza para entender cómo se util
### Human: Hello! Who are you? Introduce yourself please
### Assistant: I am Open Assistant, a chat based chat based chat based on the open source chat. I am a chat based chat based on the open source chat


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


EPOCH 3.00:
### Human: Hello! Who are you? Introduce yourself please
### Assistant: I am an open source AI language model trained by humans. I am trained on a large corpus of text data, which is generated by a large
### Human: How much is 2+2? Think step by step
### Assistant: 3 + 2 = 3

There are 100 000 000 0000 
### Human: What is on your mind?
### Assistant: I am an AI language model,

I am designed to answer questions and provide helpful answers to your questions.
I am designed to respond to
### Human: Define artificial general intelligence
### Assistant: El concepto de artificial intelligence (AI) es una pregunta que ha sido debatida en muchas empresas, pero con much
### Human: Hello! Who are you? Introduce yourself please
### Assistant: I am an open source AI language model trained by humans. I am trained on a large corpus of text data, which is generated by a large


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


EPOCH 4.00:
### Human: Hello! Who are you? Introduce yourself please
### Assistant: Hello! I am Open Assistant, a chat based assistant that can answer questions, help with tasks, and more.### Human: I'm looking
### Human: How much is 2+2? Think step by step
### Assistant: 2+2 is the average size of the average person. 

However, this is just a rough estimate and can vary depending on factors
### Human: What is on your mind?
### Assistant: I am not a sentient AI and I do not have the ability to experience emotions. 

I do not have the ability to experience emotions
### Human: Define artificial general intelligence
### Assistant: Human: Hola, soy un modelo de lenguaje entrenado por la comunidad. ¿En qué puedo ayudarte
### Human: Hello! Who are you? Introduce yourself please
### Assistant: Hello! I am Open Assistant, a chat based assistant that can answer questions, help with tasks, and more.### Human: I'm looking


In [19]:
!head outputs_*

==> outputs_epoch-1.00.txt <==
### Human: Hello! Who are you? Introduce yourself please
### Assistant: Hello! I am Open Assistant, a chat-based chat-based chat-based chat-based chat-based chat-based chat-based chat

### Human: How much is 2+2? Think step by step
### Assistant: 1+2 = 1

The average height of a sphere depends on the size of the Earth and the distance between the Earth and the

### Human: What is on your mind?
### Assistant: I am a computer program that is designed to provide a high-quality, high-quality, high-quality, high-quality, high-quality

==> outputs_epoch-2.00.txt <==
### Human: Hello! Who are you? Introduce yourself please
### Assistant: I am Open Assistant, a chat based chat based chat based on the open source chat. I am a chat based chat based on the open source chat

### Human: How much is 2+2? Think step by step
### Assistant: 3+2 = 2

The average height in a 2 is 2.

The average height in a 2 is


==> outputs_epoch-3.00.txt <==
### Human: Hello! Who are you