# Batching, multi-gpu, and multi-node for large data and large models

We've seen how to inference LLMs with a high degree of control over the model inputs and outputs. The goal of this last notebook is to discussion measures to scale up the inference process to large data and large models.

There are three primary tools we will use:
1. Batching
2. Multi-GPU inference
3. Multi-node inference

We'll discuss each of these in turn.

## Batching

Batching is the process of processing multiple inputs at once. This is a common technique in deep learning, as it allows the model to process multiple inputs in parallel. The `transformers` library has built-in support for batching, and we can use it to speed up inference with minimal code changes.

First, we'll load a large number of pieces of text that we want to process using an LLM. Then, we'll process them in batches and compare the time it takes to process them in batches versus one at a time.

In [23]:
# Get a list of texts from the 20 newsgroups dataset
# Each text is a post from a newsgroup
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='test')['data'][:64]
print(f'Number of documents: {len(docs)}')
for i, doc in enumerate(docs[:3]):
    print(f'\n\nDOCUMENT {i+1}:\n{doc}\n')

Number of documents: 64


DOCUMENT 1:
From: v064mb9k@ubvmsd.cc.buffalo.edu (NEIL B. GANDLER)
Subject: Need info on 88-89 Bonneville
Organization: University at Buffalo
Lines: 10
News-Software: VAX/VMS VNEWS 1.41
Nntp-Posting-Host: ubvmsd.cc.buffalo.edu


 I am a little confused on all of the models of the 88-89 bonnevilles.
I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
differences are far as features or performance. I am also curious to
know what the book value is for prefereably the 89 model. And how much
less than book value can you usually get them for. In other words how
much are they in demand this time of year. I have heard that the mid-spring
early summer is the best time to buy.

			Neil Gandler




DOCUMENT 2:
From: Rick Miller <rick@ee.uwm.edu>
Subject: X-Face?
Organization: Just me.
Lines: 17
Distribution: world
NNTP-Posting-Host: 129.89.2.33
Summary: Go ahead... swamp me.  <EEP!>

I'm not familiar at all with the format of these "X-Face:" thingies, but
a

Suppose we want some piece of information about each of these newsgroup posts, and what we want cannot be easily extracted in an automated way using traditional NLP techniques. An LLM might be a good choice for such a task.

For example, we might want a one-sentence summary of each post. We can craft a prompt that asks the model to generate such a summary.

In [31]:
import time
start = time.time()

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", padding_side='left')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")

device = model.device

system_prompt = "The user will supply a post from an online newsgroup. Summarize the post in a single, very short sentence."

# Define a function that will generate summaries for a batch of posts
def generate_summaries(texts, batch_size=8):
    results = []
    total_batches = (len(texts) + batch_size - 1) // batch_size
    with tqdm(total=total_batches, desc="Processing batches", leave=True, bar_format="{l_bar}{bar} | {n_fmt}/{total_fmt}") as pbar:
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            batch_messages = [[{"role": "system", "content": system_prompt}, {"role": "user", "content": text}] for text in batch]
                        
            # Tokenize the messages using chat template
            model_inputs = tokenizer.apply_chat_template(
                batch_messages,
                add_generation_prompt=True,
                return_tensors="pt",
                padding=True,
                return_dict=True,
            ).to(device)

            # Run model to get logits and generated output
            with torch.no_grad():
                outputs = model.generate(
                    **model_inputs,
                    max_new_tokens=100,
                    return_dict_in_generate=True,
                    pad_token_id=tokenizer.eos_token_id
                )
            
            # Decode output
            prompt_length = model_inputs["input_ids"].shape[1]
            generated_sequences = outputs.sequences[:, prompt_length:]
            decoded_outputs = tokenizer.batch_decode(generated_sequences, skip_special_tokens=True)
            results.extend(decoded_outputs)

            pbar.update(1)
    return results

# Generate summaries for the documents
summaries = generate_summaries(docs, batch_size=32)

end = time.time()
print(f"Total time taken: {end - start:.2f} seconds")


Processing batches: 100%|███████ | 2/2

Total time taken: 24.86 seconds





In [32]:
summaries

['The 1988-1989 Bonneville is a model with three different configurations: LE, SE, and SSE, differing in features and performance.',
 'The sender, Rick Miller, is seeking an X-Face header from the user.',
 'The author agrees that strong atheists assert the nonexistence of God.',
 "The post is a critical commentary on the Saudi government and its human rights record, specifically targeting the country's clergy and government.",
 'Jon Livesey argues that there is no objective moral system.',
 'The post from the online newsgroup is a discussion about Candida, a fungus, and its potential relationship to mucocutaneous candidiasis, a condition that can cause irritation in various parts of the body.',
 'A man who accumulates wealth slowly makes it grow.',
 'A user is seeking a "Word Perfect" EXE file for a Greek and Hebrew lexicon.',
 'A server connection is dynamically closed and reopened for each display server.',
 'Jennifer Urso suggests using Aldus Photostyler to convert negatives into co

In [5]:
# Clear the model from memory
import torch
del model
torch.cuda.empty_cache()

NameError: name 'model' is not defined

## Multi-GPU inference

Thankfully, `transformers` makes multi-gpu inference easy.

Note that there are multiple kinds of ways you might want to use multiple GPUs. Note that there are different kinds of paralellism one might want to use. For example, if you just want to speed up your LLM inference, and your model can fit on a single GPU, you can use *data parallelism*.

If your model is too large to fit on a single GPU, you can use *model parallelism*, in which the different GPUs each hold a different part of the model. Luckily, `transformers` makes it easy to use model parallelism, via setting `device_map`. 

In [6]:
import time
start = time.time()

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct", padding_side='left')
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-70B-Instruct", device_map="auto")

device = model.device

system_prompt = "The user will supply a post from an online newsgroup. Summarize the post in a single, very short sentence."

# Generate summaries for the documents
summaries = generate_summaries(docs, batch_size=1)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/68.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/59.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/30 [00:00<?, ?it/s]

model-00001-of-00030.safetensors:   0%|          | 0.00/4.58G [00:00<?, ?B/s]

model-00002-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00003-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00005-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00006-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00007-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00008-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00009-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00010-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

KeyboardInterrupt: 

## Multi-node inference

What if you have a model that's so big it won't fit on a single node, even a node with multiple GPUs? Then you will need to use multi-node inference. This is a more advanced topic, and requires a bit more setup. `transformers` alone won't cut it any longer.

We will use the `deepspeed` library to help us with multi-node inference.