## Ensemble learning on transformer based models

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from torch import nn, optim
import torch.nn.functional as F
import torch

In [2]:
from safetensors.torch import save_file
from safetensors.torch import load_file

In [3]:
device = ("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

### Authenticate HuggingFace

In [4]:
from huggingface_hub import login
from google.colab import userdata

# Replace 'YOUR_TOKEN' with your actual Hugging Face token
login(token=userdata.get("HF_TOKEN"), add_to_git_credential=True)

### model selection

Hugging Face Transformers provides access to a wide variety of pre-trained models for Natural Language Processing (NLP) tasks like text generation, classification, translation, and more. These models are built using different architectures, including:

- **BERT (Bidirectional Encoder Representations from Transformers)**: A pre-trained model designed for understanding the context in text by looking at both directions (left and right). Used for tasks like classification and question answering.

- **GPT (Generative Pre-trained Transformer)**: Focused on text generation, GPT models predict the next word in a sequence, making them ideal for tasks like conversation and text completion.

- **T5 (Text-to-Text Transfer Transformer)**: A versatile model that converts all NLP tasks into a text-to-text format, applicable for translation, summarization, and more.

- **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized version of BERT with better training techniques for improved performance on various NLP tasks.

Each model can be fine-tuned for specific use cases or used directly in applications, and they come with easy integration through the Hugging Face transformers library.

### Model Selection and Setup

**Purpose**: The model_name variable is set to the string "microsoft/Phi-3-mini-4k-instruct". This is the identifier for the pretrained model you're loading from Hugging Face's model hub.

**Model**: The microsoft/Phi-3-mini-4k-instruct is a specific language model developed by Microsoft, optimized for instruction following tasks.

---

```
tokenizer = AutoTokenizer.from_pretrained(...)
```

**Purpose**: This line loads the tokenizer associated with the microsoft/Phi-3-mini-4k-instruct model.

**How it works**:
- The tokenizer is responsible for converting input text (e.g., natural language) into tokens, which are numerical representations that the model understands.
- The from_pretrained() method fetches the pretrained tokenizer (if not already cached locally) using the specified model name.
- trust_remote_code=True allows the model to load custom code that might be required for special tokenization logic.

---

```Python
model = AutoModelForCausalLM.from_pretrained(...)
```

**Purpose**: This line loads the pretrained model itself.

**How it works**:
- AutoModelForCausalLM loads a causal language model, meaning it is designed to generate text in an autoregressive fashion, where each token depends on the previously generated tokens.
- from_pretrained() fetches the model weights and configuration from Hugging Face’s model hub (or from the local cache if it's already downloaded).
- trust_remote_code=True allows loading any custom implementation required by this specific model.

In [19]:
model_name = "microsoft/Phi-3-mini-4k-instruct"
model_one = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_one = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [52]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model_two = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_two = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
model_name = "Qwen/Qwen2.5-3B-Instruct"
model_three = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_three = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

### Setup the test prompt

In [5]:
test_case = "alhabetically"
w1 = tokenizer_one(test_case, return_tensors="pt")
w2 = tokenizer_two(test_case, return_tensors="pt")
# w3 = tokenizer_three(test_case, return_tensors="pt")

In [6]:
print(f"phi: {w1.input_ids}")
print(f"llama: {w2.input_ids}")
# print(f"qwen: {w3.input_ids}")

phi: tensor([[ 394, 7308,  300, 1711]])
llama: tensor([[128000,    278,     71,  10448,   2740]])


In [27]:
# https://huggingface.co/docs/transformers/main/en/model_doc/llama#transformers.LlamaForCausalLM
generate_ids = model_one.generate(w1.input_ids, max_length=30)
tokenizer_one.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'alhabetically.\n\n    Args:\n    - words: a list of strings representing words\n\n    Returns:\n    -'

### Test Case: Text summarization task

In [6]:
demo_text = """
Gaius Julius Caesar[a] (12 July 100 BC – 15 March 44 BC) was a Roman general and statesman. A member of the First Triumvirate, Caesar led the Roman armies in the Gallic Wars before defeating his political rival Pompey in a civil war, and subsequently became dictator from 49 BC until his assassination in 44 BC. He played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire.
In 60 BC, Caesar, Crassus, and Pompey formed the First Triumvirate, an informal political alliance that dominated Roman politics for several years. Their attempts to amass political power were opposed by many in the Senate, among them Cato the Younger with the private support of Cicero. Caesar rose to become one of the most powerful politicians in the Roman Republic through a string of military victories in the Gallic Wars, completed by 51 BC, which greatly extended Roman territory. During this time he both invaded Britain and built a bridge across the river Rhine. These achievements and the support of his veteran army threatened to eclipse the standing of Pompey, who had realigned himself with the Senate after the death of Crassus in 53 BC. With the Gallic Wars concluded, the Senate ordered Caesar to step down from his military command and return to Rome. In 49 BC, Caesar openly defied the Senate's authority by crossing the Rubicon and marching towards Rome at the head of an army.[3] This began Caesar's civil war, which he won, leaving him in a position of near-unchallenged power and influence in 45 BC.
After assuming control of government, Caesar began a programme of social and governmental reform, including the creation of the Julian calendar. He gave citizenship to many residents of far regions of the Roman Republic. He initiated land reforms to support his veterans and initiated an enormous building programme. In early 44 BC, he was proclaimed "dictator for life" (dictator perpetuo). Fearful of his power and domination of the state, a group of senators led by Brutus and Cassius assassinated Caesar on the Ides of March (15 March) 44 BC. A new series of civil wars broke out and the constitutional government of the Republic was never fully restored. Caesar's great-nephew and adopted heir Octavian, later known as Augustus, rose to sole power after defeating his opponents in the last civil war of the Roman Republic. Octavian set about solidifying his power, and the era of the Roman Empire began.
"""

In [17]:
# Check if `pad_token_id` is set, and if not, set it to `eos_token_id`
if tokenizer_two.pad_token_id is None:
    tokenizer_two.pad_token_id = tokenizer_two.eos_token_id

In [18]:
# Define the summarization prompt
prompt = f"Summarize the following text:\n\n{demo_text}\n\nSummary:"

# Tokenize the input and load tokenizer into GPU
inputs = tokenizer_two(prompt, return_tensors="pt", padding=True, truncation=True).to(device)

In [19]:
# Load model onto GPU
model_two.to(device)

# Generate the summary
summary_ids = model_two.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=700,
    min_new_tokens=50,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=2
)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [20]:
summary_ids

tensor([[128000,   9370,   5730,  ...,  39995,    402,    811]],
       device='cuda:0')

In [22]:
# Decode and print the summary
tokenizer_two.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

'Summarize the following text:\n\n\nGaius Julius Caesar[a] (12 July 100 BC – 15 March 44 BC) was a Roman general and statesman. A member of the First Triumvirate, Caesar led the Roman armies in the Gallic Wars before defeating his political rival Pompey in a civil war, and subsequently became dictator from 49 BC until his assassination in 44 BC. He played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire.\nIn 60 BC, Caesar, Crassus, and Pompey formed the First Triumvirate, an informal political alliance that dominated Roman politics for several years. Their attempts to amass political power were opposed by many in the Senate, among them Cato the Younger with the private support of Cicero. Caesar rose to become one of the most powerful politicians in the Roman Republic through a string of military victories in the Gallic Wars, completed by 51 BC, which greatly extended Roman territory. During this time he both invaded Britain and

In [25]:
# Move model back to CPU to free up GPU memory
model_two.to("cpu")

# Clear the CUDA memory cache after use of model
torch.cuda.empty_cache()

### Get the top-k tokens

In [22]:
def get_top_k_two(model, tokenizer, prompt, k: int = 10):
    # model.to("cuda")
    """
    Get the top-k token probabilities from the model output for a given prompt.

    :param model: The language model (e.g., LLaMA or GPT).
    :param tokenizer: The tokenizer corresponding to the model.
    :param prompt: The input text to generate the next token probabilities for.
    :param k: The number of top tokens to retrieve.
    :return: A list of tuples containing the top-k tokens and their probabilities.
    """
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Get the model outputs
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Convert logits to probabilities
    probabilities = F.softmax(logits, dim=-1)

    # Get the probabilities for the last generated token
    last_token_probabilities = probabilities[0, -1, :].cpu().numpy()

    # Get the top-k token indices and their probabilities
    top_k_indices = last_token_probabilities.argsort()[-k:][::-1]
    top_k_probs = last_token_probabilities[top_k_indices]
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

    # Combine tokens and probabilities into tuples for readability
    top_k_results = list(zip(top_k_tokens, top_k_probs))

    # clean up GPU usage due to Colab Constraints (T4)
    # model.to("cpu")
    # torch.cuda.empty_cache()

    return top_k_results

In [25]:
m = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
t = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

prompt = "What is the capital of France?"
top_k_tokens = get_top_k_two(model=m, tokenizer=t, prompt=prompt, k=5)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
top_k_tokens

[('ĠParis', 0.54434246),
 ('ĠThe', 0.07591423),
 ('ĠĊ', 0.06929976),
 ('Ġ', 0.056263898),
 ('Ġ(', 0.05607919)]

In [29]:
del m
del t

In [28]:
m = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
t = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "What is the capital of France?"
top_k_tokens = get_top_k_two(model=m, tokenizer=t, prompt=prompt, k=5)

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [30]:
top_k_tokens

[('<0x0A>', 0.44091272),
 ('▁The', 0.16846289),
 ('▁', 0.074185655),
 ('▁France', 0.070011795),
 ('▁A', 0.023982516)]

In [31]:
del m
del t

NameError: name 'm' is not defined

### Get Computed Probabilities

### Ensemble Result for Simple token averaging

In [14]:
def ensemble_generate(models, tokenizers, prompt, top_k=10, device='cpu'):
    """
    Ensemble generation by averaging the probability vectors of multiple models.

    :param models: List of Hugging Face models (pre-loaded).
    :param tokenizers: List of tokenizers corresponding to the models.
    :param prompt: The input prompt for text generation.
    :param top_k: The number of top tokens to sample from each model's output.
    :param device: Device to run models on ('cuda' or 'cpu').
    :return: A list of the top-k token predictions.
    """
    models = [model.to(device) for model in models]  # Move models to the right device

    # Initialize a tensor to accumulate logits
    total_logits = None

    # Get logits from each model and accumulate them
    with torch.no_grad():
        for model, tokenizer in zip(models, tokenizers): # Use the corresponding tokenizer for each model
            # Tokenize the input prompt
            inputs = tokenizer(prompt, return_tensors="pt").to(device)

            # Get the model output (logits)
            outputs = model(**inputs)
            logits = outputs.logits  # Shape: [batch_size, seq_len, vocab_size]

            # If this is the first model, initialize total_logits with its output
            if total_logits is None:
                total_logits = logits
            else:
              # Resize logits to match the sequence length and vocabulary size
              # Get the minimum sequence length and vocabulary size
              min_seq_len = min(total_logits.shape[1], logits.shape[1])
              min_vocab_size = min(total_logits.shape[2], logits.shape[2])

                # Slice both tensors to match the minimum dimensions
              total_logits = total_logits[:, :min_seq_len, :min_vocab_size]
              logits = logits[:, :min_seq_len, :min_vocab_size]

              total_logits += logits  # Sum the logits from each model

    # Average the logits across models
    averaged_logits = total_logits / len(models)

    # Use the first tokenizer for decoding
    tokenizer = tokenizers[0]

    # Convert logits to probabilities
    probabilities = F.softmax(averaged_logits, dim=-1)

    # Get the top-k token probabilities and corresponding tokens for the last token
    last_token_probs = probabilities[0, -1, :].cpu().numpy()
    top_k_indices = last_token_probs.argsort()[-top_k:][::-1]
    top_k_probs = last_token_probs[top_k_indices]
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

    # Combine tokens and probabilities into tuples
    top_k_results = list(zip(top_k_tokens, top_k_probs))

    return top_k_results

In [9]:
models = [
    AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct"),
    AutoModelForCausalLM.from_pretrained("gpt2"),
    AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
tokenizers = [
    AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct"),
    AutoTokenizer.from_pretrained("gpt2"),
    AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

In [15]:
prompt = "What is the capital of France?"

# Get the top-k predictions from the ensemble
top_k_tokens = ensemble_generate(models, tokenizers, prompt, top_k=5, device='cpu')
print(top_k_tokens)

[('Ċ', 0.876983), ('Ġwhat', 0.007054818), (',', 0.006559979), ('?', 0.0050917566), ('Ġto', 0.0030755154)]


In [16]:
del models
del tokenizers