## Ensemble learning on transformer based models

In [1]:
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from torch import nn, optim
import torch.nn.functional as F
import torch

In [7]:
from safetensors.torch import save_file
from safetensors.torch import load_file

In [8]:
device = ("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

### Authenticate HuggingFace

In [3]:
from huggingface_hub import login
from google.colab import userdata

# Replace 'YOUR_TOKEN' with your actual Hugging Face token
login(token=userdata.get("HF_TOKEN"), add_to_git_credential=True)

### model selection

Hugging Face Transformers provides access to a wide variety of pre-trained models for Natural Language Processing (NLP) tasks like text generation, classification, translation, and more. These models are built using different architectures, including:

- **BERT (Bidirectional Encoder Representations from Transformers)**: A pre-trained model designed for understanding the context in text by looking at both directions (left and right). Used for tasks like classification and question answering.

- **GPT (Generative Pre-trained Transformer)**: Focused on text generation, GPT models predict the next word in a sequence, making them ideal for tasks like conversation and text completion.

- **T5 (Text-to-Text Transfer Transformer)**: A versatile model that converts all NLP tasks into a text-to-text format, applicable for translation, summarization, and more.

- **RoBERTa (A Robustly Optimized BERT Pretraining Approach)**: An optimized version of BERT with better training techniques for improved performance on various NLP tasks.

Each model can be fine-tuned for specific use cases or used directly in applications, and they come with easy integration through the Hugging Face transformers library.

### Model Selection and Setup

**Purpose**: The model_name variable is set to the string "microsoft/Phi-3-mini-4k-instruct". This is the identifier for the pretrained model you're loading from Hugging Face's model hub.

**Model**: The microsoft/Phi-3-mini-4k-instruct is a specific language model developed by Microsoft, optimized for instruction following tasks.

---

```
tokenizer = AutoTokenizer.from_pretrained(...)
```

**Purpose**: This line loads the tokenizer associated with the microsoft/Phi-3-mini-4k-instruct model.

**How it works**:
- The tokenizer is responsible for converting input text (e.g., natural language) into tokens, which are numerical representations that the model understands.
- The from_pretrained() method fetches the pretrained tokenizer (if not already cached locally) using the specified model name.
- trust_remote_code=True allows the model to load custom code that might be required for special tokenization logic.

---

```Python
model = AutoModelForCausalLM.from_pretrained(...)
```

**Purpose**: This line loads the pretrained model itself.

**How it works**:
- AutoModelForCausalLM loads a causal language model, meaning it is designed to generate text in an autoregressive fashion, where each token depends on the previously generated tokens.
- from_pretrained() fetches the model weights and configuration from Hugging Face’s model hub (or from the local cache if it's already downloaded).
- trust_remote_code=True allows loading any custom implementation required by this specific model.

In [3]:
model_name = "microsoft/Phi-3-mini-4k-instruct"
model_one = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_one = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"
model_two = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_two = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
model_name = "Qwen/Qwen2.5-3B-Instruct"
model_three = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer_three = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

### Setup the test prompt

In [5]:
test_case = "alhabetically"
w1 = tokenizer_one(test_case, return_tensors="pt")
w2 = tokenizer_two(test_case, return_tensors="pt")
# w3 = tokenizer_three(test_case, return_tensors="pt")

In [6]:
print(f"phi: {w1.input_ids}")
print(f"llama: {w2.input_ids}")
# print(f"qwen: {w3.input_ids}")

phi: tensor([[ 394, 7308,  300, 1711]])
llama: tensor([[128000,    278,     71,  10448,   2740]])


In [27]:
# https://huggingface.co/docs/transformers/main/en/model_doc/llama#transformers.LlamaForCausalLM
generate_ids = model_one.generate(w1.input_ids, max_length=30)
tokenizer_one.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'alhabetically.\n\n    Args:\n    - words: a list of strings representing words\n\n    Returns:\n    -'

### Small text summarization task

In [11]:
demo_text = """
Gaius Julius Caesar[a] (12 July 100 BC – 15 March 44 BC) was a Roman general and statesman. A member of the First Triumvirate, Caesar led the Roman armies in the Gallic Wars before defeating his political rival Pompey in a civil war, and subsequently became dictator from 49 BC until his assassination in 44 BC. He played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire.
In 60 BC, Caesar, Crassus, and Pompey formed the First Triumvirate, an informal political alliance that dominated Roman politics for several years. Their attempts to amass political power were opposed by many in the Senate, among them Cato the Younger with the private support of Cicero. Caesar rose to become one of the most powerful politicians in the Roman Republic through a string of military victories in the Gallic Wars, completed by 51 BC, which greatly extended Roman territory. During this time he both invaded Britain and built a bridge across the river Rhine. These achievements and the support of his veteran army threatened to eclipse the standing of Pompey, who had realigned himself with the Senate after the death of Crassus in 53 BC. With the Gallic Wars concluded, the Senate ordered Caesar to step down from his military command and return to Rome. In 49 BC, Caesar openly defied the Senate's authority by crossing the Rubicon and marching towards Rome at the head of an army.[3] This began Caesar's civil war, which he won, leaving him in a position of near-unchallenged power and influence in 45 BC.
After assuming control of government, Caesar began a programme of social and governmental reform, including the creation of the Julian calendar. He gave citizenship to many residents of far regions of the Roman Republic. He initiated land reforms to support his veterans and initiated an enormous building programme. In early 44 BC, he was proclaimed "dictator for life" (dictator perpetuo). Fearful of his power and domination of the state, a group of senators led by Brutus and Cassius assassinated Caesar on the Ides of March (15 March) 44 BC. A new series of civil wars broke out and the constitutional government of the Republic was never fully restored. Caesar's great-nephew and adopted heir Octavian, later known as Augustus, rose to sole power after defeating his opponents in the last civil war of the Roman Republic. Octavian set about solidifying his power, and the era of the Roman Empire began.
"""

In [12]:
model_two.to(device)

# Define the summarization prompt
prompt = f"Summarize the following text:\n\n{demo_text}\n\nSummary:"

# Tokenize the input
inputs = tokenizer_two(prompt, return_tensors="pt").to(device)

# Generate the summary
summary_ids = model_two.generate(
    inputs.input_ids,
    max_new_tokens=700,
    min_new_tokens=50,
    num_beams=5,
    early_stopping=True,
    no_repeat_ngram_size=2
)

# Decode and print the summary
summary = tokenizer_two.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(summary)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Summarize the following text:


Gaius Julius Caesar[a] (12 July 100 BC – 15 March 44 BC) was a Roman general and statesman. A member of the First Triumvirate, Caesar led the Roman armies in the Gallic Wars before defeating his political rival Pompey in a civil war, and subsequently became dictator from 49 BC until his assassination in 44 BC. He played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire.

In 60 BC, Caesar, Crassus, and Pompey formed the First Triumvirate, an informal political alliance that dominated Roman politics for several years. Their attempts to amass political power were opposed by many in the Senate, among them Cato the Younger with the private support of Cicero. Caesar rose to become one of the most powerful politicians in the Roman Republic through a string of military victories in the Gallic Wars, completed by 51 BC, which greatly extended Roman territory. During this time he both invaded Britain and bui

### Get the top-k tokens

In [None]:
# Function to get top-k tokens and probabilities
def get_top_k(model, prompt: str, k: int = 10):
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Get model outputs
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Convert logits to probabilities
    probabilities = F.softmax(logits, dim=-1)

    # Get the probabilities for the last token in the sequence
    last_token_probabilities = probabilities[0, -1, :]

    # Get the top-k token indices and their corresponding probabilities
    top_k_indices = last_token_probabilities.argsort()[-k:][::-1]
    top_k_probs = last_token_probabilities[top_k_indices].cpu().numpy()
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

    return list(zip(top_k_tokens, top_k_probs))

In [None]:
"""
Get the top-k token probabilities from the model output.

:param k: The number of top tokens to retrieve.
:return: A list of tuples containing the top-k tokens and their probabilities.
"""
def get_top_k_two(model, prompt, k: int = 10):
    # Get the model outputs
    with torch.no_grad():
        outputs = model(prompt)
        logits = outputs.logits

    probabilities = F.softmax(logits, dim=-1)           # Convert logits to probabilities
    last_token_probabilities = probabilities[0, -1, :]  # Get the probabilities for the last token

    # Convert probabilities to a more readable format
    probs = last_token_probabilities.cpu().numpy()

    # Get the top 10 probabilities
    top_k = 10
    top_k_indices = probs.argsort()[-top_k:][::-1]
    top_k_probs = probs[top_k_indices]
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

    top_k_indices = last_token_probabilities.argsort()[-k:][::-1]
    top_k_probs = last_token_probabilities[top_k_indices]
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

    return list(zip(top_k_tokens, top_k_probs))

In [None]:
# Get the model outputs
with torch.no_grad():
  outputs = model(prompt)
  logits = outputs.logits

probabilities = F.softmax(logits, dim=-1)           # Convert logits to probabilities
last_token_probabilities = probabilities[0, -1, :]  # Get the probabilities for the last token

print(probabilities)

# Convert probabilities to a more readable format
probs = last_token_probabilities.cpu().numpy()

# Get the top 10 probabilities
top_k = 10
top_k_indices = probs.argsort()[-top_k:][::-1]
top_k_probs = probs[top_k_indices]
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)
top_k_indices = last_token_probabilities.argsort()[-k:][::-1]
top_k_probs = last_token_probabilities[top_k_indices]
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)

list_k_props = list(zip(top_k_tokens, top_k_probs))

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not str

### Get Computed Probabilities

In [None]:
top_k = get_top_k_two(model, prompt_tokenized)
print(f"Token: {top_k.token}, Probability: {top_k.prob:.4f}")

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not BatchEncoding

### Ensemble Result for Simple token averaging