### What is a transformer
A transformer is a type of neural network architecture that excels at processing sequential data (like text, time series, or even music) by using a mechanism called "attention" to understand relationships between different parts of the sequence.

What are Transformers Used For?

Natural Language Processing (NLP):

**Language Translation**

**Text Generation**

**Question Answering**

**Text Summarization**

### Key Differences Between Transformers and Feed-Forward Neural Networks (FFNN)

Feed-forward neural networks, the most basic version of neural networks, have difficulty in working with different sizes of text inputs.

---

**Input Representation:**
In a feed-forward neural network (FFNN), you need to convert the entire sentence into a **fixed-size input**. This can be done by:
- **Summing or averaging word embeddings**.
- **Using a "bag of words" (BoW) representation**, where you count the occurrence of words in the sentence and feed it into an FFNN.

---

**Challenges with FFNN:**

- **Loss of Sequence Information**:
  - The order of words in the sentence is lost. For example, the FFNN cannot distinguish between:
    - *"The movie was fantastic."* (positive)
    - *"The movie was not fantastic."* (negative)

- **Cannot Handle Variable-Length Sentences**:
  - FFNN requires fixed-size input, so it struggles with sentences of varying lengths.

- **Contextual Dependencies**:
  - Sentiment depends on relationships between words (e.g., *"not"* modifies *"fantastic"*). FFNN cannot capture these dependencies.


Lets run through the transformer architecture and how training is performed. Honestly, thats  all you need to do to perform most of the tasks unless you are building a transformer yourself and training it. Most of the NLP tasks can be done by existing pre-trained transformers which are available by the hundreds on HuggingFace

### Architecture of Transformer

#### Layer 1 Embeddings
Convert each word into embeddings and positional encoding e.g. end of sentence

#### Layer 2 Multi-head attention

Query (Q): Represents what "cat" is looking for (e.g., "what action does the cat do?")
Key (K): Each word's "advertisement" of what information it holds
Value (V): The actual information content each word contributes

Query (Q): Represents what "cat" is looking for (e.g., "what action does the cat do?")
Key (K): Each word's "advertisement" of what information it holds
Value (V): The actual information content each word contributes


Query vectors look for specific patterns:

Q("cat") looks for its verb/action
Q("sits") looks for its subject
Q("on") looks for its object

Key vectors advertise their roles:

K("cat") advertises "I am a subject noun"
K("sits") advertises "I am an action verb"
K("mat") advertises "I am an object noun"

Value vectors contain the actual semantic content:

V("cat") contains features about "feline, animal, subject"
V("sits") contains features about "sitting action, present tense"
V("mat") contains features about "floor covering, object"

So when the model processes "cat":

Its Query vector will have high dot product with the Key vector of "sits" (because cats do actions)
This high attention score means it will incorporate more of the Value vector from "sits"
The Value vector contains the actual semantic information about sitting

#### Layer 3 Feed Forward Neural network

**After attention for "cat":**
- Has information about "sits" (verb relationship)
- Has information about "the" (article relationship)

**FFN then**
1. Expands this combined context (W1)
2. Applies non-linearity (ReLU)
3. Transforms it into useful features (W2)

### Text Generation

In [None]:
!pip install transformers

In [1]:
from transformers import BartForConditionalGeneration, BartTokenizer
import torch

# Load BART model and tokenizer once
model_name = "facebook/bart-large"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Load BART model and tokenizer once
model_name = "facebook/bart-large"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=10,
        temperature=0.2,
        num_return_sequences=1
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = "I have a dream that"
print("\nGeneration:")
print(generate_text(prompt))


Generation:
I have a dream for you.Posted by


### Summarization

In [5]:
def summarize_text(text, max_length=40, min_length=20):
    inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
text = """
    The Internet is a global network of interconnected computers that use standardized communication protocols 
    to share resources and information. It emerged from ARPANET in the late 1960s as a project of the United 
    States Department of Defense. Today, it connects billions of devices worldwide and has revolutionized 
    communication, commerce, entertainment, and countless other aspects of modern life.
    """
print("\nSummarization:")
print(summarize_text(text))


Summarization:
The Internet    The Internet is a global network of interconnected computers that use standardized communication protocols   .   to share resources and information. It emerged from ARPANET


### Question Answer

In [6]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch

# Load RoBERTa model and tokenizer
model_name = "deepset/roberta-base-squad2"  # This is fine-tuned on SQuAD dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

def answer_question(context, question):
    # Tokenize input text
    inputs = tokenizer(
        question,
        context,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    )
    
    # Get model outputs
    outputs = model(**inputs)
    
    # Get the most likely beginning and end of answer
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits)
    
    # Convert tokens back to string
    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(
            inputs["input_ids"][0][answer_start:answer_end + 1]
        )
    )
    
    return answer

# Test examples
context = "Claude is an AI assistant created by Anthropic in 2022."
question = "Who created Claude?"
print("\nQuestion Answering:")
print(answer_question(context, question))

# More test examples
contexts = [
    "The Eiffel Tower was constructed between 1887 and 1889 and was designed by engineer Gustave Eiffel.",
    "Python was created by Guido van Rossum and was first released in 1991. It was named after the TV show Monty Python's Flying Circus."
    "Nehru became the prime minister of India after India gained independence in 1947"
]

questions = [
    "When was the Eiffel Tower built?",
    "Why was Python named Python?"
    "Who was the first prime minister of India?"
]

for context, question in zip(contexts, questions):
    print("\nContext:", context)
    print("Question:", question)
    print("Answer:", answer_question(context, question))


Question Answering:
 Anthropic

Context: The Eiffel Tower was constructed between 1887 and 1889 and was designed by engineer Gustave Eiffel.
Question: When was the Eiffel Tower built?
Answer:  between 1887 and 1889

Context: Python was created by Guido van Rossum and was first released in 1991. It was named after the TV show Monty Python's Flying Circus.Nehru became the prime minister of India after India gained independence in 1947
Question: Why was Python named Python?Who was the first prime minister of India?
Answer: Nehru


## Finetuning an existing pre-trained transformer

Finetuning is especially useful for NLP tasks like text generation to get up to date on information, answer questions correctly or adopt a certain style. Here we will use the BART transformer to learn the shakespearean way of speaking and let it generate text in that manner

In [22]:
import accelerate
print(accelerate.__version__)

1.0.1


In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, TrainingArguments, Trainer
from datasets import Dataset
import torch

# Set device to CPU
device = torch.device("cpu")
print(f"Using device: {device}")

# Load model and tokenizer
model_name = "facebook/bart-base"  # Using smaller model for faster training
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name).to(device)

# Load Shakespeare text
with open('shakespeare.txt', 'r', encoding='utf-8') as f:
    shakespeare_text = f.read().split('\n')

# Create dataset
dataset = Dataset.from_dict({
    'text': shakespeare_text
})

# Preprocessing function to add inputs and labels
def preprocess_function(examples):
    inputs = tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )
    # Use the same input as label for reconstruction training
    inputs['labels'] = inputs['input_ids'].copy()
    return inputs

print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

# Training arguments for CPU
training_args = TrainingArguments(
    output_dir="./shakespeare-bart",
    num_train_epochs=3,
    per_device_train_batch_size=8,  # Reduced batch size for CPU
    save_steps=1000,
    save_total_limit=2,
    learning_rate=2e-5,
    warmup_steps=500,
    logging_steps=100,
    gradient_accumulation_steps=4,
    fp16=False,  # Disable mixed precision for CPU
    optim="adamw_torch",
    no_cuda=True  # Explicitly disable GPU
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print("Starting training...")
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./shakespeare-bart-finetuned")
tokenizer.save_pretrained("./shakespeare-bart-finetuned")

# Function to test text generation
def generate_shakespeare(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt", max_length=128, truncation=True).to(device)
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        num_beams=5,
        temperature=0.7,
        no_repeat_ngram_size=2,
        top_k=50,
        top_p=0.95,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test generation with various prompts
test_prompts = [
    "To be, or not to be",
    "The course of true love",
    "All the world's a",
    "What light through yonder",
]

print("\nTesting generation with various prompts:")
for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print("Generated text:")
    print(generate_shakespeare(prompt))
