# Experimenting with PyTorch and Transformers

This notebook is a playground for experimenting with PyTorch and Transformers. The goal is to get a better understanding of how to use these libraries and how to apply them to different tasks.

# Basics of Transformers

# Architecture of a Transformer Model

## Building a Multi-Head Attention Sublayer

In [None]:
import torch
from scipy.special import softmax

# Initialize input vectors
x = torch.tensor([[3, 0, 0], [1, 1, 0],[5, 6, 1],[2, 2, 4]]).reshape(4,3)
print("Input ", x)
print("Number of inputs:", x.shape[0])
print("d-model:", x.shape[1])

# Initialize the query, weight and key matrix
weight_key = torch.tensor([[0, 0, 0, 0], [2, 1, 1, 2], [0, 1, 2, 0]]).reshape(3,4)
print("\nkey weight matrix", weight_key)

weight_query = torch.tensor([[1, 0, 1, 0],[0, 2, 1, 0],[0, 3, 2, 1]]).reshape(3,4)
print("\nquery weight matrix", weight_query)

weight_value = torch.tensor([[0, 2, 3, 1],[0, 3, 2, 4],[1, 0, 5, 2]]).reshape(3,4)
print("\nvalue weight matrix", weight_value)

# Perform matrix multiplication with the query,key,value weight matrices with the input vectors
k = x @ weight_key # The key weight matrix
q = x @ weight_query # The query weight matrix
v = x @ weight_value # The value weight matrix

# scaling attention scores
k_d = 2 #sqrt root of d_k = 4
attention_scores = (q @ k.transpose(0, 1)) / k_d
print("\nattention_scores", attention_scores)

#Multiplying attention scores with value matrix for one input (the number of att scores can be modified here!)
attention_values = []
for i in range(x.shape[0]): # loop over num features
    for j in range(x.shape[1]): # loop over attention heads
        attention = attention_scores[i][j] * v[j]
        attention_values.append(attention)
print("\nattention_values:\n", attention_values)

#add attention scores to get one row of output matrix. Repeat for other inputs (the num of output matrix rows can be modified here!)
o1=None
o2=None

# Fine-Tuning BERT Models

# Pretraining RoBERTa from Scratch

# Downstream NLP Tasks using Transformers

# Machine Translation with Transformers

# The Ries of Transformers with GPT-3

# Text summarization with Transformers

# Tokenizers and Datasets

# Semantic Labeling with BERT-based Transformers

# Question Answering with Transformers

# Sentiment Analysis with Transformers

# Fake News detection with Transformers

# Interpreting Black-Box Transformer Models

# Task Agnostic Transformer Models (non NLP)

# Transformer Models as Copilots

# Summary

# Questions and Answers

**Q: What are attention heads in transformer models?**
A: Attention heads are the individual parallel pathways through which the model processes information, allowing the model to focus on different parts of the input data, learning different patterns and relationships in parallel.

**Q: Why is/was the attention mechanism so important in the development of transformer models?**
A: It allows the parallel processing of input data (different attention heads can run on different GPUs), which is more efficient than the sequential processing of RNNs and LSTMs. This parallel processing is what allows transformer models to scale to larger datasets and more complex tasks.
 
Q: Can transformer models only be applied to NLP tasks?
A: No, transformer models can be applied to a wide range of tasks, including image recognition, speech recognition, and other tasks that involve processing sequences of data.

Q: What is the basic architecture of a transformer architecture?
Q: What are foundation models?
Q: What are the different components in a standard transformer model?
Q: What are some techniques used to train transformer models?
Q: What are some techniques used to fine-tune transformer models?
Q: What were the steps to pretrain RoBERTa models?
Q: How can transformer models be used for machine translation (NLP)?
Q: How can transformer models be used for image recognition?
Q: Explain the similarities and differences between OpenAI's GPT-2 and GPR-3
Q: Explain the concept of T5 transformer model
Q: Explain the architecture of T5 transformer model
Q: How is the quality of data improved in/for T5 transformer model?
Q: Explain how transformer models are able to "understand" the context of text.
Q: Explain how transformer models are able to "understand" long text and display reasoning skills
Q: By which methods have transformers improved sentiment analysis?
Q: How can transformers be used to understand different perspectives in text?
Q: What are some hidden details in transformer models that are not often discussed?
Q: What are some properties of advanced transformer models?
Q: What are some different transformers for vision tasks?
Q: How are vision transformers tested, for example for image generation?
  

