# Transformers

We've finally learned about attention, the core focus of the 2017 paper "Attention is All You Need". Let's review the concept of attention.

## Attention, Generally
Attention allows models to learn *what* to pay attention to in our input. 

<img src="https://drive.google.com/uc?export=view&id=1_5R7BB7i7R4P2RIVRt2YsLLJRhha68aN" alt="Q" width = "600"/>

In an image, this might look like learning which pixels to pay attention to when classifying an image as a cat or not a cat.


In a sequence model it looks like figuring out which words in the sequence we should pay attention to in order to *contextualize* our current word.

<img src="https://drive.google.com/uc?export=view&id=1QW1NmR2BG5t9QQpr0g0kZntMpaTt49er" alt="Q" width = "600"/>


## Self-Attention in Transformers

In Transformers we use **Scaled Dot Product Attention** in order to learn which words in a sequence to pay attention to.

### Context Aware Vectors
We're going to create a form of **context aware vectors** which take a weighted average of all the words in the sequence to create a new vector that contains not only information from our current word but ALSO words around it. This can help us understand words that have different meanings. For example "bank" and "bank" or "mark" and "mark".

The general formula for creating these context aware vectors is:

$$x'_i = \sum_{j = 1}^N w_{ji} \cdot x_j$$

where $w_{ji}$ are weights that tell us how much of word vector $j$ to add to our context aware vector for word $i$. 

For example if we have the sentence "the fish swam near the bank" we might give the words "fish" and "swam" high weights because they help us understand that we are referring to a "bank" (the edge of a river) not a "bank" (the financial institution).

### Queries, Keys, and Values

In order to generate the weights for our Transformer attention mechanism, we use **Queries** and **Keys**. Most textbooks and tutorials give you the analogy of querying a database where we compare queries (what we want) to keys (what options we can choose) and see which keys are most similar to our queries. But, what's really happening is a *bunch* of matrix multiplications.

The general form of attention is:

$$Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Where $softmax(\frac{QK^T}{\sqrt{d_k}})$ are our weights that allow us to create context aware vectors. The Queries, Keys and Values are just matrices. We create them by projecting a matrix of embedding vectors into a new space (this is just a fancy way of saying multiply our matrix of embeddings $E$ by a weight matrix $W$). 

$$Q = E \cdot W_Q$$
$$K = E \cdot W_K$$
$$V = E \cdot W_V$$

We take the Query ($Q$) and Key ($K$) matrices and we multiply them together. Because this multiplication might create values with more variance than the original matrices, we **scale** it by dividing by the squareroot of $d_k$ (the dimension of our query and key vectors).

We stick these values through a softmax function (which takes vectors of values and scales them so that they're a vector of probabilities that add up to 1. These are our weights for creating our context aware vectors. We multiply these weight by our Value ($V$) vectors to complete the attention mechanism.

# Tracking our Vectors through a MultiHeaded Attention


<img src="https://drive.google.com/uc?export=view&id=1WJa7zpjJISaRe-EAX1WRKMZGZ9W7R7FJ" alt="Q" height = "400"/>
<img src="https://drive.google.com/uc?export=view&id=11pjW0_p80lAjEaxEPaUXJePuy9uoJWwi" alt="Q" height = "400"/>

[worksheet](https://github.com/cmparlettpelleriti/CPSC393ParlettPelleriti/blob/main/Extras/Activities/Transformer%20Dimensions%20Worksheet.pdf)



# Using a Transformer for Transfer Learning

While many people do still build transformers from scratch, the main way most people interact with transformers is by using pre-trained transformers for their tasks. In the following code, we'll use the famous `BERT` model on the [CoLa dataset](https://nyu-mll.github.io/CoLA/) to help us build a classification model that classifies sentences as grammatical or not grammatical.

In [1]:
# install huggingface datasets and pre-trained transformers
! pip install transformers datasets

# load packages
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

# load the CoLA data set which classifies sentences as grammatical or not
dataset = load_dataset("glue", "cola")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Col

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's look at an example of a grammatical and non-grammatical sentence from our data

In [2]:
# look at examples from the data
print("Example of grammatical sentence.   : ", dataset["train"]["sentence"][19])
print("Example of non-grammatical sentence: ", dataset["train"]["sentence"][20])

Example of grammatical sentence.   :  The professor talked us into a stupor.
Example of non-grammatical sentence:  The professor talked us.


Next, we'l lload in our pre-trained tokenizer (which processes our raw text data for us, and our pre-trained model.

In [3]:
# tokenize and standardize data
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") # create function to tokenize
tokenized_data = tokenizer(dataset["train"]["sentence"], return_tensors="np", padding=True) # tokenize our data

# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

# turn labels of grammatical (1) and non-grammatical (0) into a np array
labels = np.array(dataset["train"]["label"])  # Label is already an array of 0 and 1

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [4]:
# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

model.summary()

Downloading tf_model.h5:   0%|          | 0.00/527M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


Now we can compile and train our model on the CoLa data, and see how it does!

In [7]:
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5), metrics = ["accuracy"])

model.fit(tokenized_data, labels)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.




<keras.callbacks.History at 0x7efb99268e50>

In [8]:
model.evaluate(tokenized_data, labels)



[0.1220550537109375, 0.9638638496398926]

# Learn More About Common Transformer Models

- [BERT](https://huggingface.co/blog/bert-101)
- [GPT](https://jalammar.github.io/illustrated-gpt2/)

# Try Different Pre-Trained Models

[This Model](https://huggingface.co/spaces/Tuana/should-i-follow#lighttheme) takes a person's tweets in order to summarize what they tweet about. Try this with some of your favorite people on Twitter.

[This Model](https://huggingface.co/distilbert-base-uncased) can predict masked words. Try feeding it some sentences like "I could fight [MASK] raccoons" or "[MASK] is the best undergraduate major"

[This Model](https://huggingface.co/spaces/songweig/rich-text-to-image) can create images for you based on text. 

[This Model](https://huggingface.co/spaces/ECCV2022/dis-background-removal) can remove the background from pictures.

[This Model](https://huggingface.co/spaces/trl-lib/stack-llama) can answer questions!