# Large language models (LLMs): Part II

Author: Archit Vasan , including materials on LLMs by Varuni Sastri, and discussion/editorial work by Taylor Childers, Carlo Graziani, Bethany Lusch, and Venkat Vishwanath (Argonne)

Some inspiration from the blog post "The Illustrated Transformer" by Jay Alammar, highly recommended reading.

Before you begin, make sure that you have your environment set up and your repo refreshed, as described in previous lessons, and reviewed in the accompanying 'Readme.md' file. Make sure that you select the kernel 'datascience/conda-2023-10-04' at the top-left of the Jupyter notebook.

## How are LLMs used in practice?

### Typically using Huggingface and the transformer library.

<img src="images/hf-logo-with-title.png" alt="Drawing" style="width: 300px;"/>

Several tools and libraries are available for working with Large Language Models. In this tutorial we will look at the "transformers" which is a popular library for natural language understanding and generation tasks, built on top of PyTorch and TensorFlow.

HuggingFace is a platform and community that provides open-source library tools and resources like pre-trained models and datasets.

Refer to the following links for more information :

https://huggingface.co/docs/hub/index
https://huggingface.co/docs/transformers/en/index

### Huggingface pipeline

**(1) Installations and imports**

In [None]:
!pip install transformers -U
!pip install torch -U

In [3]:
from transformers import AutoTokenizer,AutoModelForCausalLM, AutoConfig

**(2) Set up the prompt**

A "prompt" refers to a specific input or query provided to a language model. They guide the text processing and generation by providing the context for the model to generate coherent and relevant text based on the given input.

In [4]:
input_text = "My dog really wanted to"

Here, we are describing a simple prompt but prompt engineering is a growing field of research.
For those curious on prompt engineering, please refer to this notebook for a great introduction: 

https://github.com/argonne-lcf/llm-workshop/blob/main/tutorials/02-basic-prompt-engineering/Tutorial_02_Prompt_Engineering.ipynb

**(3) Set up the Huggingface pipeline**

Commonly, Huggingface pipelines are used for setting up the model and tokenizer in a single line of code

In [7]:
from transformers import pipeline
generator = pipeline("text-generation", model="openai-community/gpt2")

**(4) Load prompt into pipeline and we can see what my dog wanted to do...**

`max_length` controls the length of the outputted sequences and `num_return_sequences` is the number of sequences outputted

In [None]:
generator(input_text, max_length=30, num_return_sequences=5)

My dog really doesn't like his dad and is pretty big...

## What is going on below the hood?

There are two components that are "black-boxes" here:
1. The method for tokenization
2. The model that generates novel text.

Carlo Graziani already gave a great explanation of tokenization last week and how this affects embeddings (https://github.com/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/Sequential_Data_Models.ipynb)

Let's print the config for the gpt2 model and find out how the model works!

In [15]:
modelname = 'openai-community/gpt2'#AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
config = AutoConfig.from_pretrained(modelname)
#model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
#model.summary()
print(config)

GPT2Config {
  "_name_or_path": "openai-community/gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.38.1",
  "use_cache": true,
  "vocab_size": 50257
}



GPT2 is built off GPT2LMHeadModel. Let's also inspect this further...

In [17]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
print(model)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


## General elements of an LLM

GPT-2 is an example of the popular Transformer-Decoder architecture. 

Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

<img src="images/decoder_only_block.png" alt="Drawing" style="width: 200px;"/>

The elements of LLMs are:
* Tokenization of inputs: covered in https://github.com/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/Sequential_Data_Models.ipynb
* Token embedding: covered in https://github.com/argonne-lcf/ai-science-training-series/blob/main/04_intro_to_llms/Sequential_Data_Models.ipynb
* Positional encoding
* Multi-head attention
* Feed forward networks: covered in https://github.com/argonne-lcf/ai-science-training-series/blob/main/02_intro_neural_networks/01_introduction_mnist.ipynb
* Output layer

We are going to take a closer look at 
positional encoding, masked multi-head attention and the output layer in this lecture.

**Let's set up all the imports we need**

In [23]:
## IMPORTS AND HYPERPARAMETERS

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


<torch._C.Generator at 0x11348ebf0>

### Positional encoding

**Positional encoding** accounts for the order of the words in the input sequence.

Consider the following two sentences to see why this is important:

``The man ate the sandwich.``

``The sandwich ate the man.``

Clearly, these are two vastly different situations even though they have the same words.

Transformers differentiate between these situations by adding a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word.

<img src="images/transformer_positional_encoding_vectors.png" alt="Drawing" style="width: 500px;"/>

Image credit: https://jalammar.github.io/illustrated-transformer/

First, we are going to setup the token embedding using the ``nn.Embedding`` function in pytorch:

In [24]:
token_embedding_table = nn.Embedding(vocab_size, n_embd)

Here, we set this up similarly as token embedding using the ``nn.Embedding`` tool. We use a simple embedding here but there are more complex positional encodings used such as sinusoidal. 

For an explanation of different positional encodings, refer to this post: https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/

In [25]:
position_embedding_table = nn.Embedding(block_size, n_embd)

Then, the position embedding used is simply added to the token embedding to apply positional embedding.

In [None]:
x = torch.tensor([1,3,15,4,7,1,4,9])
x = token_embedding_table(x)
print("Token embedding alone")
print(x)
x+= position_embedding_table(torch.arange(block_size))
print("\nToken embedding with position")
print(x)

#### TODO: Add visualization of token vs positional encoding

### Self-attention mechanisms

Now I will explain self-attention at a high level.

Say the following sentence is an input sentence we want to translate using an LLM:

**”The animal didn't cross the street because it was too tired”**

When the model processes the word “it”, self-attention associates “it” with “animal”.

As the model processes each word in the input sequence, self attention looks at other positions in the input sequence for clues to a better encoding for this word.


<img src="images/transformer_self-attention_visualization.png" alt="Drawing" style="width: 300px;"/>

Image credit: https://jalammar.github.io/illustrated-transformer/

For self-attention there are 5 general steps:

1. Generate query, key and value vectors for each word
2. Calculate a score for each word in the input sentence against each other.
3. Divide the scores by the square root of the dimension of the key vectors to stabilize the gradients. This is then passed through a softmax operation.
4. Multiply each value vector by the softmax score.
5. Sum up the weighted value vectors to produce the output.

<img src="images/self-attention-output.png" alt="Drawing" style="width: 450px;"/>

Image credit: https://jalammar.github.io/illustrated-transformer/

Let's see how attention is performed in the code.

In [39]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# Here we want the wei to be data dependent - ie gather info from the past but in a data dependant way

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16) # each token here (totally B*T) produce a key and query in parallel and independently
q = query(x) # (B, T, 16)
v = value(x)

wei =  q @ k.transpose(-2, -1) * head_size**-0.5 # (B, T, 16) @ (B, 16, T) ---> (B, T, T). #
wei = F.softmax(wei, dim=-1) # exponentiate and normalize giving a nice distibution that sums to 1 and
                             # now it tells us that in a data dependent manner how much of info to aggregate from

out = wei @ v # aggregate the attention scores and value vector.


In [None]:
print("Softmax")
print(wei[0])
print("\nOutput")
print(out[0])

### Multi-head attention

In practice, multiple attention heads are used which
1. Expands the model’s ability to focus on different positions and prevent the attention to be dominated by the word itself.
2. Have multiple “representation subspaces”. Have multiple sets of Query/Key/Value weight matrices

<img src="images/transformer_multi-headed_self-attention-recap.png" alt="Drawing" style="width: 700px;"/>

Image credit: https://jalammar.github.io/illustrated-transformer/

### Let's see attention mechanisms in action!

The attention mechanisms can be significantly more complex as the number of heads increases!

In [None]:
!pip install ipywidgets

We are going to use the powerful visualization tool bertviz, which allows an interactive experience of the attention mechanisms. Normally these mechanisms are abstracted away but this will allow us to inspect our model in more detail.

In [None]:
!pip install bertviz

Let's load in the model, GPT2 and look at the attention mechanisms. 

**Hint... click on the different blocks in the visualization to see the attention**

In [22]:
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM

from bertviz import model_view
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'openai-community/gpt2'
input_text = "No, I am your father"  
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(input_text, return_tensors='pt')  # Tokenize input text
outputs = model(inputs)  # Run model
attention = outputs[-1]  # Retrieve attention from model outputs
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  # Convert input ids to token strings
model_view(attention, tokens)  # Display model view

<IPython.core.display.Javascript object>

## Let's put the pieces together to train a mini-LLM from scratch

### Set up hyperparameters:

In [42]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4 ## so head_size = 16
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

<torch._C.Generator at 0x11348ebf0>

### Load in data and create train and test datasets

We're going to be using the tiny Shakespeare dataset. 
Data is tokenized according to a simple character based tokenizer.
Data is split into a train and test set so we have something to test after performing training (9:1 split).

In [46]:
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger f

The day serves well for them now. I have heard it
said, the fittest time to corrupt a man's wife is
when she's fallen out with her husband. Your noble
Tullus Aufidius will appear well in these wars, his
great opposer, Coriolanus, being now in no request
of his country.

Volsce:
He cannot choose. I am most fortunate, thus
accidentally to encounter you: you have ended my
business, and I will merrily accompany you home.

Roman:
I shall, between this and supper, tell you most
strange things from Rome; all tending to the good of
their adversaries. Have you an army ready, say you?

Volsce:
A most royal one; the centurions and their charges,
distinctly billeted, already in the entertainment,

Roman:
I am joyful to hear of their readiness, and am the
man, I think, that shall set them in present action.
So, sir, heartily well met, and most glad of your company.

Volsce:
You take my part from me, sir; I have the most cause
to be glad of yours.

Roman:
Well, let us 

Thy garments are not spotted with our blood,
Nor thou within the compass of my curse.

BUCKINGHAM:
Nor no one here; for curses never pass
The lips of those that breathe them in the air.

QUEEN MARGARET:
I'll not believe but they ascend the sky,
And there awake God's gentle-sleeping peace.
O Buckingham, take heed of yonder dog!
Look, when he fawns, he bites; and when he bites,
His venom tooth will rankle to the death:
Have not to do with him, beware of him;
Sin, death, and hell have set their marks on him,
And all their ministers attend on him.

GLOUCESTER:
What doth she say, my Lord of Buckingham?

BUCKINGHAM:
Nothing that I respect, my gracious lord.

QUEEN MARGARET:
What, dost thou scorn me for my gentle counsel?
And soothe the devil that I warn thee from?
O, but remember this another day,
When he shall split thy very heart with sorrow,
And say poor Margaret was a prophetess!
Live each of you the subjects to his hate,
And he to yours, and all of you to G

And all my armour laid into my tent?

CATESBY:
If is, my liege; and all things are in readiness.

KING RICHARD III:
Good Norfolk, hie thee to thy charge;
Use careful watch, choose trusty sentinels.

NORFOLK:
I go, my lord.

KING RICHARD III:
Stir with the lark to-morrow, gentle Norfolk.

NORFOLK:
I warrant you, my lord.

KING RICHARD III:
Catesby!

CATESBY:
My lord?

KING RICHARD III:
Send out a pursuivant at arms
To Stanley's regiment; bid him bring his power
Before sunrising, lest his son George fall
Into the blind cave of eternal night.
Fill me a bowl of wine. Give me a watch.
Saddle white Surrey for the field to-morrow.
Look that my staves be sound, and not too heavy.
Ratcliff!

RATCLIFF:
My lord?

KING RICHARD III:
Saw'st thou the melancholy Lord Northumberland?

RATCLIFF:
Thomas the Earl of Surrey, and himself,
Much about cock-shut time, from troop to troop
Went through the army, cheering up the soldiers.

KING RICHARD III:
So, I am s

The fiery Tybalt, with his sword prepared,
Which, as he breathed defiance to my ears,
He swung about his head and cut the winds,
Who nothing hurt withal hiss'd him in scorn:
While we were interchanging thrusts and blows,
Came more and more and fought on part and part,
Till the prince came, who parted either part.

LADY MONTAGUE:
O, where is Romeo? saw you him to-day?
Right glad I am he was not at this fray.

BENVOLIO:
Madam, an hour before the worshipp'd sun
Peer'd forth the golden window of the east,
A troubled mind drave me to walk abroad;
Where, underneath the grove of sycamore
That westward rooteth from the city's side,
So early walking did I see your son:
Towards him I made, but he was ware of me
And stole into the covert of the wood:
I, measuring his affections by my own,
That most are busied when they're most alone,
Pursued my humour not pursuing his,
And gladly shunn'd who gladly fled from me.

MONTAGUE:
Many a morning hath he there been seen,
With t

I vow by heaven these eyes shall never close.
This is the palace of the fearful king,
And this the regal seat: possess it, York;
For this is thine and not King Henry's heirs'

YORK:
Assist me, then, sweet Warwick, and I will;
For hither we have broken in by force.

NORFOLK:
We'll all assist you; he that flies shall die.

YORK:
Thanks, gentle Norfolk: stay by me, my lords;
And, soldiers, stay and lodge by me this night.

WARWICK:
And when the king comes, offer no violence,
Unless he seek to thrust you out perforce.

YORK:
The queen this day here holds her parliament,
But little thinks we shall be of her council:
By words or blows here let us win our right.

RICHARD:
Arm'd as we are, let's stay within this house.

WARWICK:
The bloody parliament shall this be call'd,
Unless Plantagenet, Duke of York, be king,
And bashful Henry deposed, whose cowardice
Hath made us by-words to our enemies.

YORK:
Then leave me not, my lords; be resolute;
I mean to take p

The rather for I now must make you know
I am that Isabella and his sister.

LUCIO:
Gentle and fair, your brother kindly greets you:
Not to be weary with you, he's in prison.

ISABELLA:
Woe me! for what?

LUCIO:
For that which, if myself might be his judge,
He should receive his punishment in thanks:
He hath got his friend with child.

ISABELLA:
Sir, make me not your story.

LUCIO:
It is true.
I would not--though 'tis my familiar sin
With maids to seem the lapwing and to jest,
Tongue far from heart--play with all virgins so:
I hold you as a thing ensky'd and sainted.
By your renouncement an immortal spirit,
And to be talk'd with in sincerity,
As with a saint.

ISABELLA:
You do blaspheme the good in mocking me.

LUCIO:
Do not believe it. Fewness and truth, 'tis thus:
Your brother and his lover have embraced:
As those that feed grow full, as blossoming time
That from the seedness the bare fallow brings
To teeming foison, even so her plenteous womb
Expr

A noble Neapolitan, Gonzalo,
Out of his charity, being then appointed
Master of this design, did give us, with
Rich garments, linens, stuffs and necessaries,
Which since have steaded much; so, of his gentleness,
Knowing I loved my books, he furnish'd me
From mine own library with volumes that
I prize above my dukedom.

MIRANDA:
Would I might
But ever see that man!

PROSPERO:
Now I arise:
Sit still, and hear the last of our sea-sorrow.
Here in this island we arrived; and here
Have I, thy schoolmaster, made thee more profit
Than other princesses can that have more time
For vainer hours and tutors not so careful.

MIRANDA:
Heavens thank you for't! And now, I pray you, sir,
For still 'tis beating in my mind, your reason
For raising this sea-storm?

PROSPERO:
Know thus far forth.
By accident most strange, bountiful Fortune,
Now my dear lady, hath mine enemies
Brought to this shore; and by my prescience
I find my zenith doth depend upon
A most auspicious star,

FileNotFoundError: [Errno 2] No such file or directory: 'input.txt'

### Set up the components of the LLM: 
* MultiHeadAttention
* FeedForward Network

In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C) 16,32,16
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd), # Projection layer going back into the residual pathway
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


### Combine components into the Transformer block

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # Communication
        x = x + self.ffwd(self.ln2(x))  # Computation
        return x

### Set up the Transformer model 
This is a combination of the Token embeddings, Positional embeddings, Transformer block and an output block.

Let's briefly discuss the output block.

#### How do we turn the output of the decoder stack into a word?

Using the final Linear layer and a Softmax Layer.

The Linear layer projects the vector produced by the stack of decoders, into a larger vector called a logits vector.

If our model knows 10,000 unique English words learned from its training dataset the logits vector is 10,000 cells wide – each cell corresponds to the score of a unique word.

The softmax layer turns those scores into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

<img src="images/transformer_decoder_output_softmax.png" alt="Drawing" style="width: 500px;"/>

Image credit: https://jalammar.github.io/illustrated-transformer/

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx



### Training loop

Our model produces a probability distribution for each token. We want to compare these probability distributions to the ground truths. 
To compare these distributions we can simply look at the difference between them a loss like cross-entropy or Kullback–Leibler divergence. Then the training uses back-propagation to optimize this loss function.

A more complex situation is translating the sentence: “je suis étudiant” into “i am a student” as can be seen in the example:

<img src="images/output_target_probability_distributions.png" alt="Drawing" style="width: 500px;"/>

In this example, we are using cross entropy loss.

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out
    
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))