### Some terms you need to understand for understanding Large Language Models
  - **logits**
  - softmax
  - cross entropy
  - one hot encoding
  - encoding/embedding
  - tokens, tokenizer
  - attention
    - multi-head attention
  - positional encoding
    - rotary positional encoding
  - transformer
  - optimizer
  - Important players in the AI Ecosystem
  - How to train/fine tune!

### Most of these terms came about because of odd historical glitches. We're stuck with them because EVERYBODY uses them.

#### Assume you have a list of common names in a simple text file, one name per line. Let's call this file `names.txt`
#### Assume you want to use the information in this text file to create more names, 
#### with the provisio that the new names that you make must "seem similar" to the names in the `names.txt`
#### One simple way to do this, is to use probabilities!
#### Assume that you want to measure $P(l_{j+1} | l_{j})$ e.g. given the letter 'e', what is the likelyhood of some other letter (e.g. 'r' ?)
#### In other words, start with a random letter, and randomly assign new letters according to the probability distribution you glean from `names.txt`

In [None]:
# first load up names.txt and split them into each line

In [None]:
words = open('names.txt', 'r').read().splitlines()

In [None]:
words[:10]

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from tinytorch import *
from subroutines import *

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from matplotlib.patches import Ellipse
from matplotlib.text import OffsetFrom


In [None]:
N = torch.zeros((27, 27), dtype=torch.int32)

In [None]:
# get the list of all characters used in 'names.txt' 
chars = sorted(list(set(''.join(words))))
# to make things look prettier, use '.' as the first 'token'
stoi = {s:i for i,s in enumerate(['.'] + chars)}
# stoi is string to int, itos is int to string 
itos = {i:s for s,i in stoi.items()}

In [None]:
# now iterate through all the words, and simply count how many times a given transition occurs
# i.e. given 'e', how many times does each of the other letters occur?
# The '.' is to signify that we are either starting or ending a name.
for w in words:
  # pad each name with '.' at the start and end
  chs = ['.'] + list(w) + ['.']
  # for each pair of characters
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1, ix2] += 1

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha="center", va="bottom", color='gray')
        plt.text(j, i, N[i, j].item(), ha="center", va="top", color='gray')
plt.axis('off');

### Remember we used 0 as the start/stop marker, i.e. '.'
### So N[0] is the counts of all of the times a particular letter appeared as a first letter in a name,
#### starting from '.' (no name starts with '.'!)

In [None]:
print(hprt(N[0]))

In [None]:
N[0,stoi['a']],N[0,stoi['b']],N[0,stoi['c']],N[0,stoi['z']]

In [None]:
list(map(lambda x: x.item(), [N[0,stoi['a']],N[0,stoi['b']],N[0,stoi['c']],N[0,stoi['z']]]))

In [None]:
# here is how we convert raw counts to probabilities
p = N[0].float()
p = p / p.sum()
print(hprt(ps=p.shape, p=p))

### Notice the zero. in the first column. This is generally bad $-\infty = log(0)$

### Below, we're are taking a probabilistic sampling from the array p.
#### `torch.multinomial` returns a list of indices (of shape `(num_samples)`) 
#### `replacement` is by default `False` which means if an index is chosen, it will not be chosen again.
#### i.e. `replacement=True` allows the same `letter` (e.g. `index` into `p`)  to be chosen multiple times

In [None]:
torch.manual_seed(1337)
PP = torch.multinomial(p, num_samples=3, replacement=True)

In [None]:
PP

In [None]:
for idx in PP.tolist():
  print(f'{idx=} {itos[idx]=}')

## In LLMs, the probability of the next token (or word) is generated 
  - Alert! LOGITS and SOFTMAX ahead!
## Then we call torch.multinomial to select the next token!
### So are we done? Can we just use these probabilities to generate cool looking names?

In [None]:
g = torch.Generator().manual_seed(2147483647)
def genNames(N, count):
  for i in range(count):
    ix = 0
    dst = []
    ids = []
    while True:
      p = N[ix].float()
      p = p / p.sum()
      ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
      ids.append(ix)
      dst.append(itos[ix])
      # print(f'{ix=} {itos[ix]=}')
      if ix == 0:
        break
    print(f"{len(ids)=} name={''.join(dst)}")
genNames(N, 10)

### They are kind of like names, but, not that great!
#### But they are better than generating a random probability matrix

In [None]:
FakeN = torch.ones(27,27,dtype = torch.float32) / 27.0
assert N.shape == FakeN.shape
print(N[0]); print(FakeN[0])
genNames(FakeN, 5)

## In LLMs, the probability matrix (much like `p` above) is typically generated by using `Softmax(LOGITS)`
  - <https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html#softmax>
  - <https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html#torch-nn-functional-softmax>
  - `torch.nn.Softmax(dim=None)` is defined as $\displaystyle \frac{\displaystyle exp(x_i)}{\displaystyle \sum_{j=1}^{j=L} exp(x_j)}$
    -  $exp(x)$ is ofcourse $e^x$
    -  $x_i$ is the $i$'th element of $x$ i.e. `x[i]`
    -  $L$ is the length of the tensor (assuming a 1D tensor)
    -  The output always sums to 1.0
  - In other words, first, exponentiate each element of the tensor (separately) and divide by the sum of all of the exponentiated elements
  - Careful! pytorch-isms are sometimes confusing.
    - `nn.Softmax` is a `nn.Module` (think of it as a function).
    - When the instance of this module (e.g. function) is "called" with a `tensor`, it relays the call (and the argument) to `torch.nn.functional.softmax`
    - `nn.functional.softmax` calls out to the `C++` back end, which does the actual work!

## So WTF is **logits** ??
  - It's similar in concept to the `N` matrix above. But instead of counts, think of it as being something close to $log(probability)$ or $log(counts)$
  - Why do we do this? because keeping counts using floats is harder, and it's much easier to start from values in a small range (say 0 to 1.0) for numerical stability
  - Think of `logits` as a weird mangling of *log-probability*
  -  Historically, the tensor (which is an output of some layer of a neural network), which is fed into the `Softmax()` function is called `logits`
  -  Because neural networks like to work with small numbers between -1.0 to 1.0, it's cumbersome to represents counts of events directly.
  -  Instead, we pretend that these small values are log(probabilities), so that we can push them into Softmax to turn the logits into a tensor of probabilities

### Cross Entropy
  - This is used during the training phase to calculate the "error", usually called `Loss` 
  - log(a*b*c) = log(a) + log(b) + log(c)
  - GOAL: maxmimize likelyhood of the data w.r.t model parameters (statistically modeling)
  - Equivalent to maximizing the log likelyhood (because log is monotonic)
  - Equivalent to minimizing the negative log likelyhood
  - Equivalent to minimizing the average negative log likelyhood
  - Assume you have a int tensor Y which is of shape (B) (set B=1 for simple case)
  - Assume you have a logits vector of shape (B, R) where R is the number of tokens (or "classes") 
  - Then, `counts = logits.exp()` (elementwise exponentiation)
  - Then, `probabilities = counts/counts.sum(...)` (elementwise dividing counts by the total number of "events")
  - Assume you have expected answers `Y` which is a tensor of integers, each denoting a specific token
  - ``` 
    def CrossEntropy(logits, Y):
      B = Y.size()[0]
      counts = logits.exp()  # akin to N above
      probs = counts / counts.sum(1, keepdims=True) # this is the result of softmax!
      loss = -probs[torch.arange(B), Y].log().mean()
    ```
  - So what is `probs[torch.arange(B), Y]` doing?

### one-hot-encoding
  - When you have an ordered list of things that you can choose from, you can take advantage of pytorch to specify which item you want.
  - Assume you have $R$ elements you can choose from. You can specify which one by using the *one hot* encoding.
  - This one hot vector can be used to select the `j`th row or column from a matrix (of shape (R, C) by doing
  - `OneHot = F.one_hot(torch.tensor(j), num_classes=R)`
  - `SelectedRow = OneHot @ Embed` (  R @ (R,C) --> (1,R) @  (R,C) = (1,C) --> tensor of shape (C)   )
  - You can select the `j`th column by `Embed @ OneHot` ( (R,C) @ (C) --> (R,C) @ (C,1) = (R,1) --> tensor of shape (R) )

In [None]:
AR =  torch.arange(0, 5)
OH = F.one_hot(AR % 3)
OH2 = F.one_hot(AR % 3, num_classes=5)
print(vprt(AR=AR, Results=hprt(OH=OH, OH2=OH2)))

### Embedding
  - <https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding>
  - In LLMs information about a *token* (think of it as a word, for now) is kept in a large 1-D tensor.
  - Think of it as either a line from the origin pointing out to an n-dimensional space, or a point in an n-dimensional space
  - LLMs typically have anywhere from 30000 to ~120K tokens (words, or some similiar concept) that it knows about
  - Resulting in a *Embedding* (a tensor) of shape `(nVocab, nEmbed)`
    - `Embedding` matrix is typically in row-major memory layout
  - `nEmbed` is typically at least 768 (for GPT2, BERT, etc..), 4096 for LLama(1,2,3)
  - `nVocab` is usually at least 32000

### The Role of the Tokenizer in LLMs
  - Tokenizer turns a sentence into a sequence of integer ids 
  - This integer is then used to select the embedding tensor to get the `embedding vector` for that token
  - Each word (e.g. token) in the input sentence is thus transformed into its embedding vector.
  - These vectors (which represents the input words) is then fed into the neural network!
  - There are complexities on how many tokens are fed in to the NN at once (more on that later!)

In [None]:
nVocab, nEmbed  = (5,2)
Vocab = ['a', 'b', 'c', 'd', '.' ] # this is a really stupid language!

In [None]:
Embed1 = torch.arange(nEmbed*nVocab).view(nVocab,-1)

In [None]:
print(hprt(Embed1))

In [None]:
nBatch=3

In [None]:
torch.manual_seed(1337)
Embed = torch.randn((nVocab, nEmbed))

In [None]:
print(hprt(Embed))

In [None]:
def dumTok(inp):
  return torch.tensor([ stoi[k] for k in inp ])

In [None]:
Inp = dumTok('abc')

In [None]:
Inp

In [None]:
x = Embed[Inp]
# the above accessing Embed by [Inp[0]], [Inp[1]], [Inp[2]]
# i.e. row[Inp[0]], row[Inp[1]], row[Inp[2]]

In [None]:
print(printTensors(inp=Inp.view(-1,1), Embed=Embed, x=x))

## The above vector `x` is typically fed into the neural network

## Now that we've seen what goes INTO a LLM, let's do a simple example of what happens at the output end

In [None]:
torch.manual_seed(42)
logits = torch.randn((nBatch, nVocab))

In [None]:
print(printTensors(logits=logits))

In [None]:
print(printTensors(softmax=F.softmax(logits, dim=1)))

In [None]:
# lets assume Y is to be 3 for all three cases
Y = torch.tensor([3,3,3])

In [None]:
loss = F.cross_entropy(logits, Y)

In [None]:
loss.item()

In [None]:
counts = logits.exp()

In [None]:
probs = counts / counts.sum(1,keepdims=True)

In [None]:
print(hprt(probs))

In [None]:
CC = torch.arange(nBatch)
ACC = probs[CC, Y]
## the above It is accessing probs at [CC[0], Y[0]], [CC[1], Y[1]], [CC[2], Y[2]]
## and generaring a tensor of shape (nBatch)

In [None]:
print(printTensors(CC=CC,ACC=ACC))

In [None]:
ACC.log()

In [None]:
ACC.log().mean()

In [None]:
loss2 = -ACC.log().mean().item()

In [None]:
# floating point roundoff between F.cross_entropy() and our test
print(loss, loss2, (loss -loss2).item())

## Plot of the -log(probability) from 0.0001 to 1.0
### loss of 1.0 is at probility of $\frac{1}{e}$

In [None]:
delta=0.0001
xs = np.arange(delta, 1.0, delta)
ys = -np.log(xs)
fig, ax = plt.subplots()
line = ax.plot(xs, ys)
#ax.axis('equal')
ax.grid(True, which='both')
plt.title('Probability vs -log(P)')
plt.ylabel('-log(probability)')
plt.xlabel('probability 0.0 - 1.0')
ax.axhline(y=1.0, color='y')
ax.axvline(x=1/math.e, color='red')
ax.axhline(y=0, color='g')
ax.axvline(x=0, color='g')
ax.annotate(f'loss=1.0', xy=(1/math.e, 1.0),  xycoords='data', xytext=(0.2, 0.5), textcoords='axes fraction',
            arrowprops=dict(facecolor='black', shrink=0.05,width=0.1,headwidth=5.0),
            horizontalalignment='right',
            verticalalignment='bottom')
ax.annotate(f'x=1/{math.e:0.4f}={1/math.e:0.4f}', xy=(1/math.e, 1.0),  xycoords='data', xytext=(.9, 0.5), textcoords='axes fraction',
            arrowprops=dict(facecolor='black', shrink=0.05,width=0.1, headwidth=5.0),
            horizontalalignment='right',
            verticalalignment='bottom')

### So why cross entropy?
  - LLMs are fed in one or more tensors picked from the embedding matrix.
  - It then produces the ouput tensor called `logits`
  - LLMs have a number of Tokens that it can produce at any given step (classification!)
  - to pick one, it calls `multinomial(softmax(logits), num_classes=nVocab, replace=True)`
  - `Softmax` returns a tensor of probabilities, ranging from (0 .. 1.0)
  - `multinomial` randomly picks a token from the probability tensor (which is of shape (nBatch, nVocab)
### Characteristics of Cross Entropy
  - Arbitrarily assigns $P(\displaystyle \frac{1}{e})$ as the nominal `1.0` loss
  - Softmax takes care of normalizing the logits.
  - Is the loss related to which class has the max probability? (no, not really)
  - Does not work well if number of classes are few!
  - IMPORTANT! We don't have the **TRUE"** (or desired!) probability distribution! (i.e. the *expected* probability distribution.
  - All we have is the **ACTUAL** probability distribution (from the logits!)
  - Cross Entropy Loss measures the overall likelyhood of the $i$th token being the correct value.
    - It doesn't take into account things like whether the $i$th generated token was the most likely (i.e. had the highest probability)
### Strange issue with LLMs
  - Because of multinomial, the NN can 'accidentally' choose either the right answer or the `nVocab-1` *wrong* answers at each time step.
  - However! Not always the case that choosing the "best" token at each step always results in the best answer!
  - In this case, CrossEntropyLoss is the best we can do!

In [None]:
# simple case, only one row in logits
aa = torch.tensor([[ 0, 1, 1, 0, 0.0]])
aaP = F.softmax(aa, dim=1)
bb = torch.tensor([0]).long()
cc = F.cross_entropy(aa, bb)
print(printTensors(CE=cc, softmax=aaP, log=aaP.log())) # sum=aaP.sum(), 

In [None]:
# Example of target with multiple rows in logits
torch.manual_seed(86)
logits = torch.randn(nBatch, nVocab, requires_grad=True)
target = torch.empty(nBatch, dtype=torch.long).random_(5)
print(hprt(target=target,logits=logits, softmax=F.softmax(logits, dim=1)))
loss11 = F.cross_entropy(logits, target)
SV=F.softmax(logits, dim=1)
Sel=SV[torch.arange(nBatch), target]
print(printTensors(loss=loss11, loss2=-Sel.log().mean(), Sel=Sel))

In [None]:
-np.log(1/2)

In [None]:
-math.log(1/3)

In [None]:
-math.log(1/4)