# CNN ResNet Explainer

!!!!!!some info about resnets

To help display how the CNNs works, we'll use the first sentence from the [linear algebra wiki page](https://en.wikipedia.org/wiki/Linear_algebra) and [lu decomposition wiki page](https://en.wikipedia.org/wiki/LU_decomposition) as the topic is fitting and it shows us some non-standard patterns.  We'll take on a more simple task to many other notebooks in this repository where we have a few sentences of text and want to predict some other text, in this case the next token. 

## Text Prep/Tokenization

We'll start with a common preprocessing step of tokenizing the data.  This converts the string text into an array of numbers that can be used during the training loop.  I've built a very subtle byte-pair encoding that has each unique character that appears and the top 5 merges. This keeps our vocab size small and manageable for this example. Typically the vocab size is in the 100K+ range. A great library for this is `tiktoken`. Tokenization simply finds the longest pattern of characters that's in common with what was trained and replaces it with an integer that represents it.  This way we turn the text into a numeric array to simplify computing. import torch
from collections import Counter

In [1]:
import torch
from collections import Counter
import torch.nn as nn

In [2]:
class SimpleBPETokenizer:
    def __init__(self, num_merges=5, eot_token='<|endoftext|>'):
        self.num_merges = num_merges
        self.eot_token = eot_token
        self.eot_id = None
        self.merges = []
        self.pair_ranks = {}
        self.vocab = {}
        self.id_to_token = {}

    def _add_token(self, tok):
        if tok in self.vocab:
            return self.vocab[tok]
        i = len(self.vocab)
        self.vocab[tok] = i
        self.id_to_token[i] = tok
        return i

    def _get_bigrams(self, seq):
        for i in range(len(seq) - 1):
            yield (seq[i], seq[i + 1])

    def _merge_once(self, seq, pair):
        a, b = pair
        out = []
        i = 0
        while i < len(seq):
            if i < len(seq) - 1 and seq[i] == a and seq[i + 1] == b:
                out.append(a + b)
                i += 2
            else:
                out.append(seq[i])
                i += 1
        return out

    def train(self, corpus):
        # corpus: list[str]
        text = ''.join(corpus).lower()
        seq = list(text)
        merges = []
        for _ in range(self.num_merges):
            counts = Counter(self._get_bigrams(seq))
            if not counts: break
            best_pair, _ = counts.most_common(1)[0]
            merges.append(best_pair)
            seq = self._merge_once(seq, best_pair)
        self.merges = merges
        self.pair_ranks = {p: i for i, p in enumerate(self.merges)}

        self.vocab = {}
        self.id_to_token = {}
        for ch in sorted(set(text)):
            self._add_token(ch)
        for a, b in self.merges:
            self._add_token(a + b)
        self.eot_id = self._add_token(self.eot_token)

    def encode(self, text, force_last_eot=True):
        # treat literal eot marker as special; remove it from content
        if self.eot_token in text:
            text = text.replace(self.eot_token, '')
        seq = list(text)

        # make sure all seen base chars exist
        for ch in set(seq):
            if ch not in self.vocab:
                self._add_token(ch)

        # greedy BPE using learned pair ranks
        if self.merges:
            while True:
                best_pair, best_rank = None, None
                for p in self._get_bigrams(seq):
                    r = self.pair_ranks.get(p)
                    if r is not None and (best_rank is None or r < best_rank):
                        best_pair, best_rank = p, r
                if best_pair is None:
                    break
                seq = self._merge_once(seq, best_pair)

        # ensure all tokens in seq exist in vocab (e.g., if new chars appeared)
        for tok in seq:
            if tok not in self.vocab:
                self._add_token(tok)

        ids = [self.vocab[tok] for tok in seq]

        # FORCE: append EOT id if not already last
        if force_last_eot:
            if not ids or ids[-1] != self.eot_id:
                ids.append(self.eot_id)

        return ids

    def decode(self, ids):
        # drop trailing EOT if present
        if ids and self.eot_id is not None and ids[-1] == self.eot_id:
            ids = ids[:-1]
        toks = [self.id_to_token[i] for i in ids]
        return ''.join(toks)


In [3]:
raw_example_1 = r'''Linear algebra is central to almost all areas of mathematics. For instance, linear algebra is fundamental in modern presentations of geometry, including for defining basic objects such as lines, planes and rotations. Also, functional analysis, a branch of mathematical analysis, may be viewed as the application of linear algebra to function spaces.'''
raw_example_2 = r'''In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix (see matrix multiplication and matrix decomposition).'''


In [4]:
tok = SimpleBPETokenizer(num_merges=5)
tok.train([raw_example_1,raw_example_2])
tok.merges

[(' ', 'a'), ('a', 't'), ('i', 'n'), (' ', 'm'), ('i', 'o')]

In [5]:
tok.vocab

{' ': 0,
 '(': 1,
 ')': 2,
 ',': 3,
 '.': 4,
 'a': 5,
 'b': 6,
 'c': 7,
 'd': 8,
 'e': 9,
 'f': 10,
 'g': 11,
 'h': 12,
 'i': 13,
 'j': 14,
 'l': 15,
 'm': 16,
 'n': 17,
 'o': 18,
 'p': 19,
 'r': 20,
 's': 21,
 't': 22,
 'u': 23,
 'v': 24,
 'w': 25,
 'x': 26,
 'y': 27,
 'z': 28,
 '–': 29,
 ' a': 30,
 'at': 31,
 'in': 32,
 ' m': 33,
 'io': 34,
 '<|endoftext|>': 35}

In [6]:
vocab_size = len(tok.vocab)
vocab_size

36

In [7]:
eot = tok.eot_id
tokens = []
for example in [raw_example_1, raw_example_2]:
    tokens.extend([eot])
    tokens.extend(tok.encode(example.lower()))
all_tokens = torch.tensor(tokens, dtype=torch.long)
all_tokens

tensor([35, 15, 32,  9,  5, 20, 30, 15, 11,  9,  6, 20,  5,  0, 13, 21,  0,  7,
         9, 17, 22, 20,  5, 15,  0, 22, 18, 30, 15, 16, 18, 21, 22, 30, 15, 15,
        30, 20,  9,  5, 21,  0, 18, 10, 33, 31, 12,  9, 16, 31, 13,  7, 21,  4,
         0, 10, 18, 20,  0, 32, 21, 22,  5, 17,  7,  9,  3,  0, 15, 32,  9,  5,
        20, 30, 15, 11,  9,  6, 20,  5,  0, 13, 21,  0, 10, 23, 17,  8,  5, 16,
         9, 17, 22,  5, 15,  0, 32, 33, 18,  8,  9, 20, 17,  0, 19, 20,  9, 21,
         9, 17, 22, 31, 34, 17, 21,  0, 18, 10,  0, 11,  9, 18, 16,  9, 22, 20,
        27,  3,  0, 32,  7, 15, 23,  8, 32, 11,  0, 10, 18, 20,  0,  8,  9, 10,
        32, 32, 11,  0,  6,  5, 21, 13,  7,  0, 18,  6, 14,  9,  7, 22, 21,  0,
        21, 23,  7, 12, 30, 21,  0, 15, 32,  9, 21,  3,  0, 19, 15,  5, 17,  9,
        21, 30, 17,  8,  0, 20, 18, 22, 31, 34, 17, 21,  4, 30, 15, 21, 18,  3,
         0, 10, 23, 17,  7, 22, 34, 17,  5, 15, 30, 17,  5, 15, 27, 21, 13, 21,
         3, 30,  0,  6, 20,  5, 17,  7, 

# Modeling

A machine learning model forward pass now uses the tokenization information, runs several layers of linear algebra on it, and then "predicts" the next token. When it is noisy (like you will see in this example), this process results in gibberish.  The training process changes the noise to pattern during the "backward pass" as you'll see.    We'll show 3 steps that are focused on training:
1. **Data Loading** `x, y = train_loader.next_batch()` - this step pulls from the raw data enough tokens to complete a forward and backward pass.  If the model is inference only, this step is replaced with taking in the inference input and preparing it similarly as the forward pass.
2. **Forward Pass** `logits, loss = model(x, y)` - using the data and the model architecture to predict the next token. When training we also compare against the expected to get loss, but in infrerence, we use the logits to complete the inference task.
3. **Back Propagation, aka Backward Pass & Training** `loss.backward(); optimizer.step()` - using differentials to understand what parameters most impact the forward pass' impact on its prediction, comparing that against what is actually right based on the data loading step, and then making very minor adjustments to the impactful parameters with the hope it improves future predictions.

The we'll show a final **Forward Pass** with the updated weights we did in #3. 

## Data Loading

To start, we need to get enough data to run the forward and backward passes.  Since our total dataset is likely too big to hold all at once in real practice, we would read just enough file information into memory so that we can run the passes, leaving memory and compute to be used on the passes instead of static data holding. 
To start, we have to identify the batch size and the model context length to determine how much data we need.  Consequently, these dimensions also form 2 of the 3 dimensions in the initial matrix.
- **Batch Size (B)** - This is the number of examples you'll train on in a single pass. 
- **Context Length (T)** - This is the max number of tokens that a model can use in a single pass to generate the next token. If an example is below this length, it can be padded.
  
*Ideally both B and T are multiples of 2 to work nicely with chip architecture. This is a common theme across the board*

In [8]:
B_batch = 2 # Batch
T_context = 8 # context length

To start, we need to pull from our long raw_token list enough tokens for the forward pass. To be able to satisfy training `B` Batches `T` Context length, we need to pull out `B*T` tokens to slide the context window across the examples enough to satisfy the batch size.  Since the training will attempt to predict the last token given the previous tokens in context, we also need 1 more token at the end so that the last training example in the last batch can have the next token to validate against. 

In [9]:
current_position = 0
tok_for_training = all_tokens[current_position:current_position + B_batch*T_context +1 ]
tok_for_training

tensor([35, 15, 32,  9,  5, 20, 30, 15, 11,  9,  6, 20,  5,  0, 13, 21,  0])

Now that we have our initial tokens to train on, we now need to convert it to a matrix that's ready for training. In this step we'll need to create our batches and setup two different arrays: 1/ the input, `x`, tokens that will result in 2/ the output `y` tokens. To create each example in the batch, every `T` tokens will be placed into its own row. 

Recall that training takes in a string of tokens the length of the context and then predicts the next token. Recall that when we extracted `tok_for_training` we added 1 extra token so that we can evaluate the prediction for the last example. Because of this, the input, `x`, will be all of the tokens up to the second to last element `[:-1]`.  

It might be natural to think the output `y` would then just be the last token.But this is actually wasting valuable training loops.  Yes, there is the example that fills the context `T`, but we also have enough tokens in `tok_for_training` where any context length of `n` where `n<T` can also be used for inference since we have the `n+1` token available.  You can think of the following example:

sentence: `Hi I am learning`. This sentence contains the following "next tokens" that can be learned:
1. x: Hi I am  | y: learning
2. x: Hi I     | y: am
3. x: Hi       | y: I

Because we have this triangle to create, our `y` can be much larger.  We can start with the second token and, go all the way to the last element we added for the last example `[1:'`.   


We will now put this together and do the following:
1. Extract the input `x` and then split it into an example for each batch `B`
2. Extract the output `y` and then split it into an example for each batch `B`

*Note: View can take `-1` which allows the matrix to infer the dimension so we do not need to pass in `T`, but given how many matrices we'll work with we want to make sure we're controlling the dimensions or erroring out if they do not match our expectations.*

In [10]:
x=tok_for_training[:-1].view(B_batch, T_context)
x

tensor([[35, 15, 32,  9,  5, 20, 30, 15],
        [11,  9,  6, 20,  5,  0, 13, 21]])

In [11]:
y=tok_for_training[1:].view(B_batch, T_context)
y

tensor([[15, 32,  9,  5, 20, 30, 15, 11],
        [ 9,  6, 20,  5,  0, 13, 21,  0]])

## Forward pass

In [12]:
B_batch, T_context

(2, 8)

In [13]:
n_embd = 6 # level of embedding of input tokens
n_embd, vocab_size

(6, 36)

**Embedding Projection**

In [14]:
wte = nn.Embedding(vocab_size, n_embd)
torch.nn.init.constant_(wte.weight, 0.250)
wte.weight

Parameter containing:
tensor([[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0

In [15]:
x = wte(x)
x.shape, x

(torch.Size([2, 8, 6]),
 tensor([[[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
         [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]]],
        gra

**We'll use our channels to be our multi-dimension for convolution, so reshape.**

In [16]:
x = x.permute(0,2,1) # [B,C,T]
x = x.unsqueeze(2)  # [B,C,1,T]
x.size(), x

(torch.Size([2, 6, 1, 8]),
 tensor([[[[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]]],
 
 
         [[[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]],
 
          [[0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500]]]],
        gr

### First convolution block

In [17]:
import torch.nn.functional as F

In [18]:
in_channel = n_embd
out_channel = n_embd
in_channel, out_channel

(6, 6)

In [19]:
kernal_height = 1
kernal_width = 3
stride_height = 1
stride_width = 1
padding_height = 0
padding_width = 1
{'kernal': (kernal_height, kernal_width),
 'stride': (stride_height, stride_width),
 'padding': (padding_height, padding_width)}


{'kernal': (1, 3), 'stride': (1, 1), 'padding': (0, 1)}

In [20]:
## weight layer for convolution (similar to linear, just more explicit)
conv_stride1 = nn.Parameter(
    torch.empty(out_channel, in_channel, kernal_height, kernal_width), 
    requires_grad=True)

In [21]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    pattern = torch.tensor([0.1,0.2,0.3]).view(1,1,1,kernal_width).expand(conv_stride1.size()).clone()
    conv_stride1.copy_(pattern)
conv_stride1.size(), conv_stride1

(torch.Size([6, 6, 1, 3]),
 Parameter containing:
 tensor([[[[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]]],
 
 
         [[[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]]],
 
 
         [[[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]]],
 
 
         [[[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0.2000, 0.3000]],
 
          [[0.1000, 0

**Run Convolution**

In [22]:
batch, channel, height, width = x.size()
x.size(), batch, channel, height, width, 

(torch.Size([2, 6, 1, 8]), 2, 6, 1, 8)

This operation flattens each sliding kernel_size-sized block within the spatial dimensions of input into a column (i.e., last dimension) of a 3-D output tensor of shape (N,C×∏(kernel_size),L)(N,C×∏(kernel_size),L),

unfold (aka “im2col”) takes a 4-D tensor ((N,C,H,W)) and extracts all sliding local blocks of size $(k_h,k_w)$ into columns. The result has shape $(N,C*k_h*k_w,L)$ where $(L = H_{\text{out}} W_{\text{out}})$ and
$$
H_{\text{out}}=\left\lfloor\frac{H+2p_h-d_h,(k_h-1)-1}{s_h}+1\right\rfloor,\quad
W_{\text{out}}=\left\lfloor\frac{W+2p_w-d_w,(k_w-1)-1}{s_w}+1\right\rfloor.
$$

Here (p) = padding, (s) = stride, (d) = dilation. Patches are flattened in channel-major, then row-major within each channel, and ordered left→right, top→bottom.

Mental model: `unfold` linearizes all local receptive fields so you can do per-patch operations with a single batched matrix multiply. Convolution is exactly this with shared weights. High confidence.


**Manual unfold to show**  We'll then use unfold, show they're equal, then use unfold going forward.  Unfold basically pads, then uses stride to extract a moving window on the last 2 dimensions to create a column.  

recall we go from $(N,C,H,W)$ to $(N,C*k_h*k_w,L)$  let's calculate the dimensions

In [23]:
c_khw = channel*kernal_height*kernal_width

height_out = (height + 2*padding_height - 1*(kernal_height-1) - 1)//stride_height + 1   # = 1, 
width_out = (width + 2*padding_width - 1*(kernal_width-1) - 1)//stride_width + 1   # = 4
L = height_out * width_out

print(f'width out {width_out}, height out {height_out}, final dimension ({batch},{c_khw},{L})')

width out 8, height out 1, final dimension (2,18,8)


**Padding** is (0,1), we pad on both sides so we get output of `[2, 6, 1+0, 8+2]`

In [24]:
# pad last dim by (width, width) and 2nd to last by (height, height). width = 1, height = 0
x_pad = F.pad(x, pad=(padding_width,padding_width,padding_height,padding_height))

x_pad.size(), x_pad #total size and show first example in batch 

(torch.Size([2, 6, 1, 10]),
 tensor([[[[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]]],
 
 
         [[[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],
 
          [[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500,
            0.2500, 0.0000]],

## now show sliding 

In [25]:
manual_cols = []
for j in range(width_out):  # slide across width out: 8 
    # Extract the kernal (1,3) window at width slice [j : j+3] (j max is 7 so 7+3 == 10, max width of X_pad
    #         for all channels (H is 1 so don't need to slide on that channel
    patch = x_pad[:, :, 0:kernal_height, j:j+kernal_width]        # (2,6,1,3)
    
    # stack the entries in each batch together into a row
    col = patch.reshape(batch, c_khw) # shape to [2,18]

    if j==0: 
        print('first patch and cols')
        print(f'patch:{ patch.size()}')
        print(patch) # print first patch 
        print(f'col:{ col.size()}')
        print(col) # print first patch 

    manual_cols.append(col)

# turn all the rows in the list into columns while maintaining the batch
manual_unfold = torch.stack(manual_cols, dim=2)  # (N, 18, 8)
manual_unfold.size(), manual_unfold

first patch and cols
patch:torch.Size([2, 6, 1, 3])
tensor([[[[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]]],


        [[[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]],

         [[0.0000, 0.2500, 0.2500]]]], grad_fn=<SliceBackward0>)
col:torch.Size([2, 18])
tensor([[0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500,
         0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500],
        [0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500,
         0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500, 0.0000, 0.2500, 0.2500]],
       grad_fn=<UnsafeViewBackward0>)


(torch.Size([2, 18, 8]),
 tensor([[[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.25

**more efficint way** show that we have a function to do this and it's equal to show how we do it in the future, don't need padding

In [26]:
unfolded = F.unfold(x, 
		kernel_size=(kernal_height, kernal_width),  # (1,3)
		padding=(padding_height, padding_width), #(0,1)
		stride=(stride_height, stride_width))#(1,1)
unfolded.size() , unfolded

(torch.Size([2, 18, 8]),
 tensor([[[0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
          [0.0000, 0.2500, 0.2500, 0.2500, 0.2500, 0.2500, 0.25

**Compare now**

In [27]:
print("manual equals unfold:", torch.allclose(unfolded, manual_unfold))

manual equals unfold: True


**Dot Product of Weight by sliding window**

In [28]:
# Stacks creates a 2-d matrix of `out_channelX rest` so `6*18` by stacking the weights we match the shape of 
conv_1_weigth = conv_stride1.view(out_channel, -1) # [6,6,1,3] > [6,18]
conv_1_weigth.size(), conv_1_weigth

(torch.Size([6, 18]),
 tensor([[0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000],
         [0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000],
         [0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000],
         [0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000],
         [0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000],
         [0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000,
          0.1000, 0.2000, 0.3000, 0.1000, 0.2000, 0.3000, 0.1000, 0.200

In [31]:
# [6, 18] dot product with [2, 18, 8] resulting in [2x6x8]
# This will auto broadcast across each of the 2 batches (shared weigth) so it results in [2x6x8]
out = conv_1_weigth @ unfolded
out.size(), out

(torch.Size([2, 6, 8]),
 tensor([[[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
         [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500],
          [0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]]],
        grad_fn=<CloneBackward0>))

In [32]:
# insert in the channel dimension to go back to [B,C,1,T] [2,6,1,8]
out = out.view(batch,out_channel, height_out, width_out)
out.size(), out

(torch.Size([2, 6, 1, 8]),
 tensor([[[[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]]],
 
 
         [[[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]],
 
          [[0.7500, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.9000, 0.4500]]]],
        gr

### Batch Norm

In [33]:
bn_a = nn.BatchNorm2d(n_embd)
bn_a.weight, bn_a.bias

(Parameter containing:
 tensor([1., 1., 1., 1., 1., 1.], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0.], requires_grad=True))

In [34]:
out = bn_a(out)
out.size(), out

(torch.Size([2, 6, 1, 8]),
 tensor([[[[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]]],
 
 
         [[[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,
            -2.4994]],
 
          [[-0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4999,  0.4

### First RELU

In [35]:
out = F.relu(out) 
out.size(), out

(torch.Size([2, 6, 1, 8]),
 tensor([[[[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]]],
 
 
         [[[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],
 
          [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]]]],
        gr

### Second convolution that downsamples using a stride of 2 

In [36]:
c2_in_channel = n_embd
c2_out_channel = n_embd
c2_in_channel, out_channel

(6, 6)

In [37]:
c2_kernal_height = 1
c2_kernal_width = 3
c2_stride_height = 1
c2_stride_width = 2
c2_padding_height = 0
c2_padding_width = 1
{'kernal': (c2_kernal_height, c2_kernal_width),
 'stride': (c2_stride_height, c2_stride_width),
 'padding': (c2_padding_height, c2_padding_width)}


{'kernal': (1, 3), 'stride': (1, 2), 'padding': (0, 1)}

In [38]:
## weight layer for convolution (similar to linear, just more explicit)
conv_stride2 = nn.Parameter(
    torch.empty(c2_out_channel, c2_in_channel, c2_kernal_height, c2_kernal_width), 
    requires_grad=True)

In [39]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    c2_pattern = torch.tensor([0.3,0.2,0.1]).view(1,1,1,c2_kernal_width).expand(conv_stride2.size()).clone()
    conv_stride2.copy_(c2_pattern)
conv_stride2.size(), conv_stride2

(torch.Size([6, 6, 1, 3]),
 Parameter containing:
 tensor([[[[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]]],
 
 
         [[[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]]],
 
 
         [[[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]]],
 
 
         [[[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0.2000, 0.1000]],
 
          [[0.3000, 0

### Second convolution, see that the dimensions change

In [40]:

c2_batch, c2_channel, c2_height, c2_width = out.size()
out.size(), c2_batch, c2_channel, c2_height, c2_width, 

(torch.Size([2, 6, 1, 8]), 2, 6, 1, 8)

In [42]:
c2_c_khw = c2_channel*c2_kernal_height*c2_kernal_width

c2_height_out = (c2_height + 2*c2_padding_height - 1*(c2_kernal_height-1) - 1)//c2_stride_height + 1   # = 1, 
c2_width_out = (c2_width + 2*c2_padding_width - 1*(c2_kernal_width-1) - 1)//c2_stride_width + 1   # = 4
c2_L = c2_height_out * c2_width_out

print(f'First Conv: width out {width_out}, height out {height_out}, final dimension ({batch},{c_khw},{L})')
print(f'This  Conv: width out {c2_width_out}, height out {c2_height_out}, final dimension ({c2_batch},{c2_c_khw},{c2_L})')

First Conv: width out 8, height out 1, final dimension (2,18,8)
This  Conv: width out 4, height out 1, final dimension (2,18,4)


**notice** how the width is reduced in half,  this is cause the stride is 2, we'll have to deal with this when we do our residual step to match our input embedding with this output

In [46]:
(c2_kernal_height, c2_kernal_width), (c2_padding_height, c2_padding_width), (c2_stride_height, c2_stride_width)

((1, 3), (0, 1), (1, 2))

In [48]:
out

tensor([[[[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]]],


        [[[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]],

         [[0.0000, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.4999, 0.0000]]]],
       grad_fn=<ReluBackward0>)

In [47]:
c2_unfolded = F.unfold(out, 
		kernel_size=(c2_kernal_height, c2_kernal_width),  # (1,3)
		padding=(c2_padding_height, c2_padding_width), #(0,1)
		stride=(c2_stride_height, c2_stride_width))#(1,2)
c2_unfolded.size() , c2_unfolded

(torch.Size([2, 18, 4]),
 tensor([[[0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000]],
 
         [[0.0000, 0.4999, 0.4999, 0.4999],
          [0.0000, 0.4999, 0.4999, 0.4999],
          [0.4999, 0.4999, 0.4999, 0.0000],
          [0.0000, 0.4999, 0.4999, 0.4999],
    

### Convolution dot product# 

In [55]:
# Stacks creates a 2-d matrix of `out_channelX rest` so `6*18` by stacking the weights we match the shape of 
conv_2_weigth = conv_stride2.view(c2_out_channel, -1) # [6,6,1,3] > [6,18]
conv_2_weigth.size(), conv_2_weigth

(torch.Size([6, 18]),
 tensor([[0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000],
         [0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000],
         [0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000],
         [0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000],
         [0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000],
         [0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000,
          0.3000, 0.2000, 0.1000, 0.3000, 0.2000, 0.1000, 0.3000, 0.200

In [57]:
# [6, 18] dot product with [2, 18, 4] resulting in [2x6x4]
# This will auto broadcast across each of the 2 batches (shared weigth) so it results in [2x6x8]
out = conv_2_weigth @ c2_unfolded
out.size(), out

(torch.Size([2, 6, 4]),
 tensor([[[0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997]],
 
         [[0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997],
          [0.2999, 1.7996, 1.7996, 1.4997]]], grad_fn=<CloneBackward0>))

In [65]:
# insert in the channel dimension to go back to 1/2 of [B,C,1,T] since we took a stride of 2
out = out.view(c2_batch,c2_out_channel, c2_height_out, c2_width_out)
out.size(), out

(torch.Size([2, 6, 1, 4]),
 tensor([[[[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]]],
 
 
         [[[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]]]], grad_fn=<ViewBackward0>))

### Batch Norm #2 

In [66]:
bn_b = nn.BatchNorm2d(n_embd)   
bn_b.weight, bn_b.bias

(Parameter containing:
 tensor([1., 1., 1., 1., 1., 1.], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0.], requires_grad=True))

In [67]:
out = bn_b(out)
out.size(), out

(torch.Size([2, 6, 1, 4]),
 tensor([[[[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]]],
 
 
         [[[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]],
 
          [[-1.6977,  0.7276,  0.7276,  0.2425]]]],
        grad_fn=<NativeBatchNormBackward0>))

## Residual connection, 
bring in X, but need to convolute X to match Out dimension

convolution with 1x1 kernal and a stride of 2

In [75]:
res_in_channel = n_embd
res_out_channel = n_embd
res_in_channel, x_out_channel

(6, 6)

In [76]:
res_kernal_height = 1
res_kernal_width = 1
res_stride_height = 1
res_stride_width = 2
res_padding_height = 0
res_padding_width = 0
{'kernal': (res_kernal_height, res_kernal_width),
 'stride': (res_stride_height, res_stride_width),
 'padding': (res_padding_height, res_padding_width)}


{'kernal': (1, 1), 'stride': (1, 2), 'padding': (0, 0)}

In [88]:
## weight layer for convolution (similar to linear, just more explicit)
res_conv_1x1 = nn.Parameter(
    torch.empty(res_out_channel, res_in_channel, res_kernal_height, res_kernal_width), 
    requires_grad=True)

In [89]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    res_pattern = torch.tensor([0.05]).view(1,1,1,res_kernal_width).expand(res_conv_1x1.size()).clone()
    res_conv_1x1.copy_(res_pattern)
res_conv_1x1.size(), res_conv_1x1

(torch.Size([6, 6, 1, 1]),
 Parameter containing:
 tensor([[[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]],
 
 
         [[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]],
 
 
         [[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]],
 
 
         [[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]],
 
 
         [[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]],
 
 
         [[[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]],
 
          [[0.0500]]]], requires_grad=True))

### Residual connection with 1x1 convolution

In [90]:
res_batch, res_channel, res_height, res_width = x.size()
res_batch, res_channel, res_height, res_width

(2, 6, 1, 8)

In [92]:
res_c_khw = channel*res_kernal_height*res_kernal_width

res_height_out = (height + 2*res_padding_height - 1*(res_kernal_height-1) - 1)//res_stride_height + 1   # = 1, 
res_width_out = (width + 2*res_padding_width - 1*(res_kernal_width-1) - 1)//res_stride_width + 1   # = 4
res_L = res_height_out * res_width_out

print(f'Second  Conv: width out {c2_width_out}, height out {c2_height_out}, final dimension ({c2_batch},{c2_c_khw},{c2_L})')
print(f'This  Conv: width out {res_width_out}, height out {res_height_out}, final dimension ({batch},{res_c_khw},{res_L})')

Second  Conv: width out 4, height out 1, final dimension (2,18,4)
This  Conv: width out 4, height out 1, final dimension (2,6,4)


**1/3 less but we'll have to add them together** we'll take care of this as we resize. 

In [91]:
x_unfolded = F.unfold(x, 
		kernel_size=(res_kernal_height, res_kernal_width),  # (1,1)
		padding=(res_padding_height, res_padding_width), #(0,0)
		stride=(res_stride_height, res_stride_width))#(1,2)
x_unfolded.size() , x_unfolded

(torch.Size([2, 6, 4]),
 tensor([[[0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500]],
 
         [[0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.2500, 0.2500, 0.2500, 0.2500]]], grad_fn=<Im2ColBackward0>))

In [96]:
# Stacks creates a 2-d matrix of `out_channelX rest` so `6*18` by stacking the weights we match the shape of 
res_weigth = res_conv_1x1.view(res_out_channel, -1) # [6,6,1,1] > [6,6]
res_weigth.size(), res_weigth

(torch.Size([6, 6]),
 tensor([[0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500],
         [0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500],
         [0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500],
         [0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500],
         [0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500],
         [0.0500, 0.0500, 0.0500, 0.0500, 0.0500, 0.0500]],
        grad_fn=<ViewBackward0>))

In [97]:
# This will auto broadcast across each of the 2 batches (shared weigth) so it results in [2x6x8]
identity = res_weigth @ x_unfolded
identity.size(), identity

(torch.Size([2, 6, 4]),
 tensor([[[0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750]],
 
         [[0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750],
          [0.0750, 0.0750, 0.0750, 0.0750]]], grad_fn=<CloneBackward0>))

In [98]:
# insert in the channel dimension to go back to [B,C,1,T/2] since we took a stride of 2
identity = identity.view(res_batch,res_out_channel, res_height_out, res_width_out)
identity.size(), identity

(torch.Size([2, 6, 1, 4]),
 tensor([[[[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]]],
 
 
         [[[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]],
 
          [[0.0750, 0.0750, 0.0750, 0.0750]]]], grad_fn=<ViewBackward0>))

## Residual connection sum
see we now have the same size for our identity connection and our output. 

In [100]:
out.size(), identity.size()

(torch.Size([2, 6, 1, 4]), torch.Size([2, 6, 1, 4]))

In [101]:
x = out + identity
x.size(), x

(torch.Size([2, 6, 1, 4]),
 tensor([[[[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]]],
 
 
         [[[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]],
 
          [[-1.6227,  0.8026,  0.8026,  0.3175]]]], grad_fn=<AddBackward0>))

### Relu

In [102]:
x = F.relu(x)
x.size(), x

(torch.Size([2, 6, 1, 4]),
 tensor([[[[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]]],
 
 
         [[[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]],
 
          [[0.0000, 0.8026, 0.8026, 0.3175]]]], grad_fn=<ReluBackward0>))

### Remove our extra dimesion we added

In [105]:
x = x.squeeze(2)
x.size(), x

(torch.Size([2, 6, 4]),
 tensor([[[0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175]],
 
         [[0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175],
          [0.0000, 0.8026, 0.8026, 0.3175]]], grad_fn=<SqueezeBackward1>))

### Flip our channel and context back

In [106]:
x = x.permute(0,2,1)
x.size(), x

(torch.Size([2, 4, 6]),
 tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.8026, 0.8026, 0.8026, 0.8026, 0.8026, 0.8026],
          [0.8026, 0.8026, 0.8026, 0.8026, 0.8026, 0.8026],
          [0.3175, 0.3175, 0.3175, 0.3175, 0.3175, 0.3175]],
 
         [[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
          [0.8026, 0.8026, 0.8026, 0.8026, 0.8026, 0.8026],
          [0.8026, 0.8026, 0.8026, 0.8026, 0.8026, 0.8026],
          [0.3175, 0.3175, 0.3175, 0.3175, 0.3175, 0.3175]]],
        grad_fn=<PermuteBackward0>))

## Final Linear Projectionself.head = nn.Linear(n_embd, vocab_size, bias=False)

In [107]:
lm_head = nn.Linear(n_embd, vocab_size, bias=False)
torch.nn.init.ones_(lm_head.weight)
lm_head.weight.size(), lm_head.weight

(torch.Size([36, 6]),
 Parameter containing:
 tensor([[1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.],
         [

### logits

In [108]:
logits = lm_head(x)

logits.shape, logits

(torch.Size([2, 4, 36]),
 tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
           0.0000, 0.0000, 0.0000, 0.0000],
          [4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156],
          [4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
           4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8

### Loss

In [117]:
# logits: [B, T_out, V], Y: [B, T]
B, T_out, V = logits.shape
B, T_out, V 

(2, 4, 36)

In [122]:
x.size(1)

4

In [125]:
## adjust y for stride 2, aka every other 
s = T_context // T_out 
centers = torch.arange(T_out, device=logits.device) * s                 # [0, 2, 4, 6]
y_aligned = y.gather(1, centers.expand(B, -1)) 
y, centers, y_aligned

(tensor([[15, 32,  9,  5, 20, 30, 15, 11],
         [ 9,  6, 20,  5,  0, 13, 21,  0]]),
 tensor([0, 2, 4, 6]),
 tensor([[15,  9, 20, 15],
         [ 9, 20,  0, 21]]))

In [127]:
y_flat = y_aligned.reshape(B*T_out)
y_flat.size(), y_flat

(torch.Size([8]), tensor([15,  9, 20, 15,  9, 20,  0, 21]))

In [128]:
logits_flat = logits.reshape(B*T_out, V)
logits_flat.shape, logits_flat

(torch.Size([8, 36]),
 tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156],
         [4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156,
          4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 4.8156, 

In [129]:
loss = F.cross_entropy(logits_flat, y_flat)
loss.shape, loss

(torch.Size([]), tensor(3.5835, grad_fn=<NllLossBackward0>))