# CNN ResNet Explainer

Convolutional Neural Nets, or CNNs, learn the pattern in data by sliding small learnable filters across the sequence to spot local patterns—like short “n-gram” features—and turns them into higher-level signals. Because of this, CNNs are most commonly used for analyzing spatially structured data, like images or videos, because they can efficiently learn local patterns such as edges, textures, and shapes. They are also used in natural language processing and time-series tasks, where the same idea of sliding filters helps capture local dependencies in text or signal data. Modern architectures extend CNNs to higher-level tasks such as object detection, segmentation, and even audio or biological sequence modeling.

For our example we will be taking text and using our embedding layer to add a second dimension to it for the CNN to learn across.  Recall that a discrete convolution of 2 matrices results in summation of a series of element-wise dot products. 
$$
(a * b)_n = \sum_{\substack{i,j\\ i+j=n}} a_i \cdot b_j
$$

In our example, the embedding of the input sequence is, $A$. We pad $A$ so the output has the same length as the input. The learnable kernel weights are (B). Each output value is the dot product between a local patch of $A$ and $B$ running our discrete cross-correlation. We also include the stride controls how far the kernel window moves along $A$ to show how we can downsample A. 

Because of this, we actually run a different calculation, similar to a convolution called the 2-D discrete cross-correlation. With the input reshaped to $[B,C,1,T]$ and a $1\times k$ kernel, each output token index $t$ is
$$
y_{t}=\sum_{c=1}^{C}\sum_{u=0}^{k-1} W_{c,u} x_{c,,t+u}\quad
$$


To help display how the CNNs works, we'll actually use the c-major note letters for 3 popular songs: [Hot Cross Buns](https://en.wikipedia.org/wiki/Hot_Cross_Buns_(song)), [Twinkle Twinkle Little Star](https://en.wikipedia.org/wiki/Twinkle,_Twinkle,_Little_Star), and [Happy Birtday To You](https://en.wikipedia.org/wiki/Happy_Birthday_to_You). 

In today's notebooks we'll take in 2 different examples and predict the next note from them. In other notebooks you might have seen that we predicted notes many examples in each batch during a loop. Since we are using our inputXembedding, we'll just have the single example in each batch. 

## Text Prep/Tokenization

We'll start with a common preprocessing step of tokenizing the data.  This converts the string text into an array of numbers that can be used during the training loop.  I've built a very subtle byte-pair encoding that has each unique character that appears and the top 5 merges. This keeps our vocab size small and manageable for this example. Typically the vocab size is in the 100K+ range. A great library for this is `tiktoken`. Tokenization simply finds the longest pattern of characters that's in common with what was trained and replaces it with an integer that represents it.  This way we turn the text into a numeric array to simplify computing. import torch
from collections import Counter

In [1]:
import torch
from collections import Counter
import torch.nn as nn
import torch.nn.functional as F

In [2]:
class SimpleBPETokenizer:
    def __init__(self, num_merges=5, eot_token='<|endoftext|>'):
        self.num_merges = num_merges
        self.eot_token = eot_token
        self.eot_id = None
        self.merges = []
        self.pair_ranks = {}
        self.vocab = {}
        self.id_to_token = {}

    def _add_token(self, tok):
        if tok in self.vocab:
            return self.vocab[tok]
        i = len(self.vocab)
        self.vocab[tok] = i
        self.id_to_token[i] = tok
        return i

    def _get_bigrams(self, seq):
        for i in range(len(seq) - 1):
            yield (seq[i], seq[i + 1])

    def _merge_once(self, seq, pair):
        a, b = pair
        out = []
        i = 0
        while i < len(seq):
            if i < len(seq) - 1 and seq[i] == a and seq[i + 1] == b:
                out.append(a + b)
                i += 2
            else:
                out.append(seq[i])
                i += 1
        return out

    def train(self, corpus):
        # corpus: list[str]
        text = ''.join(corpus).lower()
        seq = list(text)
        merges = []
        for _ in range(self.num_merges):
            counts = Counter(self._get_bigrams(seq))
            if not counts: break
            best_pair, _ = counts.most_common(1)[0]
            merges.append(best_pair)
            seq = self._merge_once(seq, best_pair)
        self.merges = merges
        self.pair_ranks = {p: i for i, p in enumerate(self.merges)}

        self.vocab = {}
        self.id_to_token = {}
        for ch in sorted(set(text)):
            self._add_token(ch)
        for a, b in self.merges:
            self._add_token(a + b)
        self.eot_id = self._add_token(self.eot_token)

    def encode(self, text, force_last_eot=True):
        # treat literal eot marker as special; remove it from content
        if self.eot_token in text:
            text = text.replace(self.eot_token, '')
        seq = list(text)

        # make sure all seen base chars exist
        for ch in set(seq):
            if ch not in self.vocab:
                self._add_token(ch)

        # greedy BPE using learned pair ranks
        if self.merges:
            while True:
                best_pair, best_rank = None, None
                for p in self._get_bigrams(seq):
                    r = self.pair_ranks.get(p)
                    if r is not None and (best_rank is None or r < best_rank):
                        best_pair, best_rank = p, r
                if best_pair is None:
                    break
                seq = self._merge_once(seq, best_pair)

        # ensure all tokens in seq exist in vocab (e.g., if new chars appeared)
        for tok in seq:
            if tok not in self.vocab:
                self._add_token(tok)

        ids = [self.vocab[tok] for tok in seq]

        # FORCE: append EOT id if not already last
        if force_last_eot:
            if not ids or ids[-1] != self.eot_id:
                ids.append(self.eot_id)

        return ids

    def decode(self, ids):
        # drop trailing EOT if present
        if ids and self.eot_id is not None and ids[-1] == self.eot_id:
            ids = ids[:-1]
        toks = [self.id_to_token[i] for i in ids]
        return ''.join(toks)


In [3]:
twinkle_twinkle = r'CCGGAAG,FFEEDDC,GGFFEED,GGFFEED,CCGGAAG,FFEEDDC'
hot_cross_buns = r'EDC,EDC,CCCC,DDDD,EDC'
happy_birthday = r'GGAGCB,GGAGDC,GGGECBA,FFECDC'

In [4]:
tok = SimpleBPETokenizer(num_merges=6)
examples = [twinkle_twinkle,hot_cross_buns, happy_birthday]
tok.train(examples)
tok.merges

[('g', 'g'), ('e', 'd'), ('c', 'c'), ('f', 'f'), ('ff', 'e'), ('gg', 'a')]

In [5]:
tok.vocab

{',': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'gg': 8,
 'ed': 9,
 'cc': 10,
 'ff': 11,
 'ffe': 12,
 'gga': 13,
 '<|endoftext|>': 14}

In [6]:
vocab_size = len(tok.vocab)
vocab_size

15

In [7]:
eot = tok.eot_id
tokens = []
for example in examples:
    tokens.extend([eot])
    tokens.extend(tok.encode(example.lower()))
all_tokens = torch.tensor(tokens, dtype=torch.long)
all_tokens

tensor([14, 10, 13,  1,  7,  0, 12,  9,  4,  3,  0,  8, 12,  9,  0,  8, 12,  9,
         0, 10, 13,  1,  7,  0, 12,  9,  4,  3, 14, 14,  9,  3,  0,  9,  3,  0,
        10, 10,  0,  4,  4,  4,  4,  0,  9,  3, 14, 14, 13,  7,  3,  2,  0, 13,
         7,  4,  3,  0,  8,  7,  5,  3,  2,  1,  0, 12,  3,  4,  3, 14])

# Modeling

A machine learning model forward pass now uses the tokenization information, runs several layers of linear algebra on it, and then "predicts" the probability that each token in the vocab is next. When it is noisy (like you will see in this example), this process results in gibberish.  The training process changes the noise to pattern during the "backward pass" as you'll see. We'll show 3 steps that are focused on training:
1. **Data Loading** `x, y = train_loader.next_batch()` - this step pulls from the raw data enough tokens to complete a forward pass and loss calcualtion.  If the model is inference only, this step is replaced with taking in the inference input and preparing it similarly as the forward pass.
2. **Forward Pass** `logits, loss = model(x, y)` - using the data and the model architecture to we run a prediction for the tokens. When training we also compare against the expected to get loss, but in inference, we use the logits to complete the inference task.
3. **Back Propagation, aka Backward Pass & Training** `loss.backward(); optimizer.step()` - using differentials to understand what parameters most impact the forward pass' impact on its prediction, comparing that against what is actually right based on the data loading step, and then making very minor adjustments to the impactful parameters with the hope it improves future predictions.

The we'll show a final **Forward Pass** with the updated weights we did in #3. 

## Data Loading

To start, we need to get enough data to run the forward and backward passes.  Since our total dataset is likely too big to be held in memory all at once in real practice, we will read just enough file information into memory so that we can run the passes, leaving memory and compute to be used on the passes instead of static data holding. 
To start, we have to identify the batch size and the model context length to determine how much data we need.  Consequently, these dimensions also form 2 of the 3 dimensions in the initial matrix.
- **Batch Size (B)** - This is the number of examples you'll train on in a single pass. 
- **Context Length (T)** - This is the max number of tokens that a model can use in a single pass to generate the next token. If an example is below this length, it can be padded.
  
*Ideally both B and T are multiples of 2 to work nicely with chip architecture. This is a common theme across the board*

In [8]:
B_batch = 2 # Batch
T_context = 8 # context length

To start, we need to pull from our long raw_token list enough tokens for the forward pass. To be able to satisfy training `B_batch` Batches `T_context` context length, we need to pull out `B*T` tokens to slide the context window across the examples enough to satisfy the batch size. Since our training will attempt to predict the next token after the context, we also need 1 more token at the end so that the last last batch can have the next token to validate against. 

In [9]:
current_position = 0
tok_for_training = all_tokens[current_position:current_position + B_batch*T_context +1 ]
tok_for_training

tensor([14, 10, 13,  1,  7,  0, 12,  9,  4,  3,  0,  8, 12,  9,  0,  8, 12])

Now that we have our initial tokens to train on, we now need to convert it to a matrix that's ready for training. In this step we'll need to create our batches and setup two different arrays: 1/ the input, `x`, tokens that will result in 2/ the output `y` tokens. To create each example in the batch, every `T` tokens will be placed into its own row. 

Recall that training takes in a string of tokens the length of the context and then predicts the next token. Recall that when we extracted `tok_for_training` we added 1 extra token so that we can evaluate the prediction for the last example. Because of this, the input, `x`, will be all of the tokens up to the second to last element `[:-1]`.  


Finally, for `y` we will need to extract a token for every batch. That token will be the one immediatly following the context length or every token at positions `B*T_context +1` where B corresponds to a multiple of every batch. 

We will now put this together and do the following:
1. Extract the input `x` and then split it into an example for each batch `B`
2. Extract the output `y` and then split it into an example for each batch `B`

*Note: View can take `-1` which allows the matrix to infer the dimension so we do not need to pass in `T`, but given how many matrices we'll work with we want to make sure we're controlling the dimensions or erroring out if they do not match our expectations.*

In [10]:
x=tok_for_training[:-1].view(B_batch, T_context)
x.size(), x

(torch.Size([2, 8]),
 tensor([[14, 10, 13,  1,  7,  0, 12,  9],
         [ 4,  3,  0,  8, 12,  9,  0,  8]]))

In [11]:
tok_for_training

tensor([14, 10, 13,  1,  7,  0, 12,  9,  4,  3,  0,  8, 12,  9,  0,  8, 12])

In [12]:
y=tok_for_training[T_context::T_context].view(B_batch, 1)
y.size(), y

(torch.Size([2, 1]),
 tensor([[ 4],
         [12]]))

## Forward pass

<img src="explainer_screenshots/cnn/full_network.png" width="200">


During training, in the CNN we've built, the forward pass takes a string of tokens in and predicts the likelihood of of the next token for each batch. This is different than the other models we've used as there's only a single example in each batch. This is mainly because CNNs do best with multi-dimension data and so we're hacking our text input for this explainer by using our `text x embedding` to be our 2 dimensions, isntead of an image or other 2d data. 

This explainer for the forward pass is focused on training where we'll pass in the input `x`, carry that input through the layers, and generate a matrix of the probability of each token being the next one, something we call `logits`. During the forward pass, since this is an CNN, we will actually pass each example through different convolution layers and even show downsampling, which reduces our matrix size. 

At the end of the forward pass we then compare the probability in the logits to the actual next token in `y` and calculate `loss` based on the difference. This difference is what we'll then use in the backprop/training steps.  

*Note that we will do some layer initialization to simplify following along.  In reality layers are often initialized to normal distribution with some adjustments made for parameter sizes, called Kaiming normal, to keep the weights properly noisy.  We will not cover initialization in this series*

In [13]:
B_batch, T_context

(2, 8)

### Input Layer

<img src="explainer_screenshots/cnn/input_layer.png" width="200">

We'll first create an initial **embedding layer** for our input tokens. Recall that this is the layer that will add the second dimension to our text examples. We start with only supplying our embedding positions, though, if we wanted to add more learning capability, we could also do position.  Since CNNs generally take in multi-dimension examples and then use multi-dimension patches for learning in the convolutional layers, position is generally avoided since the goal would be to learn patterns in the data regardless of the position. We will make sure that our embedding weights are larger than 1 to visualize the convolutions well.  The output becomes `vocab_size X n_embd` so that each position can store weights that correspond with each token.  The more embedding layers added the more complex data the model can learn. 

After the embedding layer we'll then insert in fourth dimension of 1 to better suit our convolutional layers.

**Embedding** 

To start we'll initialize our embeddings with an iterative weight so that we can see how it changes through our convolutions.  
of 1.000 so that all inputs are equally weighted. We'll also set our embedding dimension to 6 to allow us to see how our convolution strides across the embedding dimension.  You'll see that because our `x` plucks our different embedding rows, we are quickly adjusting away from the nicely ordered initial embeddings.  

In [14]:
n_embd = 6 # level of embedding of input tokens
n_embd, vocab_size

(6, 15)

In [15]:
wte = nn.Embedding(vocab_size, n_embd)
with torch.no_grad(): # initilize to W[i,j] = 0.001*(1+i+j) for easy following 
    vs, d = wte.num_embeddings, wte.embedding_dim
    rows = torch.arange(vs).unsqueeze(1)  # (vs,1)
    cols = torch.arange(d).unsqueeze(0)  # (1,d)
    pattern = 0.01*(1 + rows + cols)  # W[i,j] = 0.001*(1+i+j)
    wte.weight.copy_(pattern)
wte.weight

Parameter containing:
tensor([[0.0100, 0.0200, 0.0300, 0.0400, 0.0500, 0.0600],
        [0.0200, 0.0300, 0.0400, 0.0500, 0.0600, 0.0700],
        [0.0300, 0.0400, 0.0500, 0.0600, 0.0700, 0.0800],
        [0.0400, 0.0500, 0.0600, 0.0700, 0.0800, 0.0900],
        [0.0500, 0.0600, 0.0700, 0.0800, 0.0900, 0.1000],
        [0.0600, 0.0700, 0.0800, 0.0900, 0.1000, 0.1100],
        [0.0700, 0.0800, 0.0900, 0.1000, 0.1100, 0.1200],
        [0.0800, 0.0900, 0.1000, 0.1100, 0.1200, 0.1300],
        [0.0900, 0.1000, 0.1100, 0.1200, 0.1300, 0.1400],
        [0.1000, 0.1100, 0.1200, 0.1300, 0.1400, 0.1500],
        [0.1100, 0.1200, 0.1300, 0.1400, 0.1500, 0.1600],
        [0.1200, 0.1300, 0.1400, 0.1500, 0.1600, 0.1700],
        [0.1300, 0.1400, 0.1500, 0.1600, 0.1700, 0.1800],
        [0.1400, 0.1500, 0.1600, 0.1700, 0.1800, 0.1900],
        [0.1500, 0.1600, 0.1700, 0.1800, 0.1900, 0.2000]], requires_grad=True)

In [16]:
x = wte(x)
x.shape, x

(torch.Size([2, 8, 6]),
 tensor([[[0.1500, 0.1600, 0.1700, 0.1800, 0.1900, 0.2000],
          [0.1100, 0.1200, 0.1300, 0.1400, 0.1500, 0.1600],
          [0.1400, 0.1500, 0.1600, 0.1700, 0.1800, 0.1900],
          [0.0200, 0.0300, 0.0400, 0.0500, 0.0600, 0.0700],
          [0.0800, 0.0900, 0.1000, 0.1100, 0.1200, 0.1300],
          [0.0100, 0.0200, 0.0300, 0.0400, 0.0500, 0.0600],
          [0.1300, 0.1400, 0.1500, 0.1600, 0.1700, 0.1800],
          [0.1000, 0.1100, 0.1200, 0.1300, 0.1400, 0.1500]],
 
         [[0.0500, 0.0600, 0.0700, 0.0800, 0.0900, 0.1000],
          [0.0400, 0.0500, 0.0600, 0.0700, 0.0800, 0.0900],
          [0.0100, 0.0200, 0.0300, 0.0400, 0.0500, 0.0600],
          [0.0900, 0.1000, 0.1100, 0.1200, 0.1300, 0.1400],
          [0.1300, 0.1400, 0.1500, 0.1600, 0.1700, 0.1800],
          [0.1000, 0.1100, 0.1200, 0.1300, 0.1400, 0.1500],
          [0.0100, 0.0200, 0.0300, 0.0400, 0.0500, 0.0600],
          [0.0900, 0.1000, 0.1100, 0.1200, 0.1300, 0.1400]]],
        gra

### Add Dimension

We projected our input tokens `x` that was `[B×T]` into the embedding to get `[B×T×C]` so that we now have our `T×C` for each batch. To run our convolution per batch, though, we also need a spatial dimension for the kernel to slide over. PyTorch-style convolution layers expect tensors in `[B, C, H, W]` (channels-first), where the kernel slides over `H,W` while mixing across `C`. Because of this we add a singleton spatial dimension and reorder axes. With this process, the embedding dimension `C` becomes the channels and the token axis `T` becomes the width to slide across:

`[B, T, C]  →  [B, C, T]  →  [B, C, 1, T]`

The convlution we show is a `1×k` convolution which slides only along our tokens `T`, and aggregates over all `C` channels at each position.

In [17]:
x = x.permute(0,2,1) # [B,C,T]
x = x.unsqueeze(2)  # [B,C,1,T]
x.size(), x

(torch.Size([2, 6, 1, 8]),
 tensor([[[[0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300, 0.1000]],
 
          [[0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400, 0.1100]],
 
          [[0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500, 0.1200]],
 
          [[0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600, 0.1300]],
 
          [[0.1900, 0.1500, 0.1800, 0.0600, 0.1200, 0.0500, 0.1700, 0.1400]],
 
          [[0.2000, 0.1600, 0.1900, 0.0700, 0.1300, 0.0600, 0.1800, 0.1500]]],
 
 
         [[[0.0500, 0.0400, 0.0100, 0.0900, 0.1300, 0.1000, 0.0100, 0.0900]],
 
          [[0.0600, 0.0500, 0.0200, 0.1000, 0.1400, 0.1100, 0.0200, 0.1000]],
 
          [[0.0700, 0.0600, 0.0300, 0.1100, 0.1500, 0.1200, 0.0300, 0.1100]],
 
          [[0.0800, 0.0700, 0.0400, 0.1200, 0.1600, 0.1300, 0.0400, 0.1200]],
 
          [[0.0900, 0.0800, 0.0500, 0.1300, 0.1700, 0.1400, 0.0500, 0.1300]],
 
          [[0.1000, 0.0900, 0.0600, 0.1400, 0.1800, 0.1500, 0.0600, 0.1400]]]],
        gr

### Convolution Block

<img src="explainer_screenshots/cnn/convolutional_layers.png" width="400">

As is common in CNNs, we use multiple convolution layers with normalization and nonlinearity to learn increasingly expressive features from the input. Each convolution “looks” at a local patch whose size and stride we choose; stacking layers (sequentially). We also use residual skips to let the model capture richer patterns and relationships.

In our model, our input to the convolution is $[B,C,1,T]$ with a $1\times k$ kernel. The convolution runs as 2-D discrete cross-correlation along the token axis. For output channel $m$,
$$
y^{m}_{t}=\sum_{c=1}^{C}\sum_{u=0}^{k-1} W^{m}_{c,u}x_{c,ts+u-p}+b^{m}.
$$

Under the hood we:

1. Build the matrix of local patches $P\in\mathbb{R}^{(Ck)\times L}$ by extracting all sliding $1\times k$ windows; $L$ is the number of output positions.
2. Flatten the kernel bank into $W_{\text{flat}}\in\mathbb{R}^{C_{\text{out}}\times (Ck)}$.
3. Compute all positions at once: $Y = W_{\text{flat}},P \in \mathbb{R}^{C_{\text{out}}\times L}$ independently for each batch element, then reshape back to $[B,C_{\text{out}},1,T_{\text{out}}]$.

We interleave batch normalization and ReLU to stabilize activations, improve gradient flow, and add nonlinearity. 

The second convolution in the block downsamples with stride 2, reducing the token length $T\to\lceil T/2\rceil$. This both cuts compute and expands the effective receptive field of subsequent layers, helping the model capture longer-range patterns over the sequence.


Finally, as a nod to ResNets, the convolutional block also uses a residual path. For this path we add a projected skip $S(x)$ to the main path $F(x)$, yielding $y=F(x)+S(x)$.  Since we used downsampling on our main path, the residual path also uses a $1\times 1$ projection with stride 2 downsample so dimensions of the residual path match that of the convolutional block output. 

#### Convolution Block - 1x3 Conv

##### 1x3 Conv - Initialize weights$
Our first convolutional block uses a kernal width of `(1,3)`, a stride of `(1,1)` and padding both at the start and end of the token dimension so that we can slide across all entries. For this first convolution layer we'll go through step by step showing how the convolution is built.  

To start, we will setup our weights to be based on the channel dimension, currently equal to our embedding, and our kernal. By matching the kernal we allow the layer to learn what parts of the kernal are more important for our final prediction. 

We'll also initialize our weights to be iterative so that we can see the impact clearly as they interact with our input

In [18]:
c1_kernal_height = 1
c1_kernal_width = 3
c1_stride_height = 1
c1_stride_width = 1
c1_padding_height = 0
c1_padding_width = 1
{'conv 1 kernal': (c1_kernal_height, c1_kernal_width),
 'conv 1 stride': (c1_stride_height, c1_stride_width),
 'conv 1 padding': (c1_padding_height, c1_padding_width)}


{'conv 1 kernal': (1, 3), 'conv 1 stride': (1, 1), 'conv 1 padding': (0, 1)}

In [19]:
## weight layer for convolution (similar to linear, just more explicit)
conv1 = nn.Parameter(
    torch.empty(n_embd, n_embd, c1_kernal_height, c1_kernal_width), 
    requires_grad=True)

In [68]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    c1_pattern = torch.tensor([0.001,0.002,0.001]).view(1,1,1,c1_kernal_width).expand(conv1.size()).clone()
    conv1.copy_(c1_pattern)
conv1.size(), conv1

(torch.Size([6, 6, 1, 3]),
 Parameter containing:
 tensor([[[[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]]],
 
 
         [[[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]]],
 
 
         [[[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]]],
 
 
         [[[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0.0020, 0.0010]],
 
          [[0.0010, 0

**Run Convolution**

Now we'll calculate the 2-D discrete cross-correlation for our weight and input `x`.  Since we know we have a residual connection we'll branch `x` and rejoin it after the convolutional block. For our convolutional layer, in our step by step guide we'll do the following: 
1. Since we have padding, pad our channel
2. Flattens, or **unfolds** each sliding kernel_size-sized block within the spatial dimensions of input into a column (i.e., last dimension) of a 3-D output tensor of shape $(N,C*k_h*k_w,L)$
3. Stack our weights so that it is resued across batches meaning our learning benefits from both. 
4. Take the dot product of the unstacked input and the stacked weights and reshape the result back to our batch and channels.

##### 1x3 Conv - Step-by-step unfolding
In particular we'll focus on step #2, as this specifically creates a sliding view that estracts a kernal size view across our input. By converting them to columns, when we do $W_{flat} \cdot X_{unfolded}$ the result is a sum of the row in the weight times what was previously a row in the input. Mentally, **unfold** linearizes all local receptive fields so you can do per-patch operations with a single batched matrix multiply. Convolution is exactly this with shared weights, hence the name.  After walking through step by step, we'll show you `F.unfold` a function that does the padding and unfolding for you and use it from there on out. 

In [27]:
batch = B_batch
channel = n_embd
height = 1
width = T_context
x.size(),batch, channel, height, width, 

(torch.Size([2, 6, 1, 8]), 2, 6, 1, 8)

**Calculate expected unfolded dimensions**  

Since we're doing the unfolding manually, we need to calculate the expected dimensions for our loop.  
Recall that we expect to go from $(B,C,1,T)$ to $(B,C*k_h*k_w,L)$ where $L$ is a flattening or our ouput height and width as follows: 
$$
\begin{align}
height_{out} &= (height + 2*pad_h - 1*(kernal_h-1) -1)\ //\ stride_{h}\\ 
width_{out} &= (width + 2*pad_w - 1*(kernal_w-1) -1)\ //\ stride_{w}\\
L &= height_{out} * width_{out}
\end{align}
$$

In [31]:
c1_khw = channel*c1_kernal_height*c1_kernal_width

c1_height_out = (height + 2*c1_padding_height - 1*(c1_kernal_height-1) - 1)//c1_stride_height + 1   # = 1, 
c1_width_out = (width + 2*c1_padding_width - 1*(c1_kernal_width-1) - 1)//c1_stride_width + 1   # = 4
c1_L = c1_height_out * c1_width_out

print(f'width out {c1_width_out}, height out {c1_height_out}, final dimension ({batch},{c1_khw},{c1_L})')

width out 8, height out 1, final dimension (2,18,8)


**Padding** 

We first start by padding.  Since we're using a stride of `(1,3)` we need to padd both the start and end of the tokens so that we can slide across it without losing an increment on the dimension. Padding simply adds `0` though we can add other values if we wanted.  When we pad on both sides we get output of `[2, 6, 1+0, 8+2]`

In [32]:
# pad last dim by (width, width) and 2nd to last by (height, height). width = 1, height = 0
c1_x_pad = F.pad(x, pad=(c1_padding_width,c1_padding_width,c1_padding_height,c1_padding_height))

c1_x_pad.size(), c1_x_pad #total size and show first example in batch 

(torch.Size([2, 6, 1, 10]),
 tensor([[[[0.0000, 0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300,
            0.1000, 0.0000]],
 
          [[0.0000, 0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400,
            0.1100, 0.0000]],
 
          [[0.0000, 0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500,
            0.1200, 0.0000]],
 
          [[0.0000, 0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600,
            0.1300, 0.0000]],
 
          [[0.0000, 0.1900, 0.1500, 0.1800, 0.0600, 0.1200, 0.0500, 0.1700,
            0.1400, 0.0000]],
 
          [[0.0000, 0.2000, 0.1600, 0.1900, 0.0700, 0.1300, 0.0600, 0.1800,
            0.1500, 0.0000]]],
 
 
         [[[0.0000, 0.0500, 0.0400, 0.0100, 0.0900, 0.1300, 0.1000, 0.0100,
            0.0900, 0.0000]],
 
          [[0.0000, 0.0600, 0.0500, 0.0200, 0.1000, 0.1400, 0.1100, 0.0200,
            0.1000, 0.0000]],
 
          [[0.0000, 0.0700, 0.0600, 0.0300, 0.1100, 0.1500, 0.1200, 0.0300,
            0.1100, 0.0000]],

**Maual Unfolding - First Stride** 

Now we will nmanually unfold our padded input.  The process of unfolding flattens each sliding kernel-sized block within the spatial dimensions of input into a column (i.e., last dimension) of a 3-D output tensor of shape $(N,C*k_h*k_w,L)$ 

We'll first start by pullling out the first patch.  Since we have a kernal of `(1,3)` we pull out the first 3 tokens from first spatial dimension for each Channel in each batch. 

In [44]:
step = 0
patch = c1_x_pad[:, :, step:c1_kernal_height, step:step+c1_kernal_width]
patch.size(), patch

(torch.Size([2, 6, 1, 3]),
 tensor([[[[0.0000, 0.1500, 0.1100]],
 
          [[0.0000, 0.1600, 0.1200]],
 
          [[0.0000, 0.1700, 0.1300]],
 
          [[0.0000, 0.1800, 0.1400]],
 
          [[0.0000, 0.1900, 0.1500]],
 
          [[0.0000, 0.2000, 0.1600]]],
 
 
         [[[0.0000, 0.0500, 0.0400]],
 
          [[0.0000, 0.0600, 0.0500]],
 
          [[0.0000, 0.0700, 0.0600]],
 
          [[0.0000, 0.0800, 0.0700]],
 
          [[0.0000, 0.0900, 0.0800]],
 
          [[0.0000, 0.1000, 0.0900]]]], grad_fn=<SliceBackward0>))

Now we need to stack our channels together. Since we want to make sure that eventually we can do a dot product of the weight and input where the weight column multiplies by the entry row, flattening our patches into a single entry gives us that.  

In [45]:
col = patch.reshape(batch, c1_khw)
col.size(), col

(torch.Size([2, 18]),
 tensor([[0.0000, 0.1500, 0.1100, 0.0000, 0.1600, 0.1200, 0.0000, 0.1700, 0.1300,
          0.0000, 0.1800, 0.1400, 0.0000, 0.1900, 0.1500, 0.0000, 0.2000, 0.1600],
         [0.0000, 0.0500, 0.0400, 0.0000, 0.0600, 0.0500, 0.0000, 0.0700, 0.0600,
          0.0000, 0.0800, 0.0700, 0.0000, 0.0900, 0.0800, 0.0000, 0.1000, 0.0900]],
        grad_fn=<UnsafeViewBackward0>))

Finally we want to make sure to save this since this is just the first pass of the patch. Let's create a list for now and store them. After we complete all the strides we can reshape our final output of the unfolded step to make each entry a column. 

In [46]:
manual_cols = []
manual_cols.append(col)

**Maual Unfolding - Second Stride** 

We now need to move our patch by the stride amount, in this case `(1,1)`. Using a stride of 1 on both dimensions ensures that we continue covering every input token in the example. As you'll see in future convolutions, changing the stride can downsample an input.  Let's start by again extracting the patch. You'll see that we just shifted to the "left" by 1 and took the next 3 columns in our input

In [47]:
step = 1
patch = c1_x_pad[:, :, 0:c1_kernal_height, step:step+c1_kernal_width]
patch.size(), patch

(torch.Size([2, 6, 1, 3]),
 tensor([[[[0.1500, 0.1100, 0.1400]],
 
          [[0.1600, 0.1200, 0.1500]],
 
          [[0.1700, 0.1300, 0.1600]],
 
          [[0.1800, 0.1400, 0.1700]],
 
          [[0.1900, 0.1500, 0.1800]],
 
          [[0.2000, 0.1600, 0.1900]]],
 
 
         [[[0.0500, 0.0400, 0.0100]],
 
          [[0.0600, 0.0500, 0.0200]],
 
          [[0.0700, 0.0600, 0.0300]],
 
          [[0.0800, 0.0700, 0.0400]],
 
          [[0.0900, 0.0800, 0.0500]],
 
          [[0.1000, 0.0900, 0.0600]]]], grad_fn=<SliceBackward0>))

we'll again flatten this the same as before

In [48]:
col = patch.reshape(batch, c1_khw)
col.size(), col

(torch.Size([2, 18]),
 tensor([[0.1500, 0.1100, 0.1400, 0.1600, 0.1200, 0.1500, 0.1700, 0.1300, 0.1600,
          0.1800, 0.1400, 0.1700, 0.1900, 0.1500, 0.1800, 0.2000, 0.1600, 0.1900],
         [0.0500, 0.0400, 0.0100, 0.0600, 0.0500, 0.0200, 0.0700, 0.0600, 0.0300,
          0.0800, 0.0700, 0.0400, 0.0900, 0.0800, 0.0500, 0.1000, 0.0900, 0.0600]],
        grad_fn=<UnsafeViewBackward0>))

and now add it to our list.  We can now see that we have entries for our first 2 steps already in the list

In [49]:
manual_cols.append(col)
manual_cols

[tensor([[0.0000, 0.1500, 0.1100, 0.0000, 0.1600, 0.1200, 0.0000, 0.1700, 0.1300,
          0.0000, 0.1800, 0.1400, 0.0000, 0.1900, 0.1500, 0.0000, 0.2000, 0.1600],
         [0.0000, 0.0500, 0.0400, 0.0000, 0.0600, 0.0500, 0.0000, 0.0700, 0.0600,
          0.0000, 0.0800, 0.0700, 0.0000, 0.0900, 0.0800, 0.0000, 0.1000, 0.0900]],
        grad_fn=<UnsafeViewBackward0>),
 tensor([[0.1500, 0.1100, 0.1400, 0.1600, 0.1200, 0.1500, 0.1700, 0.1300, 0.1600,
          0.1800, 0.1400, 0.1700, 0.1900, 0.1500, 0.1800, 0.2000, 0.1600, 0.1900],
         [0.0500, 0.0400, 0.0100, 0.0600, 0.0500, 0.0200, 0.0700, 0.0600, 0.0300,
          0.0800, 0.0700, 0.0400, 0.0900, 0.0800, 0.0500, 0.1000, 0.0900, 0.0600]],
        grad_fn=<UnsafeViewBackward0>)]

**Maual Unfolding - Remaining Strides** 

We'll now loop through the remaining steps for the manual unfolding to fill in the rest of the list.  This is the same set of steps done before, just in a loop but appending to the same list.  We'll start from 2 onward since we already did steps 0 and 1. 

In [52]:
for step in range(2,c1_width_out): 
    print(f'execting stride {step}')
    # extract step
    patch = c1_x_pad[:, :, 0:c1_kernal_height, step:step+c1_kernal_width]        # (2,6,1,3)
    
    # stack the entries in each batch together into a row
    col = patch.reshape(batch, c1_khw) # shape to [2,18]

    manual_cols.append(col)

manual_cols

execting stride 2
execting stride 3
execting stride 4
execting stride 5
execting stride 6
execting stride 7


[tensor([[0.0000, 0.1500, 0.1100, 0.0000, 0.1600, 0.1200, 0.0000, 0.1700, 0.1300,
          0.0000, 0.1800, 0.1400, 0.0000, 0.1900, 0.1500, 0.0000, 0.2000, 0.1600],
         [0.0000, 0.0500, 0.0400, 0.0000, 0.0600, 0.0500, 0.0000, 0.0700, 0.0600,
          0.0000, 0.0800, 0.0700, 0.0000, 0.0900, 0.0800, 0.0000, 0.1000, 0.0900]],
        grad_fn=<UnsafeViewBackward0>),
 tensor([[0.1500, 0.1100, 0.1400, 0.1600, 0.1200, 0.1500, 0.1700, 0.1300, 0.1600,
          0.1800, 0.1400, 0.1700, 0.1900, 0.1500, 0.1800, 0.2000, 0.1600, 0.1900],
         [0.0500, 0.0400, 0.0100, 0.0600, 0.0500, 0.0200, 0.0700, 0.0600, 0.0300,
          0.0800, 0.0700, 0.0400, 0.0900, 0.0800, 0.0500, 0.1000, 0.0900, 0.0600]],
        grad_fn=<UnsafeViewBackward0>),
 tensor([[0.1100, 0.1400, 0.0200, 0.1200, 0.1500, 0.0300, 0.1300, 0.1600, 0.0400,
          0.1400, 0.1700, 0.0500, 0.1500, 0.1800, 0.0600, 0.1600, 0.1900, 0.0700],
         [0.0400, 0.0100, 0.0900, 0.0500, 0.0200, 0.1000, 0.0600, 0.0300, 0.1100,
          0

**Maual Unfolding - Flatten List** 

Now that we've completed the patch extractsion we have a list of tensors. We want to create a new tensore where we maintain the batch of 2 but convert our row length of 18 into the column dimension. We'll use stack to complete this and result in a `(2,18,8)` tensor, just like we calculated. 

In [53]:
# turn all the rows in the list into columns while maintaining the batch
manual_unfold = torch.stack(manual_cols, dim=2)  # (N, 18, 8)
manual_unfold.size(), manual_unfold

(torch.Size([2, 18, 8]),
 tensor([[[0.0000, 0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300],
          [0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300, 0.1000],
          [0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300, 0.1000, 0.0000],
          [0.0000, 0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400],
          [0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400, 0.1100],
          [0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400, 0.1100, 0.0000],
          [0.0000, 0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500],
          [0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500, 0.1200],
          [0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500, 0.1200, 0.0000],
          [0.0000, 0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600],
          [0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600, 0.1300],
          [0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600, 0.1300, 0.0000],
          [0.0000, 0.1900, 0.1500, 0.1800, 0.0600, 0.1200, 0.05

**Unfolding - Efficiently**

While the above is great for demonstration purposes, it eats up a lot of time and code space.  Let's switch to the help of a pytorch function `F.unfold`.  This unfold fucntion does the same steps as above: padding, patch extraction, reshaping, stacking. 

Lets setup our unfold of the original input `x`.  We'll also do a comparison of the previous output `manual_unfold` with this functions output to demonstrate that it is infact equal and we can use it going forward

In [62]:
c1_unfolded = F.unfold(x, 
		kernel_size=(c1_kernal_height, c1_kernal_width),  # (1,3)
		padding=(c1_padding_height, c1_padding_width), #(0,1)
		stride=(c1_stride_height, c1_stride_width))#(1,1)

print("manual equals unfold:", torch.allclose(c1_unfolded, manual_unfold))
c1_unfolded.size() , c1_unfolded

manual equals unfold: True


(torch.Size([2, 18, 8]),
 tensor([[[0.0000, 0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300],
          [0.1500, 0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300, 0.1000],
          [0.1100, 0.1400, 0.0200, 0.0800, 0.0100, 0.1300, 0.1000, 0.0000],
          [0.0000, 0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400],
          [0.1600, 0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400, 0.1100],
          [0.1200, 0.1500, 0.0300, 0.0900, 0.0200, 0.1400, 0.1100, 0.0000],
          [0.0000, 0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500],
          [0.1700, 0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500, 0.1200],
          [0.1300, 0.1600, 0.0400, 0.1000, 0.0300, 0.1500, 0.1200, 0.0000],
          [0.0000, 0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600],
          [0.1800, 0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600, 0.1300],
          [0.1400, 0.1700, 0.0500, 0.1100, 0.0400, 0.1600, 0.1300, 0.0000],
          [0.0000, 0.1900, 0.1500, 0.1800, 0.0600, 0.1200, 0.05

##### 1x3 Conv - $W\cdot X_{unfolded}$

Now that we have our unstacked patches, we can then let our network decide how much of the patch, and which part of the patch, influences our output.  To do this we take the dot product of the weight with the unfoleded input.  We do have an issue though since our weight is `[6,6,1,3]` but our input is `[2x18x8]`. We will solve this simply by squeezing the last two dimensions of our Weights together to result in a `[6,18]` tensor that we can multiply.  

You might be now asking "what about the batch dimesnions of 2".  We do want to make sure the 2 different batches actually share the same weight so we don't actually want to increase our weight dimension.  Instead we rely on the pytorch which broadcasts the same matrix to each of the batches in the input automatically.  This allows the two batches to share the weights. 

In [70]:
conv1_weigth = conv1.view(n_embd, -1) # [6,6,1,3] > [6,18]
conv1_weigth.size(), conv1_weigth

(torch.Size([6, 18]),
 tensor([[0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010],
         [0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010],
         [0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010],
         [0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010],
         [0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010],
         [0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010,
          0.0010, 0.0020, 0.0010, 0.0010, 0.0020, 0.0010, 0.0010, 0.002

Now that we have the weigths in the dimension we want them we're ready to multiply them with the unfolded input.  Because in our weight matrix each "row" is the same, each of our column entries in the result will be eqal.  We'll also get a final output of `[2,6,8]` compressing the 18 down. Also note that the batch dimension is maintained as the weight is broadcast across the batches. 

In [72]:
out = conv1_weigth @ c1_unfolded
out.size(), out

(torch.Size([2, 6, 8]),
 tensor([[[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024],
          [0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024],
          [0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024],
          [0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024],
          [0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024],
          [0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
         [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016],
          [0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016],
          [0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016],
          [0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016],
          [0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016],
          [0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]]],
        grad_fn=<CloneBackward0>))

Finally we need to resize our last dimension back to our target channel heigh and width.  Since our height is 1 it will just insert in another dimension of 1 without looking signficantly different. 

In [74]:
out = out.view(batch,n_embd, c1_height_out, c1_width_out)
out.size(), out

(torch.Size([2, 6, 1, 8]),
 tensor([[[[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
          [[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
          [[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
          [[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
          [[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]],
 
          [[0.0029, 0.0037, 0.0031, 0.0022, 0.0017, 0.0020, 0.0028, 0.0024]]],
 
 
         [[[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]],
 
          [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]],
 
          [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]],
 
          [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]],
 
          [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]],
 
          [[0.0013, 0.0014, 0.0015, 0.0025, 0.0033, 0.0026, 0.0019, 0.0016]]]],
        gr

#### Convolution Block - First Batch Norm

In [None]:
bn_a = nn.BatchNorm2d(n_embd)
bn_a.weight, bn_a.bias

In [None]:
out = bn_a(out)
out.size(), out

### First RELU

In [None]:
out = F.relu(out) 
out.size(), out

### Second convolution that downsamples using a stride of 2 

In [None]:
c2_in_channel = n_embd
c2_out_channel = n_embd
c2_in_channel, out_channel

In [None]:
c2_kernal_height = 1
c2_kernal_width = 3
c2_stride_height = 1
c2_stride_width = 2
c2_padding_height = 0
c2_padding_width = 1
{'kernal': (c2_kernal_height, c2_kernal_width),
 'stride': (c2_stride_height, c2_stride_width),
 'padding': (c2_padding_height, c2_padding_width)}


In [None]:
## weight layer for convolution (similar to linear, just more explicit)
conv_stride2 = nn.Parameter(
    torch.empty(c2_out_channel, c2_in_channel, c2_kernal_height, c2_kernal_width), 
    requires_grad=True)

In [None]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    c2_pattern = torch.tensor([0.3,0.2,0.1]).view(1,1,1,c2_kernal_width).expand(conv_stride2.size()).clone()
    conv_stride2.copy_(c2_pattern)
conv_stride2.size(), conv_stride2

### Second convolution, see that the dimensions change

In [None]:

c2_batch, c2_channel, c2_height, c2_width = out.size()
out.size(), c2_batch, c2_channel, c2_height, c2_width, 

In [None]:
c2_c_khw = c2_channel*c2_kernal_height*c2_kernal_width

c2_height_out = (c2_height + 2*c2_padding_height - 1*(c2_kernal_height-1) - 1)//c2_stride_height + 1   # = 1, 
c2_width_out = (c2_width + 2*c2_padding_width - 1*(c2_kernal_width-1) - 1)//c2_stride_width + 1   # = 4
c2_L = c2_height_out * c2_width_out

print(f'First Conv: width out {width_out}, height out {height_out}, final dimension ({batch},{c_khw},{L})')
print(f'This  Conv: width out {c2_width_out}, height out {c2_height_out}, final dimension ({c2_batch},{c2_c_khw},{c2_L})')

**notice** how the width is reduced in half,  this is cause the stride is 2, we'll have to deal with this when we do our residual step to match our input embedding with this output

In [None]:
(c2_kernal_height, c2_kernal_width), (c2_padding_height, c2_padding_width), (c2_stride_height, c2_stride_width)

In [None]:
out

In [None]:
c2_unfolded = F.unfold(out, 
		kernel_size=(c2_kernal_height, c2_kernal_width),  # (1,3)
		padding=(c2_padding_height, c2_padding_width), #(0,1)
		stride=(c2_stride_height, c2_stride_width))#(1,2)
c2_unfolded.size() , c2_unfolded

### Convolution dot product

In [None]:
# Stacks creates a 2-d matrix of `out_channelX rest` so `6*18` by stacking the weights we match the shape of 
conv_2_weigth = conv_stride2.view(c2_out_channel, -1) # [6,6,1,3] > [6,18]
conv_2_weigth.size(), conv_2_weigth

In [None]:
# [6, 18] dot product with [2, 18, 4] resulting in [2x6x4]
# This will auto broadcast across each of the 2 batches (shared weigth) so it results in [2x6x8]
out = conv_2_weigth @ c2_unfolded
out.size(), out

In [None]:
# insert in the channel dimension to go back to 1/2 of [B,C,1,T] since we took a stride of 2
out = out.view(c2_batch,c2_out_channel, c2_height_out, c2_width_out)
out.size(), out

### Batch Norm #2 

In [None]:
bn_b = nn.BatchNorm2d(n_embd)   
bn_b.weight, bn_b.bias

In [None]:
out = bn_b(out)
out.size(), out

## Residual connection, 
bring in X, but need to convolute X to match Out dimension

convolution with 1x1 kernal and a stride of 2

In [None]:
res_in_channel = n_embd
res_out_channel = n_embd
res_in_channel, res_out_channel

In [None]:
res_kernal_height = 1
res_kernal_width = 1
res_stride_height = 1
res_stride_width = 2
res_padding_height = 0
res_padding_width = 0
{'kernal': (res_kernal_height, res_kernal_width),
 'stride': (res_stride_height, res_stride_width),
 'padding': (res_padding_height, res_padding_width)}


In [None]:
## weight layer for convolution (similar to linear, just more explicit)
res_conv_1x1 = nn.Parameter(
    torch.empty(res_out_channel, res_in_channel, res_kernal_height, res_kernal_width), 
    requires_grad=True)

In [None]:
# iniate rows as 0.1, 0.2, and 0.3 for easier view of the weight impact
with torch.no_grad():
    res_pattern = torch.tensor([0.05]).view(1,1,1,res_kernal_width).expand(res_conv_1x1.size()).clone()
    res_conv_1x1.copy_(res_pattern)
res_conv_1x1.size(), res_conv_1x1

### Residual connection with 1x1 convolution

In [None]:
res_batch, res_channel, res_height, res_width = x.size()
res_batch, res_channel, res_height, res_width

In [None]:
res_c_khw = channel*res_kernal_height*res_kernal_width

res_height_out = (height + 2*res_padding_height - 1*(res_kernal_height-1) - 1)//res_stride_height + 1   # = 1, 
res_width_out = (width + 2*res_padding_width - 1*(res_kernal_width-1) - 1)//res_stride_width + 1   # = 4
res_L = res_height_out * res_width_out

print(f'Second  Conv: width out {c2_width_out}, height out {c2_height_out}, final dimension ({c2_batch},{c2_c_khw},{c2_L})')
print(f'This  Conv: width out {res_width_out}, height out {res_height_out}, final dimension ({batch},{res_c_khw},{res_L})')

**1/3 less but we'll have to add them together** we'll take care of this as we resize. 

In [None]:
x_unfolded = F.unfold(x, 
		kernel_size=(res_kernal_height, res_kernal_width),  # (1,1)
		padding=(res_padding_height, res_padding_width), #(0,0)
		stride=(res_stride_height, res_stride_width))#(1,2)
x_unfolded.size() , x_unfolded

In [None]:
# Stacks creates a 2-d matrix of `out_channelX rest` so `6*18` by stacking the weights we match the shape of 
res_weigth = res_conv_1x1.view(res_out_channel, -1) # [6,6,1,1] > [6,6]
res_weigth.size(), res_weigth

In [None]:
# This will auto broadcast across each of the 2 batches (shared weigth) so it results in [2x6x8]
identity = res_weigth @ x_unfolded
identity.size(), identity

In [None]:
# insert in the channel dimension to go back to [B,C,1,T/2] since we took a stride of 2
identity = identity.view(res_batch,res_out_channel, res_height_out, res_width_out)
identity.size(), identity

## Residual connection sum
see we now have the same size for our identity connection and our output. 

In [None]:
out.size(), identity.size()

In [None]:
x = out + identity
x.size(), x

### Adaptive Average Pooling
Applies a 2D adaptive average pooling over an input signal composed of several input planes.

since we're treating this as 1 example, not 8, we now need to bring our token dimension down to a final single "example". this means squeezing down from a `[2,6,1,4]` to a `[2,6,1,1]`

equivalent to It is equivalent to x.mean(dim=(2,3), keepdim=True)

In [None]:
gap2d = nn.AdaptiveAvgPool2d((1, 1))

In [None]:
x = gap2d(x)
x.size(), x

### Remove our extra dimesion we added

In [None]:
x = x.squeeze(2)
x.size(), x

### Flip our channel and context back

In [None]:
x = x.permute(0,2,1)
x.size(), x

## Final Linear Projectionself.head = nn.Linear(n_embd, vocab_size, bias=False)

In [None]:
lm_head = nn.Linear(n_embd, vocab_size, bias=False)
torch.nn.init.constant_(lm_head.weight,0.01)
lm_head.weight.size(), lm_head.weight

### logits

In [None]:
logits = lm_head(x)

logits.shape, logits

### Loss

In [None]:
y_flat = y.view(-1)
y_flat.shape, y_flat

In [None]:
logits_flat = logits.view(-1, logits.size(-1))
logits_flat.shape, logits_flat

In [None]:
loss = F.cross_entropy(logits_flat, y_flat)
loss.shape, loss