### Review Shape

![shape](https://cdn-images-1.medium.com/max/2000/1*_D5ZvufDS38WkhK9rK32hQ.jpeg)

###  A 3D tensor is common in many machine learning (ML) applications, especially when dealing with sequences, batches, or multi-channel data.
- General Shape: [batch_size, seq_len, feature_dim]
- This is the most common 3D tensor format.

### Common Scenarios
- NLP (Natural Language Processing)
    - Shape: [batch_size, seq_len, embedding_dim]
    - Example: Word embeddings for a batch of sentences.
        - batch_size: number of sentences
        - seq_len: number of words per sentence
        - embedding_dim: size of word embeddings (e.g., 300 or 768)

- Time Series / Sequence Models
    - Shape: [batch_size, time_steps, features]
    - Example: Predicting stock prices from multivariate time series.
        - Each sample has time steps (e.g., 30 days)
        - Each time step has features (e.g., open, high, low, volume)

- Audio Processing
    - Shape: [batch_size, num_frames, feature_dim]
    - Example: Spectrograms or MFCC features.
        - Each audio clip is broken into frames (e.g., 100 ms)
        - Each frame has a feature vector

- Video Data
    - Shape: [batch_size, num_frames, flattened_image_features]
    - Example: Extracting features per frame using CNN, then feeding to RNN for action recognition.

- Transformer Models (Self-Attention)
    - Shape: [batch_size, seq_len, model_dim]
    - Used throughout in attention mechanisms to compute queries, keys, and values.

### Python list comprehension

In [32]:
list = [1, 2, 3, 4, 5]
[ i**2 for i in list ]

[1, 4, 9, 16, 25]

### anonymous function 
- A lambda is a short way to write a function 
- a function without a name.

`def square(x):`
    `return x ** 2 `

In [33]:
square = lambda x: x ** 2


### generate random integers drawn from a specified range.
`torch.randint(low, high, size, dtype=None, layout=torch.strided, device=None, requires_grad=False)
`

In [34]:
# Create a 2x3 tensor of random integers from 0 to 10 (excluding 10)
import torch
x = torch.randint(0, 10, (2, 3))
x

tensor([[7, 0, 7],
        [0, 8, 8]])

In [35]:
x = torch.randint(10, (2,))
x

tensor([0, 5])

### swap two dimensions of a tensor

`torch.transpose(input, dim0, dim1)`
- `input`: The input tensor.
- `dim0, dim1`: The two dimensions to swap.
- dim0 = 0 (rows), dim1 = 1 (columns) — so rows become columns and vice versa.
- Effectively: it transposes the matrix like in linear algebra.

![transpose](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*HRWWBxD3H0rkO4r5J64dVg.png)

In [36]:
import torch

a = torch.tensor([[1, 2],
                  [3, 4],
                  [5, 6]])  # Shape: (3, 2)

b = torch.transpose(a, 0, 1)
print(b)
# Output:
# tensor([[1, 3, 5],
#         [2, 4, 6]])
print(b.shape)  # torch.Size([2, 3])


tensor([[1, 3, 5],
        [2, 4, 6]])
torch.Size([2, 3])


In [37]:
a = torch.arange(0, 24).reshape(2, 3, 4)
print(a)


tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])


In [38]:
torch.transpose(a, 1,2) # shape (2, 4, 3)

tensor([[[ 0,  4,  8],
         [ 1,  5,  9],
         [ 2,  6, 10],
         [ 3,  7, 11]],

        [[12, 16, 20],
         [13, 17, 21],
         [14, 18, 22],
         [15, 19, 23]]])

In [39]:
# equivalent to  torch.transpose(a, 1, 2)
torch.transpose(a, -1,-2) # shape (2, 4, 3) 
# -1 refers to the last dimension (dim 2, which is 4),
# -2 refers to second-last (dim 1, which is 3).

tensor([[[ 0,  4,  8],
         [ 1,  5,  9],
         [ 2,  6, 10],
         [ 3,  7, 11]],

        [[12, 16, 20],
         [13, 17, 21],
         [14, 18, 22],
         [15, 19, 23]]])

### shape 

In [40]:
a = torch.arange(0, 24).reshape(2, 3, 4)
a.shape[-1]

4

In [41]:
B, T, C = a.shape # unpack
print(B, T, C)

2 3 4


In [42]:
a.view(-1, C)

tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]])

In [43]:
a.view(-1, )

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23])

### torch Stack (add a new dimension)
Often add traning data as batch

- torch.stack — Stack along a new dimension
- Think: Create a new axis (like stacking flat sheets into a pile).
- Requirement: Tensors must have exactly the same shape.


In [None]:
a = torch.tensor([[1, 2, 3],
                 [4, 5, 6]]) # shape (2,3)
b = torch.tensor([[7, 8, 9],
                 [10, 11, 12]])

batch=torch.stack([a, b], dim = 0) # dim=0 is default. shape (2,2,3)
batch

tensor([[[ 1,  2,  3],
         [ 4,  5,  6]],

        [[ 7,  8,  9],
         [10, 11, 12]]])

### cat and stack
- torch.cat — Concatenate along an existing dimension
- Think: Extend an axis (like adding more rows to a table).
- Requirement: Tensors must have the same shape in all dimensions except the one you're concatenating along.

![Cat](https://user-images.githubusercontent.com/111734605/235976058-d23f9b75-401c-4547-9e17-6655f3baf957.png)

In [74]:
a = torch.tensor([[1, 2, 3],
                 [4, 5, 6]])
b = torch.tensor([[7, 8, 9],
                 [10, 11, 12]])

torch.cat([a, b], dim = 0)

tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]])

In [75]:
torch.cat([a, b], dim = 1)

tensor([[ 1,  2,  3,  7,  8,  9],
        [ 4,  5,  6, 10, 11, 12]])

### torch zeros and ones

In [46]:
torch.zeros(2)

tensor([0., 0.])

In [47]:
torch.zeros(2,2)

tensor([[0., 0.],
        [0., 0.]])

In [48]:
torch.ones(2,2)

tensor([[1., 1.],
        [1., 1.]])

### F.softmax

In [49]:
# A list
import torch
import torch.nn.functional as F

logits = torch.tensor([2.0, 1.0, 0.1])
probs = F.softmax(logits, dim=0)
print(probs)


tensor([0.6590, 0.2424, 0.0986])


In PyTorch, softmax requires a dim to tell it which axis to normalize over. Since there's only one axis here (axis 0), you must use dim=0

In [50]:
# A batch of 2 samples, each with 3 class scores (logits)
logits = torch.tensor([[2.0, 1.0, 0.1],
                       [1.0, 3.0, 0.2]])

# Apply softmax along dim=1 (columns — the class dimension)
probs = F.softmax(logits, dim=1)

print(probs)

tensor([[0.6590, 0.2424, 0.0986],
        [0.1131, 0.8360, 0.0508]])


In [51]:
F.softmax(logits, dim=-1) # same as probs = F.softmax(logits, dim=1)

tensor([[0.6590, 0.2424, 0.0986],
        [0.1131, 0.8360, 0.0508]])

- Apply softmax over the last dimension of the tensor, no matter how many dimensions it has.
- dim=-1 is a convenient and general way to apply softmax across the correct axis — especially in batch scenarios.

### Masks
- `torch.tril(input, diagonal=0)`
- It returns the lower triangular part of a matrix (or batch of matrices), setting elements above the specified diagonal to zero.

In [52]:
import torch

a = torch.tensor([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

lower = torch.tril(a)
print(lower)


tensor([[1, 0, 0],
        [4, 5, 0],
        [7, 8, 9]])


In [53]:
torch.tril(torch.ones(3,3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

###  A variable that is part of the model, used for computation, but not learned: `register_buffer('tril', tensor)`
-  is a PyTorch method used inside a nn.Module subclass to store a tensor as part of the model, without treating it as a learnable parameter (i.e., it's not updated during training by optimizer.step()).
- variable
    - It is not a parameter (won’t be updated during training)
    - It is essential for forward pass or internal logic
    - It stays with the model — moves to GPU, saved/loaded with weights
- Why need it?
    - Constant tensors needed during forward pass but should not be updated during training (e.g., masks, position encodings, identity matrices, etc.) → You still want them to:
        - Move with the model (`.cuda()` or `.to(device)`)
        - Save/load with the model (`.state_dict()`)

In [63]:
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self, n):
        super().__init__()
        tril = torch.tril(torch.ones(n, n))
        self.register_buffer('tril', tril)  # Register buffer

    def forward(self, x):
        # Use the registered lower-triangular mask
        return x * self.tril

model = MyModel(4)
print(model.tril)  # Access the registered buffer


tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])


### nn.Embedding: a lookup table that maps integer indices to dense vectors 
- it's commonly used to convert tokens (like words or characters) into embeddings for use in neural networks.
- `nn.Embedding(num_embeddings, embedding_dim)`
    - num_embeddings: total number of unique tokens (e.g., vocabulary size)
    - embedding_dim: size of each embedding vector (e.g., 100-dimensional vector)

In [78]:
import torch
import torch.nn as nn

# Define embedding: vocab size = 10, embedding size = 4
embedding = nn.Embedding(10, 4)

# Input: token indices (batch of 2 sequences with 3 tokens each)
input = torch.tensor([[1, 2, 4], [4, 3, 2]]) # shape (2, 3)

# Output: shape (2, 3, 4) → each token index becomes a vector with 4 features
output = embedding(input) # shape (2, 3, 4)
print(output)


tensor([[[-1.0356, -0.4012,  0.2564,  1.4654],
         [-0.3390, -0.5554,  1.3577,  0.2175],
         [-0.9895, -0.1303, -0.0827, -0.8324]],

        [[-0.9895, -0.1303, -0.0827, -0.8324],
         [-0.6834,  0.9849,  0.7157,  1.7701],
         [-0.3390, -0.5554,  1.3577,  0.2175]]], grad_fn=<EmbeddingBackward0>)


###  samples indices based on probabilities.

- `torch.multinomial(input, num_samples, replacement=False)`
- input: A 1D or 2D tensor of non-negative values (like probabilities or weights)
- num_samples: How many samples to draw
- replacement: Whether to sample with or without replacement


In [72]:
weights = torch.tensor([0.1, 0.3, 0.6])
sample = torch.multinomial(weights, num_samples=1)
print(sample)


tensor([2])


In [76]:
weights = torch.tensor([
    [0.1, 0.9],
    [0.8, 0.2]
])
samples = torch.multinomial(weights, 1)
print(samples) # shape (2, 1)


tensor([[0],
        [0]])


### computing average cross rows (implmentation 1)

In [177]:
T=3
tril=torch.tril(torch.ones(T,T))
tril= tril/torch.sum(tril, 1, keepdim=True)
a= torch.randint(0, 10, (T,T)).float()
tril

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])

In [180]:
a

tensor([[9., 5., 2.],
        [4., 3., 8.],
        [8., 9., 8.]])

In [181]:
tril@a

tensor([[9.0000, 5.0000, 2.0000],
        [6.5000, 4.0000, 5.0000],
        [7.0000, 5.6667, 6.0000]])

### computing average attention cross rows (implmentation 2)

In [173]:
# tensor.masked_fill(mask, value) 
# sets elements of a tensor to a given value where the mask is True.

import torch

T=3
# self-attention weight
w= torch.randint(0, 10, (T,T)).float() # (T,T)
print("attention in logits\n", w) 

# mask so that self-attention won't learn from later characters
tril=torch.tril(torch.ones(T,T))
# print(tril==0)  # tril[:T, :T] ==0 equivlent  (T,T)

# Set elements where tril==0 (only attend to current characters, not future generated characters) to -inf
w = w.masked_fill(tril==0, float('-inf'))
print("masked attentioin\n", w)
print("masked attentioin with averageed softmax\n", F.softmax(w, dim=-1)) 


attention in logits
 tensor([[5., 0., 9.],
        [4., 1., 7.],
        [4., 3., 6.]])
masked attentioin
 tensor([[5., -inf, -inf],
        [4., 1., -inf],
        [4., 3., 6.]])
masked attentioin with averageed softmax
 tensor([[1.0000, 0.0000, 0.0000],
        [0.9526, 0.0474, 0.0000],
        [0.1142, 0.0420, 0.8438]])


### layerNorm: normlize all features of each input (e.g., token)
- "Re-centering and resizing" each token's representation so that it’s easier for the next layer to work with — regardless of what happened in earlier layers.
- LayerNorm centers (zero mean) and scales (unit variance) each token's features, ensuring the activations passed to the next layer stay in a stable range.

In [2]:
import torch
import torch.nn as nn

x = torch.tensor([[1.0, 2.0, 3.0], [2.0, 4.0, 6.0]])
layer_norm = nn.LayerNorm(x.shape[1])  # Normalize across last dimension
output = layer_norm(x)
print(output)


tensor([[-1.2247,  0.0000,  1.2247],
        [-1.2247,  0.0000,  1.2247]], grad_fn=<NativeLayerNormBackward0>)


In [3]:
output.mean(dim=1)

tensor([0., 0.], grad_fn=<MeanBackward1>)

In [4]:
output.std(dim=1)

tensor([1.2247, 1.2247], grad_fn=<StdBackward0>)

In [8]:
output[0,:].std()

tensor(1.2247, grad_fn=<StdBackward0>)