<a href="https://colab.research.google.com/github/Yashcode007/pytorch/blob/main/pytorch_gpt_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
print(torch.__version__)
print("GPU Available:", torch.cuda.is_available())

2.6.0+cu124
GPU Available: True


Tensors - Multi-Dimensional Arrays (like Numpy Arrays) , but they work on GPU and support AutoGrad

In [None]:
scalar = torch.tensor(5)
print(scalar)

tensor(5)


In [None]:
#1 dimension
vector = torch.tensor([1,2,3])
print(vector)

tensor([1, 2, 3])


In [None]:
vector.shape

torch.Size([3])

In [None]:
#2 Dimension
matrix = torch.tensor([[1,2],[3,4]])
print(matrix)

tensor([[1, 2],
        [3, 4]])


In [None]:
matrix.shape

torch.Size([2, 2])

In [None]:
#3 Dimension
tensor3d = torch.rand(2,3,4)  #2 blocks of 3x4 matrix
print(tensor3d.shape)

torch.Size([2, 3, 4])


In [None]:
tensor3d

tensor([[[0.9838, 0.0349, 0.9940, 0.7765],
         [0.3012, 0.2138, 0.3181, 0.8706],
         [0.9217, 0.8107, 0.6103, 0.5807]],

        [[0.4957, 0.0520, 0.1324, 0.3200],
         [0.0212, 0.2656, 0.1971, 0.0360],
         [0.2589, 0.9403, 0.6068, 0.3493]]])

In [None]:
row_vector = vector.view(1,3)
col_vector = vector.view(3,1)

In [None]:
row_vector.shape

torch.Size([1, 3])

In [None]:
col_vector.shape

torch.Size([3, 1])

In [None]:
print("x.ndim:", vector.ndim)
print("y.ndim:", vector.ndim)

x.ndim: 1
y.ndim: 1


What do LLMs actually take as input?
Most LLMs(like GPT , BERT, LLaMa) take input shaped like this:
Shape->
[batch_size, sequence_length , embedding_dim]

batch_size -> How many sequence(sentences) you feed in at once
sequence_length -> How many tokens per sequence
embedding_dim -> How big each token vector is (usually 256 , 768 , 1024, etc)

In [None]:
llm_input = torch.rand(2,5,4) # 2 sentences , 5 tokens each , embedding_size 4
print("LLM Input Tensor:\n", llm_input)
print("Shape:", llm_input.shape)

LLM Input Tensor:
 tensor([[[0.2859, 0.2708, 0.8687, 0.6443],
         [0.0397, 0.9256, 0.3220, 0.3820],
         [0.9727, 0.3922, 0.5970, 0.2150],
         [0.2278, 0.2654, 0.5593, 0.4859],
         [0.0033, 0.2276, 0.3299, 0.0431]],

        [[0.6066, 0.4954, 0.3526, 0.4957],
         [0.2711, 0.8903, 0.8998, 0.2803],
         [0.8758, 0.5222, 0.0750, 0.3358],
         [0.0547, 0.0412, 0.3838, 0.1640],
         [0.5227, 0.8826, 0.0132, 0.3425]]])
Shape: torch.Size([2, 5, 4])


What if the sentences have different lengths or embedding dim?
Then we need to pad the shorter one , hence both the sentences become of the same length

Also the embedding cannot be different , as the embedding dim is fixed in the models like BERT , GPT-2 , etc

Pytorch + HuggingFace tokenizer handles this using the padding=True convention

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(["I love tea", "Transformers rock"], padding=True, return_tensors="pt")
print(tokens['input_ids'].shape)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

torch.Size([2, 5])


In [None]:
tokens

{'input_ids': tensor([[  101,  1045,  2293,  5572,   102],
        [  101, 19081,  2600,   102,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0]])}

In [None]:
type(tokens)

Tensor Operations + GPU Usage

In [None]:
import torch

a = torch.tensor([[1.,2.],[3.,4.]])
b = torch.tensor([[10.,20.],[30.,40.]])

print(a+b)
print(a-b)
print(torch.matmul(a,b.T))

tensor([[11., 22.],
        [33., 44.]])
tensor([[ -9., -18.],
        [-27., -36.]])
tensor([[ 50., 110.],
        [110., 250.]])


Matrix Multiplication is important as every neural network layer is a matrix multiplication:
1) Embedding Layer
2) Linear(fully connected layer)
3) Self Attention

All doing some form of:
output = input@weight_matrix + bias

In [None]:
x = torch.rand(2,3)
print(x.shape)

# reshaped = x.view(3,2)
# print(reshaped.reshape)

torch.Size([2, 3])


In [None]:
x

tensor([[0.6458, 0.1860, 0.9687],
        [0.1091, 0.9286, 0.6414]])

Reshaping meaning -> It changes how the data is arranged in dimension , but dont change the actual data itself

We reshape because many parts o a model expect inputs in specific shapes

For example-> Batch processing	Stack [batch, seq_len, embed_dim]
Flatten before linear layers	       [batch, channels, height, width] → [batch, features]
Squeeze singleton dims	[3, 1] → [3]

In [None]:
reshaped = x.view(3,2)
print(reshaped.reshape)

<built-in method reshape of Tensor object at 0x7a8bb482c530>


In [None]:
reshaped

tensor([[0.6458, 0.1860],
        [0.9687, 0.1091],
        [0.9286, 0.6414]])

In [None]:
word_embed = torch.tensor([1.0,2.0,3.0])
print(word_embed.shape)

print(word_embed.view(1,3))
print(word_embed.view(3,1))

torch.Size([3])
tensor([[1., 2., 3.]])
tensor([[1.],
        [2.],
        [3.]])


# Heart of Deep Learning and Pytorch

Weights -> learnable parameters that control the strength of connection between input and output features in a neural network

For example,
input = [x1,x2,x3]
weights = [w1,w2,w3]

weighted sum will be:-
z = w1x1 + w2x2 + w3x3

The model adjusts weights during training using backpropogation , it minimizes the error

Bias is a learnable scalar(or vector) that is added to the weighted sum to shift the output

Mathematically:
z = w.x + b

it is like c in y=mx+c , it lets thr model move the line up and down

Linear Transformation (y=mx+c)

In neural networks:-
x = input
m = weight(w)
c = bias(b)

Output
y=w.x+b

If we apply an activation(like Relu or sigmoid) ,
y=σ(w⋅x+b)

Practical Context:-
Imagine you're predicting house prices.

Each feature like square footage, number of rooms, etc. has a weight.

Bias allows your prediction to adjust upward or downward, even when all inputs are zero.

The model learns the best weights and bias to match real house prices.



## Activation Function

An activation function decides whether a neuron should be activated or not . It adds non-linearity to the model so that it can learn complex patterns(as no matter how many layers we add , the output would still be linear)

## ReLU(Rectified Linear Unit)
ReLU(x) = max(0,x)

Positive -> Same as input
Negative or 0 -> 0

When to use -
- Extremely simple and fast
- Helps avoid vanishing gradient in deep networks
- Usually the default for hidden layers

## Sigmoid

Formula-> sigmoid(x) = 1/(1+exp(-x))
Output Range -> (0,1)

When to use:-
- converts real numbers into probability like value
- helps in binary classification or output layer when you want a yes/no prediction

## Tanh
Formula-> (exp(x) - exp(-x))/(exp(x) + exp(-x))

Output-> (-1,1)
Similar to sigmoid , but centered at 0
We can use it when the input data is centered and we need smooth transitions or better gradient flow than sigmoid

| Activation  | Output Range | Use In              | When                              |
| ----------- | ------------ | ------------------- | --------------------------------- |
| **ReLU**    | \[0, ∞)      | Hidden Layers       | **Default** in LLMs, fast, sparse |
| **Sigmoid** | (0, 1)       | Output Layer        | Binary classification             |
| **Tanh**    | (-1, 1)      | RNNs / older models | Data centered around 0            |


Q) Why do neural networks need to stack layers?

Think of each neauron s learning simple patterns

Layer 1 - edges , corners (in image)

Layer 2 - curves , shapes

Layer 3 - objects

Layer 4 - concepts

Layer 8+- meaning , sentiment , intent(LLMs)


> Add blockquote



Relu is used a lot in convolution and simple feedforward nets , but for models like GPT , we use:-
1) GELU(Gaussian Error Linear Unit) in attention and feedforward layer (because its smooth , non linear , keeps small negative values(imp for language representation)

## Real world Analogy
- Linear layers are like arranging blocks in straight lines

- Relu/tanh are allowing curves , turns and new dimensions

- Weights are how much of each block you use(or how strong is the signal)

- Backpropogation is editing your drawing based on feedback