# DAY 3 What made transformers so successful? like on a serious note the world is different now.


📌 Problem with Basic Seq2Seq Models (Without Attention)
In a traditional Sequence-to-Sequence (Seq2Seq) model (like an RNN-based encoder-decoder), the encoder processes the input sequence and compresses it into a single fixed-size context vector (the last hidden state).

🚨 What’s the issue?

This fixed-size context vector acts as a bottleneck—it must store all the information from the input sequence, even if the sequence is very long.
This leads to poor performance on long sentences, as the decoder struggles to retrieve earlier words.
The model tends to forget earlier parts of the sentence when generating longer outputs.

# The basic RNN cell equation can be represented as follows:
# h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
# where:
# - h_t is the hidden state at time t
# - h_{t-1} is the hidden state at time t-1
# - x_t is the input at time t
# - W_hh is the weight matrix for the hidden state
# - W_xh is the weight matrix for the input
# - b_h is the bias term
# 


#We are implementing a basic RNN cell in PyTorch. This will help us understand how an RNN processes a sequence over time.


In [1]:
import torch
import torch.nn as nn

In [2]:
import numpy as np

# Generate a simple sequence of numbers
sequence = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Prepare input-output pairs
X = []
y = []
for i in range(len(sequence) - 2):  # We need at least 2 previous values
    X.append([sequence[i], sequence[i+1]])  # Input is two previous numbers
    y.append(sequence[i+2])  # Target is the next number

# Convert to NumPy arrays
X = np.array(X)
y = np.array(y)

# Print dataset
print("Inputs:\n", X)
print("Targets:\n", y)


Inputs:
 [[1 2]
 [2 3]
 [3 4]
 [4 5]
 [5 6]
 [6 7]
 [7 8]
 [8 9]]
Targets:
 [ 3  4  5  6  7  8  9 10]


In [3]:
X

array([[1, 2],
       [2, 3],
       [3, 4],
       [4, 5],
       [5, 6],
       [6, 7],
       [7, 8],
       [8, 9]])

In [4]:
y

array([ 3,  4,  5,  6,  7,  8,  9, 10])

In [5]:
import numpy as np

# Generate simple sequence data
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7], [7, 8], [8, 9]])
y = np.array([[3], [4], [5], [6], [7], [8], [9], [10]])

print("Inputs:\n", X)
print("Targets:\n", y)


Inputs:
 [[1 2]
 [2 3]
 [3 4]
 [4 5]
 [5 6]
 [6 7]
 [7 8]
 [8 9]]
Targets:
 [[ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]]


In [6]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)



Shape of X: (8, 2)
Shape of y: (8, 1)


In [22]:
# 1 Hidden Layer
np.random.seed(42)
w1=np.random.randn(1,1)
print(w1)
print(w1.shape)
W1=w1.T
print(W1.shape)
w2=np.random.randn(1,1)
print(w2)
print(w2.shape)
W2=w2.T
print(W2.shape)
b1=np.random.randn(1,1)
print(b1)
print(b1.shape)
b2=np.random.randn(1,1)
print(b2)
print(b2.shape)



#for the first layer input will be multiplied by W1 and then added to b1
before_activation_1_1=((X[0][0]*W1)+b1)
print(before_activation_1_1)
print(before_activation_1_1.shape)
before_activation_1_2=((X[0][1]*W2)+b2)
print(before_activation_1_2)
print(before_activation_1_2.shape)

after_activation_1_full=np.tanh(before_activation_1_1 + before_activation_1_2)
print(after_activation_1_full)
print(after_activation_1_full.shape)

W3=np.random.randn(1,1)
print(W3)
print(W3.shape)
W3=W3.T
print(W3.shape)
b3=np.random.randn(1,1)
print(b3)
print(b3.shape)
#final layer
before_activation_2_1=((after_activation_1_full*W3)+b3)
print(before_activation_2_1)
print(before_activation_2_1.shape)

#Let's not use any activation function for the final layer since it seems a regression problem
after_activation_2_1=(before_activation_2_1)
print(after_activation_2_1)
print(after_activation_2_1.shape)











[[0.49671415]]
(1, 1)
(1, 1)
[[-0.1382643]]
(1, 1)
(1, 1)
[[0.64768854]]
(1, 1)
[[1.52302986]]
(1, 1)
[[1.14440269]]
(1, 1)
[[1.24650125]]
(1, 1)
[[0.98337764]]
(1, 1)
[[-0.23415337]]
(1, 1)
(1, 1)
[[-0.23413696]]
(1, 1)
[[-0.46439815]]
(1, 1)
[[-0.46439815]]
(1, 1)
