# Problem: Write Sequence-to-Sequence with Attention

### Problem Statement
Implement a **Sequence-to-Sequence (Seq2Seq) model with Attention** by completing the required sections. The model consists of an **Encoder** that processes input sequences and a **Decoder** with an attention mechanism that generates output sequences.

### Requirements

1. **Encoder Class**:
   - **Layers**:
     - Use an embedding layer to map input tokens to dense vectors.
     - Use an LSTM layer to capture temporal dependencies in the sequence.
   - **Forward Pass**:
     - Pass the input sequence through the embedding layer.
     - Feed the embedded sequence into the LSTM.
     - Return the LSTM outputs and the final hidden and cell states.

2. **Decoder with Attention**:
   - **Layers**:
     - Use an embedding layer to process output sequence tokens.
     - Implement an attention mechanism to compute attention weights between the encoder outputs and the current decoder hidden state.
     - Use an LSTM layer to predict the next token using the context vector (from attention) and the current decoder state.
     - Use a fully connected output layer to predict the next token.
   - **Forward Pass**:
     - Process the input through the embedding layer.
     - Compute attention weights using the decoder hidden state and encoder outputs.
     - Calculate the context vector by applying the attention weights to the encoder outputs.
     - Combine the context vector with the embedded input.
     - Feed the combined representation into the LSTM.
     - Pass the LSTM output through a fully connected layer to predict the next token.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
# Define the Encoder
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, (hidden, cell)

# Define the Decoder with Attention
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers, src_seq_length):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.attention = nn.Linear(hidden_dim + embed_dim, src_seq_length)
        self.attention_combine = nn.Linear(hidden_dim + embed_dim, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)

    def forward(self, x, encoder_outputs, hidden, cell):
        x = x.unsqueeze(1)  # Add sequence dimension
        embedded = self.embedding(x)

        # Attention mechanism
        attention_weights = torch.softmax(self.attention(torch.cat((embedded.squeeze(1), hidden[-1]), dim=1)), dim=1)
        context_vector = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)

        # Combine context and embedded input
        combined = torch.cat((embedded.squeeze(1), context_vector.squeeze(1)), dim=1)
        combined = torch.tanh(self.attention_combine(combined)).unsqueeze(1)

        # LSTM and output
        lstm_out, (hidden, cell) = self.lstm(combined, (hidden, cell))
        output = self.fc_out(lstm_out.squeeze(1))
        return output, hidden, cell

In [3]:
# Define synthetic training data
torch.manual_seed(42)
src_vocab_size = 20
tgt_vocab_size = 20
src_seq_length = 10
tgt_seq_length = 12
batch_size = 16

src_data = torch.randint(0, src_vocab_size, (batch_size, src_seq_length))
tgt_data = torch.randint(0, tgt_vocab_size, (batch_size, tgt_seq_length))

# Initialize models, loss function, and optimizer
input_dim = src_vocab_size
output_dim = tgt_vocab_size
embed_dim = 32
hidden_dim = 64
num_layers = 2

encoder = Encoder(input_dim, embed_dim, hidden_dim, num_layers)
decoder = Decoder(output_dim, embed_dim, hidden_dim, num_layers, src_seq_length)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), lr=0.001)

In [4]:
# Training loop
epochs = 100
for epoch in range(epochs):
    encoder_outputs, (hidden, cell) = encoder(src_data)
    loss = 0
    decoder_input = torch.zeros(batch_size, dtype=torch.long)  # Start token

    for t in range(tgt_seq_length):
        output, hidden, cell = decoder(decoder_input, encoder_outputs, hidden, cell)
        loss += criterion(output, tgt_data[:, t])
        decoder_input = tgt_data[:, t]  # Teacher forcing

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log progress every 10 epochs
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch + 1}/{epochs}] - Loss: {loss.item():.4f}")

Epoch [10/100] - Loss: 35.5304
Epoch [20/100] - Loss: 34.7664
Epoch [30/100] - Loss: 33.6247
Epoch [40/100] - Loss: 30.9979
Epoch [50/100] - Loss: 27.3896
Epoch [60/100] - Loss: 24.1525
Epoch [70/100] - Loss: 21.2032
Epoch [80/100] - Loss: 18.6953
Epoch [90/100] - Loss: 16.5154
Epoch [100/100] - Loss: 14.5446


In [5]:
# Test the sequence-to-sequence model with new input
test_input = torch.randint(0, src_vocab_size, (1, src_seq_length))
with torch.no_grad():
    encoder_outputs, (hidden, cell) = encoder(test_input)
    decoder_input = torch.zeros(1, dtype=torch.long)  # Start token
    output_sequence = []

    for _ in range(tgt_seq_length):
        output, hidden, cell = decoder(decoder_input, encoder_outputs, hidden, cell)
        predicted = output.argmax(1)
        output_sequence.append(predicted.item())
        decoder_input = predicted

    print(f"Input: {test_input.tolist()}, Output: {output_sequence}")

Input: [[3, 18, 4, 11, 8, 17, 12, 7, 18, 1]], Output: [13, 13, 2, 2, 2, 12, 12, 7, 7, 12, 12, 12]
