# Programming Task Description

## Programming Task: Implementing a Character-Level GPT Model

### Introduction
In this task, you will create a Python script using PyTorch to implement a simplified GPT (Generative Pre-trained Transformer) model for character-level language modeling. The model will be trained on the text in input.txt to predict the next character in a sequence and generate new text based on a given context. The architecture follows the decoder part of the transformer model from the "Attention is All You Need" paper by Vaswani et al., focusing on masked multi-head self-attention to ensure predictions depend only on previous positions.

## Task Description
### Your goal is to write a Python jupyter notebook that:

1. Reads and processes the text from input.txt.
2. Implements a decoder-only transformer model with manual attention mechanisms.
3. Trains the model on the processed data.
4. Generates new text using the trained model.

You will use PyTorch and implement the attention mechanism from scratch, following the decoder structure outlined in the "Attention is All You Need" paper.

### Step-by-step Guide

1. Data Preparation
* Read all text from input.txt using UTF-8 encoding.
* Create a sorted list of unique characters (vocabulary) from the text.
* Build two dictionaries:
    * stoi: Maps characters to integers (e.g., 'a' -> 0).
    * itos: Maps integers to characters (e.g., 0 -> 'a').
* Define functions:
    * encode(s): Converts a string to a list of integers using stoi.
    * decode(l): Converts a list of integers to a string using itos.
* Encode the entire text into a tensor of integers using torch.tensor.
* Split the data: 90% for training, 10% for validation.

2. Data Loading
* Implement a function get_batch(split):
    * Input: split is either 'train' or 'val'.
    * Select the appropriate dataset (training or validation).
    * Randomly sample batch_size starting indices, ensuring each sequence fits within block_size.
* Return:
    * x: A tensor of shape (batch_size, block_size) with input sequences.
    * y: A tensor of shape (batch_size, block_size) with target sequences (shifted by one position).
* Move tensors to the device (CPU or GPU).

3. Model Architecture
* Implement the following components in a decoder-only transformer:
    * Embedding Layers:
        * Token embedding: nn.Embedding(vocab_size, n_embd) for character indices.
        * Position embedding: nn.Embedding(block_size, n_embd) for positions 0 to block_size-1.
    * Transformer Blocks:
        * Each block includes:
            * Masked Multi-Head Self-Attention:
                * Implement manually (do not use nn.MultiheadAttention).
                * For each head:
                    * Linear layers for queries (Q), keys (K), and values (V).
                    * Scaled dot-product attention: attention = softmax((Q @ K.T) / sqrt(d_k)) @ V.
                    * Mask future positions with a lower triangular matrix (e.g., tril) by setting future weights to -inf before softmax.
                * Concatenate heads and apply a projection layer.
            * Feed-Forward Network: nn.Linear(n_embd, 4 * n_embd) → ReLU → nn.Linear(4 * n_embd, n_embd).
            * Layer Normalization: Apply nn.LayerNorm(n_embd) before each sub-layer (pre-norm).
            * Residual Connections: Add input to output of each sub-layer.
        * Use n_layer blocks in sequence.
    * Final Layers:
        * nn.LayerNorm(n_embd) for final normalization.
        * nn.Linear(n_embd, vocab_size) to produce logits.
* Define a GPTLanguageModel class with:
    * forward(idx, targets=None): Computes logits and loss (if targets provided).
    * generate(idx, max_new_tokens): Autoregressively generates new tokens.

4. Training
* Use the AdamW optimizer with learning_rate = 3e-4.
* Train for max_iters = 5000 iterations.
* Estimate and print training and validation losses:
* Compute loss using F.cross_entropy on flattened logits and targets.

5. Text Generation
* Implement generate(idx, max_new_tokens):
    * Start with an initial context idx (shape (B, T)).
    * For max_new_tokens steps:
        * Crop idx to the last block_size tokens.
        * Get logits from forward.
        * Apply softmax to the last time step’s logits to get probabilities.
        * Sample the next token using torch.multinomial.
        * Append the sampled token to idx.
    * Return the extended sequence.

### Hyperparameters
Use these values:

* batch_size = 64
* block_size = 256
* n_embd = 384
* n_head = 6
* n_layer = 6
* dropout = 0.2
* learning_rate = 3e-4
* max_iters = 5000

### Understanding the Decoder
The "Attention is All You Need" paper describes a transformer with an encoder and decoder. For this task, you focus on the decoder-only architecture used in GPT:

* Masked Self-Attention: Ensures the model only attends to previous positions in the sequence, making it autoregressive. This is achieved by masking future tokens in the attention computation, as shown below:

$Attention (Q, K, V) = softmax((Q@K.T)/sqrt(d_{k}) + mask) @V$ 

where $mask$ sets future positions to $-inf$

* Decoder Role: In the original paper, the decoder generates output sequences while attending to the encoder’s output. Here, without an encoder, it uses self-attention on the input sequence alone, predicting the next token step-by-step.

### Additional Notes
* Manual Attention: Implement attention from scratch to understand its mechanics (no pre-built PyTorch modules).
* Masking: Use a lower triangular matrix (e.g., torch.tril) to mask future positions.
* Device Handling: Set device = 'cuda' if torch.cuda.is_available() else 'cpu' and move tensors/models accordingly.
* Dropout: Apply nn.Dropout(dropout) in attention and feed-forward layers for regularization.

### Deliverables
A Python script that:
* Implements all steps above.
* Prints training and validation losses every 500/100? iterations (up to you).
* Generates and prints a 500-character sample after training.

### Evaluation Criteria
* Correct data preparation and batch loading.
* Accurate implementation of the transformer model, especially masked self-attention.
* Successful training with decreasing loss.
* Generation of coherent (for character-level) text.