# Building a Tiny Transformer Language Model using TensorFlow

In this lab, you'll implement a simplified Transformer-based language model using TensorFlow and Keras. You'll work with a real-world text (from a PDF or a webpage) and build, train, and experiment with a tiny language model.

**Note:** This model is intentionally small and trained on limited data. The generated text might be funny or incomplete, but the goal is to understand the building blocks of Transformers.

## 1. Setup and Import Libraries

Make sure you have the required packages installed:

```bash
pip install tensorflow numpy PyPDF2 beautifulsoup4 requests
```

Now, import the libraries:

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import math

print('TensorFlow version:', tf.__version__)

TensorFlow version: 2.15.0


## 2. Define Model Components

We'll build the following components:

- **Positional Encoding**
- **Multi-Head Self-Attention**
- **Feed-Forward Network**
- **Transformer Block**
- **Tiny Transformer Language Model**

### 2.1 Positional Encoding

This layer adds positional information to token embeddings.

Mathematic Formula 

![Alternative Text](https://github.com/amiraliz93/Lab2_week1/blob/main/Pics/position%20formula.PNG)



```markdown
![Example Image](https://github.com/amiraliz93/Lab2_week1/blob/main/Pics/position%20formula.PNG)
```

```markdown
![Example Image](https://github.com/amiraliz93/Lab2_week1/blob/main/Pics/position%20formula.PNG)
```

In [None]:
class PositionalEncoding(layers.Layer):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.d_model = d_model # int, 

        # Create a matrix of shape (max_len, d_model) with positional encodings
        pos_enc = np.zeros((max_len, d_model))
        pos = np.arange(max_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
        pos_enc[:, 0::2] = np.sin(pos * div_term)
        pos_enc[:, 1::2] = np.cos(pos * div_term)
        pos_enc = pos_enc[np.newaxis, ...]  # Shape: (1, max_len, d_model)
        self.pos_enc = tf.cast(pos_enc, dtype=tf.float32)

    def call(self, x):
        # x shape: (batch_size, seq_len, d_model)
        seq_len = tf.shape(x)[1]
        return x + self.pos_enc[:, :seq_len, :]

# Test the PositionalEncoding layer
sample_pe = PositionalEncoding(d_model=16, max_len=50)
print('PositionalEncoding created successfully')


PositionalEncoding created successfully


In [4]:
sample_pe

<__main__.PositionalEncoding at 0x28d731bbc10>

### 2.2 Multi-Head Self-Attention

This layer implements multi-head self-attention.

In [None]:
class MultiHeadSelfAttention(layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadSelfAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.num_heads = num_heads
        self.d_model = d_model
        self.depth = d_model // num_heads

        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        self.dense = layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        # x shape: (batch_size, seq_len, d_model)
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        # Transpose to shape: (batch_size, num_heads, seq_len, depth)
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, x):
        batch_size = tf.shape(x)[0]
        q = self.wq(x)  # (batch_size, seq_len, d_model)
        k = self.wk(x)
        v = self.wv(x)

        # Split heads
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        matmul_qk = tf.matmul(q, k, transpose_b=True)  # (batch_size, num_heads, seq_len, seq_len)
        dk = tf.cast(self.depth, tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        scaled_attention = tf.matmul(attention_weights, v)  # (batch_size, num_heads, seq_len, depth)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)
        return output

# Test MultiHeadSelfAttention
sample_mha = MultiHeadSelfAttention(d_model=16, num_heads=4)
dummy_input = tf.random.uniform((1, 10, 16))
print('MultiHeadSelfAttention output shape:', sample_mha(dummy_input).shape)

### 2.3 Feed-Forward Network

A simple two-layer feed-forward network.

In [None]:
class FeedForward(layers.Layer):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.dense1 = layers.Dense(d_ff, activation='relu')
        self.dense2 = layers.Dense(d_model)

    def call(self, x):
        return self.dense2(self.dense1(x))

# Test FeedForward
sample_ff = FeedForward(d_model=16, d_ff=32)
print('FeedForward output shape:', sample_ff(dummy_input).shape)

### 2.4 Transformer Block

This block combines the multi-head self-attention and feed-forward network with residual connections and layer normalization.

In [None]:
class TransformerBlock(layers.Layer):
    def __init__(self, d_model, num_heads, d_ff, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(d_model, num_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, x, training):
        attn_output = self.att(x)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Test TransformerBlock
sample_tb = TransformerBlock(d_model=16, num_heads=4, d_ff=32)
print('TransformerBlock output shape:', sample_tb(dummy_input, training=False).shape)

### 2.5 Tiny Transformer Language Model

This model stacks the embedding, positional encoding, and multiple Transformer blocks to predict the next token.

In [None]:
class TinyTransformerLM(keras.Model):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_len):
        super(TinyTransformerLM, self).__init__()
        self.token_embedding = layers.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.transformer_layers = [TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers)]
        self.dropout = layers.Dropout(0.1)
        self.final_layer = layers.Dense(vocab_size)

    def call(self, x, training=False):
        # x shape: (batch, seq_len)
        x = self.token_embedding(x)
        x = self.pos_encoding(x)
        for layer in self.transformer_layers:
            x = layer(x, training=training)
        x = self.dropout(x, training=training)
        logits = self.final_layer(x)
        return logits

# The model will be used later for training.

## 3. Preprocessing a Real-World Text

For this lab, you can extract text from a webpage. The examples below show how to do that.

### 3.1 Extracting Text from a Webpage

We'll use **Requests** and **BeautifulSoup** to extract text from a webpage.

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_text_from_webpage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup.get_text()

text_data = extract_text_from_webpage("https://gist.githubusercontent.com/sgsinclair/f895f2b37cdee761ac08e4ed8cc83d58/raw/b675c976913a939b325f7f2ee4ab4b3b396edd35/CharlesDickens-OliverTwist.txt")

### 3.2 Preprocessing and Tokenization

For simplicity, we'll use word-level tokenization. In practice, you might use more advanced tokenizers.

In [None]:
# Basic preprocessing: lowercasing and splitting into words
text_data = text_data.lower()
tokens = text_data.split()

# Build vocabulary (sorted set of words)
vocab = sorted(set(tokens))
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

# Encode the entire text into a list of integers
encoded_text = [word2idx[word] for word in tokens]

print('Vocabulary size:', len(vocab))
print('Sample encoded text:', encoded_text[:10])

### 3.3 Creating Training Examples

We'll use a sliding window approach. For a given sequence length, the model will learn to predict the next token.

In [None]:
sequence_length = 10  # Adjust as needed
inputs = []
targets = []

for i in range(len(encoded_text) - sequence_length):
    inputs.append(encoded_text[i:i+sequence_length])
    targets.append(encoded_text[i+1:i+sequence_length+1])

inputs = np.array(inputs)
targets = np.array(targets)

print('Number of training samples:', inputs.shape[0])

### 3.4 Creating a tf.data.Dataset

Now, we create a dataset for training.

In [None]:
batch_size = 64
buffer_size = 10000

dataset = tf.data.Dataset.from_tensor_slices((inputs, targets))
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

print('Dataset ready.')

## 4. Training the Model

Set hyperparameters, compile the model, and train it.

In [None]:
# Hyperparameters
vocab_size = len(vocab)
d_model = 128
num_heads = 4
d_ff = 256
num_layers = 2
max_len = sequence_length

# Instantiate the model
model = TinyTransformerLM(vocab_size, d_model, num_heads, d_ff, num_layers, max_len)

# Compile the model
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss=loss_object)

# Train the model
epochs = 5  # Increase if needed
model.fit(dataset, epochs=epochs)

## 5. Generating Text

After training, use the model to generate text from a seed phrase.

In [None]:
def generate_text(model, start_text, gen_length=100):
    print (start_text, end=" ")
    # Tokenize the starting text
    tokens = start_text.lower().split()
    input_seq = [word2idx[word] for word in tokens if word in word2idx]
    input_seq = tf.expand_dims(input_seq, 0)  # shape: (1, current_seq_len)

    generated = tokens.copy()
    last_word=tokens[-1]
    for _ in range(gen_length):
        predictions = model(input_seq, training=False)  # (1, seq_len, vocab_size)
        last_logits = predictions[0, -1, :]
        predicted_id = tf.random.categorical(tf.expand_dims(last_logits, 0), num_samples=1)[0, 0].numpy()
        new_word=idx2word[predicted_id]
        print (new_word, end=" ") if (new_word!=last_word) else None
        last_word=new_word

        # Append the new token and trim the sequence if necessary
        input_seq = tf.concat([input_seq, tf.expand_dims([predicted_id], 0)], axis=1)
        if input_seq.shape[1] > sequence_length:
            input_seq = input_seq[:, -sequence_length:]

# Generate text using a seed phrase
generated_text = generate_text(model, start_text="Oliver leaned his head upon his hand and")

## 6. Discussion & Experimentation

- Try changing hyperparameters (like `num_layers`, `d_model`, `sequence_length`, etc.)
- Experiment with different text data (e.g., a chapter from a public-domain book)
- Discuss how scaling the model and dataset affects performance

Happy coding and experimenting!