# Project Overview
This notebook presents an optimized Transformer-based solution for translating mathematical expressions from fully parenthesized infix notation to postfix notation (Reverse Polish Notation). While the initial project constraints allowed for a budget of up to 2 million parameters, this submission also focuses on architectural efficiency, keeping the very high accuracy i got with early large models while aggressively minimizing model size.

## Architectural Strategy
The solution I went with is a sequence-to-sequence Transformer Encoder-Decoder. This choice was made as i read of the limitations of recurrent architectures (like LSTMs) in capturing hierarchical syntax (which this is all about). By using multi-head self-attention, the model parses nested expressions (up to 4) without the need for excessive depth or parameter bloat.

## Model Variants Explored
2.  **Baseline**: D_MODEL=128, N_LAYERS=3.
    *   **Result**: ~1.4M parameters. Keeps a 0.9988 ± 0.0036 accuracy.
3.  **Optimized (THIS NOTEBOOK)**: D_MODEL=64, N_LAYERS=3.
    *   **Result**: ~550k parameters. Final Score: 1.0000 ± 0.0000, ultima epoch: accuracy: 0.9987 - loss: 0.0047 - val_accuracy: 1.0000 - val_loss: 1.1916e-04
4.  **Tiny (Experimental)**: D_MODEL=32, N_LAYERS=2.
    *   **Result**: ~43k parameters. Still achieved 0.9960 ± 0.0060 accuracy!

In [1]:
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras
from keras import layers

In [None]:
# -------------------- Constants --------------------
OPERATORS = ['+', '-', '*', '/']
IDENTIFIERS = list('abcdef')
SPECIAL_TOKENS = ['PAD', 'SOS', 'EOS']
SYMBOLS = ['(', ')', '+', '-', '*', '/']
VOCAB = SPECIAL_TOKENS + SYMBOLS + IDENTIFIERS + ['JUNK']  #may use junk in autoregressive generation

token_to_id = {tok: i for i, tok in enumerate(VOCAB)}
id_to_token = {i: tok for tok, i in token_to_id.items()}
VOCAB_SIZE = len(VOCAB)
PAD_ID = token_to_id['PAD']
EOS_ID = token_to_id['EOS']
SOS_ID = token_to_id['SOS']

MAX_DEPTH = 4
MAX_LEN = 4 * 2**MAX_DEPTH - 2 #enough to fit expressions at given depth (+ EOS)

# Data Generation and Evaluation Functions
as in the specs

In [None]:
# -------------------- Expression Generation --------------------
def generate_infix_expression(max_depth):
    if max_depth == 0:
        return random.choice(IDENTIFIERS)
    elif random.random() < 0.25:
        return generate_infix_expression(max_depth - 1)
    else:
        left = generate_infix_expression(max_depth - 1)
        right = generate_infix_expression(max_depth - 1)
        op = random.choice(OPERATORS)
        return f'({left} {op} {right})'

def tokenize(expr):
    return [c for c in expr if c in token_to_id]

def infix_to_postfix(tokens):
    precedence = {'+': 1, '-': 1, '*': 2, '/': 2}
    output, stack = [], []
    for token in tokens:
        if token in IDENTIFIERS:
            output.append(token)
        elif token in OPERATORS:
            while stack and stack[-1] in OPERATORS and precedence[stack[-1]] >= precedence[token]:
                output.append(stack.pop())
            stack.append(token)
        elif token == '(':
            stack.append(token)
        elif token == ')':
            while stack and stack[-1] != '(':
                output.append(stack.pop())
            stack.pop()
    while stack:
        output.append(stack.pop())
    return output

def encode(tokens, max_len=MAX_LEN):
    ids = [token_to_id[t] for t in tokens] + [EOS_ID]
    return ids + [PAD_ID] * (max_len - len(ids))

def decode_sequence(token_ids, id_to_token, pad_token='PAD', eos_token='EOS'):
    """
    Converts a list of token IDs into a readable string by decoding tokens.
    Stops at the first EOS token if present, and ignores PAD tokens.
    """
    tokens = []
    for token_id in token_ids:
        token = id_to_token.get(token_id, '?')
        if token == eos_token:
            break
        if token != pad_token:
            tokens.append(token)
    return ' '.join(tokens)

def generate_dataset(n, max_depth=MAX_DEPTH):
    X, Y = [], []
    for _ in range(n):
        expr = generate_infix_expression(max_depth)
        infix = tokenize(expr)
        postfix = infix_to_postfix(infix)
        X.append(encode(infix))
        Y.append(encode(postfix))
    return np.array(X), np.array(Y)

#you might use the shift function for teacher-forcing
def shift_right(seqs): # i do use it
    shifted = np.zeros_like(seqs)
    shifted[:, 1:] = seqs[:, :-1]
    shifted[:, 0] = SOS_ID
    return shifted

#moved here for convenience
def prefix_accuracy_single(y_true, y_pred, id_to_token, eos_id=EOS_ID, verbose=False):
    t_str = decode_sequence(y_true, id_to_token).split(' EOS')[0]
    p_str = decode_sequence(y_pred, id_to_token).split(' EOS')[0]
    t_tokens = t_str.strip().split()
    p_tokens = p_str.strip().split()
    print(len(p_tokens))
    max_len = max(len(t_tokens), len(p_tokens))
    n = min(len(t_tokens), len(p_tokens))
    match_len = 0
    while match_len < n and t_tokens[match_len] == p_tokens[match_len]:
        match_len += 1
    score = match_len / max_len if max_len > 0 else 0
    if verbose:
        print(f"TARGET : {' '.join(t_tokens)}")
        print(f"PREDICT: {' '.join(p_tokens)}")
        print(f"PREFIX MATCH: {match_len}/{len(t_tokens)} → {score:.2f}")
    return score


# Transformer Architecture (The Baseline)

This model uses a 3-layer stack for both Encoder and Decoder. With D_MODEL=64, it totals approximately 550k parameters, which is safely within the 2M limit.

In [None]:
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0.1):
    x = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size)(inputs, inputs)
    x = layers.Dropout(dropout)(x) #did not see overfitting but better safe than sorry
    res = layers.LayerNormalization(epsilon=1e-6)(x + inputs) #add something here
    x = layers.Dense(ff_dim, activation="relu")(res)
    x = layers.Dense(inputs.shape[-1])(x)
    return layers.LayerNormalization(epsilon=1e-6)(x + res)

def transformer_decoder(inputs, enc_outputs, head_size, num_heads, ff_dim, dropout=0.1):
    x = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size)(inputs, inputs, use_causal_mask=True)
    res = layers.LayerNormalization(epsilon=1e-6)(x + inputs)
    x = layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size)(res, enc_outputs) #2 multi head attention layers why? put something here
    res = layers.LayerNormalization(epsilon=1e-6)(x + res)
    x = layers.Dense(ff_dim, activation="relu")(res)
    x = layers.Dense(inputs.shape[-1])(x)
    return layers.LayerNormalization(epsilon=1e-6)(x + res)

def build_model():
    # Baseline used D_MODEL=128 (1.4M params).
    # Tiny experiment used D_MODEL=32 and 2 layers (43k params).
    # I selected 64 as the optimal middle point (~550k params).
    D_MODEL = 64
    N_LAYERS = 3
    NUM_HEADS = 4
    FF_DIM = 512 # set as 4x D_MODEL except for the tiny model where it is set as 2x D_MODEL

    enc_inputs = layers.Input(shape=(MAX_LEN,))
    dec_inputs = layers.Input(shape=(MAX_LEN,))

    embed = layers.Embedding(VOCAB_SIZE, D_MODEL)
    pos_embed = layers.Embedding(MAX_LEN, D_MODEL) #put something here
    positions = tf.range(start=0, limit=MAX_LEN, delta=1) #put something here

    x_enc = embed(enc_inputs) + pos_embed(positions) #put something here
    x_dec = embed(dec_inputs) + pos_embed(positions)

    for _ in range(N_LAYERS):
        #Scaled head_size to avoid parameter explosion. put something here to explain more
        x_enc = transformer_encoder(x_enc, D_MODEL // NUM_HEADS, NUM_HEADS, FF_DIM)
    for _ in range(N_LAYERS):
        x_dec = transformer_decoder(x_dec, x_enc, D_MODEL // NUM_HEADS, NUM_HEADS, FF_DIM)

    outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x_dec)
    return keras.Model(inputs=[enc_inputs, dec_inputs], outputs=outputs)

model = build_model()
model.summary()

# Training Preparation

Generate the training data and compile the model.

In [None]:
# a larger dataset (50k-100k samples) is recommended for Transformer in general and to generalize the structure.
TRAIN_SIZE = 50000
VAL_SIZE = 5000

X_train, Y_train = generate_dataset(TRAIN_SIZE)
X_val, Y_val = generate_dataset(VAL_SIZE)

# teacher forcing
dec_input_train = shift_right(Y_train)
dec_input_val = shift_right(Y_val)

model.compile(
    optimizer="adam", # add something here
    loss="sparse_categorical_crossentropy", # add something here
    metrics=["accuracy"]
)

print(f"Generated {TRAIN_SIZE} training samples and {VAL_SIZE} validation samples.")

Generated 50000 training samples and 5000 validation samples.


# Train the model

In [None]:
BATCH_SIZE = 64 #add something here
# saw fast convergence in testing so 5 epoch are plenty, baseline and tiny model used 10, 
# this model also converges incredibly quickly, i could push limits and increase batch size?would this make it converge slower? or use a smaller dataset, but it is already very fast to train 
EPOCHS = 5 
history = model.fit(
    [X_train, dec_input_train],
    Y_train,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=([X_val, dec_input_val], Y_val),
    verbose=1
)

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 83ms/step - accuracy: 0.8041 - loss: 0.5342 - val_accuracy: 0.9950 - val_loss: 0.0166
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 83ms/step - accuracy: 0.9954 - loss: 0.0160 - val_accuracy: 0.9993 - val_loss: 0.0021
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 81ms/step - accuracy: 0.9994 - loss: 0.0022 - val_accuracy: 0.9996 - val_loss: 0.0016
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 81ms/step - accuracy: 0.9999 - loss: 5.8965e-04 - val_accuracy: 0.9993 - val_loss: 0.0027
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m64s[0m 82ms/step - accuracy: 0.9987 - loss: 0.0047 - val_accuracy: 1.0000 - val_loss: 1.1916e-04


# Autoregressive Inference

In [None]:
def autoregressive_decode(model, encoder_input): #greedy decoding
    encoder_input = np.array(encoder_input).reshape(1, -1)
    decoder_input = np.full((1, MAX_LEN), PAD_ID)
    decoder_input[0, 0] = SOS_ID
    
    for i in range(1, MAX_LEN):
        predictions = model.predict([encoder_input, decoder_input], verbose=0)
        predicted_id = np.argmax(predictions[0, i-1, :])
        decoder_input[0, i] = predicted_id
        if predicted_id == EOS_ID:
            break
            
    return decoder_input[0]


# Formal Test Loop

In [None]:
def test(no=30, rounds=10):
    rscores = []
    for i in range(rounds):
        print(f"Round {i}...")
        X_test, Y_test = generate_dataset(no) 
        scores = []
        for j in range(no):
            generated = autoregressive_decode(model, X_test[j])[1:] 
            scores.append(prefix_accuracy_single(Y_test[j], generated, id_to_token))
        rscores.append(np.mean(scores))
    return np.mean(rscores), np.std(rscores)

mean_score, std_score = test(no=30, rounds=10)
print(f"Final Score: {mean_score:.4f} ± {std_score:.4f}")

Round 0...
Round 1...
Round 2...
Round 3...
Round 4...
Round 5...
Round 6...
Round 7...
Round 8...
Round 9...
Final Score: 1.0000 ± 0.0000


# Test singular inputs

this is just to test manually some strings as i am always skeptical of good results

In [None]:
def test_expression(expression):
    print(f"Input Expression: {expression}")
    # Ensure the expression is valid (only contains allowed characters)
    try:
        tokens = tokenize(expression)
        input_ids = encode(tokens)
    except KeyError as e:
        print(f"Error: Found unknown character {e}")
        return

    # generate Postfix
    output_ids = autoregressive_decode(model, input_ids)
    
    # decode back to string
    predicted_postfix = decode_sequence(output_ids, id_to_token)
    
    print(f"Predicted Postfix: {predicted_postfix}")
    return predicted_postfix

In [None]:

# Testing grounds
# test_expression("((a + b) * c)")
# test_expression("((a * b) + (c / d))")
# test_expression("(a + b)")
# test_expression("(b + a)")

# Depth 2
test_expression("((a + b) * (c - d))")
# Depth 3
test_expression("(((a * b) + c) / (d - e))")
test_expression("((a + (b * c)) - (d / e))")
# Depth 4
test_expression("((((a + b) * c) - d) / (e + f))")
test_expression("((a * (b + c)) - ((d / e) * f))")
# same letters, different structure
test_expression("((a + b) + (c + d))")
test_expression("(((a + b) + c) + d)")



Input Expression: ((a + b) * (c - d))
Predicted Postfix: SOS a b + c d - *
Input Expression: (((a * b) + c) / (d - e))
Predicted Postfix: SOS a b * c + d e - /
Input Expression: ((a + (b * c)) - (d / e))
Predicted Postfix: SOS a b c * + d e / -
Input Expression: ((((a + b) * c) - d) / (e + f))
Predicted Postfix: SOS a b + c * d - e f + /
Input Expression: ((a * (b + c)) - ((d / e) * f))
Predicted Postfix: SOS a b c + * d e / f * -
Input Expression: ((a + b) + (c + d))
Predicted Postfix: SOS a b + c d + +
Input Expression: (((a + b) + c) + d)
Predicted Postfix: SOS a b + c + d +


'SOS a b + c + d +'

['a', 'b', 'c', '+', '*', 'd', 'e', '/', 'f', '*', '-']

# Final Results & Comparison

We successfully met all specifications. Below is the final comparison of the architectures we tested.

| Model Variant | D_MODEL | Layers | Key Dim | Parameters | Training Acc | Validation Acc | Validation Loss | Test Acc | Test Std | Notes |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :--- | :--- | :--- | :--- |
| **Baseline** | 128 | 3 | 32 | ~1.4M | 99.79% | 100% | 1.5862e-04 | 99.88% | 0.0036 | pretty heavy, but good performance. |
| **Optimized (Final)** | **64** | **3** | **16** | **~550k** | **99.87%** | **100%** | **1.1916e-04** | **100%** | **0** | Submitted, nice 100% |
| **Tiny** | 32 | 2 | 8 | ~43k | 99.93% | 99.95% | 0.0019 | 99.60% | 0.006 | Proof of concept using as few resources as i found possible. |
