# Abstract

Transformer models achieve state-of-the-art performance in NLP but incur substantial computational costs that scale with depth. This study investigates whether recurrent transformers—which reuse layers across multiple iterations—can match the performance of standard transformers while using significantly fewer parameters. We compare a baseline transformer (6 layers, ~26M parameters) against a recurrent transformer (3 shared layers × 2 iterations, ~11M parameters), both incorporating modern architectural components: Flash Attention, SwiGLU activation, Rotary Position Embedding (RoPE), and RMSNorm. Experiments on sentiment classification (SST-2, Yelp) and multi-domain review classification demonstrate that the recurrent architecture achieves comparable or superior accuracy while reducing parameter count by approximately 58%. Notably, the recurrent model shows particular strength on longer sequences, suggesting that iterative refinement effectively captures extended contextual dependencies. We further evaluate deployment efficiency through FP16 quantization experiments, finding that both architectures maintain accuracy under half-precision inference while achieving 2× size reduction. The recurrent transformer with FP16 quantization achieves a 64% total size reduction compared to the full-precision baseline, offering the smallest deployment footprint without accuracy degradation. These findings support the hypothesis that weight sharing through recurrence can substitute for distinct layer stacking in encoder-based classification, with practical implications for deploying transformer models in resource-constrained environments.


# Introduction

Transformer models have become the dominant architecture in natural language processing, achieving state-of-the-art results across a wide range of tasks. However, their computational and memory costs scale with model depth, creating deployment challenges in resource-constrained environments. This tension between model capacity and efficiency has motivated research into parameter-efficient architectures that can match the performance of larger models while using substantially fewer parameters.

## Background

Recent advances in depth-recurrent transformer architectures offer a promising direction for model compression. The Universal Transformer [1] introduced the concept of applying the same transformation block iteratively, effectively trading parameters for computation. More recent work has extended this idea: the Mixture-of-Recursions (MoR) framework [2] combines parameter sharing with adaptive computation, dynamically allocating depth per token. Similarly, the Tiny Recursive Model (TRM) [5] demonstrates that a compact ~7M parameter network with recursive reasoning can outperform much larger models on complex tasks like ARC-AGI benchmarks.

These developments are particularly relevant in the context of reasoning-intensive tasks, where deeper processing has been shown to improve performance. By reusing layers across multiple iterations, recurrent transformers can achieve an effective depth that exceeds their physical layer count, enabling smaller models to develop richer representations without the parameter overhead of traditional deep networks.

Modern transformer implementations also benefit from architectural innovations such as Rotary Position Embedding (RoPE) [3], which encodes relative positional information directly into attention computations, and SwiGLU activation functions [4], which provide improved gradient flow through gated linear units. These components have become standard in efficient transformer designs.

## Research Objective

This study investigates whether a recurrent transformer with shared weights can achieve classification performance comparable to a standard (non-recurrent) transformer baseline while using significantly fewer parameters. We additionally examine how both architectures respond to post-training quantization, a critical consideration for practical deployment. Specifically, we compare:

1. A **baseline transformer** with 6 distinct layers (~26M parameters)
2. A **recurrent transformer** with 3 shared layers applied over 2 iterations (~11M parameters)
3. **FP16-quantized versions** of both architectures to evaluate deployment efficiency

Both architectures incorporate identical modern components (Flash Attention, SwiGLU, RoPE, RMSNorm) to ensure that performance differences reflect the structural choice of depth stacking versus iterative refinement rather than implementation details.

## Contributions

Our experiments across sentiment classification (SST-2, Yelp) and multi-domain classification tasks demonstrate that:

1. The recurrent transformer achieves comparable or superior accuracy to the baseline while reducing parameter count by approximately 58%.

2. Both architectures maintain full accuracy under FP16 quantization, with the recurrent model achieving the smallest deployment footprint (36 MB vs 99 MB baseline).

3. The compound benefit of weight sharing and quantization enables a **64% total size reduction** without accuracy degradation, supporting practical deployment in memory-constrained environments.

These findings support the hypothesis that iterative depth can serve as an effective substitute for distinct layers in encoder-based classification tasks.


#  Theory

This section introduces the key architectural components that differentiate our transformer implementations from the original design. Both the baseline and recurrent models share these modern enhancements.

## SwiGLU Activation

The feed-forward network in each transformer layer uses SwiGLU activation instead of the standard ReLU or GELU:

$$\text{SwiGLU}(x) = \text{Swish}(W_1 x) \odot (W_2 x)$$

where $\text{Swish}_\beta(x) = x \cdot \sigma(\beta x)$, and $\odot$ denotes element-wise multiplication. SiLU is the special case with $\beta = 1$. Unlike standard FFN with 2 projection matrices, SwiGLU uses 3 matrices ($W_1$, $W_2$, $W_3$), providing a gating mechanism that improves expressiveness while maintaining smooth gradients for negative inputs.

## Rotary Position Embedding (RoPE)

RoPE encodes positional information by applying rotations to **query (Q) and key (K) vectors only**—value vectors remain unrotated. For a $d$-dimensional embedding (where $d = 2K$), we partition into $K$ pairs, each rotated at a different frequency:

$$\theta_i = 10000^{-2i/d}, \quad i \in \{0, 1, \ldots, \tfrac{d}{2}-1\}$$

Each consecutive pair $(x_{2i}, x_{2i+1})$ is treated as a complex number and rotated by angle $m\theta_i$ at position $m$:

$$\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}$$

**Key Property:** When computing attention $q_m^T k_n$, the rotation matrices satisfy $(R_m q)^T (R_n k) = q^T R_{n-m} k$. This follows from orthogonality ($R_m^T = R_{-m}$) and composition ($R_m R_n = R_{m+n}$) properties of rotation matrices. The absolute positions cancel out, leaving only the relative distance $(n-m)$, which enables better generalization to longer sequences.

## RMSNorm

We use RMSNorm instead of LayerNorm for normalization:

$$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma$$

RMSNorm omits the mean-centering step of LayerNorm, using only root-mean-square normalization with a learnable scale parameter $\gamma$. This simplification provides approximately 10–15% faster computation while achieving similar training dynamics.


# Datasets & Data Preparation

To assess the generalization capability and computational efficiency of the recurrent transformer architecture, three sentiment classification datasets were employed: the Stanford Sentiment Treebank (SST-2), Yelp Reviews, and a composite Multi-Domain corpus. These datasets collectively span short-sequence, long-sequence, and cross-domain linguistic settings. For comparability, all datasets were standardized to identical training, validation, and test sizes (54,576 / 6,822 / 6,823).

## Dataset Characteristics

### SST-2 (Short-Sequence Domain)

SST-2 consists of concise movie review excerpts, with sequence lengths concentrated between 0 and 50 tokens. This dataset emphasizes sentiment cues embedded in short syntactic patterns. Label distributions across all splits exhibit a mild positive skew.

<!-- \begin{center}
{\small \textbf{Table 1. SST-2 Label Distribution}}
\end{center} -->
Table 1. SST-2 Label Distribution

| Split | Label 0 (%) | Label 1 (%) | n |
|-------|-------------|-------------|---------|
| Train | 44.11 | 55.89 | 54,576 |
| Validation | 44.91 | 55.09 | 6,822 |
| Test | 45.04 | 54.96 | 6,823 |

### Yelp Reviews (Long-Sequence Domain)

Yelp reviews include substantially longer paragraphs, with some sequences reaching approximately 1,000 tokens. This dataset enables evaluating the model’s ability to capture long-range dependencies. The class distribution is nearly symmetric, minimizing confounding effects arising from label imbalance.

<!-- \begin{center}
{\small \textbf{Table 2. Yelp Label Distribution}}
\end{center} -->
Table 2. Yelp Label Distribution

| Split | Label 0 (%) | Label 1 (%) | n |
|-------|-------------|-------------|---------|
| Train | 50.08 | 49.92 | 54,576 |
| Validation | 51.28 | 48.72 | 6,822 |
| Test | 51.08 | 48.92 | 6,823 |

### Multi-Domain Dataset (Composite Setting)

This dataset integrates samples drawn from local business reviews, movie reviews, and online shopping reviews. Category proportions were strictly controlled to maintain balanced domain representation across splits. Text lengths span a wide range, including many long sequences.

<!-- \begin{center}
{\small \textbf{Table 3. Multi-Domain Category Distribution}}
\end{center} -->
Table 3. Multi-Domain Category Distribution

| Split | Local business (%) | Movie review (%) | Online shopping (%) | n |
|-------|---------------------|------------------|----------------------|---------|
| Train | 33.33 | 33.33 | 33.33 | 54,576 |
| Validation | 33.33 | 33.33 | 33.33 | 6,822 |
| Test | 33.33 | 33.33 | 33.33 | 6,823 |

## Preprocessing Pipeline

A unified text preprocessing procedure was applied to all datasets to ensure consistency and reduce non-semantic noise before tokenization. The steps were as follows:

1. Normalization of text, including lowercasing and standardization of whitespace to remove redundant spacing or line breaks.  
2. Removal of non-linguistic artifacts such as HTML tags and URL patterns commonly found in web-derived data.  
3. Standardization of punctuation by collapsing repeated punctuation marks (e.g., sequences of exclamation points or ellipses) into single instances, preventing artificial inflation of sequence length.

This preprocessing strategy ensures that models operate over semantically meaningful content and reduces variability introduced by formatting irregularities or source-specific artifacts.

# Model Architectures

We evaluate two encoder-only transformer architectures designed to isolate the effect of iterative recurrent refinement versus standard depth stacking. Both models utilize identical architectural components—Flash Attention, SwiGLU feed-forward networks, rotary positional embeddings (RoPE), and RMSNorm—to ensure a controlled comparison.


## Baseline Transformer

The baseline follows a conventional pre-norm transformer encoder with 6 layers and a hidden dimensionality of 384. Table 1 summarizes the architectural configuration.

<!-- \begin{center}
{\small \textbf{Table 4. Baseline Transformer Configuration}}
\end{center} -->
Table 4. Baseline Transformer Configuration

| Component | Value |
|----------|-------|
| Number of layers | 6 |
| Hidden dimension | 384 |
| Attention heads | 6 |
| FFN intermediate size | 1536 |
| Total parameters | 26M |

Each layer consists of a multi-head self-attention block followed by a feed-forward network (FFN), both wrapped with pre-norm residual connections:

$$
\begin{aligned}
h'_l &= h_{l-1} + \mathrm{MHA}(\mathrm{Norm}_1(h_{l-1})) \\
h_l &= h'_l + \mathrm{FFN}(\mathrm{Norm}_2(h'_l))
\end{aligned}
$$

Classification is performed using the final hidden state corresponding to the [CLS] token [6]:

$$
\hat{y} = \mathrm{softmax}(W_c \, h_L[0])
$$



## Recurrent Transformer

The recurrent architecture employs iterative refinement to achieve an effective depth comparable to the baseline while using substantially fewer parameters. Instead of stacking 6 distinct layers, the model uses 3 shared layers unrolled for 2 iterations.

<!-- \begin{center}
{\small \textbf{Table 5. Recurrent Transformer Configuration}}
\end{center} -->
Table 5. Recurrent Transformer Configuration


| Component | Value |
|----------|-------|
| Physical layers | 3 |
| Recurrent iterations | 2 |
| Hidden dimension | 256 |
| Attention heads | 4 |
| FFN intermediate size | 1024 |
| Total parameters | 11M |

Let $h^{(r)}$ denote the hidden representation at iteration $r$. Each iteration applies the same 3-layer block:

$$
h^{(r+1)} = F(h^{(r)}) + \alpha \, h^{(r)}, \quad \alpha = 0.5
$$

Although only 3 layers are instantiated, the effective depth equals:

$$
D_{\mathrm{eff}} = N_{\mathrm{layers}} \times N_{\mathrm{iterations}} = 6
$$

This design follows prior iterative-refinement encoder models and enables parameter-efficient depth scaling.



## Rationale for Fixed Configurations

To ensure a meaningful and controlled comparison, architectural hyperparameters are held constant across both models:

1. Effective depth equivalence: Both architectures achieve a depth of 6 transformer layers.
2. Controlled parameter budget: The recurrent model reduces parameters by approximately 58% while maintaining comparable representational depth.
3. Consistent scaling principles: Both models use a 4× FFN expansion ratio and standard head dimensionality.

This controlled setup isolates the architectural contribution of recurrence, allowing us to analyze performance differences independent of model capacity.



## Shared Architectural Components

1. Flash Attention for memory-efficient attention computation  
2. SwiGLU feed-forward networks  
3. Rotary Positional Embeddings (RoPE)  
4. RMSNorm for stable pre-norm training dynamics  

By controlling for these components, any observed performance differences can be attributed primarily to the structural distinction between stacked depth and recurrent iterative depth.


## Loss Function for Optimization Objective


We optimize both models using the standard Cross-Entropy loss. Given a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$, where $x_i$ is the input sequence and $y_i$ is the ground truth label, the training objective is to minimize:

$$
\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \mathbb{1}(y_i = c) \log P(c | x_i; \theta)
$$

where $C$ is the number of classes, $\mathbb{1}(\cdot)$ is the indicator function, and $P(c | x_i; \theta)$ is the probability predicted by the model (after Softmax).

# Experiments & Results

## Training Protocol

We implemented all models using PyTorch and trained them under identical conditions to ensure a fair comparison. The specific protocol is as follows, and the same training strategy is applied consistently across all experiments and data subsets.

1. Optimization Configuration  

We utilize the AdamW optimizer with an initial learning rate of $3 \times 10^{-5}$ and a batch size of 16. To maintain training stability and prevent gradient explosion, particularly in the recurrent layers, we apply gradient clipping with a maximum norm of 1.0.

1. Adaptive Scheduling  

To facilitate convergence, we employ a ReduceLROnPlateau scheduler. The learning rate is dynamically decayed by a factor of 0.5 whenever the validation loss fails to improve for 2 consecutive epochs.

1. Early Stopping and Model Selection  

We implement early stopping to prevent overfitting, terminating training if validation loss does not improve by a margin of $10^{-3}$ for 3 consecutive epochs. The final evaluation uses the model checkpoint that achieved the lowest validation loss, rather than the final training state.


### Evaluation Methodology

We evaluate the trained models on the held-out test set to assess both predictive performance and computational efficiency. Our evaluation framework consists of three key components:

1. Classification Performance

To measure the model's ability to correctly categorize sequences, we report standard classification metrics: Accuracy, Precision, Recall, and F1-Score. These metrics provide a comprehensive view of the model's predictive power, balancing the trade-off between false positives and false negatives.

1. Parameter Efficiency

We quantify the spatial complexity of each architecture by calculating the total number of trainable parameters and the corresponding memory footprint (in MB, assuming 32-bit precision). This allows us to verify the hypothesis that the recurrent architecture significantly reduces model size while maintaining capacity.

1. Inference Latency

To assess temporal efficiency, we measure the real-world inference speed. We calculate the average inference time per sample (in milliseconds) over the entire test set. This metric captures the computational cost of the recurrent unrolling process compared to the standard stacked layers of the baseline.

## Experimental Design & Results

### SST-2 Benchmark Evaluation (Full Dataset)

We evaluate the baseline transformer and the recurrent transformer on the full SST-2 dataset to examine their predictive performance, parameter efficiency, and inference characteristics. Table 1 summarizes the quantitative results, while Figure 1 visualizes the accuracy–latency trade-off using bubble size to represent overall model storage cost.

Across all key metrics, the recurrent model provides a favorable balance between compactness and predictive quality. Although it contains less than half the parameters of the baseline model (11.0M vs. 25.9M; 41.9 MB vs. 98.8 MB), it achieves higher accuracy (0.7993 vs. 0.7901) and an improved F1 score (0.8031 vs. 0.7946). Similar gains are observed in precision and recall, suggesting that recurrent depth-sharing does not compromise representational capacity on short-sequence sentiment classification tasks such as SST-2.

As illustrated in Figure 1, the recurrent model occupies a more desirable position along the accuracy–efficiency frontier, delivering better accuracy at less than half the model size.

Overall, these results indicate that the recurrent transformer provides a more parameter-efficient alternative to the conventional transformer for SST-2 sentiment analysis, achieving superior or comparable performance while maintaining a significantly smaller computational footprint.

<!-- \begin{center}
{\small \textbf{Table 6. SST-2 Benchmark Performance}}
\end{center} -->
Table 6. SST-2 Benchmark Performance

| Model      | Parameters | Size (MB) | Accuracy | F1     | Precision | Recall | Inference (ms) |
|------------|------------|-----------|----------|--------|-----------|--------|----------------|
| Baseline   | 25,912,706 | 98.85     | 0.7901   | 0.7946 | 0.7919    | 0.7973 | 0.1659         |
| Recurrent  | 10,972,162 | 41.86     | 0.7993   | 0.8031 | 0.8022    | 0.8041 | 0.1795         |

<!-- \begin{center}
{\small \textbf{Figure 1. Inference Speed vs. Accuracy (Bubble size = model storage cost)}}
\end{center} -->
Figure 1. Inference Speed vs. Accuracy (Bubble size = model storage cost)


![](images/sst2_inference_accuracy_plot.png){width=70%}



### Data Size Sensitivity (50% / 10% SST-2)

### Length-Based Sensitivity (Short vs Long on SST-2)

To study how input length influences model performance, we extract the shortest 30% and longest 30% of SST-2 samples and train each model separately on short-only and long-only subsets. The distributions shown in Figure 3 illustrate a clear separation between the two regimes: short subsets contain predominantly 1–2 token sequences, whereas long subsets span 20–50 tokens and exhibit substantially higher lexical diversity.

<!-- \begin{center}
{\small \textbf{Figure 3. Word Count Distributions for Short and Long SST-2 Subsets}}
\end{center} -->
Figure 3. Word Count Distributions for Short and Long SST-2 Subsets  
![](images/sst2_length_subsets_histograms.png){width=90%}

Model performance on the long-text subset is summarized in Table 9. Despite using fewer than half the parameters, the recurrent transformer slightly outperforms the baseline in accuracy and yields noticeably higher F1 and recall. This suggests that recurrent depth-sharing provides an advantage when modeling extended contextual dependencies. Although its inference latency is marginally higher, the improvement in predictive performance combined with a significantly smaller model footprint indicates a favorable efficiency–performance trade-off.

<!-- \begin{center}
{\small \textbf{Table 9. Long-Sequence SST-2 Subset Performance}}
\end{center} -->
Table 9. Long-Sequence SST-2 Subset Performance

| Model      | Parameters | Size (MB) | Accuracy | F1     | Precision | Recall | Inference (ms) |
|------------|------------|-----------|----------|--------|-----------|--------|----------------|
| Baseline   | 25,912,706 | 98.85     | 0.8666   | 0.8679 | 0.9144    | 0.8260 | 0.3115         |
| Recurrent  | 10,972,162 | 41.86     | 0.8710   | 0.8795 | 0.8723    | 0.8867 | 0.3920         |

On the short-only subset (1–2 token inputs), both models show reduced performance due to the absence of contextual structure. The baseline attains slightly higher accuracy, while the recurrent model achieves higher recall with nearly identical F1 scores, indicating that under minimal context, the two architectures behave similarly and differ mainly in precision–recall trade-offs.


<!-- \begin{center}
{\small \textbf{Table 10. Short-Sequence SST-2 Subset Performance}}
\end{center} -->
Table 10. Short-Sequence SST-2 Subset Performance

| Model      | Parameters | Size (MB) | Accuracy | F1     | Precision | Recall | Inference (ms) |
|------------|------------|-----------|----------|--------|-----------|--------|----------------|
| Baseline   | 25,912,706 | 98.85     | 0.7610   | 0.7739 | 0.7971    | 0.7520 | 0.2985         |
| Recurrent  | 10,972,162 | 41.86     | 0.7522   | 0.7732 | 0.7701    | 0.7763 | 0.2959         |


### Cross-Domain Architectural Consistency Analysis on Yelp

### Multi-Domain Review Classification (3-class)

We extend our evaluation to a three-class domain classification task (Movie, Yelp, Amazon) to assess whether the models can distinguish stylistic and distributional differences across review sources, beyond simple sentiment polarity. By matching each domain’s data size to SST-2, this setting provides a balanced and more challenging multi-class benchmark for comparing the Baseline and Recurrent Transformers, offering clearer insight into architectural differences.

Both models achieve near-perfect performance on the multi-domain three-class task, with the recurrent transformer showing a slight but consistent improvement in accuracy and F1, as summarized in Table 12.


<!-- \begin{center}
{\small \textbf{Table 12. Multi-Domain 3-Class Classification Results}}
\end{center} -->
Table 12. Multi-Domain 3-Class Classification Results

| Model      | Parameters | Size (MB) | Accuracy | F1     | Precision | Recall | Inference (ms) |
|------------|------------|-----------|----------|--------|-----------|--------|----------------|
| Baseline   | 25,913,091 | 98.85     | 0.9840   | 0.9840 | 0.9841    | 0.9840 | 0.3304         |
| Recurrent  | 10,972,419 | 41.86     | 0.9865   | 0.9865 | 0.9865    | 0.9865 | 0.3228         |


### Deployment Analysis: Quantization Comparison

While our main experiments demonstrate that recurrent transformers achieve comparable accuracy with fewer parameters, practical deployment often involves additional optimization through quantization. We conducted a focused comparison to evaluate how both architectures perform under FP16 (half-precision) quantization, a standard technique for reducing model size and accelerating inference on modern GPUs.

#### Experimental Setup

For this analysis, we use a **wider recurrent model** (hidden_size=384) compared to our main experiments (hidden_size=256). This ensures both architectures have identical width, isolating the effect of recurrent weight sharing on quantization sensitivity rather than differences in model capacity.

| Configuration | Baseline | Recurrent (Quantization Exp.) | Recurrent (Main Exp.) |
|---------------|----------|------------------------------|----------------------|
| Hidden size | 384 | 384 | 256 |
| Layers | 6 | 3 × 2 iter | 3 × 2 iter |
| Parameters | 25.9M | 18.8M | 3.4M |

We compare four deployment configurations:
- **Baseline FP32**: Full precision standard transformer (6 layers)
- **Baseline FP16**: Half-precision quantized baseline
- **Recurrent FP32**: Full precision recurrent transformer (3 layers × 2 iterations)
- **Recurrent FP16**: Half-precision quantized recurrent

#### Results

| Model | Precision | Parameters | Size (MB) | Accuracy | F1 | Inference (ms) |
|-------|-----------|------------|-----------|----------|-----|----------------|
| Baseline (FP32) | FP32 | 25.9M | 98.9 | 0.918 | 0.918 | 0.28 |
| Baseline (FP16) | FP16 | 25.9M | 49.4 | 0.918 | 0.918 | 0.14 |
| Recurrent (FP32) | FP32 | 18.8M | 71.8 | 0.918 | 0.918 | 0.27 |
| Recurrent (FP16) | FP16 | 18.8M | 35.9 | 0.918 | 0.918 | 0.13 |

#### Key Findings

1. **Quantization Robustness**: Both architectures maintain identical accuracy (91.8%) after FP16 quantization, demonstrating that neither the baseline nor the recurrent approach is more sensitive to precision reduction at this level.

2. **Size Efficiency**: FP16 quantization provides exactly 2× size reduction for both architectures:
   - Baseline: 98.9 MB → 49.4 MB
   - Recurrent: 71.8 MB → 35.9 MB
   
3. **Compound Benefits**: The recurrent architecture's weight sharing combines multiplicatively with quantization benefits. **Recurrent FP16 (35.9 MB)** achieves a **64% size reduction** compared to Baseline FP32 (98.9 MB) while maintaining equivalent accuracy.

4. **Inference Speed**: FP16 quantization approximately halves inference time for both architectures due to reduced memory bandwidth and GPU tensor core utilization. Notably, the recurrent model's iterative computation does not significantly impact FP16 inference speed.

#### Deployment Recommendations

| Priority | Recommended Configuration | Size | Rationale |
|----------|--------------------------|------|-----------|
| **Latency** | Baseline FP16 | 49 MB | Single-pass inference, fastest |
| **Balance** | Recurrent FP32 | 72 MB | Fewer parameters, full precision |
| **Memory** | Recurrent FP16 | 36 MB | Smallest footprint, maintains accuracy |

These results suggest that for memory-constrained deployments, the recurrent transformer with FP16 quantization offers the most efficient configuration, achieving the same accuracy as the full-precision baseline while requiring only 36% of its storage footprint.

#### Discussion: Task Saturation

The uniform accuracy (91.8%) across all configurations warrants discussion. This convergence may reflect:

1. **Benchmark Saturation**: SST-2 is a well-studied binary sentiment task where many architectures achieve 90-93% accuracy. State-of-the-art models on the GLUE leaderboard reach 97%+, suggesting our models are within expected range but below ceiling.

2. **Task Simplicity**: Binary sentiment classification may not fully stress architectural differences that emerge in more complex tasks (multi-class, longer sequences, reasoning).

The consistent performance across quantization levels is actually a positive finding—it demonstrates that FP16 deployment introduces no measurable degradation for this task, validating the practical viability of half-precision inference for both architectures.



#  Conclusions


## References

[1] Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, Ł. (2018). Universal Transformers. *arXiv preprint arXiv:1807.03819*.

[2] Anonymous. (2025). Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-level Computation. *arXiv preprint arXiv:2507.10524*.

[3] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. *arXiv preprint arXiv:2104.09864*.

[4] Shazeer, N. (2020). GLU Variants Improve Transformer. *arXiv preprint arXiv:2002.05202*.

[5] Jolicoeur-Martineau, A., et al. (2025). Less is More: Recursive Reasoning with Tiny Networks. *arXiv preprint arXiv:2510.04871*.

[6] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *Proceedings of NAACL-HLT 2019*, 4171-4186.