**<span style="color:#B3E5FC">Background Description</span>**

On the Brain-to-Text '24 Dataset, impressive gains in accuracy were made using the following techniques. 

1. Using an ensemble of 10 neural networks with different seeds. 

2. Decoding the outputs of each of these seeds with a 5-gram language model using beam search with a second pass rescoring stage applied.

3. Obtaining the top 100 beams from each seed and selecting the best beam with OPT 6.7B resulting in a single, best hypothesis for each seed. 

4. Fine-tuning a LLM to produce the intended transcription given the best hypothesis from each seed. 

**<span style="color:#B3E5FC">Areas for Improvement</span>**

1. The 5-gram LM + second pass rescoring requires very intensive computational resources, specifically a server with 
over 300 GB of RAM. Furthermore, it is unclear if latency is an issue with the 5-gram LM. It appears that Card et al., 2025 
used a 5-gram LM + Rescoring + OPT 6.7B for online evaluations, so this may not be a critical issue.

2. LLM fine-tuning relies on the entire sentence being decoded, and no previous method has applied it in a streaming fashion. 

3. The acoustic model is crafted based on PER, whereas the metric of interest is WER. Given the complexity and large resource demands of this setup, optimizing for WER in the creation of the acoustic model during training appears challenging (loading the 5-gram LM, latency costs for decoding the entire validation set, the LLM can only be fine-tuned after the acoustic model has finished training). This is not ideal because PER is not a perfect indicator of WER, which may lead to imperfect hyperparameter optimization for the acoustic model.
    * The relationship between WER and PER appears to be complex, and modulated by validation CTC loss. Validation CTC loss means that the sum over the probability of valid alignments is high, whereas PER indicates the edit distance between the correct and decoded sequences after CTC greedy decoding rules are applied.
    * A data point that would be valuable to have is what the WER is for a consistency regularized CTC model that is trained for 250 epochs. Such a model appears to obtain a lower PER and validation CTC loss relative to the current best model I have.
        * The WER across three versions of the Transformers are displayed in Table 1 in the cell below. Although adding the consistency regularized CTC term improves PER and CTC loss, the model has a (likely negligible) higher WER than either of the two other Transformer versions. 
    



In [1]:
from helper_functions import process_and_display_results, conduct_paired_ttest

process_and_display_results(['research_data/neurips_transformer_time_masked.pkl', 'research_data/transformer_short_training_fixed.pkl', 'research_data/time_masked_transformer_cr-ctc_0.2.pkl'], ['Transformer 600 epochs + 7 layers', 'Transformer 250 Epochs + 5 layers', "Transformer with CR-CTC 0.2"])

print("\n Table 1: N denotes the number of seeds. PER, CTC loss, and 3-gram WER denote the mean values across seeds on the validation dataset. \n The CR-CTC model is trained with the consistency regularized CTC loss (Yao et al., 2024, ICLR)")

| Model Name                        |   N |    PER |   CTC Loss |   3-gram WER |
|:----------------------------------|----:|-------:|-----------:|-------------:|
| Transformer 600 epochs + 7 layers |  10 | 0.1403 |     0.8477 |       0.174  |
| Transformer 250 Epochs + 5 layers |  10 | 0.1569 |     0.7336 |       0.1715 |
| Transformer with CR-CTC 0.2       |  10 | 0.1458 |     0.5839 |       0.1764 |

 Table 1: N denotes the number of seeds. PER, CTC loss, and 3-gram WER denote the mean values across seeds on the validation dataset. 
 The CR-CTC model is trained with the consistency regularized CTC loss (Yao et al., 2024, ICLR)


In [2]:
conduct_paired_ttest('research_data/time_masked_transformer_cr-ctc_0.2.pkl', 'research_data/transformer_short_training_fixed.pkl')

Stats test between time_masked_transformer_cr-ctc_0.2 and transformer_short_training_fixed:

PER - t value : -13.7043, p value:  0.0000
WER - t value :  6.1760, p value:  0.0002
CTC Loss - t value : -40.7995, p value:  0.0000
