### Summary of current Research Progress


**<span style="color:#B3E5FC">Background Description</span>**

On the Brain-to-Text '24 Dataset, impressive gains in accuracy were made using the following techniques. 

1. Using an ensemble of 10 neural networks with different seeds. 

2. Decoding the outputs of each of these seeds with a 5-gram language model using beam search with a second pass rescoring stage applied.

3. Obtaining the top 100 beams from each seed and selecting the best beam with OPT 6.7B resulting in a single, best hypothesis for each seed. 

4. Fine-tuning a LLM to produce the intended transcription given the best hypothesis from each seed. 

**<span style="color:#B3E5FC">Areas for Improvement</span>**

1. The 5-gram LM + second pass rescoring requires very intensive computational resources, specifically a server with 
over 300 GB of RAM. Furthermore, it is unclear if latency is an issue with the 5-gram LM. It appears that Card et al., 2025 
used a 5-gram LM + Rescoring + OPT 6.7B for online evaluations, so this may not be a critical issue.

2. LLM fine-tuning relies on the entire sentence being decoded, and no previous method has applied it in a streaming fashion. 

3. The acoustic model is crafted based on PER, whereas the metric of interest is WER. Given the complexity and large resource demands of this setup, optimizing for WER in the creation of the acoustic model during training appears challenging (loading the 5-gram LM, latency costs for decoding the entire validation set, the LLM can only be fine-tuned after the acoustic model has finished training). This is not ideal because PER is not a perfect indicator of WER, which may lead to imperfect hyperparameter optimization for the acoustic model.
    * The relationship between WER and PER appears to be complex, and modulated by validation CTC loss. Validation CTC loss means that the sum over the probability of valid alignments is high, whereas PER indicates the edit distance between the correct and decoded sequences after CTC greedy decoding rules are applied.
    * A data point that would be valuable to have is what the WER is for a consistency regularized CTC model that is trained for 250 epochs. Such a model appears to obtain a lower PER and validation CTC loss relative to the current best model I have.
    
    



In [29]:
from helper_functions import process_and_display_results

process_and_display_results(['research_data/neurips_transformer_time_masked.pkl', 'research_data/transformer_short_training_fixed.pkl'], 
                            ['Transformer 600 epochs + 7 layers'], ['Transformer 250 Epochs + 5 layers'])

| Model Name                        |    PER |   3-gram WER |   CTC Loss |   N |
|:----------------------------------|-------:|-------------:|-----------:|----:|
| Transformer 600 epochs + 7 layers | 0.1408 |       0.1769 |     0.8555 |   4 |


In [17]:
import pandas as pd
pd.read_pickle('research_data/neurips_transformer_time_masked.pkl')

{'neurips_transformer_time_masked': {'PER': [0.13788122652800133,
   0.1451446494160373,
   0.13726218480458915,
   0.1428335603152986],
  'WER': [0.17875977450445535,
   0.18112384069830878,
   0.1685761047463175,
   0.1789416257501364],
  'CTC Loss': [0.8373662474711591,
   0.8862333237659827,
   0.8195294758713131,
   0.8788532746131399]}}

print(")