# Evaluation Additional Modeling Pipelines
We should also compare performance in the evaluation data with other readily available phonetic transcription options, to determine whether fine-tuning your own model is worth the effort. 
The two options we consider here are: 
- Wav2vec2 fine tuned on TIMIT (https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr) as the speech recognition model, followed by using [epitran](https://github.com/dmort27/epitran) to convert othography to IPA. The TIMIT corpus is a high quality corpus of read English speech.
- [Allosaurus](https://github.com/xinjli/allosaurus) is a pre-trained universal phone recognizer that claims to recognize phones in more than 2000 languages. 
- [Whisper](https://openai.com/index/whisper/) is the state-of-the-art sequence-to-sequence speech recognition model released by OpenAI. Details about the different model releases are available at https://github.com/openai/whisper/blob/main/model-card.md. There are multilingual and English fine-tuned versions.

These evaluations only need to be run and computed once. 

## Additional installation step for Epitran
To use Epitran for English, you also need to install https://github.com/festvox/flite. See the Epitran note at https://github.com/dmort27/epitran?tab=readme-ov-file#installation-of-flite-for-english-g2p.  I installed Flite on my mac:

```bash
$ git clone http://github.com/festvox/flite
$ cd flite
$ ./configure && make
$ sudo make install
$ cd testsuite
$ make lex_lookup
$ sudo cp lex_lookup /usr/local/bin
```



In [1]:
import itertools
import time

import allosaurus.app
import allosaurus.bin.download_model
import epitran
import transformers
from tqdm import tqdm


from multipa.data_utils import load_buckeye_split
from multipa.evaluate import ModelEvaluator, preprocess_test_data, DETAILED_PREDICTIONS_CSV_SUFFIX

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def allosaurus_predict(test_dataset, model="eng2102", phone_inventory="ipa"):
    print("Evaluating allosaurus. Model:", model, "Phone inventory:", phone_inventory)
    model_predictions = []
    model = allosaurus.app.read_recognizer(model)
    start = time.time()
    for audio in tqdm(test_dataset["audio"]):
        prediction = model.recognize(audio["path"], phone_inventory)
        prediction = prediction.replace(" ", "")
        model_predictions.append(prediction)
    end = time.time()
    print("Eval time in seconds:", end-start)
    return model_predictions

def hf_model_to_epitran_predict(model_name, test_dataset):
    print("Building pipeline and downloading model")
    pipe = transformers.pipeline("automatic-speech-recognition", model=model_name)
    print("Predicting with", model_name)
    start = time.time()
    orthography_predictions = [d["text"] for d in pipe(test_dataset["audio"])]
    epi = epitran.Epitran('eng-Latn')
    print("Transliterating with Epitran")
    ipa_predictions = []
    for pred in tqdm(orthography_predictions):
        result = epi.transliterate(pred).replace(" ", "")
        ipa_predictions.append(result)
    end = time.time()
    print("Eval time in seconds:", end-start)
    return ipa_predictions

In [3]:
input_data = load_buckeye_split("../data/buckeye", "test")
# Snippet of transcriptions
# Note that there don't appear to be any non-empty transcriptions,
# so this notebook skips looking at hallucinations
print("Data Preview")
print(input_data)
print(input_data[0])

non_empty_test_data, empty_test_data = preprocess_test_data(input_data, is_remove_space=True)

model_evaluator = ModelEvaluator()

Resolving data files: 100%|██████████| 37566/37566 [00:00<00:00, 230004.87it/s]
Resolving data files: 100%|██████████| 10160/10160 [00:00<00:00, 29160.30it/s]
Resolving data files: 100%|██████████| 11212/11212 [00:00<00:00, 689326.40it/s]


Data Preview
Dataset({
    features: ['audio', 'utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', '__index_level_0__'],
    num_rows: 5079
})
{'audio': {'bytes': None, 'path': '/Users/virginia/workspace/multipa/data/buckeye/test/s2501a_Utt0.wav'}, 'utterance_id': 's2501a_Utt0', 'duration': 0.925981, 'buckeye_transcript': 'f ao r f ay v', 'text': 'four five', 'ipa': 'f ɔ ɹ f aɪ v', 'speaker_id': 'S25', 'speaker_gender': 'f', 'speaker_age_range': 'o', 'interviewer_gender': 'm', 'file_path': 'data/buckeye/test/s2501a_Utt0.wav', '__index_level_0__': 0}


100%|██████████| 5079/5079 [00:00<00:00, 14530.44ex/s]
100%|██████████| 6/6 [00:02<00:00,  2.40ba/s]


Number of test examples with empty transcriptions: 0
Dataset({
    features: ['audio', 'utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', '__index_level_0__'],
    num_rows: 0
})


100%|██████████| 6/6 [00:01<00:00,  5.15ba/s]


In [4]:
models = [
    "openai/whisper-large-v3-turbo",
    "openai/whisper-large-v3",
    "openai/whisper-medium.en",
    "elgeish/wav2vec2-large-lv60-timit-asr",
]
for m in models:
    # Epitran
    epitran_predictions = hf_model_to_epitran_predict(m, non_empty_test_data)
    model_name = f"{m}_to_epitran".replace("/", "_")
    epitran_detailed_csv = f"{model_name}_{DETAILED_PREDICTIONS_CSV_SUFFIX}"
    metrics = model_evaluator.eval_non_empty_transcriptions(
        model_name, epitran_predictions, non_empty_test_data["ipa"]
    )
    detailed_results = non_empty_test_data.add_column(
        "prediction", epitran_predictions
    ).remove_columns(["audio"])
    for k in ["phone_error_rates", "phone_feature_error_rates", "feature_error_rates"]:
        detailed_results = detailed_results.add_column(k, metrics[k])
    detailed_results.remove_columns(["__index_level_0__"]).to_csv(
        epitran_detailed_csv, index=False
)

Building pipeline and downloading model


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Predicting with openai/whisper-large-v3-turbo


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transliterating with Epitran


100%|██████████| 5079/5079 [07:24<00:00, 11.42it/s]


Eval time in seconds: 23151.309972286224


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 16.71ba/s]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 95.77ba/s]


Building pipeline and downloading model


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Predicting with openai/whisper-large-v3




Transliterating with Epitran


100%|██████████| 5079/5079 [04:47<00:00, 17.70it/s]


Eval time in seconds: 37418.261837005615


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 15.26ba/s]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 95.00ba/s]


Building pipeline and downloading model


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Predicting with openai/whisper-medium.en




Transliterating with Epitran


100%|██████████| 5079/5079 [10:48<00:00,  7.83it/s]   


Eval time in seconds: 14702.21756529808


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00,  7.89ba/s]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 55.67ba/s]


Building pipeline and downloading model


Some weights of the model checkpoint at elgeish/wav2vec2-large-lv60-timit-asr were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at elgeish/wav2vec2-large-lv60-timit-asr and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probab

Predicting with elgeish/wav2vec2-large-lv60-timit-asr
Transliterating with Epitran


100%|██████████| 5079/5079 [12:24<00:00,  6.82it/s]


Eval time in seconds: 2219.9671170711517


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 15.72ba/s]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 97.09ba/s]


In [5]:
# Define models and phone inventory to test
allosaurus_models = ["uni2005", "eng2102"]
phone_inventory = ["ipa", "eng"]

# Download models
for m in allosaurus_models:
    allosaurus.bin.download_model.download_model(m)

# Predict and check against gold standard
for model, pi in itertools.product(allosaurus_models, phone_inventory):
    model_predictions = allosaurus_predict(non_empty_test_data, model, pi)
    model_name = f"allosaurus_{model}_{pi}"
    detailed_results_csv = f"{model_name}_{DETAILED_PREDICTIONS_CSV_SUFFIX}"
    metrics = model_evaluator.eval_non_empty_transcriptions(model_name, model_predictions, non_empty_test_data["ipa"])
    detailed_results = non_empty_test_data.add_column("prediction", model_predictions).\
                remove_columns(["audio"])
    for k in ["phone_error_rates", "phone_feature_error_rates", "feature_error_rates"]:
        detailed_results = detailed_results.add_column(k, metrics[k])
    detailed_results.remove_columns(["__index_level_0__"]).to_csv(detailed_results_csv, index=False)



  model_state_dict = torch.load(str(path), map_location=torch.device('cpu'))


Evaluating allosaurus. Model: uni2005 Phone inventory: ipa


  return (feature - spk_mean)/spk_std
100%|██████████| 5079/5079 [29:31<00:00,  2.87it/s]    


Eval time in seconds: 1774.002643108368


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 17.01ba/s]
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 107.37ba/s]
  model_state_dict = torch.load(str(path), map_location=torch.device('cpu'))


Evaluating allosaurus. Model: uni2005 Phone inventory: eng


100%|██████████| 5079/5079 [17:11<00:00,  4.93it/s]


Eval time in seconds: 1033.3425288200378


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 16.64ba/s]
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 101.72ba/s]


Evaluating allosaurus. Model: eng2102 Phone inventory: ipa


100%|██████████| 5079/5079 [23:28<00:00,  3.61it/s]


Eval time in seconds: 1410.6358399391174


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 16.45ba/s]
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 107.43ba/s]


Evaluating allosaurus. Model: eng2102 Phone inventory: eng


100%|██████████| 5079/5079 [25:59<00:00,  3.26it/s]   


Eval time in seconds: 1561.5043210983276


Flattening the indices: 100%|██████████| 6/6 [00:00<00:00, 10.78ba/s]
Creating CSV from Arrow format: 100%|██████████| 6/6 [00:00<00:00, 62.67ba/s]


In [6]:
# Write all results to file for comparison
model_evaluator.to_csv("epitran_allosaurus_eval.csv")