# Evaluation of Additional Modeling Pipelines
We should also compare performance on the evaluation data (Buckeye test split) with other readily available phonetic transcription options, to determine whether fine-tuning your own model is worth the effort. 
The two options we consider here are: 
- [Allosaurus](https://github.com/xinjli/allosaurus) is a pre-trained universal phone recognizer that claims to recognize phones in more than 2000 languages. 
- [Whisper](https://openai.com/index/whisper/) is the state-of-the-art sequence-to-sequence speech recognition model released by OpenAI. Details about the different model releases are available at https://github.com/openai/whisper/blob/main/model-card.md. There are multilingual and English fine-tuned versions. We follow these models with grapheme to phoneme conversion using Epitran.
- [excalibur12/wav2vec2-large-lv60_phoneme-timit_english_timit-4k](https://huggingface.co/excalibur12/wav2vec2-large-lv60_phoneme-timit_english_timit-4k) is a wav2vec2 model fine-tuned on TIMIT data. Because it uses the original TIMIT phonemes, we post-process using [phonecodes](https://pypi.org/project/phonecodes/) to convert predictions to IPA. 

These evaluations only need to be run and computed once. 

## Additional installation step for Epitran
To use Epitran for English, you also need to install https://github.com/festvox/flite. See the Epitran note at https://github.com/dmort27/epitran?tab=readme-ov-file#installation-of-flite-for-english-g2p.  I installed Flite on my mac:

```bash
$ git clone http://github.com/festvox/flite
$ cd flite
$ ./configure && make
$ sudo make install
$ cd testsuite
$ make lex_lookup
$ sudo cp lex_lookup /usr/local/bin
```



In [None]:
import itertools
from pathlib import Path

from multipa.data_utils import load_buckeye_split
import multipa.evaluation
import multipa.evaluation_extras

VERBOSE_RESULTS_DIR = Path("../data/evaluation_results/detailed_predictions")
AGGREGATE_METRICS_CSV = Path("../data/evaluation_results/aggregate_metrics/epitran_allosaurus_eval.csv")
EDIT_DIST_DIR = Path("../data/evaluation_results/edit_distances/")

# Post-processing options
IS_REMOVE_SPACES = True
IS_NORMALIZE_IPA = True
NUM_PROC = 8 # For HuggingFace dataset map and filter
DEVICE = 0

KeyboardInterrupt: 

In [None]:
input_data = load_buckeye_split("../data/buckeye", "test")
# Snippet of transcriptions
# Note that there don't appear to be any non-empty transcriptions,
# so this notebook skips looking at hallucinations
print("Data Preview")
print(input_data)
print(input_data[0])

non_empty_test_data, empty_test_data = multipa.evaluation.preprocess_test_data(input_data, is_remove_space=IS_REMOVE_SPACES, is_normalize_ipa=IS_NORMALIZE_IPA, num_proc=NUM_PROC)

print("Test data with speech transcriptions")
print(non_empty_test_data)
print(non_empty_test_data[0])
print("Test data without speech")
print(empty_test_data)

model_evaluator = multipa.evaluation.ModelEvaluator()

Data Preview
Dataset({
    features: ['utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', 'audio', '__index_level_0__'],
    num_rows: 5079
})
{'utterance_id': 's2501a_Utt0', 'duration': 0.925981, 'buckeye_transcript': 'f ao r f ay v', 'text': 'four five', 'ipa': 'f ɔ ɹ f aɪ v', 'speaker_id': 'S25', 'speaker_gender': 'f', 'speaker_age_range': 'o', 'interviewer_gender': 'm', 'file_path': 'data/buckeye/test/s2501a_Utt0.wav', 'audio': {'bytes': None, 'path': '/Users/virginia/workspace/multipa/data/buckeye/test/s2501a_Utt0.wav'}, '__index_level_0__': 0}


Map (num_proc=8): 100%|██████████| 5079/5079 [00:00<00:00, 12737.47 examples/s]
Filter (num_proc=8): 100%|██████████| 5079/5079 [00:01<00:00, 3036.88 examples/s]
Filter (num_proc=8): 100%|██████████| 5079/5079 [00:01<00:00, 4529.48 examples/s]


Test data with speech transcriptions
Dataset({
    features: ['utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', 'audio', '__index_level_0__'],
    num_rows: 5079
})
{'utterance_id': 's2501a_Utt0', 'duration': 0.925981, 'buckeye_transcript': 'f ao r f ay v', 'text': 'four five', 'ipa': 'fɔɹfaɪv', 'speaker_id': 'S25', 'speaker_gender': 'f', 'speaker_age_range': 'o', 'interviewer_gender': 'm', 'file_path': 'data/buckeye/test/s2501a_Utt0.wav', 'audio': {'path': '/Users/virginia/workspace/multipa/data/buckeye/test/s2501a_Utt0.wav', 'array': array([-0.00997925, -0.01052856, -0.00958252, ...,  0.00085449,
        0.00061035,  0.00042725]), 'sampling_rate': 16000}, '__index_level_0__': 0}
Test data without speech
Dataset({
    features: ['utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', 'aud

In [None]:
models = [
    "openai/whisper-large-v3-turbo",
    #"openai/whisper-large-v3",
    "openai/whisper-medium.en",
]
for m in models:
    # Epitran
    epitran_predictions = multipa.evaluation_extras.hf_model_to_epitran_predict(m, non_empty_test_data, device=DEVICE, num_proc=NUM_PROC, is_remove_spaces=IS_REMOVE_SPACES, is_normalize_ipa=IS_NORMALIZE_IPA)
    model_name = f"{m}_to_epitran".replace("/", "_")
    metrics = model_evaluator.eval_non_empty_transcriptions(
        model_name, epitran_predictions[multipa.evaluation.PREDICTION_KEY], non_empty_test_data["ipa"]
    )
    multipa.evaluation.write_detailed_prediction_results(VERBOSE_RESULTS_DIR, model_name, non_empty_test_data, epitran_predictions, metrics)
    model_evaluator.write_edit_distance_results(model_name,EDIT_DIST_DIR)

Building pipeline and downloading model
Predicting with openai/whisper-large-v3-turbo


You have passed language=english, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=english.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [None]:
# Define models and phone inventory to test
# allosaurus_models = ["uni2005", "eng2102"]
# phone_inventory = ["ipa", "eng"]

allosaurus_models = ["eng2102"]
phone_inventory = ["eng"]

# Predict and check against gold standard
for model, pi in itertools.product(allosaurus_models, phone_inventory):
    model_predictions = multipa.evaluation_extras.allosaurus_predict(non_empty_test_data, model, pi, is_remove_spaces=IS_REMOVE_SPACES, is_normalize_ipa=IS_NORMALIZE_IPA, num_proc=NUM_PROC)
    model_name = f"allosaurus_{model}_{pi}"
    metrics = model_evaluator.eval_non_empty_transcriptions(model_name, model_predictions[PREDICTION_KEY], non_empty_test_data["ipa"])
    write_detailed_prediction_results(
        VERBOSE_RESULTS_DIR, model_name, non_empty_test_data, model_predictions, metrics
    )
    model_evaluator.write_edit_distance_results(model_name, EDIT_DIST_DIR)


In [None]:
hf_to_phonecodes_models = [("excalibur12/wav2vec2-large-lv60_phoneme-timit_english_timit-4k", "timit", "ipa")]

for model_name, in_code, out_code in hf_to_phonecodes_models:
    model_predictions = multipa.evaluation_extras.hf_to_phonecodes(non_empty_test_data, model_name, in_code, out_code, is_remove_spaces=IS_REMOVE_SPACES, is_normalize_ipa=IS_NORMALIZE_IPA, device=DEVICE, num_proc=NUM_PROC)
    print(model_predictions)
    metrics = model_evaluator.eval_non_empty_transcriptions(
        model_name,
        model_predictions[multipa.evaluation.PREDICTION_KEY],
        non_empty_test_data["ipa"])
    multipa.evaluation.write_detailed_prediction_results(
        VERBOSE_RESULTS_DIR, multipa.evaluation.clean_model_name(model_name), non_empty_test_data, model_predictions, metrics
    )
    model_evaluator.write_edit_distance_results(model_name, EDIT_DIST_DIR)


In [None]:
# Write all results to file for comparison
model_evaluator.to_csv(AGGREGATE_METRICS_CSV)