# Evaluation Additional Modeling Pipelines
We should also compare performance in the evaluation data with other readily available phonetic transcription options, to determine whether fine-tuning your own model is worth the effort. 
The two options we consider here are: 
- Wav2vec2 fine tuned on TIMIT (https://huggingface.co/elgeish/wav2vec2-large-lv60-timit-asr) as the speech recognition model, followed by using [epitran](https://github.com/dmort27/epitran) to convert othography to IPA. The TIMIT corpus is a high quality corpus of read English speech.
- [Allosaurus](https://github.com/xinjli/allosaurus) is a pre-trained universal phone recognizer that claims to recognize phones in more than 2000 languages. 

These evaluations only need to be run and computed once. 

## Additional installation step for Epitran
To use Epitran for English, you also need to install https://github.com/festvox/flite. See the Epitran note at https://github.com/dmort27/epitran?tab=readme-ov-file#installation-of-flite-for-english-g2p.  I installed Flite on my mac:

```bash
$ git clone http://github.com/festvox/flite
$ cd flite
$ ./configure && make
$ sudo make install
$ cd testsuite
$ make lex_lookup
$ sudo cp lex_lookup /usr/local/bin
```



In [1]:
import itertools
import time

import allosaurus.app
import allosaurus.bin.download_model
import epitran
import transformers
from tqdm import tqdm


from multipa.data_utils import load_buckeye_split
from multipa.evaluate import ModelEvaluator, preprocess_test_data, DETAILED_PREDICTIONS_CSV_SUFFIX

In [2]:
def allosaurus_predict(test_dataset, model="eng2102", phone_inventory="ipa"):
    print("Evaluating allosaurus. Model:", model, "Phone inventory:", phone_inventory)
    model_predictions = []
    model = allosaurus.app.read_recognizer(model)
    start = time.time()
    for audio in tqdm(test_dataset["audio"]):
        prediction = model.recognize(audio["path"], phone_inventory)
        prediction = prediction.replace(" ", "")
        model_predictions.append(prediction)
    end = time.time()
    print("Eval time in seconds:", end-start)
    return model_predictions

def wav2vec_to_epitran_predict(test_dataset):
    print("Building pipeline and downloading model")
    pipe = transformers.pipeline("automatic-speech-recognition", model="elgeish/wav2vec2-large-lv60-timit-asr")
    print("Predicting with wav2vec")
    start = time.time()
    orthography_predictions = [d["text"] for d in pipe(test_dataset["audio"])]
    epi = epitran.Epitran('eng-Latn')
    print("Transliterating with Epitran")
    ipa_predictions = []
    for pred in tqdm(orthography_predictions):
        result = epi.transliterate(pred).replace(" ", "")
        ipa_predictions.append(result)
    end = time.time()
    print("Eval time in seconds:", end-start)
    return ipa_predictions

In [3]:
input_data = load_buckeye_split("../data/buckeye", "test")
# Snippet of transcriptions
# Note that there don't appear to be any non-empty transcriptions, 
# so this notebook skips looking at hallucinations
print("Data Preview")
print(input_data)
print(input_data[0])

non_empty_test_data, empty_test_data = preprocess_test_data(input_data, is_remove_space=True)

model_evaluator = ModelEvaluator()

Resolving data files:   0%|          | 0/36010 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/10160 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/11212 [00:00<?, ?it/s]



Data Preview
Dataset({
    features: ['audio', 'utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', '__index_level_0__'],
    num_rows: 5079
})
{'audio': {'bytes': None, 'path': '/Users/virginia/workspace/multipa/data/buckeye/test/s2501a_Utt0.wav'}, 'utterance_id': 's2501a_Utt0', 'duration': 0.925981, 'buckeye_transcript': 'f ao r f ay v', 'text': 'four five', 'ipa': 'f ɔ ɹ f aɪ v', 'speaker_id': 'S25', 'speaker_gender': 'f', 'speaker_age_range': 'o', 'interviewer_gender': 'm', 'file_path': 'data/buckeye/test/s2501a_Utt0.wav', '__index_level_0__': 0}


  0%|          | 0/5079 [00:00<?, ?ex/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

Number of test examples with empty transcriptions: 0
Dataset({
    features: ['audio', 'utterance_id', 'duration', 'buckeye_transcript', 'text', 'ipa', 'speaker_id', 'speaker_gender', 'speaker_age_range', 'interviewer_gender', 'file_path', '__index_level_0__'],
    num_rows: 0
})


  0%|          | 0/6 [00:00<?, ?ba/s]

In [4]:
# Epitran
epitran_predictions = wav2vec_to_epitran_predict(non_empty_test_data)
model_name = "wav2vec_to_epitran"
epitran_detailed_csv = f"{model_name}_{DETAILED_PREDICTIONS_CSV_SUFFIX}"
metrics = model_evaluator.eval_non_empty_transcriptions(model_name, epitran_predictions, non_empty_test_data["ipa"])
detailed_results = non_empty_test_data.add_column("prediction", epitran_predictions).\
            remove_columns(["audio"])
for k in ["phone_error_rates", "phone_feature_error_rates", "feature_error_rates"]:
    detailed_results = detailed_results.add_column(k, metrics[k])
detailed_results.remove_columns(["__index_level_0__"]).to_csv(epitran_detailed_csv, index=False)    


Building pipeline and downloading model




Predicting with wav2vec
Transliterating with Epitran


100%|██████████| 5079/5079 [03:32<00:00, 23.93it/s]


Eval time in seconds: 2035.2930929660797


Flattening the indices:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

1961394

In [5]:
# Define models and phone inventory to test
allosaurus_models = ["uni2005", "eng2102"]
phone_inventory = ["ipa", "eng"]

# Download models
for m in allosaurus_models:
    allosaurus.bin.download_model.download_model(m)

# Predict and check against gold standard
for model, pi in itertools.product(allosaurus_models, phone_inventory):
    model_predictions = allosaurus_predict(non_empty_test_data, model, pi)
    model_name = f"allosaurus_{model}_{pi}"
    detailed_results_csv = f"{model_name}_{DETAILED_PREDICTIONS_CSV_SUFFIX}"
    metrics = model_evaluator.eval_non_empty_transcriptions(model_name, model_predictions, non_empty_test_data["ipa"])
    detailed_results = non_empty_test_data.add_column("prediction", model_predictions).\
                remove_columns(["audio"])
    for k in ["phone_error_rates", "phone_feature_error_rates", "feature_error_rates"]:
        detailed_results = detailed_results.add_column(k, metrics[k])
    detailed_results.remove_columns(["__index_level_0__"]).to_csv(detailed_results_csv, index=False) 



Evaluating allosaurus. Model: uni2005 Phone inventory: ipa


  return (feature - spk_mean)/spk_std
100%|██████████| 5079/5079 [58:04<00:00,  1.46it/s]    


Eval time in seconds: 3486.4274678230286


Flattening the indices:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Evaluating allosaurus. Model: uni2005 Phone inventory: eng


  return (feature - spk_mean)/spk_std
100%|██████████| 5079/5079 [1:21:28<00:00,  1.04it/s]    


Eval time in seconds: 4891.501588821411


Flattening the indices:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Evaluating allosaurus. Model: eng2102 Phone inventory: ipa


  return (feature - spk_mean)/spk_std
100%|██████████| 5079/5079 [2:23:31<00:00,  1.70s/it]     


Eval time in seconds: 8613.858610153198


Flattening the indices:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Evaluating allosaurus. Model: eng2102 Phone inventory: eng


  return (feature - spk_mean)/spk_std
100%|██████████| 5079/5079 [2:37:40<00:00,  1.86s/it]     


Eval time in seconds: 9463.133610010147


Flattening the indices:   0%|          | 0/6 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

In [6]:
# Write all results to file for comparison
model_evaluator.to_csv("epitran_allosaurus_eval.csv")