# Upload experimental models to Hugging Face Model repositories
This notebook is a helper for uploading pre-trained models to Hugging Face. It allows you to add README info for experiments at upload time for better documentation. 

*First*: Make sure that you have added your HuggingFace Hub token in some way or logged in on the command line via `huggingface-cli login`

In [1]:
from huggingface_hub import HfApi
from huggingface_hub.utils import HfHubHTTPError
import transformers

from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Model name prefix to 
MODEL_ROOT = Path("../data/models/")
ALL_MODELS_README = """
---
license: mit
language:
- en
pipeline_tag: automatic-speech-recognition
---
# About 
This model was created to support experiments for evaluating phonetic transcription 
with the Buckeye corpus as part of https://github.com/ginic/multipa. 
This is a version of facebook/wav2vec2-large-xlsr-53 fine tuned on a specific subset of the Buckeye corpus.
For details about specific model parameters, please view the config.json here or 
training scripts in the scripts/buckeye_experiments folder of the GitHub repository. 

# Experiment Details
"""

# Specific sets of experiments have more details. I just copied these from the EXPERIMENT_LOG.md 
README_MAPPINGS = {
#     # This was the best hyperparam tuned model & these model parameters were used for all other experiments
#     "hyperparam_tuning_1":"""The best performing model from hyperparameter tuning experiments (batch size, learning rat, base model to fine tune). Vary the random seed to select training data while keeping an even 50/50 gender split to measure statistical significance of changing training data selection. Retrain with the same model parameters, but different data seeding to measure statistical significance of data seed, keeping 50/50 gender split. 

# Goals: 
# - Choose initial hyperparameters (batch size, learning rat, base model to fine tune) based on validation set performance
# - Establish whether data variation with the same gender makeup is statistically significant in changing performance on the test set (first data_seed experiment)
# """,
#     "data_seed_bs64": """Vary the random seed to select training data while keeping an even 50/50 gender split to measure statistical significance of changing training data selection. Retrain with the same model parameters, but different data seeding to measure statistical significance of data seed, keeping 50/50 gender split. 

# Goals: 
# - Establish whether data variation with the same gender makeup is statistically significant in changing performance on the test set

# Params to vary:
# - training data seed (--train_seed): [91, 114, 771, 503]
# """,

#     "gender_split": """Still training with a total amount of data equal to half the full training data (4000 examples), vary the gender split 30/70, but draw examples from all individuals. Do 5 models for each gender split with the same model parameters but different data seeds. 

# Goals: 
# - Determine how different in gender split in training data affects performance

# Params to vary: 
# - percent female (--percent_female) [0.3, 0.7]
# - training seed (--train_seed)
# """, 

#     "vary_individuals": """These experiments keep the total amount of data equal to half the training data with the gender split 50/50, but further exclude certain speakers completely using the --speaker_restriction argument. This allows us to restrict speakers included in training data in any way. For the purposes of these experiments, we are focussed on the age demogrpahic of the user.  

# For reference, the speakers and their demographics included in the training data are as follows where the speaker age range 'y' means under 30 and 'o' means over 40: 

# | speaker_id | speaker_gender | speaker_age_range | 
# | ---------- | -------------- | ----------------- |
# | S01 | f | y |
# | S04 | f | y | 
# | S08 | f | y | 
# | S09 | f | y | 
# | S12 | f | y | 
# | S21 | f | y | 
# | S02 | f | o |
# | S05 | f | o | 
# | S07 | f | o | 
# | S14 | f | o | 
# | S16 | f | o |
# | S17 | f | o | 
# | S06 | m | y | 
# | S11 | m | y | 
# | S13 | m | y | 
# | S15 | m | y | 
# | S28 | m | y | 
# | S30 | m | y |
# | S03 | m | o | 
# | S10 | m | o | 
# | S19 | m | o |
# | S22 | m | o |
# | S24 | m | o | 


# Goals: 
# - Determine how variety of speakers in the training data affects performance

# Params to vary: 
# - training seed (--train_seed)
# - demographic make up of training data by age, using --speaker_restriction 
#     - Experiments `young_only`: only individuals under 30, S01 S04 S08 S09 S12 S21 S06 S11 S13 S15 S28 S30
#     - Experiments `old_only`: only individuals over 40, S02 S05 S07 S14 S16 S17 S03 S10 S19 S22 S24
# """
    "full_dataset": """The entire Buckeye corpus, including the sets that were held out for validation and/or testing in our original experiments with Buckeye, are used to train the model. 
The "full_dataset_train_val" used both training and validation splits in model training and the "full_dataset_train_val_test" used all splits (training, validation, test) in model training. As a result, the held out test set of Buckeye should used to establish performance benchmarks for these models.

Goals: 
- Include the largest amount of training data possible. 
- Can be used with a different corpus (e.g. TIMIT, Speech Accent Archive) for evaluation to test generalization to other dialects and language varieties. 
"""


}

In [None]:
api = HfApi()
for model_folder in MODEL_ROOT.iterdir():
    if model_folder.is_dir(): 
        for prefix in README_MAPPINGS.keys(): 
            if model_folder.name.startswith(prefix):
                print(f"Model {model_folder} matches prefix '{prefix}'.")
                hub_name = f"ginic/{model_folder.name}_wav2vec2-large-xlsr-53-buckeye-ipa" 
        
                full_readme = "".join([ALL_MODELS_README, README_MAPPINGS[prefix]])
                model_to_upload = model_folder / "wav2vec2-large-xlsr-53-buckeye-ipa"
                readme_path = model_to_upload / "README.md"
                readme_path.write_text(full_readme)

                model_pipeline = transformers.pipeline("automatic-speech-recognition", model=model_to_upload)
                print("Uploading to hub as:", hub_name)
                model_pipeline.push_to_hub(hub_name)
                print("Uploading README for", hub_name)
                api.upload_file(
                    path_or_fileobj = readme_path, 
                    path_in_repo = "README.md",
                    repo_id = hub_name, 
                    repo_type = "model"
                )

                # Don't look at other prefix keys, the model is already uploaded
                break



Model ../data/models/vary_individuals_old_only_1 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_old_only_1_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:23<00:00, 53.9MB/s]


Uploading README for ginic/vary_individuals_old_only_1_wav2vec2-large-xlsr-53-buckeye-ipa
Model ../data/models/vary_individuals_old_only_2 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_old_only_2_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:22<00:00, 55.9MB/s]


Uploading README for ginic/vary_individuals_old_only_2_wav2vec2-large-xlsr-53-buckeye-ipa
Model ../data/models/vary_individuals_old_only_3 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_old_only_3_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:28<00:00, 43.9MB/s]


Uploading README for ginic/vary_individuals_old_only_3_wav2vec2-large-xlsr-53-buckeye-ipa
Model ../data/models/vary_individuals_young_only_1 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_young_only_1_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:23<00:00, 54.4MB/s]


Uploading README for ginic/vary_individuals_young_only_1_wav2vec2-large-xlsr-53-buckeye-ipa
Model ../data/models/vary_individuals_young_only_2 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_young_only_2_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:26<00:00, 47.1MB/s]


Uploading README for ginic/vary_individuals_young_only_2_wav2vec2-large-xlsr-53-buckeye-ipa
Model ../data/models/vary_individuals_young_only_3 matches prefix 'vary_individuals'.


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Uploading to hub as: ginic/vary_individuals_young_only_3_wav2vec2-large-xlsr-53-buckeye-ipa


model.safetensors: 100%|██████████| 1.26G/1.26G [00:23<00:00, 53.3MB/s]


Uploading README for ginic/vary_individuals_young_only_3_wav2vec2-large-xlsr-53-buckeye-ipa


In [4]:
# Sanity check that upload worked and the model from the hub can be used for inference
from multipa.data_utils import load_buckeye_split
import datasets

dataset = datasets.load_dataset("MLCommons/peoples_speech", split="train", streaming=True).take(2)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
print(list(dataset))
pipe = transformers.pipeline("automatic-speech-recognition", model="ginic/vary_individuals_old_only_1_wav2vec2-large-xlsr-53-buckeye-ipa")
for i in list(dataset): 
    pred = pipe(i["audio"])
    print("actual text:", i["text"])
    print("prediction:", pred)



Using custom data configuration clean-88debf4c1ba2b8fa


[{'id': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00000.flac', 'audio': {'path': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00000.flac', 'array': array([ 0.14205933,  0.20620728,  0.27151489, ...,  0.00402832,
       -0.00628662, -0.01422119]), 'sampling_rate': 16000}, 'duration_ms': 14920, 'text': "i wanted this to share a few things but i'm going to not share as much as i wanted to share because we are starting late i'd like to get this thing going so we all get home at a decent hour this this election is very important to"}, {'id': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00001.flac', 'audio': {'path': '07282016HFUUforum_SLASH_07-28-2016_HFUUforum_DOT_mp3_00001.flac', 'array': array([-0.01480103,  0.05319214, -0.0105896 , ..., -0.02996826,
        0.06680298,  0.0071106 ]), 'sampling_rate': 16000}, 'duration_ms': 14530, 'text': "state we support agriculture to the tune of point four percent no way i made a mistake this year they lowered it from po

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


actual text: i wanted this to share a few things but i'm going to not share as much as i wanted to share because we are starting late i'd like to get this thing going so we all get home at a decent hour this this election is very important to
prediction: {'text': 'ɑwɑɾ̃ɪtzɪzdʒɪʃɛɹfjuθɪŋzbʌɾʌmɡʌɾ̃ʌznɑtʃɛɹʌzmʌtʃzʌwɑñɪdɪʃɛɹbɪkʌzwiɑɹstɑɹɾɪɡliadlaɪktɪɡɛttðɪsθɪŋɡoʊʌnsʌwiɡɔɡɛɾhoʊmʌɾʌdisʌnaʊɹ̩ʌmðɪsðɪsʌlɛkʃɪnɪzʌmvɛɹiɪmpɔɹʔn̩tu'}
actual text: state we support agriculture to the tune of point four percent no way i made a mistake this year they lowered it from point four percent to point three eight percent and in the same breath they're saying food
prediction: {'text': 'steɪwisʌpoʊɹɾæɡɹ̩kʌltʃɹ̩tɪðɪtunʌvpɔɪnfɔɹpɹ̩sɛnoʊnoʊwɪaɪmɪɾʌmʌsteɪkðɪʃjɪɹ̩ðeɪloʊɹ̩dɪtfɹʌmpɔɪntfoʊɹpɹ̩sɛntʌpɔɪntθɹieɪpɹ̩sɛɛɾ̃ɪnðʌseɪmɡɹɛθðɛɹ̩seɪmfuts'}
