In [24]:
from transformers import pipeline
import soundfile as sf
import os
from datasets import load_dataset
import numpy as np

In order to get inference from the prototype we will use the Fleurs dataset from Google which contains short audio recordings of people speaking. For the scale of this demonstration, we will use 100 samples.

In [25]:
#Load data set
fleurs_asr = load_dataset("google/fleurs", "en_us")  # for English

# test out first 100 samples
audio_inputs = fleurs_asr["train"][:100]["audio"]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Next we will access the off-the-shelf models necessary to put our pipeline together. We are using Whisper for ASR, Opus for Machine Translation and MMS for TTS.

In [26]:
# ASR English
audio2text = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-tiny.en",
    chunk_length_s=30,
) 

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [27]:
# Translator EN to ES
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es", device='cpu')



In [28]:
# Spanish text to speech
text2audio = pipeline(model="ylacombe/mms-spa-finetuned-chilean-monospeaker")

Some weights of VitsModel were not initialized from the model checkpoint at ylacombe/mms-spa-finetuned-chilean-monospeaker and are newly initialized: ['flow.flows.0.wavenet.in_layers.0.parametrizations.weight.original0', 'flow.flows.0.wavenet.in_layers.0.parametrizations.weight.original1', 'flow.flows.0.wavenet.in_layers.1.parametrizations.weight.original0', 'flow.flows.0.wavenet.in_layers.1.parametrizations.weight.original1', 'flow.flows.0.wavenet.in_layers.2.parametrizations.weight.original0', 'flow.flows.0.wavenet.in_layers.2.parametrizations.weight.original1', 'flow.flows.0.wavenet.in_layers.3.parametrizations.weight.original0', 'flow.flows.0.wavenet.in_layers.3.parametrizations.weight.original1', 'flow.flows.0.wavenet.res_skip_layers.0.parametrizations.weight.original0', 'flow.flows.0.wavenet.res_skip_layers.0.parametrizations.weight.original1', 'flow.flows.0.wavenet.res_skip_layers.1.parametrizations.weight.original0', 'flow.flows.0.wavenet.res_skip_layers.1.parametrizations.weig

Below is a function that ties all of the models together into the audio to audio translation pipeline. We need to make sure the audio data is loaded properly to be fed into the pipeline and that the output is formatted correctly.

In [33]:
def translate(input_path, output_path):
    # Load audio data from file
    audio_data = {
        "raw": np.array(input_path["array"]),  # The audio waveform
        "sampling_rate": input_path["sampling_rate"]  # The sampling rate of the audio
    }
    english_text = audio2text(audio_data, batch_size=8)
    spanish_text = translator(english_text['text'])
    spanish_text = spanish_text[0]['translation_text']
    speech = text2audio(spanish_text)

    # Ensure the directory exists
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    sf.write(
       output_path, speech["audio"].squeeze(), samplerate=speech["sampling_rate"]
    )

Finally we run the pipeline and store our output in a folder.

In [35]:
output_directory = 'prototype_outputs'

for i, audio_input in enumerate(audio_inputs):
    print(f'translating audio {i}')
    output_path = os.path.join(output_directory, f'translated_audio_{i}.wav')
    translate(audio_input, output_path)

translating audio 0
translating audio 1
translating audio 2
translating audio 3
translating audio 4
translating audio 5
translating audio 6
translating audio 7
translating audio 8
translating audio 9
translating audio 10
translating audio 11
translating audio 12
translating audio 13
translating audio 14
translating audio 15
translating audio 16
translating audio 17
translating audio 18
translating audio 19
translating audio 20
translating audio 21
translating audio 22
translating audio 23
translating audio 24
translating audio 25
translating audio 26
translating audio 27
translating audio 28
translating audio 29
translating audio 30
translating audio 31
translating audio 32
translating audio 33
translating audio 34
translating audio 35
translating audio 36
translating audio 37
translating audio 38
translating audio 39
translating audio 40
translating audio 41
translating audio 42
translating audio 43
translating audio 44
translating audio 45
translating audio 46
translating audio 47
tr