You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is your question?
I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?
Initially, I used the following code and generated audio on CPU:
fromtransformersimportVitsModel, AutoTokenizerimporttorchdefgenerate_audio(transcription, language):
model_paths= {
"en": "/home/igor/NEURALNETWORK/facebook_mms_tts_eng",
"de": "/home/igor/NEURALNETWORK/facebook_mms_tts_deu"
}
model_path=model_paths.get(language)
# Loading model and tokenizermodel=VitsModel.from_pretrained(model_path)
tokenizer=AutoTokenizer.from_pretrained(model_path)
# Example texttext=transcription# Tokenizing input and generating waveforminputs=tokenizer(text, return_tensors="pt")
withtorch.no_grad():
output=model(**inputs).waveformreturnoutput[0].numpy(), model.config.sampling_rate
The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.
Then, I conducted experiments on Google Colab and tried changing speaker IDs:
import torch
from transformers import VitsModel, AutoTokenizer
import scipy.io.wavfile
import os
from huggingface_hub import login
# Authenticate to Hugging Face
login("##############################")
# Load model and tokenizer
model = VitsModel.from_pretrained("facebook/mms-tts-deu")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-deu")
# Check if the model supports multiple speakers
config = model.config
supports_speaker_id = hasattr(config, 'num_speakers') and config.num_speakers > 1
# Example text input
text = "Hallo, wie geht es Ihnen heute?"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Set speaker_id if supported
if supports_speaker_id:
speaker_id = torch.tensor([1]) # Example: 0 - male voice, 1 - female voice
inputs["speaker_id"] = speaker_id
# Generate waveform
with torch.no_grad():
output = model(**inputs).waveform
# Ensure the output is in the right shape and format for audio playback
waveform = output.squeeze().numpy()
# Define the output path
output_dir = "/content/gdrive/MyDrive/TextToSpeech/TestSpeech"
output_path = os.path.join(output_dir, "output.wav")
# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)
# Save waveform as .wav file
scipy.io.wavfile.write(output_path, rate=model.config.sampling_rate, data=waveform)
print(f"Audio saved at: {output_path}")
# Playback the audio in the notebook
from IPython.display import Audio
Audio(waveform, rate=model.config.sampling_rate)
I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?
What's your environment?
fairseq Version: 1.0
PyTorch Version: Tried different versions
OS: Linux
How you installed fairseq: pip
Build command you used (if compiling from source): N/A
Python version: Tried different versions (3.9, 3.10)
CUDA/cuDNN version: N/A (CPU usage)
GPU models and configuration: N/A (CPU usage)
Any other relevant information: None
The text was updated successfully, but these errors were encountered:
What is your question?
I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?
Initially, I used the following code and generated audio on CPU:
The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.
Then, I conducted experiments on Google Colab and tried changing speaker IDs:
I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?
What's your environment?
fairseq Version: 1.0
PyTorch Version: Tried different versions
OS: Linux
How you installed fairseq: pip
Build command you used (if compiling from source): N/A
Python version: Tried different versions (3.9, 3.10)
CUDA/cuDNN version: N/A (CPU usage)
GPU models and configuration: N/A (CPU usage)
Any other relevant information: None
The text was updated successfully, but these errors were encountered: