Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facebook/mms-tts-deu speaks two voices at once, male and female #5498

Open
moseich opened this issue May 16, 2024 · 0 comments
Open

Facebook/mms-tts-deu speaks two voices at once, male and female #5498

moseich opened this issue May 16, 2024 · 0 comments

Comments

@moseich
Copy link

moseich commented May 16, 2024

What is your question?
I am experiencing an issue with the pretrained neural network facebook/mms-tts-deu. When generating speech, it sometimes alternates between male and female voices, making the output unclear. How can I resolve this issue and generate speech with a single, consistent voice?

Initially, I used the following code and generated audio on CPU:

from transformers import VitsModel, AutoTokenizer
import torch

def generate_audio(transcription, language):
    model_paths = {
        "en": "/home/igor/NEURALNETWORK/facebook_mms_tts_eng",
        "de": "/home/igor/NEURALNETWORK/facebook_mms_tts_deu"
    }
    model_path = model_paths.get(language)
    
    # Loading model and tokenizer
    model = VitsModel.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Example text
    text = transcription  

    # Tokenizing input and generating waveform
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        output = model(**inputs).waveform

    return output[0].numpy(), model.config.sampling_rate

The model generated good audio, but sometimes both male and female voices appear alternately. Here, I will upload an example audio file.

Then, I conducted experiments on Google Colab and tried changing speaker IDs:

import torch
from transformers import VitsModel, AutoTokenizer
import scipy.io.wavfile
import os
from huggingface_hub import login

# Authenticate to Hugging Face
login("##############################")

# Load model and tokenizer
model = VitsModel.from_pretrained("facebook/mms-tts-deu")
tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-deu")

# Check if the model supports multiple speakers
config = model.config
supports_speaker_id = hasattr(config, 'num_speakers') and config.num_speakers > 1

# Example text input
text = "Hallo, wie geht es Ihnen heute?"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Set speaker_id if supported
if supports_speaker_id:
    speaker_id = torch.tensor([1])  # Example: 0 - male voice, 1 - female voice
    inputs["speaker_id"] = speaker_id

# Generate waveform
with torch.no_grad():
    output = model(**inputs).waveform

# Ensure the output is in the right shape and format for audio playback
waveform = output.squeeze().numpy()

# Define the output path
output_dir = "/content/gdrive/MyDrive/TextToSpeech/TestSpeech"
output_path = os.path.join(output_dir, "output.wav")

# Ensure the output directory exists
os.makedirs(output_dir, exist_ok=True)

# Save waveform as .wav file
scipy.io.wavfile.write(output_path, rate=model.config.sampling_rate, data=waveform)

print(f"Audio saved at: {output_path}")

# Playback the audio in the notebook
from IPython.display import Audio
Audio(waveform, rate=model.config.sampling_rate)

I changed the speaker ID to 0 and 1, but I couldn't generate a female voice, as it always produced a male voice. How can I control the voices in this model?

What's your environment?
fairseq Version: 1.0
PyTorch Version: Tried different versions
OS: Linux
How you installed fairseq: pip
Build command you used (if compiling from source): N/A
Python version: Tried different versions (3.9, 3.10)
CUDA/cuDNN version: N/A (CPU usage)
GPU models and configuration: N/A (CPU usage)
Any other relevant information: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant