[Bug] Truncated audio when generating speech #1522

koutheir · 2022-04-21T05:04:09Z

Describe the bug

When generating speech audio from the following text, the generated file contains only a truncated speech audio, that is, the speech audio is interrupted before the sentences are pronounced.

Text:

Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.

Generated speech audio: tts_output.zip

To Reproduce

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'

Expected behavior

Generation of speech audio for the full input text.

Logs

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.
 > Text splitted to sentences.
['Multiple debugging information entries may share the same abbreviation table entry.', 'Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 5.151856899261475
 > Real-time factor: 0.4295162001993176
 > Saving output to tts_output.wav

Environment

{
    "CUDA": {
        "GPU": [
            "Quadro RTX 5000 with Max-Q Design"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.12",
        "version": "#1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29)"
    }
}

Additional context

No response

The text was updated successfully, but these errors were encountered:

erogol · 2022-04-21T08:27:43Z

There is a limit on the sentence length for the default model. Either split the longest sentence into two or use a different model.

koutheir · 2022-04-21T13:43:28Z

There is a limit on the sentence length for the default model.

What is that exact limit?
How can one find this information for a particular model?
How is the length calculated (character count, word count, etc.)?

koutheir · 2022-05-03T13:06:46Z

@erogol, please reopen this issue as there are still pending questions about it, and I don't think I'm the only person asking them.

koutheir · 2022-05-15T02:19:27Z

@JRMeyer, can you please reopen this issue, or at least answer the questions above? @erogol doesn't seem to be responsive.

a-t-0 · 2022-07-09T12:26:40Z

I would also like to know how to know the information @koutheir asked for :) :

* What is that exact limit?

* How can one find this information for a particular model?

* How is the length calculated (character count, word count, etc.)?

So it would be nice if the question could be re-opened.

FYI, here is a work-around given in this issue:
Add:

"max_decoder_steps": 5000

at the end of the config file of the (default) model, in:

~/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2/config.json

Note, json expects a comma at the end of a line if the line is followed by another property/line. So you have to add a comma to the previous last line, and ensure you don't add a comma to your own "max_decoder_steps": 5000 command. TL;DR: make the last lines look like:

    // PATHS
    "output_path": "/home/erogol/gdrive/Trainings/sam/",
    // Custom limit made larger
    "max_decoder_steps": 5000
}

a-t-0 · 2022-07-09T14:56:53Z

Here is a small script that splits up a text into separate sentences, then converts those sentences into a single .mp3 file:
First install it with this installer.sh:

pip install TTS
#git clone git@github.com:coqui-ai/TTS.git
#cd TTS
#pip install -e .[all,dev,notebooks]  # Select the relevant extras
mkdir -p "output/path/"
tts --text "Text for TTS" --out_path output/path/speech.wav

# The TTS does not allow for long sentences, so to work around this, 
# you could increase the permitted max_decoder_steps.
# Source: https://github.com/coqui-ai/TTS/issues/1333

# Open the configuration of the default model, 
# (which is en--ljspeech--hifigan_v2), located at:
# ~/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2/config.json

# Then add/ensure it contains:
# "max_decoder_steps": 5000

# In short, make the last lines of the config.json look like:
#    // PATHS
#    "output_path": "/home/erogol/gdrive/Trainings/sam/",
#    // Custom limit made larger
#    "max_decoder_steps": 5000
#}

# Install text parsing packages:
pip install nltk
pip install pydub

# Do a test run.
the_text='"'$(cat text.txt)'"'
tts --text "$the_text" --out_path output/path/text.wav

Run it with:

chmod +x installer.sh
./installer.sh

Then add that max_decoder_steps manually, like described above.
Then you can create a text.txt with your complete text, and run:

python text2speech.py

Below is the text2speech.py content:

""" Browse into audio/ dir and run:
python text2speech.py
to convert the audio files into a .wav file"""



import os
import re
import subprocess
from pydub import AudioSegment


def split_into_sentences_using_nlp():
    import nltk.data
    nltk.download()

    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    fp = open("text.txt")
    data = fp.read()
    sentences='\n-----\n'.join(tokenizer.tokenize(data))
    return sentences


def load_txt_from_file(filename):
    fp = open(filename)
    text = fp.read()
    return text


def split_into_sentences(text):
    # Specify regex values.
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov)"

    # Perform conversion.
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

def convert_sentences_to_wav_files(filename: str,output_dir:str ,sentences: list):
    soundbite_filepaths=[]
    for i,sentence in enumerate(sentences):
        soundbite_filepath=f"{output_dir}/{filename}_{i}.wav"
        command=f'tts --text "{sentence}" --out_path {soundbite_filepath}'
        print(f'command={command}')
        subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read()
        soundbite_filepaths.append(soundbite_filepath)
    return soundbite_filepaths
    
def install_ffmpeg_if_needed(format):
    if format =="mp3":
        subprocess.Popen("yes | sudo apt install ffmpeg", shell=True, stdout=subprocess.PIPE).stdout.read()

def merge_wav_files_into_single_mp3(format: str,output_dir:str,output_filename:str,soundbite_filepaths:list):
    if format not in ["mp3","wav"]:
        raise Exception(f"Format:{format} not supported.")
    install_ffmpeg_if_needed(format)

    output_filepath=f"{output_dir}/{output_filename}.{format}"
    print(f'output_filepath={output_filepath}')

    
    combined_sounds=None
    for soundbite_filepath in soundbite_filepaths:
        print(f'soundbite_filepath={soundbite_filepath}')
        some_sound=AudioSegment.from_wav(soundbite_filepath)
        if combined_sounds is None:
            combined_sounds=some_sound    
        else:
            combined_sounds=combined_sounds+some_sound

    #combined_sounds = sound1 + sound2
    #combined_sounds.export("/output/path.wav", format="wav")
    #combined_sounds.export("/output/path.mp3", format="mp3")
    combined_sounds.export(output_filepath, format=format)

def get_output_files(output_dir,soundbite_filename):
    soundbite_filepaths=[]
    for i in range(0,10000):
        soundbite_filepath=f"{output_dir}/{soundbite_filename}_{i}.wav"
        if os.path.isfile(soundbite_filepath):
            soundbite_filepaths.append(soundbite_filepath)
    return soundbite_filepaths

def merge_without_converting(extension, output_dir,output_filename,soundbite_filename):
    soundbite_filepaths=get_output_files(output_dir,soundbite_filename)
    print(f'soundbite_filepaths={soundbite_filepaths}')
    merge_wav_files_into_single_mp3(extension,output_dir,output_filename,soundbite_filepaths)
    exit()


# Specify the audio output dir.
output_dir="output"
soundbite_filename="soundbite"
output_filename="Spoken_text"
extension="mp3"
# TODO: ensure and verify output dir exists.

# TODO: Clear out output directory before starting.

# TODO: allow manually overwriting a single soundbite without converting the entire text.

# Optional: If you already generated the separate .wav soundbites and would like to merge.
# merge_without_converting(extension, output_dir,output_filename,soundbite_filename)

# Load the presentation text from file.
text=load_txt_from_file("text.txt")
# Separate the text into smaller sentences.
sentences=split_into_sentences(text)
print(f'sentences={sentences}')

# TODO: Verify the sentences are short enough.

# Convert the sentences into .wav files
soundbite_filepaths= convert_sentences_to_wav_files(soundbite_filename,output_dir,sentences)

# Merge the .wav files into a single .wav file
merge_wav_files_into_single_mp3(extension,output_dir,output_filename,soundbite_filepaths)

Here is the related repo: https://github.com/a-t-0/icons_poster

koutheir added the bug Something isn't working label Apr 21, 2022

erogol closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Truncated audio when generating speech #1522

[Bug] Truncated audio when generating speech #1522

koutheir commented Apr 21, 2022

erogol commented Apr 21, 2022

koutheir commented Apr 21, 2022

koutheir commented May 3, 2022

koutheir commented May 15, 2022

a-t-0 commented Jul 9, 2022 •

edited

Loading

a-t-0 commented Jul 9, 2022 •

edited

Loading

[Bug] Truncated audio when generating speech #1522

[Bug] Truncated audio when generating speech #1522

Comments

koutheir commented Apr 21, 2022

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

erogol commented Apr 21, 2022

koutheir commented Apr 21, 2022

koutheir commented May 3, 2022

koutheir commented May 15, 2022

a-t-0 commented Jul 9, 2022 • edited Loading

a-t-0 commented Jul 9, 2022 • edited Loading

a-t-0 commented Jul 9, 2022 •

edited

Loading

a-t-0 commented Jul 9, 2022 •

edited

Loading