Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Truncated audio when generating speech #1522

Closed
koutheir opened this issue Apr 21, 2022 · 6 comments
Closed

[Bug] Truncated audio when generating speech #1522

koutheir opened this issue Apr 21, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@koutheir
Copy link

Describe the bug

When generating speech audio from the following text, the generated file contains only a truncated speech audio, that is, the speech audio is interrupted before the sentences are pronounced.

Text:

Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.

Generated speech audio: tts_output.zip

To Reproduce

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'

Expected behavior

Generation of speech audio for the full input text.

Logs

$ tts --text 'Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.'
 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Multiple debugging information entries may share the same abbreviation table entry. Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.
 > Text splitted to sentences.
['Multiple debugging information entries may share the same abbreviation table entry.', 'Each compilation unit is associated with a particular abbreviation table, but multiple compilation units may share the same table.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 5.151856899261475
 > Real-time factor: 0.4295162001993176
 > Saving output to tts_output.wav

Environment

{
    "CUDA": {
        "GPU": [
            "Quadro RTX 5000 with Max-Q Design"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.12",
        "version": "#1 SMP PREEMPT Debian 5.16.18-1 (2022-03-29)"
    }
}

Additional context

No response

@koutheir koutheir added the bug Something isn't working label Apr 21, 2022
@erogol
Copy link
Member

erogol commented Apr 21, 2022

There is a limit on the sentence length for the default model. Either split the longest sentence into two or use a different model.

@erogol erogol closed this as completed Apr 21, 2022
@koutheir
Copy link
Author

There is a limit on the sentence length for the default model.

  • What is that exact limit?
  • How can one find this information for a particular model?
  • How is the length calculated (character count, word count, etc.)?

@koutheir
Copy link
Author

koutheir commented May 3, 2022

@erogol, please reopen this issue as there are still pending questions about it, and I don't think I'm the only person asking them.

@koutheir
Copy link
Author

@JRMeyer, can you please reopen this issue, or at least answer the questions above? @erogol doesn't seem to be responsive.

@a-t-0
Copy link

a-t-0 commented Jul 9, 2022

I would also like to know how to know the information @koutheir asked for :) :

* What is that exact limit?

* How can one find this information for a particular model?

* How is the length calculated (character count, word count, etc.)?

So it would be nice if the question could be re-opened.

FYI, here is a work-around given in this issue:
Add:

"max_decoder_steps": 5000

at the end of the config file of the (default) model, in:

~/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2/config.json

Note, json expects a comma at the end of a line if the line is followed by another property/line. So you have to add a comma to the previous last line, and ensure you don't add a comma to your own "max_decoder_steps": 5000 command. TL;DR: make the last lines look like:

    // PATHS
    "output_path": "/home/erogol/gdrive/Trainings/sam/",
    // Custom limit made larger
    "max_decoder_steps": 5000
}

@a-t-0
Copy link

a-t-0 commented Jul 9, 2022

Here is a small script that splits up a text into separate sentences, then converts those sentences into a single .mp3 file:
First install it with this installer.sh:

pip install TTS
#git clone git@github.com:coqui-ai/TTS.git
#cd TTS
#pip install -e .[all,dev,notebooks]  # Select the relevant extras
mkdir -p "output/path/"
tts --text "Text for TTS" --out_path output/path/speech.wav

# The TTS does not allow for long sentences, so to work around this, 
# you could increase the permitted max_decoder_steps.
# Source: https://github.com/coqui-ai/TTS/issues/1333

# Open the configuration of the default model, 
# (which is en--ljspeech--hifigan_v2), located at:
# ~/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2/config.json

# Then add/ensure it contains:
# "max_decoder_steps": 5000

# In short, make the last lines of the config.json look like:
#    // PATHS
#    "output_path": "/home/erogol/gdrive/Trainings/sam/",
#    // Custom limit made larger
#    "max_decoder_steps": 5000
#}

# Install text parsing packages:
pip install nltk
pip install pydub

# Do a test run.
the_text='"'$(cat text.txt)'"'
tts --text "$the_text" --out_path output/path/text.wav

Run it with:

chmod +x installer.sh
./installer.sh

Then add that max_decoder_steps manually, like described above.
Then you can create a text.txt with your complete text, and run:

python text2speech.py

Below is the text2speech.py content:

""" Browse into audio/ dir and run:
python text2speech.py
to convert the audio files into a .wav file"""



import os
import re
import subprocess
from pydub import AudioSegment


def split_into_sentences_using_nlp():
    import nltk.data
    nltk.download()

    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    fp = open("text.txt")
    data = fp.read()
    sentences='\n-----\n'.join(tokenizer.tokenize(data))
    return sentences


def load_txt_from_file(filename):
    fp = open(filename)
    text = fp.read()
    return text


def split_into_sentences(text):
    # Specify regex values.
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov)"

    # Perform conversion.
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

def convert_sentences_to_wav_files(filename: str,output_dir:str ,sentences: list):
    soundbite_filepaths=[]
    for i,sentence in enumerate(sentences):
        soundbite_filepath=f"{output_dir}/{filename}_{i}.wav"
        command=f'tts --text "{sentence}" --out_path {soundbite_filepath}'
        print(f'command={command}')
        subprocess.Popen(command, shell=True, stdout=subprocess.PIPE).stdout.read()
        soundbite_filepaths.append(soundbite_filepath)
    return soundbite_filepaths
    
def install_ffmpeg_if_needed(format):
    if format =="mp3":
        subprocess.Popen("yes | sudo apt install ffmpeg", shell=True, stdout=subprocess.PIPE).stdout.read()

def merge_wav_files_into_single_mp3(format: str,output_dir:str,output_filename:str,soundbite_filepaths:list):
    if format not in ["mp3","wav"]:
        raise Exception(f"Format:{format} not supported.")
    install_ffmpeg_if_needed(format)

    output_filepath=f"{output_dir}/{output_filename}.{format}"
    print(f'output_filepath={output_filepath}')

    
    combined_sounds=None
    for soundbite_filepath in soundbite_filepaths:
        print(f'soundbite_filepath={soundbite_filepath}')
        some_sound=AudioSegment.from_wav(soundbite_filepath)
        if combined_sounds is None:
            combined_sounds=some_sound    
        else:
            combined_sounds=combined_sounds+some_sound

    #combined_sounds = sound1 + sound2
    #combined_sounds.export("/output/path.wav", format="wav")
    #combined_sounds.export("/output/path.mp3", format="mp3")
    combined_sounds.export(output_filepath, format=format)

def get_output_files(output_dir,soundbite_filename):
    soundbite_filepaths=[]
    for i in range(0,10000):
        soundbite_filepath=f"{output_dir}/{soundbite_filename}_{i}.wav"
        if os.path.isfile(soundbite_filepath):
            soundbite_filepaths.append(soundbite_filepath)
    return soundbite_filepaths

def merge_without_converting(extension, output_dir,output_filename,soundbite_filename):
    soundbite_filepaths=get_output_files(output_dir,soundbite_filename)
    print(f'soundbite_filepaths={soundbite_filepaths}')
    merge_wav_files_into_single_mp3(extension,output_dir,output_filename,soundbite_filepaths)
    exit()


# Specify the audio output dir.
output_dir="output"
soundbite_filename="soundbite"
output_filename="Spoken_text"
extension="mp3"
# TODO: ensure and verify output dir exists.

# TODO: Clear out output directory before starting.

# TODO: allow manually overwriting a single soundbite without converting the entire text.

# Optional: If you already generated the separate .wav soundbites and would like to merge.
# merge_without_converting(extension, output_dir,output_filename,soundbite_filename)

# Load the presentation text from file.
text=load_txt_from_file("text.txt")
# Separate the text into smaller sentences.
sentences=split_into_sentences(text)
print(f'sentences={sentences}')

# TODO: Verify the sentences are short enough.

# Convert the sentences into .wav files
soundbite_filepaths= convert_sentences_to_wav_files(soundbite_filename,output_dir,sentences)

# Merge the .wav files into a single .wav file
merge_wav_files_into_single_mp3(extension,output_dir,output_filename,soundbite_filepaths)

Here is the related repo: https://github.com/a-t-0/icons_poster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants