-
-
Notifications
You must be signed in to change notification settings - Fork 115
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Real Time Streaming #6
Comments
Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback. |
I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS The main thing is streaming the raw audio to stdout as its produced |
Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues. Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed. From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed. |
There are a view more things to consider:
|
@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually |
@Sascha353 @mercuryyy Should be easy to implement into the addon. is out of scope, due to the LowVram "incompatibility"? |
@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process? |
@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav |
I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui soace). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here. As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred. It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Is it possible at exec of TTS cmd, to Stream the results in chunks to something like a Temp_stream.wav file that will be playable immediately after exec as it is still being created.
So for example if i am Transcribing 100 words and it takes me 4 seconds. but i want to play the file in real time, meaning at the point of exec of the TTS command i want the Audio to start playing, you can essentially do real time.
Piper does this - https://github.com/rhasspy/piper
echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' |
./piper --model en_US-lessac-medium.onnx --output-raw |
aplay -r 22050 -f S16_LE -t raw -
But the models on Coqui_tts are better but longer to exec, but if we can Stream it wouldn't matter and can do real time.
The text was updated successfully, but these errors were encountered: