Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge of coqui voices broke stuff #126

Closed
aedocw opened this issue Dec 23, 2023 · 13 comments · Fixed by #127
Closed

Merge of coqui voices broke stuff #126

aedocw opened this issue Dec 23, 2023 · 13 comments · Fixed by #127
Assignees
Labels
bug Something isn't working

Comments

@aedocw
Copy link
Owner

aedocw commented Dec 23, 2023

Due to lack of complete testing, the merge that made studio voices work also broke a bunch of other stuff. This merge fixes that, and also includes a test script that I will use in the future to validate a few common use cases. It would be nice to add some real tests into CI, but the test runners do not have GPU, so would not be that useful.

@aedocw aedocw self-assigned this Dec 23, 2023
@aedocw aedocw added the bug Something isn't working label Dec 23, 2023
@aedocw aedocw linked a pull request Dec 23, 2023 that will close this issue
@danielw97
Copy link

Hi again,
If you'd rather have a separate issue for this let me know, although in my testing just now after pulling your most recent commit I'm getting the following error, as xtts is getting sent a bigger text chunk than it can handle I believe.
This is using one of the coqui studio voices, btw.

Error: ❗ XTTS can only generate text with a maximum of 400 tokens. ... Retrying (0 retries left)
This is a longer paragraph, although using a finetuned model last week with the same book didn't have this problem.
Thanks for all of your work.

@aedocw
Copy link
Owner Author

aedocw commented Dec 23, 2023 via email

@danielw97
Copy link

Other than that everything seems to be working, I wonder is it possible to encorperate the same segmenting code that is used with xtts as I assume the limits are the same (if it isn't already)?
I think this is an outlier with a longer paragraph that xtts handled fine although I can see how it may cause issues.

@danielw97
Copy link

Also, the same text processing at least in my mind should be used, as I believe this is basically xtts under the hood unless I am incorrect.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

I'm going to reopen this one until things are sorted out.

The difference has to do with how I'm calling XTTS between these two ways. The way I use the xtts cloning model is with their inference streaming approach but the docs don't indicate how you can use that with their voices.

@aedocw aedocw reopened this Dec 24, 2023
@danielw97
Copy link

Okay, thanks.
Not a rush especially this time of year with the holidays of course, although wanted to let you know the errors I was seeing.
Maybe they've not implemented the streaming approach with their studio voices, although might be worth asking on the Discord as it might just not be documented.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

Haha yes I have asked on discord, no answer yet though (and someone else just asked the same question today).

I put a potential fix in the branch "fixes" if you want to try it out when you get a chance.

As far as the holidays, it's OK, this is relaxing and I always sleep better after fixing some bugs :)

Thanks, and happy holidays to you too!

@danielw97
Copy link

danielw97 commented Dec 24, 2023

Thanks, I've got some time this evening and will test this now.
Edit: I've run the troublesome paragraph and that seems to have fixed it, appreciate your quick work on this.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

Thanks, I appreciate your testing and all your feedback!

You should see this, indicating the right xtts version:

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.                                                                       
 > Using model: xtts ```

@danielw97
Copy link

Yes, that's what I got in the end.
I made a silly mistake the first time specifying --model instead of --engine although that was fixed fairly quickly, all good now though.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

Excellent! I'll figure out what's going on with other languages hopefully tonight and merge this branch.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

Found the problem with Coqui voices, reading plain text (rather than epub), and specifying a language other than english. On line 164 I replace all periods with commas if language != en, and that seems to break something along the way (maybe it confuses the segmenter that breaks everything up into individual sentences).

Replacing periods with commas did seem to help for non-english languages where it would seem to always pronounce the period at the end of sentences as "dot" or some variation of that. I changed that to happen now just before the sentence is sent to TTS, hopefully it is still effective for other languages.

@aedocw
Copy link
Owner Author

aedocw commented Dec 24, 2023

I believe things are all fixed now, please log bugs as always :)

@aedocw aedocw closed this as completed Dec 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants