Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with RVC #229

Closed
gshawn3 opened this issue May 16, 2024 · 28 comments
Closed

Integration with RVC #229

gshawn3 opened this issue May 16, 2024 · 28 comments

Comments

@gshawn3
Copy link

gshawn3 commented May 16, 2024

First of all, thank you for this wonderful project. I've been playing around with it this past week, both in standalone mode and as a text-generation-webui extension, and it's all working very well. The documentation is top-notch as well!

I noticed some lines of code mentioning "RVC Injection" here:

// RVC injection

Is this working currently, or is that a feature that is still being worked on? I would really love to pass the generated audio through RVC because it makes voices sound 10x times more accurate, even after finetuning the XTTS model. If that's not currently possible, please consider adding this functionality in the future. Thanks, and good luck with the V2 update!

@erew123
Copy link
Owner

erew123 commented May 16, 2024

Hi @gshawn3

That file is actually part of SillyTavern, its just there because I had to submit it with the alltalk SillyTavern PR submission originally. AKA, the code you are pointing out is nothing actually to do with AllTalk.

As for RVC, you mean using a Retrieval-based Voice Conversion model? (not XTTS models etc).

Thanks

@gshawn3
Copy link
Author

gshawn3 commented May 16, 2024

Ah that makes sense. Sorry I should probably have taken a closer look at the code before asking the question.

And yes, that's right! A common pipeline for inferencing AI voices is generating a sample with a fine-tuned XTTS / Tortoise / etc, and then routing that sample through RVC. It increases the likeness of any cloned voices by a huge amount. See for example the first 30 seconds of https://www.youtube.com/watch?v=IcpRfHod1ic.

@bobcate
Copy link

bobcate commented May 16, 2024

I just set up the RVC so I can have Expression-Based Dynamic Voice and only then I wondered; I'm using AllTalk with both narator and character, how is it even going to work with more than one voice?😄

And it worked! But both the narrator and character voices combined into one.
It's still good though. The voice coming from RVC belongs to the same person(model), but the intonation is different between narration text and dialogue text, very much usable: rvc-io.zip

@erew123
Copy link
Owner

erew123 commented May 16, 2024

@gshawn3 @bobcate I think it looks like something I could include. It mostly just seems to be another layer of transcoding to add, and Ive already added in transcoding to 5x audio types into v2.... so I think it would be possible.

As you are both using it already, Im just trying to wrap my head around a few bits as it would help me code/build something in the future. Are you actually using a custom RVC model, like a finetuned on with the conversation you do? or is it using the base models? (these ones I think hubert_base.pt and rmvpe.pt) and/or is there a need to select different rvc models when doing this process?

I was intending on adding just base RVC models anyway, but obviously what you are describing above is a pipeline and slightly different to loading a model in.

Thanks

@bobcate
Copy link

bobcate commented May 17, 2024

I'm using the base model(s). There is hubert_base.pt(renames to hubert_rvc.pt)(185mb) and rmvpe.pt(176mb).
I don't know technically but from my observation:

  • As you said, it's another layer: TTS converts text2speech, then that speech is sent to RVC to convert speech2speech.
  • Hubert is the actual model and rmvpe is an option
    Untitled
  • rmvpe makes the voice a little, just barely, better but it takes 1 second longer to convert since the rmvpe model reloads every single time.
  • I chose dio and it's still beter than pure tts. Quite fast, too. It takes about half a second to covert two short sentences.
  • And then there is voice files: A .pth and an .index file for a specific voice. Their sizes change depending on the voice, the one I'm using is 110mb in total. The voices are easy to acquire.
  • If I understood correctly, the finetuning, if I can call it that in this case, must be done on the voice files to achieve emotional voice I mentioned. Need one voice file, with the correct emotion, per expression. But that's something the user should worry.

I hope you can add it and we can use RVC on both(narrator+character) voices seperately. Even if that doesn't work, it'd still be better to have it integrated in AllTalk because we would be using one less extension and not need the ST-extras server.

@Dolyfin
Copy link

Dolyfin commented May 17, 2024

2 months ago I have trained a xttsv2 model using alltalk_tts in replacement of a regular TTS>RVC2 system.

If you have the capability of finetuning on 2 medium to large dataset, you can theoretically get optimal results with 2 or even more voices by just swapping out the reference clone audio.

Finetune model with a dataset of character 1 and 2.
Inference with audio sample of character 1 or 2 (setting them as character or narrator).

The above isnt a solution for the majority that cant train/finetune their own models (TTS or RVC). However if you use all_talk for the api, you likely can just use the api of alltalk into the RVC gradio api in the webui to achieve similar results. The latency will be huge though.

@Mixomo
Copy link

Mixomo commented May 18, 2024

I join for this RVC request in standalone mode! 🙏🙏🙏
looking forward to this thread, thanks in advance!

@gshawn3
Copy link
Author

gshawn3 commented May 20, 2024

@gshawn3 @bobcate I think it looks like something I could include. It mostly just seems to be another layer of transcoding to add, and Ive already added in transcoding to 5x audio types into v2.... so I think it would be possible.

As you are both using it already, Im just trying to wrap my head around a few bits as it would help me code/build something in the future. Are you actually using a custom RVC model, like a finetuned on with the conversation you do? or is it using the base models? (these ones I think hubert_base.pt and rmvpe.pt) and/or is there a need to select different rvc models when doing this process?

I was intending on adding just base RVC models anyway, but obviously what you are describing above is a pipeline and slightly different to loading a model in.

Thanks

Sorry for the late reply. Personally, I do use a finetuned model. Training a voice in RVC only takes 15-20 minutes on consumer hardware, and it makes a huge difference in the quality of the output. I actually used the exact same dataset to train RVC that I used when finetuning XTTSv2 with AllTalkTTS. The output from XTTSv2 + RVC is basically indistinguishable from my real voice. Thanks again for looking into this!

@erew123
Copy link
Owner

erew123 commented May 21, 2024

I have a good news/bad news type scenario. Probably best to start with the good news......

So I looked further into RVC and Ive written enough code to in theory load a model and process it with RVC. Writing that and figuring it out, took a good while...... Thats the good news.

Bad news is there is a compatibility issue with Fairseq (a thing you need to install that loads the hubert model) and various versions of Python.

So I found someone who has written an updated version of Fariseq https://github.com/VarunGumma/fairseq and it wont compile (Ill explain more in a moment).

Also, there is feature request with the RVC-Project RVC-Project/Retrieval-based-Voice-Conversion-WebUI#2036 to move to Fairseq2 (It may be possible I can do this in some way myself.. but ill need a bit more time on this.)

So back to the updated Fariseq I found, it wont compile (at least on Windows) because:

OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\useraccount\AppData\Local\Temp\pip-build-env-gitiuk5h\overlay\Lib\site-packages\torch\lib\shm.dll" or one of its dependencies.

Which is one hell of a rabbit hole....

pytorch/pytorch#125109
facebookresearch/fairseq#5012

etc........

The long and short of that is that somewhere in many Python versions from 3.11 upwards a bug was introduced with certain versions of PyTorch that causes this issue where its not always looking in the correct places for the files required to compile things id they use or need to access MKL libraries.

From my research, its something that the PyTorch developers are looking at, but is not resolved yet, so that leaves a question of what versions of Python will they fix it for? What will the version of PyTorch be, as that can result in other dependency conflicts (things like DeepSpeed), When might other things like Text-gen-webui upgrade its Pytorch version to a working Pytorch version (and how to deal with any Python versions that may not be supported).... and a whole host of other questions.

So Im not saying this is dead in the water, after all, I have written the base code to at least attempt it. But what I am saying is that Ill take a look at Fairseq2 and if theres no dice there, then its really down to PyTorch and when their developers say they have fixed it and what the resulting mess of that looks like.

As such, Im going to close the ticket for now, put it in the Features Requests and its going to be one of those I keep an eye on.

@erew123 erew123 closed this as completed May 21, 2024
@erew123
Copy link
Owner

erew123 commented May 21, 2024

Scrub the Fairseq2 option...

image

No Windows support and a mess on other OS's

@gshawn3
Copy link
Author

gshawn3 commented May 21, 2024

Thank you for looking into it.

I'm not sure if this could be helpful, but here is another TTS project that somewhat recently added integration with RVC. Looks like they just import is as a submodule, and after that there's not a ton of code needed for inference:

JarodMica/ai-voice-cloning@b7879cc

@Mixomo
Copy link

Mixomo commented May 22, 2024

Hey, here are other project of XTTS with webui and RVC.

https://github.com/daswer123/xtts-webui

It includes many very useful aspects. Such a model switcher among others...

@erew123
Copy link
Owner

erew123 commented May 22, 2024

Hi Everyone..

Bad news/good news.....

Bad news, I looked over lots of other projects that use RVC and they all have the same problems with Python versions, Pytorch versions, Fairseq, compiling the loader etc and so on and so forth. Honestly, the whole thing is a rabbit hole! Even some of the premier projects out there are all dropping back to Python 3.9,... there literally is no way around it and its a mess.............................

Unless you spend 20ish hours re-writing ALL the model loaders, handling, etc etc...... and the code is damn complicated! But....

image

Probably another 20+hours to tidy up all the code and actually integrate it, setup downloaders/model management etc.. and hope that I can get all the requirements files to work correctly! What you see on the screen below is just a test interface that I was using to debug/try make it work.

And just to be clear, this will be a Alltalk v2 feature.... I still have a few bits to do before I can release that on BETA..... so watch this space.

It handles all the methods though.

image

@erew123
Copy link
Owner

erew123 commented May 22, 2024

Apologies for my spelling mistakes on that last post... Ive been far too long looking at code to spell correctly!

@Mixomo
Copy link

Mixomo commented May 23, 2024

@erew123 Ooh, don't worry, and of course, no rush! You're doing all of this out of love! Do you have a tip jar? I'd love to support you on this wonderful project as soon as I can!

@brentjohnston
Copy link

brentjohnston commented May 23, 2024

Edit: Just noticed your screenshot and comments in bottom of post above, not sure if my comment below still relevant but I can try to help out with the coding stuff if you need it, just let me know. Here is direct link to the repo that converts xtts to rvc

Original comment:
Not sure if this would help, but it seems like this guy is manually putting a trained coqui 2.0.2 output into an RVC webui at the end of the video Would it be possible to have the all_talk extension somehow just send the .mp3 to a separate all_talk RVC extension based on the code from the repo he is using?

It uses Python 3.10.6 I think. I'm guessing you are already trying to do this as you mentioned the autoloader stuff, just making sure. RVC does seem to really improve the results at the end of that video.

Maybe after alltalk_tts makes the .mp3 then it can use some of the code from this repo and additional code to auto pull the mp3 and runs it though a "alltalk RVC" sort of thing. I could try to help with this and pickup where you left off. Did you upload the files somewhere? And not sure what specific files you were working on.

@erew123
Copy link
Owner

erew123 commented May 23, 2024

@Mixomo If you want to tip you are welcome to. There is a Kofi link on the front page of Github on the right hand side.

@brentjohnston You would be welcome to go through the V2 when I have it up. I'm happy to take criticism or additions to the code :) The re-write Ive done of RVC in theory may allow it to work on Python 3.12 and 3.13 now.... though youll need to manually compile fairseq for it, but it shouldn't bug out/error like all other versions do past Python 3.11.......Though I am still mid development, so who knows, something yet may happen to make me swallow my words!

Still.. I've at least made a settings interface for RVC....now! Now on with the chunk of work to make the rest work (model downloaders/management, API calls, logic trees for identifying when/where to run RVC calls, making the narrator function work with it, large tidy up of the code still. And thats just RVC).

image

@brentjohnston
Copy link

brentjohnston commented May 23, 2024

That looks great! Yeah I trained star trek tng computer voice and made an interface for it, I was fairly happy with the voice and alltalk finetune. But after putting the .mp3's through extra RVC finetune I'm now mindblown of the accuracy.

It beats my elevenlabs v2 version for sure. I'm using dragon naturally speaking for it to type for me and I have dragon "custom voice command" where I just say "make it so" and it presses enter for me.

https://old.reddit.com/r/Oobabooga/comments/1bj7tx4/guide_the_easiest_way_to_modify_oobabooga_colors/
lcars-oobabooga

Can't wait to see RVC added, thanks for working on this.

@erew123
Copy link
Owner

erew123 commented May 24, 2024

@brentjohnston Hey that's awesome work there! I bet that took some time?

You asked about merging models..... Well, humm. Its not something Coqui supported or that I can find anyone ever having done, but, saying that, in theory, if you had 2x finetunes and they were built from the same base model e.g. 2.0.2 or 2.0.3, then in theory, it should be possible. I just have no idea what the resulting outcome would be! And the merged model (if it works) may need a very small, very low learning rate extra finetune with both sets of training data after just to re bump the merged weights for the finetuned data back up.

So a VERY VERY hypothetical script would look like this: (I don't think I can specify VERY VERY enough there)

import os
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import gradio as gr

try:
    import deepspeed
    deepspeed_enabled = True
except ImportError:
    deepspeed_enabled = False
    pass

device = "cuda" if torch.cuda.is_available() else "cpu"

def merge_models(model1, model2, merge_type='weighted_avg', alpha=0.5):
    """
    Merge two XTTS models using different merging techniques.

    Args:
        model1 (Xtts): Path to first XTTS model.
        model2 (Xtts): Path to second XTTS model.
        merge_type (str): The type of merging technique to use. Options: 'weighted_avg', 'interpolate'.
        alpha (float): The weight factor for the weighted average or interpolation.

    Returns:
        Xtts: The merged XTTS model.
    """
    merged_model = Xtts(model1.config)  # Create a new instance of the Xtts class

    if merge_type == 'weighted_avg':
        # Perform weighted average of model parameters
        for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
            param_merged.data = alpha * param1.data + (1 - alpha) * param2.data
    elif merge_type == 'interpolate':
        # Perform interpolation of model parameters
        for param1, param2, param_merged in zip(model1.parameters(), model2.parameters(), merged_model.parameters()):
            param_merged.data = torch.lerp(param1.data, param2.data, alpha)
    else:
        raise ValueError(f"Invalid merge type: {merge_type}")

    return merged_model

def merge_models_interface(model1_path, model2_path, merge_type, alpha, output_path):
    # Load model1
    config1 = XttsConfig()
    config1_path = model1_path / "config.json"
    vocab1_path_dir = model1_path / "vocab.json"
    checkpoint1_dir = model1_path
    config1.load_json(str(config1_path))
    model1 = Xtts.init_from_config(config1)
    model1.load_checkpoint(
        config1,
        checkpoint_dir=str(checkpoint1_dir),
        vocab_path=str(vocab1_path_dir),
        use_deepspeed=deepspeed_enabled,
    )
    model1.to(device)

    # Load model2
    config2 = XttsConfig()
    config2_path = model2_path / "config.json"
    vocab2_path_dir = model2_path / "vocab.json"
    checkpoint2_dir = model2_path
    config2.load_json(str(config2_path))
    model2 = Xtts.init_from_config(config2)
    model2.load_checkpoint(
        config2,
        checkpoint_dir=str(checkpoint2_dir),
        vocab_path=str(vocab2_path_dir),
        use_deepspeed=deepspeed_enabled,
    )
    model2.to(device)

    # Merge models
    merged_model = merge_models(model1, model2, merge_type=merge_type, alpha=alpha)

    # Create the output directory if it doesn't exist
    output_dir = os.path.dirname(output_path)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Save the merged model
    merged_model.save_checkpoint(output_path)

    return f"Merged model saved at: {output_path}"

iface = gr.Interface(
    fn=merge_models_interface,
    inputs=[
        gr.components.Textbox(label="Path to Model 1", value="c:\mymodels\mymodelfolder_1"),
        gr.components.Textbox(label="Path to Model 2", value="c:\mymodels\mymodelfolder_2"),
        gr.components.Radio(["weighted_avg", "interpolate"], value="weighted_avg", label="Merge Type"),
        gr.components.Slider(minimum=0, maximum=1, step=0.1, value=0.5, label="Alpha"),
        gr.components.Textbox(label="Output Path", value="c:\mymodels\mymodeloutputfolder"),
    ],
    outputs=gr.components.Textbox(label="Result"),
    title="Merge XTTS Models",
    description="Merge two XTTS models using different merging techniques.",
)

if __name__ == "__main__":
    iface.launch()

This would obviously have to be loaded in the AllTalk python environment too (or one that has TTS built in). FYI, I have not debugged, tested, tried this script etc. I honestly cannot say what the resulting model would be like or if it will work for sure.

But if you wanted to give it a go..... you could. Umm I would copy some model folders somewhere safe and play around that way. It will create the output folder path specified, if it doesnt exist.

No idea what merging rates would be good, but 0.5 (50/50) probably would be most sensible.

Its highly possible though that you end up with a model that just doesnt work. It could take a lot of testing to figure out if this process works or if there are other things that need doing to make it work. And who knows, there may be some real quirks of Coqui's models that requires a really deep dive into their code to figure out.

This is the start of the source code for XTTS:

https://docs.coqui.ai/en/latest/_modules/TTS/tts/models/xtts.html#

@brentjohnston
Copy link

brentjohnston commented May 24, 2024 via email

@brentjohnston
Copy link

brentjohnston commented May 24, 2024 via email

@erew123
Copy link
Owner

erew123 commented May 24, 2024

@brentjohnston Im not sure a 2,0.2 model would blend with a 2,0,3 model. The reason being their introduced 3x entire new languages in 2,0,3 that didnt exist in 2,0,2 (one of those languages, Hindi, is undocumented) So the underlying dataset of the models isnt the same to potentially a large extent along with the models config.json's. As such I dont know how well that would work out in a merge of 2x finetuned models. Im not sure it would be able to pick up on just the finetuned part as being the differences to merge. At least, thats my kind of rough thoughts in my head. Aka, it might be like trying to merge a SD 1.5 model with SD 2.0 model, they are just too far away for it to work.

@erew123
Copy link
Owner

erew123 commented May 24, 2024

Everyone will be happy to hear that... RVC is working, with the Narrator function now. I can tell you all it kicked my ass getting this working. Beyond the initial re-write or RVC to get it working with Python 3.11+ I then spend 8-10 hours figuring out how to deal with variable model index sizes and get the best quality vs performance (aka, lots of complicated maths im not too sure I understand). However, I got there in the end with that and Ive actually setup a whole new RVC feature. It gives you a payoff between performance VS quality.

image

And after that, integrating it with the narrator and dealing with coding fallouts from that (plus deleting 1x small line of code that screwed the thing up for 1 hour) took me the best part of 7 hours. (I hate having to re-write and work with the narrator on things, it gets so damn complicated). But......... its done... it works... Im going to make a backup of the code before some accident happens.

So thats a very big thing out the way. Hopefully lots of the other bits I want to do are smooth going and I can get a beta out soon (Will probably have only tested on Windows by that point).

image

@Dolyfin
Copy link

Dolyfin commented May 25, 2024

Thank you for the hard work. Although I can’t fathom how much VRAM this would need to use locally with just one GPU as I remember it using 5GB+ in the RVC webui. Might be able to even quantise the models although I’ve yet to see anyone do that with xtts and RVC.

Would love to see some early latency testing and vram usage.

@erew123
Copy link
Owner

erew123 commented May 25, 2024

@Dolyfin As I have so much code to punch my way through at the moment, For now I'm allocating my time to that and getting something out for people to test/try out. Saying that, RVC seemed to add about an extra 1GB to 1.3GB on VRAM use. The first RVC generation (see the images above) is slower if its loading in a new *.pth file into VRAM (adds a short load time). The training index size I have introduced impacts generation time (you can see Index size used listed on the above image). Best I understand, different *.pth and their associated *.index file (if you are using index files), well, the indexes are all different sizes in length. Some indexes could be 20,000 long, some 80,000 long. The more indexing you use, the better the sample generation, but the more processing time required, hence Ive given everyone the option to set what they like and that will affect latency.

Hope that gives some information.

Thanks

@erew123
Copy link
Owner

erew123 commented May 25, 2024

@Dolyfin Ive manage to have a brief moment to do something to show you how the indexing affects generation times etc.

I know the index on this file is about 76000 (the most it can be indexed). Others I have, have indexes of 40,000 ish. It varies by file. The indexing function Ive introduced says "use a maximum amount of the index", hence setting it at 20,000, 40,000, 60,000 and 80,000 (being over the amount it can be indexed by). You can see a relatively linear path of, the more you index, the longer it takes, but the higher quality the end result. Im not yet sure id there is a good middle ground. But somewhere above 30,000 Im not noticing too much difference, unless you wanted studio quality audio.

image

@ibrah3m
Copy link

ibrah3m commented Jun 18, 2024

Is there any updates regarding this? a work around like XTTS-RVC

can't we make plugin to pipeline to Applio(rvc)?
they did it with elevenlabs by using the api to generate the TTS then inference with applio rvc

is there a discord server for this project im deadly like it and hope to find workaround ASAP.

also I couldn't really finetune my audios the progress wasn't progressing..

@gshawn3
Copy link
Author

gshawn3 commented Jun 19, 2024

@ibrah3m Indeed there is. RVC integration has been implemented in the upcoming AllTalk TTS V2. I had a chance to test it briefly and it works great. Check out #245.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants