Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

Open
kabachuha opened this issue May 16, 2024 · 2 comments

Comments

@kabachuha
Copy link

kabachuha commented May 16, 2024

AnyGPT is quite a promising project released 2 months before GPT4o.

It is a versatile multimodal LLaMA-based model, which is able not only to take images as an input, but also non-transcribed speech (for example, for cloning), music. And the output is also speech, images and music in the tokens form, what are fed into (inplicitly-represented, e.g. UnCLIP instead of prompts for StableDiffusion) specialized models to generate the outputs.

anygpy demo

I think such concept can improve the o-like experience, although it may require to adjust the encoder/decoder backends to make the generation faster.

See the project page https://junzhan2000.github.io/AnyGPT.github.io/

https://github.com/OpenMOSS/AnyGPT

P.S. I think it would be a much better addition, than just giving it vision via the legacy llava

- [ ] Give GLaDOS vision via [LLaVA](https://llava-vl.github.io/)

@dnhkng
Copy link
Owner

dnhkng commented May 16, 2024

The model is really cool!

I'm not sure how those modalities would be useful though. For example, it could generate music, or just as easily, have function calling and pick a track on Spotify. Same for the image generation.

@kabachuha
Copy link
Author

As I get, it's still quite a general purpose model and function-calling should work with it as well (maybe some tuning, if they overwrote normal instructions too much)

The best thing is that the model acquires better semantic understanding of things (how words, sounds, music and images connect with each other), so it may be worth exploring it's "creative soul" :)

Decoders like StableDiffusion can work very fast now on proper GPUs, if you use things like LCMs and tricks allowing for few/one step diffusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants