[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

kabachuha · 2024-05-16T15:12:54Z

AnyGPT is quite a promising project released 2 months before GPT4o.

It is a versatile multimodal LLaMA-based model, which is able not only to take images as an input, but also non-transcribed speech (for example, for cloning), music. And the output is also speech, images and music in the tokens form, what are fed into (inplicitly-represented, e.g. UnCLIP instead of prompts for StableDiffusion) specialized models to generate the outputs.

I think such concept can improve the o-like experience, although it may require to adjust the encoder/decoder backends to make the generation faster.

See the project page https://junzhan2000.github.io/AnyGPT.github.io/

https://github.com/OpenMOSS/AnyGPT

P.S. I think it would be a much better addition, than just giving it vision via the legacy llava

GlaDOS/README.md

Line 14 in 9d6bc9a

- [ ] Give GLaDOS vision via [LLaVA](https://llava-vl.github.io/)

dnhkng · 2024-05-16T17:08:18Z

The model is really cool!

I'm not sure how those modalities would be useful though. For example, it could generate music, or just as easily, have function calling and pick a track on Spotify. Same for the image generation.

kabachuha · 2024-05-16T19:51:31Z

As I get, it's still quite a general purpose model and function-calling should work with it as well (maybe some tuning, if they overwrote normal instructions too much)

The best thing is that the model acquires better semantic understanding of things (how words, sounds, music and images connect with each other), so it may be worth exploring it's "creative soul" :)

Decoders like StableDiffusion can work very fast now on proper GPUs, if you use things like LCMs and tricks allowing for few/one step diffusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

kabachuha commented May 16, 2024 •

edited

Loading

dnhkng commented May 16, 2024

kabachuha commented May 16, 2024

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

[feature] Ability to use AnyGPT for speech/text/image/music multimodality #47

Comments

kabachuha commented May 16, 2024 • edited Loading

dnhkng commented May 16, 2024

kabachuha commented May 16, 2024

kabachuha commented May 16, 2024 •

edited

Loading