Video-to-audio AI
- Insert Description
- Audioset processing (see dedicated repo): I wrote some code to download video-audio pairs from youtube and store them on AWS S3, together with the strongly labelled annotations from AudioSet (100k+ videos).
- Labels augmentation with GPT: I augment Audioset labels to identify sound emitters objects and classify as sound effect (SFX) vs ambience (AMB), by repeatetely calling OpenAI and applying majority voting.
- I use ImageBind model (fork repo) to generate embeddings. Imagebind is a multimodal encoder that maps video, audio and text to the same embeddings space.
- I migrate the embeddings to Pinecone generating a vector database. Pipeline
- Semantic search
- Eval
- Streamlit app