# Multimodal AI Youtube Chatbot
This project builds a multimodal AI chatbot that can answer questions about YouTube videos using both speech and text input. It combines Whisper for transcription, a retrieval-augmented generation (RAG) pipeline for context-aware responses, and LangChain agents for orchestration. Users interact through a simple Gradio interface, making video content searchable, accessible, and easier to explore.

In [42]:
# add directory
import sys
sys.path.append("..")  # Add parent directory to sys.path

In [43]:
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

In [44]:
import os

video_url = "https://www.youtube.com/watch?v=SN-vBnWj6e8"
output_path = "../data/eleo_audio.mp3"

os.system(f'yt-dlp -x --audio-format mp3 -o "{output_path}" {video_url}')




[youtube] Extracting URL: https://www.youtube.com/watch?v=SN-vBnWj6e8
[youtube] SN-vBnWj6e8: Downloading webpage
[youtube] SN-vBnWj6e8: Downloading tv client config
[youtube] SN-vBnWj6e8: Downloading tv player API JSON
[youtube] SN-vBnWj6e8: Downloading ios player API JSON
[youtube] SN-vBnWj6e8: Downloading m3u8 information
[info] SN-vBnWj6e8: Downloading 1 format(s): 251
[download] ../data/eleo_audio.mp3 has already been downloaded
[ExtractAudio] Not converting audio ../data/eleo_audio.mp3; file is already in target format mp3


0

In [45]:
# Check if the file was created successfully
if os.path.exists(output_path):
    print(f"Audio file downloaded successfully: {output_path}")
else:
    print("Failed to download the audio file.")

Audio file downloaded successfully: ../data/eleo_audio.mp3


In [46]:
# Install if not already installed
#!pip install openai
#!pip install python-dotenv
#!pip install langchain

### Transcribe with Whisper
As first it seemed smart to use youtube-transcript-api, but in this task we can rely on auto captions, when content is not fully in english. Hence Whisper will be a better choice.

In [47]:
from utils.whisper import transcribe_audio

transcript_text = transcribe_audio(
    file_path="../data/eleo_audio.mp3",
    save_path="../data/eleo_transcript.txt"
)

print(transcript_text[:300])


Hallo! Hi! Ich werde jetzt multitasken, ein sehr deutsches Wort. Also, ich werde jetzt mehrere Dinge gleichzeitig machen. Ich werde jetzt eure Fragen beantworten. Ihr habt mir nämlich sehr viele Fragen auf Instagram und hier auf dem YouTube-Kanal gestellt. Danke für die ganzen Fragen. Und die werde 


### Let's add meta data for the structure



In [52]:
from utils.metadata_extract import extract_youtube_metadata

video_url = "https://www.youtube.com/watch?v=SN-vBnWj6e8"
save_path = "../data/eleo_metadata.json"

metadata = extract_youtube_metadata(video_url, save_path=save_path)

print("✅ Metadata:")
print(metadata)


✅ Metadata:
{'video_id': 'SN-vBnWj6e8', 'title': 'How to Learn German Fast & Enjoy the Process – Q&A with a Native Teacher', 'description': "Let’s multitask: I answer your german learning questions – while baking cookies!\n👉 Want to grow your German vocab? https://link.fluentu.com/eleos-corner-3\n📬 Sign up for Eleo's Deutsch Post! bit.ly/3GhGxsi\n🤝 Get in touch with other learners: https://www.instagram.com/eleos_corner/\n\n🤎 DANKE for supporting my work with a Coffee: https://buymeacoffee.com/eleoscorner \n📺Watch this Video next: https://www.youtube.com/watch?v=es0iwHlAmR8\n\n💬 Lass uns in den Kommentaren austauschen!\n👉 Hast du Lust, mich in einem Livestream zu treffen?\n👉 Und hast du noch weitere Fragen?\n\n🥣 Das Rezept für die Müsli-Kekse\n🍪 Zutaten: 150\u202fg Müsli, 100\u202fg Mehl, 2–3 EL Honig, 1 TL Zimt, 1 Prise Salz, 1 Ei, 4 EL Öl, Optional: 1–2 EL Zucker, Schokostückchen oder 1 EL Erdnussbutter \n👉 So geht’s: Alles gut miteinander vermengen, kleine Häufchen auf ein Backblech