## PropHero AI ChatBot 

PropHero AI Chatbot → An intelligent assistant that helps users explore PropHero’s approach to real estate investing — answering questions about property opportunities and how PropHero can help them invest with confidence.

**Key Objectives Delivered**:
1) Integrate speech recognition to convert video --> text 
2) Retrieve Yotube video content and store it 

**1. Data Acquisition and Pre-Processing Pipeline Setup**

1.1 Ingestion Pipeline: Prepare and get raw data (transcript text from Yotube) via Whispter (STT)

In [1]:
import os
import re
import subprocess
import whisper
import json

# ------------ VIDEO CONFIGURATION ------------
VIDEO_LIST = [
    {
        "id": "prophero_video_1",
        "url": "https://www.youtube.com/watch?v=ED3eypjlfrY",
        "title": "PropHero – Intro Video 1"
    },
    {
        "id": "prophero_video_2",
        "url": "https://www.youtube.com/watch?v=uxF2IObEzZg",
        "title": "PropHero – Intro Video 2"
    },
    {
        "id": "prophero_video_3",
        "url": "https://www.youtube.com/watch?v=5Kca3nOrefY",
        "title": "PropHero – Intro Video 3"
    },
]
AUDIO_DIR = "data/audio"
TRANSCRIPT_DIR = "data/transcripts"
MODEL_SIZE = "small" 

os.makedirs(AUDIO_DIR, exist_ok=True)
os.makedirs(TRANSCRIPT_DIR, exist_ok=True)

# 1) DOWNLOAD AUDIO FROM YOUTUBE
def download_audio(url, temp_filename):
    cmd = [
        "yt-dlp",
        "-f", "bestaudio/best",
        "-o", temp_filename,
        url,
    ]
    print(f" Downloading audio from YouTube: {url}")
    subprocess.run(cmd, check=True)
    print("   Audio downloaded as:", temp_filename)

# 2) CONVERT TO MP3
def convert_to_mp3(input_path, output_path):
    if os.path.exists(output_path):
        os.remove(output_path)

    print(" Converting to mp3 using ffmpeg...")
    cmd = [
        "ffmpeg",
        "-y",
        "-i",
        input_path,
        output_path,
    ]
    subprocess.run(cmd, check=True)
    print("   Converted to:", output_path)

# 3) BASIC TEXT CLEANING
def basic_clean(text: str) -> str:
    # Remove some common noise tokens (you can add more later)
    noise_tokens = ["[Music]", "(Music)", "[Applause]", "(Applause)"]
    for token in noise_tokens:
        text = text.replace(token, " ")

    # Replace line breaks by spaces
    text = text.replace("\n", " ")

    # Collapse multiple spaces into a single one
    text = re.sub(r"\s+", " ", text)

    # Strip leading / trailing spaces
    text = text.strip()

    return text

# 4) RUN WHISPER AND RETURN FULL RESULT
def transcribe_audio_with_segments(audio_path, model_size="small"):
    print(f" Loading Whisper model: {model_size} ...")
    model = whisper.load_model(model_size)
    print(" Transcribing (with timestamps)...")
    result = model.transcribe(audio_path, language="en")
    print("   Transcription done.")
    return result  

# 5a) SAVE PLAIN TEXT
def save_transcript_txt(text, filename):
    with open(filename, "w", encoding="utf-8") as f:
        f.write(text)
    print(f" TXT transcript saved to {filename}")

# 5b) SAVE JSON WITH METADATA + CLEANED SEGMENTS
def save_transcript_json(segments, filename, video_meta):
    """
    segments: result["segments"] from Whisper
    we will keep: start, end, CLEANED text
    and wrap them with video metadata so Notebook 2 can use it easily.
    """
    cleaned_segments = []
    for seg in segments:
        cleaned_text = basic_clean(seg["text"])
        if not cleaned_text:
            
            continue
        item = {
            "start": round(seg["start"], 2),
            "end": round(seg["end"], 2),
            "text": cleaned_text,
        }
        cleaned_segments.append(item)

    data = {
        "video_id": video_meta["id"],
        "title": video_meta["title"],
        "url": video_meta["url"],
        "segments": cleaned_segments,
    }

    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

    print(f" JSON transcript (with metadata) saved to {filename}")
    print(f"   Total segments after cleaning: {len(cleaned_segments)}")

# ------------ MAIN PIPELINE (MULTI-VIDEO) ------------

for video in VIDEO_LIST:
    url = video["url"]
    vid = video["id"]

    print("\n" + "=" * 60)
    print(f"Processing video: {vid} | {video['title']}")
    print("=" * 60)

    # Build file paths so each video has its own files
    temp_file = os.path.join(AUDIO_DIR, f"{vid}_raw.m4a")
    audio_file = os.path.join(AUDIO_DIR, f"{vid}.mp3")
    txt_output = os.path.join(TRANSCRIPT_DIR, f"{vid}.txt")
    json_output = os.path.join(TRANSCRIPT_DIR, f"{vid}.json")

    try:
        # 1. Download audio
        download_audio(url, temp_file)

        # 2. Convert to MP3
        convert_to_mp3(temp_file, audio_file)

        # 3. Transcribe (get full result)
        result = transcribe_audio_with_segments(audio_file, MODEL_SIZE)
        full_text = basic_clean(result["text"])
        segments = result["segments"]

        # 4a. Save cleaned plain text transcript
        save_transcript_txt(full_text, txt_output)

        # 4b. Save JSON with timestamps + metadata (already cleaned)
        save_transcript_json(segments, json_output, video)

        # 5. Show a short preview
        print("\n Preview of cleaned transcript (first 400 characters):\n")
        print(full_text[:400])
        print("\n Finished video:", vid)

    except subprocess.CalledProcessError as e:
        print(" There was an error running yt-dlp or ffmpeg. Details:")
        print(e)

    except Exception as e:
        print(" Unexpected error:", e)

    finally:
        # Optional: cleanup of temporary raw file
        if os.path.exists(temp_file):
            os.remove(temp_file)
            print(" Removed temporary file:", temp_file)



Processing video: prophero_video_1 | PropHero – Intro Video 1
 Downloading audio from YouTube: https://www.youtube.com/watch?v=ED3eypjlfrY
   Audio downloaded as: data/audio\prophero_video_1_raw.m4a
 Converting to mp3 using ffmpeg...
   Converted to: data/audio\prophero_video_1.mp3
 Loading Whisper model: small ...
 Transcribing (with timestamps)...




   Transcription done.
 TXT transcript saved to data/transcripts\prophero_video_1.txt
 JSON transcript (with metadata) saved to data/transcripts\prophero_video_1.json
   Total segments after cleaning: 90

 Preview of cleaned transcript (first 400 characters):

Hi, my name is Michael Roger. I'm one of the co-founders of Prop Hero. In this video, I'm going to show you how we are using data to find the best deals. All right, let's start with the absolute basics. In a very, very simplified view, a great investment property is the property that will increase in value, that's capital growth, and that will provide you with strong cash flows. That's rental retu

 Finished video: prophero_video_1
 Removed temporary file: data/audio\prophero_video_1_raw.m4a

Processing video: prophero_video_2 | PropHero – Intro Video 2
 Downloading audio from YouTube: https://www.youtube.com/watch?v=uxF2IObEzZg
   Audio downloaded as: data/audio\prophero_video_2_raw.m4a
 Converting to mp3 using ffmpeg...
   Conv

**Summary**

Notebook 1, I built the ingestion pipeline that converts multiple PropHero Yotube videos into clean, structured transcripts. 

First, I define the list of videos with their IDs, URLs, and set up folders to store audio files and transcripts. 
For each video, I dowloaded the best audio using yt-dlp, convert it to MP3 with ffmpeg, and then run Whisper to generate a transcript. 

I´ve applied basic cleaning functions to remove noise like [music, applause] and to normalize spaces. Then I save 2 versions of the transcripts. 1) plain t.t file for human inspection and .json file with metadaa. The JSON contains cleaned segments with timestamps plus the video id, title, and URL. 

This JSON format is what I´ll use in the notebook to create chunks, embeddings, and build the vector database for my RAG system. So this notebook is essentially the "data ingestion and cleaning" step of my multimodal chatbot. 