**Key Objectives Delivered**:
1) **Video Data Indexing** --> Loaded the structured Yotube transcripts generated in Notebook 1 for further processing. 
2) **Blog Ingestion** --> Retrieved and cleaned text from multiple PropHero blogs URLs. 
3) **Text Cleaning** --> Applied preprocessing helper functions to normalize text, remove noise.
4) **Text Chunking** --> Split both Yotube transcripts and blog articles into overlapping chunks with LangChain (RecursiveCharacterTextSplitter). 
- Yotube --> chunk size = 600 / overlap = 100. 
- Blogs --> chunk size = 800 / overlap = 100
5) **Embeddings Creation** --> Generated semantic vector embeddings for all chunks using (sentence-transformers/all-MiniLM-L6-v2)
6) **Chroma Vector DB** --> Stored the embeddings (videos + blogs) into Chroma collection for future retrieval in the QA system. 

**2. Indexing Base Builder RAG System** 

2.1 Loading all .json from data/transcript (videos)

In [None]:
!pip install "sentence-transformers==2.7.0" \
             "transformers==4.39.3" \
             "huggingface_hub==0.20.0" \
             "protobuf<=3.20.3"



In [None]:
import os
import json
import glob
import pandas as pd

TRANSCRIPT_DIR = "data/transcripts"

# 1. Find all JSON transcript files
json_files = glob.glob(os.path.join(TRANSCRIPT_DIR, "*.json"))

print(f"Found {len(json_files)} transcript files in {TRANSCRIPT_DIR}:")
for path in json_files:
    print(" -", os.path.basename(path))

if not json_files:
    raise FileNotFoundError(
        f"No JSON transcript files found in {TRANSCRIPT_DIR}. "
        "Run Notebook 1 first to generate them."
    )

# 2. Quick overview of transcripts (for sanity check)
summary = []
for path in json_files:
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    summary.append({
        "video_id": data["video_id"],
        "title": data["title"],
        "url": data["url"],
        "num_segments": len(data["segments"])
    })

df_summary = pd.DataFrame(summary)
df_summary


Found 3 transcript files in data/transcripts:
 - prophero_video_1.json
 - prophero_video_2.json
 - prophero_video_3.json


Unnamed: 0,video_id,title,url,num_segments
0,prophero_video_1,PropHero – Intro Video 1,https://www.youtube.com/watch?v=ED3eypjlfrY,90
1,prophero_video_2,PropHero – Intro Video 2,https://www.youtube.com/watch?v=uxF2IObEzZg,32
2,prophero_video_3,PropHero – Intro Video 3,https://www.youtube.com/watch?v=5Kca3nOrefY,64


2.2 Blog ingestion configuration

In [None]:
import requests
from bs4 import BeautifulSoup

BLOG_LIST = [
    {
        "id": "blog_mistakes",
        "url": "https://www.prophero.com/the-most-common-property-investment-mistakes-that-can-easily-be-avoided/",
        "title": "The Most Common Property Investment Mistakes That Can Easily Be Avoided"
    },
    {
        "id": "blog_capital_gains",
        "url": "https://www.prophero.com/capital-gains-101-a-simplified-guide-for-smart-property-investing/",
        "title": "Capital Gains 101: A Simplified Guide for Smart Property Investing"
    },
    {
        "id": "blog_rental_yield",
        "url": "https://www.prophero.com/are-you-calculating-rental-yield-correctly/",
        "title": "Are You Calculating Rental Yield Correctly?"
    },
    {
        "id": "blog_property_vs_shares",
        "url": "https://www.prophero.com/property-vs-shares-whats-the-difference-which-is-better-for-you/",
        "title": "Property vs Shares: What’s the Difference & Which is Better for You?"
    },
]

2.3 Helper functions. *basic_clean, fetch_blog_text, chunk_long_text*

 Clean = cleaning the text by removing new lines, extra spaces. 
 Fetching the blog = dowloads and extracts the main content of PropHero blog articles
 Chunks = splits long article into smaller pieces.

In [None]:
import re

def basic_clean(text: str) -> str:
    """
    Simple cleaning function:
    - remove extra line breaks
    - collapse multiple spaces
    - strip spaces at the beginning and end
    """
    text = text.replace("\n", " ")
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text


def fetch_blog_text(url: str) -> str:
    """
    Download a PropHero blog article and return (roughly) the main text content.
    """
    print(f" Fetching blog: {url}")
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    body = soup.body
    if body is None:
        print(" No <body> found in page, returning empty text.")
        return ""

    raw_text = body.get_text(separator=" ")
    cleaned_text = basic_clean(raw_text)
    return cleaned_text


def chunk_long_text(text: str, chunk_size: int = 800, overlap: int = 100):
    """
    Split a long piece of text into overlapping chunks, measured in characters.
    """
    chunks = []
    start_idx = 0
    text_len = len(text)

    while start_idx < text_len:
        end_idx = min(start_idx + chunk_size, text_len)
        chunk_text = text[start_idx:end_idx].strip()

        if chunk_text:
            chunks.append(chunk_text)

        if end_idx == text_len:
            break

        start_idx = end_idx - overlap

    return chunks


Reason why we dont chunk Yotube and Blog together is because they have different formats. Yotube --> already segmented text and Blog --> a big bolb of texty

In [None]:
import re

def preprocess_text(text: str) -> str:
    """
    Light preprocessing before chunking:
    - Remove HTML tags
    - Normalize whitespace
    (No lemmatization, no aggressive changes to words)
    """
    if not isinstance(text, str):
        return ""

    # 1) Strip leading/trailing spaces
    text = text.strip()

    # 2) Remove HTML tags like <p>, <br>, etc.
    text = re.sub(r"<[^>]+>", " ", text)

    # 3) Normalize multiple spaces/newlines into a single space
    text = re.sub(r"\s+", " ", text).strip()

    return text


2.4 Chunk Yotube Transcripts


 Chunk *(splitting the long transcript into smaller pieces, this will help us later for the retrieval)* / Embedding *(loaded setence-transformers/all-MiniLM-L6-v2. by converting every chunk into a vector, so that we can do semantic search)* / Store in vector DB. 

In [None]:
import os
import json
import pandas as pd

CHUNK_SIZE = 600
OVERLAP = 100

def chunk_by_length(segments, chunk_size=600, overlap=100):
    """
    Merge small transcript segments into larger overlapping chunks.
    """
    chunks = []
    current_chunk = ""
    current_start = None
    current_end = None

    for seg in segments:
        text = seg["text"].strip()
        start = seg["start"]
        end = seg["end"]

        if current_chunk == "":
            current_chunk = text
            current_start = start
            current_end = end
            continue

        if len(current_chunk) + len(text) + 1 > chunk_size:
            chunks.append({
                "text": current_chunk.strip(),
                "start": round(current_start, 2),
                "end": round(current_end, 2)
            })

            
            overlap_text = current_chunk[-overlap:] if overlap > 0 else ""
            current_chunk = overlap_text + " " + text
            current_start = start
            current_end = end
        else:
            current_chunk += " " + text
            current_end = end

    if current_chunk:
        chunks.append({
            "text": current_chunk.strip(),
            "start": round(current_start, 2),
            "end": round(current_end, 2)
        })

    return chunks


# === Loop through all transcript JSON files ===
TRANSCRIPT_DIR = "data/transcripts"
json_files = [f for f in os.listdir(TRANSCRIPT_DIR) if f.endswith(".json")]

all_video_chunks = []

for file_name in json_files:
    path = os.path.join(TRANSCRIPT_DIR, file_name)
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    video_id = data["video_id"]
    title = data["title"]
    url = data["url"]
    segments = data["segments"]

    print(f" Processing {video_id} ({len(segments)} segments)")

    # Apply linguistic preprocessing to each segment
    for seg in segments:
        seg["text"] = preprocess_text(seg["text"])

    chunks = chunk_by_length(segments, CHUNK_SIZE, OVERLAP)
    print(f" Created {len(chunks)} chunks for {video_id}")

    # Save chunks with metadata
    for ch in chunks:
        all_video_chunks.append({
            "source_type": "video",
            "video_id": video_id,
            "title": title,
            "url": url,
            "start": ch["start"],
            "end": ch["end"],
            "text": ch["text"]
        })

print(f"\n Total chunks from all videos: {len(all_video_chunks)}")
df_video_chunks = pd.DataFrame(all_video_chunks)
df_video_chunks.head(3)


 Processing prophero_video_1 (90 segments)
 Created 11 chunks for prophero_video_1
 Processing prophero_video_2 (32 segments)
 Created 5 chunks for prophero_video_2
 Processing prophero_video_3 (64 segments)
 Created 12 chunks for prophero_video_3

 Total chunks from all videos: 28


Unnamed: 0,source_type,video_id,title,url,start,end,text
0,video,prophero_video_1,PropHero – Intro Video 1,https://www.youtube.com/watch?v=ED3eypjlfrY,0.0,42.24,"Hi, my name is Michael Roger. I'm one of the c..."
1,video,prophero_video_1,PropHero – Intro Video 1,https://www.youtube.com/watch?v=ED3eypjlfrY,42.24,72.56,rowth is about investing in an area where in t...
2,video,prophero_video_1,PropHero – Intro Video 1,https://www.youtube.com/watch?v=ED3eypjlfrY,72.56,104.48,"e improvements. So think about renovation, add..."


2.5 Chunk Blogs

In [None]:
BLOG_CHUNK_SIZE = 800  
BLOG_OVERLAP = 100

all_blog_chunks = [] 

for blog in BLOG_LIST:
    blog_id = blog["id"]
    url = blog["url"]
    title = blog["title"]

    try:
        # 1. Fetch the blog article
        blog_text = fetch_blog_text(url)
        if not blog_text:
            print(f" Empty content for blog {blog_id}, skipping.")
            continue

        # 2. Apply preprocessing (clean + lemmatize)
        clean_blog_text = preprocess_text(blog_text)

        # 3. Split the clean text into overlapping chunks
        blog_chunks = chunk_long_text(
            clean_blog_text,
            chunk_size=BLOG_CHUNK_SIZE,
            overlap=BLOG_OVERLAP
        )

        print(f" Blog '{title}' → created {len(blog_chunks)} chunks")

        # 4. Store chunks with metadata
        for i, chunk_text in enumerate(blog_chunks):
            all_blog_chunks.append({
                "source_type": "blog",
                "video_id": blog_id,  
                "title": title,
                "url": url,
                "start": None,
                "end": None,
                "text": chunk_text
            })

    except Exception as e:
        print(f" Error while processing blog '{title}': {e}")

print(f"\n Total blog chunks created: {len(all_blog_chunks)}")

# Optional: preview
import pandas as pd
df_blog_chunks = pd.DataFrame(all_blog_chunks)
df_blog_chunks.head(3)


 Fetching blog: https://www.prophero.com/the-most-common-property-investment-mistakes-that-can-easily-be-avoided/
 Blog 'The Most Common Property Investment Mistakes That Can Easily Be Avoided' → created 5 chunks
 Fetching blog: https://www.prophero.com/capital-gains-101-a-simplified-guide-for-smart-property-investing/
 Blog 'Capital Gains 101: A Simplified Guide for Smart Property Investing' → created 6 chunks
 Fetching blog: https://www.prophero.com/are-you-calculating-rental-yield-correctly/
 Blog 'Are You Calculating Rental Yield Correctly?' → created 4 chunks
 Fetching blog: https://www.prophero.com/property-vs-shares-whats-the-difference-which-is-better-for-you/
 Blog 'Property vs Shares: What’s the Difference & Which is Better for You?' → created 6 chunks

 Total blog chunks created: 21


Unnamed: 0,source_type,video_id,title,url,start,end,text
0,blog,blog_mistakes,The Most Common Property Investment Mistakes T...,https://www.prophero.com/the-most-common-prope...,,,How it works About us Data and AI Press and Bl...
1,blog,blog_mistakes,The Most Common Property Investment Mistakes T...,https://www.prophero.com/the-most-common-prope...,,,t. Mistake #1. Using your emotions to guide yo...
2,blog,blog_mistakes,The Most Common Property Investment Mistakes T...,https://www.prophero.com/the-most-common-prope...,,,ke #2: You do not have to invest in property w...


We chunk separately Yotube videos vs text because Yotube videos transcripts come already semi-chunked (with segments), blogs are chunked differently (with continious text) and not already segmented.  

2.6 Merging Yotube + Blog Chunks into one dataset

In [None]:
all_chunks = list(all_video_chunks) + list(all_blog_chunks)

num_video_chunks = len(all_video_chunks)
num_blog_chunks = len(all_blog_chunks)
num_total_chunks = len(all_chunks)

print(f" Video chunks: {num_video_chunks}")
print(f" Blog chunks : {num_blog_chunks}")
print(f" Total chunks: {num_total_chunks}")

# Create DataFrame for exploration
df_all_chunks = pd.DataFrame(all_chunks)

# Preview a few random examples
df_all_chunks.sample(5, random_state=42)


 Video chunks: 28
 Blog chunks : 21
 Total chunks: 49


Unnamed: 0,source_type,video_id,title,url,start,end,text
13,video,prophero_video_2,PropHero – Intro Video 2,https://www.youtube.com/watch?v=uxF2IObEzZg,53.28,74.72,led on bringing the best talent to the team. W...
45,blog,blog_property_vs_shares,Property vs Shares: What’s the Difference & Wh...,https://www.prophero.com/property-vs-shares-wh...,,,0 minute webinar here. This will give you tips...
47,blog,blog_property_vs_shares,Property vs Shares: What’s the Difference & Wh...,https://www.prophero.com/property-vs-shares-wh...,,,onth. The stock market is more volatile The st...
44,blog,blog_property_vs_shares,Property vs Shares: What’s the Difference & Wh...,https://www.prophero.com/property-vs-shares-wh...,,,difference for you and help you choose the rig...
17,video,prophero_video_3,PropHero – Intro Video 3,https://www.youtube.com/watch?v=5Kca3nOrefY,42.0,78.24,perties or do proper due diligence. Number two...


In [None]:
import os
import json

os.makedirs("data/processed", exist_ok=True)

# Save as JSON (keeps full structure)
json_path = "data/processed/prophero_all_chunks.json"
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(all_chunks, f, ensure_ascii=False, indent=2)

# Saving it into CSV for quick viewing in Excel
csv_path = "data/processed/prophero_all_chunks.csv"
df_all_chunks.to_csv(csv_path, index=False)

print(" Saved merged chunks to:")
print(" -", json_path)
print(" -", csv_path)

 Saved merged chunks to:
 - data/processed/prophero_all_chunks.json
 - data/processed/prophero_all_chunks.csv


2.7 Creating the embedding for all chunks

In [None]:
import shutil
shutil.rmtree("data/chroma_db", ignore_errors=True)

In [None]:
from sentence_transformers import SentenceTransformer

emb_model_name = "sentence-transformers/all-MiniLM-L6-v2"
emb_model = SentenceTransformer(emb_model_name)

print(f"Loaded embedding model: {emb_model_name}")

  from .autonotebook import tqdm as notebook_tqdm


Loaded embedding model: sentence-transformers/all-MiniLM-L6-v2


In [None]:
texts = [c["text"] for c in all_chunks]
print(f"Number of chunks to embed: {len(texts)}")

embeddings = emb_model.encode(texts, show_progress_bar=True)

print("Embeddings shape:", embeddings.shape)

Number of chunks to embed: 49


Batches: 100%|██████████| 2/2 [00:12<00:00,  6.25s/it]

Embeddings shape: (49, 384)





"sentence-transformers/all-MiniLM-L6-v2", is the "swet spot" model.

- Good semantic Quality:  It performs very well on typical semantic similarity tasks (question --> answer, paragraphs), perfect for RAG. 

- Fast & lightweight: Fast embedding creation, low memory usage and easy to run in a normal laptop.

2.8 Store chunks + embeddings in Chroma DB

Chroma stores not only the embeddings but also metadata like source type, video/blog ID, title, URL and timestamps.
Chroma’s metadata layer only accepts primitive types (strings, numbers, booleans), so we had to ensure we never send None values.

We built a build_metadata helper that cleans each metadata dictionary and removes any None values before insertion.
Then we use a PersistentClient to save all embeddings and metadata in a local folder (data/chroma_db), so our chatbot’s knowledge base is permanently stored and can be reused in later notebooks and in the UI.

In [None]:
def build_metadata(chunk: dict, idx: int) -> dict:
    """
    Build metadata for each chunk, including a unique chunk_id.
    """

    raw_meta = {
        "source_type": chunk.get("source_type"),
        "video_id": chunk.get("video_id"),
        "title": chunk.get("title"),
        "url": chunk.get("url"),
        "start": float(chunk["start"]) if chunk.get("start") is not None else None,
        "end": float(chunk["end"]) if chunk.get("end") is not None else None,


        "chunk_id": f"{chunk.get('source_type')}_{chunk.get('video_id')}_{idx}"
    }

    return {k: v for k, v in raw_meta.items() if v is not None}


In [None]:
import chromadb

persist_dir = "data/chroma_db"

client = chromadb.PersistentClient(path=persist_dir)

collection_name = "prophero_knowledge"
collection = client.get_or_create_collection(
    name=collection_name,
    metadata={"hnsw:space": "cosine"} 
)

documents = [c["text"] for c in all_chunks]
metadatas = [build_metadata(c, i) for i, c in enumerate(all_chunks)]
ids = [m["chunk_id"] for m in metadatas]  

collection.add(
    ids=ids,
    documents=documents,
    metadatas=metadatas,
    embeddings=embeddings.tolist(),
)

print("Saved", len(ids), "chunks correctly.")

Saved 49 chunks correctly.


**Summary**


In Notebook 2, I focused on indexing, cleaning, and embedding all textual data (YouTube transcripts + PropHero blogs) to prepare it for the Question-Answering stage of the RAG system.

First, I re-used the cleaned video transcripts created in Notebook 1 and ingested several PropHero blog posts.
I applied helper functions that performed additional cleaning to improve the quality of embeddings.

Next, I used LangChain’s RecursiveCharacterTextSplitter to chunk both data sources into overlapping segments, ensuring that each piece maintained context.
I then encoded all chunks into vector embeddings using the sentence-transformers/all-MiniLM-L6-v2 model and stored them together in a persistent Chroma database.

This notebook therefore represents the indexing and embedding stage of the project pipeline—transforming all PropHero textual data (videos and blogs) into a structured, searchable vector format that powers the final Question-Answering system.