<a href="https://colab.research.google.com/github/bnsreenu/python_for_microscopists/blob/master/358_recommender_system_for_digitalsreeni_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://youtu.be/kWwLVwNwW1k

# Building an Educational Video Knowledge Graph with LLMs and Embeddings

This project creates an intelligent knowledge graph for educational video recommendations by combining Large Language Models (LLMs) and semantic embeddings. The system extracts key concepts, difficulty levels, prerequisites, and learning outcomes from educational videos, builds meaningful relationships between them, and enables semantic search and personalized learning path generation. By leveraging both the structured information extraction capabilities of LLMs and the semantic similarity detection of embeddings, we create a powerful hybrid approach that provides robust, contextually relevant video recommendations even for complex or ambiguous queries.
<p>
The system processes video metadata through an LLM to extract rich structured information, generates vector embeddings for semantic search, builds a knowledge graph with meaningful relationships, and provides multiple query methods including semantic search, LLM-based understanding, and pattern matching as a fallback. This approach significantly improves upon traditional NLP methods (e.g., using spaCy) by offering deeper contextual understanding, concept expansion, and robust query handling.

### Preparation: Getting the Colab system ready

In [None]:
#Change run time to GPU and verify CUDA version
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [None]:
# Mount Google drive to save data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Install llama - this version works with Colab
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.38

Collecting llama-cpp-python==0.2.38
  Downloading llama_cpp_python-0.2.38.tar.gz (10.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.38)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.38-cp311-cp311-linux_x86_64.whl size=9878611 sha256=2f

In [None]:
# Install other dependencies
!pip install sentence-transformers networkx pyvis

Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence

**Download appropriate llama model and save to Google drive for future use**

This is a one time action, so comment it out after initial download

In [None]:
#### NOT PREFERRED (if you don;t have Huggingface authentication)
# Download Llama model (smaller quantized version works well on Colab). The following one may require authentication.
#!wget -c https://huggingface.co/TheBloke/Llama-3-8B-Instruct-GGUF/resolve/main/llama-3-8b-instruct.Q4_K_S.gguf -O models/llama-3-8b-instruct.gguf

### PREFERRED #######
# Download an alternative model that doesn't require authentication
#!wget -c https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf -O /content/drive/MyDrive/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

#I have also experimented with tinyllama model but it consistently failed at understanding the command to create json, the way we instructed the model.

### 0. Import modules

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import json
import re
import os
import sqlite3
import time
from typing import List, Dict, Tuple, Set
from pyvis.network import Network
import torch
from sentence_transformers import SentenceTransformer
from concurrent.futures import ThreadPoolExecutor
import matplotlib.pyplot as plt
import io
import base64
from tqdm.notebook import tqdm



# Create output directories
OUTPUT_DIR = "/content/drive/MyDrive/data/knowledge_graph_results"
os.makedirs(os.path.join(OUTPUT_DIR, "visualizations"), exist_ok=True)
os.makedirs(os.path.join(OUTPUT_DIR, "database"), exist_ok=True)

### 1. Model Setup and Initialization

The first step in the knowledge graph creation process powered by LLM involves setting up GPU for LLM and initializing the LLM and embedding models. Here, we use Mistral LLM from llama-cpp-python which is about 4GB and suitable for Colab. Then we initialize the model with context window, batch size, threads, etc. Let us also initialize the embeddings model from sentence-transformers which we need later for semantic search and finding conceptually related videos without exact keyword matches.

In [None]:

def setup_llm_gpu():
    """
    Install and setup llama-cpp-python with proper GPU support
    """
    import subprocess
    from IPython.display import clear_output

    print("Setting up LLM with GPU support...")

    # Check GPU availability
    gpu_available = torch.cuda.is_available()
    if gpu_available:
        gpu_name = torch.cuda.get_device_name(0)
        print(f"GPU detected: {gpu_name}")

        # Uninstall existing llama-cpp-python
        subprocess.run("pip uninstall -y llama-cpp-python", shell=True)

        # Install with CUDA support
        if torch.cuda.get_device_capability()[0] >= 7:  # For newer GPUs (Compute capability ≥ 7.0)
            print("Installing llama-cpp-python with CUDA support (optimized for modern GPUs)...")
            subprocess.run(
                "CMAKE_ARGS=\"-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all\" pip install llama-cpp-python==0.2.38",
                shell=True
            )
        else:
            print("Installing llama-cpp-python with basic CUDA support...")
            subprocess.run(
                "CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" pip install llama-cpp-python==0.2.38",
                shell=True
            )

        clear_output()
        print("✅ llama-cpp-python installed with GPU support")
    else:
        print("⚠️ No GPU detected. Performance will be limited.")

    # Set environment variables for optimizing GPU memory usage
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'

    # Verify installation
    from llama_cpp import Llama
    print("LLM setup complete. Ready to initialize model.")


def initialize_llm(model_path=None, model_type="mistral", use_gpu=True):
    """
    Initialize the Llama model with proper GPU acceleration

    """
    from llama_cpp import Llama

    # Define default model paths
    model_paths = {
        "mistral": "/content/drive/MyDrive/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf",
        "llama": "/content/drive/MyDrive/models/llama-3-8b-instruct.Q4_K_S.gguf",
        "tiny": "/content/drive/MyDrive/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf"
    }

    # Use provided path or default based on model_type
    if model_path is None:
        model_path = model_paths.get(model_type)
        if model_path is None:
            raise ValueError(f"Unknown model type: {model_type}")

    print(f"Initializing {model_type} model...")

    # Configure GPU usage
    gpu_available = torch.cuda.is_available()
    if gpu_available and use_gpu:
        n_gpu_layers = -1  # Use all layers on GPU
        print("Using GPU acceleration for model inference")
    else:
        n_gpu_layers = 0
        print("Using CPU only for model inference")

    # Handle model file existence
    if not os.path.exists(model_path):
        raise FileNotFoundError(f"Model file not found: {model_path}")

    # Initialize model with optimized settings
    try:
        llm = Llama(
            model_path=model_path,
            n_ctx=4096,             # Larger context window
            n_batch=512,            # Optimized batch size
            n_threads=8,            # More CPU threads
            n_gpu_layers=n_gpu_layers,
            verbose=True
        )

        # Test the model with a quick query
        print("Testing model...")
        start_time = time.time()
        result = llm("Hello from DigitalSreeni!", max_tokens=20)
        end_time = time.time()
        print(f"Model responded in {end_time - start_time:.2f} seconds.")

        return llm
    except Exception as e:
        print(f"Error initializing model: {e}")
        raise


def initialize_embedding_model():
    """
    Initialize the sentence transformer model for embeddings
    """

    print("Initializing embedding model...")
    try:
        # Using an efficient model for sentence embeddings
        model = SentenceTransformer('all-MiniLM-L6-v2')
        print("✅ Embedding model loaded successfully")
        return model
    except Exception as e:
        print(f"Error initializing embedding model: {e}")
        raise

### 2. Data Loading

Next, we need to define a function to load data from the CSV file. Here, we will also clean up data to convert the video duration from milliseconds to minutes and also standardize text fields. This prepares our dataset for processing by ensuring consistent formatting and appropriate units for duration measurements.

In [None]:
def load_video_data(csv_path: str) -> pd.DataFrame:
    """
    Load and preprocess video data from CSV - return pandas dataframe
    Clean up time and standardize text fields.

    """
    print(f"Loading video data from {csv_path}...")
    df = pd.read_csv(csv_path)

    # Convert duration from milliseconds to minutes
    if 'Approx Duration (ms)' in df.columns:
        df['duration_minutes'] = df['Approx Duration (ms)'] / (1000 * 60)

    # Convert timestamp to datetime if present
    if 'Video Publish Timestamp' in df.columns:
        df['publish_date'] = pd.to_datetime(df['Video Publish Timestamp'])

    # Clean and standardize text fields
    text_columns = ['Video Title (Original)', 'Video Description (Original)']
    for col in text_columns:
        if col in df.columns:
            df[col] = df[col].fillna('').astype(str)

    print(f"Loaded {len(df)} videos")
    return df

### 3. Entity Extraction

With our data loaded, we now extract key entities from each video's metadata. Helper functions extract topics from titles, determine the specific main topic, and infer difficulty levels. The main extraction function uses the LLM to analyze each video's title and description, extracting structured information including main topics, difficulty level, prerequisites, and learning outcomes. This gives us a rich understanding of what each video teaches and its place in a learning sequence.

In [None]:
def extract_topics_from_title(title: str) -> List[str]:
    """
    Extract potential topics from video title
    """
    # Remove numbers and common words
    cleaned_title = re.sub(r'^\d+\s*-\s*', '', title.lower())
    cleaned_title = re.sub(r'what is|how to|learn|basics of|introduction to', '', cleaned_title)

    # Split into words and remove stop words
    words = re.findall(r'\b[a-z]{3,}\b', cleaned_title)
    stop_words = {'and', 'the', 'for', 'with', 'this', 'that', 'from', 'using', 'your', 'you', 'can', 'will', 'are'}
    words = [w for w in words if w not in stop_words]

    # Try to extract noun phrases (2-3 word combinations)
    phrases = []
    for i in range(len(words)-1):
        phrases.append(f"{words[i]} {words[i+1]}")

    for i in range(len(words)-2):
        phrases.append(f"{words[i]} {words[i+1]} {words[i+2]}")

    # Combine words and phrases, with preference to phrases
    topics = []
    if phrases:
        topics.extend(phrases[:2])  # Add top 2 phrases if available

    # Add individual words to reach at least 3 topics
    while len(topics) < 3 and words:
        word = words.pop(0)
        if not any(word in topic for topic in topics):
            topics.append(word)

    # If we still don't have enough topics, add some general ones based on common patterns
    if "python" in title.lower():
        if len(topics) < 3:
            topics.append("python programming")
    if "image" in title.lower():
        if len(topics) < 3:
            topics.append("image processing")

    # Ensure at least one topic
    if not topics:
        topics = ["programming concepts"]

    return topics[:3]  # Return up to 3 topics

def extract_topic_from_title(title: str, position: int = 0) -> str:
    """
    Extract a specific topic from the title based on position
    """
    topics = extract_topics_from_title(title)
    if position < len(topics):
        return topics[position]
    elif topics:
        return topics[0]
    else:
        return "programming concepts"

def infer_difficulty(title: str) -> str:
    """
    Infer difficulty level from title
    (beginner, intermediate, or advanced)
    """
    title_lower = title.lower()

    # Check for explicit indicators
    if any(word in title_lower for word in ['introduction', 'basics', 'beginner', 'what is', 'getting started']):
        return "beginner"
    elif any(word in title_lower for word in ['advanced', 'expert', 'complex', 'mastering']):
        return "advanced"

    # Check for number in series
    number_match = re.search(r'^(\d+)', title_lower)
    if number_match:
        num = int(number_match.group(1))
        if num <= 10:
            return "beginner"
        elif num <= 20:
            return "intermediate"
        else:
            return "advanced"

    # Default
    return "intermediate"


def create_batch_prompts(video_data: pd.DataFrame, batch_size: int = 3) -> List[Tuple[List[int], str]]:
    """
    Create batched prompts for more efficient processing
    """
    batched_prompts = []

    for i in range(0, len(video_data), batch_size):
        batch_indices = list(range(i, min(i + batch_size, len(video_data))))
        batch_data = video_data.iloc[batch_indices]

        # Create prompt for the batch with more explicit instructions
        prompt = "You are an educational content analyzer specialized in extracting structured information. Analyze these videos:\n\n"

        for j, (_, row) in enumerate(batch_data.iterrows()):
            title = row['Video Title (Original)']
            description = row.get('Video Description (Original)', '')

            # Limit description length to avoid exceeding context window
            if description and len(description) > 500:
                description = description[:500] + "..."

            prompt += f"VIDEO {j+1}:\n"
            prompt += f"TITLE: {title}\n"
            prompt += f"DESCRIPTION: {description}\n\n"

        prompt += """For EACH video above, you must identify:
1. Main topics: Extract 3-5 key concepts covered in the video
2. Difficulty level: Classify as exactly one of: "beginner", "intermediate", or "advanced"
3. Prerequisites: List 1-3 concepts a viewer should understand before watching
4. Learning outcomes: List 2-4 specific skills or knowledge the viewer will gain

IMPORTANT: Your response must be a valid JSON array with the following exact structure:
[
  {
    "video": 1,
    "main_topics": ["topic1", "topic2", "topic3"],
    "difficulty": "beginner",
    "prerequisites": ["prereq1", "prereq2"],
    "learning_outcomes": ["outcome1", "outcome2"]
  },
  {
    "video": 2,
    "main_topics": ["topic1", "topic2", "topic3"],
    "difficulty": "intermediate",
    "prerequisites": ["prereq1", "prereq2"],
    "learning_outcomes": ["outcome1", "outcome2"]
  }
]

Make sure you include all videos (numbered 1 through """ + str(len(batch_data)) + """) and all required fields.
Always maintain valid JSON structure. Your entire response must be a valid JSON array.
"""

        batched_prompts.append((batch_indices, prompt))

    return batched_prompts


def extract_video_entities(llm, video_data: pd.DataFrame, process_in_batches: bool = False, batch_size: int = 3, show_first_video_llm_output: bool = False) -> Dict[int, Dict]:
    """
    Extract entities from video data using LLM, with option for individual or batch processing
    """
    results = {}

    if process_in_batches:
        # Process in batches (less reliable but potentially faster)
        batched_prompts = create_batch_prompts(video_data, batch_size)
        results = extract_entities_batch(llm, video_data, batched_prompts)
    else:
        # Process videos one by one (more reliable)
        total_videos = len(video_data)

        for idx in tqdm(range(total_videos), desc="Processing videos"):
            try:
                print(f"\nProcessing video {idx+1}/{total_videos}: {video_data.iloc[idx]['Video Title (Original)'][:30]}...")

                # Create prompt for single video with template for structured completion
                title = video_data.iloc[idx]['Video Title (Original)']
                description = video_data.iloc[idx].get('Video Description (Original)', '')

                # Limit description length for context window
                if len(description) > 500:
                    description = description[:500] + "..."

                # Use a structured prompt with placeholders to help model complete correctly
                prompt = f"""You are an AI trained to extract information from educational videos.

For the following video:
TITLE: {title}
DESCRIPTION: {description}

Complete this JSON template by filling in the information between the brackets.
Do not change the template structure, only replace the text inside [brackets].
When not sure, make your best guess based on the title and description.

{{
  "main_topics": [
    "[topic1]",
    "[topic2]",
    "[topic3]"
  ],
  "difficulty": "[beginner/intermediate/advanced]",
  "prerequisites": [
    "[prerequisite1]",
    "[prerequisite2]"
  ],
  "learning_outcomes": [
    "[outcome1]",
    "[outcome2]",
    "[outcome3]"
  ]
}}

Only provide the completed JSON, no additional text."""

                # Process with LLM
                response = llm(
                    prompt,
                    max_tokens=2000,
                    temperature=0.1,
                    top_p=0.95,
                    stop=["```"]
                )

                response_text = response["choices"][0]["text"].strip()

                # Print LLM output for the first video if requested
                if idx == 0 and show_first_video_llm_output:
                    print("\n==== EXAMPLE LLM OUTPUT FOR FIRST VIDEO ====")
                    print(f"TITLE: {title}")
                    print(f"DESCRIPTION: {description[:100]}...")
                    print("\nLLM RESPONSE:")
                    print(response_text)
                    print("==========================================\n")

                # Attempt to extract JSON
                json_start = response_text.find('{')
                json_end = response_text.rfind('}') + 1

                if json_start != -1 and json_end > json_start:
                    json_str = response_text[json_start:json_end]

                    # Replace placeholder values in the template
                    json_str = json_str.replace("[topic1]", extract_topic_from_title(title, 0))
                    json_str = json_str.replace("[topic2]", extract_topic_from_title(title, 1))
                    json_str = json_str.replace("[topic3]", extract_topic_from_title(title, 2))
                    json_str = json_str.replace("[beginner/intermediate/advanced]", infer_difficulty(title))
                    json_str = json_str.replace("[prerequisite1]", "basic programming knowledge")
                    json_str = json_str.replace("[prerequisite2]", "computer basics")
                    json_str = json_str.replace("[outcome1]", f"understand {extract_topic_from_title(title, 0)}")
                    json_str = json_str.replace("[outcome2]", f"apply {extract_topic_from_title(title, 0)} techniques")
                    json_str = json_str.replace("[outcome3]", "solve related problems")

                    # Clean up any remaining placeholders
                    json_str = re.sub(r'\["?\[.*?\]"?\]', '[]', json_str)
                    json_str = re.sub(r'"?\[.*?\]"?', '""', json_str)

                    try:
                        # Try to parse the cleaned JSON
                        video_result = json.loads(json_str)

                        # Process the result and remove any remaining brackets
                        main_topics = [topic.replace("[", "").replace("]", "") for topic in video_result.get("main_topics", [])]
                        main_topics = [topic for topic in main_topics if topic and not topic.startswith('[') and not topic.endswith(']')]

                        difficulty = video_result.get("difficulty", "intermediate")
                        if difficulty.startswith('[') or difficulty.endswith(']'):
                            difficulty = infer_difficulty(title)

                        prerequisites = [prereq.replace("[", "").replace("]", "") for prereq in video_result.get("prerequisites", [])]
                        prerequisites = [prereq for prereq in prerequisites if prereq and not prereq.startswith('[') and not prereq.endswith(']')]

                        learning_outcomes = [outcome.replace("[", "").replace("]", "") for outcome in video_result.get("learning_outcomes", [])]
                        learning_outcomes = [outcome for outcome in learning_outcomes if outcome and not outcome.startswith('[') and not outcome.endswith(']')]

                        # Store the cleaned results
                        results[idx] = {
                            "main_topics": main_topics if main_topics else extract_topics_from_title(title),
                            "difficulty": difficulty,
                            "prerequisites": prerequisites if prerequisites else ["basic programming knowledge"],
                            "learning_outcomes": learning_outcomes if learning_outcomes else [f"understand {extract_topic_from_title(title, 0)}"]
                        }

                        print(f"Successfully processed video {idx}")

                    except json.JSONDecodeError:
                        print(f"Invalid JSON for video {idx}, using title-based extraction")
                        results[idx] = {
                            "main_topics": extract_topics_from_title(title),
                            "difficulty": infer_difficulty(title),
                            "prerequisites": ["basic programming knowledge"],
                            "learning_outcomes": [f"understand {extract_topic_from_title(title, 0)}"]
                        }

                else:
                    print(f"No JSON found in response for video {idx}")
                    print(f"Response: {response_text[:20]}...")

                    # Extract from title directly
                    results[idx] = {
                        "main_topics": extract_topics_from_title(title),
                        "difficulty": infer_difficulty(title),
                        "prerequisites": ["basic programming knowledge"],
                        "learning_outcomes": [f"understand {extract_topic_from_title(title, 0)}"]
                    }

            except Exception as e:
                print(f"Error processing video {idx}: {str(e)}")

                # Add default values based on title analysis
                title = video_data.iloc[idx]['Video Title (Original)']
                results[idx] = {
                    "main_topics": extract_topics_from_title(title),
                    "difficulty": infer_difficulty(title),
                    "prerequisites": ["basic programming knowledge"],
                    "learning_outcomes": [f"understand {extract_topic_from_title(title, 0)}"]
                }

    return results

### 4. Relationship Extraction

After identifying individual video entities, we need to understand how concepts relate to each other. The relationship extraction function uses the LLM to identify meaningful connections between topics, such as when one concept is a prerequisite for another or when concepts build upon each other. We then standardize these relationship types to ensure consistency in our knowledge graph.

In [None]:
def extract_relationships_with_llm(llm, video_entities: Dict[int, Dict], video_data: pd.DataFrame) -> Dict[str, List[Dict]]:
    """
    Extract relationships between concepts using LLM
    """
    # Collect all unique topics
    all_topics = set()
    for video_id, data in video_entities.items():
        all_topics.update(data.get("main_topics", []))
        all_topics.update(data.get("prerequisites", []))

    # Convert to list and limit to reasonable number to avoid excessive token usage
    topic_list = list(all_topics)
    if len(topic_list) > 30:
        print(f"Limiting from {len(topic_list)} to 30 topics for relationship analysis")
        topic_list = topic_list[:30]

    # Create prompt for relationship extraction
    prompt = f"""You are an expert in educational content organization.

I have extracted the following topics from a series of educational videos:
{', '.join(topic_list)}

Please identify meaningful relationships between these topics. For each relationship, specify:
1. The source topic
2. The target topic
3. The relationship type (prerequisite_for, builds_upon, related_to, applies)
4. The strength of the relationship (a float between 0.1 and 1.0)

Only include relationships that truly exist. Not every topic needs to be connected to others.

Format your response as a JSON object:
{{
  "relationships": [
    {{
      "source": "topic1",
      "target": "topic2",
      "type": "prerequisite_for",
      "strength": 0.9
    }},
    ...
  ]
}}
"""

    # Call LLM
    response = llm(prompt, max_tokens=4000, temperature=0.2)
    response_text = response["choices"][0]["text"]

    # Extract JSON
    json_start = response_text.find('{')
    json_end = response_text.rfind('}') + 1

    if json_start == -1 or json_end == 0:
        print("Failed to get valid JSON for relationships")
        return {"relationships": []}

    json_str = response_text[json_start:json_end]

    try:
        relationships_data = json.loads(json_str)
        return relationships_data
    except json.JSONDecodeError:
        print(f"Invalid JSON for relationships")
        return {"relationships": []}


def map_relationship_type(rel_type: str) -> str:
    """
    Map relationship types from LLM to standard edge types
    """
    rel_type = rel_type.lower()

    if rel_type in ['prerequisite_for', 'prerequisite']:
        return 'prerequisite_for'
    elif rel_type in ['builds_upon', 'builds on', 'extends']:
        return 'builds_upon'
    elif rel_type in ['related_to', 'related']:
        return 'related'
    elif rel_type in ['applies', 'uses', 'implements']:
        return 'applies'
    else:
        return 'related'

### 5. Knowledge Graph Construction

Now we build the actual knowledge graph structure using NetworkX. The main graph building function creates nodes for each video with all extracted metadata as attributes. Then we add connections between videos based on shared topics, prerequisites, and the concept relationships identified in the previous step. This creates a rich network of interconnected educational content.

**NOte about Adding Video Connections**

One of the most critical aspects of our knowledge graph system is how we establish meaningful connections between videos. In the add_video_connections function, we create three distinct types of connections: First, we identify shared topics between videos and connect them based on similarity, considering difficulty levels to establish prerequisite relationships (easier content connects to more advanced content). Second, we analyze explicit prerequisites by checking if topics covered in one video match prerequisites listed in another, creating strong "explicit_prerequisite" connections. Finally, we apply concept relationships identified by the LLM to further strengthen connections between videos covering related concepts. This multi-layered approach creates a rich network of relationships that reflects both content similarity and logical learning progressions.

In [None]:
#Note: We are only adding videos as nodes. Ideally you also add topic nodes.
def build_llm_knowledge_graph(video_data: pd.DataFrame, video_entities: Dict[int, Dict], relationships: Dict) -> nx.DiGraph:
    """
    Build knowledge graph with LLM-extracted entities and relationships
    """
    print("Building knowledge graph...")
    G = nx.DiGraph()

    # Add only video nodes - no separate topic nodes
    for idx, row in video_data.iterrows():
        if idx not in video_entities:
            continue

        entities = video_entities[idx]

        # Add video node with all metadata as attributes
        G.add_node(
            idx,
            title=row['Video Title (Original)'],
            description=row.get('Video Description (Original)', ''),
            difficulty=entities.get('difficulty', 'intermediate').lower(),  # Ensure lowercase
            duration=row.get('Approx Duration (ms)', 0),
            duration_minutes=row.get('duration_minutes', 0),
            topics=entities.get('main_topics', []),
            prerequisites=entities.get('prerequisites', []),
            learning_outcomes=entities.get('learning_outcomes', []),
            node_type='video'  # Keep track that this is a video node
        )

    # Add connections between videos based on shared topics and prerequisites
    add_video_connections(G, relationships)  #This function is defined below.

    print(f"Knowledge graph built with {len(G.nodes())} nodes and {len(G.edges())} edges")
    return G


def add_video_connections(G: nx.DiGraph, relationships: Dict) -> None:
    """
    Add connections between videos based on shared topics, prerequisites, and difficulty
    """
    video_nodes = list(G.nodes())

    # Define standard difficulty levels for comparison
    diff_levels = ['beginner', 'intermediate', 'advanced']

    # First pass: Calculate topic similarity between videos
    for i, v1 in enumerate(video_nodes):
        for v2 in video_nodes[i+1:]:
            v1_data = G.nodes[v1]
            v2_data = G.nodes[v2]

            # Calculate shared topics
            v1_topics = set(v1_data.get('topics', []))
            v2_topics = set(v2_data.get('topics', []))
            common_topics = v1_topics.intersection(v2_topics)

            if not common_topics:
                continue

            # Calculate similarity score (Jaccard similarity)
            similarity = len(common_topics) / len(v1_topics.union(v2_topics))

            # Only connect if there's significant similarity
            if similarity < 0.1:
                continue

            # Get difficulty levels (standardized to lowercase)
            v1_diff = v1_data.get('difficulty', 'intermediate').lower()
            v2_diff = v2_data.get('difficulty', 'intermediate').lower()

            # Map to standard difficulties
            if v1_diff not in diff_levels:
                v1_diff = 'intermediate'
            if v2_diff not in diff_levels:
                v2_diff = 'intermediate'

            v1_diff_idx = diff_levels.index(v1_diff)
            v2_diff_idx = diff_levels.index(v2_diff)

            # Determine relationship type based on difficulty
            if v1_diff_idx < v2_diff_idx:
                # v1 is easier than v2
                G.add_edge(
                    v1, v2,
                    type='prerequisite_for',
                    weight=similarity,
                    shared_topics=list(common_topics)
                )
            elif v1_diff_idx > v2_diff_idx:
                # v2 is easier than v1
                G.add_edge(
                    v2, v1,
                    type='prerequisite_for',
                    weight=similarity,
                    shared_topics=list(common_topics)
                )
            else:
                # Same difficulty - connect based on relationship strength
                G.add_edge(
                    v1, v2,
                    type='related',
                    weight=similarity,
                    shared_topics=list(common_topics)
                )

    # Second pass: Connect based on explicit prerequisites
    for v1 in video_nodes:
        v1_data = G.nodes[v1]
        v1_prereqs = set(v1_data.get('prerequisites', []))

        for v2 in video_nodes:
            if v1 == v2:
                continue

            v2_data = G.nodes[v2]
            v2_topics = set(v2_data.get('topics', []))

            # If any of v1's prerequisites are in v2's topics, v2 is a prerequisite for v1
            prereq_matches = v1_prereqs.intersection(v2_topics)
            if prereq_matches:
                # Strength based on how many prerequisites match
                strength = len(prereq_matches) / len(v1_prereqs) if v1_prereqs else 0.5

                G.add_edge(
                    v2, v1,
                    type='explicit_prerequisite',
                    weight=min(1.0, strength + 0.2),  # Boost explicit prerequisites
                    matched_prereqs=list(prereq_matches)
                )

    # Apply concept relationships from LLM to strengthen existing edges
    for rel in relationships.get('relationships', []):
        source_concept = rel.get('source', '').lower()
        target_concept = rel.get('target', '').lower()
        rel_type = rel.get('type', '')
        rel_strength = rel.get('strength', 0.5)

        # Find videos containing these concepts
        source_videos = []
        target_videos = []

        for node in video_nodes:
            node_topics = [t.lower() for t in G.nodes[node].get('topics', [])]

            if any(source_concept in topic for topic in node_topics):
                source_videos.append(node)

            if any(target_concept in topic for topic in node_topics):
                target_videos.append(node)

        # Connect videos based on concept relationships
        for s_vid in source_videos:
            for t_vid in target_videos:
                if s_vid != t_vid:
                    # Check if edge already exists
                    if G.has_edge(s_vid, t_vid):
                        # Update weight if the new relationship is stronger
                        current_weight = G.edges[s_vid, t_vid]['weight']
                        if rel_strength > current_weight:
                            G.edges[s_vid, t_vid]['weight'] = rel_strength
                            G.edges[s_vid, t_vid]['type'] = map_relationship_type(rel_type)
                    else:
                        # Add new edge if it doesn't exist
                        G.add_edge(
                            s_vid, t_vid,
                            type=map_relationship_type(rel_type),
                            weight=rel_strength,
                            concept_relationship=f"{source_concept} -> {target_concept}"
                        )




### 6. Embedding Generation

To enable semantic search, we generate dense vector embeddings for each video using the sentence-transformers model. We create rich text representations that combine all video metadata (title, description, topics, prerequisites, and learning outcomes) to capture the full semantic meaning of each video. These embeddings are then saved to disk for future use, avoiding the need to regenerate them each time.

In [None]:

def generate_video_embeddings(embedding_model, G):
    """
    Generate embeddings for all videos in the knowledge graph
    """
    import numpy as np
    from tqdm.notebook import tqdm

    print("Generating embeddings for videos...")
    video_embeddings = {}

    # Create texts for embedding
    embedding_texts = {}
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        if node_data.get('node_type') == 'video':
            # Create rich text representation including all metadata
            title = node_data.get('title', '')
            description = node_data.get('description', '')
            topics = ' '.join(node_data.get('topics', []))
            prerequisites = ' '.join(node_data.get('prerequisites', []))
            outcomes = ' '.join(node_data.get('learning_outcomes', []))

            # Combine all text data for richer embedding
            text_for_embedding = f"{title}. {description}. Topics: {topics}. Prerequisites: {prerequisites}. Learning outcomes: {outcomes}"
            embedding_texts[node_id] = text_for_embedding

    # Generate embeddings in batches to avoid memory issues
    batch_size = 32
    node_ids = list(embedding_texts.keys())

    for i in tqdm(range(0, len(node_ids), batch_size)):
        batch_ids = node_ids[i:i+batch_size]
        batch_texts = [embedding_texts[node_id] for node_id in batch_ids]

        # Generate embeddings
        batch_embeddings = embedding_model.encode(batch_texts)

        # Store embeddings
        for j, node_id in enumerate(batch_ids):
            video_embeddings[node_id] = batch_embeddings[j]

    print(f"Generated embeddings for {len(video_embeddings)} videos")
    return video_embeddings

def save_embeddings(video_embeddings, filepath):
  """
  Save video embeddings to a file
  """
  import numpy as np
  import os
  import pickle

  # Create directory if it doesn't exist
  os.makedirs(os.path.dirname(filepath), exist_ok=True)

  # Save embeddings
  with open(filepath, 'wb') as f:
      pickle.dump(video_embeddings, f)

  print(f"Embeddings saved to {filepath}")

### 7. Graph Storage

With our knowledge graph built and embeddings generated, we save everything to persistent storage. We save the graph both as a pickle file for easy Python reloading and to an SQLite database for potential integration with other systems. This ensures our processed data is available for future use without having to recreate the graph.

In [None]:
def save_graph_pickle(G: nx.DiGraph, filepath: str) -> None:
    """
    Save NetworkX graph as a pickle file for later loading using Python's built-in pickle

    """
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(filepath), exist_ok=True)

    try:
        # Save graph using Python's built-in pickle
        import pickle
        with open(filepath, 'wb') as f:
            pickle.dump(G, f)
        print(f"Knowledge graph saved as pickle to: {filepath}")
    except Exception as e:
        print(f"Error saving graph pickle: {str(e)}")



def save_knowledge_graph_to_db(G: nx.DiGraph, db_path='knowledge_graph.db'):
    """
    Save knowledge graph to SQLite database
    """
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(db_path), exist_ok=True)

    print(f"Saving knowledge graph to database: {db_path}")

    # Create or connect to database
    conn = sqlite3.connect(db_path)
    c = conn.cursor()

    # Create tables
    c.execute('DROP TABLE IF EXISTS nodes')
    c.execute('DROP TABLE IF EXISTS edges')

    c.execute('''
    CREATE TABLE nodes (
        id TEXT PRIMARY KEY,
        label TEXT,
        type TEXT,
        title TEXT,
        description TEXT,
        difficulty TEXT,
        duration INTEGER,
        topics TEXT,
        prerequisites TEXT,
        learning_outcomes TEXT
    )
    ''')

    c.execute('''
    CREATE TABLE edges (
        source TEXT,
        target TEXT,
        type TEXT,
        weight REAL,
        PRIMARY KEY (source, target)
    )
    ''')

    # Insert nodes
    for node_id in G.nodes():
        node_data = G.nodes[node_id]

        # Standardize difficulty to lowercase
        difficulty = node_data.get('difficulty', '')
        if difficulty:
            difficulty = difficulty.lower()

        # Convert list attributes to JSON
        topics = json.dumps(node_data.get('topics', []))
        prerequisites = json.dumps(node_data.get('prerequisites', []))
        learning_outcomes = json.dumps(node_data.get('learning_outcomes', []))

        c.execute('''
        INSERT INTO nodes (id, label, type, title, description, difficulty, duration, topics, prerequisites, learning_outcomes)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ''', (
            str(node_id),
            node_data.get('label', ''),
            node_data.get('node_type', ''),
            node_data.get('title', ''),
            node_data.get('description', ''),
            difficulty,
            node_data.get('duration', 0),
            topics,
            prerequisites,
            learning_outcomes
        ))

    # Insert edges
    for source, target, data in G.edges(data=True):
        c.execute('''
        INSERT INTO edges (source, target, type, weight)
        VALUES (?, ?, ?, ?)
        ''', (
            str(source),
            str(target),
            data.get('type', ''),
            data.get('weight', 0.0)
        ))

    # Commit and close
    conn.commit()
    conn.close()

    print(f"Knowledge graph saved to database: {db_path} ({len(G.nodes())} nodes, {len(G.edges())} edges)")

    # Return database path for reference
    return db_path

### 8. Visualization Functions

To understand the structure of our knowledge graph, we create interactive visualizations using PyVis. The main visualization shows all videos and their relationships, while topic-specific visualizations focus on videos related to particular subjects. These visual representations help us better understand the connections between educational content.

In [None]:


def visualize_knowledge_graph(G: nx.DiGraph, filename='knowledge_graph.html'):
    """
    Create interactive visualization of the knowledge graph
    """
    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(output_path), exist_ok=True)

    # Create a PyVis network
    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 font_color='#000000', directed=True)

    # Set physics layout options
    net.force_atlas_2based()
    net.show_buttons(filter_=['physics'])

    # Color mapping for different difficulties
    color_map = {
        'beginner': '#90EE90',      # light green
        'intermediate': '#ADD8E6',   # light blue
        'advanced': '#FFB6C1'        # light pink
    }

    # Add nodes (only video nodes in this version)
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        title = node_data.get('title', 'Unknown Video')
        difficulty = node_data.get('difficulty', 'intermediate').lower()
        duration = node_data.get('duration_minutes', 0)
        topics = ', '.join(node_data.get('topics', []))

        # Use default color if difficulty not in map
        if difficulty not in color_map:
            difficulty = 'intermediate'

        hover_text = f"""
        Title: {title}
        Difficulty: {difficulty}
        Duration: {duration:.1f} min
        Topics: {topics}
        """

        net.add_node(
            node_id,
            label=title[:20] + "..." if len(title) > 20 else title,
            title=hover_text,
            color=color_map[difficulty],
            shape='dot',
            size=15
        )

    # Edge color mapping
    edge_colors = {
        'prerequisite_for': '#FF0000',       # red
        'explicit_prerequisite': '#8B0000',  # dark red
        'builds_upon': '#0000FF',            # blue
        'related': '#A0A0A0',                # gray
        'applies': '#008000'                 # green
    }

    # Add edges
    for edge in G.edges(data=True):
        source, target, data = edge
        edge_type = data.get('type', 'related')
        weight = data.get('weight', 0.5)

        # Create descriptive title based on relationship type
        if edge_type == 'prerequisite_for' or edge_type == 'explicit_prerequisite':
            title = f"Watch {G.nodes[source]['title']} before {G.nodes[target]['title']}"
        elif edge_type == 'builds_upon':
            title = f"{G.nodes[target]['title']} builds upon {G.nodes[source]['title']}"
        elif edge_type == 'related':
            title = f"Related videos with shared topics: {', '.join(data.get('shared_topics', []))}"
        else:
            title = f"Type: {edge_type}, Weight: {weight:.2f}"

        net.add_edge(
            source,
            target,
            title=title,
            color=edge_colors.get(edge_type, '#A0A0A0'),
            width=weight * 3,
            arrows='to'
        )

    # Save the network
    net.save_graph(output_path)
    print(f"Knowledge graph visualization saved to {output_path}")

    return output_path


def visualize_topic_subgraph(G: nx.DiGraph, topic: str, filename=None):
    """
    Visualize a subgraph of videos related to a specific topic
    """
    if filename is None:
        filename = f'{topic.replace(" ", "_")}_subgraph.html'

    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    # Find videos related to this topic
    related_videos = []
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        if topic.lower() in [t.lower() for t in node_data.get('topics', [])]:
            related_videos.append(node_id)

    if not related_videos:
        print(f"No videos found for topic '{topic}'")
        return None

    # Create subgraph with just the related videos and their connections
    subgraph = G.subgraph(related_videos)

    # Create visualization
    net = Network(height='750px', width='100%', bgcolor='#ffffff', directed=True)

    # Color map for difficulties
    color_map = {
        'beginner': '#90EE90',      # light green
        'intermediate': '#ADD8E6',   # light blue
        'advanced': '#FFB6C1'        # light pink
    }

    # Add nodes
    for node_id in subgraph.nodes():
        node_data = subgraph.nodes[node_id]
        title = node_data.get('title', 'Unknown')
        difficulty = node_data.get('difficulty', 'intermediate').lower()

        # Use default color if difficulty not in map
        if difficulty not in color_map:
            difficulty = 'intermediate'

        net.add_node(
            node_id,
            label=title[:20] + "..." if len(title) > 20 else title,
            title=f"Title: {title}\nDifficulty: {difficulty}\nTopics: {', '.join(node_data.get('topics', []))}",
            color=color_map[difficulty]
        )

    # Add edges
    for u, v, data in subgraph.edges(data=True):
        net.add_edge(
            u, v,
            title=f"Type: {data.get('type', 'related')}\nWeight: {data.get('weight', 0.5):.2f}",
            width=data.get('weight', 0.5) * 3
        )

    # Save visualization
    net.save_graph(output_path)
    print(f"Topic subgraph saved to {output_path}")

    return output_path

### 9. Main Graph Construction Process

The main function orchestrates all the previous steps into a complete pipeline. It handles loading data, extracting entities and relationships, building the graph, generating embeddings, saving everything to disk, and creating visualizations. This function serves as the entry point for building the entire knowledge graph system from scratch.

In [None]:


def main(csv_path, model_type="tiny", use_gpu=True, process_in_batches=False, output_dir=None, max_videos=None, show_first_video_llm_output=False, generate_embeddings=True):
    """
    Main execution function

    Parameters:
    -----------
    csv_path : str
        Path to CSV file with video data
    model_type : str
        Type of model to use ('mistral', 'llama', or 'tiny')
    use_gpu : bool
        Whether to use GPU acceleration
    process_in_batches : bool
        Whether to process videos in batches (Default: False - process one at a time)
    output_dir : str
        Custom output directory (defaults to /content/drive/MyDrive/data/knowledge_graph_results if None)
    max_videos : int, optional
        Maximum number of videos to process (processes all if None)
    show_first_video_llm_output : bool
        Whether to print the LLM output for the first video
    generate_embeddings : bool
        Whether to generate and save embeddings for semantic search

    Returns:
    --------
    Tuple[nx.DiGraph, object, object, Dict]
        Knowledge graph, LLM, embedding model, and video embeddings
    """
    # Step 1: Set output directory
    global OUTPUT_DIR
    if output_dir:
        OUTPUT_DIR = output_dir
    else:
        OUTPUT_DIR = "/content/drive/MyDrive/data/knowledge_graph_results"

    # Create output directories (may be redundant, delete the ones at the neginning of the file)
    os.makedirs(os.path.join(OUTPUT_DIR, "visualizations"), exist_ok=True)
    os.makedirs(os.path.join(OUTPUT_DIR, "database"), exist_ok=True)
    os.makedirs(os.path.join(OUTPUT_DIR, "graph"), exist_ok=True)
    os.makedirs(os.path.join(OUTPUT_DIR, "embeddings"), exist_ok=True)

    print(f"Results will be saved to: {OUTPUT_DIR}")

    # Step 2: Setup LLM with GPU support
    setup_llm_gpu()

    # Step 3: Initialize LLM
    llm = initialize_llm(model_type=model_type, use_gpu=use_gpu)

    # Step 4: Load video data
    video_data = load_video_data(csv_path)

    # Limit number of videos if specified
    if max_videos is not None and max_videos > 0 and max_videos < len(video_data):
        print(f"Limiting processing to first {max_videos} videos (out of {len(video_data)} total)")
        video_data = video_data.iloc[:max_videos].copy()

    # Step 5: Process videos individually (more reliable) or in batches
    print("Extracting entities from videos...")
    video_entities = extract_video_entities(
        llm,
        video_data,
        process_in_batches=process_in_batches,
        show_first_video_llm_output=show_first_video_llm_output
    )

    # Step 6: Extract relationships between concepts
    print("Extracting relationships between concepts...")
    relationships = extract_relationships_with_llm(llm, video_entities, video_data)

    # Step 7: Build knowledge graph
    G = build_llm_knowledge_graph(video_data, video_entities, relationships)

    # Step 8: Save knowledge graph
    db_path = os.path.join(OUTPUT_DIR, 'database', 'knowledge_graph.db')
    save_knowledge_graph_to_db(G, db_path)

    # Step 9: Save NetworkX graph for direct reloading
    graph_path = os.path.join(OUTPUT_DIR, 'graph', 'knowledge_graph.pickle')
    save_graph_pickle(G, graph_path)

    # Step 10: Generate and save embeddings for semantic search
    embedding_model = None
    video_embeddings = None

    if generate_embeddings:
        try:
            print("Initializing embedding model for semantic search...")
            embedding_model = initialize_embedding_model()

            print("Generating video embeddings...")
            video_embeddings = generate_video_embeddings(embedding_model, G)

            # Save embeddings
            embeddings_path = os.path.join(OUTPUT_DIR, 'embeddings', 'video_embeddings.pickle')
            save_embeddings(video_embeddings, embeddings_path)
            print(f"Embeddings saved to {embeddings_path}")
        except Exception as e:
            print(f"Error generating embeddings: {str(e)}")

    # Step 11: Create visualizations
    visualize_knowledge_graph(G)

    # Step 12: Generate example learning path
    example_goal = "Learn Python for bioimage analysis"
    try:
        learning_path = generate_learning_path(llm, G, example_goal)
        print(f"Example learning path generated for: '{example_goal}'")
    except Exception as e:
        print(f"Error generating example learning path: {str(e)}")

    print("System ready! You can now query the knowledge graph.")
    return G, llm, embedding_model, video_embeddings



### 10. Loading Previously Built System

Once we've built and saved our knowledge graph system, we need functions to load it back. These functions load the graph from pickle or database, load saved embeddings, and initialize the LLM. The comprehensive load_knowledge_graph_system function brings everything together, loading the graph, LLM, and embeddings in one step for convenient querying.

In [None]:

def load_graph_pickle(filepath: str) -> nx.DiGraph:
    """
    Load NetworkX graph from a pickle file using Python's built-in pickle

    """
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Graph pickle not found at: {filepath}")

    try:
        # Load graph using Python's built-in pickle
        import pickle
        with open(filepath, 'rb') as f:
            G = pickle.load(f)
        print(f"Knowledge graph loaded from pickle: {filepath}")
        return G
    except Exception as e:
        raise RuntimeError(f"Error loading graph pickle: {str(e)}")


def load_knowledge_graph_from_db(db_path='knowledge_graph.db'):
    """
    Load knowledge graph from SQLite database
    """
    db_path = os.path.join(OUTPUT_DIR, 'database', db_path)

    if not os.path.exists(db_path):
        raise FileNotFoundError(f"Database not found: {db_path}")

    # Connect to database
    conn = sqlite3.connect(db_path)
    c = conn.cursor()

    # Create new graph
    G = nx.DiGraph()

    # Load nodes
    c.execute('SELECT * FROM nodes')
    for row in c.fetchall():
        node_id = row[0]

        # Parse JSON fields
        topics = json.loads(row[7]) if row[7] else []
        prerequisites = json.loads(row[8]) if row[8] else []
        learning_outcomes = json.loads(row[9]) if row[9] else []

        # Add node with all attributes
        G.add_node(
            node_id,
            label=row[1],
            node_type=row[2],
            title=row[3],
            description=row[4],
            difficulty=row[5],
            duration=row[6],
            topics=topics,
            prerequisites=prerequisites,
            learning_outcomes=learning_outcomes
        )

    # Load edges
    c.execute('SELECT * FROM edges')
    for row in c.fetchall():
        source = row[0]
        target = row[1]
        edge_type = row[2]
        weight = row[3]

        G.add_edge(
            source,
            target,
            type=edge_type,
            weight=weight
        )

    # Close connection
    conn.close()

    print(f"Knowledge graph loaded from database: {db_path}")
    return G


def load_embeddings(filepath):
    """
    Load video embeddings from a file

    """
    import pickle
    import os

    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Embeddings file not found: {filepath}")

    with open(filepath, 'rb') as f:
        video_embeddings = pickle.load(f)

    print(f"Loaded embeddings for {len(video_embeddings)} videos")
    return video_embeddings


def load_knowledge_graph_system(graph_path, embedding_path=None, model_type="mistral", use_gpu=True):
    """
    Load the knowledge graph, embeddings, and initialize the LLM for querying

    model_type : ('mistral', 'llama', or 'tiny')

    """
    import pickle
    import os
    from tqdm.notebook import tqdm

    # Step 1: Load the graph
    print(f"Loading knowledge graph from: {graph_path}")
    with open(graph_path, 'rb') as f:
        G = pickle.load(f)
    print(f"Graph loaded with {len(G.nodes())} nodes and {len(G.edges())} edges")

    # Step 2: Setup LLM
    setup_llm_gpu()  # Make sure GPU support is set up

    # Step 3: Initialize LLM
    llm = initialize_llm(model_type=model_type, use_gpu=use_gpu)
    print(f"LLM ({model_type}) initialized and ready")

    # Step 4: Load embeddings if path provided
    embedding_model = None
    video_embeddings = None

    if embedding_path and os.path.exists(embedding_path):
        try:
            # Initialize embedding model
            print("Initializing embedding model...")
            embedding_model = initialize_embedding_model()

            # Load saved embeddings
            print(f"Loading embeddings from: {embedding_path}")
            with open(embedding_path, 'rb') as f:
                video_embeddings = pickle.load(f)
            print(f"Loaded embeddings for {len(video_embeddings)} videos")
        except Exception as e:
            print(f"Error loading embeddings: {str(e)}")
            print("Will continue without embeddings capability")
    else:
        print("No embedding path provided or file not found. Semantic search will not be available.")

    print("Knowledge graph system loaded and ready for queries!")
    return G, llm, embedding_model, video_embeddings

"""
# Example usage:
graph_path = "/content/drive/MyDrive/data/knowledge_graph_results/graph/knowledge_graph.pickle"
embeddings_path = "/content/drive/MyDrive/data/knowledge_graph_results/embeddings/video_embeddings.pickle"

# Load the system once
G, llm, embedding_model, video_embeddings = load_knowledge_graph_system(
    graph_path=graph_path,
    embedding_path=embeddings_path,
    model_type="mistral",
    use_gpu=True
)
"""

'\n# Example usage:\ngraph_path = "/content/drive/MyDrive/data/knowledge_graph_results/graph/knowledge_graph.pickle"\nembeddings_path = "/content/drive/MyDrive/data/knowledge_graph_results/embeddings/video_embeddings.pickle"\n\n# Load the system once\nG, llm, embedding_model, video_embeddings = load_knowledge_graph_system(\n    graph_path=graph_path,\n    embedding_path=embeddings_path,\n    model_type="mistral",\n    use_gpu=True\n)\n'

### 11. Semantic Search

With our system loaded, we can now perform semantic searches. The semantic search function finds videos similar to a query by comparing embeddings. Additional functions expand keywords and use the LLM to enrich queries with related concepts. This allows us to find relevant videos even when queries don't exactly match the terms in the video metadata.

In [None]:
def semantic_search(query, embedding_model, video_embeddings, G, top_k=10):
    """
    Search for videos semantically similar to the query
    """
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity

    # Generate embedding for the query
    query_embedding = embedding_model.encode([query])[0]

    # Calculate cosine similarity between query and all videos
    similarities = {}
    for node_id, embedding in video_embeddings.items():
        # Calculate cosine similarity
        sim = cosine_similarity([query_embedding], [embedding])[0][0]
        similarities[node_id] = sim

    # Sort by similarity and get top_k results
    top_nodes = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_k]

    # Format results
    results = []
    for node_id, similarity in top_nodes:
        if node_id in G.nodes():
            node_data = G.nodes[node_id]
            results.append({
                "id": node_id,
                "title": node_data.get('title', 'Unknown'),
                "difficulty": node_data.get('difficulty', 'intermediate'),
                "topics": node_data.get('topics', []),
                "similarity": float(similarity),
                "duration_minutes": node_data.get('duration_minutes', 0)
            })

    return results

def expand_keywords(query):
    """
    Simple function to expand keywords in a query with related terms
    """
    query = query.lower()
    expanded = []

    # Extract main keywords (words longer than 3 chars)
    main_keywords = [w for w in query.split() if len(w) > 3 and w not in ['show', 'find', 'about', 'related', 'videos', 'with']]
    expanded.extend(main_keywords)

    # Add domain-specific expansions
    for keyword in main_keywords:
        if 'python' in keyword:
            expanded.extend(['programming', 'coding', 'development'])
        if 'image' in keyword:
            expanded.extend(['processing', 'analysis', 'computer vision'])
        if 'bio' in keyword or 'medical' in keyword:
            expanded.extend(['microscopy', 'analysis', 'biology', 'healthcare'])
        if 'deep' in keyword or 'learning' in keyword or 'ai' in keyword:
            expanded.extend(['neural', 'network', 'machine learning', 'artificial intelligence'])
        if 'data' in keyword:
            expanded.extend(['analysis', 'science', 'visualization'])

    # Remove duplicates while preserving order
    seen = set()
    expanded = [x for x in expanded if not (x in seen or seen.add(x))]

    return expanded



def expand_query_concepts(llm, query, domain="programming and data science"):
    """
    Use LLM to expand query with related concepts
    """
    prompt = f"""You are an expert in {domain}. Analyze this query:
"{query}"

Extract the main concept/topic and identify 3-5 closely related concepts that would help find relevant educational videos.
For example, if someone asks about "deep learning", related concepts might include "neural networks", "TensorFlow", "PyTorch", etc.

Format your response as a JSON object:
{{
  "main_topic": "the main concept",
  "related_concepts": ["concept1", "concept2", "concept3"],
  "expanded_query": "a more detailed query including related concepts"
}}
"""

    try:
        response = llm(prompt, max_tokens=1000, temperature=0.2)
        response_text = response["choices"][0]["text"]

        # Find JSON in response
        import json
        import re

        # Extract JSON
        json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
        if json_match:
            expansion_data = json.loads(json_match.group(0))
            return expansion_data
        else:
            print("Could not extract JSON from query expansion response")
            return {
                "main_topic": query,
                "related_concepts": [],
                "expanded_query": query
            }
    except Exception as e:
        print(f"Error in query expansion: {str(e)}")
        return {
            "main_topic": query,
            "related_concepts": [],
            "expanded_query": query
        }



### 12. Query Processing

Our handle_user_query function provides a three-tier approach to query processing: first attempting semantic search with embeddings, then using LLM-based query understanding if needed, and finally falling back to pattern matching if both fail. The load_and_query function combines loading the graph and running a query in one convenient step.

In [None]:

def handle_user_query(llm, G, query, embedding_model=None, video_embeddings=None):
    """
    Enhanced query function using LLM for understanding and embeddings for semantic search
    """
    print(f"Processing query: '{query}'")

    # First attempt: Use semantic search if embeddings are available
    if embedding_model is not None and video_embeddings is not None:
        semantic_results = semantic_search(query, embedding_model, video_embeddings, G, top_k=20)

        # If we find good semantic matches, return them
        if semantic_results and semantic_results[0]['similarity'] > 0.4:
            return {
                "type": "semantic_search",
                "query": query,
                "videos": semantic_results,
                "method": "embedding"
            }

    # Second attempt: Use LLM to understand the query and expand concepts
    if llm is not None:
        try:
            # Parse query with LLM to extract structured information
            prompt = f"""You are an AI assistant helping users find educational videos.
Analyze this query: "{query}"

Extract the following information:
1. Search type: What kind of search is this? (topic exploration, learning path, finding prerequisites, etc.)
2. Main topic: What is the main subject or concept being asked about?
3. Difficulty level: Is a specific difficulty level mentioned? (beginner, intermediate, advanced, or none)
4. Related concepts: What other concepts might be relevant to this query?

Format your response as valid JSON, and ONLY JSON with no additional text:
{{
  "search_type": "topic_exploration",
  "main_topic": "python",
  "difficulty": "beginner",
  "related_concepts": ["programming", "coding", "python basics"]
}}
"""

            response = llm(prompt, max_tokens=1000, temperature=0.1)
            response_text = response["choices"][0]["text"]

            # Extract JSON - improved error handling
            import json
            import re

            # Clean the response text to improve JSON parsing success
            # Remove any text before the first '{' and after the last '}'
            json_match = re.search(r'(\{.*\})', response_text, re.DOTALL)

            if json_match:
                json_str = json_match.group(1)
                # Further clean the JSON string to handle common issues
                json_str = re.sub(r'[\n\r\t]', ' ', json_str)  # Remove newlines, tabs
                json_str = re.sub(r',\s*\}', '}', json_str)     # Remove trailing commas

                try:
                    query_info = json.loads(json_str)

                    # Extract key information
                    main_topic = query_info.get('main_topic', '').lower()
                    difficulty = query_info.get('difficulty', '').lower()
                    related_concepts = [c.lower() for c in query_info.get('related_concepts', [])]

                    # Find videos matching the criteria
                    matched_videos = []
                    for node_id in G.nodes():
                        node_data = G.nodes[node_id]

                        # Skip if not a video node
                        if node_data.get('node_type') != 'video':
                            continue

                        # Prepare for matching
                        node_topics = [t.lower() for t in node_data.get('topics', [])]
                        node_title = node_data.get('title', '').lower()
                        node_desc = node_data.get('description', '').lower()
                        node_difficulty = node_data.get('difficulty', '').lower()

                        # Match criteria
                        topic_match = False
                        if main_topic:
                            # Check direct topic match
                            topic_match = any(main_topic in t for t in node_topics) or main_topic in node_title or main_topic in node_desc

                            # If no direct match, check for related concepts
                            if not topic_match and related_concepts:
                                for concept in related_concepts:
                                    if any(concept in t for t in node_topics) or concept in node_title or concept in node_desc:
                                        topic_match = True
                                        break
                        else:
                            # If no topic specified, consider it a match
                            topic_match = True

                        # Difficulty match
                        diff_match = not difficulty or node_difficulty == difficulty

                        # If both criteria match, add to results
                        if topic_match and diff_match:
                            # Calculate relevance score
                            relevance = 0.0

                            # Higher score for direct topic matches
                            if main_topic and any(main_topic in t for t in node_topics):
                                relevance += 1.0
                            elif main_topic and main_topic in node_title:
                                relevance += 0.8
                            elif main_topic and main_topic in node_desc:
                                relevance += 0.6

                            # Add smaller scores for related concept matches
                            for concept in related_concepts:
                                if any(concept in t for t in node_topics):
                                    relevance += 0.4
                                elif concept in node_title:
                                    relevance += 0.3
                                elif concept in node_desc:
                                    relevance += 0.2

                            # Add video to results
                            matched_videos.append({
                                "id": node_id,
                                "title": node_data.get('title', ''),
                                "difficulty": node_data.get('difficulty', ''),
                                "duration_minutes": node_data.get('duration_minutes', 0),
                                "topics": node_data.get('topics', []),
                                "relevance": relevance
                            })

                    # Sort by relevance score
                    matched_videos.sort(key=lambda x: x.get('relevance', 0), reverse=True)

                    # Return results
                    if matched_videos:
                        return {
                            "type": query_info.get('search_type', 'topic_exploration'),
                            "topic": main_topic,
                            "difficulty": difficulty,
                            "related_concepts": related_concepts,
                            "videos": matched_videos[:20],  # Limit to top 20
                            "method": "llm"
                        }
                except json.JSONDecodeError as e:
                    print(f"JSON parsing error: {e} in string: {json_str[:50]}...")
            else:
                print("Could not extract JSON from LLM response")

        except Exception as e:
            print(f"Error using LLM for query understanding: {str(e)}")
            # Fall back to pattern matching

    # Third attempt (fallback): Use expanded keyword matching
    expanded_keywords = expand_keywords(query)
    print(f"Using expanded keyword matching with: {', '.join(expanded_keywords)}")

    filtered_videos = []
    for node_id in G.nodes():
        node_data = G.nodes[node_id]

        # Skip if not a valid node
        if not all(k in node_data for k in ['title', 'topics']):
            continue

        node_topics = [t.lower() for t in node_data.get('topics', [])]
        node_title = node_data.get('title', '').lower()
        node_desc = node_data.get('description', '').lower()

        # Match any of the expanded keywords
        match_score = 0
        for keyword in expanded_keywords:
            if any(keyword in t.lower() for t in node_topics):
                match_score += 2  # Higher weight for topic matches
            if keyword in node_title:
                match_score += 1.5  # Medium weight for title matches
            if keyword in node_desc:
                match_score += 1  # Lower weight for description matches

        if match_score > 0:
            filtered_videos.append({
                "id": node_id,
                "title": node_data.get('title', ''),
                "difficulty": node_data.get('difficulty', ''),
                "duration_minutes": node_data.get('duration_minutes', 0),
                "topics": node_data.get('topics', []),
                "match_score": match_score
            })

    # Sort by match score
    filtered_videos.sort(key=lambda x: x.get('match_score', 0), reverse=True)

    # Extract main topic from query for response formatting
    query_words = query.lower().split()
    main_topic = ' '.join([w for w in query_words if len(w) > 3 and w not in ['show', 'find', 'about', 'related', 'videos', 'with']])

    print(f"Found {len(filtered_videos)} matching videos")
    return {
        "type": "keyword_search",
        "topic": main_topic,
        "related_concepts": expanded_keywords[1:],  # Skip the first which is usually the main topic
        "videos": filtered_videos[:20],  # Limit to top 20
        "method": "expanded_keywords"
    }


def load_and_query(graph_path, query, embedding_path=None):
    """
    Load a graph from pickle and run a query on it

    """
    # Load graph
    import pickle
    print(f"Loading graph from: {graph_path}")
    with open(graph_path, 'rb') as f:
        G = pickle.load(f)
    print(f"Graph loaded with {len(G.nodes())} nodes and {len(G.edges())} edges")

    # Load embeddings if available
    embedding_model = None
    video_embeddings = None
    if embedding_path:
        try:
            # Initialize embedding model
            from sentence_transformers import SentenceTransformer
            embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

            # Load embeddings
            with open(embedding_path, 'rb') as f:
                video_embeddings = pickle.load(f)
            print(f"Loaded embeddings for {len(video_embeddings)} videos")
        except Exception as e:
            print(f"Error loading embeddings: {str(e)}")

    # Run query
    result = handle_user_query(None, G, query, embedding_model, video_embeddings)

    # Print results summary
    if result.get("type") == "topic_exploration":
        print(f"\nFound {len(result.get('videos', []))} videos about {result.get('topic', 'unknown topic')}")
        for i, video in enumerate(result.get('videos', [])[:5]):  # Show top 5
            print(f"{i+1}. {video.get('title', 'Unknown')} ({video.get('difficulty', 'unknown')})")
    elif result.get("type") == "semantic_search":
        print(f"\nFound {len(result.get('videos', []))} videos semantically related to '{query}'")
        for i, video in enumerate(result.get('videos', [])[:5]):  # Show top 5
            print(f"{i+1}. {video.get('title', 'Unknown')} (similarity: {video.get('similarity', 0):.2f})")

    return result



### 13. Learning Path Generation

One of the most powerful features of our system is generating personalized learning paths. These functions use the LLM to create a logical sequence of topics for a given learning goal, find appropriate videos for each topic, format the path into a structured output, and visualize the learning journey. This helps users navigate from beginner to advanced content in a logical progression.

In [None]:

def generate_learning_path_query(llm, goal: str, available_topics: List[str]) -> List[str]:
    """
    Generate a learning path for a specific goal using LLM
    """
    # Limit the number of topics to avoid exceeding context window
    max_topics = 100  # Maximum number of topics to include in prompt

    if len(available_topics) > max_topics:
        print(f"Limiting from {len(available_topics)} to {max_topics} topics for learning path generation")

        # Filter topics by relevance to the goal
        # Simple filtering: check for word overlap with the goal
        goal_words = set(goal.lower().split())

        # Calculate relevance score based on word overlap
        topic_scores = []
        for topic in available_topics:
            topic_words = set(topic.lower().split())
            overlap = len(goal_words.intersection(topic_words))
            topic_scores.append((topic, overlap))

        # Sort topics by relevance score and take top ones
        topic_scores.sort(key=lambda x: x[1], reverse=True)
        filtered_topics = [t[0] for t in topic_scores[:max_topics]]
    else:
        filtered_topics = available_topics

    # Create prompt for learning path generation (more concise version)
    prompt = f"""You are an educational content curator. Create a learning path for:
"{goal}"

Available topics: {', '.join(filtered_topics)}

Order topics in a logical progression from basic to advanced. Include only relevant topics.

Format your response as JSON:
{{
  "learning_path": [
    {{
      "topic": "topic1",
      "reason": "Brief explanation"
    }},
    {{
      "topic": "topic2",
      "reason": "Brief explanation"
    }}
  ]
}}

Select topics from the exact names listed in the available topics.
"""

    # Get response from LLM
    try:
        response = llm(prompt, max_tokens=2048, temperature=0.3)
        response_text = response["choices"][0]["text"]

        # Extract JSON
        json_start = response_text.find('{')
        json_end = response_text.rfind('}') + 1

        if json_start == -1 or json_end == 0:
            print("Failed to get valid JSON for learning path")
            return []

        json_str = response_text[json_start:json_end]

        try:
            path_data = json.loads(json_str)
            return path_data.get("learning_path", [])
        except json.JSONDecodeError:
            print(f"Invalid JSON for learning path")
            return []

    except Exception as e:
        print(f"Error generating learning path: {str(e)}")

        # Create a simple fallback path based on the goal
        fallback_path = []
        goal_lower = goal.lower()

        # Add some basic topics based on keywords in the goal
        if "python" in goal_lower:
            fallback_path.append({"topic": "Python basics", "reason": "Fundamental programming concepts"})
        if "image" in goal_lower or "bio" in goal_lower:
            fallback_path.append({"topic": "Image processing", "reason": "Core concepts for working with images"})
        if "analysis" in goal_lower:
            fallback_path.append({"topic": "Data analysis", "reason": "Techniques for analyzing data"})

        return fallback_path




def find_videos_for_learning_path(G: nx.DiGraph, learning_path: List[Dict]) -> List[Dict]:
    """
    Find videos for each topic in the learning path

    """
    path_videos = []

    for path_item in learning_path:
        topic = path_item.get("topic", "")
        reason = path_item.get("reason", "")

        # Find related videos
        related_videos = []

        for node in G.nodes():
            if G.nodes[node].get('node_type') == 'video':
                video_topics = G.nodes[node].get('topics', [])
                if topic in video_topics:
                    # Found a video covering this topic
                    video_data = G.nodes[node]
                    related_videos.append({
                        "id": node,
                        "title": video_data.get('title', 'Unknown Video'),
                        "difficulty": video_data.get('difficulty', 'intermediate'),
                        "duration_minutes": video_data.get('duration_minutes', 0)
                    })

        # Sort videos by difficulty
        difficulty_order = {"beginner": 0, "intermediate": 1, "advanced": 2}
        related_videos.sort(key=lambda x: difficulty_order[x["difficulty"]])

        path_videos.append({
            "topic": topic,
            "reason": reason,
            "videos": related_videos
        })

    return path_videos


def format_learning_path(path_with_videos: List[Dict]) -> Dict:
    """
    Format learning path with videos into a structured output

    """
    # Calculate overall statistics
    total_videos = sum(len(stage["videos"]) for stage in path_with_videos)
    total_duration = sum(
        sum(video["duration_minutes"] for video in stage["videos"])
        for stage in path_with_videos
    )

    # Count videos by difficulty
    difficulty_breakdown = {"beginner": 0, "intermediate": 0, "advanced": 0}
    for stage in path_with_videos:
        for video in stage["videos"]:
            difficulty_breakdown[video["difficulty"]] += 1

    # Format the path
    formatted_path = {
        "total_videos": total_videos,
        "total_duration_minutes": total_duration,
        "difficulty_breakdown": difficulty_breakdown,
        "topics_covered": [stage["topic"] for stage in path_with_videos],
        "stages": []
    }

    # Add detailed stages
    step_counter = 1
    for stage in path_with_videos:
        stage_data = {
            "topic": stage["topic"],
            "reason": stage["reason"],
            "videos": []
        }

        for video in stage["videos"]:
            video_data = {
                "step": step_counter,
                "id": video["id"],
                "title": video["title"],
                "difficulty": video["difficulty"],
                "duration_minutes": video["duration_minutes"]
            }
            stage_data["videos"].append(video_data)
            step_counter += 1

        formatted_path["stages"].append(stage_data)

    return formatted_path


def generate_learning_path(llm, G: nx.DiGraph, goal: str) -> Dict:
    """
    Generate a complete learning path for a specific goal

    """
    print(f"Generating learning path for: '{goal}'")


    # Collect unique topics from video nodes
    available_topics = set()
    for node in G.nodes():
        if G.nodes[node].get('node_type') == 'video':
            available_topics.update(G.nodes[node].get('topics', []))
    available_topics = list(available_topics)

    # Generate learning path using LLM
    learning_path = generate_learning_path_query(llm, goal, available_topics)

    if not learning_path:
        print("Failed to generate learning path")
        return None

    # Find videos for each topic in the path
    path_with_videos = find_videos_for_learning_path(G, learning_path)

    # Format the complete path
    formatted_path = format_learning_path(path_with_videos)

    # Visualize the path
    filename = f"learning_path_{goal.replace(' ', '_').replace('/', '_')}.html"
    viz_path = visualize_learning_path(G, formatted_path, filename)

    # Add visualization path to result
    formatted_path["visualization"] = viz_path

    return formatted_path


def visualize_learning_path(G: nx.DiGraph, path: Dict, filename='learning_path.html'):
    """
    Create interactive visualization of a learning path

    """
    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    # Create a PyVis network
    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 directed=True)

    # Set physics layout options
    net.set_options("""
    {
      "physics": {
        "hierarchicalRepulsion": {
          "centralGravity": 0.0,
          "springLength": 100,
          "springConstant": 0.01,
          "nodeDistance": 120
        },
        "solver": "hierarchicalRepulsion",
        "stabilization": {
          "iterations": 100
        }
      },
      "layout": {
        "hierarchical": {
          "enabled": true,
          "direction": "LR",
          "sortMethod": "directed",
          "levelSeparation": 150
        }
      }
    }
    """)

    # Color mapping
    color_map = {
        'beginner': '#90EE90',      # light green
        'intermediate': '#ADD8E6',   # light blue
        'advanced': '#FFB6C1',       # light pink
        'topic': '#FFA500'          # orange
    }

    # Add all nodes and edges
    nodes_added = set()

    # First add topic nodes
    for i, topic in enumerate(path["topics_covered"]):
        topic_id = f"topic_{i}"
        net.add_node(
            topic_id,
            label=topic,
            title=f"Stage {i+1}: {topic}",
            color=color_map['topic'],
            shape='diamond',
            size=20,
            level=i  # For hierarchical layout
        )
        nodes_added.add(topic_id)

    # Add video nodes connected to topics
    for i, stage in enumerate(path["stages"]):
        topic_id = f"topic_{i}"

        # Add videos for this stage
        for j, video in enumerate(stage["videos"]):
            video_node_id = video["id"]

            # Skip if already added
            if video_node_id in nodes_added:
                continue

            # Add video node
            net.add_node(
                video_node_id,
                label=f"{video['step']}. {video['title'][:20]}...",
                title=f"Step {video['step']}: {video['title']} ({video['duration_minutes']:.1f} min)",
                color=color_map[video["difficulty"]],
                shape='dot',
                size=15,
                level=i  # Same level as its topic
            )
            nodes_added.add(video_node_id)

            # Connect topic to video
            net.add_edge(
                topic_id,
                video_node_id,
                width=2,
                arrows='to'
            )

    # Add edges between topics to show progression
    for i in range(len(path["topics_covered"]) - 1):
        topic_id1 = f"topic_{i}"
        topic_id2 = f"topic_{i+1}"

        net.add_edge(
            topic_id1,
            topic_id2,
            width=3,
            color="#000000",
            arrows='to'
        )

    # Save the network
    net.save_graph(output_path)
    print(f"Learning path visualization saved to {output_path}")

    return output_path



### 14. Concept-Based Learning Paths

Our most advanced feature generates concept-based learning paths that focus on key concepts rather than just video sequences. The generate_concept_based_learning_path function uses the LLM to identify key concepts needed for a particular goal, finds suitable videos for each concept using semantic search, and creates a comprehensive learning journey. The visualization function then creates an interactive representation of this concept-based path.

In [None]:


def generate_concept_based_learning_path(llm, G, goal, embedding_model=None, video_embeddings=None):
    """
    Generate a concept-based learning path using semantic similarity
    """
    print(f"Generating concept-based learning path for: '{goal}'")

    # Step 1: Use LLM to extract concepts and their relationships
    prompt = f"""You are an educational content expert. Create a learning path for:
"{goal}"

First, identify the key concepts needed to achieve this goal, in a logical learning progression from fundamental to advanced.
For each concept, provide:
1. A short title (1-3 words)
2. Why this concept is important for the goal
3. What prerequisite concepts should be learned first (if any)

Format your response as a JSON object:
{{
  "concepts": [
    {{
      "concept": "concept1",
      "importance": "Why this concept matters",
      "prerequisites": []
    }},
    {{
      "concept": "concept2",
      "importance": "Why this concept matters",
      "prerequisites": ["concept1"]
    }},
    {{
      "concept": "concept3",
      "importance": "Why this concept matters",
      "prerequisites": ["concept1", "concept2"]
    }}
  ]
}}

Order concepts so prerequisites come before the concepts that require them.
"""

    try:
        # Get LLM response
        response = llm(prompt, max_tokens=2048, temperature=0.3)
        response_text = response["choices"][0]["text"]

        # Extract JSON
        import json
        import re

        # Find JSON in response
        json_match = re.search(r'\{.*\}', response_text, re.DOTALL)
        if not json_match:
            print("Could not extract JSON from learning path response")
            return None

        concepts_data = json.loads(json_match.group(0))
        concepts = concepts_data.get("concepts", [])

        # Step 2: Find videos for each concept using semantic search or concept matching
        concept_videos = []

        for concept in concepts:
            concept_title = concept.get("concept", "")
            importance = concept.get("importance", "")
            prerequisites = concept.get("prerequisites", [])

            # Find videos for this concept
            if embedding_model and video_embeddings:
                # Use semantic search to find videos
                videos = semantic_search(
                    concept_title,
                    embedding_model,
                    video_embeddings,
                    G,
                    top_k=5  # Limit to top 5 videos per concept
                )
            else:
                # Fall back to concept matching
                videos = []
                for node_id in G.nodes():
                    node_data = G.nodes[node_id]
                    if node_data.get('node_type') != 'video':
                        continue

                    # Match by topic or title
                    node_topics = node_data.get('topics', [])
                    node_title = node_data.get('title', '')

                    if any(concept_title.lower() in topic.lower() for topic in node_topics) or concept_title.lower() in node_title.lower():
                        videos.append({
                            "id": node_id,
                            "title": node_title,
                            "difficulty": node_data.get('difficulty', 'intermediate'),
                            "duration_minutes": node_data.get('duration_minutes', 0)
                        })

                # Sort by difficulty (beginner first)
                difficulty_order = {"beginner": 0, "intermediate": 1, "advanced": 2}
                videos = sorted(videos, key=lambda x: difficulty_order.get(x.get("difficulty", "intermediate"), 1))[:5]

            # Add to path
            concept_videos.append({
                "concept": concept_title,
                "importance": importance,
                "prerequisites": prerequisites,
                "videos": videos
            })

        # Step 3: Format learning path
        concept_based_path = {
            "goal": goal,
            "total_concepts": len(concept_videos),
            "total_videos": sum(len(c.get("videos", [])) for c in concept_videos),
            "concepts": concept_videos
        }

        # Calculate total duration
        total_duration = 0
        for concept in concept_videos:
            for video in concept.get("videos", []):
                total_duration += video.get("duration_minutes", 0)

        concept_based_path["total_duration_minutes"] = total_duration

        # Step 4: Create visualization
        filename = f"concept_path_{goal.replace(' ', '_').replace('/', '_')}.html"
        viz_path = visualize_concept_learning_path(G, concept_based_path, filename)
        concept_based_path["visualization"] = viz_path

        return concept_based_path

    except Exception as e:
        print(f"Error generating concept-based learning path: {str(e)}")
        return None



def visualize_concept_learning_path(G, path, filename='concept_learning_path.html'):
    """
    Create interactive visualization of a concept-based learning path
    """
    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    # Create a PyVis network
    net = Network(height='750px', width='100%', bgcolor='#ffffff', directed=True)

    # Set physics layout options
    net.set_options("""
    {
      "physics": {
        "hierarchicalRepulsion": {
          "centralGravity": 0.0,
          "springLength": 120,
          "springConstant": 0.01,
          "nodeDistance": 150
        },
        "solver": "hierarchicalRepulsion",
        "stabilization": {
          "iterations": 100
        }
      },
      "layout": {
        "hierarchical": {
          "enabled": true,
          "direction": "LR",
          "sortMethod": "directed",
          "levelSeparation": 200
        }
      }
    }
    """)

    # Color mapping
    color_map = {
        'beginner': '#90EE90',      # light green
        'intermediate': '#ADD8E6',   # light blue
        'advanced': '#FFB6C1',       # light pink
        'concept': '#FFA500'        # orange
    }

    # Add all nodes and edges
    nodes_added = set()

    # First add concept nodes
    concepts = path.get("concepts", [])
    for i, concept_data in enumerate(concepts):
        concept = concept_data.get("concept", "")
        importance = concept_data.get("importance", "")

        # Create unique ID for concept node
        concept_id = f"concept_{i}"

        # Add concept node
        net.add_node(
            concept_id,
            label=concept,
            title=f"Concept: {concept}\nImportance: {importance}",
            color=color_map['concept'],
            shape='diamond',
            size=25,
            level=i  # For hierarchical layout
        )
        nodes_added.add(concept_id)

    # Add prerequisite connections between concepts
    for i, concept_data in enumerate(concepts):
        concept_id = f"concept_{i}"
        prerequisites = concept_data.get("prerequisites", [])

        for prereq in prerequisites:
            # Find the prerequisite concept ID
            for j, c in enumerate(concepts):
                if c.get("concept", "") == prereq:
                    prereq_id = f"concept_{j}"

                    # Add edge from prerequisite to this concept
                    net.add_edge(
                        prereq_id,
                        concept_id,
                        width=3,
                        color="#000000",
                        arrows='to',
                        title="Prerequisite"
                    )
                    break

    # Add video nodes for each concept
    for i, concept_data in enumerate(concepts):
        concept_id = f"concept_{i}"
        videos = concept_data.get("videos", [])

        # Add video nodes
        for j, video in enumerate(videos):
            video_id = f"{concept_id}_video_{j}"
            video_node_id = video.get("id", video_id)

            # Skip if already added
            if video_node_id in nodes_added:
                continue

            # Get difficulty
            difficulty = video.get("difficulty", "intermediate")
            if difficulty not in color_map:
                difficulty = "intermediate"

            # Add video node
            net.add_node(
                video_id,
                label=video.get("title", "")[:25] + "..." if len(video.get("title", "")) > 25 else video.get("title", ""),
                title=f"{video.get('title', '')}\nDifficulty: {difficulty}\nDuration: {video.get('duration_minutes', 0):.1f} min",
                color=color_map[difficulty],
                shape='dot',
                size=15,
                level=i  # Same level as its concept
            )
            nodes_added.add(video_id)

            # Connect concept to video
            net.add_edge(
                concept_id,
                video_id,
                width=1.5,
                arrows='to',
                title="Teaches"
            )

    # Save the network
    net.save_graph(output_path)
    print(f"Concept learning path visualization saved to {output_path}")

    return output_path






### 15. Building the Knowledge Graph System

In this section, we initialize the full knowledge graph building process. We call the main function to process our educational videos, extract structured information using the LLM, build the knowledge graph, generate embeddings, and save everything to disk. This step handles the end-to-end pipeline from raw CSV data to a fully functional knowledge graph system ready for querying and visualizing.

In [None]:
# Initialize with embedding generation
G, llm, embedding_model, video_embeddings = main(
    "/content/drive/MyDrive/data/video_recommender/combined_videos.csv",
    model_type="mistral",
    generate_embeddings=True,
    max_videos=10,  # Give a number or just type None
    show_first_video_llm_output=True,
    output_dir="/content/drive/MyDrive/data/knowledge_graph_results"
)

# Save paths for reference
graph_path = "/content/drive/MyDrive/data/knowledge_graph_results/graph/knowledge_graph.pickle"
embeddings_path = "/content/drive/MyDrive/data/knowledge_graph_results/embeddings/video_embeddings.pickle"


✅ llama-cpp-python installed with GPU support
LLM setup complete. Ready to initialize model.
Initializing mistral model...
Using GPU acceleration for model inference


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.na

Testing model...
Model responded in 1.07 seconds.
Loading video data from /content/drive/MyDrive/data/video_recommender/combined_videos.csv...
Loaded 438 videos
Limiting processing to first 10 videos (out of 438 total)
Extracting entities from videos...


Processing videos:   0%|          | 0/10 [00:00<?, ?it/s]


Processing video 1/10: 01 - Why do you need to learn ...


Llama.generate: prefix-match hit



==== EXAMPLE LLM OUTPUT FOR FIRST VIDEO ====
TITLE: 01 - Why do you need to learn programming?
DESCRIPTION: If you are a student or researcher in any field, you'll eventually run into the need to learn to cod...

LLM RESPONSE:
{
  "main_topics": [
    "The importance of learning programming",
    "Benefits of learning programming for students and researchers",
    "Career advancement through programming"
  ],
  "difficulty": "beginner",
  "prerequisites": [],
  "learning_outcomes": [
    "Understanding why learning programming is essential",
    "Recognizing the benefits of programming for personal and professional growth",
    "Identifying potential career opportunities through programming"
  ]
}

Successfully processed video 0

Processing video 2/10: 02 - What is programming?...


Llama.generate: prefix-match hit


Successfully processed video 1

Processing video 3/10: 03 - What is command prompt?...


Llama.generate: prefix-match hit


Successfully processed video 2

Processing video 4/10: 04 - What is a digital image?...


Llama.generate: prefix-match hit


Successfully processed video 3

Processing video 5/10: 05 - What is Python?...


Llama.generate: prefix-match hit


Successfully processed video 4

Processing video 6/10: 06 - Python basics - IDE & ope...


Llama.generate: prefix-match hit


Successfully processed video 5

Processing video 7/10: 07 - Python basics - logical o...


Llama.generate: prefix-match hit


Successfully processed video 6



Llama.generate: prefix-match hit


Successfully processed video 7

Processing video 9/10: 09 - if else elif statements i...


Llama.generate: prefix-match hit


Successfully processed video 8

Processing video 10/10: 10 - lists tuples and dictiona...


Llama.generate: prefix-match hit


Successfully processed video 9
Extracting relationships between concepts...
Limiting from 32 to 30 topics for relationship analysis


Llama.generate: prefix-match hit


Building knowledge graph...
Knowledge graph built with 10 nodes and 7 edges
Saving knowledge graph to database: /content/drive/MyDrive/data/knowledge_graph_results/database/knowledge_graph.db
Knowledge graph saved to database: /content/drive/MyDrive/data/knowledge_graph_results/database/knowledge_graph.db (10 nodes, 7 edges)
Knowledge graph saved as pickle to: /content/drive/MyDrive/data/knowledge_graph_results/graph/knowledge_graph.pickle
Initializing embedding model for semantic search...
Initializing embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model loaded successfully
Generating video embeddings...
Generating embeddings for videos...


  0%|          | 0/1 [00:00<?, ?it/s]

Generated embeddings for 10 videos
Embeddings saved to /content/drive/MyDrive/data/knowledge_graph_results/embeddings/video_embeddings.pickle
Embeddings saved to /content/drive/MyDrive/data/knowledge_graph_results/embeddings/video_embeddings.pickle
Knowledge graph visualization saved to /content/drive/MyDrive/data/knowledge_graph_results/visualizations/knowledge_graph.html
Generating learning path for: 'Learn Python for bioimage analysis'


Llama.generate: prefix-match hit


Learning path visualization saved to /content/drive/MyDrive/data/knowledge_graph_results/visualizations/learning_path_Learn_Python_for_bioimage_analysis.html
Example learning path generated for: 'Learn Python for bioimage analysis'
System ready! You can now query the knowledge graph.


### 16. Demonstration of Key Features

After building our knowledge graph system, we demonstrate its key capabilities through a series of examples. We showcase semantic search using embeddings to find conceptually related videos without exact keyword matches, LLM-based query understanding with concept expansion to enrich search results, filtering by difficulty level for targeted learning, and generating personalized concept-based learning paths. These examples highlight the power of combining LLMs and embeddings for educational content discovery.

**Semantic Search with Embeddings**

Our first example demonstrates the semantic search capability, finding videos related to specific topics like U-net without requiring exact keyword matches. By using dense vector embeddings that capture the semantic meaning of videos, we can identify relevant content based on conceptual similarity rather than just keyword matching. This allows users to discover useful educational content even when using different terminology than what appears in the video titles or descriptions.

In [None]:
# Example 1: Semantic search with embeddings
print("\n\n=== Example 1: Semantic Search with Embeddings ===")
print("Searching for videos related to U-net without requiring exact keyword matches...")
result1 = handle_user_query(llm, G, "Videos related to U-net", embedding_model, video_embeddings)
print("\nSemantic search results:")
for i, video in enumerate(result1.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} (similarity: {video.get('similarity', 0):.2f})")



=== Example 1: Semantic Search with Embeddings ===
Searching for videos related to U-net without requiring exact keyword matches...
Processing query: 'Videos related to U-net'


Llama.generate: prefix-match hit


JSON parsing error: Extra data: line 1 column 186 (char 185) in string: {   "search_type": "learning_path",   "main_topic"...
Using expanded keyword matching with: u-net
Found 0 matching videos

Semantic search results:


**LLM-based Query with Concept Expansion**

Next, we showcase how the system uses LLM-based query understanding to expand searches with related concepts. When a user queries for advanced deep learning techniques for image segmentation, the system automatically identifies related concepts like neural networks and semantic segmentation, enriching the search results with videos that might use different but related terminology. This powerful feature helps bridge the gap between how users formulate queries and how content is described.

In [None]:
# Example 2: LLM-based query with concept expansion
print("\n\n=== Example 2: LLM-based Query with Concept Expansion ===")
print("Demonstrating how the system expands queries with related concepts...")
result2 = handle_user_query(llm, G, "Show me advanced deep learning techniques for image segmentation", embedding_model, video_embeddings)
print("\nLLM query results:")
print(f"Main topic: {result2.get('topic')}")
print(f"Related concepts: {', '.join(result2.get('related_concepts', []))}")
for i, video in enumerate(result2.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} ({video.get('difficulty', 'unknown')})")

**Filtering Content by Difficulty Level**

The third example demonstrates filtering content by difficulty level, allowing users to find videos appropriate for their current skill level. Whether looking for beginner-friendly introductions or advanced techniques, the system accurately identifies content suitable for different learning stages. This helps users find educational content that matches their expertise, avoiding material that's either too basic or too complex for their needs.

In [None]:
# Example 3: Simple query for beginner content
print("\n\n=== Example 3: Simple Query for Beginner Content ===")
print("Filtering videos by difficulty level...")
result3 = handle_user_query(llm, G, "Show me beginner videos on Python", embedding_model, video_embeddings)
print("\nQuery results:")
print(f"Method used: {result3.get('method', 'unknown')}")
for i, video in enumerate(result3.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} ({video.get('difficulty', 'unknown')})")



=== Example 3: Simple Query for Beginner Content ===
Filtering videos by difficulty level...
Processing query: 'Show me beginner videos on Python'

Query results:
Method used: embedding
1. 10 - lists tuples and dictionaries (beginner)
2. 05 - What is Python? (beginner)
3. 07 - Python basics - logical operators and basic math (beginner)
4. 06 - Python basics - IDE & operators (beginner)
5. 03 - What is command prompt? (beginner)


**Concept-Based Learning Paths**

Our most advanced feature is demonstrated through generating concept-based learning paths for complex learning goals. For a goal like "Master computer vision for medical image analysis," the system identifies the key concepts a learner needs to understand, arranges them in a logical progression, and finds relevant videos for each concept. This creates a comprehensive, personalized learning journey that guides users from foundational concepts to advanced applications.

In [None]:

# Example 4: Generate concept-based learning path
print("\n\n=== Example 4: Generate Concept-Based Learning Path ===")
print("Creating a personalized learning journey across multiple concepts...")
concept_path = generate_concept_based_learning_path(llm, G, "Master computer vision for medical image analysis", embedding_model, video_embeddings)

# Print concept path summary
print(f"\nConcept-based learning path summary:")
print(f"Total concepts: {concept_path.get('total_concepts')}")
print(f"Total videos: {concept_path.get('total_videos')}")
print(f"Total duration: {concept_path.get('total_duration_minutes', 0):.1f} minutes")

print("\nConcepts in order:")
for i, concept in enumerate(concept_path.get('concepts', [])):
    print(f"{i+1}. {concept.get('concept')} ({len(concept.get('videos', []))} videos)")
    if i < 3:  # Show videos for first 3 concepts only
        for j, video in enumerate(concept.get('videos', [])[:2]):  # Show only first 2 videos per concept
            print(f"   - {video.get('title')}")

print("\n\n=== Example 5: Loading Pre-built Knowledge Graph System ===")
print("Demonstrating how to load a previously built system...")
print("(Note: In a real scenario, you would run this in a new session after closing the previous one)")

# Loading Pre-built Knowledge Graph Systems

The final example shows how to load a previously built knowledge graph system without having to rebuild it from scratch. This is crucial for practical applications, as building the knowledge graph is a one-time process, but querying and using it happens many times. By saving and loading the graph, embeddings, and models, we create a reusable resource that can be quickly deployed in different contexts without repeating the intensive extraction and building process.

In [None]:
# Loading a pre-built knowledge graph system and running the examples

# Define paths to pre-built knowledge graph components
GRAPH_PATH = "/content/drive/MyDrive/data/knowledge_graph_results/graph/full_knowledge_graph.pickle"
EMBEDDINGS_PATH = "/content/drive/MyDrive/data/knowledge_graph_results/embeddings/full_video_embeddings.pickle"

# Step 1: Load the pre-built knowledge graph system using the existing function
print("\n\n=== Loading Pre-built Knowledge Graph System ===")
G, llm, embedding_model, video_embeddings = load_knowledge_graph_system(
    graph_path=GRAPH_PATH,
    embedding_path=EMBEDDINGS_PATH,
    model_type="mistral",  # Can use "tiny" for faster loading
    use_gpu=True
)

✅ llama-cpp-python installed with GPU support
LLM setup complete. Ready to initialize model.
Initializing mistral model...
Using GPU acceleration for model inference


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.na

Testing model...
Model responded in 0.74 seconds.
LLM (mistral) initialized and ready
Initializing embedding model...
Initializing embedding model...
✅ Embedding model loaded successfully
Loading embeddings from: /content/drive/MyDrive/data/knowledge_graph_results/embeddings/full_video_embeddings.pickle
Loaded embeddings for 438 videos
Knowledge graph system loaded and ready for queries!


In [None]:
# Step 2: Display some basic statistics about the graph
print("\n=== Knowledge Graph Statistics ===")
print(f"Total nodes: {len(G.nodes())}")
print(f"Total edges: {len(G.edges())}")

# Count videos by difficulty
difficulty_counts = {'beginner': 0, 'intermediate': 0, 'advanced': 0, 'unknown': 0}
topics = set()

for node_id in G.nodes():
    node_data = G.nodes[node_id]

    # Count by difficulty
    difficulty = node_data.get('difficulty', 'unknown')
    if difficulty in difficulty_counts:
        difficulty_counts[difficulty] += 1
    else:
        difficulty_counts['unknown'] += 1

    # Collect all topics
    topics.update(node_data.get('topics', []))

print(f"Unique topics: {len(topics)}")
print("\nVideos by difficulty:")
for difficulty, count in difficulty_counts.items():
    print(f"  {difficulty}: {count}")


=== Knowledge Graph Statistics ===
Total nodes: 438
Total edges: 3842
Unique topics: 831

Videos by difficulty:
  beginner: 19
  intermediate: 414
  advanced: 2
  unknown: 3


In [None]:
# Step 3: Run the same example queries from the original code
print("\n\n=== Example 1: Semantic Search with Embeddings ===")
print("Searching for videos related to U-net without requiring exact keyword matches...")
result1 = handle_user_query(llm, G, "Videos related to U-net", embedding_model, video_embeddings)
print("\nSemantic search results:")
for i, video in enumerate(result1.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} (similarity: {video.get('similarity', 0):.2f})")

print("\n\n=== Example 2: LLM-based Query with Concept Expansion ===")
print("Demonstrating how the system expands queries with related concepts...")
result2 = handle_user_query(llm, G, "Show me advanced deep learning techniques for image segmentation", embedding_model, video_embeddings)
print("\nLLM query results:")
print(f"Main topic: {result2.get('topic')}")
print(f"Related concepts: {', '.join(result2.get('related_concepts', []))}")
for i, video in enumerate(result2.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} ({video.get('difficulty', 'unknown')})")

print("\n\n=== Example 3: Simple Query for Beginner Content ===")
print("Filtering videos by difficulty level...")
result3 = handle_user_query(llm, G, "Show me beginner videos on Python", embedding_model, video_embeddings)
print("\nQuery results:")
print(f"Method used: {result3.get('method', 'unknown')}")
for i, video in enumerate(result3.get('videos', [])[:5]):
    print(f"{i+1}. {video.get('title', 'Unknown')} ({video.get('difficulty', 'unknown')})")



=== Example 1: Semantic Search with Embeddings ===
Searching for videos related to U-net without requiring exact keyword matches...
Processing query: 'Videos related to U-net'

Semantic search results:
1. 236 - Pre-training U-net using autoencoders - Part 2 - Generating encoder weights for U-net (similarity: 0.51)
2. 73 - Image Segmentation using U-Net - Part1 (What is U-net?) (similarity: 0.49)
3. 77 - Image Segmentation using U-Net - Part 5 (Understanding the data) (similarity: 0.49)
4. 78 - Image Segmentation using U-Net - Part 6 (Running the code and understanding results) (similarity: 0.48)
5. 226 - U-Net vs Attention U-Net vs Attention Residual U-Net - should you care? (similarity: 0.47)


=== Example 2: LLM-based Query with Concept Expansion ===
Demonstrating how the system expands queries with related concepts...
Processing query: 'Show me advanced deep learning techniques for image segmentation'

LLM query results:
Main topic: None
Related concepts: 
1. 159 - Convolutional f

In [None]:
# Step 4: Generate a concept-based learning path (using your existing function)
print("\n\n=== Example 4: Generate Concept-Based Learning Path ===")
print("Creating a personalized learning journey across multiple concepts...")
concept_path = generate_concept_based_learning_path(llm, G, "Master computer vision for scientific image analysis", embedding_model, video_embeddings)

# Print concept path summary
print(f"\nConcept-based learning path summary:")
print(f"Total concepts: {concept_path.get('total_concepts')}")
print(f"Total videos: {concept_path.get('total_videos')}")
print(f"Total duration: {concept_path.get('total_duration_minutes', 0):.1f} minutes")

print("\nConcepts in order:")
# Get only first 5 concepts
for i, concept in enumerate(concept_path.get('concepts', [])[:5]):
    print(f"{i+1}. {concept.get('concept')} ({len(concept.get('videos', []))} videos)")
    if i < 3:  # Show videos for first 3 concepts only
        for j, video in enumerate(concept.get('videos', [])[:2]):  # Show only first 2 videos per concept
            print(f"   - {video.get('title')}")

print("\nKnowledge graph system loaded and examples completed!")



=== Example 4: Generate Concept-Based Learning Path ===
Creating a personalized learning journey across multiple concepts...
Generating concept-based learning path for: 'Master computer vision for scientific image analysis'


Llama.generate: prefix-match hit


Concept learning path visualization saved to /content/drive/MyDrive/data/knowledge_graph_results/visualizations/concept_path_Master_computer_vision_for_scientific_image_analysis.html

Concept-based learning path summary:
Total concepts: 9
Total videos: 45
Total duration: 810.3 minutes

Concepts in order:
1. Image Processing (5 videos)
   - 19 - image processing using scipy in Python
   - 28 - Thresholding and morphological operations using openCV in Python
2. Pixel Values (5 videos)
   - Python tips and tricks - 8:  Working with RGB (and Hex) masks for semantic segmentation
   - White balancing your pictures using python
3. Image Filters (5 videos)
   - 106 - Image filters using discrete Fourier transform (DFT)
   - 103 - Edge filters for image processing
4. Edge Detection (5 videos)
5. Image Segmentation (5 videos)

Knowledge graph system loaded and examples completed!
