# Code Plagiarism Detection - Phase 2: Interactive Testing
This notebook provides interactive functions to test plagiarism detection using four different approaches.

In [14]:
import os
import json
import numpy as np
import pickle
import re
from pathlib import Path
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Embeddings and search
from sentence_transformers import SentenceTransformer
import faiss
from rank_bm25 import BM25Okapi


# LLM - Gemini
import google.generativeai as genai

# Setup
BASE_DIR = Path('.')
DATA_DIR = BASE_DIR / 'data'
INDEX_DIR = BASE_DIR / 'indexes'

# API key from environment
gemini_api_key = os.getenv('GEMINI_API_KEY')
if not gemini_api_key:
    raise ValueError("GEMINI_API_KEY environment variable not set")


genai.configure(api_key=gemini_api_key)
client = genai.GenerativeModel('gemini-2.5-flash')

print("✓ Libraries loaded")

✓ Libraries loaded


## Load Pre-built Indexes
This cell loads all indexes created by 01_indexing.ipynb:

1. FAISS Index (faiss_index.bin)
   - 399 function embeddings
   - Used by: detect_embedding(), detect_rag(), detect_hybrid_rag()

2. Function Metadata (function_metadata.pkl)
   - Function names, code, docstrings, repository info
   - Maps index positions to actual code

3. BM25 Index (bm25_index.pkl)
   - Lexical matching index
   - Used by: detect_hybrid_rag()

4. CodeBERT Model (microsoft/codebert-base)
   - Same model used in Phase 1 for consistent embeddings
   - Used to encode query code

Does not re-index; loads existing artifacts

In [15]:
print("Loading indexes and metadata...")

# Load embedding model
embedding_model = SentenceTransformer('microsoft/codebert-base')

# Load FAISS index
faiss_index = faiss.read_index(str(INDEX_DIR / 'faiss_index.bin'))
embeddings = np.load(INDEX_DIR / 'embeddings.npy')

# Load function metadata
with open(INDEX_DIR / 'function_metadata.pkl', 'rb') as f:
    function_metadata = pickle.load(f)

# Load BM25 index
with open(INDEX_DIR / 'bm25_index.pkl', 'rb') as f:
    bm25_index = pickle.load(f)

with open(INDEX_DIR / 'tokenized_corpus.pkl', 'rb') as f:
    tokenized_corpus = pickle.load(f)

print(f"✓ Loaded FAISS index with {faiss_index.ntotal} vectors")
print(f"✓ Loaded {len(function_metadata)} function metadata entries")
print(f"✓ Loaded BM25 index with {len(tokenized_corpus)} documents")

Loading indexes and metadata...


No sentence-transformers model found with name microsoft/codebert-base. Creating a new one with mean pooling.


✓ Loaded FAISS index with 399 vectors
✓ Loaded 399 function metadata entries
✓ Loaded BM25 index with 399 documents


## Helper Functions
These functions support the four main detection methods:

1. tokenize_code()
   - Tokenizes code for BM25 lexical matching
   - Keeps underscores in identifiers (variable_name stays intact)

2. retrieve_with_embeddings()
   - Semantic search using CodeBERT embeddings
   - Returns top-k similar functions with cosine similarity scores
   - Used by: detect_embedding(), detect_rag()

3. retrieve_with_bm25()
   - Lexical search using BM25 algorithm
   - Returns top-k functions based on keyword overlap
   - Used by: detect_hybrid_rag()

4. hybrid_retrieve()
   - Combines embeddings + BM25 with weighted fusion
   - alpha parameter controls weight (default 0.5 = equal weight)
   - Normalizes and fuses scores from both methods
   - Used by: detect_hybrid_rag()

5. call_llm()
   - Wrapper for Gemini API calls
   - Adds system instruction for consistent LLM behavior
   - Error handling for API failures

In [16]:
def tokenize_code(code: str) -> List[str]:
    """Tokenize code for BM25."""
    tokens = re.findall(r'\b\w+\b', code.lower())
    return tokens

def retrieve_with_embeddings(query_code: str, k: int = 5) -> List[Tuple[Dict, float]]:
    """
    Retrieve top-k similar functions using embedding-based search.
    
    Args:
        query_code: Code snippet to search for
        k: Number of results to return
    
    Returns:
        List of (function_metadata, similarity_score) tuples
    """
    # Encode query
    query_embedding = embedding_model.encode([query_code])
    query_embedding = np.array(query_embedding).astype('float32')
    faiss.normalize_L2(query_embedding)
    
    # Search
    similarities, indices = faiss_index.search(query_embedding, k)
    
    # Prepare results
    results = []
    for idx, sim in zip(indices[0], similarities[0]):
        results.append((function_metadata[idx], float(sim)))
    
    return results

def retrieve_with_bm25(query_code: str, k: int = 5) -> List[Tuple[Dict, float]]:
    """
    Retrieve top-k similar functions using BM25 lexical search.
    
    Args:
        query_code: Code snippet to search for
        k: Number of results to return
    
    Returns:
        List of (function_metadata, bm25_score) tuples
    """
    # Tokenize query
    query_tokens = tokenize_code(query_code)
    
    # Get BM25 scores
    scores = bm25_index.get_scores(query_tokens)
    
    # Get top-k indices
    top_k_indices = np.argsort(scores)[::-1][:k]
    
    # Prepare results
    results = []
    for idx in top_k_indices:
        results.append((function_metadata[idx], float(scores[idx])))
    
    return results

def hybrid_retrieve(query_code: str, k: int = 5, alpha: float = 0.5) -> List[Tuple[Dict, float]]:
    """
    Retrieve using hybrid approach combining embeddings and BM25.
    
    Args:
        query_code: Code snippet to search for
        k: Number of results to return
        alpha: Weight for embedding scores (1-alpha for BM25)
    
    Returns:
        List of (function_metadata, combined_score) tuples
    """
    # Get results from both methods
    embedding_results = retrieve_with_embeddings(query_code, k=k*2)
    bm25_results = retrieve_with_bm25(query_code, k=k*2)
    
    # Normalize scores
    def normalize_scores(results):
        scores = [score for _, score in results]
        if max(scores) > min(scores):
            normalized = [(meta, (score - min(scores)) / (max(scores) - min(scores))) 
                         for meta, score in results]
        else:
            normalized = [(meta, 1.0) for meta, _ in results]
        return normalized
    
    embedding_results = normalize_scores(embedding_results)
    bm25_results = normalize_scores(bm25_results)
    
    # Combine scores
    combined_scores = {}
    
    for meta, score in embedding_results:
        func_id = meta['id']
        combined_scores[func_id] = combined_scores.get(func_id, 0) + alpha * score
    
    for meta, score in bm25_results:
        func_id = meta['id']
        combined_scores[func_id] = combined_scores.get(func_id, 0) + (1 - alpha) * score
    
    # Sort by combined score
    sorted_ids = sorted(combined_scores.keys(), key=lambda x: combined_scores[x], reverse=True)
    
    # Get metadata for top-k
    results = []
    for func_id in sorted_ids[:k]:
        meta = next(f for f in function_metadata if f['id'] == func_id)
        results.append((meta, combined_scores[func_id]))
    
    return results

def call_llm(prompt: str, max_tokens: int = 500) -> str:
    """
    Call Gemini API with prompt.
    
    Args:
        prompt: The prompt to send
        max_tokens: Maximum tokens in response (not used by Gemini)
    
    Returns:
        LLM response text
    """
    try:
        # Add system instruction to the beginning of prompt
        full_prompt = "You are an expert code analyst specializing in plagiarism detection.\n\n" + prompt
        response = client.generate_content(full_prompt)
        return response.text.strip()
    except Exception as e:
        return f"Error calling LLM: {str(e)}"

print("✓ Helper functions defined")

✓ Helper functions defined


## Method 1: Pure Embedding Search
Implementation Strategy:
- Uses ONLY embeddings (no LLM)
- Combines multiple signals for robust detection:
  1. Embedding similarity (semantic understanding)
  2. Lexical similarity (exact text matching via SequenceMatcher)
  3. Code length ratio (filters false positives)

Detection Logic:
- Early rejection: Embedding score < 0.88 → not plagiarism
- High lexical match (>70%) → definitely plagiarism
- High embedding + moderate lexical + similar length → plagiarism
- Filters trivial code (<5 lines)

Threshold Selection Rationale:
- 0.85-0.95 embedding threshold range (tuned for code similarity)
- Lexical similarity helps catch variable-renamed plagiarism
- Length ratio prevents matching snippets from different-sized functions

Returns:
- is_plagiarized: Boolean
- confidence: 0.0-1.0 score
- best_match: Top matching function from corpus
- top_matches: Top 3 candidates for inspection

In [27]:
def detect_embedding(query_code: str, threshold: float = 0.85, k: int = 5) -> Dict:
    """
    Fixed embedding detection with proper false positive filtering.
    """
    from difflib import SequenceMatcher
    
    results = retrieve_with_embeddings(query_code, k=k)
    
    if not results:
        return _negative_result()
    
    best_match, embedding_score = results[0]
    
    # ⭐ EARLY REJECTION - if best match is too dissimilar, stop here
    if embedding_score < 0.88:  # Adjust this threshold based on your data
        return _negative_result()
    
    lexical_similarity = SequenceMatcher(None, query_code, best_match['code']).ratio()
    query_lines = len([l for l in query_code.split('\n') if l.strip()])
    match_lines = len([l for l in best_match['code'].split('\n') if l.strip()])
    
    # Trivial code filter
    if query_lines < 5:
        return _negative_result()
    
    # High lexical = plagiarism
    if lexical_similarity > 0.7:
        return _positive_result(best_match, lexical_similarity, results)
    
    # High embedding + moderate lexical + similar length
    if embedding_score > 0.95 and lexical_similarity > 0.4:
        line_ratio = min(query_lines, match_lines) / max(query_lines, match_lines)
        if line_ratio > 0.7:
            confidence = (embedding_score * 0.5 + lexical_similarity * 0.5)
            return _positive_result(best_match, confidence, results)
    
    # Medium-high both scores
    if embedding_score > 0.92 and lexical_similarity > 0.5:
        confidence = (lexical_similarity * 0.6 + embedding_score * 0.4)
        return _positive_result(best_match, confidence, results)
    
    # Default: not plagiarism
    return _negative_result()


def _negative_result():
    return {
        'method': 'embedding_search',
        'is_plagiarized': False,
        'confidence': 0.0,
        'best_match': {'id': None, 'name': None, 'repo': None},  # ✅ Changed from None to dict
        'top_matches': []
    }


def _positive_result(best_match, confidence, results):
    return {
        'method': 'embedding_search',
        'is_plagiarized': True,
        'confidence': confidence,
        'best_match': {
            'id': best_match['id'],
            'name': best_match['name'],
            'repo': best_match['repo']
        },
        'top_matches': [
            {'id': meta['id'], 'name': meta['name'], 'similarity': score}
            for meta, score in results[:3]
        ]
    }


# Test example - UPDATE THIS TOO
test_code = """
def calculate_sum(numbers):
    total = 0
    for num in numbers:
        total += num
    return total
"""

result = detect_embedding(test_code)
print("\nExample - Pure Embedding Search:")
print(f"Is Plagiarized: {result['is_plagiarized']}")
print(f"Confidence: {result['confidence']:.3f}")

# ✅ Safely access best_match
best_match = result['best_match']
if best_match and best_match.get('name'):
    print(f"Best Match: {best_match['name']}")
else:
    print(f"Best Match: None")


Example - Pure Embedding Search:
Is Plagiarized: True
Confidence: 0.738
Best Match: base_check


## Method 2: Direct LLM Analysis
Implementation Strategy:
- Provides full context (up to 20 reference functions) to LLM
- LLM analyzes with complete information access
- Tests what LLMs can achieve with full context vs. limited retrieval

Process:
1. Pre-retrieve top 20 candidates (reduces context to fit token limits)
2. Build context with reference functions
3. Ask LLM to determine plagiarism with structured output format
4. Parse response into structured result

Prompt Engineering:
- Instructs LLM to consider variable renaming, comment removal, refactoring
- Requires specific output format for consistent parsing:
  PLAGIARIZED: YES/NO
  CONFIDENCE: 0.0-1.0
  MATCH: function number
  REASON: explanation

Robust Parsing:
- Handles LLM output variations
- Fallback to default values if parsing fails
- Normalizes confidence scores (handles 0-1 or 0-100 scales)

Trade-offs:
- ✅ Pros: Contextual understanding, catches subtle similarities
- ❌ Cons: Token limits restrict full corpus, slower, API costs

In [18]:
def detect_llm(query_code: str, max_context_functions: int = 20) -> Dict:
    """
    Detect plagiarism using direct LLM analysis with full context.
    
    Args:
        query_code: Code snippet to check
        max_context_functions: Maximum number of reference functions to include
    
    Returns:
        Dictionary containing detection results
    """
    # First, get most relevant functions to reduce context size
    candidates = retrieve_with_embeddings(query_code, k=max_context_functions)
    
    if not candidates:
        return {
            'method': 'direct_llm',
            'is_plagiarized': False,
            'confidence': 0.0,
            'best_match': {'id': None, 'name': None, 'repo': None},
            'reason': 'No candidate functions retrieved',
            'raw_response': ''
        }
    
    # Build context with reference functions
    context = "Reference code functions:\n\n"
    for i, (meta, _) in enumerate(candidates[:max_context_functions]):
        context += f"Function {i+1}: {meta['name']}\n"
        context += f"From: {meta['repo']}\n"
        context += f"{meta['code'][:500]}\n\n"  # Limit length
    
    # Create prompt
    prompt = f"""You are an expert code analyst specializing in plagiarism detection.

{context}

Query code to analyze:
{query_code}

Analyze if the query code is plagiarized from any of the reference functions.
Consider variable renaming, comment removal, and minor refactoring as signs of plagiarism.

Respond in this exact format:
PLAGIARIZED: [YES/NO]
CONFIDENCE: [0.0-1.0]
MATCH: [function number or NONE]
REASON: [brief explanation]
"""
    
    # Call LLM
    response = call_llm(prompt, max_tokens=300)
    
    # Default values
    is_plagiarized = False
    confidence = 0.5
    best_match = candidates[0][0] if candidates else None
    reason = "Unable to parse response"
    
    # Try to parse response - NOW WITH PROPER ERROR HANDLING
    try:
        if 'PLAGIARIZED:' in response:
            plagiarized_part = response.split('PLAGIARIZED:')[1].split('\n')[0].upper()
            is_plagiarized = 'YES' in plagiarized_part
        
        if 'CONFIDENCE:' in response:
            confidence_str = response.split('CONFIDENCE:')[1].split('\n')[0].strip()
            conf_match = re.findall(r'\d+\.?\d*', confidence_str)
            if conf_match:
                confidence = float(conf_match[0])
                if confidence > 1.0:
                    confidence = confidence / 100.0
        
        if 'MATCH:' in response:
            match_str = response.split('MATCH:')[1].split('\n')[0].strip()
            if 'NONE' not in match_str.upper():
                match_nums = re.findall(r'\d+', match_str)
                if match_nums:
                    match_num = int(match_nums[0])
                    if 1 <= match_num <= len(candidates):
                        best_match = candidates[match_num - 1][0]
        
        if 'REASON:' in response:
            reason = response.split('REASON:')[1].strip()
    except Exception as e:
        print(f"    Warning: Error parsing LLM response: {str(e)}")
        print(f"    Raw response: {response[:200]}...")
    
    # Ensure best_match is not None
    if best_match is None and candidates:
        best_match = candidates[0][0]
    
    return {
        'method': 'direct_llm',
        'is_plagiarized': is_plagiarized,
        'confidence': confidence,
        'best_match': {
            'id': best_match['id'] if best_match else None,
            'name': best_match['name'] if best_match else None,
            'repo': best_match['repo'] if best_match else None
        } if best_match else {'id': None, 'name': None, 'repo': None},
        'reason': reason,
        'raw_response': response
    }

# Test example
result = detect_llm(test_code)
print("\nExample - Direct LLM Analysis:")
print(f"Is Plagiarized: {result['is_plagiarized']}")
print(f"Confidence: {result['confidence']:.3f}")
print(f"Reason: {result['reason'][:100]}...")


Example - Direct LLM Analysis:
Is Plagiarized: False
Confidence: 1.000
Reason: The query code calculates the sum of numbers in a list, which is a very fundamental and common progr...


## Method 3: Standard RAG
Retrieval-Augmented Generation Architecture:
1. Retrieval Stage: Use embeddings to find top-k relevant functions
2. Augmentation Stage: Build context with retrieved code
3. Generation Stage: LLM analyzes query against retrieved context

Key Differences from Method 2:
- Fewer retrieved documents (k=5 vs. 20)
- More focused context → better LLM performance
- Standard RAG pattern: retrieve first, then reason

Prompt Design:
- Explicit instructions to look for:
  • Identical logic with cosmetic changes
  • Variable/function renaming
  • Comment removal
  • Whitespace modifications
  
- Structured output format (same as Method 2)

Advantages over Pure Embedding:
- LLM understands semantic equivalence despite syntactic differences
- Can explain WHY code is similar (interpretability)

Advantages over Direct LLM:
- Focused context improves accuracy
- Scalable to larger corpora (retrieval filters irrelevant code)
- Lower token usage → faster + cheaper

In [19]:
def detect_rag(query_code: str, k: int = 5) -> Dict:
    """
    Detect plagiarism using standard RAG (Retrieval-Augmented Generation).
    
    Args:
        query_code: Code snippet to check
        k: Number of functions to retrieve
    
    Returns:
        Dictionary containing detection results
    """
    # Retrieve relevant functions using embeddings
    retrieved = retrieve_with_embeddings(query_code, k=k)
    
    if not retrieved:
        return {
            'method': 'standard_rag',
            'is_plagiarized': False,
            'confidence': 0.0,
            'best_match': {'id': None, 'name': None, 'repo': None},
            'retrieved_count': 0,
            'reason': 'No functions retrieved',
            'raw_response': ''
        }
    
    # Build context
    context = "Retrieved reference functions:\n\n"
    for i, (meta, score) in enumerate(retrieved):
        context += f"Function {i+1}: {meta['name']} (similarity: {score:.3f})\n"
        context += f"Repository: {meta['repo']}\n"
        context += f"Code:\n{meta['code']}\n\n"
    
    # Create prompt with system instruction included
    prompt = f"""You are an expert code analyst specializing in plagiarism detection.

{context}

Query code to analyze:
{query_code}

Determine if the query code is plagiarized from any of the retrieved functions.
Look for identical logic despite superficial changes like:
- Variable/function renaming
- Comment/docstring removal
- Whitespace changes
- Minor reordering that preserves logic

Respond in this exact format:
PLAGIARIZED: [YES/NO]
CONFIDENCE: [0.0-1.0]
MATCH: [function number or NONE]
REASON: [brief explanation]
"""
    
    # Call LLM
    response = call_llm(prompt, max_tokens=300)
    
    # Default values - SET THESE FIRST!
    is_plagiarized = False
    confidence = 0.0
    best_match = retrieved[0][0] if retrieved else None
    reason = "Unable to parse response"
    
    # Try to parse response - WITH PROPER ERROR HANDLING
    try:
        if 'PLAGIARIZED:' in response:
            plagiarized_part = response.split('PLAGIARIZED:')[1].split('\n')[0].upper()
            is_plagiarized = 'YES' in plagiarized_part
        
        if 'CONFIDENCE:' in response:
            confidence_str = response.split('CONFIDENCE:')[1].split('\n')[0].strip()
            conf_match = re.findall(r'\d+\.?\d*', confidence_str)
            if conf_match:
                confidence = float(conf_match[0])
                # If confidence is given as percentage (e.g., 85), convert to 0-1 scale
                if confidence > 1.0:
                    confidence = confidence / 100.0
        else:
            # Fallback to similarity score
            confidence = retrieved[0][1] if retrieved else 0.0
        
        if 'MATCH:' in response:
            match_str = response.split('MATCH:')[1].split('\n')[0].strip()
            if 'NONE' not in match_str.upper():
                match_nums = re.findall(r'\d+', match_str)
                if match_nums:
                    match_num = int(match_nums[0])
                    if 1 <= match_num <= len(retrieved):
                        best_match = retrieved[match_num - 1][0]
        
        if 'REASON:' in response:
            reason = response.split('REASON:')[1].strip()
    except Exception as e:
        print(f"    Warning: Error parsing LLM response: {str(e)}")
        print(f"    Raw response: {response[:200]}...")
        # Keep default values
    
    # Ensure best_match is not None
    if best_match is None and retrieved:
        best_match = retrieved[0][0]
    
    return {
        'method': 'standard_rag',
        'is_plagiarized': is_plagiarized,
        'confidence': confidence,
        'best_match': {
            'id': best_match['id'] if best_match else None,
            'name': best_match['name'] if best_match else None,
            'repo': best_match['repo'] if best_match else None
        } if best_match else {'id': None, 'name': None, 'repo': None},
        'retrieved_count': len(retrieved),
        'reason': reason,
        'raw_response': response
    }
# Test example
result = detect_rag(test_code)
print("\nExample - Standard RAG:")
print(f"Is Plagiarized: {result['is_plagiarized']}")
print(f"Confidence: {result['confidence']:.3f}")
print(f"Retrieved: {result['retrieved_count']} functions")


Example - Standard RAG:
Is Plagiarized: False
Confidence: 1.000
Retrieved: 5 functions


## Method 4: Hybrid RAG
Hybrid Retrieval Architecture:
- Combines dense (embeddings) + sparse (BM25) retrieval
- Fusion strategy: Weighted score combination (alpha parameter)

Why Hybrid?
1. Embeddings capture semantic similarity
2. BM25 captures lexical/keyword matching
3. Together: More robust than either alone

Fusion Process:
1. Retrieve with embeddings → get top 2k results
2. Retrieve with BM25 → get top 2k results
3. Normalize scores to [0,1] range
4. Combine: score = alpha * embedding + (1-alpha) * bm25
5. Rank by combined score → select top k

Alpha Parameter (default=0.5):
- 0.5: Equal weight to semantic + lexical
- >0.5: Favor semantic similarity
- <0.5: Favor keyword matching
- Will be ablated in evaluation (Phase 3)

Robust LLM Response Parsing:
- Primary: Structured format parsing
- Fallback: Keyword-based confidence estimation
- Handles malformed responses gracefully

When Hybrid Helps:
- Plagiarism with variable renaming: BM25 catches structural keywords
- Semantic refactoring: Embeddings catch logic similarity
- Best of both worlds

In [None]:
def detect_hybrid_rag(query_code: str, k: int = 5, alpha: float = 0.5) -> Dict:
    """
    Detect plagiarism using hybrid RAG (dense + sparse retrieval).
    
    Args:
        query_code: Code snippet to check
        k: Number of functions to retrieve
        alpha: Weight for embedding scores (1-alpha for BM25)
    
    Returns:
        Dictionary containing detection results
    """
    # Retrieve using hybrid approach
    retrieved = hybrid_retrieve(query_code, k=k, alpha=alpha)
    
    if not retrieved:
        return {
            'method': 'hybrid_rag',
            'is_plagiarized': False,
            'confidence': 0.0,
            'best_match': None,
            'retrieved_count': 0,
            'fusion_alpha': alpha,
            'reason': 'No similar code found in corpus',
            'raw_response': 'No retrieval results'
        }
    
    # Build context
    context = "Retrieved reference functions (hybrid search):\n\n"
    for i, (meta, score) in enumerate(retrieved):
        context += f"Function {i+1}: {meta['name']} (score: {score:.3f})\n"
        context += f"Repository: {meta['repo']}\n"
        context += f"Code:\n{meta['code'][:500]}...\n\n"  # Limit code length
    
    # Create prompt
    prompt = f"""{context}

Query code to analyze:
{query_code}

Determine if the query code is plagiarized from any of the retrieved functions.
These functions were retrieved using both semantic similarity and lexical matching.

Look for:
- Identical or very similar logic flow
- Same algorithm/approach with cosmetic changes
- Variable renaming but same structure
- Comment removal or modification

Respond in this EXACT format (no extra text):
PLAGIARIZED: YES or NO
CONFIDENCE: 0.85
MATCH: 1 or NONE
REASON: Brief explanation here
"""
    
    # Call LLM
    try:
        response = call_llm(prompt, max_tokens=300)
    except Exception as e:
        print(f"❌ LLM call failed: {e}")
        response = ""
    
    # Robust parsing with fallbacks
    is_plagiarized = False
    confidence = 0.0
    match_num = None
    reason = "Unable to parse LLM response"
    
    # Parse PLAGIARIZED
    try:
        if 'PLAGIARIZED:' in response:
            plagiarized_line = response.split('PLAGIARIZED:')[1].split('\n')[0].strip().upper()
            is_plagiarized = 'YES' in plagiarized_line
        else:
            # Fallback: look for yes/no in response
            response_upper = response.upper()
            if 'YES' in response_upper and 'PLAGIARIZED' in response_upper:
                is_plagiarized = True
    except Exception as e:
        print(f"⚠️  Error parsing PLAGIARIZED: {e}")
    
    # Parse CONFIDENCE
    try:
        if 'CONFIDENCE:' in response:
            confidence_line = response.split('CONFIDENCE:')[1].split('\n')[0].strip()
            # Extract float
            conf_match = re.search(r'(\d+\.?\d*)', confidence_line)
            if conf_match:
                confidence = float(conf_match.group(1))
                # Normalize if needed
                if confidence > 1.0:
                    confidence = confidence / 100.0
        else:
            # Fallback: use top retrieval score
            confidence = retrieved[0][1] if retrieved else 0.0
    except Exception as e:
        print(f"⚠️  Error parsing CONFIDENCE: {e}")
        confidence = retrieved[0][1] if retrieved else 0.0
    
    # Parse MATCH
    try:
        if 'MATCH:' in response:
            match_line = response.split('MATCH:')[1].split('\n')[0].strip().upper()
            if 'NONE' not in match_line:
                match_search = re.search(r'(\d+)', match_line)
                if match_search:
                    match_num = int(match_search.group(1))
    except Exception as e:
        print(f"⚠️  Error parsing MATCH: {e}")
    
    # Parse REASON
    try:
        if 'REASON:' in response:
            reason = response.split('REASON:')[1].strip()
            # Clean up if there are multiple lines
            reason = reason.split('\n')[0] if '\n' in reason else reason
        else:
            reason = "LLM response did not follow expected format"
    except Exception as e:
        print(f"⚠️  Error parsing REASON: {e}")
    
    # Determine best match
    best_match = None
    if match_num and 1 <= match_num <= len(retrieved):
        best_match = retrieved[match_num - 1][0]
    elif retrieved:
        best_match = retrieved[0][0]  # Default to top result
    
    return {
        'method': 'hybrid_rag',
        'is_plagiarized': is_plagiarized,
        'confidence': confidence,
        'best_match': {
            'id': best_match['id'] if best_match else None,
            'name': best_match['name'] if best_match else None,
            'repo': best_match['repo'] if best_match else None,
            'code': best_match['code'][:200] + '...' if best_match else None
        },
        'retrieved_count': len(retrieved),
        'fusion_alpha': alpha,
        'reason': reason,
        'raw_response': response
    }


# Alternative: More flexible parsing function
def parse_llm_response(response: str, retrieved: List, default_confidence: float = 0.0) -> Dict:
    """
    Robustly parse LLM response with multiple fallback strategies.
    """
    result = {
        'is_plagiarized': False,
        'confidence': default_confidence,
        'match_num': None,
        'reason': 'Unable to parse response'
    }
    
    if not response:
        return result
    
    # Strategy 1: Exact format parsing
    lines = response.strip().split('\n')
    for line in lines:
        line = line.strip()
        
        if line.startswith('PLAGIARIZED:'):
            result['is_plagiarized'] = 'YES' in line.upper()
        
        elif line.startswith('CONFIDENCE:'):
            try:
                nums = re.findall(r'(\d+\.?\d*)', line)
                if nums:
                    conf = float(nums[0])
                    result['confidence'] = conf if conf <= 1.0 else conf / 100.0
            except:
                pass
        
        elif line.startswith('MATCH:'):
            try:
                if 'NONE' not in line.upper():
                    nums = re.findall(r'(\d+)', line)
                    if nums:
                        result['match_num'] = int(nums[0])
            except:
                pass
        
        elif line.startswith('REASON:'):
            result['reason'] = line.replace('REASON:', '').strip()
    
    # Strategy 2: Keyword-based fallback
    if result['confidence'] == default_confidence:
        response_lower = response.lower()
        
        # Look for confidence indicators
        if 'very similar' in response_lower or 'identical' in response_lower:
            result['confidence'] = 0.9
            result['is_plagiarized'] = True
        elif 'similar' in response_lower or 'likely' in response_lower:
            result['confidence'] = 0.7
            result['is_plagiarized'] = True
        elif 'somewhat similar' in response_lower:
            result['confidence'] = 0.5
        elif 'different' in response_lower or 'not plagiarized' in response_lower:
            result['confidence'] = 0.2
            result['is_plagiarized'] = False
    
    return result


# Updated version using the parser
def detect_hybrid_rag_v2(query_code: str, k: int = 5, alpha: float = 0.5) -> Dict:
    """
    Detect plagiarism using hybrid RAG with robust parsing.
    """
    retrieved = hybrid_retrieve(query_code, k=k, alpha=alpha)
    
    if not retrieved:
        return {
            'method': 'hybrid_rag',
            'is_plagiarized': False,
            'confidence': 0.0,
            'best_match': None,
            'retrieved_count': 0,
            'fusion_alpha': alpha,
            'reason': 'No similar code found',
            'raw_response': ''
        }
    
    # Build shorter context to avoid token limits
    context = "Compare the query code with these reference functions:\n\n"
    for i, (meta, score) in enumerate(retrieved[:3]):  # Limit to top 3
        context += f"{i+1}. {meta['name']} (similarity: {score:.2f})\n"
        context += f"{meta['code'][:300]}...\n\n"
    
    prompt = f"""{context}

Query code:
{query_code[:500]}

Is the query code plagiarized from any reference function?

Answer in this format:
PLAGIARIZED: YES or NO
CONFIDENCE: 0.0 to 1.0
MATCH: function number or NONE
REASON: one sentence explanation
"""
    
    try:
        response = call_llm(prompt, max_tokens=200)
        print(f"\n[LLM Response]\n{response}\n")
    except Exception as e:
        print(f"❌ LLM Error: {e}")
        response = ""
    
    # Parse with fallback
    parsed = parse_llm_response(
        response, 
        retrieved, 
        default_confidence=retrieved[0][1]
    )
    
    # Get best match
    if parsed['match_num'] and 1 <= parsed['match_num'] <= len(retrieved):
        best_match = retrieved[parsed['match_num'] - 1][0]
    else:
        best_match = retrieved[0][0]
    
    return {
        'method': 'hybrid_rag',
        'is_plagiarized': parsed['is_plagiarized'],
        'confidence': parsed['confidence'],
        'best_match': {
            'id': best_match['id'],
            'name': best_match['name'],
            'repo': best_match['repo'],
            'code_preview': best_match['code'][:150] + '...'
        },
        'retrieved_count': len(retrieved),
        'fusion_alpha': alpha,
        'reason': parsed['reason'],
        'raw_response': response
    }


# Test with error handling
try:
    result = detect_hybrid_rag_v2(test_code)
    print("\n✅ Hybrid RAG Result:")
    print(f"  Plagiarized: {result['is_plagiarized']}")
    print(f"  Confidence: {result['confidence']:.3f}")
    print(f"  Best Match: {result['best_match']['name']}")
    print(f"  Reason: {result['reason']}")
except Exception as e:
    print(f"❌ Detection failed: {e}")
    import traceback
    traceback.print_exc()


[LLM Response]
PLAGIARIZED: NO
CONFIDENCE: 1.0
MATCH: NONE
REASON: The query code implements a basic sum accumulation pattern, which is a fundamental programming concept and not a unique or complex algorithm derived from the provided reference functions.


✅ Hybrid RAG Result:
  Plagiarized: False
  Confidence: 1.000
  Best Match: base_check
  Reason: The query code implements a basic sum accumulation pattern, which is a fundamental programming concept and not a unique or complex algorithm derived from the provided reference functions.


## Interactive Testing Interface
This cell provides a unified interface for testing all methods:

compare_all_methods(query_code):
- Runs all 4 detection methods sequentially
- Returns structured comparison
- Prints side-by-side results

Individual Method Usage:
- detect_embedding(code): Method 1
- detect_llm(code): Method 2
- detect_rag(code): Method 3
- detect_hybrid_rag(code): Method 4

Output Format (all methods):
{
  'method': str,
  'is_plagiarized': bool,
  'confidence': float,
  'best_match': {
    'id': str,
    'name': str,
    'repo': str
  },
  'reason': str,  # LLM methods only
  ...
}

Requirement Met: All functions callable independently for 03_evaluation.ipynb

In [21]:
def compare_all_methods(query_code: str) -> Dict:
    """
    Run all four detection methods and compare results.
    
    Args:
        query_code: Code snippet to analyze
    
    Returns:
        Dictionary with results from all methods
    """
    print("Running all detection methods...\n")
    
    results = {}
    
    print("[1/4] Pure Embedding Search...")
    results['embedding'] = detect_embedding(query_code)
    
    print("[2/4] Direct LLM Analysis...")
    results['llm'] = detect_llm(query_code)
    
    print("[3/4] Standard RAG...")
    results['rag'] = detect_rag(query_code)
    
    print("[4/4] Hybrid RAG...")
    results['hybrid_rag'] = detect_hybrid_rag(query_code)
    
    print("\n✓ All methods completed")
    
    # Print comparison
    print("\n" + "="*60)
    print("COMPARISON OF ALL METHODS")
    print("="*60)
    
    for method_name, result in results.items():
        print(f"\n{method_name.upper()}:")
        print(f"  Plagiarized: {result['is_plagiarized']}")
        print(f"  Confidence: {result['confidence']:.3f}")
        print(f"  Best Match: {result['best_match']['name']}")
    
    return results

# Example usage
print("\n" + "="*60)
print("INTERACTIVE TESTING READY")
print("="*60)
print("\nAvailable functions:")
print("  - detect_embedding(code)")
print("  - detect_llm(code)")
print("  - detect_rag(code)")
print("  - detect_hybrid_rag(code)")
print("  - compare_all_methods(code)")
print("\nExample:")
print('  result = detect_rag("""your code here""")')


INTERACTIVE TESTING READY

Available functions:
  - detect_embedding(code)
  - detect_llm(code)
  - detect_rag(code)
  - detect_hybrid_rag(code)
  - compare_all_methods(code)

Example:
  result = detect_rag("""your code here""")


In [None]:
sample_code = """
def add_numbers(a, b):
    return a + b
"""

# Test the RAG method
result_rag = detect_rag(sample_code)
print(result_rag)


In [None]:
sample_code = """
def multiply_numbers(a, b):
    return a * b
"""

results = compare_all_methods(sample_code)


Running all detection methods...

[1/4] Pure Embedding Search...
[2/4] Direct LLM Analysis...
[3/4] Standard RAG...
[4/4] Hybrid RAG...

[DEBUG] LLM Response:
PLAGIARIZED: NO
CONFIDENCE: 1.00
MATCH: NONE
REASON: The query code performs a very basic arithmetic multiplication. None of the reference functions perform multiplication or share any similar logic, algorithm, or structure with the query code. The query code is a fundamental operation that would be independently written.


✓ All methods completed

COMPARISON OF ALL METHODS

EMBEDDING:
  Plagiarized: False
  Confidence: 0.000
  Best Match: None

LLM:
  Plagiarized: False
  Confidence: 1.000
  Best Match: concat

RAG:
  Plagiarized: False
  Confidence: 1.000
  Best Match: concat

HYBRID_RAG:
  Plagiarized: False
  Confidence: 1.000
  Best Match: concat
