# Code Plagiarism Detection - Phase 1: Indexing and Data Preparation
This notebook handles all data preparation for the plagiarism detection system:

Collects Python code from 5+ GitHub repositories
Extracts individual functions from source files
Builds FAISS semantic index using CodeBERT embeddings
Builds BM25 lexical index for hybrid retrieval
Creates labeled test dataset with 30+ examples (positive and negative cases)
Saves all indexes and datasets for use by other notebooks

Security Note: GitHub token is loaded from environment variables (not hardcoded) to comply with security requirements.
Required environment variable: GITHUB_TOKEN

In [1]:
import os
import json
import re
import ast
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Dict, Tuple
from tqdm.notebook import tqdm
import pickle
import warnings
warnings.filterwarnings('ignore')

# Embedding and search
from sentence_transformers import SentenceTransformer
import faiss
from rank_bm25 import BM25Okapi

# GitHub access
from github import Github
import requests

# GitHub authentication
GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN")
if not GITHUB_TOKEN:
    raise ValueError("GitHub token not found! Set GITHUB_TOKEN as an environment variable.")

# Headers for authenticated requests
HEADERS = {"Authorization": f"token {GITHUB_TOKEN}"}

# Authenticate PyGithub (optional, if you use PyGithub library)
g = Github(GITHUB_TOKEN)


# Setup paths
BASE_DIR = Path('.')
DATA_DIR = BASE_DIR / 'data'
CORPUS_DIR = DATA_DIR / 'reference_corpus'
INDEX_DIR = BASE_DIR / 'indexes'
RESULTS_DIR = BASE_DIR / 'results'

# Create directories
for dir_path in [DATA_DIR, CORPUS_DIR, INDEX_DIR, RESULTS_DIR]:
    dir_path.mkdir(exist_ok=True, parents=True)

print("‚úì Environment setup complete")

‚úì Environment setup complete


## 1. Data Collection from GitHub
This section collects Python code from 5 GitHub repositories, meeting the minimum requirement.

Key Features:
- Recursive traversal of repository directories
- Filters out test files and system directories
- Authenticates with GitHub API to avoid rate limits
- Saves raw corpus with metadata (repo, path, URL) for traceability

Output: data/reference_corpus/raw_corpus.json (150 Python files)

In [2]:

# Selected repositories - popular Python projects with diverse functionality
REPOSITORIES = [
    "TheAlgorithms/Python",              # Algorithm implementations (sorting, searching, etc.)
    "karan/Projects-Solutions",          # Common programming project solutions
    "geekcomputers/Python",              # Simple Python scripts and projects
    "zhiwehu/Python-programming-exercises",  # Programming exercises with solutions
    "trekhleb/learn-python"              # Python learning examples
]

def download_repository_code(repo_name: str, max_files: int = 50) -> List[Dict]:
    """
    Download Python files from a GitHub repository.
    
    Args:
        repo_name: GitHub repository in format 'owner/repo'
        max_files: Maximum number of files to download
    
    Returns:
        List of dictionaries containing file information
    """
    print(f"\nDownloading from {repo_name}...")
    
    # Use GitHub API
    base_url = f"https://api.github.com/repos/{repo_name}/contents"
    files_data = []
    
    def fetch_python_files(url: str, path: str = ""):
        """Recursively fetch Python files from repository."""
        if len(files_data) >= max_files:
            return
        
        try:
            response = requests.get(url, headers=HEADERS)
            response.raise_for_status()
            contents = response.json()
            
            for item in contents:
                if len(files_data) >= max_files:
                    break
                
                if item['type'] == 'file' and item['name'].endswith('.py'):
                    # Download file content
                    file_response = requests.get(item['download_url'], headers=HEADERS)
                    if file_response.status_code == 200:
                        files_data.append({
                            'repo': repo_name,
                            'path': item['path'],
                            'content': file_response.text,
                            'url': item['html_url']
                        })
                        
                elif item['type'] == 'dir' and not any(skip in item['path'] for skip in ['test', 'tests', '__pycache__', '.git']):
                    # Recursively explore subdirectories (avoid test directories)
                    fetch_python_files(item['url'], item['path'])
                    
        except Exception as e:
            print(f"Error fetching {url}: {e}")
    
    fetch_python_files(base_url)
    print(f"  Downloaded {len(files_data)} files")
    return files_data

# Download code from all repositories
all_files = []
for repo in REPOSITORIES:
    repo_files = download_repository_code(repo)
    all_files.extend(repo_files)

print(f"\n‚úì Total files collected: {len(all_files)}")

# Save raw corpus
corpus_file = CORPUS_DIR / 'raw_corpus.json'
with open(corpus_file, 'w', encoding='utf-8') as f:
    json.dump(all_files, f, indent=2)
print(f"‚úì Saved to {corpus_file}")


Downloading from TheAlgorithms/Python...
  Downloaded 50 files

Downloading from karan/Projects-Solutions...
  Downloaded 0 files

Downloading from geekcomputers/Python...
  Downloaded 50 files

Downloading from zhiwehu/Python-programming-exercises...
  Downloaded 0 files

Downloading from trekhleb/learn-python...
  Downloaded 50 files

‚úì Total files collected: 150
‚úì Saved to data\reference_corpus\raw_corpus.json


## 2. Code Chunking: Extract Functions
Uses Python's AST module to parse source code and extract individual functions.

Chunking Decision: Function-level (not file-level)
- Rationale: Plagiarism typically occurs at function level, not entire files
- Filters: Excludes trivial functions (< 50 characters)
- Metadata: Preserves function name, docstring, line numbers, and source repository

Each function gets unique ID: "repo::filepath::function_name"

Output: data/reference_corpus/functions_corpus.json (399 functions across 3 repositories)

In [3]:
def extract_functions_from_code(code: str, file_path: str, repo: str) -> List[Dict]:
    """
    Extract individual functions from Python code.
    
    Args:
        code: Python source code
        file_path: Path to the file
        repo: Repository name
    
    Returns:
        List of function dictionaries
    """
    functions = []
    
    try:
        tree = ast.parse(code)
        
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                # Extract function source code
                function_lines = code.split('\n')[node.lineno-1:node.end_lineno]
                function_code = '\n'.join(function_lines)
                
                # Skip very short functions (likely trivial)
                if len(function_code.strip()) < 50:
                    continue
                
                # Extract docstring if present
                docstring = ast.get_docstring(node) or ""
                
                functions.append({
                    'id': f"{repo}::{file_path}::{node.name}",
                    'repo': repo,
                    'file': file_path,
                    'name': node.name,
                    'code': function_code,
                    'docstring': docstring,
                    'line_start': node.lineno,
                    'line_end': node.end_lineno
                })
    except SyntaxError:
        # Skip files with syntax errors
        pass
    except Exception as e:
        # Skip files with other parsing errors
        pass
    
    return functions

# Extract all functions from corpus
print("Extracting functions from code files...")
all_functions = []

for file_data in tqdm(all_files):
    functions = extract_functions_from_code(
        file_data['content'],
        file_data['path'],
        file_data['repo']
    )
    all_functions.extend(functions)

print(f"\n‚úì Extracted {len(all_functions)} functions")

# Save function corpus
functions_file = CORPUS_DIR / 'functions_corpus.json'
with open(functions_file, 'w', encoding='utf-8') as f:
    json.dump(all_functions, f, indent=2)
print(f"‚úì Saved to {functions_file}")

# Display statistics
df_stats = pd.DataFrame(all_functions)
print("\nCorpus Statistics:")
print(df_stats.groupby('repo').size())

Extracting functions from code files...


  0%|          | 0/150 [00:00<?, ?it/s]


‚úì Extracted 399 functions
‚úì Saved to data\reference_corpus\functions_corpus.json

Corpus Statistics:
repo
TheAlgorithms/Python      96
geekcomputers/Python     182
trekhleb/learn-python    121
dtype: int64


## 3. Build Embedding Index
SEMANTIC EMBEDDING INDEX (Dense Retrieval)

Builds FAISS index for pure embedding search and RAG systems (Homework Phase 2).

Model: CodeBERT (microsoft/codebert-base)
- Pre-trained on code from GitHub
- Understands programming semantics and structure
- Embedding dimension: 768

Input Representation: Concatenation of function name, docstring, and code
- Captures both semantic meaning and syntactic structure

Index Type: FAISS IndexFlatIP (exact cosine similarity)
- Normalized L2 vectors ‚Üí inner product = cosine similarity
- Will be used for: detect_embedding(), detect_rag(), detect_hybrid_rag()

Output Files:
- indexes/faiss_index.bin (searchable index)
- indexes/embeddings.npy (raw embeddings)
- indexes/function_metadata.pkl (function details for retrieval results)

In [4]:
# Initialize embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('microsoft/codebert-base')
print("‚úì Model loaded")

# Generate embeddings for all functions
print("\nGenerating embeddings...")
function_texts = [f"{func['name']}\n{func['docstring']}\n{func['code']}" for func in all_functions]
embeddings = embedding_model.encode(
    function_texts,
    show_progress_bar=True,
    batch_size=32
)
embeddings = np.array(embeddings).astype('float32')
print(f"‚úì Generated embeddings shape: {embeddings.shape}")

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)

# Build FAISS index
print("\nBuilding FAISS index...")
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product = cosine similarity for normalized vectors
index.add(embeddings)
print(f"‚úì Index built with {index.ntotal} vectors")

# Save index and metadata
faiss.write_index(index, str(INDEX_DIR / 'faiss_index.bin'))
np.save(INDEX_DIR / 'embeddings.npy', embeddings)
with open(INDEX_DIR / 'function_metadata.pkl', 'wb') as f:
    pickle.dump(all_functions, f)

print("‚úì Saved FAISS index and metadata")

Loading embedding model...


No sentence-transformers model found with name microsoft/codebert-base. Creating a new one with mean pooling.


‚úì Model loaded

Generating embeddings...


Batches:   0%|          | 0/13 [00:00<?, ?it/s]

‚úì Generated embeddings shape: (399, 768)

Building FAISS index...
‚úì Index built with 399 vectors
‚úì Saved FAISS index and metadata


## 4. Build BM25 Index
LEXICAL INDEX (Sparse Retrieval - BM25)

Builds BM25 index for hybrid RAG system (Homework Phase 2).

Why BM25?
- Captures exact lexical matches (variable names, keywords)
- Complements semantic search by finding syntactically similar code
- Effective for code that shares tokens despite different semantics

Tokenization Strategy:
- Lowercase normalization
- Splits on non-alphanumeric characters (preserves underscores in identifiers)
- Simple but effective for code tokens

Will be combined with FAISS embeddings in detect_hybrid_rag() using score fusion.

Output Files:
- indexes/bm25_index.pkl (BM25 model)
- indexes/tokenized_corpus.pkl (tokenized documents)

In [5]:
def tokenize_code(code: str) -> List[str]:
    """Tokenize code for BM25 indexing."""
    # Split on non-alphanumeric characters but keep underscores
    tokens = re.findall(r'\b\w+\b', code.lower())
    return tokens

print("Building BM25 index...")
tokenized_corpus = [tokenize_code(func['code']) for func in tqdm(all_functions)]
bm25_index = BM25Okapi(tokenized_corpus)

# Save BM25 index
with open(INDEX_DIR / 'bm25_index.pkl', 'wb') as f:
    pickle.dump(bm25_index, f)
with open(INDEX_DIR / 'tokenized_corpus.pkl', 'wb') as f:
    pickle.dump(tokenized_corpus, f)

print("‚úì BM25 index built and saved")

Building BM25 index...


  0%|          | 0/399 [00:00<?, ?it/s]

‚úì BM25 index built and saved


## 5. Create Test Dataset with Plagiarism Examples
POSITIVE EXAMPLES (15 cases - Plagiarism):
Realistic transformations applied to actual repository functions:

1. Variable Renaming (5 cases):
   - Changes: data‚Üíinput_data, result‚Üíoutput, temp‚Üítmp_var, etc.
   - Rationale: Common plagiarism tactic to avoid detection

2. Comment/Docstring Removal (5 cases):
   - Strips all comments and docstrings
   - Rationale: Students often remove documentation from copied code

3. Minor Refactoring (5 cases):
   - Combines variable renaming + comment removal + whitespace changes
   - Rationale: More sophisticated plagiarism attempts

NEGATIVE EXAMPLES (15 cases - Non-Plagiarism):
- Pairs of functions from DIFFERENT repositories
- Ensures functions solve unrelated problems (different domains)
- Rationale: True negatives should have fundamentally different implementations

Dataset Properties:
- Balanced: 50% positive, 50% negative
- Labeled: Each case has 'label' field (1=plagiarism, 0=original)
- Traceable: 'original_id' links to source function in corpus
- Documented: 'transformation' field records modification type

Output: data/test_dataset.json (30 test cases)

This dataset will be used for evaluation in 03_evaluation.ipynb.

In [6]:
import random
random.seed(42)
np.random.seed(42)

def apply_variable_renaming(code: str) -> str:
    """Rename variables in code to simulate plagiarism."""
    # Simple renaming strategy
    replacements = {
        'data': 'input_data',
        'result': 'output',
        'temp': 'tmp_var',
        'value': 'val',
        'item': 'element',
        'index': 'idx',
        'count': 'counter',
        'list': 'lst',
        'dict': 'dct',
    }
    
    modified = code
    for old, new in replacements.items():
        modified = re.sub(r'\b' + old + r'\b', new, modified)
    
    return modified

def remove_comments_and_docstrings(code: str) -> str:
    """Remove comments and docstrings from code."""
    lines = code.split('\n')
    cleaned_lines = []
    in_docstring = False
    
    for line in lines:
        stripped = line.strip()
        
        # Toggle docstring state
        if '"""' in stripped or "'''" in stripped:
            in_docstring = not in_docstring
            continue
        
        # Skip lines in docstrings or comments
        if in_docstring or stripped.startswith('#'):
            continue
        
        # Remove inline comments
        if '#' in line:
            line = line[:line.index('#')]
        
        cleaned_lines.append(line)
    
    return '\n'.join(cleaned_lines)

def add_extra_whitespace(code: str) -> str:
    """Add extra whitespace to code."""
    lines = code.split('\n')
    modified_lines = []
    
    for line in lines:
        modified_lines.append(line)
        if random.random() < 0.2:  # 20% chance to add blank line
            modified_lines.append('')
    
    return '\n'.join(modified_lines)

def minor_refactoring(code: str) -> str:
    """Apply minor refactoring changes."""
    # Combine multiple transformations
    code = apply_variable_renaming(code)
    code = remove_comments_and_docstrings(code)
    code = add_extra_whitespace(code)
    return code

# Select functions for test cases
# Filter to get substantial functions
substantial_functions = [f for f in all_functions if len(f['code']) > 200]
print(f"Selected {len(substantial_functions)} substantial functions for test dataset")

# Create positive examples (plagiarized)
num_positive = 15
positive_examples = []

selected_originals = random.sample(substantial_functions, num_positive)

for i, original in enumerate(selected_originals):
    # Apply different transformation types
    transformation_type = i % 3
    
    if transformation_type == 0:
        plagiarized_code = apply_variable_renaming(original['code'])
        transform_desc = "variable_renaming"
    elif transformation_type == 1:
        plagiarized_code = remove_comments_and_docstrings(original['code'])
        transform_desc = "comment_removal"
    else:
        plagiarized_code = minor_refactoring(original['code'])
        transform_desc = "minor_refactoring"
    
    positive_examples.append({
        'id': f"pos_{i}",
        'query_code': plagiarized_code,
        'original_id': original['id'],
        'label': 1,  # Plagiarized
        'transformation': transform_desc
    })

print(f"‚úì Created {len(positive_examples)} positive examples")

# Create negative examples (original code)
num_negative = 15
negative_examples = []

# Negative examples: functions from unrelated domains
for i in range(num_negative):
    # Select two random functions from different repositories
    func1, func2 = random.sample(substantial_functions, 2)
    
    while func1['repo'] == func2['repo']:
        func1, func2 = random.sample(substantial_functions, 2)
    
    negative_examples.append({
        'id': f"neg_{i}",
        'query_code': func1['code'],
        'original_id': func2['id'],  # Different function
        'label': 0,  # Not plagiarized
        'transformation': 'none'
    })

print(f"‚úì Created {len(negative_examples)} negative examples")

# Combine and shuffle
test_dataset = positive_examples + negative_examples
random.shuffle(test_dataset)

# Save test dataset
test_file = DATA_DIR / 'test_dataset.json'
with open(test_file, 'w', encoding='utf-8') as f:
    json.dump(test_dataset, f, indent=2)

print(f"\n‚úì Test dataset created with {len(test_dataset)} examples")
print(f"  - Positive (plagiarized): {len(positive_examples)}")
print(f"  - Negative (original): {len(negative_examples)}")
print(f"‚úì Saved to {test_file}")

Selected 283 substantial functions for test dataset
‚úì Created 15 positive examples
‚úì Created 15 negative examples

‚úì Test dataset created with 30 examples
  - Positive (plagiarized): 15
  - Negative (original): 15
‚úì Saved to data\test_dataset.json


## 6. Summary and Index Verification


In [7]:
print("=" * 60)
print("INDEXING COMPLETE - SUMMARY")
print("=" * 60)

print(f"\nüìÅ Data Collection:")
print(f"  - Repositories: {len(REPOSITORIES)}")
print(f"  - Files downloaded: {len(all_files)}")
print(f"  - Functions extracted: {len(all_functions)}")

print(f"\nüîç Indexes Built:")
print(f"  - FAISS embedding index: {index.ntotal} vectors")
print(f"  - BM25 lexical index: {len(tokenized_corpus)} documents")

print(f"\nüìä Test Dataset:")
print(f"  - Total test cases: {len(test_dataset)}")
print(f"  - Positive examples: {len(positive_examples)}")
print(f"  - Negative examples: {len(negative_examples)}")

print(f"\nüíæ Files Saved:")
print(f"  - {CORPUS_DIR}/raw_corpus.json")
print(f"  - {CORPUS_DIR}/functions_corpus.json")
print(f"  - {INDEX_DIR}/faiss_index.bin")
print(f"  - {INDEX_DIR}/embeddings.npy")
print(f"  - {INDEX_DIR}/function_metadata.pkl")
print(f"  - {INDEX_DIR}/bm25_index.pkl")
print(f"  - {INDEX_DIR}/tokenized_corpus.pkl")
print(f"  - {DATA_DIR}/test_dataset.json")

print("\n‚úÖ All indexing tasks completed successfully!")
print("üìù Ready for interactive testing (02_interactive.ipynb)")

INDEXING COMPLETE - SUMMARY

üìÅ Data Collection:
  - Repositories: 5
  - Files downloaded: 150
  - Functions extracted: 399

üîç Indexes Built:
  - FAISS embedding index: 399 vectors
  - BM25 lexical index: 399 documents

üìä Test Dataset:
  - Total test cases: 30
  - Positive examples: 15
  - Negative examples: 15

üíæ Files Saved:
  - data\reference_corpus/raw_corpus.json
  - data\reference_corpus/functions_corpus.json
  - indexes/faiss_index.bin
  - indexes/embeddings.npy
  - indexes/function_metadata.pkl
  - indexes/bm25_index.pkl
  - indexes/tokenized_corpus.pkl
  - data/test_dataset.json

‚úÖ All indexing tasks completed successfully!
üìù Ready for interactive testing (02_interactive.ipynb)
