# Code Similarity, AI Detection, and Plagiarism Analysis Notebook

This notebook provides a comprehensive pipeline for analyzing Python code submissions using advanced machine learning and NLP techniques. It is designed for:

- **AI-generated code detection**: Uses GPT-2 language model to estimate code perplexity and pattern analysis to identify AI-generated code.
- **Plagiarism detection**: Compares code against common algorithmic patterns and known sources using fast hashing and similarity matching.
- **Batch and parallel analysis**: Supports efficient batch processing and parallel execution for large-scale code review.
- **Performance metrics**: Includes benchmarking for single and batch analysis, with cache optimization for repeated code checks.
- **Device selection**: Automatically detects and utilizes GPU (CUDA) if available for faster inference, otherwise falls back to CPU.
- **Model saving**: Demonstrates saving HuggingFace models and tokenizers for production deployment.

## Key Components
- **OptimizedAIGeneratedCodeDetector**: Singleton class for AI detection using GPT-2, with caching and fast pattern analysis.
- **OptimizedPlagiarismDetector**: Detects plagiarism by matching code against pre-computed hashes and known patterns.
- **OptimizedCodeChecker**: Integrates AI and plagiarism detection, supports parallel and batch analysis, and provides actionable recommendations.
- **Test Suite**: Performance tests for single and batch code analysis, including suspicious, plagiarized, and original code examples.

## Usage
1. **Check Python environment and device**: Ensure required libraries are installed and GPU is available for optimal performance.
2. **Import libraries and classes**: All dependencies are imported and classes are defined for immediate use.
3. **Run analysis**: Use the provided test suite or custom code snippets to analyze for AI generation and plagiarism.
4. **Save models**: Save trained or pre-trained models for future use or deployment.

## Requirements
- Python 3.8+
- PyTorch
- Transformers (HuggingFace)
- NumPy, requests, pymongo

---

> **Note:** This notebook is optimized for speed, scalability, and production-readiness. All detection logic is modular and can be integrated into backend services or automated code review pipelines.

In [1]:
import sys
print(sys.executable)

e:\project\ML\syntax_env\python.exe


### Device Selection and GPU Availability

This notebook is optimized to run on systems with a CUDA-enabled GPU for faster code analysis and AI detection. The device check below will automatically detect and display the available GPU. If no GPU is found, the notebook will default to CPU execution.

- **Why GPU?** GPU acceleration significantly speeds up model inference and batch processing, making large-scale code review much more efficient.
- **Automatic Detection:** The code cell checks for GPU availability and prints the device name if found, or notifies you if only CPU is available.

> **Recommendation:** For best performance, run this notebook on a machine with a CUDA-enabled GPU.

In [7]:
import torch
gpu_available = torch.cuda.is_available()
if gpu_available:
    print(f"GPU available: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
    print("No GPU available")

GPU available: NVIDIA GeForce GTX 1650


### Project Dependencies

This notebook uses a combination of **data processing**, **AI/ML**, **NLP**, and **database** libraries for code analysis and detection tasks.

**Core Libraries:**
- `numpy` – Numerical computations and array operations  
- `torch` – PyTorch for deep learning and model inference  
- `transformers` – Hugging Face models and tokenizers (`AutoTokenizer`, `AutoModel`, `GPT2LMHeadModel`, `GPT2Tokenizer`)  

**Code Analysis & Processing:**
- `re` – Regular expressions for pattern matching  
- `ast` – Abstract Syntax Tree parsing for code inspection  
- `hashlib` – Generating hashes for code fingerprinting  
- `difflib` – Sequence matching for similarity detection  

**Performance & Utilities:**
- `functools.lru_cache` – Caching for faster repeated computations  
- `concurrent.futures.ThreadPoolExecutor` – Parallel execution  
- `warnings` – Suppressing unnecessary warnings   


> ⚡ **Tip:** Importing these libraries upfront ensures smooth execution for code analysis, AI detection, and database operations.


In [4]:
import numpy as np
import torch
import re
import ast
import hashlib
import difflib
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor
from typing import Dict,List,Optional
import warnings
warnings.filterwarnings('ignore')

---

## AI-Generated Code Detection Logic

This section explains the logic and methodology behind detecting AI-generated code submissions:

- **Perplexity Analysis**: Utilizes the GPT-2 language model to calculate the perplexity of code snippets. Lower perplexity values often indicate AI-generated code due to the model's familiarity with such patterns.
- **Pattern Recognition**: Analyzes code structure, variable naming conventions, comment ratios, and indentation consistency to identify characteristics typical of AI-generated code.
- **Caching for Speed**: Implements LRU caching to accelerate repeated analysis and improve scalability for large datasets.
- **Singleton Model Loading**: Ensures models are loaded only once per session, reducing memory usage and initialization time.
- **Batch and Parallel Processing**: Supports efficient batch analysis and parallel execution for rapid review of multiple code samples.

> This modular AI detection logic is designed for integration into automated code review systems, online judges, and educational platforms to help maintain code authenticity and integrity.

In [2]:
class OptimizedAIGeneratedCodeDetector:
    """Optimized AI detection with caching and faster inference"""
    
    _instance = None
    
    def __new__(cls):
        """Singleton pattern - load models only once"""
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance._initialized = False
        return cls._instance
    
    def __init__(self):
        if self._initialized:
            return
            
        # Load models once and keep in memory
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"GPT-2 model loaded on {self.device}.")
        
        self.gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2').to(self.device)
        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.gpt2_tokenizer.pad_token = self.gpt2_tokenizer.eos_token
        self.gpt2_model.eval() # model in eval mode for better inference.
        
        self._initialized = True
        print("GPT-2 model loaded succesfully!.")
    

    def calculate_perplexity(self,code):
        """Calculate perplexity score of given code.
           Score : Low:- AI likely, High:- Human likely.
        """

        try:
            code_lines=[line.strip() for line in code.split('\n') if line.strip()]
            code_text=' '.join(code_lines)

            encodings=self.gpt2_tokenizer(
                code_text,
                return_tensors='pt',
                trucation=True,
                max_length=512
            ).to(self.device)

            with torch.no_grad():
                outputs=self.gpt2_model(**encodings,labels=encodings['inputs_ids'])
                perplexity=torch.exp(outputs.loss)

            return float(perplexity)
        
        except Exception as e:
            print(f"Eroor calculation failed!!! : {e}")
            return 0.0
        

    def extract_ast_features(self, code):
        """Extract AST structural features"""
        features = {}
        
        try:
            tree = ast.parse(code)
            
            node_types = [type(node).__name__ for node in ast.walk(tree)]
            total_nodes = len(node_types)
            features['total_nodes'] = total_nodes
            features['unique_node_ratio'] = len(set(node_types)) / max(total_nodes, 1)
            
            def get_depths(node, depth=0):
                depths = [depth]
                for child in ast.iter_child_nodes(node):
                    depths.extend(get_depths(child, depth + 1))
                return depths
            
            depths = get_depths(tree)
            features['max_depth'] = max(depths) if depths else 0
            features['depth_variance'] = np.var(depths) if len(depths) > 1 else 0
            
            num_conditions = sum(1 for _ in ast.walk(tree) if isinstance(_, ast.If))
            num_loops = sum(1 for _ in ast.walk(tree) if isinstance(_, (ast.For, ast.While)))
            num_try = sum(1 for _ in ast.walk(tree) if isinstance(_, ast.Try))
            
            features['num_conditions'] = num_conditions
            features['num_loops'] = num_loops
            features['cyclomatic_complexity'] = 1 + num_conditions + num_loops
            
            features['is_perfectly_balanced'] = (
                features['depth_variance'] < 1.0 and 
                features['cyclomatic_complexity'] < 4
            )
            features['has_error_handling'] = num_try > 0
            features['parse_success'] = True
            
        except:
            features['parse_success'] = False
            features['parse_error'] = True
        
        return features    

    def analyze_code_patterns(self, code):
        """Analyze code patterns (naming, comments, style)"""
        features = {}
        
        lines = code.split('\n')
        total_lines = len(lines)
        
        comment_lines = sum(1 for line in lines if line.strip().startswith('#'))
        docstring_count = len(re.findall(r'"""[\s\S]*?"""|\'\'\'[\s\S]*?\'\'\'', code))
        features['comment_ratio'] = comment_lines / max(total_lines, 1)
        features['has_docstrings'] = docstring_count > 0
        
        try:
            var_pattern = r'\b[a-zA-Z_][a-zA-Z0-9_]*\b'
            var_names = re.findall(var_pattern, code)
            keywords = {'def', 'class', 'if', 'else', 'for', 'while', 'return', 'import'}
            var_names = [v for v in var_names if v not in keywords]
            
            if var_names:
                avg_name_length = np.mean([len(name) for name in var_names])
                long_names = sum(1 for name in var_names if len(name) >= 10)
                
                features['avg_var_name_length'] = avg_name_length
                features['long_name_ratio'] = long_names / len(var_names)
            else:
                features['avg_var_name_length'] = 0
                features['long_name_ratio'] = 0
        except:
            features['avg_var_name_length'] = 0
            features['long_name_ratio'] = 0
        
        indents = [len(line) - len(line.lstrip()) for line in lines if line.strip()]
        if indents:
            features['indent_consistency'] = 1 - (np.std(indents) / (np.mean(indents) + 1))
        else:
            features['indent_consistency'] = 0
        
        return features


    def detect_ai_generated(self, code):
        """
        Main AI detection method
        Combines perplexity (60%), AST (25%), patterns (15%)
        """
        
        perplexity = self.calculate_perplexity(code)
        ast_features = self.extract_ast_features(code)
        pattern_features = self.analyze_code_patterns(code)
        
        print(f"\n{'='*70}")
        print(f"AI DETECTION ANALYSIS")
        print(f"{'='*70}")
        
        # PERPLEXITY SCORING (60% weight)
        print(f"\n1. PERPLEXITY ANALYSIS")
        print(f"   Value: {perplexity:.2f}")
        
        if perplexity < 10:
            perplexity_score = 0.60
            perplexity_level = "CRITICAL"
            interpretation = "EXTREMELY LOW - Very strong AI signature"
        elif perplexity < 20:
            perplexity_score = 0.50
            perplexity_level = "HIGH"
            interpretation = "Very low - Strong AI indicator"
        elif perplexity < 35:
            perplexity_score = 0.35
            perplexity_level = "MODERATE-HIGH"
            interpretation = "Low - Significant AI suspicion"
        elif perplexity < 60:
            perplexity_score = 0.20
            perplexity_level = "MODERATE"
            interpretation = "Moderate - Some AI patterns"
        elif perplexity < 100:
            perplexity_score = 0.10
            perplexity_level = "LOW"
            interpretation = "Higher - Leaning human"
        else:
            perplexity_score = 0.0
            perplexity_level = "NONE"
            interpretation = "High - Human-like variability"
        
        print(f"   Level: {perplexity_level}")
        print(f"   {interpretation}")
        print(f"   Score: {perplexity_score:.3f} / 0.600")
        
        # AST SCORING (25% weight)
        print(f"\n2. AST STRUCTURE ANALYSIS")
        print(f"   Balanced: {ast_features.get('is_perfectly_balanced', False)}")
        print(f"   Depth Variance: {ast_features.get('depth_variance', 0):.2f}")
        print(f"   Complexity: {ast_features.get('cyclomatic_complexity', 0)}")
        
        ast_score = 0.0
        
        if not ast_features.get('parse_error', False):
            if ast_features.get('is_perfectly_balanced', False):
                ast_score += 0.15
                print(f"   Perfect balance detected (+0.15)")
            
            if ast_features.get('depth_variance', 10) < 1.5:
                ast_score += 0.10
                print(f"   Low depth variance (+0.10)")
            
            if ast_features.get('unique_node_ratio', 1.0) < 0.35:
                ast_score += 0.05
        
        ast_score = min(ast_score, 0.25)
        print(f"   Score: {ast_score:.3f} / 0.250")
        
        # PATTERN SCORING (15% weight)
        print(f"\n3. PATTERN ANALYSIS")
        print(f"   Avg Variable Length: {pattern_features.get('avg_var_name_length', 0):.1f}")
        print(f"   Comment Ratio: {pattern_features.get('comment_ratio', 0):.2f}")
        print(f"   Indent Consistency: {pattern_features.get('indent_consistency', 0):.2f}")
        
        pattern_score = 0.0
        
        if pattern_features.get('avg_var_name_length', 0) > 15:
            pattern_score += 0.08
            print(f"   Very long variable names (+0.08)")
        elif pattern_features.get('avg_var_name_length', 0) > 10:
            pattern_score += 0.04
        
        if pattern_features.get('has_docstrings', False) and pattern_features.get('comment_ratio', 0) > 0.3:
            pattern_score += 0.07
            print(f"   Excessive documentation (+0.07)")
        
        if pattern_features.get('indent_consistency', 0) > 0.98:
            pattern_score += 0.05
        
        pattern_score = min(pattern_score, 0.15)
        print(f"   Score: {pattern_score:.3f} / 0.150")
        
        # FINAL SCORING
        total_score = perplexity_score + ast_score + pattern_score
        ai_probability = min(total_score, 1.0)
        
        print(f"\n{'='*70}")
        print(f"TOTAL AI PROBABILITY: {ai_probability:.3f} ({ai_probability*100:.1f}%)")
        print(f"{'='*70}")
        
        if ai_probability >= 0.65:
            verdict = "HIGH CONFIDENCE: Code is very likely AI-generated"
            risk_level = "HIGH"
        elif ai_probability >= 0.45:
            verdict = "MODERATE CONFIDENCE: Code shows significant AI patterns"
            risk_level = "MEDIUM"
        elif ai_probability >= 0.30:
            verdict = "LOW CONFIDENCE: Some AI-like patterns detected"
            risk_level = "LOW"
        else:
            verdict = "CLEAN: Code appears human-written"
            risk_level = "NONE"
        
        conflict_detected = False
        if perplexity < 15 and ast_score < 0.10:
            conflict_detected = True
            verdict += " [CONFLICT: Low perplexity but human-like structure]"
            print(f"\nCONFLICT DETECTED:")
            print(f"  Perplexity signals strong AI ({perplexity:.2f})")
            print(f"  But AST structure appears human-like")
        
        return {
            'ai_probability': round(ai_probability, 3),
            'is_ai_generated': ai_probability >= 0.45,
            'risk_level': risk_level,
            'verdict': verdict,
            'conflict_detected': conflict_detected,
            'score_breakdown': {
                'perplexity_score': round(perplexity_score, 3),
                'ast_score': round(ast_score, 3),
                'pattern_score': round(pattern_score, 3)
            },
            'metrics': {
                'perplexity': round(perplexity, 2),
                'perplexity_level': perplexity_level,
                'ast_features': ast_features,
                'pattern_features': pattern_features
            }
        }
    
    


---

## Plagiarism Detection Logic

This section details the approach used for detecting plagiarism in code submissions:

- **Common Algorithm Patterns**: The detector checks submitted code against a set of well-known algorithmic patterns (e.g., quicksort, binary search, bubble sort, merge sort, recursive Fibonacci).
- **Hash-Based Matching**: Each pattern is pre-processed and stored as a hash for fast exact matching. If a code snippet matches a known pattern hash, it is flagged as an exact match.
- **Similarity Analysis**: For non-exact matches, the detector uses sequence similarity to compare code structure and logic, flagging submissions with high similarity scores.
- **Source Attribution**: When plagiarism is detected, the system reports the sources (e.g., StackOverflow, GitHub, LeetCode) where the matching pattern is commonly found.
- **Performance**: The logic is optimized for speed using caching and early termination, making it suitable for large-scale automated code review.

> This modular approach allows for easy extension with new patterns and sources, and can be integrated into backend services for real-time plagiarism detection.

In [3]:
class OptimizedCodeNormalizer:
    """Code normalization for plagiarism detection"""
    
    @staticmethod
    def normalize_code(code: str, level: str = 'medium') -> str:
        """
        Normalize code at different levels
        
        Args:
            code: Source code
            level: 'light', 'medium', 'aggressive'
        """
        if level == 'light':
            return OptimizedCodeNormalizer._light_normalize(code)
        elif level == 'medium':
            return OptimizedCodeNormalizer._medium_normalize(code)
        else:
            return OptimizedCodeNormalizer._aggressive_normalize(code)
    
    @staticmethod
    def _light_normalize(code: str) -> str:
        """Remove comments and whitespace"""
        code = re.sub(r'#.*$', '', code, flags=re.MULTILINE)
        code = re.sub(r'"""[\s\S]*?"""', '', code)
        code = re.sub(r"'''[\s\S]*?'''", '', code)
        code = re.sub(r'\s+', ' ', code).strip()
        return code
    
    @staticmethod
    def _medium_normalize(code: str) -> str:
        """Normalize and rename variables"""
        code = OptimizedCodeNormalizer._light_normalize(code)
        
        try:
            tree = ast.parse(code)
            
            class VariableRenamer(ast.NodeTransformer):
                def __init__(self):
                    self.var_map = {}
                    self.counter = 0
                    self.builtins = {'print', 'len', 'range', 'str', 'int', 'float', 
                                   'list', 'dict', 'set', 'sum', 'max', 'min', 'input'}
                
                def visit_Name(self, node):
                    if node.id not in self.builtins:
                        if node.id not in self.var_map:
                            self.var_map[node.id] = f'v{self.counter}'
                            self.counter += 1
                        node.id = self.var_map[node.id]
                    return node
                
                def visit_FunctionDef(self, node):
                    if node.name not in self.var_map:
                        self.var_map[node.name] = f'f{self.counter}'
                        self.counter += 1
                    node.name = self.var_map[node.name]
                    self.generic_visit(node)
                    return node
            
            renamer = VariableRenamer()
            tree = renamer.visit(tree)
            code = ast.unparse(tree)
        except:
            pass
        
        return code
    
    @staticmethod
    def _aggressive_normalize(code: str) -> str:
        """Full normalization"""
        code = OptimizedCodeNormalizer._medium_normalize(code)
        code = re.sub(r'\s+', '', code)
        code = code.lower()
        return code
    
    

In [8]:
class OptimizedPlagiarismDetector:
    
    """Optimized plagiarism detection with faster matching"""
    
    def __init__(self):
        self.normalizer=OptimizedCodeNormalizer()
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.pattern_hashes = {}
        self.patterns_by_hash = {}
        self.pattern_database={}
        self._load_common_snippets()
        

    def _load_common_snippets(self):

        """L=Common algorithms patterns"""

        patterns = {
            'quicksort_basic': {
                'code': 'def quicksort(arr): if len(arr) <= 1: return arr',
                'category':'sorting',
                'sources': ['stackoverflow', 'github']
            },
            'fibonacci_recursive': {
                'code': 'def fibonacci(n): if n <= 1: return n return fibonacci(n-1) + fibonacci(n-2)',
                'category': 'dynamic_programming',
                'sources': ['common_algorithm']
            },
            'binary_search': {
                'code': 'def binary_search(arr, target): left = 0 right = len(arr) - 1',
                'category': 'searching',
                'sources': ['leetcode', 'github']
            },
            'bubble_sort': {
                'code': 'def bubble_sort(arr): for i in range(len(arr)): for j in range(len(arr)-i-1):',
                'category':'sorting',
                'sources': ['stackoverflow']
            },
            'merge_sort': {
                'code': 'def merge_sort(arr): if len(arr) > 1: mid = len(arr) // 2',
                'category':'sorting',
                'sources': ['github', 'common_algorithm']
            },
            'bfs': {
                'code': 'def bfs(graph, start): visited = set() queue = [start] while queue:',
                'category': 'graph',
                'sources': ['stackoverflow']
            }
        }
        
        # Pre-compute all hashes
        for name, info in patterns.items():
            normalized = self.normalizer.normalize_code(info['code'], level='medium')
            pattern_hash = hashlib.md5(normalized.encode()).hexdigest()
            self.pattern_hashes[name] = pattern_hash
            self.patterns_by_hash[pattern_hash] = {
                'name': name,
                'code':info['code'],
                'normalized':normalized,
                'category': info['category'],
                'sources': info['sources'],
            }
            self.pattern_database[name]=info
    
    @lru_cache(maxsize=2000)
    def _cached_normalize(self, code:str,level:str)->str:
        """Cached normalization """
        
        return self.normalizer.normalize_code(code,level)
    
    def check_pattern_databse(self, code:str)->List[dict]:
        
        matches=[]

        for level in ['medium', 'aggressive']:
            normalized_code = self._cached_normalize(code, level)
            code_hash = hashlib.md5(normalized_code.encode()).hexdigest()
            
            if code_hash in self.patterns_by_hash:
                pattern_info = self.patterns_by_hash[code_hash]
                matches.append({
                    'pattern_name': pattern_info['name'],
                    'category': pattern_info['category'],
                    'similarity': 1.0,
                    'sources': pattern_info['sources'],
                    'match_type': 'exact',
                    'normalization_level': level
                })
                continue
            
            for pattern_hash, pattern_info in self.patterns_by_hash.items():
                pattern_normalized = pattern_info['normalized']
                
                len_ratio = len(normalized_code) / max(len(pattern_normalized), 1)
                if len_ratio < 0.5 or len_ratio > 2.0:
                    continue
                
                similarity = difflib.SequenceMatcher(
                    None, 
                    normalized_code, 
                    pattern_normalized
                ).ratio()
                
                if similarity > 0.75:
                    matches.append({
                        'pattern_name': pattern_info['name'],
                        'category': pattern_info['category'],
                        'similarity': similarity,
                        'sources': pattern_info['sources'],
                        'match_type': 'similar',
                        'normalization_level': level
                    })
        
        seen = set()
        unique_matches = []
        for match in matches:
            key = (match['pattern_name'], match['match_type'])
            if key not in seen:
                seen.add(key)
                unique_matches.append(match)
        
        return unique_matches
        
    
    def compare_submissions(self, code1: str, code2: str) -> Dict:
        """Compare two submissions"""
        results = {}
        
        for level in ['light', 'medium', 'aggressive']:
            norm1 = self._cached_normalize(code1, level)
            norm2 = self._cached_normalize(code2, level)
            
            similarity = difflib.SequenceMatcher(None, norm1, norm2).ratio()
            results[f'{level}_similarity'] = similarity
        
        try:
            tree1 = ast.parse(code1)
            tree2 = ast.parse(code2)
            
            dump1 = ast.dump(tree1)
            dump2 = ast.dump(tree2)
            
            structural_similarity = difflib.SequenceMatcher(None, dump1, dump2).ratio()
            results['structural_similarity'] = structural_similarity
        except:
            results['structural_similarity'] = 0.0
        
        results['max_similarity'] = max(
            results.get('light_similarity', 0),
            results.get('medium_similarity', 0),
            results.get('aggressive_similarity', 0)
        )
        
        results['is_similar'] = results['max_similarity'] > 0.85
        
        return results



    def detect_plagiarism(self, code: str, submission_database: Optional[List[Dict]] = None) -> Dict:
        """
        Main plagiarism detection method
        
        Args:
            code: Code to analyze
            submission_database: Optional list of previous submissions
        """
        results = {
            'is_plagiarized': False,
            'confidence': 0.0,
            'pattern_matches': [],
            'submission_matches': [],
            'sources_found': [],
            'risk_level': 'NONE'
        }
        
        pattern_matches = self.check_pattern_database(code)
        results['pattern_matches'] = pattern_matches
        
        if pattern_matches:
            max_pattern_similarity = max(m['similarity'] for m in pattern_matches)
            results['sources_found'] = list(set(
                source for match in pattern_matches for source in match['sources']
            ))
            
            if max_pattern_similarity >= 0.95:
                results['is_plagiarized'] = True
                results['confidence'] = max_pattern_similarity
                results['risk_level'] = 'HIGH'
                results['reason'] = "Exact copy of common pattern"
            elif max_pattern_similarity >= 0.85:
                results['is_plagiarized'] = True
                results['confidence'] = max_pattern_similarity
                results['risk_level'] = 'MEDIUM'
                results['reason'] = "Very similar to common pattern"
        
        if submission_database:
            for submission in submission_database:
                comparison = self.compare_submissions(code, submission['code'])
                
                if comparison['max_similarity'] > 0.85:
                    results['submission_matches'].append({
                        'user_id': submission.get('user_id', 'unknown'),
                        'timestamp': submission.get('timestamp', 'unknown'),
                        'similarity': comparison['max_similarity'],
                        'details': comparison
                    })
                    
                    if comparison['max_similarity'] > results['confidence']:
                        results['is_plagiarized'] = True
                        results['confidence'] = comparison['max_similarity']
                        results['risk_level'] = 'HIGH' if comparison['max_similarity'] > 0.95 else 'MEDIUM'
                        results['reason'] = "Matches another submission"
        
        if not results['is_plagiarized'] and results['pattern_matches']:
            avg_similarity = np.mean([m['similarity'] for m in pattern_matches])
            if avg_similarity > 0.6:
                results['risk_level'] = 'LOW'
                results['reason'] = "Uses common algorithm patterns (acceptable)"
        
        return results
    

---

## Code Analysis and Recommendation Engine

This section introduces the integrated code analysis engine, which combines AI-generated code detection and plagiarism detection to provide a comprehensive assessment of code submissions:

- **Parallel Processing**: The engine runs AI and plagiarism checks in parallel for faster results, making it suitable for batch analysis and large datasets.
- **Suspiciousness Scoring**: Each code snippet is evaluated for signs of AI generation and plagiarism, with an overall suspiciousness score and detailed breakdown.
- **Actionable Recommendations**: Based on the analysis, the engine provides clear recommendations, such as flagging high-risk submissions, suggesting manual review, or confirming clean code.
- **Batch Support**: Multiple code samples can be analyzed simultaneously, with results aggregated for efficient review.
- **Cache Management**: Built-in cache clearing methods help manage memory and maintain performance during repeated or large-scale analysis.

> This modular recommendation engine can be integrated into automated code review systems, online judges, or educational platforms to enhance code integrity and fairness.

In [13]:

class OptimizedCodeChecker:
    
    
    def __init__(self):
        self.ai_detector = OptimizedAIGeneratedCodeDetector()
        self.plagiarism_detector = OptimizedPlagiarismDetector()
        self.executor=ThreadPoolExecutor(max_workers=2)
    
    def analyze_code(self, code:str, submission_database:Optional[List[Dict]]=None,parallel:bool=True)->Dict:
        
        if parallel:
            
            future_ai = self.executor.submit(self.ai_detector.detect_ai_generated, code)
            future_plag = self.executor.submit(self.plagiarism_detector.detect_plagiarism, code,submission_database)
            
            ai_result = future_ai.result()
            plag_result = future_plag.result()
        else:
            ai_result = self.ai_detector.detect_ai_generated(code)
            plag_result = self.plagiarism_detector.detect_plagiarism(code)
        
        overall_suspicious = (
            ai_result['is_ai_generated'] or 
            plag_result['is_plagiarized']
        )

        risk_levels={'NONE': 0, 'LOW': 1, 'MEDIUM': 2, 'HIGH': 3}
        ai_risk = risk_levels.get(ai_result['risk_level'], 0)
        plag_risk = risk_levels.get(plag_result['risk_level'], 0)
        
        overall_risk_value = max(ai_risk, plag_risk)
        overall_risk = [k for k, v in risk_levels.items() if v == overall_risk_value][0]


        return{
            'overall_suspicious': overall_suspicious,
            'overall_risk_level':overall_risk,
            'ai_detection': ai_result,
            'plagiarism_detection': plag_result,
            'recommendation': self._generate_recommendation(ai_result, plag_result),
            'action':self._determine_action(ai_result,plag_result)
        }
    
    def analyze_batch(self, codes:List[str])->List[Dict]:
        
        results = []
        with ThreadPoolExecutor(max_workers=4) as executor:
            futures = [executor.submit(self.analyze_code, code,None, False) for code in codes]
            results = [future.result() for future in futures]
        
        return results
    
    def _get_recommendation(self, ai_result:Dict, plag_result:Dict)->str:
        """Fast recommendation generation"""
        
        if ai_result['is_ai_generated'] and plag_result['is_plagiarized']:
            return (f"CRITICAL: Code shows both AI generation "
                   f"(confidence: {ai_result['ai_probability']:.1%}) "
                   f"and plagiarism (similarity: {plag_result['confidence']:.1%})")
        
        elif ai_result['is_ai_generated']:
            return (f"AI DETECTED: Code likely AI-generated "
                   f"(confidence: {ai_result['ai_probability']:.1%})")
        
        elif plag_result['is_plagiarized']:
            sources = plag_result.get('sources_found', [])
            if sources:
                source_str = ', '.join(sources)
                return f"PLAGIARISM DETECTED: Matches known sources ({source_str})"
            else:
                return "PLAGIARISM DETECTED: Matches other submissions"
        
        elif ai_result['ai_probability'] > 0.4:
            return "MODERATE RISK: Some AI-like patterns detected"
        
        elif plag_result['risk_level'] == 'LOW':
            return "ACCEPTABLE: Uses common algorithm patterns"
        
        else:
            return "CLEAN: No significant issues detected"
    
    def _determine_action(self, ai_result: Dict, plag_result: Dict) -> str:
        """Determine action"""
        
        if ai_result['is_ai_generated'] and plag_result['is_plagiarized']:
            return "BLOCK_AND_REPORT"
        elif plag_result['is_plagiarized'] and plag_result['risk_level'] == 'HIGH':
            return "INVESTIGATE"
        elif ai_result['is_ai_generated'] and ai_result['risk_level'] == 'HIGH':
            return "FLAG_FOR_REVIEW"
        elif ai_result['ai_probability'] > 0.4 or plag_result['risk_level'] != 'NONE':
            return "MONITOR"
        else:
            return "ACCEPT"
        

    def clear_cache(self):
        """Clear LRU caches to free memory"""
        if hasattr(self.plagiarism_detector, '_cached_normalize'):
            self.plagiarism_detector._cached_normalize.cache_clear()



---

## Test Cases and Performance Metrics

This section provides a suite of test cases to validate the code analysis engine and benchmark its performance:

- **Test Coverage**: Includes examples of suspicious AI-like code, common algorithms (potential plagiarism), and original code to demonstrate detection capabilities.
- **Single and Batch Analysis**: Measures the time taken for individual and batch code analysis, highlighting the speedup from parallel processing.
- **Performance Reporting**: Outputs suspiciousness, AI probability, plagiarism status, and recommendations for each test case, along with timing metrics.
- **Cache Efficiency**: (Commented) Optionally tests cache performance for repeated analysis, showing the benefits of caching in large-scale scenarios.

> Use these tests to ensure the reliability and efficiency of the detection pipeline before deploying in production or integrating with automated review systems.

In [14]:
if __name__ == "__main__":
    checker = OptimizedCodeChecker()
    
    test_code = '''
def bubble_sort(arr):
    for i in range(len(arr)):
        for j in range(len(arr) - i - 1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr
    '''
    
    print("="*70)
    print("COMPREHENSIVE CODE ANALYSIS TEST")
    print("="*70)
    
    result = checker.analyze_code(test_code)
    
    print(f"\n\nOVERALL ASSESSMENT:")
    print(f"  Suspicious: {result['overall_suspicious']}")
    print(f"  Risk Level: {result['overall_risk_level']}")
    print(f"  Action: {result['action']}")
    print(f"  Recommendation: {result['recommendation']}")
    
    print(f"\nAI DETECTION:")
    print(f"  Probability: {result['ai_detection']['ai_probability']:.1%}")
    print(f"  Risk: {result['ai_detection']['risk_level']}")
    
    print(f"\nPLAGIARISM DETECTION:")
    print(f"  Is Plagiarized: {result['plagiarism_detection']['is_plagiarized']}")
    print(f"  Confidence: {result['plagiarism_detection']['confidence']:.1%}")
    print(f"  Risk: {result['plagiarism_detection']['risk_level']}")
    
    print("="*70)

Keyword arguments {'trucation': True} not recognized.


COMPREHENSIVE CODE ANALYSIS TEST
Eroor calculation failed!!! : 'inputs_ids'

AI DETECTION ANALYSIS

1. PERPLEXITY ANALYSIS
   Value: 0.00
   Level: CRITICAL
   EXTREMELY LOW - Very strong AI signature
   Score: 0.600 / 0.600

2. AST STRUCTURE ANALYSIS
   Balanced: False
   Depth Variance: 4.81
   Complexity: 4
   Score: 0.050 / 0.250

3. PATTERN ANALYSIS
   Avg Variable Length: 2.7
   Comment Ratio: 0.00
   Indent Consistency: 0.36
   Score: 0.000 / 0.150

TOTAL AI PROBABILITY: 0.650 (65.0%)

CONFLICT DETECTED:
  Perplexity signals strong AI (0.00)
  But AST structure appears human-like


AttributeError: 'OptimizedPlagiarismDetector' object has no attribute 'check_pattern_database'

---

## Model Saving and Deployment

This section demonstrates how to save the optimized GPT-2 model and tokenizer in HuggingFace format for future use or production deployment. Saving models in this way allows for easy loading, sharing, and integration into backend services or cloud environments.

- **save_dir**: The directory where the model and tokenizer will be stored.
- **gpt2_model.save_pretrained(save_dir)**: Saves the model weights and configuration.
- **gpt2_tokenizer.save_pretrained(save_dir)**: Saves the tokenizer files for consistent preprocessing.

> After saving, you can reload the model and tokenizer using `from_pretrained(save_dir)` in any compatible environment.

**Tip:** Always version your saved models and document the training or fine-tuning process for reproducibility and auditability.

In [None]:
# Save the optimized model and tokenizer
save_dir = "../models"                 # choose a folder name
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# save everything in Hugging-Face format
gpt2_model.save_pretrained(save_dir)
gpt2_tokenizer.save_pretrained(save_dir)

print(f"Model and tokenizer saved in {save_dir}")


Model and tokenizer saved in ../models
