<a href="https://colab.research.google.com/github/VedantKothari01/DocInsight/blob/main/DocInsight_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔍 DocInsight Enhanced - Complete Real Dataset Demo

## 🚀 Production-Ready Plagiarism Detection with Real Datasets

**What's New in v2.0:**
- ✅ **Real Dataset Integration**: PAWS, Wikipedia, arXiv (50,000+ sentences)
- ✅ **NO Hardcoded Corpus**: Purely data-driven approach
- ✅ **Complete Document Analysis**: Every sentence analyzed, not just highlights
- ✅ **Advanced ML Pipeline**: SentenceTransformers + Cross-encoder + Stylometry
- ✅ **Production-Ready Web Interface**: One-click document upload and analysis

This notebook demonstrates the complete DocInsight system from setup to analysis.

## 📦 Installation and Setup

Install all required dependencies for real dataset integration:

In [None]:
# Install all required packages
!pip install -q sentence-transformers faiss-cpu transformers datasets wikipedia
!pip install -q spacy textstat python-docx pymupdf streamlit requests

# Download spaCy model (optional, enhances linguistic analysis)
!python -m spacy download en_core_web_sm

print("✅ Installation complete!")

## 🌐 Real Dataset Integration

Load and test the real dataset integration:

In [None]:
import sys
import time
import logging
from pathlib import Path

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("🔍 DocInsight Enhanced v2.0")
print("Real Dataset Integration Demo")
print("=" * 50)

In [None]:
# Test dataset loading capabilities
from dataset_loaders import DatasetLoader

# Initialize dataset loader
loader = DatasetLoader()

print("📊 Testing dataset loading capabilities...")
print()

# Test PAWS dataset loading
print("1. 🔤 Loading PAWS paraphrase dataset...")
start_time = time.time()
paws_sentences = loader.load_paws_dataset(max_samples=1000)
paws_time = time.time() - start_time
print(f"   ✅ Loaded {len(paws_sentences)} PAWS sentences in {paws_time:.1f}s")
if paws_sentences:
    print(f"   📝 Sample: {paws_sentences[0][:100]}...")
print()

# Test Wikipedia loading
print("2. 📚 Loading Wikipedia articles...")
start_time = time.time()
wiki_topics = ["Machine learning", "Climate change", "Healthcare"]
wiki_sentences = loader.load_wikipedia_articles(topics=wiki_topics, sentences_per_topic=50)
wiki_time = time.time() - start_time
print(f"   ✅ Loaded {len(wiki_sentences)} Wikipedia sentences in {wiki_time:.1f}s")
if wiki_sentences:
    print(f"   📝 Sample: {wiki_sentences[0][:100]}...")
print()

# Test arXiv loading
print("3. 🎓 Loading arXiv academic papers...")
start_time = time.time()
arxiv_categories = ["cs.AI", "cs.CL"]
arxiv_sentences = loader.load_arxiv_abstracts(categories=arxiv_categories, max_papers=100)
arxiv_time = time.time() - start_time
print(f"   ✅ Loaded {len(arxiv_sentences)} arXiv sentences in {arxiv_time:.1f}s")
if arxiv_sentences:
    print(f"   📝 Sample: {arxiv_sentences[0][:100]}...")
print()

total_sentences = len(paws_sentences) + len(wiki_sentences) + len(arxiv_sentences)
print(f"🎉 Successfully loaded {total_sentences} sentences from real datasets!")
print(f"📊 Sources: PAWS ({len(paws_sentences)}), Wikipedia ({len(wiki_sentences)}), arXiv ({len(arxiv_sentences)})")

## 🧠 Corpus Building and ML Pipeline

Build the complete corpus with semantic embeddings and search index:

In [None]:
from corpus_builder import CorpusIndex
from enhanced_pipeline import PlagiarismDetector

print("🏗️ Building DocInsight corpus and ML pipeline...")
print()

# Create corpus index with reasonable size for demo
print("1. 📚 Building corpus from real datasets...")
start_time = time.time()
corpus_index = CorpusIndex(target_size=5000)  # 5K sentences for demo
success = corpus_index.load_or_build()
corpus_time = time.time() - start_time

if not success:
    print("❌ Failed to build corpus - check network connection")
    raise RuntimeError("Cannot proceed without real datasets")

print(f"   ✅ Built corpus with {len(corpus_index.sentences)} sentences in {corpus_time:.1f}s")
print()

# Build semantic embeddings
print("2. 🧠 Generating semantic embeddings...")
start_time = time.time()
corpus_index.build_embeddings()
embed_time = time.time() - start_time
print(f"   ✅ Generated embeddings in {embed_time:.1f}s")
print()

# Build FAISS search index
print("3. ⚡ Building FAISS search index...")
start_time = time.time()
index_built = corpus_index.build_index()
index_time = time.time() - start_time

if index_built:
    print(f"   ✅ Built FAISS index in {index_time:.1f}s")
else:
    print(f"   ⚠️ FAISS not available, using fallback search")
print()

# Initialize plagiarism detector
print("4. 🔍 Initializing plagiarism detector...")
detector = PlagiarismDetector(corpus_index)
print("   ✅ Detector ready with complete ML pipeline")
print()

total_time = corpus_time + embed_time + index_time
print(f"🎉 DocInsight fully initialized in {total_time:.1f}s!")
print(f"📊 Ready for plagiarism detection with {len(corpus_index.sentences)} real sentences")

## 🔍 Plagiarism Detection Demo

Test the complete plagiarism detection pipeline:

In [None]:
import json

print("🔍 DocInsight Plagiarism Detection Demo")
print("=" * 45)
print()

# Test sentences with varying similarity levels
test_sentences = [
    "Machine learning algorithms can identify patterns in large datasets.",  # Likely in corpus
    "Artificial intelligence systems are becoming increasingly sophisticated.",  # Common topic
    "Climate change poses significant challenges to global sustainability.",  # Wikipedia content
    "The quick brown fox jumps over the lazy dog.",  # Unlikely match
    "Neural networks utilize backpropagation for training deep architectures.",  # Technical
]

print("🧪 Testing individual sentence analysis:")
print()

for i, sentence in enumerate(test_sentences, 1):
    print(f"{i}. Analyzing: '{sentence}'")
    
    start_time = time.time()
    result = detector.analyze_sentence(sentence)
    analysis_time = time.time() - start_time
    
    print(f"   📊 Fused Score: {result.fused_score:.3f}")
    print(f"   🎯 Confidence: {result.confidence}")
    print(f"   📝 Stylometry Score: {result.stylometry_score:.3f}")
    print(f"   ⏱️ Analysis Time: {analysis_time:.3f}s")
    
    if result.matches:
        print(f"   🔍 Top Match: {result.matches[0].similarity:.3f} - '{result.matches[0].text[:60]}...'")
    else:
        print("   ℹ️ No significant matches found")
    
    print()

print("✅ Individual sentence analysis complete!")

## 📄 Complete Document Analysis

Test comprehensive document analysis (the main feature):

In [None]:
# Sample document for testing
test_document = """
Machine learning has become a fundamental tool in modern artificial intelligence. 
Neural networks can approximate complex non-linear functions through deep architectures. 
The training process involves optimizing weights using gradient descent algorithms. 
Natural language processing enables computers to understand and generate human text. 
Computer vision systems analyze visual information to recognize objects and patterns.

Climate change represents one of the most pressing challenges of our time. 
Greenhouse gas emissions from human activities are warming the planet. 
Renewable energy sources offer sustainable alternatives to fossil fuels. 
Biodiversity loss threatens ecosystem stability and resilience worldwide.

This document contains a mix of technical and environmental content. 
Some sentences may have high similarity to existing literature. 
The plagiarism detection system should identify potential matches. 
Overall analysis will provide comprehensive risk assessment.
""".strip()

print("📄 Complete Document Analysis Demo")
print("=" * 40)
print()
print(f"📝 Document Length: {len(test_document)} characters")
print(f"📊 Estimated Sentences: ~{len(test_document.split('.'))}")
print()

print("🔍 Analyzing entire document...")
start_time = time.time()
doc_results = detector.analyze_document(test_document)
analysis_time = time.time() - start_time

print(f"✅ Analysis complete in {analysis_time:.2f}s")
print()

# Display overall statistics
stats = doc_results['overall_stats']
print("📊 OVERALL DOCUMENT STATISTICS:")
print(f"   📝 Total Sentences Analyzed: {stats['total_sentences']}")
print(f"   📈 Average Similarity Score: {stats['avg_fused_score']:.3f}")
print(f"   🔝 Maximum Similarity Score: {stats['max_fused_score']:.3f}")
print(f"   🔴 HIGH Risk Sentences: {stats['high_confidence_count']}")
print(f"   🟡 MEDIUM Risk Sentences: {stats['medium_confidence_count']}")
print(f"   🟢 LOW Risk Sentences: {stats['low_confidence_count']}")
print(f"   ⚠️ High Risk Percentage: {stats['high_risk_ratio']*100:.1f}%")
print()

# Risk assessment
if stats['high_confidence_count'] > 0:
    risk_level = "🔴 HIGH RISK"
    recommendation = "Immediate review required - potential plagiarism detected"
elif stats['medium_confidence_count'] > 0:
    risk_level = "🟡 MEDIUM RISK"
    recommendation = "Manual review recommended - moderate similarities found"
else:
    risk_level = "🟢 LOW RISK"
    recommendation = "No significant plagiarism concerns detected"

print(f"🚨 RISK ASSESSMENT: {risk_level}")
print(f"💡 RECOMMENDATION: {recommendation}")
print()

print("📋 DETAILED SENTENCE-BY-SENTENCE ANALYSIS:")
print("=" * 50)

for i, analysis in enumerate(doc_results['sentence_analyses'], 1):
    confidence = analysis['confidence']
    score = analysis['fused_score']
    sentence = analysis['sentence']
    matches = analysis['matches']
    
    # Color coding
    if confidence == "HIGH":
        emoji = "🔴"
    elif confidence == "MEDIUM":
        emoji = "🟡"
    else:
        emoji = "🟢"
    
    print(f"{emoji} Sentence {i}: {confidence} RISK (Score: {score:.3f})")
    print(f"   Text: {sentence}")
    
    if matches:
        print(f"   🔍 {len(matches)} similar matches found:")
        for j, match in enumerate(matches[:2], 1):  # Show top 2
            print(f"      {j}. Similarity: {match['similarity']:.3f} - {match['text'][:80]}...")
    else:
        print("   ℹ️ No significant matches")
    
    print()

print("✅ Complete document analysis finished!")
print(f"📊 Analyzed {stats['total_sentences']} sentences in {analysis_time:.2f}s")
print(f"⚡ Average time per sentence: {analysis_time/stats['total_sentences']:.3f}s")

## 🎯 Report Generation

Generate comprehensive reports in multiple formats:

In [None]:
import json
from datetime import datetime

print("📋 Report Generation Demo")
print("=" * 30)
print()

# Generate JSON report
print("1. 📄 Generating JSON report...")
json_report = json.dumps(doc_results, indent=2, ensure_ascii=False)
print(f"   ✅ JSON report generated ({len(json_report)} characters)")

# Save JSON report
with open('plagiarism_report.json', 'w', encoding='utf-8') as f:
    f.write(json_report)
print("   💾 Saved as 'plagiarism_report.json'")
print()

# Generate summary report
print("2. 📊 Generating summary report...")
summary_report = f"""DocInsight Enhanced - Plagiarism Analysis Report
================================================

Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
Document Length: {len(test_document)} characters
Corpus Source: Real datasets (PAWS, Wikipedia, arXiv)
Total Corpus Size: {len(corpus_index.sentences)} sentences

OVERALL STATISTICS:
- Total Sentences Analyzed: {stats['total_sentences']}
- Average Similarity Score: {stats['avg_fused_score']:.3f}
- Maximum Similarity Score: {stats['max_fused_score']:.3f}
- HIGH Risk Sentences: {stats['high_confidence_count']}
- MEDIUM Risk Sentences: {stats['medium_confidence_count']}
- LOW Risk Sentences: {stats['low_confidence_count']}
- High Risk Percentage: {stats['high_risk_ratio']*100:.1f}%

RISK ASSESSMENT:
{risk_level} - {recommendation}

METHODOLOGY:
- Semantic Similarity: SentenceTransformers (all-MiniLM-L6-v2)
- Cross-encoder Reranking: ms-marco-MiniLM-L-6-v2
- Stylometry Analysis: 15+ linguistic features
- Search Index: FAISS for fast similarity search
- Score Fusion: Weighted combination of multiple signals

DATA SOURCES:
- PAWS: Paraphrase Adversaries from Word Scrambling
- Wikipedia: Multi-domain encyclopedia articles
- arXiv: Academic paper abstracts

Generated by DocInsight Enhanced v2.0
Real Dataset Integration - No Hardcoded Corpus
"""

# Save summary report
with open('plagiarism_summary.txt', 'w', encoding='utf-8') as f:
    f.write(summary_report)

print(f"   ✅ Summary report generated ({len(summary_report)} characters)")
print("   💾 Saved as 'plagiarism_summary.txt'")
print()

print("📋 Sample of summary report:")
print("-" * 40)
print(summary_report[:800] + "...")
print("-" * 40)
print()

print("✅ Report generation complete!")
print("📁 Files generated: plagiarism_report.json, plagiarism_summary.txt")

## 🌐 Streamlit Web Interface

Launch the production-ready web interface:

In [None]:
print("🚀 DocInsight Web Interface Setup")
print("=" * 35)
print()

print("📋 Instructions to run the Streamlit web interface:")
print()
print("1. 📁 Save this notebook and ensure all files are in the same directory:")
print("   - corpus_builder.py")
print("   - dataset_loaders.py")
print("   - enhanced_pipeline.py")
print("   - streamlit_app.py")
print("   - requirements.txt")
print()
print("2. 🖥️ Open a terminal/command prompt and navigate to the directory")
print()
print("3. 🚀 Run the following command:")
print("   streamlit run streamlit_app.py")
print()
print("4. 🌐 The web interface will open at: http://localhost:8501")
print()
print("🎯 Web Interface Features:")
print("- 📤 Upload documents (TXT, PDF, DOCX)")
print("- 🔍 One-click plagiarism analysis")
print("- 📊 Complete document review with confidence scoring")
print("- 📋 Downloadable reports (JSON + summary)")
print("- 🌐 Real dataset integration (no hardcoded corpus)")
print("- ⚡ FAISS-powered fast search")
print()

# Alternative: Run demo script
print("🎁 Alternative - Complete Demo Script:")
print("   python docinsight_demo.py")
print("   (Includes automatic setup + Streamlit launch)")
print()

print("✅ DocInsight Enhanced is ready for production use!")
print("🔍 Upload any document and get comprehensive plagiarism analysis")

## 🎉 Summary and Next Steps

**DocInsight Enhanced v2.0 Features Demonstrated:**

✅ **Real Dataset Integration**: 
- PAWS paraphrase dataset for advanced paraphrase detection
- Wikipedia articles covering 25+ diverse topics
- arXiv academic papers for scholarly content
- **NO hardcoded fallback corpus**

✅ **Complete Document Analysis**:
- Every sentence analyzed and scored
- Confidence levels: HIGH/MEDIUM/LOW
- Comprehensive risk assessment
- Detailed similarity matches with sources

✅ **Advanced ML Pipeline**:
- SentenceTransformers for semantic similarity
- Cross-encoder reranking for precision
- Stylometry analysis with 15+ linguistic features
- FAISS indexing for sub-second search

✅ **Production-Ready Interface**:
- Streamlit web app with file upload
- One-click analysis workflow
- Multiple report formats (JSON, summary)
- Real-time progress indicators

**🎯 Ready for Production Use:**
1. Run `python docinsight_demo.py` for complete setup
2. Upload any document (TXT, PDF, DOCX) via web interface
3. Get comprehensive plagiarism analysis report
4. No manual corpus upload needed - fully automated!

**🚀 This system now provides enterprise-grade plagiarism detection with real dataset integration!**