# 🎓 Academic Paper Classifier v7.0

**Automatically classify academic papers by discipline, subfield, and research methodology**

This notebook provides a complete, production-ready classification system for academic papers. Simply input a title and abstract, and get back the paper's:

- **🎯 Discipline**: Computer Science (CS), Information Systems (IS), or Information Technology (IT)
- **🔬 Subfield**: Specialized area within the discipline (e.g., AI/ML, Security, DevOps, etc.)
- **📊 Methodology**: Research approach (Qualitative, Quantitative, or Mixed)

## 🚀 Quick Start
1. **Run cells 1-5** (Setup & Test)
2. **Jump to Section 3** for examples  
3. **Use Section 4** for your own papers

---


## 📚 Classification Categories Explained

### 🎯 Disciplines
| **Code** | **Name** | **Focus** |
|----------|----------|-----------|
| **CS** | Computer Science | Algorithms, systems, computational theory |
| **IS** | Information Systems | Business processes, organizational technology |
| **IT** | Information Technology | Infrastructure, operations, deployment |

### 🔬 Subfields by Discipline

#### Computer Science (CS)
**AI/ML** (AI/Machine Learning), **CLOUDCS** (Cloud Computing), **CV** (Computer Vision), **NLP** (Natural Language Processing), **SE** (Software Engineering), **SEC** (Security)

#### Information Systems (IS)  
**BPM** (Business Process Management), **DT** (Digital Transformation), **GOV** (IT Governance), **HIS** (Health Information Systems), **KM** (Knowledge Management)

#### Information Technology (IT)
**CLOUDIT** (Cloud IT), **DEVOPS** (Development Operations), **EMERGING** (Emerging Technologies), **RISK** (Risk Management)

### 📊 Research Methodologies
| **Code** | **Name** | **Description** |
|----------|----------|-----------------|
| **QUAL** | Qualitative | Interviews, case studies, ethnography |
| **QUANT** | Quantitative | Experiments, surveys, statistical analysis |
| **MIXED** | Mixed Methods | Combination of qualitative and quantitative |

---


# 🛠️ Section 1: Setup

## Step 1: Working Directory Setup
Run this cell first to ensure the notebook can find all model files.


In [2]:
# 🔧 Working Directory Setup
import os
import sys

print(f"📂 Current working directory: {os.getcwd()}")

# Navigate to project root if needed
if not os.path.exists('Artefacts/current'):
    possible_paths = [
        '.',  # Already in project root
        '..',  # One level up
        '../..',  # Two levels up
        '../../..',  # Three levels up
        '/Users/aanandprabhu/Desktop/development/NLP-Project',  # Absolute path
    ]
    
    project_root = None
    for path in possible_paths:
        if os.path.exists(os.path.join(path, 'Artefacts/current')):
            project_root = os.path.abspath(path)
            break
    
    if project_root:
        os.chdir(project_root)
        print(f"✅ Changed to project root: {os.getcwd()}")
    else:
        print("❌ Could not find project root directory!")

# Verify we can see the required files
if os.path.exists('Artefacts/current'):
    print("✅ Found Artefacts/current directory")
    pkl_files = []
    for root, dirs, files in os.walk('Artefacts/current'):
        pkl_files.extend([os.path.join(root, f) for f in files if f.endswith('.pkl')])
    print(f"✅ Found {len(pkl_files)} .pkl files")
else:
    print("❌ Cannot find Artefacts/current directory")


📂 Current working directory: /Users/aanandprabhu/Desktop/development/NLP-Project/Notebooks
✅ Changed to project root: /Users/aanandprabhu/Desktop/development/NLP-Project
✅ Found Artefacts/current directory
✅ Found 10 .pkl files


## Step 2: Import Libraries, Define Classes & Load Models
This cell imports all required libraries and defines the OptimizedFeatureExtractor class needed for model loading. 


In [3]:
# Import required libraries
import joblib
import pandas as pd
import numpy as np
import json
from typing import Dict, List, Tuple, Optional, Union
import warnings
warnings.filterwarnings('ignore')

# For progress bars
from tqdm import tqdm

# Required imports for model loading
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.sparse import hstack

# Define the OptimizedFeatureExtractor class (required for model loading)
class OptimizedFeatureExtractor(BaseEstimator, TransformerMixin):
    """Optimized feature extractor for academic paper classification."""
    
    def __init__(self):
        self.vectorizers = {
            'tfidf': TfidfVectorizer(
                max_features=3000,
                ngram_range=(1, 2),
                stop_words='english',
                lowercase=True,
                min_df=2,
                max_df=0.95
            )
        }
        self.is_fitted = False
    
    def fit(self, X, y=None):
        """Fit the feature extractor."""
        for name, vectorizer in self.vectorizers.items():
            vectorizer.fit(X)
        self.is_fitted = True
        return self
    
    def transform(self, X):
        """Transform texts to features."""
        if not self.is_fitted:
            raise ValueError("Feature extractor must be fitted first")
        
        features = []
        for name, vectorizer in self.vectorizers.items():
            features.append(vectorizer.transform(X))
        
        if len(features) == 1:
            return features[0]
        else:
            return hstack(features)
    
    def fit_transform(self, X, y=None):
        """Fit and transform in one step."""
        return self.fit(X, y).transform(X)
    
    def get_feature_names(self):
        """Get all feature names"""
        names = []
        for name, vectorizer in self.vectorizers.items():
            names.extend([f"{name}_{n}" for n in vectorizer.get_feature_names_out()])
        return names

# Define model paths
MODEL_PATHS = {
    'discipline': 'Artefacts/current/discipline_classifier_v7/discipline_pipeline_v7.pkl',
    'cs_subfield': 'Artefacts/current/cs_subfield_classifier_v7/cs_subfield_pipeline_v7.pkl',
    'is_subfield': 'Artefacts/current/is_subfield_classifier_v7/is_subfield_pipeline_v7.pkl',
    'it_subfield': 'Artefacts/current/it_subfield_classifier_v7/it_subfield_pipeline_v7.pkl',
    'methodology': 'Artefacts/current/methodology_classifier_v7/methodology_pipeline_v7.pkl'
}

# Load all models
print("🤖 Loading classification models...")
models = {}
successful_loads = 0

for model_name, path in MODEL_PATHS.items():
    try:
        models[model_name] = joblib.load(path)
        print(f"✅ Loaded {model_name}")
        successful_loads += 1
    except Exception as e:
        print(f"❌ Error loading {model_name}: {e}")
        # Try alternative artifacts approach
        artifact_path = f'Artefacts/current/{model_name}_classifier_v7/artifacts_v7.pkl'
        try:
            artifacts = joblib.load(artifact_path)
            models[model_name] = {
                'feature_extractor': artifacts['feature_extractor'],
                'model': artifacts['model'],
                'label_encoder': artifacts['label_encoder']
            }
            print(f"✅ Loaded {model_name} from artifacts")
            successful_loads += 1
        except Exception as e2:
            print(f"❌ Failed to load {model_name} artifacts: {e2}")
            models[model_name] = None

print(f"\n🎉 Model loading complete! Successfully loaded {successful_loads}/{len(MODEL_PATHS)} models")

if successful_loads == 0:
    print("⚠️  No models loaded successfully. Please check that you're in the correct directory and that model files exist.")
elif successful_loads < len(MODEL_PATHS):
    print("⚠️  Some models failed to load. Classification may not work for all disciplines.")
else:
    print("🚀 All models loaded successfully! Ready for classification.")


🤖 Loading classification models...
✅ Loaded discipline
✅ Loaded cs_subfield
✅ Loaded is_subfield
✅ Loaded it_subfield
✅ Loaded methodology

🎉 Model loading complete! Successfully loaded 5/5 models
🚀 All models loaded successfully! Ready for classification.


# 🧠 Section 2: Core Classification Functions

## Step 3: Define Classification Functions
These functions handle the actual paper classification with confidence scores and layman explanations.


In [4]:
def preprocess_text(text: str) -> str:
    """Clean and preprocess text input."""
    if pd.isna(text) or text is None:
        return ""
    return str(text).strip()

def combine_title_abstract(title: str, abstract: str) -> str:
    """Combine title and abstract with proper formatting."""
    title = preprocess_text(title)
    abstract = preprocess_text(abstract)
    combined = f"{title} {abstract}".strip()
    if not combined:
        combined = "No content available"
    return combined

def get_prediction_confidence(model_dict, X) -> Tuple[str, float, np.ndarray]:
    """Get prediction with confidence score from model dictionary."""
    feature_extractor = model_dict['feature_extractor']
    model = model_dict['model']
    label_encoder = model_dict['label_encoder']
    
    # Transform input to features
    X_features = feature_extractor.transform(X)
    
    # Get prediction and probabilities
    prediction_encoded = model.predict(X_features)[0]
    prediction = label_encoder.inverse_transform([prediction_encoded])[0]
    
    # Get probability scores
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(X_features)[0]
        confidence = float(np.max(proba))
    else:
        proba = np.array([1.0])
        confidence = 1.0
    
    return prediction, confidence, proba

def generate_layman_explanation(discipline, subfield, methodology, confidences):
    """Generate a simple, layman-friendly explanation of the classification."""
    
    # Discipline explanations
    discipline_explanations = {
        'CS': "focuses on technical computer science topics like algorithms and software systems",
        'IS': "focuses on how technology is used in business and organizational contexts", 
        'IT': "focuses on practical technology implementation and infrastructure management"
    }
    
    # Subfield explanations
    subfield_explanations = {
        # CS subfields
        'AI/ML': "dealing with artificial intelligence and machine learning",
        'CLOUDCS': "about cloud computing systems and distributed computing",
        'CV': "about computer vision and image processing",
        'NLP': "about natural language processing and text analysis",
        'SE': "about software engineering and development practices",
        'SEC': "about cybersecurity and information security",
        
        # IS subfields
        'BPM': "about business process management and workflow optimization",
        'DT': "about digital transformation in organizations",
        'GOV': "about IT governance and technology policy",
        'HIS': "about health information systems and medical technology",
        'KM': "about knowledge management and information systems",
        
        # IT subfields
        'CLOUDIT': "about cloud infrastructure and deployment",
        'DEVOPS': "about development operations and automation",
        'EMERGING': "about emerging technologies like IoT or blockchain",
        'RISK': "about IT risk management and security operations"
    }
    
    # Methodology explanations
    methodology_explanations = {
        'QUAL': "using qualitative research methods (interviews, case studies, observations)",
        'QUANT': "using quantitative research methods (experiments, surveys, statistical analysis)",
        'MIXED': "using a combination of both qualitative and quantitative research methods"
    }
    
    # Build the explanation
    disc_conf = confidences.get('discipline', 0)
    
    explanation = f"This paper {discipline_explanations.get(discipline, 'appears to be academic research')}"
    
    if subfield and subfield != 'Unknown':
        explanation += f", specifically {subfield_explanations.get(subfield, f'in the {subfield} area')}"
    
    if methodology and methodology != 'Unknown':
        explanation += f", {methodology_explanations.get(methodology, f'using {methodology} methods')}"
    
    explanation += "."
    
    # Add confidence note if low
    if disc_conf < 0.7:
        explanation += " (Note: The classifier has some uncertainty about this classification.)"
    
    return explanation

print("✅ Core functions defined successfully!")


✅ Core functions defined successfully!


In [5]:
def classify_paper(title: str, abstract: str, include_layman_explanation: bool = True) -> Dict:
    """
    Classify a single academic paper.
    
    Args:
        title: Paper title
        abstract: Paper abstract
        include_layman_explanation: Whether to include a simple explanation
    
    Returns:
        Dictionary with classification results including layman explanation
    """
    results = {
        'input': {'title': title, 'abstract': abstract[:200] + '...' if len(abstract) > 200 else abstract},
        'predictions': {}, 'confidence_scores': {}, 'probability_distributions': {},
        'warnings': []
    }
    
    # Combine title and abstract
    combined_text = combine_title_abstract(title, abstract)
    
    if combined_text == "No content available":
        results['warnings'].append("Empty or missing title and abstract")
        results['predictions'] = {'discipline': 'Unknown', 'subfield': 'Unknown', 'methodology': 'Unknown'}
        if include_layman_explanation:
            results['layman_explanation'] = "Cannot classify this paper because no title or abstract was provided."
        return results
    
    # Step 1: Predict Discipline
    if models['discipline'] is not None:
        disc_pred, disc_conf, disc_proba = get_prediction_confidence(models['discipline'], [combined_text])
        results['predictions']['discipline'] = disc_pred
        results['confidence_scores']['discipline'] = disc_conf
        
        if 'label_encoder' in models['discipline']:
            classes = models['discipline']['label_encoder'].classes_
            results['probability_distributions']['discipline'] = {
                cls: float(prob) for cls, prob in zip(classes, disc_proba)
            }
    else:
        results['predictions']['discipline'] = 'Error'
        results['warnings'].append("Discipline classifier not loaded")
        disc_pred = None
    
    # Step 2: Predict Subfield based on Discipline
    if disc_pred in ['CS', 'IS', 'IT']:
        subfield_model_name = f"{disc_pred.lower()}_subfield"
        if models[subfield_model_name] is not None:
            sub_pred, sub_conf, sub_proba = get_prediction_confidence(models[subfield_model_name], [combined_text])
            results['predictions']['subfield'] = sub_pred
            results['confidence_scores']['subfield'] = sub_conf
            
            if 'label_encoder' in models[subfield_model_name]:
                classes = models[subfield_model_name]['label_encoder'].classes_
                results['probability_distributions']['subfield'] = {
                    cls: float(prob) for cls, prob in zip(classes, sub_proba)
                }
        else:
            results['predictions']['subfield'] = 'Error'
    else:
        results['predictions']['subfield'] = 'Unknown'
    
    # Step 3: Predict Methodology
    if models['methodology'] is not None:
        meth_pred, meth_conf, meth_proba = get_prediction_confidence(models['methodology'], [combined_text])
        results['predictions']['methodology'] = meth_pred
        results['confidence_scores']['methodology'] = meth_conf
        
        if 'label_encoder' in models['methodology']:
            classes = models['methodology']['label_encoder'].classes_
            results['probability_distributions']['methodology'] = {
                cls: float(prob) for cls, prob in zip(classes, meth_proba)
            }
    else:
        results['predictions']['methodology'] = 'Error'
    
    # Add layman's explanation
    if include_layman_explanation:
        results['layman_explanation'] = generate_layman_explanation(
            results['predictions']['discipline'],
            results['predictions']['subfield'],
            results['predictions']['methodology'],
            results['confidence_scores']
        )
    
    return results

def classify_batch(df: pd.DataFrame, title_col: str = 'Title', abstract_col: str = 'Abstract', 
                  show_progress: bool = True) -> pd.DataFrame:
    """Classify multiple papers from a DataFrame."""
    results_list = []
    
    iterator = tqdm(df.iterrows(), total=len(df)) if show_progress else df.iterrows()
    
    for idx, row in iterator:
        title = row.get(title_col, '')
        abstract = row.get(abstract_col, '')
        
        result = classify_paper(title, abstract)
        
        flat_result = {
            'index': idx,
            'title': title,
            'abstract': abstract[:100] + '...' if len(str(abstract)) > 100 else abstract,
            'discipline': result['predictions'].get('discipline', 'Unknown'),
            'discipline_confidence': result['confidence_scores'].get('discipline', 0),
            'subfield': result['predictions'].get('subfield', 'Unknown'),
            'subfield_confidence': result['confidence_scores'].get('subfield', 0),
            'methodology': result['predictions'].get('methodology', 'Unknown'),
            'methodology_confidence': result['confidence_scores'].get('methodology', 0),
            'layman_explanation': result.get('layman_explanation', ''),
            'warnings': '; '.join(result['warnings']) if result['warnings'] else 'None'
        }
        
        results_list.append(flat_result)
    
    return pd.DataFrame(results_list)

print("✅ Main classification functions ready!")


✅ Main classification functions ready!


# 🎯 Section 3: Examples & Usage

## Single Paper Classification Examples
Try these examples to see how the classifier works with different types of papers.


In [6]:
# Example 1: CS/NLP/QUANT
print("🔬 Example 1: Natural Language Processing Research")
print("=" * 60)

result1 = classify_paper(
    title="Deep Learning for Natural Language Processing: A Survey",
    abstract="""This paper provides a comprehensive survey of deep learning techniques 
    applied to natural language processing tasks. We review recent advances in neural 
    architectures including transformers, BERT, and GPT models. Our analysis covers 
    both theoretical foundations and practical applications, with empirical comparisons 
    of model performance across various benchmarks."""
)

print(f"📊 Classification Results:")
print(f"Discipline: {result1['predictions']['discipline']} (confidence: {result1['confidence_scores'].get('discipline', 0):.2f})")
print(f"Subfield: {result1['predictions']['subfield']} (confidence: {result1['confidence_scores'].get('subfield', 0):.2f})")
print(f"Methodology: {result1['predictions']['methodology']} (confidence: {result1['confidence_scores'].get('methodology', 0):.2f})")
print(f"\n💬 Plain English: {result1['layman_explanation']}")

print("\n" + "=" * 60)

# Example 2: IT/DEVOPS/MIXED
print("⚙️ Example 2: DevOps Infrastructure Research")

result2 = classify_paper(
    title="Automated Testing in Continuous Integration Pipelines: A Mixed-Methods Analysis",
    abstract="""This study evaluates automated testing practices in CI/CD pipelines through 
    a comprehensive mixed-methods approach. We collected performance metrics from 150 production 
    deployments across 10 organizations, measuring test execution times, failure rates, and 
    deployment frequency. Additionally, we conducted semi-structured interviews with 30 DevOps 
    engineers to understand their experiences with test automation challenges and best practices."""
)

print(f"📊 Classification Results:")
print(f"Discipline: {result2['predictions']['discipline']} (confidence: {result2['confidence_scores'].get('discipline', 0):.2f})")
print(f"Subfield: {result2['predictions']['subfield']} (confidence: {result2['confidence_scores'].get('subfield', 0):.2f})")
print(f"Methodology: {result2['predictions']['methodology']} (confidence: {result2['confidence_scores'].get('methodology', 0):.2f})")
print(f"\n💬 Plain English: {result2['layman_explanation']}")

print("\n" + "=" * 60)

# Example 3: IS/GOV/QUAL
print("🏛️ Example 3: Government IT Research")

result3 = classify_paper(
    title="Barriers to Digital Transformation in Municipal Government: A Qualitative Investigation",
    abstract="""This qualitative study examines organizational and political barriers to digital 
    transformation in local government. Through ethnographic observation and in-depth interviews 
    with 45 municipal officials across three cities, we identify key challenges including legacy 
    system dependencies, bureaucratic resistance, and citizen privacy concerns. Our grounded theory 
    analysis reveals how institutional factors shape technology adoption decisions in public sector contexts."""
)

print(f"📊 Classification Results:")
print(f"Discipline: {result3['predictions']['discipline']} (confidence: {result3['confidence_scores'].get('discipline', 0):.2f})")
print(f"Subfield: {result3['predictions']['subfield']} (confidence: {result3['confidence_scores'].get('subfield', 0):.2f})")
print(f"Methodology: {result3['predictions']['methodology']} (confidence: {result3['confidence_scores'].get('methodology', 0):.2f})")
print(f"\n💬 Plain English: {result3['layman_explanation']}")

print("\n🎯 These examples show how the classifier handles different disciplines, subfields, and methodologies!")


🔬 Example 1: Natural Language Processing Research
📊 Classification Results:
Discipline: CS (confidence: 1.00)
Subfield: NLP (confidence: 0.96)
Methodology: QUANT (confidence: 0.83)

💬 Plain English: This paper focuses on technical computer science topics like algorithms and software systems, specifically about natural language processing and text analysis, using quantitative research methods (experiments, surveys, statistical analysis).

⚙️ Example 2: DevOps Infrastructure Research
📊 Classification Results:
Discipline: IT (confidence: 0.99)
Subfield: DEVOPS (confidence: 1.00)
Methodology: MIXED (confidence: 0.84)

💬 Plain English: This paper focuses on practical technology implementation and infrastructure management, specifically about development operations and automation, using a combination of both qualitative and quantitative research methods.

🏛️ Example 3: Government IT Research
📊 Classification Results:
Discipline: IS (confidence: 1.00)
Subfield: GOV (confidence: 0.90)
Methodol

## Batch Processing Example
Classify multiple papers at once from a DataFrame.


In [7]:
# 🧪 Test to verify everything is working
print("🧪 Testing classification system...")

try:
    test_result = classify_paper(
        title="Test Classification",
        abstract="This is a simple test to verify the classification system is working properly."
    )
    print("✅ Classification system is working!")
    print(f"Test result: {test_result['predictions']['discipline']}/{test_result['predictions']['subfield']}/{test_result['predictions']['methodology']}")
    print(f"💬 Test explanation: {test_result['layman_explanation']}")
except Exception as e:
    print(f"❌ Classification test failed: {e}")
    print("Please check that all setup cells (1-4) ran successfully.")

print("\n🎯 Ready for examples and your own papers!")


🧪 Testing classification system...
✅ Classification system is working!
Test result: CS/SE/QUANT
💬 Test explanation: This paper focuses on technical computer science topics like algorithms and software systems, specifically about software engineering and development practices, using quantitative research methods (experiments, surveys, statistical analysis). (Note: The classifier has some uncertainty about this classification.)

🎯 Ready for examples and your own papers!


In [8]:
# Create sample dataset for batch processing
sample_papers = pd.DataFrame([
    {
        'Title': 'Machine Learning in Healthcare: A Systematic Review',
        'Abstract': 'This systematic review examines the application of machine learning techniques in healthcare settings. We analyzed 150 papers published between 2018-2023, identifying key trends and challenges in clinical decision support systems.'
    },
    {
        'Title': 'Enterprise Resource Planning Implementation: A Case Study',
        'Abstract': 'This paper presents an in-depth case study of ERP implementation in a Fortune 500 company. Through interviews with 30 stakeholders and analysis of organizational documents, we identify critical success factors.'
    },
    {
        'Title': 'Network Security in IoT Devices: Vulnerabilities and Solutions',
        'Abstract': 'We present a comprehensive analysis of security vulnerabilities in Internet of Things devices. Our study includes penetration testing of 50 commercial IoT devices and proposes a new security framework.'
    }
])

print("📊 Batch Classification Results:")
print("=" * 80)

# Classify all papers
batch_results = classify_batch(sample_papers, show_progress=False)

# Display results with layman explanations
for idx, row in batch_results.iterrows():
    print(f"\n📄 Paper {idx + 1}: {row['title'][:50]}...")
    print(f"   Classification: {row['discipline']}/{row['subfield']}/{row['methodology']}")
    print(f"   Confidence: D:{row['discipline_confidence']:.2f} | S:{row['subfield_confidence']:.2f} | M:{row['methodology_confidence']:.2f}")
    print(f"   💬 Plain English: {row['layman_explanation']}")

print(f"\n✅ Successfully classified {len(batch_results)} papers!")


📊 Batch Classification Results:

📄 Paper 1: Machine Learning in Healthcare: A Systematic Revie...
   Classification: CS/AI/ML/QUAL
   Confidence: D:0.94 | S:0.97 | M:0.82
   💬 Plain English: This paper focuses on technical computer science topics like algorithms and software systems, specifically dealing with artificial intelligence and machine learning, using qualitative research methods (interviews, case studies, observations).

📄 Paper 2: Enterprise Resource Planning Implementation: A Cas...
   Classification: IS/BPM/QUAL
   Confidence: D:1.00 | S:1.00 | M:0.54
   💬 Plain English: This paper focuses on how technology is used in business and organizational contexts, specifically about business process management and workflow optimization, using qualitative research methods (interviews, case studies, observations).

📄 Paper 3: Network Security in IoT Devices: Vulnerabilities a...
   Classification: CS/SEC/QUANT
   Confidence: D:0.78 | S:0.98 | M:0.62
   💬 Plain English: This paper foc

# 🚀 Section 4: Classify Your Own Papers

## Option A: Single Paper Classification
Ready to test! Sample data is provided below - just run the cell to see classification results.

**To use your own paper**: Modify the variables in the cell below and run it.


In [9]:
# 📝 ENTER YOUR PAPER DETAILS HERE
# (Sample data provided below - replace with your own paper!)
your_title = "Machine Learning Approaches for Cybersecurity Threat Detection in IoT Networks"
your_abstract = """Internet of Things (IoT) networks face increasing cybersecurity threats due to their distributed nature and resource constraints. This study proposes a novel machine learning framework for real-time threat detection in IoT environments. We developed and evaluated three ML models: Random Forest, Support Vector Machines, and Deep Neural Networks using a dataset of 50,000 network traffic samples from simulated IoT deployments. Our experimental results demonstrate that the ensemble approach combining all three models achieves 96.7% accuracy in detecting malicious activities, including DDoS attacks, malware propagation, and unauthorized access attempts. The proposed system operates with low latency (average 15ms response time) making it suitable for real-time deployment. Feature importance analysis reveals that packet size distribution, connection frequency, and payload entropy are the most significant indicators of malicious behavior. This research contributes to the field by providing a practical and efficient solution for securing IoT networks against evolving cyber threats."""

# Classify your paper
print("🔍 Classifying your paper...")
print("=" * 60)

your_result = classify_paper(your_title, your_abstract)

print(f"📄 Title: {your_title}")
print(f"\n📊 Classification Results:")
print(f"Discipline: {your_result['predictions']['discipline']} (confidence: {your_result['confidence_scores'].get('discipline', 0):.2f})")
print(f"Subfield: {your_result['predictions']['subfield']} (confidence: {your_result['confidence_scores'].get('subfield', 0):.2f})")
print(f"Methodology: {your_result['predictions']['methodology']} (confidence: {your_result['confidence_scores'].get('methodology', 0):.2f})")

print(f"\n💬 Plain English Explanation:")
print(f"{your_result['layman_explanation']}")

if your_result['warnings']:
    print(f"\n⚠️ Warnings: {'; '.join(your_result['warnings'])}")

# Show confidence breakdown
print(f"\n📈 Detailed Confidence Scores:")
for level, conf in your_result['confidence_scores'].items():
    status = "🟢 High" if conf > 0.8 else "🟡 Medium" if conf > 0.6 else "🔴 Low"
    print(f"  {level.capitalize()}: {conf:.3f} ({status})")


🔍 Classifying your paper...
📄 Title: Machine Learning Approaches for Cybersecurity Threat Detection in IoT Networks

📊 Classification Results:
Discipline: CS (confidence: 1.00)
Subfield: SEC (confidence: 1.00)
Methodology: QUANT (confidence: 1.00)

💬 Plain English Explanation:
This paper focuses on technical computer science topics like algorithms and software systems, specifically about cybersecurity and information security, using quantitative research methods (experiments, surveys, statistical analysis).

📈 Detailed Confidence Scores:
  Discipline: 0.999 (🟢 High)
  Subfield: 0.997 (🟢 High)
  Methodology: 0.997 (🟢 High)


## Option B: Batch Classification from CSV
Load and classify papers from your own CSV file.


In [10]:
# 📁 LOAD YOUR CSV FILE HERE
# Uncomment and modify the path below to use your own data

# your_csv_path = 'path/to/your/papers.csv'
# your_papers = pd.read_csv(your_csv_path)

# # Classify all papers (adjust column names if needed)
# your_results = classify_batch(
#     your_papers,
#     title_col='Title',     # Change this to your title column name
#     abstract_col='Abstract' # Change this to your abstract column name
# )

# # Save results
# your_results.to_csv('classified_papers.csv', index=False)
# print(f"✅ Classified {len(your_results)} papers and saved to 'classified_papers.csv'")

# # Display summary
# print("\n📊 Classification Summary:")
# print(f"Disciplines: {your_results['discipline'].value_counts().to_dict()}")
# print(f"Methodologies: {your_results['methodology'].value_counts().to_dict()}")

print("💡 Uncomment the code above and modify the path to classify your own CSV file!")
print("📋 Your CSV should have columns for 'Title' and 'Abstract'")
print("🎯 Results will include layman explanations for each paper")


💡 Uncomment the code above and modify the path to classify your own CSV file!
📋 Your CSV should have columns for 'Title' and 'Abstract'
🎯 Results will include layman explanations for each paper


# 📊 Section 5: Advanced Features

## Interactive Classification 
Ready to test! Sample data is provided below - just run the cell to see classification results.

**To use your own paper**: Modify the variables in the cell below and run it.

**Note**: `input()` functions don't work well in Jupyter notebooks, so this uses a variable-based approach instead.


In [11]:
# 🎪 Simple Interactive Classification
# Note: input() functions don't work well in Jupyter notebooks
# Instead, just modify the variables below and run this cell!

print("🎓 Interactive Classification - Modify variables below and run this cell!")
print("=" * 75)

# 📝 MODIFY THESE VARIABLES FOR YOUR PAPER:
# (Sample data provided below - replace with your own paper!)
interactive_title = "Blockchain-based Supply Chain Management: A Systematic Review of Security and Transparency Challenges"
interactive_abstract = """This paper presents a comprehensive systematic review of blockchain technology applications in supply chain management, focusing on security and transparency challenges. We analyzed 127 peer-reviewed papers published between 2018-2023 from major databases including IEEE Xplore, ACM Digital Library, and ScienceDirect. Our methodology involved a structured literature review protocol with predefined inclusion and exclusion criteria. The review identifies key security vulnerabilities in blockchain-based supply chains, including smart contract exploits, consensus mechanism attacks, and privacy concerns. We found that while blockchain technology offers significant improvements in supply chain transparency and traceability, implementation challenges persist regarding scalability, interoperability, and energy consumption. The study contributes to the field by providing a taxonomic classification of security threats and proposing a framework for evaluating blockchain solutions in supply chain contexts. Our findings suggest that hybrid blockchain architectures combined with traditional database systems offer the most practical approach for current enterprise implementations."""

# 🔍 Classify the paper
print(f"📄 Classifying: {interactive_title}")
print("-" * 50)

if interactive_title and interactive_title.strip() and len(interactive_title.strip()) > 5:
    try:
        result = classify_paper(interactive_title, interactive_abstract)
        
        print(f"📊 Classification Results:")
        print(f"Discipline: {result['predictions']['discipline']} (confidence: {result['confidence_scores'].get('discipline', 0):.2%})")
        print(f"Subfield: {result['predictions']['subfield']} (confidence: {result['confidence_scores'].get('subfield', 0):.2%})")
        print(f"Methodology: {result['predictions']['methodology']} (confidence: {result['confidence_scores'].get('methodology', 0):.2%})")
        
        print(f"\n💬 Plain English Explanation:")
        print(f"{result['layman_explanation']}")
        
        # Show confidence breakdown
        print(f"\n📈 Confidence Breakdown:")
        for level, conf in result['confidence_scores'].items():
            status = "🟢 High" if conf > 0.8 else "🟡 Medium" if conf > 0.6 else "🔴 Low"
            print(f"  {level.capitalize()}: {conf:.3f} ({status})")
            
    except Exception as e:
        print(f"❌ Classification failed: {e}")
else:
    print("💡 Please enter a valid paper title (more than 5 characters) and run again!")

print("\n🔄 To classify another paper: modify the variables above and run this cell again!")


🎓 Interactive Classification - Modify variables below and run this cell!
📄 Classifying: Blockchain-based Supply Chain Management: A Systematic Review of Security and Transparency Challenges
--------------------------------------------------
📊 Classification Results:
Discipline: IS (confidence: 63.25%)
Subfield: BPM (confidence: 99.71%)
Methodology: QUAL (confidence: 52.99%)

💬 Plain English Explanation:
This paper focuses on how technology is used in business and organizational contexts, specifically about business process management and workflow optimization, using qualitative research methods (interviews, case studies, observations). (Note: The classifier has some uncertainty about this classification.)

📈 Confidence Breakdown:
  Discipline: 0.633 (🟡 Medium)
  Subfield: 0.997 (🟢 High)
  Methodology: 0.530 (🔴 Low)

🔄 To classify another paper: modify the variables above and run this cell again!


## Summary & Key Features

### ✅ What This Notebook Provides:

1. **🎯 Hierarchical Classification**: Discipline → Subfield → Methodology
2. **📊 Confidence Scores**: For each prediction level  
3. **💬 Layman Explanations**: Simple, plain-English interpretations
4. **🔄 Batch Processing**: Efficient classification of multiple papers
5. **📁 CSV Export**: Save results for further analysis
6. **🎪 Variable-Based Mode**: Easy classification by modifying variables

### 🔧 Technical Details:

- **Models**: Pre-trained v7.0 XGBoost classifiers  
- **Features**: 3,000 TF-IDF features per classifier
- **Setup**: 5 cells for complete initialization
- **Validation**: Proper train/validation/test splits
- **Anti-leakage**: No data contamination between sets
- **Performance**: 92.32% discipline accuracy, 86.92% methodology accuracy

### 🎓 Example Classifications:

- **CS/NLP/QUANT**: Natural language processing with empirical analysis
- **IT/DEVOPS/MIXED**: Infrastructure research with combined methods  
- **IS/GOV/QUAL**: Government technology with qualitative approaches

---

**🚀 Ready to classify your papers? Run cells 1-5 first, then start with Section 3 examples or jump to Section 4 for your own papers!**
