JobGuard: Neuro-Symbolic Fraud Detection System

📋 Table of Contents

📌 Executive Overview
🔬 Deep Dive: The Neuro-Symbolic Architecture
✨ Key Features & Capabilities
📚 Frameworks & Libraries
📂 Project Structure
⚙️ Installation & Setup
👤 Author
📜 License

📌 Executive Overview

The Problem: The rise of remote work has led to a 300% increase in online recruitment fraud. Traditional spam filters, which rely on simple keyword matching (e.g., blocking "Bitcoin"), are easily bypassed by sophisticated scammers who use corporate jargon and legitimate-looking templates.

The Solution: JobGuard is a next-generation deception detection system designed to identify fraudulent job postings with human-like reasoning. It represents a paradigm shift from simple "Pattern Matching" to "Intent Understanding."

Core Philosophy: The system employs a Neuro-Symbolic AI approach:

The "Neural" Brain (Deep Learning): Uses Transformers (BERT) to read between the lines," detecting the subtle tone of desperation, vagueness, or unprofessionalism that rule-based systems miss.
The "Symbolic" Brain (Hard Logic): Uses strict Rule Engines and Regex to catch specific technical red flags (e.g., Crypto wallets, Instant Payment apps, Raw Code injections) that probabilistic models might overlook.

By fusing these two worlds, JobGuard achieves a robust defense against both "Zero-Day" novel scams and classic "Copy-Paste" fraud.

🔬 Deep Dive: The Neuro-Symbolic Architecture

The system architecture is designed as a Multi-Stage Pipeline that mimics the cognitive process of a human fraud analyst. Data flows through four distinct phases before a final verdict is rendered.

Phase 1: The Gatekeeper (Sanitization & Validation)

Before any AI inference, the raw input is rigorously cleaned to prevent adversarial attacks.

Unicode Normalization (NFKC): Neutralizes "homoglyph attacks" where scammers use Cyrillic characters (e.g., 'а' instead of 'a') to bypass keyword filters.
Gibberish Detection (WordFreq): Rejects inputs that fail the Zipf Frequency Test (e.g., "asdfghjkl"), ensuring expensive GPU resources aren't wasted on garbage data.
Code Injection Block: Uses structural regex to detect and reject raw source code (C++/Java/Python) pasted into the description field.

Phase 2: Feature Extraction (The Sensors)

The cleaned text is then analyzed by three independent feature extractors:

Linguistic Features (BERT Fusion):
- The text is tokenized and passed through a fine-tuned DistilBERT model.
- Output: A dense 768-dimensional embedding vector representing the semantic context.
Structural Anomalies (Autoencoder + IsoForest):
- Deep Autoencoder: Attempts to compress and reconstruct the text features. High reconstruction error (MSE) indicates the text deviates from the "Norm" of legitimate corporate postings.
- Isolation Forest: A tree-based outlier detector that flags metadata anomalies (e.g., descriptions that are suspiciously short or lack punctuation).
Heuristic Features:
- 10 handcrafted features including Urgency Score, Capitalization Ratio, Symbol Density, and Emoji Professionalism Score.

Phase 3: The Supervised Committee (Voting Ensemble)

The extracted features are fed into meta_ensemble.pkl, a Soft Voting Classifier composed of three diverse algorithms. This ensures robustness against overfitting.

1. Multi-Layer Perceptron (MLP)

Role: Non-Linear Pattern Recognition.
Logic: It captures complex interactions, such as "High Salary" + "No Experience" = "Fraud".

2. Gradient Boosting Machine (GBM)

Role: Rule-Based Decision Making.
Logic: An ensemble of decision trees that excels at handling tabular data and cutting through noise to find critical "Red Flags."

3. Logistic Regression

Role: The Calibrator.
Logic: A linear probabilistic model that provides a stable baseline. It ensures the final probability score (0-100%) is mathematically well-calibrated.

Ensemble Logic:

$$ P_{ensemble} = \frac{1}{3} (P_{MLP} + P_{GBM} + P_{LogReg}) $$

Phase 4: The Parallel Inference Expert (S-BERT)

Running alongside the ensemble is the Semantic Knowledge Base.

🧠 Sentence-BERT (S-BERT)

Role: Inference-Only Semantic Search.
Logic: It compares the input text against a pre-computed database of 50+ Known Scam Concepts (e.g., "Pay for training", "No interview").
The Safety Valve: Unlike the other models, S-BERT is context-aware. It can differentiate between "No interview required" (Scam) and "Interview in person" (Legit) using negative constraints.

✨ Key Features & Capabilities

🛡️ Advanced Threat Detection

Zero-Day Scam Protection: The Unsupervised Autoencoder detects novel scams that don't contain any known "bad words" but statistically look like fraud.
Crypto & Wallet Awareness: Explicitly flags attempts to solicit payment via Bitcoin (BTC), Ethereum (ETH), or USDT. It distinguishes between the word "Crypto" in a job title vs. a payment method.
Instant Payment Block: Detects requests for UPI, GPay, Paytm, or Zelle, which are standard indicators of a recruitment scam.
Multi-Currency Parsing: Robust regex handles salaries in USD, INR, EUR, GBP, AUD, etc., ensuring accurate financial analysis.

⚙️ Corporate Intelligence

Jargon Whitelist: Pre-trained on 500+ corporate terms (SaaS, Kubernetes, CI/CD, ROI) to prevent false positives on technical JDs. It knows that "Python" is a skill, not a snake.
Domain Reputation: Automatically flags free email providers (Gmail, Yahoo) and URL shorteners (bit.ly) when used in official contact fields.
Professionalism Metrics: Analyzes emoji density and punctuation patterns to flag unprofessional behavior typical of MLM schemes.

⚡ Production Engineering

Real-Time Inference: The entire pipeline (cleaning -> BERT -> Ensemble -> S-BERT) executes in < 200ms on a standard CPU.
Dockerized: Fully containerized environment ensuring reproducibility. System-level dependencies (libenchant) are handled automatically.
Faster Cold Starts: The Docker image preloads Hugging Face model assets during build to reduce startup-time network fetches.
Smart Caching: SHA-256 hashing of inputs ensures instant results for repeated queries.
Live Telemetry: Color-coded, real-time system logs visible only to administrators for debugging and monitoring.

📚 Frameworks & Libraries

The system is built upon a robust stack of industry-standard libraries, each chosen for a specific purpose:

Library	Category	Purpose in JobGuard
Scikit-Learn	Machine Learning	Orchestrates the Voting Ensemble. Provides the MLP, Gradient Boosting, Logistic Regression, and Isolation Forest algorithms.
PyTorch	Deep Learning	Powers the BERT Fusion model and S-BERT inference. Chosen for its dynamic computation graph and seamless Hugging Face integration.
TensorFlow / Keras	Deep Learning	Powers the Deep Autoencoder. Used for its efficient static graph execution in anomaly detection tasks.
Hugging Face Transformers	NLP	Loads the pre-trained `distilbert-base-uncased` tokenizer and model weights.
Sentence-Transformers	NLP	Facilitates the semantic similarity search using `all-MiniLM-L12-v2` for the anchor-based detection system.
Flask	Backend Framework	Serves the REST API, manages user sessions, and renders the Jinja2 templates for the dashboard.
LIME	Explainable AI	Generates local perturbations to explain why a specific text was flagged as fraud (e.g., highlighting the word "Telegram").
PyEnchant	Text Processing	Wraps the C-based `Enchant` library to perform high-speed dictionary validation (Gibberish detection).
WordFreq	Text Processing	Validation using Zipf frequency to distinguish between typos, slang, and true gibberish.
Gunicorn	Production Server	A WSGI HTTP Server used to run Flask in production environments (like Docker containers).

📂 Project Structure

JobGuard_Root/
│
├── app.py                            # Main Application (Neuro-Symbolic Engine)
├── Dockerfile                        # Container Configuration
├── pyproject.toml                    # Python Dependencies
├── uv.lock                           # Locked Dependency Graph
├── packages.txt                      # System Dependencies (libenchant)
│
├── models/                           # Serialized AI Models
│   ├── best_bert_fusion.pth          # Fine-tuned PyTorch Model
│   ├── autoencoder.keras             # Anomaly Detector
│   ├── iso_forest.pkl                # Outlier Detector
│   ├── meta_ensemble.pkl             # Voting Classifier (MLP+GBM+LR)
│   └── *_scaler.pkl                  # Data Normalizers
│
├── static/                           # Frontend Assets
│   ├── css/style.css                 # Cyberpunk/Glassmorphism UI
│   ├── js/script.js                  # Dashboard Async Logic
│   └── js/login.js                   # Auth & Animation Logic
│
└── templates/                        # HTML Views
    ├── index.html                    # Main Dashboard
    └── login.html                    # Secure Login Gateway

⚙️ Installation & Setup

Option A: Docker (Recommended)

# 1. Build the container
docker build -t jobguard .

# 2. Run the application
docker run -p 7860:7860 jobguard

The first image build is heavier because it warms the Hugging Face model cache into the image. Subsequent container starts are faster and less network-dependent.

Option B: Manual Setup

Prerequisites: You must install libenchant-2-dev on your system.

# Ubuntu/Debian
sudo apt-get install libenchant-2-dev

# Install Python dependencies
uv sync

# Run
uv run python app.py

👤 Author

Yogeshwaran

Project: JobGuard AI (Final Year Project 2026)
Institution: Panimalar Engineering College
Focus: AI Security & Fraud Detection

📜 License

Distributed under the MIT License. See LICENSE for more information.

Disclaimer: This tool is an assistive AI. While it achieves high accuracy (98% on test sets), final hiring decisions should always involve human verification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JobGuard: Neuro-Symbolic Fraud Detection System

📋 Table of Contents

📌 Executive Overview

🔬 Deep Dive: The Neuro-Symbolic Architecture

Phase 1: The Gatekeeper (Sanitization & Validation)

Phase 2: Feature Extraction (The Sensors)

Phase 3: The Supervised Committee (Voting Ensemble)

1. Multi-Layer Perceptron (MLP)

2. Gradient Boosting Machine (GBM)

3. Logistic Regression

Phase 4: The Parallel Inference Expert (S-BERT)

🧠 Sentence-BERT (S-BERT)

✨ Key Features & Capabilities

🛡️ Advanced Threat Detection

⚙️ Corporate Intelligence

⚡ Production Engineering

📚 Frameworks & Libraries

📂 Project Structure

⚙️ Installation & Setup

Option A: Docker (Recommended)

Option B: Manual Setup

👤 Author

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.idea		.idea
dataset		dataset
models		models
results		results
static		static
templates		templates
.dockerignore		.dockerignore
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
compose.yaml		compose.yaml
packages.txt		packages.txt
pyproject.toml		pyproject.toml
users.db		users.db
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

JobGuard: Neuro-Symbolic Fraud Detection System

📋 Table of Contents

📌 Executive Overview

🔬 Deep Dive: The Neuro-Symbolic Architecture

Phase 1: The Gatekeeper (Sanitization & Validation)

Phase 2: Feature Extraction (The Sensors)

Phase 3: The Supervised Committee (Voting Ensemble)

1. Multi-Layer Perceptron (MLP)

2. Gradient Boosting Machine (GBM)

3. Logistic Regression

Phase 4: The Parallel Inference Expert (S-BERT)

🧠 Sentence-BERT (S-BERT)

✨ Key Features & Capabilities

🛡️ Advanced Threat Detection

⚙️ Corporate Intelligence

⚡ Production Engineering

📚 Frameworks & Libraries

📂 Project Structure

⚙️ Installation & Setup

Option A: Docker (Recommended)

Option B: Manual Setup

👤 Author

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages