- π Executive Overview
- π¬ Deep Dive: The Neuro-Symbolic Architecture
- β¨ Key Features & Capabilities
- π Frameworks & Libraries
- π Project Structure
- βοΈ Installation & Setup
- π€ Author
- π License
The Problem: The rise of remote work has led to a 300% increase in online recruitment fraud. Traditional spam filters, which rely on simple keyword matching (e.g., blocking "Bitcoin"), are easily bypassed by sophisticated scammers who use corporate jargon and legitimate-looking templates.
The Solution: JobGuard is a next-generation deception detection system designed to identify fraudulent job postings with human-like reasoning. It represents a paradigm shift from simple "Pattern Matching" to "Intent Understanding."
Core Philosophy: The system employs a Neuro-Symbolic AI approach:
- The "Neural" Brain (Deep Learning): Uses Transformers (BERT) to read between the lines," detecting the subtle tone of desperation, vagueness, or unprofessionalism that rule-based systems miss.
- The "Symbolic" Brain (Hard Logic): Uses strict Rule Engines and Regex to catch specific technical red flags (e.g., Crypto wallets, Instant Payment apps, Raw Code injections) that probabilistic models might overlook.
By fusing these two worlds, JobGuard achieves a robust defense against both "Zero-Day" novel scams and classic "Copy-Paste" fraud.
The system architecture is designed as a Multi-Stage Pipeline that mimics the cognitive process of a human fraud analyst. Data flows through four distinct phases before a final verdict is rendered.
Before any AI inference, the raw input is rigorously cleaned to prevent adversarial attacks.
- Unicode Normalization (NFKC): Neutralizes "homoglyph attacks" where scammers use Cyrillic characters (e.g., 'Π°' instead of 'a') to bypass keyword filters.
- Gibberish Detection (WordFreq): Rejects inputs that fail the Zipf Frequency Test (e.g., "asdfghjkl"), ensuring expensive GPU resources aren't wasted on garbage data.
- Code Injection Block: Uses structural regex to detect and reject raw source code (C++/Java/Python) pasted into the description field.
The cleaned text is then analyzed by three independent feature extractors:
- Linguistic Features (BERT Fusion):
- The text is tokenized and passed through a fine-tuned DistilBERT model.
- Output: A dense 768-dimensional embedding vector representing the semantic context.
- Structural Anomalies (Autoencoder + IsoForest):
- Deep Autoencoder: Attempts to compress and reconstruct the text features. High reconstruction error (MSE) indicates the text deviates from the "Norm" of legitimate corporate postings.
- Isolation Forest: A tree-based outlier detector that flags metadata anomalies (e.g., descriptions that are suspiciously short or lack punctuation).
- Heuristic Features:
- 10 handcrafted features including Urgency Score, Capitalization Ratio, Symbol Density, and Emoji Professionalism Score.
The extracted features are fed into meta_ensemble.pkl, a Soft Voting Classifier composed of three diverse algorithms. This ensures robustness against overfitting.
- Role: Non-Linear Pattern Recognition.
- Logic: It captures complex interactions, such as "High Salary" + "No Experience" = "Fraud".
- Role: Rule-Based Decision Making.
- Logic: An ensemble of decision trees that excels at handling tabular data and cutting through noise to find critical "Red Flags."
- Role: The Calibrator.
- Logic: A linear probabilistic model that provides a stable baseline. It ensures the final probability score (0-100%) is mathematically well-calibrated.
Ensemble Logic:
Running alongside the ensemble is the Semantic Knowledge Base.
- Role: Inference-Only Semantic Search.
- Logic: It compares the input text against a pre-computed database of 50+ Known Scam Concepts (e.g., "Pay for training", "No interview").
- The Safety Valve: Unlike the other models, S-BERT is context-aware. It can differentiate between "No interview required" (Scam) and "Interview in person" (Legit) using negative constraints.
- Zero-Day Scam Protection: The Unsupervised Autoencoder detects novel scams that don't contain any known "bad words" but statistically look like fraud.
- Crypto & Wallet Awareness: Explicitly flags attempts to solicit payment via Bitcoin (BTC), Ethereum (ETH), or USDT. It distinguishes between the word "Crypto" in a job title vs. a payment method.
- Instant Payment Block: Detects requests for UPI, GPay, Paytm, or Zelle, which are standard indicators of a recruitment scam.
- Multi-Currency Parsing: Robust regex handles salaries in USD, INR, EUR, GBP, AUD, etc., ensuring accurate financial analysis.
- Jargon Whitelist: Pre-trained on 500+ corporate terms (SaaS, Kubernetes, CI/CD, ROI) to prevent false positives on technical JDs. It knows that "Python" is a skill, not a snake.
- Domain Reputation: Automatically flags free email providers (Gmail, Yahoo) and URL shorteners (bit.ly) when used in official contact fields.
- Professionalism Metrics: Analyzes emoji density and punctuation patterns to flag unprofessional behavior typical of MLM schemes.
- Real-Time Inference: The entire pipeline (cleaning -> BERT -> Ensemble -> S-BERT) executes in < 200ms on a standard CPU.
- Dockerized: Fully containerized environment ensuring reproducibility. System-level dependencies (
libenchant) are handled automatically. - Faster Cold Starts: The Docker image preloads Hugging Face model assets during build to reduce startup-time network fetches.
- Smart Caching: SHA-256 hashing of inputs ensures instant results for repeated queries.
- Live Telemetry: Color-coded, real-time system logs visible only to administrators for debugging and monitoring.
The system is built upon a robust stack of industry-standard libraries, each chosen for a specific purpose:
| Library | Category | Purpose in JobGuard |
|---|---|---|
| Scikit-Learn | Machine Learning | Orchestrates the Voting Ensemble. Provides the MLP, Gradient Boosting, Logistic Regression, and Isolation Forest algorithms. |
| PyTorch | Deep Learning | Powers the BERT Fusion model and S-BERT inference. Chosen for its dynamic computation graph and seamless Hugging Face integration. |
| TensorFlow / Keras | Deep Learning | Powers the Deep Autoencoder. Used for its efficient static graph execution in anomaly detection tasks. |
| Hugging Face Transformers | NLP | Loads the pre-trained distilbert-base-uncased tokenizer and model weights. |
| Sentence-Transformers | NLP | Facilitates the semantic similarity search using all-MiniLM-L12-v2 for the anchor-based detection system. |
| Flask | Backend Framework | Serves the REST API, manages user sessions, and renders the Jinja2 templates for the dashboard. |
| LIME | Explainable AI | Generates local perturbations to explain why a specific text was flagged as fraud (e.g., highlighting the word "Telegram"). |
| PyEnchant | Text Processing | Wraps the C-based Enchant library to perform high-speed dictionary validation (Gibberish detection). |
| WordFreq | Text Processing | Validation using Zipf frequency to distinguish between typos, slang, and true gibberish. |
| Gunicorn | Production Server | A WSGI HTTP Server used to run Flask in production environments (like Docker containers). |
JobGuard_Root/
β
βββ app.py # Main Application (Neuro-Symbolic Engine)
βββ Dockerfile # Container Configuration
βββ pyproject.toml # Python Dependencies
βββ uv.lock # Locked Dependency Graph
βββ packages.txt # System Dependencies (libenchant)
β
βββ models/ # Serialized AI Models
β βββ best_bert_fusion.pth # Fine-tuned PyTorch Model
β βββ autoencoder.keras # Anomaly Detector
β βββ iso_forest.pkl # Outlier Detector
β βββ meta_ensemble.pkl # Voting Classifier (MLP+GBM+LR)
β βββ *_scaler.pkl # Data Normalizers
β
βββ static/ # Frontend Assets
β βββ css/style.css # Cyberpunk/Glassmorphism UI
β βββ js/script.js # Dashboard Async Logic
β βββ js/login.js # Auth & Animation Logic
β
βββ templates/ # HTML Views
βββ index.html # Main Dashboard
βββ login.html # Secure Login Gateway# 1. Build the container
docker build -t jobguard .
# 2. Run the application
docker run -p 7860:7860 jobguardThe first image build is heavier because it warms the Hugging Face model cache into the image. Subsequent container starts are faster and less network-dependent.
Prerequisites: You must install libenchant-2-dev on your system.
# Ubuntu/Debian
sudo apt-get install libenchant-2-dev
# Install Python dependencies
uv sync
# Run
uv run python app.pyYogeshwaran
- Project: JobGuard AI (Final Year Project 2026)
- Institution: Panimalar Engineering College
- Focus: AI Security & Fraud Detection
Distributed under the MIT License. See LICENSE for more information.
Disclaimer: This tool is an assistive AI. While it achieves high accuracy (98% on test sets), final hiring decisions should always involve human verification.