A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.
| Code | Language | Flag | Slang Terms | Regional Support |
|---|---|---|---|---|
en |
English | 🇺🇸🇬🇧 | ~1,300+ | US, UK, Gen-Z, Gaming, AAVE, texting |
es |
Spanish | 🇪🇸🇲🇽 | ~1,800+ | Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean |
fr |
French | 🇫🇷 | ~300+ | Standard French, SMS abbreviations (mdr, ptdr, slt, tkt) |
it |
Italian | 🇮🇹 | ~350+ | Standard Italian, regional variants (cmq, tvb, xke) |
pt |
Portuguese | 🇧🇷🇵🇹 | ~500+ | Brazilian (pt-BR) & European (pt-PT) Portuguese |
de |
German | 🇩🇪🇦🇹🇨🇭 | ~400+ | Standard German, Austrian, Swiss German, youth slang (Jugendsprache) |
Total: ~4,600+ slang terms across all languages for improved informal text detection.
- ✅ Optimized for short text: Works well with SMS and chat messages (1-50 words)
- ✅ Handles informal language: Supports slang, abbreviations, and texting patterns
- ✅ Multi-language support: 6 languages with regional variations
- ✅ Language filtering: Restrict detection to specific languages with "neither" detection
- ✅ Fast inference: <5ms per detection, suitable for real-time applications
- ✅ TypeScript support: Full type definitions included
- ✅ Slang dictionary fallback: Comprehensive detection for ambiguous cases
- ✅ Zero dependencies at runtime: Lightweight and self-contained
npm install naive-bayes-language-detectorimport { getDetector } from 'naive-bayes-language-detector';
// Load the pre-trained model
const detector = getDetector(
'./node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);
// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
// language: 'es',
// confidence: 0.95,
// isReliable: true,
// probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
// source: 'ml'
// }
// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);Get or create a singleton detector instance.
const detector = getDetector('./models/language-model.json');Detect the language of a single text.
interface DetectionResult {
language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
confidence: number; // Confidence score (0-1)
isReliable: boolean; // True if confidence > 0.7
probabilities?: Record<string, number>; // Probability per language
source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}Detect languages for multiple texts efficiently.
Restrict detection to specific languages only. Useful when you only care about certain languages.
// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);
// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });Options:
| Option | Type | Default | Description |
|---|---|---|---|
fastMode |
boolean |
false |
When true, only computes probabilities for allowed languages (faster but no "neither" detection) |
Behavior:
| Scenario | fastMode: false (default) |
fastMode: true |
|---|---|---|
| Text matches allowed language | High confidence, isReliable: true |
High confidence, isReliable: true |
| Text doesn't match allowed languages | Low confidence, isReliable: false |
High confidence (re-normalized) |
detector.setAllowedLanguages(['en', 'es']);
// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }
// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languagesRemove language restrictions and detect all supported languages again.
detector.clearAllowedLanguages();
// Now detects all 6 languagesGet the currently allowed languages. Returns null if all languages are allowed.
Get whether fast mode is enabled.
Reset the singleton instance (useful for testing).
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Input Text │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Detection │ ◀── │ Slang Detection │ ◀── │ Naive Bayes │
│ Result │ │ (fallback) │ │ Classifier │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Converts text to numerical vectors using character n-grams (2-5 characters).
import { TfidfVectorizer } from 'naive-bayes-language-detector';
const vectorizer = new TfidfVectorizer({
minN: 2, // Minimum n-gram size
maxN: 5, // Maximum n-gram size
maxFeatures: 5000, // Vocabulary limit
});
vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');Gaussian Naive Bayes classifier for language prediction.
import { NaiveBayesClassifier } from 'naive-bayes-language-detector';
const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);For short/ambiguous texts, the detector uses comprehensive slang dictionaries:
| Language | Examples |
|---|---|
| English | lol, bruh, ngl, fr, lowkey, bussin, innit, mate |
| Spanish | wey, neta, chido, parce, bacano, po, cachai |
| French | mdr, ptdr, slt, tkt, jsp, bcp, cv |
| Italian | cmq, tvb, xke, nn, qlc, grz |
| Portuguese | kkk, blz, vlw, tmj, mano, bora, fixe |
| German | digga, krass, geil, oida, leiwand, hdl, vllt |
npm run download-dataDownloads data from multiple sources:
| Source | Description | Link |
|---|---|---|
| Tatoeba | Community-sourced sentence pairs | tatoeba.org |
| OpenSubtitles | Movie and TV subtitles | opus.nlpl.eu |
| Leipzig Corpora | Web and news text | uni-leipzig.de |
| TED2020 | TED talk transcripts | opus.nlpl.eu |
| QED | Educational content | opus.nlpl.eu |
| Ubuntu | Technical support dialogues | opus.nlpl.eu |
npm run prepare-dataProcesses raw data, filters by length, and removes duplicates.
npm run trainTrains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.
npm run evaluateRuns the model against 959 test cases and reports accuracy.
Interactive mode:
npm run evaluate -- -iimport { normalizeText, augmentText } from 'naive-bayes-language-detector';
// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'
// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]language-detector/
├── src/ # TypeScript source
│ ├── index.ts # Main exports
│ ├── types/ # Type definitions
│ ├── utils/ # Utilities (normalization, n-grams, slang)
│ └── inference/ # ML components (vectorizer, classifier, detector)
├── dist/ # Compiled JavaScript (CommonJS)
├── test/ # Mocha + Chai test files
├── scripts/ # Training and evaluation scripts
├── models/ # Pre-trained model
│ └── language-model.json
└── data/ # Training data (not included in npm package)
import type {
LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
DetectionResult,
DetectionSource,
SlangDetectionResult,
PredictionResult,
VectorizerOptions,
VectorizerData,
ClassifierData,
ModelData,
AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';# Install dependencies
npm install
# Build TypeScript
npm run build
# Run tests
npm test
# Run tests with coverage
npm run coverage
# Lint code
npm run lint
npm run lint:fix
# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluate| Technology | Purpose |
|---|---|
| TypeScript | Type-safe development |
| Node.js | Runtime environment |
| Mocha + Chai | Testing framework |
| ESLint + Prettier | Code quality |
| Husky | Git hooks |
| Airbnb Style Guide | Code style |
This project uses Husky for Git hooks:
- pre-commit: Runs
lint-stagedto lint and format staged.tsfiles
# Hooks are automatically installed when you run npm install
npm install- Node.js >= 20
| Metric | Value |
|---|---|
| Inference time | <5ms per text |
| Model size | ~1.7MB (JSON) |
| Accuracy | 100% (959 test cases) |
| Memory usage | ~50MB loaded |
Contributions are welcome! Please ensure:
- All tests pass (
npm test) - Code is linted (
npm run lint) - New features include tests
Made with ❤️ for the messaging community