Language Detector

A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.

Supported Languages

Code	Language	Flag	Slang Terms	Regional Support
`en`	English	🇺🇸🇬🇧	~1,300+	US, UK, Gen-Z, Gaming, AAVE, texting
`es`	Spanish	🇪🇸🇲🇽	~1,800+	Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean
`fr`	French	🇫🇷	~300+	Standard French, SMS abbreviations (mdr, ptdr, slt, tkt)
`it`	Italian	🇮🇹	~350+	Standard Italian, regional variants (cmq, tvb, xke)
`pt`	Portuguese	🇧🇷🇵🇹	~500+	Brazilian (pt-BR) & European (pt-PT) Portuguese
`de`	German	🇩🇪🇦🇹🇨🇭	~400+	Standard German, Austrian, Swiss German, youth slang (Jugendsprache)

Total: ~4,600+ slang terms across all languages for improved informal text detection.

Features

✅ Optimized for short text: Works well with SMS and chat messages (1-50 words)
✅ Handles informal language: Supports slang, abbreviations, and texting patterns
✅ Multi-language support: 6 languages with regional variations
✅ Language filtering: Restrict detection to specific languages with "neither" detection
✅ Fast inference: <5ms per detection, suitable for real-time applications
✅ TypeScript support: Full type definitions included
✅ Slang dictionary fallback: Comprehensive detection for ambiguous cases
✅ Zero dependencies at runtime: Lightweight and self-contained

Installation

npm install naive-bayes-language-detector

Quick Start

import { getDetector } from 'naive-bayes-language-detector';

// Load the pre-trained model
const detector = getDetector(
   './node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);

// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
//   language: 'es',
//   confidence: 0.95,
//   isReliable: true,
//   probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
//   source: 'ml'
// }

// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);

API Reference

`getDetector(modelPath: string): LanguageDetector`

Get or create a singleton detector instance.

const detector = getDetector('./models/language-model.json');

`LanguageDetector.detect(text: string): DetectionResult`

Detect the language of a single text.

interface DetectionResult {
   language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
   confidence: number; // Confidence score (0-1)
   isReliable: boolean; // True if confidence > 0.7
   probabilities?: Record<string, number>; // Probability per language
   source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}

`LanguageDetector.detectBatch(texts: string[]): DetectionResult[]`

Detect languages for multiple texts efficiently.

`LanguageDetector.setAllowedLanguages(languages, options?): this`

Restrict detection to specific languages only. Useful when you only care about certain languages.

// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);

// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });

Options:

Option	Type	Default	Description
`fastMode`	`boolean`	`false`	When `true`, only computes probabilities for allowed languages (faster but no "neither" detection)

Behavior:

Scenario	`fastMode: false` (default)	`fastMode: true`
Text matches allowed language	High confidence, `isReliable: true`	High confidence, `isReliable: true`
Text doesn't match allowed languages	Low confidence, `isReliable: false`	High confidence (re-normalized)

detector.setAllowedLanguages(['en', 'es']);

// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }

// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languages

`LanguageDetector.clearAllowedLanguages(): this`

Remove language restrictions and detect all supported languages again.

detector.clearAllowedLanguages();
// Now detects all 6 languages

`LanguageDetector.allowedLanguages: string[] | null`

Get the currently allowed languages. Returns null if all languages are allowed.

`LanguageDetector.fastMode: boolean`

Get whether fast mode is enabled.

`resetDetector(): void`

Reset the singleton instance (useful for testing).

How It Works

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Detection      │ ◀── │ Slang Detection │ ◀── │ Naive Bayes     │
│  Result         │     │ (fallback)      │     │ Classifier      │
└─────────────────┘     └─────────────────┘     └─────────────────┘

TF-IDF Vectorizer

Converts text to numerical vectors using character n-grams (2-5 characters).

import { TfidfVectorizer } from 'naive-bayes-language-detector';

const vectorizer = new TfidfVectorizer({
   minN: 2, // Minimum n-gram size
   maxN: 5, // Maximum n-gram size
   maxFeatures: 5000, // Vocabulary limit
});

vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');

Naive Bayes Classifier

Gaussian Naive Bayes classifier for language prediction.

import { NaiveBayesClassifier } from 'naive-bayes-language-detector';

const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);

Slang Detection

For short/ambiguous texts, the detector uses comprehensive slang dictionaries:

Language	Examples
English	lol, bruh, ngl, fr, lowkey, bussin, innit, mate
Spanish	wey, neta, chido, parce, bacano, po, cachai
French	mdr, ptdr, slt, tkt, jsp, bcp, cv
Italian	cmq, tvb, xke, nn, qlc, grz
Portuguese	kkk, blz, vlw, tmj, mano, bora, fixe
German	digga, krass, geil, oida, leiwand, hdl, vllt

Training Your Own Model

1. Download Training Data

npm run download-data

Downloads data from multiple sources:

Source	Description	Link
Tatoeba	Community-sourced sentence pairs	tatoeba.org
OpenSubtitles	Movie and TV subtitles	opus.nlpl.eu
Leipzig Corpora	Web and news text	uni-leipzig.de
TED2020	TED talk transcripts	opus.nlpl.eu
QED	Educational content	opus.nlpl.eu
Ubuntu	Technical support dialogues	opus.nlpl.eu

2. Prepare Data

npm run prepare-data

Processes raw data, filters by length, and removes duplicates.

3. Train Model

npm run train

Trains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.

4. Evaluate Model

npm run evaluate

Runs the model against 959 test cases and reports accuracy.

Interactive mode:

npm run evaluate -- -i

Text Normalization

import { normalizeText, augmentText } from 'naive-bayes-language-detector';

// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'

// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]

Project Structure

language-detector/
├── src/                    # TypeScript source
│   ├── index.ts           # Main exports
│   ├── types/             # Type definitions
│   ├── utils/             # Utilities (normalization, n-grams, slang)
│   └── inference/         # ML components (vectorizer, classifier, detector)
├── dist/                  # Compiled JavaScript (CommonJS)
├── test/                  # Mocha + Chai test files
├── scripts/               # Training and evaluation scripts
├── models/                # Pre-trained model
│   └── language-model.json
└── data/                  # Training data (not included in npm package)

Type Exports

import type {
   LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
   DetectionResult,
   DetectionSource,
   SlangDetectionResult,
   PredictionResult,
   VectorizerOptions,
   VectorizerData,
   ClassifierData,
   ModelData,
   AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Run tests with coverage
npm run coverage

# Lint code
npm run lint
npm run lint:fix

# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluate

Tech Stack

Technology	Purpose
TypeScript	Type-safe development
Node.js	Runtime environment
Mocha + Chai	Testing framework
ESLint + Prettier	Code quality
Husky	Git hooks
Airbnb Style Guide	Code style

Git Hooks

This project uses Husky for Git hooks:

pre-commit: Runs lint-staged to lint and format staged .ts files

# Hooks are automatically installed when you run npm install
npm install

Requirements

Node.js >= 20

Performance

Metric	Value
Inference time	<5ms per text
Model size	~1.7MB (JSON)
Accuracy	100% (959 test cases)
Memory usage	~50MB loaded

License

ISC

Contributing

Contributions are welcome! Please ensure:

All tests pass (npm test)
Code is linted (npm run lint)
New features include tests

Maintainers

@aaochoa

Made with ❤️ for the messaging community

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.husky		.husky
models		models
scripts		scripts
src		src
test		test
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.test.json		tsconfig.test.json

aaochoa/language-detector

Folders and files

Latest commit

History

Repository files navigation

Language Detector

Supported Languages

Features

Installation

Quick Start

API Reference

getDetector(modelPath: string): LanguageDetector

LanguageDetector.detect(text: string): DetectionResult

LanguageDetector.detectBatch(texts: string[]): DetectionResult[]

LanguageDetector.setAllowedLanguages(languages, options?): this

LanguageDetector.clearAllowedLanguages(): this

LanguageDetector.allowedLanguages: string[] | null

LanguageDetector.fastMode: boolean

resetDetector(): void