Skip to content

This is a Naive Bayes language detector npm module that provides fast (~1-5ms) language detection with 95-98% accuracy for short, informal text like SMS and chat messages.

Notifications You must be signed in to change notification settings

aaochoa/language-detector

Repository files navigation

Language Detector

npm version License: ISC Node.js TypeScript

A Naive Bayes language detector optimized for short, informal text like SMS and chat messages. Built with TypeScript and powered by TF-IDF vectorization and Gaussian Naive Bayes classification.

Supported Languages

Code Language Flag Slang Terms Regional Support
en English 🇺🇸🇬🇧 ~1,300+ US, UK, Gen-Z, Gaming, AAVE, texting
es Spanish 🇪🇸🇲🇽 ~1,800+ Mexico, Spain, Argentina, Colombia, Venezuela, Chile, Caribbean
fr French 🇫🇷 ~300+ Standard French, SMS abbreviations (mdr, ptdr, slt, tkt)
it Italian 🇮🇹 ~350+ Standard Italian, regional variants (cmq, tvb, xke)
pt Portuguese 🇧🇷🇵🇹 ~500+ Brazilian (pt-BR) & European (pt-PT) Portuguese
de German 🇩🇪🇦🇹🇨🇭 ~400+ Standard German, Austrian, Swiss German, youth slang (Jugendsprache)

Total: ~4,600+ slang terms across all languages for improved informal text detection.

Features

  • Optimized for short text: Works well with SMS and chat messages (1-50 words)
  • Handles informal language: Supports slang, abbreviations, and texting patterns
  • Multi-language support: 6 languages with regional variations
  • Language filtering: Restrict detection to specific languages with "neither" detection
  • Fast inference: <5ms per detection, suitable for real-time applications
  • TypeScript support: Full type definitions included
  • Slang dictionary fallback: Comprehensive detection for ambiguous cases
  • Zero dependencies at runtime: Lightweight and self-contained

Installation

npm install naive-bayes-language-detector

Quick Start

import { getDetector } from 'naive-bayes-language-detector';

// Load the pre-trained model
const detector = getDetector(
   './node_modules/naive-bayes-language-detector/dist/models/language-model.json',
);

// Detect language
const result = detector.detect('Hola, ¿cómo estás?');
console.log(result);
// {
//   language: 'es',
//   confidence: 0.95,
//   isReliable: true,
//   probabilities: { es: 0.95, en: 0.01, fr: 0.02, it: 0.01, pt: 0.005, de: 0.005 },
//   source: 'ml'
// }

// Batch detection
const results = detector.detectBatch(['hello', 'hola', 'bonjour', 'ciao', 'oi']);

API Reference

getDetector(modelPath: string): LanguageDetector

Get or create a singleton detector instance.

const detector = getDetector('./models/language-model.json');

LanguageDetector.detect(text: string): DetectionResult

Detect the language of a single text.

interface DetectionResult {
   language: string; // Detected language code ('en', 'es', 'fr', 'it', 'pt', 'de')
   confidence: number; // Confidence score (0-1)
   isReliable: boolean; // True if confidence > 0.7
   probabilities?: Record<string, number>; // Probability per language
   source?: 'ml' | 'slang' | 'slang-override' | 'combined';
}

LanguageDetector.detectBatch(texts: string[]): DetectionResult[]

Detect languages for multiple texts efficiently.

LanguageDetector.setAllowedLanguages(languages, options?): this

Restrict detection to specific languages only. Useful when you only care about certain languages.

// Only detect English or Spanish
detector.setAllowedLanguages(['en', 'es']);

// With fast mode for better performance (skips "neither" detection)
detector.setAllowedLanguages(['en', 'es'], { fastMode: true });

Options:

Option Type Default Description
fastMode boolean false When true, only computes probabilities for allowed languages (faster but no "neither" detection)

Behavior:

Scenario fastMode: false (default) fastMode: true
Text matches allowed language High confidence, isReliable: true High confidence, isReliable: true
Text doesn't match allowed languages Low confidence, isReliable: false High confidence (re-normalized)
detector.setAllowedLanguages(['en', 'es']);

// Spanish text - detected correctly
detector.detect('Hola amigo');
// { language: 'es', confidence: 0.95, isReliable: true }

// French text with en/es filter - "neither" case
detector.detect('Bonjour!');
// { language: 'en', confidence: 0.12, isReliable: false }
// Low confidence indicates text doesn't really match allowed languages

LanguageDetector.clearAllowedLanguages(): this

Remove language restrictions and detect all supported languages again.

detector.clearAllowedLanguages();
// Now detects all 6 languages

LanguageDetector.allowedLanguages: string[] | null

Get the currently allowed languages. Returns null if all languages are allowed.

LanguageDetector.fastMode: boolean

Get whether fast mode is enabled.

resetDetector(): void

Reset the singleton instance (useful for testing).

How It Works

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Input Text    │ ──▶ │ Text Normalizer │ ──▶ │TF-IDF Vectorizer│
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Detection      │ ◀── │ Slang Detection │ ◀── │ Naive Bayes     │
│  Result         │     │ (fallback)      │     │ Classifier      │
└─────────────────┘     └─────────────────┘     └─────────────────┘

TF-IDF Vectorizer

Converts text to numerical vectors using character n-grams (2-5 characters).

import { TfidfVectorizer } from 'naive-bayes-language-detector';

const vectorizer = new TfidfVectorizer({
   minN: 2, // Minimum n-gram size
   maxN: 5, // Maximum n-gram size
   maxFeatures: 5000, // Vocabulary limit
});

vectorizer.fit(trainingTexts);
const vector = vectorizer.transform('hello world');

Naive Bayes Classifier

Gaussian Naive Bayes classifier for language prediction.

import { NaiveBayesClassifier } from 'naive-bayes-language-detector';

const classifier = new NaiveBayesClassifier();
classifier.fit(vectors, labels);
const prediction = classifier.predict(vector);

Slang Detection

For short/ambiguous texts, the detector uses comprehensive slang dictionaries:

Language Examples
English lol, bruh, ngl, fr, lowkey, bussin, innit, mate
Spanish wey, neta, chido, parce, bacano, po, cachai
French mdr, ptdr, slt, tkt, jsp, bcp, cv
Italian cmq, tvb, xke, nn, qlc, grz
Portuguese kkk, blz, vlw, tmj, mano, bora, fixe
German digga, krass, geil, oida, leiwand, hdl, vllt

Training Your Own Model

1. Download Training Data

npm run download-data

Downloads data from multiple sources:

Source Description Link
Tatoeba Community-sourced sentence pairs tatoeba.org
OpenSubtitles Movie and TV subtitles opus.nlpl.eu
Leipzig Corpora Web and news text uni-leipzig.de
TED2020 TED talk transcripts opus.nlpl.eu
QED Educational content opus.nlpl.eu
Ubuntu Technical support dialogues opus.nlpl.eu

2. Prepare Data

npm run prepare-data

Processes raw data, filters by length, and removes duplicates.

3. Train Model

npm run train

Trains a TF-IDF + Naive Bayes model using batch processing and saves to models/language-model.json.

4. Evaluate Model

npm run evaluate

Runs the model against 959 test cases and reports accuracy.

Interactive mode:

npm run evaluate -- -i

Text Normalization

import { normalizeText, augmentText } from 'naive-bayes-language-detector';

// Normalize text (lowercase, remove URLs, emails, phone numbers)
const normalized = normalizeText('Hello World! https://example.com');
// 'hello world'

// Augment for training (creates variations with abbreviations)
const variations = augmentText('porque no vienes', 'es');
// ['porque no vienes', 'xq no vienes', ...]

Project Structure

language-detector/
├── src/                    # TypeScript source
│   ├── index.ts           # Main exports
│   ├── types/             # Type definitions
│   ├── utils/             # Utilities (normalization, n-grams, slang)
│   └── inference/         # ML components (vectorizer, classifier, detector)
├── dist/                  # Compiled JavaScript (CommonJS)
├── test/                  # Mocha + Chai test files
├── scripts/               # Training and evaluation scripts
├── models/                # Pre-trained model
│   └── language-model.json
└── data/                  # Training data (not included in npm package)

Type Exports

import type {
   LanguageCode, // 'en' | 'es' | 'fr' | 'it' | 'pt' | 'de'
   DetectionResult,
   DetectionSource,
   SlangDetectionResult,
   PredictionResult,
   VectorizerOptions,
   VectorizerData,
   ClassifierData,
   ModelData,
   AllowedLanguagesOptions, // Options for setAllowedLanguages()
} from 'naive-bayes-language-detector';

Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run tests
npm test

# Run tests with coverage
npm run coverage

# Lint code
npm run lint
npm run lint:fix

# Training workflow
npm run download-data
npm run prepare-data
npm run train
npm run evaluate

Tech Stack

Technology Purpose
TypeScript Type-safe development
Node.js Runtime environment
Mocha + Chai Testing framework
ESLint + Prettier Code quality
Husky Git hooks
Airbnb Style Guide Code style

Git Hooks

This project uses Husky for Git hooks:

  • pre-commit: Runs lint-staged to lint and format staged .ts files
# Hooks are automatically installed when you run npm install
npm install

Requirements

Performance

Metric Value
Inference time <5ms per text
Model size ~1.7MB (JSON)
Accuracy 100% (959 test cases)
Memory usage ~50MB loaded

License

ISC

Contributing

Contributions are welcome! Please ensure:

  1. All tests pass (npm test)
  2. Code is linted (npm run lint)
  3. New features include tests

Maintainers


Made with ❤️ for the messaging community

About

This is a Naive Bayes language detector npm module that provides fast (~1-5ms) language detection with 95-98% accuracy for short, informal text like SMS and chat messages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published