-
Notifications
You must be signed in to change notification settings - Fork 123
Closed
Labels
enhancementNew feature or requestNew feature or requestpriority: lowLow priority - nice to haveLow priority - nice to have
Milestone
Description
Summary
Provide a command-line executable for training classifiers and classifying text, with support for downloading pre-trained models from community repositories.
Motivation
A CLI tool would enable:
- Training classifiers from shell scripts or CI/CD pipelines
- Quick classification without writing Ruby code
- Integration with other Unix tools via pipes
- Batch processing of files
- Sharing and reusing pre-trained models from community repositories
Model Registry (like Homebrew taps)
Pre-trained models are stored in public GitHub repositories that anyone can contribute to:
- Default registry:
github.com/cardmagic/classifier-models - Custom registries: Any GitHub repo with the same structure
Model Syntax
# From default registry (cardmagic/classifier-models)
classifier -r sentiment "I love this!"
# From custom registry (@user/repo:model)
classifier -r @person/foobar:sentiment "I love this!"
# Pull from default registry
classifier pull sentiment
# Pull from custom registry
classifier pull @person/foobar:sentiment
# Pull ALL models from a registry
classifier pull @person/foobarRepository Structure
classifier-models/
├── models/
│ ├── spam-filter.json # Bayes spam classifier
│ ├── sentiment.json # Sentiment classifier
│ └── language-detect.json # Language detection
├── models.json # Index of available models
└── README.md
Model Index (models.json)
{
"version": "1.0.0",
"models": {
"spam-filter": {
"description": "Email spam detection trained on SpamAssassin corpus",
"type": "bayes",
"categories": ["spam", "ham"],
"file": "models/spam-filter.json",
"version": "1.0.0",
"author": "cardmagic",
"size": "245KB"
},
"sentiment": {
"description": "Positive/negative sentiment analysis",
"type": "bayes",
"categories": ["positive", "negative", "neutral"],
"file": "models/sentiment.json",
"version": "1.2.0",
"author": "contributor",
"size": "1.2MB"
}
}
}Commands for Model Management
Pull (download models)
# Download from default registry
classifier pull spam-filter
# => Downloading spam-filter from cardmagic/classifier-models...
# => Saved to ~/.classifier/models/spam-filter.json
# Download from custom registry
classifier pull @person/foobar:sentiment
# => Downloading sentiment from person/foobar...
# => Saved to ~/.classifier/models/@person/foobar/sentiment.json
# Download ALL models from a registry
classifier pull @person/foobar
# => Downloading 5 models from person/foobar...
# => Saved to ~/.classifier/models/@person/foobar/
# Download to specific location
classifier pull spam-filter -o ./classifier.jsonModels (list available)
# List models in default registry
classifier models
# => spam-filter Email spam detection (bayes, 245KB)
# => sentiment Positive/negative sentiment (bayes, 1.2MB)
# => language-detect Language detection (bayes, 890KB)
# List models in custom registry
classifier models @person/foobar
# => sentiment Custom sentiment model (bayes, 500KB)
# => topic Topic classifier (lsi, 2.1MB)
# Search models
classifier models --search spam
# => spam-filter Email spam detection (bayes, 245KB)
# Show model details
classifier models spam-filter
# => Name: spam-filter
# => Description: Email spam detection trained on SpamAssassin corpus
# => Type: bayes
# => Categories: spam, ham
# => Version: 1.0.0
# => Author: cardmagic
# => Size: 245KBPush (contribute a model)
# Publish to default registry (opens PR)
classifier push ./classifier.json --name "my-classifier" --description "My custom classifier"
# => Creating pull request to cardmagic/classifier-models...
# => PR #42 created: https://github.com/cardmagic/classifier-models/pull/42Using Remote Models
# Use model directly (downloads and caches automatically)
classifier -r spam-filter "Is this spam?"
# => Downloads if not cached, then classifies
# => spam
# Use model from custom registry
classifier -r @person/foobar:sentiment "I love this product!"
# => positive
# Cache location
~/.classifier/models/spam-filter.json # Default registry
~/.classifier/models/@person/foobar/sentiment.json # Custom registry
~/.classifier/registry.json # Cached model indexProposed CLI (Full)
Classifying (default action)
# Classify with local model
classifier "Is this spam?"
# => ham
# Use a remote model directly
classifier -r spam-filter "Buy now! Limited offer!"
# => spam
# Use model from custom registry
classifier -r @person/foobar:sentiment "Great product!"
# => positive
# From stdin
echo "Buy now!" | classifier -r spam-filter
# => spam
# Show probabilities
classifier -r spam-filter -p "Buy now!"
# => spam:0.92 ham:0.08Training
classifier train spam spam_emails.txt
classifier train ham email1.txt email2.txt
cat corpus/*.txt | classifier train spamInfo
classifier info
# => Type: bayes
# => Categories: spam, ham
# => Documents: 1,234Fit (Logistic Regression)
classifier -m lr train spam spam.txt
classifier -m lr train ham ham.txt
classifier fit # Required before classifyingSearch (LSI only)
classifier search "machine learning concepts"
# => articles/neural_networks.txt:0.89Related (LSI only)
classifier related articles/ruby.txt
# => articles/python.txt:0.82Options
Global Options
-f, --file FILE Model file (default: ./classifier.json)
-m, --model TYPE Classifier type: bayes, lsi, knn, lr (default: bayes)
-r, --remote MODEL Use remote model: name or @user/repo:name
-p Show probabilities
-n, --count N Number of results for search/related (default: 10)
-q Quiet mode
-v, --version Show version
-h, --help Show helpKNN Options
-k, --neighbors N Number of neighbors (default: 5)
--weighted Use distance-weighted votingLogistic Regression Options
--learning-rate N Learning rate (default: 0.1)
--regularization N L2 regularization (default: 0.01)
--max-iterations N Maximum iterations (default: 100)Environment Variables
CLASSIFIER_FILE=./model.json # Default model path
CLASSIFIER_MODEL=bayes # Default classifier type
CLASSIFIER_REGISTRY=cardmagic/classifier-models # Default registry
CLASSIFIER_CACHE=~/.classifier # Cache directoryCommand Summary
| Command | Description |
|---|---|
classifier "text" |
Classify text (default action) |
classifier -r model "text" |
Classify using remote model |
classifier -r @user/repo:model "text" |
Classify using model from custom registry |
classifier train <cat> [files] |
Train a category |
classifier info |
Show model details |
classifier fit |
Fit model (LR only) |
classifier search "query" |
Semantic search (LSI only) |
classifier related <item> |
Find related docs (LSI only) |
classifier models |
List models in default registry |
classifier models @user/repo |
List models in custom registry |
classifier pull <model> |
Download model from default registry |
classifier pull @user/repo:model |
Download model from custom registry |
classifier pull @user/repo |
Download ALL models from registry |
classifier push <file> |
Contribute model to default registry |
Examples
# Quick start with pre-trained model
classifier -r spam-filter "Meeting tomorrow at 3pm"
# => ham
# Use community model from another repo
classifier -r @nlp-models/classifiers:sentiment "I love this!"
# => positive
# Download all models from a repo
classifier pull @company/internal-models
# => Downloaded 8 models to ~/.classifier/models/@company/internal-models/
# Train your own and share
classifier train positive good_reviews.txt
classifier train negative bad_reviews.txt
classifier push ./classifier.json --name "product-reviews"
# => PR created!Implementation Notes
- Use
optparse(stdlib) - no external dependencies - Exit codes: 0 success, 1 error, 2 usage error
- Registry uses GitHub raw URLs for downloads
- Cache models in
~/.classifier/models/ - Model paths:
~/.classifier/models/<model>.jsonor~/.classifier/models/@user/repo/<model>.json pushcommand usesghCLI or GitHub API to create PR- Model index cached locally, refreshed on
classifier models --refresh - Parse
@user/repo:modelsyntax to extract registry and model name
Related
- classifier-reborn#99: Executable for training with a persistent data store jekyll/classifier-reborn#99
- Add keywords CLI tool for text vectorization #122: TF-IDF CLI tool
- Model registry: https://github.com/cardmagic/classifier-models
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority: lowLow priority - nice to haveLow priority - nice to have