Skip to content

Add CLI executable for training and classification #119

@cardmagic

Description

@cardmagic

Summary

Provide a command-line executable for training classifiers and classifying text, with support for downloading pre-trained models from community repositories.

Motivation

A CLI tool would enable:

  • Training classifiers from shell scripts or CI/CD pipelines
  • Quick classification without writing Ruby code
  • Integration with other Unix tools via pipes
  • Batch processing of files
  • Sharing and reusing pre-trained models from community repositories

Model Registry (like Homebrew taps)

Pre-trained models are stored in public GitHub repositories that anyone can contribute to:

  • Default registry: github.com/cardmagic/classifier-models
  • Custom registries: Any GitHub repo with the same structure

Model Syntax

# From default registry (cardmagic/classifier-models)
classifier -r sentiment "I love this!"

# From custom registry (@user/repo:model)
classifier -r @person/foobar:sentiment "I love this!"

# Pull from default registry
classifier pull sentiment

# Pull from custom registry
classifier pull @person/foobar:sentiment

# Pull ALL models from a registry
classifier pull @person/foobar

Repository Structure

classifier-models/
├── models/
│   ├── spam-filter.json           # Bayes spam classifier
│   ├── sentiment.json             # Sentiment classifier
│   └── language-detect.json       # Language detection
├── models.json                    # Index of available models
└── README.md

Model Index (models.json)

{
  "version": "1.0.0",
  "models": {
    "spam-filter": {
      "description": "Email spam detection trained on SpamAssassin corpus",
      "type": "bayes",
      "categories": ["spam", "ham"],
      "file": "models/spam-filter.json",
      "version": "1.0.0",
      "author": "cardmagic",
      "size": "245KB"
    },
    "sentiment": {
      "description": "Positive/negative sentiment analysis",
      "type": "bayes", 
      "categories": ["positive", "negative", "neutral"],
      "file": "models/sentiment.json",
      "version": "1.2.0",
      "author": "contributor",
      "size": "1.2MB"
    }
  }
}

Commands for Model Management

Pull (download models)

# Download from default registry
classifier pull spam-filter
# => Downloading spam-filter from cardmagic/classifier-models...
# => Saved to ~/.classifier/models/spam-filter.json

# Download from custom registry
classifier pull @person/foobar:sentiment
# => Downloading sentiment from person/foobar...
# => Saved to ~/.classifier/models/@person/foobar/sentiment.json

# Download ALL models from a registry
classifier pull @person/foobar
# => Downloading 5 models from person/foobar...
# => Saved to ~/.classifier/models/@person/foobar/

# Download to specific location
classifier pull spam-filter -o ./classifier.json

Models (list available)

# List models in default registry
classifier models
# => spam-filter      Email spam detection (bayes, 245KB)
# => sentiment        Positive/negative sentiment (bayes, 1.2MB)
# => language-detect  Language detection (bayes, 890KB)

# List models in custom registry
classifier models @person/foobar
# => sentiment        Custom sentiment model (bayes, 500KB)
# => topic            Topic classifier (lsi, 2.1MB)

# Search models
classifier models --search spam
# => spam-filter      Email spam detection (bayes, 245KB)

# Show model details
classifier models spam-filter
# => Name: spam-filter
# => Description: Email spam detection trained on SpamAssassin corpus
# => Type: bayes
# => Categories: spam, ham
# => Version: 1.0.0
# => Author: cardmagic
# => Size: 245KB

Push (contribute a model)

# Publish to default registry (opens PR)
classifier push ./classifier.json --name "my-classifier" --description "My custom classifier"
# => Creating pull request to cardmagic/classifier-models...
# => PR #42 created: https://github.com/cardmagic/classifier-models/pull/42

Using Remote Models

# Use model directly (downloads and caches automatically)
classifier -r spam-filter "Is this spam?"
# => Downloads if not cached, then classifies
# => spam

# Use model from custom registry
classifier -r @person/foobar:sentiment "I love this product!"
# => positive

# Cache location
~/.classifier/models/spam-filter.json              # Default registry
~/.classifier/models/@person/foobar/sentiment.json # Custom registry
~/.classifier/registry.json                        # Cached model index

Proposed CLI (Full)

Classifying (default action)

# Classify with local model
classifier "Is this spam?"
# => ham

# Use a remote model directly
classifier -r spam-filter "Buy now! Limited offer!"
# => spam

# Use model from custom registry
classifier -r @person/foobar:sentiment "Great product!"
# => positive

# From stdin
echo "Buy now!" | classifier -r spam-filter
# => spam

# Show probabilities
classifier -r spam-filter -p "Buy now!"
# => spam:0.92 ham:0.08

Training

classifier train spam spam_emails.txt
classifier train ham email1.txt email2.txt
cat corpus/*.txt | classifier train spam

Info

classifier info
# => Type: bayes
# => Categories: spam, ham
# => Documents: 1,234

Fit (Logistic Regression)

classifier -m lr train spam spam.txt
classifier -m lr train ham ham.txt
classifier fit                    # Required before classifying

Search (LSI only)

classifier search "machine learning concepts"
# => articles/neural_networks.txt:0.89

Related (LSI only)

classifier related articles/ruby.txt
# => articles/python.txt:0.82

Options

Global Options

-f, --file FILE        Model file (default: ./classifier.json)
-m, --model TYPE       Classifier type: bayes, lsi, knn, lr (default: bayes)
-r, --remote MODEL     Use remote model: name or @user/repo:name
-p                     Show probabilities
-n, --count N          Number of results for search/related (default: 10)
-q                     Quiet mode
-v, --version          Show version
-h, --help             Show help

KNN Options

-k, --neighbors N      Number of neighbors (default: 5)
--weighted             Use distance-weighted voting

Logistic Regression Options

--learning-rate N      Learning rate (default: 0.1)
--regularization N     L2 regularization (default: 0.01)
--max-iterations N     Maximum iterations (default: 100)

Environment Variables

CLASSIFIER_FILE=./model.json                       # Default model path
CLASSIFIER_MODEL=bayes                             # Default classifier type
CLASSIFIER_REGISTRY=cardmagic/classifier-models    # Default registry
CLASSIFIER_CACHE=~/.classifier                     # Cache directory

Command Summary

Command Description
classifier "text" Classify text (default action)
classifier -r model "text" Classify using remote model
classifier -r @user/repo:model "text" Classify using model from custom registry
classifier train <cat> [files] Train a category
classifier info Show model details
classifier fit Fit model (LR only)
classifier search "query" Semantic search (LSI only)
classifier related <item> Find related docs (LSI only)
classifier models List models in default registry
classifier models @user/repo List models in custom registry
classifier pull <model> Download model from default registry
classifier pull @user/repo:model Download model from custom registry
classifier pull @user/repo Download ALL models from registry
classifier push <file> Contribute model to default registry

Examples

# Quick start with pre-trained model
classifier -r spam-filter "Meeting tomorrow at 3pm"
# => ham

# Use community model from another repo
classifier -r @nlp-models/classifiers:sentiment "I love this!"
# => positive

# Download all models from a repo
classifier pull @company/internal-models
# => Downloaded 8 models to ~/.classifier/models/@company/internal-models/

# Train your own and share
classifier train positive good_reviews.txt
classifier train negative bad_reviews.txt  
classifier push ./classifier.json --name "product-reviews"
# => PR created!

Implementation Notes

  • Use optparse (stdlib) - no external dependencies
  • Exit codes: 0 success, 1 error, 2 usage error
  • Registry uses GitHub raw URLs for downloads
  • Cache models in ~/.classifier/models/
  • Model paths: ~/.classifier/models/<model>.json or ~/.classifier/models/@user/repo/<model>.json
  • push command uses gh CLI or GitHub API to create PR
  • Model index cached locally, refreshed on classifier models --refresh
  • Parse @user/repo:model syntax to extract registry and model name

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: lowLow priority - nice to have

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions