Skip to content

Implement proper Laplace smoothing in Bayes classifier #64

@cardmagic

Description

@cardmagic

Summary

The current smoothing in the Bayesian classifier uses ad-hoc magic numbers instead of proper probabilistic smoothing.

Problem Location

File: lib/classifier/bayes.rb:75

s = category_words.key?(word) ? category_words[word] : 0.1  # Magic number

Why This Matters

  • The hardcoded 0.1 violates proper probabilistic foundations
  • Smoothing factor should be proportional to vocabulary size (standard Laplace smoothing uses α=1)
  • Classification accuracy degrades on sparse vocabularies
  • Can produce pathological results when vocabulary size varies significantly

Proposed Fix

Implement proper add-k (Laplace) smoothing:

def classifications(text)
  vocab_size = @categories.values.flat_map(&:keys).uniq.size
  alpha = 1.0  # Laplace smoothing parameter
  
  @categories.each do |category, category_words|
    total = @category_word_count[category] + (alpha * vocab_size)
    word_hash.each_key do |word|
      # P(word|category) = (count + α) / (total + α * vocab_size)
      s = (category_words[word] || 0) + alpha
      score[category.to_s] += Math.log(s / total)
    end
  end
end

Benefits

  • Mathematically sound probabilistic model
  • Better accuracy on small training sets
  • Configurable smoothing parameter for tuning

Impact

Severity: High - affects classification correctness

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions