Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

fix for classifying a doc when no training has been done. fixed tests…

… for hbase, memory, and file system storage, but could'nt get cassandra tests running under ruby 1.9.3
  • Loading branch information...
commit 8ae62a137cac5b65e3bde69b8b2e3bd2fdca9798 1 parent 7070bb3
@bborn bborn authored
View
14 README.rdoc
@@ -9,7 +9,7 @@ First, install HBase/Hadoop, Mongo, or Cassandra (>= 0.7.0-rc2). Then, install
gem install hbaserb
# or
gem install cassandra
- # or
+ # or
gem install mongo
If you're using HBase, make sure the HBase Thrift interface has been started as well. Then:
@@ -35,7 +35,7 @@ Using the naive Bayes classifier:
# This will return the most likely class (as symbol)
puts c.classify "This is some spammy text"
- # This will return Hash with classes as keys and
+ # This will return Hash with classes as keys and
# membership probability as values
puts c.classifications "This is some spammy text"
@@ -54,13 +54,13 @@ Using the naive Bayes classifier:
== KL Diverence Classifier
There is a Kullback–Leibler divergence classifier as well. KL divergence is a distance measure (though not a true metric because it does not satisfy the triangle inequality). The KL classifier simply measures the relative entropy between the text you want to classify and each of the classes. The class with the shortest "distance" is the best class. You may find that for a especially large corpus it may be slightly faster to use this classifier (since prior probablities are never calculated, only likelihoods).
-The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
+The API is the same as the NaiveBayesClassifier, except rather than calling "classifications" if you want actual numbers you call "distances".
require 'rubygems'
require 'ankusa'
require 'ankusa/hbase_storage'
- # connect to HBase
+ # connect to HBase
storage = Ankusa::HBaseStorage.new 'localhost'
c = Ankusa::KLDivergenceClassifier.new storage
@@ -72,7 +72,7 @@ The API is the same as the NaiveBayesClassifier, except rather than calling "cla
# This will return the most likely class (as symbol)
puts c.classify "This is some spammy text"
- # This will return Hash with classes as keys and
+ # This will return Hash with classes as keys and
# distances >= 0 as values
puts c.distances "This is some spammy text"
@@ -104,13 +104,13 @@ HBase storage:
For Cassandra storage:
* You will need Cassandra version 0.7.0-rc2 or greater.
-* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
+* You will need to set a max number classes since current implementation of the Ruby Cassandra client doesn't support table scans.
* Prior to using the Cassandra storage you will need to run the following command from the cassandra-cli: "create keyspace ankusa with replication_factor = 1". This should be fixed with a new release candidate for Cassandra.
To use the Cassandra storage class:
require 'ankusa/cassandra_storage'
# defaults: host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100
- storage = Ankusa::HBaseStorage.new host, port, keyspace, max_classes
+ storage = Ankusa::CassandraStorage.new host, port, keyspace, max_classes
For MongoDB storage:
require 'ankusa/mongo_db_storage'
View
2  Rakefile
@@ -23,7 +23,7 @@ Rake::TestTask.new("test_memory") { |t|
desc "Run all unit tests with HBase storage"
Rake::TestTask.new("test_hbase") { |t|
t.libs += ["lib", "."]
- t.test_files = FileList['test/hasher_test.rb']
+ t.test_files = FileList['test/hasher_test.rb', 'test/hbase_classifier_test.rb']
t.verbose = true
}
View
2  lib/ankusa/classifier.rb
@@ -49,7 +49,7 @@ def get_word_probs(word, classnames)
probs = Hash.new 0
@storage.get_word_counts(word).each { |k,v| probs[k] = v if classnames.include? k }
vs = vocab_sizes
- classnames.each { |cn|
+ classnames.each { |cn|
# if we've never seen the class, the word prob is 0
next unless vs.has_key? cn
View
10 lib/ankusa/hbase_storage.rb
@@ -26,7 +26,7 @@ def reset
drop_tables
init_tables
end
-
+
def drop_tables
freq_table.delete
summary_table.delete
@@ -69,10 +69,12 @@ def get_total_word_count(klass)
@klass_word_counts[klass] = summary_table.get(klass, "totals:wordcount").first.to_i64.to_f
}
end
-
+
def get_doc_count(klass)
@klass_doc_counts.fetch(klass) {
- @klass_doc_counts[klass] = summary_table.get(klass, "totals:doccount").first.to_i64.to_f
+ totals = summary_table.get(klass, "totals:doccount")
+ totals = (totals.size === 0) ? 0 : totals.first.to_i64.to_f
+ @klass_doc_counts[klass] = totals
}
end
@@ -83,7 +85,7 @@ def incr_word_count(klass, word, count)
if size == count
summary_table.atomic_increment klass, "totals:vocabsize"
elsif size == 0
- summary_table.atomic_increment klass, "totals:vocabsize", -1
+ summary_table.atomic_increment klass, "totals:vocabsize", -1
end
size
end
View
27 lib/ankusa/naive_bayes.rb
@@ -6,9 +6,16 @@ class NaiveBayesClassifier
def classify(text, classes=nil)
# return the most probable class
- log_likelihoods(text, classes).sort_by { |c| -c[1] }.first.first
+
+ result = log_likelihoods(text, classes)
+ if result.values.uniq.size. === 1
+ # unless all classes are equally likely, then return nil
+ return nil
+ else
+ result.sort_by { |c| -c[1] }.first.first
+ end
end
-
+
# Classes is an array of classes to look at
def classifications(text, classnames=nil)
result = log_likelihoods text, classnames
@@ -17,8 +24,10 @@ def classifications(text, classnames=nil)
}
# normalize to get probs
- sum = result.values.inject { |x,y| x+y }
- result.keys.each { |k| result[k] = result[k] / sum }
+ sum = result.values.inject{ |x,y| x+y }
+ result.keys.each { |k|
+ result[k] = result[k] / sum
+ } unless sum.zero?
result
end
@@ -29,7 +38,7 @@ def log_likelihoods(text, classnames=nil)
TextHash.new(text).each { |word, count|
probs = get_word_probs(word, classnames)
- classnames.each { |k|
+ classnames.each { |k|
# log likelihood should be negative infinity if we've never seen the klass
result[k] += probs[k] > 0 ? (Math.log(probs[k]) * count) : -INFTY
}
@@ -37,9 +46,11 @@ def log_likelihoods(text, classnames=nil)
# add the prior
doc_counts = doc_count_totals.select { |k,v| classnames.include? k }.map { |k,v| v }
- doc_count_total = (doc_counts.inject { |x,y| x+y } + classnames.length).to_f
- classnames.each { |k|
- result[k] += Math.log((@storage.get_doc_count(k) + 1).to_f / doc_count_total)
+
+ doc_count_total = (doc_counts.inject(0){ |x,y| x+y } + classnames.length).to_f
+
+ classnames.each { |k|
+ result[k] += Math.log((@storage.get_doc_count(k) + 1).to_f / doc_count_total)
}
result
View
17 test/classifier_base.rb
@@ -4,14 +4,14 @@ module ClassifierBase
def train
@classifier.train :spam, "spam and great spam" # spam:2 great:1
@classifier.train :good, "words for processing" # word:1 process:1
- @classifier.train :good, "good word" # word:1 good:1
+ @classifier.train :good, "good word" # word:1 good:1
end
def test_train
counts = @storage.get_word_counts(:spam)
assert_equal counts[:spam], 2
counts = @storage.get_word_counts(:word)
- assert_equal counts[:good], 2
+ assert_equal counts[:good], 2
assert_equal @storage.get_total_word_count(:good), 4
assert_equal @storage.get_doc_count(:good), 2
assert_equal @storage.get_total_word_count(:spam), 3
@@ -41,6 +41,17 @@ def setup
train
end
+ def test_untrained
+ @storage.reset
+
+ string = "spam is tastey"
+
+ hash = {:spam => 0, :good => 0}
+ assert_equal hash, @classifier.classifications(string)
+ assert_equal nil, @classifier.classify(string)
+ end
+
+
def test_probs
spamlog = Math.log(3.0 / 5.0) + Math.log(1.0 / 5.0) + Math.log(2.0 / 5.0)
goodlog = Math.log(1.0 / 7.0) + Math.log(1.0 / 7.0) + Math.log(3.0 / 5.0)
@@ -64,7 +75,7 @@ def test_probs
@classifier.train :somethingelse, "this is something else entirely spam"
cs = @classifier.classifications("spam is tastey", [:spam, :good])
assert_equal cs[:spam], spam
- assert_equal cs[:good], good
+ assert_equal cs[:good], good
# test for class we didn't train on
cs = @classifier.classifications("spam is super tastey if you are a zombie", [:spam, :nothing])
View
6 test/hbase_classifier_test.rb
@@ -1,11 +1,11 @@
require File.join File.dirname(__FILE__), 'classifier_base'
require 'ankusa/hbase_storage'
-module HBaseClassifierBase
+module HBaseClassifierBase
def initialize(name)
@freq_tablename = "ankusa_word_frequencies_test"
- @sum_tablename = "ankusa_summary_test"
- @storage = Ankusa::HBaseStorage.new CONFIG['hbase_host'], CONFIG['hbase_port'], @freq_tablename, @sum_tablename
+ @sum_tablename = "ankusa_summary_test"
+ @storage = Ankusa::HBaseStorage.new CONFIG['hbase_host'], CONFIG['hbase_port'], @freq_tablename, @sum_tablename
@freq_table = @storage.hbase.get_table(@freq_tablename)
@sum_table = @storage.hbase.get_table(@sum_tablename)
super(name)
Please sign in to comment.
Something went wrong with that request. Please try again.