Skip to content

Commit

Permalink
Replace check method to see if a string is numeric
Browse files Browse the repository at this point in the history
The old code used the very correct Float(word) method to see if a string
was numeric. This works reliable with all sorts of edge-case data but it
is very slow.

Since we have already parsed out a lot of possibilities during word
atomisation (e.g. decimal numbers like 123.45 have already been split
into "123" and "45") we do not need this level of "dealing with edge"
case surety.

Therefore we can just do a simple regex check to see if the string is
all numerals or not.

In tests on 1000 emails (Single threaded) the run-time was reduced
from 2.4 seconds to 1.4 seconds.

Since we have traded edge-case reliability for speed we can no longer
leave this as a String class monkey-patch so move it into a method that
will only be called by Ankusa itself
  • Loading branch information
rurounijones committed Jun 3, 2014
1 parent 8b223a0 commit 2d3d916
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 15 deletions.
4 changes: 0 additions & 4 deletions lib/ankusa/extensions.rb
@@ -1,8 +1,4 @@
class String class String
def numeric?
true if Float(self) rescue false
end

def to_ascii def to_ascii
encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8') rescue "" encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8') rescue ""
end end
Expand Down
15 changes: 12 additions & 3 deletions lib/ankusa/hasher.rb
Expand Up @@ -3,7 +3,7 @@


module Ankusa module Ankusa


class TextHash < Hash class TextHash < Hash
attr_reader :word_count attr_reader :word_count


def initialize(text=nil, stem=true) def initialize(text=nil, stem=true)
Expand All @@ -19,14 +19,14 @@ def self.atomize(text)


# word should be only alphanum chars at this point # word should be only alphanum chars at this point
def self.valid_word?(word) def self.valid_word?(word)
not (Ankusa::STOPWORDS.include?(word) || word.length < 3 || word.numeric?) not (Ankusa::STOPWORDS.include?(word) || word.length < 3 || self.numeric_word?(word))
end end


def add_text(text) def add_text(text)
if text.instance_of? Array if text.instance_of? Array
text.each { |t| add_text t } text.each { |t| add_text t }
else else
# replace dashes with spaces, then get rid of non-word/non-space characters, # replace dashes with spaces, then get rid of non-word/non-space characters,
# then split by space to get words # then split by space to get words
words = TextHash.atomize text words = TextHash.atomize text
words.each { |word| add_word(word) if TextHash.valid_word?(word) } words.each { |word| add_word(word) if TextHash.valid_word?(word) }
Expand All @@ -42,6 +42,15 @@ def add_word(word)
key = word.intern key = word.intern
store key, fetch(key, 0)+1 store key, fetch(key, 0)+1
end end

# Due to the character filtering that takes place in atomisation
# this method should never received something that could be a
# negative number, float etc.
# Therefore we can dispense with the SLOW Float(word) method and
# just do a simple regex.
def self.numeric_word?(word)
word.match(/[\d]+/)
end
end end


end end
23 changes: 15 additions & 8 deletions test/hasher_test.rb
@@ -1,25 +1,32 @@
require File.join File.dirname(__FILE__), 'helper' require File.join File.dirname(__FILE__), 'helper'


class HasherTest < Test::Unit::TestCase class HasherTest < Test::Unit::TestCase
def setup
def test_stemming
string = "Words word a the at fish fishing fishes? /^/ The at a of! @#$!" string = "Words word a the at fish fishing fishes? /^/ The at a of! @#$!"
@text_hash = Ankusa::TextHash.new string @text_hash = Ankusa::TextHash.new string
@array = Ankusa::TextHash.new [string] @array = Ankusa::TextHash.new [string]
end


def test_stemming
assert_equal @text_hash.length, 2 assert_equal @text_hash.length, 2
assert_equal @text_hash.word_count, 5 assert_equal @text_hash.word_count, 5


assert_equal @array.length, 2 assert_equal @array.length, 2
assert_equal @array.word_count, 5 assert_equal @array.word_count, 5
end end


def test_atomization
string = "Hello 123,45 My-name! is Robot14 123.45 @#$!"
@array = Ankusa::TextHash.atomize string

assert_equal %w{hello 123 45 my name is robot14 123 45}, @array
end

def test_valid_word def test_valid_word
assert (not Ankusa::TextHash.valid_word? "accordingly") assert !Ankusa::TextHash.valid_word?("accordingly")
assert (not Ankusa::TextHash.valid_word? "appropriate") assert !Ankusa::TextHash.valid_word?("appropriate")
assert Ankusa::TextHash.valid_word? "^*&@" assert Ankusa::TextHash.valid_word?("^*&@")
assert Ankusa::TextHash.valid_word? "mother" assert Ankusa::TextHash.valid_word?("mother")
assert (not Ankusa::TextHash.valid_word? "21675") assert !Ankusa::TextHash.valid_word?("21675")
assert !Ankusa::TextHash.valid_word?("00000")
end end
end end

0 comments on commit 2d3d916

Please sign in to comment.