Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replace check method to see if a string is numeric
The old code used the very correct Float(word) method to see if a string was numeric. This works reliable with all sorts of edge-case data but it is very slow. Since we have already parsed out a lot of possibilities during word atomisation (e.g. decimal numbers like 123.45 have already been split into "123" and "45") we do not need this level of "dealing with edge" case surety. Therefore we can just do a simple regex check to see if the string is all numerals or not. In tests on 1000 emails (Single threaded) the run-time was reduced from 2.4 seconds to 1.4 seconds. Since we have traded edge-case reliability for speed we can no longer leave this as a String class monkey-patch so move it into a method that will only be called by Ankusa itself
- Loading branch information
1 parent
8b223a0
commit 2d3d916
Showing
3 changed files
with
27 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -1,25 +1,32 @@ | |||
require File.join File.dirname(__FILE__), 'helper' | require File.join File.dirname(__FILE__), 'helper' | ||
|
|
||
class HasherTest < Test::Unit::TestCase | class HasherTest < Test::Unit::TestCase | ||
def setup |
|
||
def test_stemming | |||
string = "Words word a the at fish fishing fishes? /^/ The at a of! @#$!" | string = "Words word a the at fish fishing fishes? /^/ The at a of! @#$!" | ||
@text_hash = Ankusa::TextHash.new string | @text_hash = Ankusa::TextHash.new string | ||
@array = Ankusa::TextHash.new [string] | @array = Ankusa::TextHash.new [string] | ||
end | |||
|
|
||
def test_stemming | |||
assert_equal @text_hash.length, 2 | assert_equal @text_hash.length, 2 | ||
assert_equal @text_hash.word_count, 5 | assert_equal @text_hash.word_count, 5 | ||
|
|
||
assert_equal @array.length, 2 | assert_equal @array.length, 2 | ||
assert_equal @array.word_count, 5 | assert_equal @array.word_count, 5 | ||
end | end | ||
|
|
||
def test_atomization | |||
string = "Hello 123,45 My-name! is Robot14 123.45 @#$!" | |||
@array = Ankusa::TextHash.atomize string | |||
|
|||
assert_equal %w{hello 123 45 my name is robot14 123 45}, @array | |||
end | |||
|
|||
def test_valid_word | def test_valid_word | ||
assert (not Ankusa::TextHash.valid_word? "accordingly") | assert !Ankusa::TextHash.valid_word?("accordingly") | ||
assert (not Ankusa::TextHash.valid_word? "appropriate") | assert !Ankusa::TextHash.valid_word?("appropriate") | ||
assert Ankusa::TextHash.valid_word? "^*&@" | assert Ankusa::TextHash.valid_word?("^*&@") | ||
assert Ankusa::TextHash.valid_word? "mother" | assert Ankusa::TextHash.valid_word?("mother") | ||
assert (not Ankusa::TextHash.valid_word? "21675") | assert !Ankusa::TextHash.valid_word?("21675") | ||
assert !Ankusa::TextHash.valid_word?("00000") | |||
end | end | ||
end | end |