Implementation of Moses Charikar's simhashes in Ruby
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
ext/string_hashing
lib
test
.gitignore
LICENSE
README.rdoc
Rakefile
simhash.gemspec

README.rdoc

Absctract

This is implementation of Moses Charikar’s simhashes in Ruby.

Usage

When you have a string and want to calculate it's simhash, you should

my_string.simhash

By default it will generate 64-bit integer - that is simhash for this string

It's always better to tokenize string before simhashing. It's as simple as

my_string.simhash(:split_by => / /)

This will generate 64-bit integer based, but will split string into words before. It's handy when you need to calculate similarity of strings based on word usage. You can split string as you like: by letters/sentences/specific letter-combinations, etc.

my_string.simhash(:split_by => /./, :hashbits => 512)

Sometimes you might need longer simhash (finding similarity for very long strings is a good example). You can set length of result hash by passing hashbits parameter. This example will return 512-bit simhash for your string splitted by sentences.

Advanced usage

It's useful to clean your string before simhashing. But it's useful not to clean, too.

Here are examples:

my_string.simhash(:stop_words => true) # here we clean

This will find stop-words in your string and remove them before simhashing. Stop-words are “the”, “not”, “about”, etc. Currently we remove only Russian and English stop-words.

my_string.simhash(:preserve_punctuation => true) # here we not

This will not remove punctuation before simhashing. Yes, we remove all dots, commas, etc. after splitting string to words by default. Because different punctiation does not mean difference in general. If you not agree you can turn this default off.

Installation

As usual:

gem install simhash

But if you have GNU MP library, simhash will work faster! To check out which version is used, type:

Simhash::DEFAULT_STRING_HASH_METHOD It should return symbol. If symbol ends with “rb”, your simhash is slow. If you want make it faster, install GNU MP.