Easy-to-use Java similarity algorithms for text and number-series.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
.travis.settings.xml
.travis.yml
LICENSE
README.md
pom.xml

README.md

Similarity Search algorithms in Java

Build Status License

This library contains easy-to-use and high-performant nearest-neighbor-search algorithms (as specified in "Mining of Massive Datasets", Cambridge University Press, Rajaraman, A., & Ullman, J. D.) implemented in Java, which can be used to determine the similarity between text Strings or sets of numbers (any Collection that contains inheritors of java.lang.Number).

To achieve a higher performance, all these algorithms are implemented so that multiple operations (counters, hashing, etc...) are executed in parallel (using threads spawned by an ExecutorService).

Quick Start

Maven

<dependency>
  <groupId>com.edduarte</groupId>
  <artifactId>similarity-search-java</artifactId>
  <version>0.0.1</version>
</dependency>

Gradle

dependencies {
  compile 'com.edduarte:similarity-search-java:0.0.1'
}

Usage

If you just need an easy way to figure out how similar two strings or number-sets are, use this:

// for strings
double similarity = Similarity.jaccard().of(string1, string2);

// for number sets
double similarity = Similarity.jaccard().of(set1, set2);

This will return a similarity coefficient, a value between 0 and 1 where 1 means the two strings or sets are exactly equal and where 0 means they are disjoint.

In the case of string similarity, you might need to set an appropriate shingle length based on the size of your strings. A shingle length of 2 is used by default, which works well for short strings. The shingle length should be 5 if your strings have the size of an email (e.g. 430 characters) or 8 if they have the size of an average Wikipedia article (e.g. 7800 characters):

// only needed for strings
double similarity = Similarity.jaccard()
    .withShingleLength(5)
    .of(string1, string2);

Jaccard is an exact approach, so this will always return the most accurate, gold-standard result but at a slower speed than other approaches. Jaccard similarity should be enough for most use-cases, but if you're dealing with a massive number of operations in parallel (big data, data-streams), you should go for a probabilistic approach like Minhashing or LSH, detailed in the "Advanced" section below.

Advanced

Every Similarity operation is available through the simple fluent / builder interface in the Similarity class.

Jaccard

// optional parameters
double similarity = Similarity.jaccard()

    // Length of n-gram shingles that are used for
    // comparison (used for strings only).
    .withShingleLength(5)

    // An executor where the kshingling and signature 
    // processing tasks are spawned. If nothing is
    // provided then it launches a new executor with
    // the cached thread pool.
    .withExecutor(executorService)

    .of(string1, string2);

Minhashing

// for strings
double similarity = Similarity.minhash().of(string1, string2);

// for number sets
double similarity = Similarity.minhash().of(set1, set2);

// optional parameters
double similarity = Similarity.minhash()

    // Length of n-gram shingles that are used when
    // generating signatures (used for strings only).
    .withShingleLength(5)

    // The size of the generated signatures, which are
    // compared to determine similarity.
    .withSignatureSize(100)

    // The hashing algorithm used to hash shingles to
    // signatures (used for strings only).
    .withHashMethod(HashMethod.Murmur3)

    // Number of unique elements in both sets (used for
    // sets only). For example, if set1=[4, 5, 6, 7, 8]
    // and set2=[7, 8, 9, 10], this value should be 7. If
    // nothing is provided, this value is determined in 
    // pre-processing.
    .withNumberOfElements(14)

    // An executor where the kshingling and signature 
    // processing tasks are spawned. If nothing is
    // provided then it launches a new executor with the
    // cached thread pool.
    .withExecutor(executorService)

    .of(string1, string2);

LSH

Minhashing is the fastest of the implemented approaches, but returns a non-deterministic index that can be worsened when using inappropriate shingle lengths or signature sizes. If you need the performance of a probabilistic approach but the deterministic index values of Jaccard similarity, you can use Minhashing with Locality-Sensitive Hashing:

// for strings
double similarity = Similarity.lsh().of(string1, string2);

// for number sets
double similarity = Similarity.lsh().of(set1, set2);

// optional parameters
double similarity = Similarity.lsh()

    // Length of n-gram shingles that are used when
    // generating signatures (used for strings only).
    .withShingleLength(5)

    // The number of bands and rows where the minhash
    // signatures will be organized.
    .withNumberOfBands(20).withNumberOfRows(5)

    // A threshold S that balances the number of false
    // positives and false negatives.
    .withThreshold(0.5)

    // The hashing algorithm used to hash shingles to
    // signatures (used for strings only).
    .withHashMethod(HashMethod.Murmur3)

    // Number of unique elements in both sets (used for
    // sets only). For example, if set1=[4, 5, 6, 7, 8]
    // and set2=[7, 8, 9, 10], this value should be 7. If
    // nothing is provided, this value is determined in 
    // pre-processing.
    .withNumberOfElements(14)

    // An executor where the kshingling and signature 
    // processing tasks are spawned. If nothing is
    // provided then it launches a new executor with the
    // cached thread pool.
    .withExecutor(executorService)

    .of(string1, string2);

This will return the Jaccard similarity coefficient for strings / sets that are considered to be candidate pairs, or return 0 if they are not candidate pairs. In other words, this will perform the expensive Jaccard coefficient call only for a smaller subset of a dataset and ignore elements that are too dissimilar. This also means that the result for candidate pairs will be deterministic while the result for non-candidate pairs will be non-deterministic.

Internal classes

So far the code samples have shown how to use the builder pattern available in the Similarity interface. However, you can instead instantiate the internal classes that implement the Similarity interface.

// example values
int shingleLength = 2;
int signatureSize = 100;
HashMethod hashMethod = HashMethod.Murmur3;
int n = 5;
int bands = 20;
int rows = 5;
double threshold = 0.5;

// string similarity
StringSimilarity stringSimilarity = new JaccardStringSimilarity(s1, s2, shingleLength, executor);
StringSimilarity stringSimilarity = new MinHashStringSimilarity(s1, s2, shingleLength, signatureSize, hashMethod, executor);
StringSimilarity stringSimilarity = new LSHStringSimilarity(s1, s2, shingleLength, bands, rows, threshold, hashMethod, executor);
double similarity = stringSimilarity.getAsDouble();

// set similarity
SetSimilarity setSimilarity = new JaccardSetSimilarity(c1, c2);
SetSimilarity setSimilarity = new MinHashSetSimilarity(c1, c2, n, signatureSize, executor);
SetSimilarity setSimilarity = new LSHSetSimilarity(c1, c2, n, bands, rows, threshold, executor);
double similarity = setSimilarity.getAsDouble();

Do note that while the Similarity builder pattern ensures that the provided number sets are copied, so that the computation of the similarity index is not affected by external threads that perform changes to the original sets in parallel, these internal classes do NOT, and you should ensure this condition yourself. This is intentional, so that these internal classes provide a higher performance variant of Similarity instantiation.

You can also use a number of classes that correspond to each step of the implemented similarity search algorithms, and use them in your application at your own accord. Each of the converter classes below returns a result that could, for example, be stored in a database / cache for later use.

// example values
String exampleString = "example string";
Set<Integer> exampleSet = Set.of(1, 2, 3, 4, 5);
int shingleLength = 2;
int signatureSize = 100;
HashMethod hashMethod = HashMethod.Murmur3;
int n = 5;
int bands = 20;
int rows = 5;

// generate shingles
KShingler c1 = new KShingler(shingleLength);
List<CharSequence> shingles = c1.apply(exampleString).call();

// get jaccard similarity coefficient
double stringSimilarity = Similarity.jaccardIndex(shingles1, shingles2);
double setSimilarity    = Similarity.jaccardIndex(exampleSet1, exampleSet2);

// get signatures from shingles
KShingles2SignatureConverter c2 = new KShingles2SignatureConverter(hashMethod, signatureSize);
int[] stringSignature = c2.apply(shingles).call();

// generate a universal-hash signature for sets
Set2SignatureConverter c3 = new Set2SignatureConverter(n, signatureSize);
int[] setSignature = c3.apply(exampleSet).call();

// get minhash similarity coefficient
double stringSimilarity = Similarity.signatureIndex(stringSignature1, stringSignature2);
double setSimilarity    = Similarity.signatureIndex(setSignature1, setSignature2);

// convert signatures to bands
Signature2BandsConverter c4 = new Signature2BandsConverter(bands, rows);
int[] stringBands = c4.apply(stringSignature).call();
int[] setBands = c4.apply(setSignature).call();

// determine if there are any candidate pairs
boolean isCandidatePair = Similarity.isCandidatePair(stringBands1, stringBands2);
boolean isCandidatePair = Similarity.isCandidatePair(setBands1, setBands2);

Note that all of the Converter classes above return a Callable, which can be submitted into any Executor in order to trigger multiple conversion calls in parallel. Below is an example of how to obtain the string similarity between two strings, with k-shingling for both strings being forked as parallel tasks and then joined to compute the Jaccard index:

KShingler kShingler = new KShingler(shingleLength);
Future<List<CharSequence>> shingles1 = exec.submit(kShingler.apply("example string 1"));
Future<List<CharSequence>> shingles2 = exec.submit(kShingler.apply("example string 2"));

double stringSimilarity = Similarity.jaccardIndex(shingles1.get(), shingles2.get());

Projects using this library

You can see this library in use at https://github.com/vokter/vokter.

License

Copyright 2018 Eduardo Duarte

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.