Random-Index Vectoring in Java. Written in the voice of Mister Torgue.
Random-Index Vectoring is a memory efficient way to perform textual analysis across corpora of extreme scale. It enables this by using contextual word co-occurrence as a stand-in for word meaning.
Go read about it:
http://www.ercim.eu/publication/Ercim_News/enw50/sahlgren.html
If that doesn't give you enough of a primer, do the google. Sahlgren has done a lot of work on this and is a reliable source. So is Paradis.
If you want to peg to a particular release, you can get it from Maven:
<dependency>
<groupId>com.github.druidgreeneyes</groupId>
<artifactId>rivet-core</artifactId>
<version>1.0.2</version>
</dependency>
You can also get it from jitpack.io, if you prefer:
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
...
<dependency>
<groupId>com.github.druidgreeneyes</groupId>
<artifactId>rivet-core.java</artifactId>
<version>v1.0.2</version>
</dependency>
And if you want to pull whatever's currently in development, you can do that too (through jitpack), though it's guaranteed to break sometimes.
<dependency>
<groupId>com.github.druidgreeneyes</groupId>
<artifactId>rivet-core.java</artifactId>
<version>-SNAPSHOT</version>
</dependency>
The core functionality is in rivet.core.labels
, where you'll probably use MTJRIV
or 'MapRIV'. There are a number of other implementations of the RIV interface, but those two have so far proven to be the fastest by a pretty significant margin, so you may as well start there. If you see flaws in that or in one of the other implementations, and/or you know how they can be done better, feel free to create an issue with your suggestion, or a pr with a fix.
If you want some examples for what to do with it, or if you want some basic use without having to roll your own, you can check out rivet.core.extras
. Currently there's an implementation of text shingling and one of untrained word-by-word analysis, but the real power in RIVs is going to be in cluster- and database-backed big-data applications, and for that you'll probably want to just use labels
as your base library and build into Spark or MapReduce or Tez or something like that. You can see how it works in Spark by looking at my rivet-cluster.java repo.
If you have any questions, let me know. If you get bugs, leave an issue; include the full stacktrace and the relevant code, or I probably won't even look. Or if you know me, just hit me up on skype or google-chat and I'll see what I can do.
Random Index Vectors give you a conveniently scalable way to mathematically represent blocks of text. The easiest way to do this is going to look something like the following:
public void example() {
final String[] documents = DOCUMENTS;// import your documents here.
final int size = 8000; // this is the width of the sparse vectors we'll be
// making
final int nnz = 4; // this is the number of non-zero elements you want each
// one to start with
final RIV[] rivs = new RIV[documents.length];
int fill = 0;
for (final String text : documents) {
final RIV riv = MTJRIV.empty(size);
for (final String word : text.split("\\W+"))
riv.destructiveAdd(MTJRIV.generate(size, nnz, word));
rivs[fill++] = riv;
}
/**
* Now we have a collection of rivs, each of which represents a document at the
* same index in the original collection of texts.
*
* Using this, we can go through the list and see how similar each is to each
* other one. This only builds half of matrix; a.similarityTo(a) will always be
* 1 and a.similarityTo(b) will always be equal to b.similarityTo(a), so we skip
* comparing rivs to themselves and we ensure that one riv is only ever compared
* to another riv once.
**/
final double[][] sims = new double[rivs.length][rivs.length];
for (int c = 0; c < rivs.length; c++)
for (int i = c + 1; i < rivs.length; i++)
sims[c][i] = rivs[c].similarityTo(rivs[i]);
/**
* Now we can go through the matrix and find the (say) 10 pairs of documents
* that are most similar to one another without being identical
* (a.similarityTo(b) != 1)
**/
final double[][] pairs = new double[10][3]; // this is a collection of
// [index, index, similarity]
// triples
for (int r = 0; r < sims.length; r++)
for (int c = 0; c < sims[r].length; c++)
if (!Util.doubleEquals(sims[r][c], 1.0))
for (int x = 0; x < 10; x++)
if (pairs[x][2] == 0.0 || pairs[x][2] < sims[r][c]) {
pairs[x][0] = r;
pairs[x][1] = c;
pairs[x][2] = sims[r][c];
break;
}
System.out.println("Top 10 most similar documents:");
for (final double[] pair : pairs) {
final String a = documents[(int) pair[0]].substring(0, 20) + "...";
final String b = documents[(int) pair[1]].substring(0, 20) + "...";
System.out.println(a + " <=> " + b + ": " + pair[2]);
}
}
There are a number of other things you can do, and other ways you can use RIVs to represent text. For example, you can build a lexicon (a map of words to the corpus-wide L2 RIV associated with each word, in turn built over time by adding together all words that occur within a 2- to 4-word window of the word in question), and use it to (say) find synonyms, or label documents with topics associated with words that have good similarity with the document in question. If you want to get wacky with it (and you don't mind the high compute cost of doing so), you can encode ordering to words within sentences or context windows by using RIV.permute(n), where n is the given word's location in the context relative to the word it's being added to. Among the things I am working on or have worked on include clustering documents based on cosine similarity, finding and eliminating near-duplicates across corpora, and inferring flat and heirarchical topic models from data. I stood up a small website using a Python version of this library, and was able to do similarity comparisons between 100k word texts in 2 and some change minutes on a 2nd generation raspberry pi. It would probably be faster if I had done it in Java, but I wanted to muck about with Python, so I did. I most recently implemented Hilbert Transformations on RIVs, so for big corpora you should be able to sort by hilbert key (using Hilbert.getHilbertKey(riv), or ImmutableRIV.getHilbertKey()) and use that for some nifty tricks.
The first couple things you'll have to figure out, mostly through trial and error, will be what size and what nnz grant you the best balance between how accurate your RIVs are and how many your hardware can process within a suitable span of time. In general, 8000/4 seems to be an acceptable starting point; I know people who are using 16k/24, and I have heard of people using sizes as big as 100k. This will mostly depend on the size of your corpus and the power of your hardware. Beyond that, there are a number of decisions you can take to customize your RIV experience, so to speak. Permutations will allow you to (in theory) pack more real information into a RIV, but it has a sizable cost in terms of ram and cpu usage. If you build a lexicon, then you'll have to decide what will qualify as the surrounding context for your words; I have used windows of 2 and 3, and I have used the entire sentence, on the theory that every piece I leave out is lost information that could theoretically inflect the meaning of the word. I have found that within reasonably-sized rivs (8k to 16k wide), using the whole sentence seems to propagate too much noise into the data, and makes even utterly unrelated things have cosine similarities of 0.6 or greater.
The importance of cleaning your input text cannot be overstated; I ran this on a collection of raw emails and the results were utter garbage, and no matter what tricks I tried when processing the RIVs, I didn't really get better results until I got cleaner texts to work with. I'm have mixed feelings about stopwording and lemmatizing. In theory, they reduce scale and noise from the input data, but ultimately some of the original information is lost in the process. In my ideal world, I would handle raw text and extract meaning from that, because theoretically that is the most accurate way to do this. But, stopwords especially add an incredible amount of noise, and so they kind of collide with everything, and as a result they tend to muddy up all the results. You can mitigate this somewhat by taking the mean vector of a lexicon or of a corpus (that is, the sum of all constituent RIVs divided by the number of constituent RIVs) and subtracting that from all constituents, but in practice this has not proven to be as helpful as I had originally hoped.
Finally, be aware of the distinction between operations and their destructive counterparts. The destructive version will always be faster, but it is faster because it operates in place on the calling RIV, and the previous state of that RIV is lost. So, for example, if you want to find out if rivA is equal to rivB times 2 (I dunno why you would, but for the sake of example...), use rivA.equals(rivB.multiply(2))
and NOT rivA.equals(rivB.destructiveMult(2))
. Doing it the second way will permanently alter rivB, and throw all your further processing out of whack. If you're adding a bunch of RIVs together (and you will, because that's the whole point), the best compromise between speed and safety will be to do something like I did above:
RIV[] rivs;
RIV res = MTJRIV.empty();
for(riv : rivs)
res.destructiveAdd(riv);
or, if your operation is big enough that it's worth using Java streams, you can just do a reduce operation:
Stream<RIV> rivs;
rivs.reduce(MTJRIV.empty(); RIV::destructiveAdd);
either way, you're starting with an empty and making destructive modifications to that instead of to something else you might want to use later.
I don't have a real answer for that; this library is a work in progress. MapRIV has had the most work done on it, but quantity of work doesn't bear a consistent relationship with quality of product. Test coverage is identical for each, so one should be as stable as the next. As far as speed goes, here are the most recent results of running the above example 250 times against each of the following riv classes and averaging the execution times:
Windows, Intel Core i7 quad-core, 8-threads, 3.6Ghz:
class com.github.druidgreeneyes.rivet.core.labels.MTJRIV: 1.217055s
class com.github.druidgreeneyes.rivet.core.labels.ArrayRIV: 1.329052s
class com.github.druidgreeneyes.rivet.core.labels.MapRIV: 3.252606s
class com.github.druidgreeneyes.rivet.core.labels.ColtRIV: 4.813925s
Ubuntu, Intel Core i5 mobile dual-core, 4 threads, 1.7Ghz:
class com.github.druidgreeneyes.rivet.core.labels.ArrayRIV: 2.476869s
class com.github.druidgreeneyes.rivet.core.labels.MTJRIV: 2.537074s
class com.github.druidgreeneyes.rivet.core.labels.MapRIV: 5.231158s
class com.github.druidgreeneyes.rivet.core.labels.ColtRIV: 8.462583s
Note that HPPCRIV is not included here because it tests way longer than seems appropriate; I've seen benchmarks that put it pretty high on relative speed ratings (http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/). This discrepancy seems more likely to be caused by misuse rather than any quality inherent in the library itself (i.e. it's probably my fault) so I'm going to keep working at it in hopes that I can get it right.
Note that also that the size of the input data is the 500-odd text files found in resource/test/hilbert/data, most of which are pretty small.