Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce lattice combination steps #34

Open
garfieldnate opened this issue Jan 10, 2015 · 9 comments
Open

reduce lattice combination steps #34

garfieldnate opened this issue Jan 10, 2015 · 9 comments

Comments

@garfieldnate
Copy link
Owner

The Java version manages to combine lattices much more quickly by consolidating common supracontexts after each combination, not just after the final combination. That would be a good optimization here.

@shixilun
Copy link

Can you give a specific example of this?

@garfieldnate
Copy link
Owner Author

Sure. Take this set of items, split into three sublattices:

ab bc cd
bc ab cb
cd ac bd

The XS code does this (0 means null set):

ab x bc x cd -> 0
ab x bc x cb -> b
ab x bc x bc -> b
ab x ab x cd -> 0
ab x ab x cb -> b
...

So it iterates over every combination of every element. The Java one first combines the two columns:

ab x bc -> ab
ab x ab -> ab
ab x ac -> a
bc x bc -> bc
bc x ab -> b
bc x ac -> c
cd x bc -> c
cd x ab -> 0
cd x ac -> c

It then consolidates the identical results and keeps track of counts. So the current results are

ab*2, c*3, a*1, b*1

Then we do the combination of this set with the third colum, multiplying through the numbers:

ab*2 x cd -> 0
ab*2 x cb -> b*2
ab*2 x bd -> b*2
a*1 x cd -> 0
a*1 x cb -> 0
a*1 x bd -> 0
bc*1 x cd -> c*1
bc*1 x cb -> bc*1
bc*1 x bd -> b*1
b*1 x cd -> 0
b*1 x cb -> b*1
b*1 x b -> b*1
cx3 x c -> c*3
cx3 x cb -> c*3
cx3 x bd -> 0

So when you see a count of 2 in the first column, you're doing in one step what the naive algorithm would do in 2 steps, and so on for larger numbers. In the data I've tested with I found the intermittent steps to have numbers in the hundreds or thousands, so this really does save a lot of time.

Still, it's slow in the long run, so I am in search of something better. The fast set intersection problem.

@shixilun
Copy link

shixilun commented Jun 4, 2015

This description of the XS code isn't completely accurate.

If, for example, the intersection of an element from sublattice one and an element from sublattice two is empty, no more intersections are attempted with elements from sublattices three and four. (Look for the continue statements labeled /* intersection is empty */.)

Only if there are no empty intersections among elements of the first three sublattices would "it iterate over every combination of every element."

Do you have any statistics about how much coalescing occurs?

@garfieldnate
Copy link
Owner Author

Yes, the XS stops early for empty combinations, but combinations are still built as many times as their count. Funny, I started writing down statistics but left a big TODO there. I'll see if I can get something together.

@garfieldnate
Copy link
Owner Author

I can send you the data/script if you'd like to confirm with AM::Parallel (I exposed a bug in Algorithm::AM doing this).
I'm using this 35-feature, 683-item set recording symptoms and diagnoses of problems in soybeans. I remove item number 16 from the set and classify it. With Java AM in takes 4-10 seconds (variance probably caching). With AM::Parallel or Algorithm::AM it takes 40 to 50 seconds.

The soy bean lattice is split into 6 sublattices of 5 features each. The numbers of supracontexts in each are 14, 26, 7, 32, 11, 24, and 22. My notes from when I wrote the algorithm say that abortion of empty combinations didn't happen until combining the the last or second-to-last sets. So the worst-case was the norm for this set. So, assuming second to last is where empty combinations are stopped, and ordering by smallest to biggest sets, you get 7*11*14*22*24*26 = 14,798,784 combinations. From there, whatever number of those passed then need to be combined with the 32 in the last set. The total count for this particular test item is 338,317,342,300 (hmm, I'm messing up some math here somewhere).

I stuck some print statements into the combining algorithm to see how much combination there was:

11*32 = 352(219)
14*219 = 3066(1155)
26*1155 = 30030(3941)
22*7 = 154(89)
24*3941 = 94584(11810)
final: 89*11810 = 1051090(150)

A set of 11 combined with a set of 32 required 352 combinations, but yielded only 219 unique sets, etc. This really made the calculation much faster. The total number of pointers for this example is 338,317,342,300. Note that the total number of combinations probably differs each time because the order is basically random (determined by threads; I believe I tried ordering smallest to biggest but as-soon-as-possible processing with threads was faster; filling of initial supras with subs is also parallel to this). It looks like the total number of combinations is about 1.2 million. Note also that it had to store about 100K sets in memory, which isn't bad; time is a much more precious resource, here.

@garfieldnate garfieldnate mentioned this issue Jun 6, 2015
@shixilun
Copy link

shixilun commented Jun 8, 2015

I would appreciate having the data checked in here somewhere.

Did you use a specially hacked version of AM::Parallel with seven sublattices?

I have given some thought about how to do combining as you propose within AM::Parallel, but nothing simple comes to mind yet.

@garfieldnate
Copy link
Owner Author

Yes, sorry about that. I added the dataset. The quickest way to test this (I fixed that bug so you don't have to use AM::Parallel)) is:

perl Makefile.PL
make (or dmake or whatever)
perl -Mblib bin/analogize.pl --project datasets/soybean --format commas --print statistical_summary

That won't give you exact time though, so you could use this instead:

# Run soybean. It takes a while!
use strict;
use warnings;
use Algorithm::AM;

use FindBin qw($Bin);

my $dataset = dataset_from_file(
    path => "$Bin/soybean/data",
    format => 'commas',
    unknown => '?',
);
my $testset = dataset_from_file(
    path => "$Bin/soybean/test",
    format => 'commas',
    unknown => '?',
);
my $test_item = $testset->get_item(0);
my $am = Algorithm::AM->new(training_set => $dataset);

my $start = time;
my $result = $am->classify($test_item);
my $end = time;
print ${ $result->statistical_summary };
print '' . ($end - $start) . " seconds";

The differing number of lattices applies to the Java version only. Since the zero-item supras were not found until the last or second to last combination when the lattice was broke in 7, I assume that the XS must not be finding them until the last combination, meaning full search. (This could be checked easily, though).

The Java implementation was very simple: for each combination, keep the results in a hash map, and if duplicates are produced then just adjust the count of the existing one. It wasn't bad because I could override equals and hashCode. Perl only allows strings to be hash keys... Maybe an external library would be required to make this simple.

@shixilun
Copy link

shixilun commented Jun 9, 2015

I'm not afraid of using AM::Parallel; I happen to be quite familiar with it :) I would just have to patch the bug you found regarding initializing the counting arrays.

However, I don't think the comparison is completely fair unless both versions are using seven sublattices.

Any sequence of bytes can be a hash key in Perl; it need not be a PV. Since the intersection is contained in the array intersectlist, and you know the length of that list, you can use that as the hash key. Intersections always store the entries in order, so you don't have to worry about two intersections having the same elements but in different orders.

I also believe we can rewrite the filling and combining steps to take advantage of multi-core processors, without the hassle of pthreads or some other parallelization library.

@garfieldnate
Copy link
Owner Author

I guess I should have known the author would be familiar with it:) The version on the AM website has been patched.

I'll see about checking stats using 4 lattices. Allowing a variable number of lattices would be a nice improvement here, as well, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants