Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[HIVEMALL-75] Support Sparse Vector Format as the input of RandomForest #51

Closed
wants to merge 62 commits into from

Conversation

myui
Copy link
Member

@myui myui commented Feb 24, 2017

What changes were proposed in this pull request?

Supported sparse vector as the input of RandomForest.

What type of PR is it?

Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-75

How was this patch tested?

unit test and manual test

@coveralls
Copy link

coveralls commented Feb 25, 2017

Coverage Status

Coverage decreased (-0.3%) to 36.092% when pulling 7e8b29b on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Feb 27, 2017

Coverage Status

Coverage decreased (-0.3%) to 36.093% when pulling 051e855 on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Feb 27, 2017

Coverage Status

Coverage decreased (-0.009%) to 36.342% when pulling d5dfe6c on myui:HIVEMALL-75 into 19d472b on apache:master.

@myui myui changed the title [WIP][HIVEMALL-75] Support Sparse Vector Format as the input of RandomForest [HIVEMALL-75] Support Sparse Vector Format as the input of RandomForest Feb 27, 2017
@myui
Copy link
Member Author

myui commented Mar 3, 2017

Need to reduce memory usage for a large sparse input. WIP...

Caused by: java.lang.OutOfMemoryError: Java heap space
        at smile.sort.QuickSort.sort(QuickSort.java:576)
        at hivemall.smile.utils.SmileExtUtils.sort(SmileExtUtils.java:135)
        at hivemall.smile.classification.RandomForestClassifierUDTF.train(RandomForestClassifierUDTF.java:340)
        at hivemall.smile.classification.RandomForestClassifierUDTF.close(RandomForestClassifierUDTF.java:291)
        at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:682)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:696)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:696)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:696)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:398)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:186)
        ... 14 more

@myui
Copy link
Member Author

myui commented Mar 3, 2017

@maropu

To reduce memory consumption, I'm considering to replace int[][] order with CompressedIntStream[].

https://github.com/myui/incubator-hivemall/blob/7dba520cb709326edffb945dd925876433200a93/core/src/main/java/hivemall/smile/classification/RandomForestClassifierUDTF.java#L343

CompressedIntStream should support streaming construction holding int sequence as byte[], and should support streaming decompression in
https://github.com/myui/incubator-hivemall/blob/7dba520cb709326edffb945dd925876433200a93/core/src/main/java/hivemall/smile/classification/DecisionTree.java#L551

Any good algorithm or library for fast integer sequence compression/decompression?

JavaFastPFOR does not support streaming decompression.

@maropu
Copy link
Member

maropu commented Mar 4, 2017

blosc provides random access (of course, you can read data from the head to the end) on compressed data though, there is no java wrapper in the official: http://blosc.org/.

As another option, you can use bitshuffle + snappy in snappy-java (this is not differential coding though). But, in this case, you need to write a bit code to compress data block-by-block by yourself. Then, you could access the compressed data from the head block-by-block.

That is, my answer is that AFAIK there is no existing library for your case.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.1%) to 37.482% when pulling 7dba520 on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Mar 4, 2017

Coverage Status

Coverage increased (+0.8%) to 37.185% when pulling 7dba520 on myui:HIVEMALL-75 into 19d472b on apache:master.

@myui
Copy link
Member Author

myui commented Mar 5, 2017

@maropu In this case, Random Access is not a requirement.

An option is using Deflate and DataInputStream/DataOutputStream. Fast deserialization as much as possible is preferred though.

Compressing int[] to byte[] and decompressing the compressed byte[] by sequential readInt(), not batch decompression to avoid memory consumption.

@myui
Copy link
Member Author

myui commented Mar 6, 2017

@maropu introduced compressed IntStream for sparse inputs.

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Coverage increased (+1.1%) to 37.449% when pulling 9a02b9c on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Coverage increased (+1.09%) to 37.444% when pulling 9a02b9c on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.09%) to 37.44% when pulling 9a02b9c on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Mar 6, 2017

Coverage Status

Coverage increased (+1.09%) to 37.444% when pulling 9a02b9c on myui:HIVEMALL-75 into 19d472b on apache:master.

@coveralls
Copy link

coveralls commented Mar 22, 2017

Coverage Status

Coverage increased (+0.3%) to 37.037% when pulling 50df900 on myui:HIVEMALL-75 into f7fc304 on apache:master.

@myui
Copy link
Member Author

myui commented Mar 22, 2017

CSRMatrix#eachInRow is too slow ...

6d749df044b24b53907e24cb7071afd0

@myui
Copy link
Member Author

myui commented Mar 22, 2017

IntArraySet.add() became bottleneck after revising anonymous class invocations in eachInRow().

b75bdd4f980ca1c8f996d511a325c425

Refactored to use RoaringBitmap and IntReservoirSampler.

d520699f0e299405e577936273ac14c3

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 37.018% when pulling a0ebfe1 on myui:HIVEMALL-75 into f7fc304 on apache:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 37.018% when pulling a0ebfe1 on myui:HIVEMALL-75 into f7fc304 on apache:master.

@myui
Copy link
Member Author

myui commented Mar 23, 2017

Need to sync with the current Smile's up-to-date prediction scheme for the better accuracy.

haifengl/smile@444d8bb
haifengl/smile@4a6368c
szilard/benchm-ml#32

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 36.842% when pulling 4933a48 on myui:HIVEMALL-75 into f7fc304 on apache:master.

1 similar comment
@coveralls
Copy link

coveralls commented Apr 6, 2017

Coverage Status

Coverage increased (+0.1%) to 36.842% when pulling 4933a48 on myui:HIVEMALL-75 into f7fc304 on apache:master.

@asfgit asfgit closed this in 8dc3a02 Apr 9, 2017
@myui
Copy link
Member Author

myui commented Apr 9, 2017

Merged. Documentation and model conversion pull requests follows in [HIVEMALL-75-2]

takuti pushed a commit to takuti/incubator-hivemall that referenced this pull request Apr 21, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants