Skip to content
This repository has been archived by the owner on Sep 20, 2022. It is now read-only.

[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

Closed
wants to merge 1 commit into from

Conversation

myui
Copy link
Member

@myui myui commented Feb 17, 2017

What changes were proposed in this pull request?

Reduce memory usage of each_top_k

What type of PR is it?

Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-73

How was this patch tested?

manual tests

@myui
Copy link
Member Author

myui commented Feb 17, 2017

@maropu Could you review this PR?

This PR resolves OOM in drainQueue where k is very large.

2017-02-16 05:56:22,378 FATAL [Thread-4] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
    at hivemall.tools.EachTopKUDTF.drainQueue(EachTopKUDTF.java:182)
    at hivemall.tools.EachTopKUDTF.close(EachTopKUDTF.java:215)
    at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:143)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:577)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
    at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:318)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

@myui myui changed the title [HIVEMALL-73] Reduce memory usages of each_top_k [WIP][HIVEMALL-73] Reduce memory usages of each_top_k Feb 17, 2017
@myui
Copy link
Member Author

myui commented Feb 17, 2017

Oops. This PR contains a bug. Reverse order to _queue.poll() is required for the output.

for (int i = queueSize - 1; i >= 0; i--) {
   TupleWithKey tuple = tuples[i];

@myui
Copy link
Member Author

myui commented Feb 17, 2017

hmm... hard to cope w/ this issue. Any good idea? @maropu

@myui
Copy link
Member Author

myui commented Feb 17, 2017

@maropu I found that each_top_k behavior on Spark is little bit difference one from Hive for the ranking scheme in
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/tools/EachTopKUDTF.java#L198

Hive provides a dense_rank but Spark does not.

@coveralls
Copy link

coveralls commented Feb 17, 2017

Coverage Status

Coverage increased (+0.02%) to 35.969% when pulling 13304c5 on myui:HIVEMALL-73 into bcae153 on apache:master.

@myui myui closed this Feb 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants