[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

myui · 2017-02-17T07:40:47Z

What changes were proposed in this pull request?

Reduce memory usage of each_top_k

What type of PR is it?

Improvement

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-73

How was this patch tested?

manual tests

myui · 2017-02-17T07:42:04Z

@maropu Could you review this PR?

This PR resolves OOM in drainQueue where k is very large.

2017-02-16 05:56:22,378 FATAL [Thread-4] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
    at hivemall.tools.EachTopKUDTF.drainQueue(EachTopKUDTF.java:182)
    at hivemall.tools.EachTopKUDTF.close(EachTopKUDTF.java:215)
    at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:143)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:577)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
    at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:318)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)

myui · 2017-02-17T08:02:16Z

Oops. This PR contains a bug. Reverse order to _queue.poll() is required for the output.

for (int i = queueSize - 1; i >= 0; i--) {
   TupleWithKey tuple = tuples[i];

myui · 2017-02-17T10:48:22Z

hmm... hard to cope w/ this issue. Any good idea? @maropu

myui · 2017-02-17T10:57:35Z

@maropu I found that each_top_k behavior on Spark is little bit difference one from Hive for the ranking scheme in
https://github.com/apache/incubator-hivemall/blob/master/core/src/main/java/hivemall/tools/EachTopKUDTF.java#L198

Hive provides a dense_rank but Spark does not.

incubator-hivemall/spark/spark-2.0/src/main/scala/org/apache/spark/sql/catalyst/expressions/EachTopK.scala

Line 101 in 72d6a62

new JoinedRow(InternalRow(1 + index), row)

coveralls · 2017-02-17T21:44:41Z

Coverage increased (+0.02%) to 35.969% when pulling 13304c5 on myui:HIVEMALL-73 into bcae153 on apache:master.

[HIVEMALL-73] Reduce memory usages of each_top_k

13304c5

myui changed the title ~~[HIVEMALL-73] Reduce memory usages of each_top_k~~ [WIP][HIVEMALL-73] Reduce memory usages of each_top_k Feb 17, 2017

myui closed this Feb 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

myui commented Feb 17, 2017

myui commented Feb 17, 2017

myui commented Feb 17, 2017

myui commented Feb 17, 2017

myui commented Feb 17, 2017

coveralls commented Feb 17, 2017 •

edited

[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

[WIP][HIVEMALL-73] Reduce memory usages of each_top_k #47

Conversation

myui commented Feb 17, 2017

What changes were proposed in this pull request?

What type of PR is it?

What is the Jira issue?

How was this patch tested?

myui commented Feb 17, 2017

myui commented Feb 17, 2017

myui commented Feb 17, 2017

myui commented Feb 17, 2017

coveralls commented Feb 17, 2017 • edited

coveralls commented Feb 17, 2017 •

edited