SKIPME Streaming iterable #14

markhamstra · 2014-05-28T22:58:31Z

No description provided.

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/main/scala/org/apache/spark/util/RDDiterable.scala Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/main/scala/org/apache/spark/util/RDDiterable.scala

SKIPME Streaming iterable

Support ! boolean logic operator like NOT in sql as follows select * from for_test where !(col1 > col2) Author: YanTangZhai <hakeemzhai@tencent.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#3555 from YanTangZhai/SPARK-4692 and squashes the following commits: 1a9f605 [YanTangZhai] Update HiveQuerySuite.scala 7c03c68 [YanTangZhai] Merge pull request alteryx#23 from apache/master 992046e [YanTangZhai] Update HiveQuerySuite.scala ea618f4 [YanTangZhai] Update HiveQuerySuite.scala 192411d [YanTangZhai] Merge pull request alteryx#17 from YanTangZhai/master e4c2c0a [YanTangZhai] Merge pull request alteryx#15 from apache/master 1e1ebb4 [YanTangZhai] Update HiveQuerySuite.scala efc4210 [YanTangZhai] Update HiveQuerySuite.scala bd2c444 [YanTangZhai] Update HiveQuerySuite.scala 1893956 [YanTangZhai] Merge pull request alteryx#14 from marmbrus/pr/3555 59e4de9 [Michael Armbrust] make hive test 718afeb [YanTangZhai] Merge pull request alteryx#12 from apache/master 950b21e [YanTangZhai] Update HiveQuerySuite.scala 74175b4 [YanTangZhai] Update HiveQuerySuite.scala 92242c7 [YanTangZhai] Update HiveQl.scala 6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master 76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master 03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master 8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master

This PR try to speed up some python tests: ``` tests.py 144s -> 103s -41s mllib/classification.py 24s -> 17s -7s mllib/regression.py 27s -> 15s -12s mllib/tree.py 27s -> 13s -14s mllib/tests.py 64s -> 31s -33s streaming/tests.py 185s -> 84s -101s ``` Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice). During testing, it will show used time for each test file: ``` Run core tests ... Running test: pyspark/rdd.py ... ok (22s) Running test: pyspark/context.py ... ok (16s) Running test: pyspark/conf.py ... ok (4s) Running test: pyspark/broadcast.py ... ok (4s) Running test: pyspark/accumulators.py ... ok (4s) Running test: pyspark/serializers.py ... ok (6s) Running test: pyspark/profiler.py ... ok (5s) Running test: pyspark/shuffle.py ... ok (1s) Running test: pyspark/tests.py ... ok (103s) 144s ``` Author: Reynold Xin <rxin@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#5605 from rxin/python-tests-speed and squashes the following commits: d08542d [Reynold Xin] Merge pull request alteryx#14 from mengxr/SPARK-6953 89321ee [Xiangrui Meng] fix seed in tests 3ad2387 [Reynold Xin] Merge pull request apache#5427 from davies/python_tests

…into a single batch. SQL ``` select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e) ``` Plan before modify ``` == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297)))) MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None ``` Plan after modify ``` == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((b#294 = d#296) || (b#294 = e#297))) Filter (a#293 > 3) MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None ``` CombineLimits ==> Limit(If(LessThan(ne, le), ne, le), grandChild) and LessThan is in BooleanSimplification , so CombineLimits must before BooleanSimplification and BooleanSimplification must before PushPredicateThroughJoin. Author: Zhongshuai Pei <799203320@qq.com> Author: DoingDone9 <799203320@qq.com> Closes apache#6351 from DoingDone9/master and squashes the following commits: 20de7be [Zhongshuai Pei] Update Optimizer.scala 7bc7d28 [Zhongshuai Pei] Merge pull request alteryx#17 from apache/master 0ba5f42 [Zhongshuai Pei] Update Optimizer.scala f8b9314 [Zhongshuai Pei] Update FilterPushdownSuite.scala c529d9f [Zhongshuai Pei] Update FilterPushdownSuite.scala ae3af6d [Zhongshuai Pei] Update FilterPushdownSuite.scala a04ffae [Zhongshuai Pei] Update Optimizer.scala 11beb61 [Zhongshuai Pei] Update FilterPushdownSuite.scala f2ee5fe [Zhongshuai Pei] Update Optimizer.scala be6b1d5 [Zhongshuai Pei] Update Optimizer.scala b01e622 [Zhongshuai Pei] Merge pull request alteryx#15 from apache/master 8df716a [Zhongshuai Pei] Update FilterPushdownSuite.scala d98bc35 [Zhongshuai Pei] Update FilterPushdownSuite.scala fa65718 [Zhongshuai Pei] Update Optimizer.scala ab8e9a6 [Zhongshuai Pei] Merge pull request alteryx#14 from apache/master 14952e2 [Zhongshuai Pei] Merge pull request alteryx#13 from apache/master f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master 34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master 802261c [DoingDone9] Merge pull request alteryx#7 from apache/master d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master 98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master 161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master

…gle batch ## What changes were proposed in this pull request? This PR support multiple Python UDFs within single batch, also improve the performance. ```python >>> from pyspark.sql.types import IntegerType >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType()) >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType()) >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)] +- OneRowRelation$ == Analyzed Logical Plan == double(add(1, 2)): int, add(double(2), 1): int Project [double(add(1, 2))alteryx#14,add(double(2), 1)alteryx#15] +- Project [double(add(1, 2))alteryx#14,add(double(2), 1)alteryx#15] +- Project [pythonUDF0#16 AS double(add(1, 2))alteryx#14,pythonUDF0#18 AS add(double(2), 1)alteryx#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Optimized Logical Plan == Project [pythonUDF0#16 AS double(add(1, 2))alteryx#14,pythonUDF0#18 AS add(double(2), 1)alteryx#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [pythonUDF0#16 AS double(add(1, 2))alteryx#14,pythonUDF0#18 AS add(double(2), 1)alteryx#15] : +- INPUT +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18] +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- Scan OneRowRelation[] ``` ## How was this patch tested? Added new tests. Using the following script to benchmark 1, 2 and 3 udfs, ``` df = sqlContext.range(1, 1 << 23, 1, 4) double = F.udf(lambda x: x * 2, LongType()) print df.select(double(df.id)).count() print df.select(double(df.id), double(df.id + 1)).count() print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count() ``` Here is the results: N | Before | After | speed up ---- |------------ | -------------|------ 1 | 22 s | 7 s | 3.1X 2 | 38 s | 13 s | 2.9X 3 | 58 s | 16 s | 3.6X This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering). Author: Davies Liu <davies@databricks.com> Closes apache#12057 from davies/multi_udfs.

* Added service name as prefix to executor pods to be able to tell them apart from kubectl output * Addressed comments

tbfenet added 2 commits May 28, 2014 15:35

streaming iterable

0324bfd

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

updated streaming iterable

4f51bdf

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/main/scala/org/apache/spark/util/RDDiterable.scala Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala core/src/main/scala/org/apache/spark/util/RDDiterable.scala

markhamstra assigned jhartlaub May 28, 2014

jhartlaub added a commit that referenced this pull request May 31, 2014

Merge pull request #14 from markhamstra/streaming-iterable

ab96fc3

SKIPME Streaming iterable

jhartlaub merged commit ab96fc3 into alteryx:branch-0.9-csd May 31, 2014

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017

Added service name as prefix to executor pods (alteryx#14)

0bcc391

* Added service name as prefix to executor pods to be able to tell them apart from kubectl output * Addressed comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SKIPME Streaming iterable #14

SKIPME Streaming iterable #14

markhamstra commented May 28, 2014

SKIPME Streaming iterable #14

SKIPME Streaming iterable #14

Conversation

markhamstra commented May 28, 2014