Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csd branch 0.9 merge #9

Closed
wants to merge 2,175 commits into from

Conversation

jhartlaub
Copy link

@markhamstra - initial merge of 0.9 spark for review.

rxin and others added 30 commits January 11, 2014 12:35
Fix configure didn't work small problem in ALS
Revert PR 381

This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down).

I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.
Fix UI bug introduced in alteryx#244.

The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
Minor update for clone writables and more documentation.
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
We've used camel case in other Spark methods so it felt reasonable to
keep using it here and make the code match Scala/Java as much as
possible. Note that parameter names matter in Python because it allows
passing optional parameters by name.
Also fixes mains of a few other algorithms to print the final model
…ected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
This helps in case the exception happened while serializing a record to
be sent to Java, leaving the stream to Java in an inconsistent state
where PythonRDD won't be able to read the error.
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
…spark.streaming to org.apache.spark.streaming.dstream.
mengxr and others added 8 commits February 21, 2014 22:44
The current doc hints spark doesn't support accumulators of type `Long`, which is wrong.

JIRA: https://spark-project.atlassian.net/browse/SPARK-1117

Author: Xiangrui Meng <meng@databricks.com>

Closes apache#631 from mengxr/acc and squashes the following commits:

45ecd25 [Xiangrui Meng] update accumulator docs
(cherry picked from commit aaec7d4)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
As reported in https://spark-project.atlassian.net/browse/SPARK-1055

"The used Spark version in the .../base/Dockerfile is stale on 0.8.1 and should be updated to 0.9.x to match the release."

Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>

Closes apache#634 from CodingCat/SPARK-1055 and squashes the following commits:

cb7330e [Nan Zhu] Update Dockerfile
adf8259 [CodingCat] fix the SCALA_VERSION and SPARK_VERSION in docker file

(cherry picked from commit 1aa4f8a)
Signed-off-by: Aaron Davidson <aaron@databricks.com>
In the previous code, if you had a failing map stage and then tried to
run reduce stages on it repeatedly, the first reduce stage would fail
correctly, but the later ones would mistakenly believe that all map
outputs are available and start failing infinitely with fetch failures
from "null".
A recent PR that added Java vs Scala tabs for streaming also
inadvertently added some bad code to a document.ready handler, breaking
our other handler that manages scrolling to anchors correctly with the
floating top bar. As a result the section title ended up always being
hidden below the top bar. This removes the unnecessary JavaScript code.

Author: Matei Zaharia <matei@databricks.com>

Closes alteryx#3 from mateiz/doc-links and squashes the following commits:

e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes apache#644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
(cherry picked from commit 12738c1)

Signed-off-by: Josh Rosen <joshrosen@apache.org>
Conflicts:
	CHANGES.txt
	README.md
	assembly/pom.xml
	bagel/pom.xml
	bin/spark-class
	bin/spark-class2.cmd
	core/pom.xml
	core/src/main/java/org/apache/spark/network/netty/FileServerHandler.java
	core/src/main/java/org/apache/spark/network/netty/PathResolver.java
	core/src/main/scala/org/apache/spark/Aggregator.scala
	core/src/main/scala/org/apache/spark/CacheManager.scala
	core/src/main/scala/org/apache/spark/FutureAction.scala
	core/src/main/scala/org/apache/spark/InterruptibleIterator.scala
	core/src/main/scala/org/apache/spark/MapOutputTracker.scala
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/TaskEndReason.scala
	core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala
	core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala
	core/src/main/scala/org/apache/spark/api/java/function/FlatMapFunction.scala
	core/src/main/scala/org/apache/spark/api/java/function/FlatMapFunction2.scala
	core/src/main/scala/org/apache/spark/api/java/function/Function.java
	core/src/main/scala/org/apache/spark/api/java/function/Function2.java
	core/src/main/scala/org/apache/spark/api/java/function/Function3.java
	core/src/main/scala/org/apache/spark/api/java/function/Function4.java
	core/src/main/scala/org/apache/spark/api/java/function/PairFlatMapFunction.java
	core/src/main/scala/org/apache/spark/api/java/function/PairFunction.java
	core/src/main/scala/org/apache/spark/api/java/function/WrappedFunction3.scala
	core/src/main/scala/org/apache/spark/broadcast/BitTorrentBroadcast.scala
	core/src/main/scala/org/apache/spark/broadcast/Broadcast.scala
	core/src/main/scala/org/apache/spark/broadcast/HttpBroadcast.scala
	core/src/main/scala/org/apache/spark/broadcast/MultiTracker.scala
	core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala
	core/src/main/scala/org/apache/spark/broadcast/TreeBroadcast.scala
	core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
	core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala
	core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala
	core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
	core/src/main/scala/org/apache/spark/deploy/client/Client.scala
	core/src/main/scala/org/apache/spark/deploy/client/TestClient.scala
	core/src/main/scala/org/apache/spark/deploy/master/ApplicationInfo.scala
	core/src/main/scala/org/apache/spark/deploy/master/ApplicationState.scala
	core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala
	core/src/main/scala/org/apache/spark/deploy/master/Master.scala
	core/src/main/scala/org/apache/spark/deploy/master/PersistenceEngine.scala
	core/src/main/scala/org/apache/spark/deploy/master/RecoveryState.scala
	core/src/main/scala/org/apache/spark/deploy/master/SparkZooKeeperSession.scala
	core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala
	core/src/main/scala/org/apache/spark/deploy/master/WorkerState.scala
	core/src/main/scala/org/apache/spark/deploy/master/ZooKeeperLeaderElectionAgent.scala
	core/src/main/scala/org/apache/spark/deploy/master/ZooKeeperPersistenceEngine.scala
	core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
	core/src/main/scala/org/apache/spark/network/netty/ShuffleCopier.scala
	core/src/main/scala/org/apache/spark/network/netty/ShuffleSender.scala
	core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala
	core/src/main/scala/org/apache/spark/rdd/BlockRDD.scala
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithIndexRDD.scala
	core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala
	core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerEvent.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGSchedulerSource.scala
	core/src/main/scala/org/apache/spark/scheduler/JobLogger.scala
	core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala
	core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala
	core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
	core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
	core/src/main/scala/org/apache/spark/scheduler/SparkListenerBus.scala
	core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalTaskSetManager.scala
	core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala
	core/src/main/scala/org/apache/spark/storage/BlockId.scala
	core/src/main/scala/org/apache/spark/storage/BlockManager.scala
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
	core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala
	core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
	core/src/main/scala/org/apache/spark/storage/MemoryStore.scala
	core/src/main/scala/org/apache/spark/storage/ShuffleBlockManager.scala
	core/src/main/scala/org/apache/spark/storage/StoragePerfTester.scala
	core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala
	core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
	core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
	core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
	core/src/main/scala/org/apache/spark/util/Utils.scala
	core/src/main/scala/org/apache/spark/util/collection/BitSet.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveVector.scala
	core/src/test/scala/org/apache/spark/CheckpointSuite.scala
	core/src/test/scala/org/apache/spark/DistributedSuite.scala
	core/src/test/scala/org/apache/spark/FileServerSuite.scala
	core/src/test/scala/org/apache/spark/JavaAPISuite.java
	core/src/test/scala/org/apache/spark/deploy/JsonProtocolSuite.scala
	core/src/test/scala/org/apache/spark/deploy/worker/ExecutorRunnerTest.scala
	core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala
	core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/ClusterSchedulerSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/FakeTask.scala
	core/src/test/scala/org/apache/spark/scheduler/JobLoggerSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/local/LocalSchedulerSuite.scala
	core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
	core/src/test/scala/org/apache/spark/storage/BlockIdSuite.scala
	core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala
	core/src/test/scala/org/apache/spark/storage/DiskBlockManagerSuite.scala
	core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
	core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
	docker/spark-test/base/Dockerfile
	docs/_config.yml
	docs/_layouts/global.html
	docs/building-with-maven.md
	docs/configuration.md
	docs/mllib-guide.md
	docs/python-programming-guide.md
	docs/running-on-yarn.md
	docs/streaming-programming-guide.md
	docs/tuning.md
	ec2/spark_ec2.py
	examples/pom.xml
	examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala
	examples/src/main/scala/org/apache/spark/streaming/examples/clickstream/PageViewGenerator.scala
	graphx/src/test/scala/org/apache/spark/graphx/lib/SVDPlusPlusSuite.scala
	mllib/pom.xml
	mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
	mllib/src/test/java/org/apache/spark/mllib/recommendation/JavaALSSuite.java
	mllib/src/test/scala/org/apache/spark/mllib/recommendation/ALSSuite.scala
	pom.xml
	project/SparkBuild.scala
	python/pyspark/context.py
	python/pyspark/rdd.py
	python/pyspark/shell.py
	python/pyspark/tests.py
	repl/pom.xml
	repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
	repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala
	repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala
	sbin/stop-slaves.sh
	streaming/pom.xml
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/NetworkInputDStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
	streaming/src/main/scala/org/apache/spark/streaming/dstream/TransformedDStream.scala
	streaming/src/main/scala/org/apache/spark/streaming/receivers/ActorReceiver.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/NetworkInputTracker.scala
	streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
	streaming/src/test/java/org/apache/spark/streaming/JavaTestUtils.scala
	streaming/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java
	streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
	tools/pom.xml
	yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
	yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/WorkerRunnable.scala
	yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
	yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
	yarn/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@markhamstra
Copy link

Do we want to go straight into master-csd, or should there be a separate branch instead?

@jhartlaub
Copy link
Author

closing to resolve some conflicts.

@jhartlaub jhartlaub closed this May 27, 2014
@jhartlaub
Copy link
Author

yeah, I think we could go into a separate branch and merge into master-csd later (to hedge our bets)

@markhamstra
Copy link

I just went through the diff between jhartlaub:csd-branch-0.9-merge and clearstorydata:csd-2.10. I went through quickly and may have missed something, but what I find is that many of the differences are either completely inconsequential or are inconsequential to us; but for those that are consequential, I believe that we should be using the csd-2.10 version. Several of those are post-0.9.1 bug fixes, including my recent SPY-302 work; and the diff between csd-2.10 and Apache branch-0.9 is smaller.

I think that you can easily substitute csd-2.10 for csd-branch-0.9-merge, and that we should use csd-2.10.

@jhartlaub
Copy link
Author

ok, I can do that- no problem.

xiajunluan pushed a commit to xiajunluan/spark that referenced this pull request May 30, 2014
StandaloneSchedulerBackend instead of the smaller IDs used within Spark
(that lack the application name).

This was reported by ClearStory in
alteryx/spark#9.

Also fixed some messages that said slave instead of executor.
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Aug 24, 2014
…tion improvements

Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>

Closes apache#1935 from marmbrus/countDistinctPartial and squashes the following commits:

5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request alteryx#9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Aug 25, 2014
…tion improvements

Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>

Closes apache#1935 from marmbrus/countDistinctPartial and squashes the following commits:

5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request alteryx#9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max

(cherry picked from commit 7e191fe)
Signed-off-by: Michael Armbrust <michael@databricks.com>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Dec 10, 2014
…if sql has null

val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master
e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master
d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master
76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master
03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master
8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master

(cherry picked from commit 1066427)
Signed-off-by: Michael Armbrust <michael@databricks.com>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 1, 2015
…if sql has null

val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master
e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master
d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master
76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master
03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master
8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 1, 2015
…ins an empty AttributeSet() references

The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>

Closes apache#3556 from YanTangZhai/SPARK-4693 and squashes the following commits:

620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master
e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master
d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master
76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master
03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master
8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 1, 2015
…askTracker to reduce the chance of the communicating problem

Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>

Closes apache#3785 from YanTangZhai/SPARK-4946 and squashes the following commits:

9ca6541 [yantangzhai] [SPARK-4946] [CORE] Using AkkaUtils.askWithReply in MapOutputTracker.askTracker to reduce the chance of the communicating problem
e4c2c0a [YanTangZhai] Merge pull request alteryx#15 from apache/master
718afeb [YanTangZhai] Merge pull request alteryx#12 from apache/master
6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master
e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master
d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master
76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master
03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master
8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 9, 2015
When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers?

While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. This PR introduces such a UI. It shows various statistics related to the streaming application. Here is a screenshot of the UI running on my local machine.

http://i.imgur.com/1ooDGhm.png

This UI is integrated into the Spark UI running at 4040.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes apache#290 from tdas/streaming-web-ui and squashes the following commits:

fc73ca5 [Tathagata Das] Merge pull request alteryx#9 from andrewor14/ui-refactor
642dd88 [Andrew Or] Merge SparkUISuite.scala into UISuite.scala
eb30517 [Andrew Or] Merge github.com:apache/spark into ui-refactor
f4f4cbe [Tathagata Das] More minor fixes.
34bb364 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
252c566 [Tathagata Das] Merge pull request alteryx#8 from andrewor14/ui-refactor
e038b4b [Tathagata Das] Addressed Patrick's comments.
125a054 [Andrew Or] Disable serving static resources with gzip
90feb8d [Andrew Or] Address Patrick's comments
89dae36 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
72fe256 [Tathagata Das] Merge pull request alteryx#6 from andrewor14/ui-refactor
2fc09c8 [Tathagata Das] Added binary check exclusions
aa396d4 [Andrew Or] Rename tabs and pages (No more IndexPage.scala)
f8e1053 [Tathagata Das] Added Spark and Streaming UI unit tests.
caa5e05 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
585cd65 [Tathagata Das] Merge pull request alteryx#5 from andrewor14/ui-refactor
914b8ff [Tathagata Das] Moved utils functions to UIUtils.
548c98c [Andrew Or] Wide refactoring of WebUI, UITab, and UIPage (see commit message)
6de06b0 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
ee6543f [Tathagata Das] Minor changes based on Andrew's comments.
fa760fe [Tathagata Das] Fixed long line.
1c0bcef [Tathagata Das] Refactored streaming UI into two files.
1af239b [Tathagata Das] Changed streaming UI to attach itself as a tab with the Spark UI.
827e81a [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
168fe86 [Tathagata Das] Merge pull request #2 from andrewor14/ui-refactor
3e986f8 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
c78c92d [Andrew Or] Remove outdated comment
8f7323b [Andrew Or] End of file new lines, indentation, and imports (minor)
0d61ee8 [Andrew Or] Merge branch 'streaming-web-ui' of github.com:tdas/spark into ui-refactor
9a48fa1 [Andrew Or] Allow adding tabs to SparkUI dynamically + add example
61358e3 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-web-ui
53be2c5 [Tathagata Das] Minor style updates.
ed25dfc [Andrew Or] Generalize SparkUI header to display tabs dynamically
a37ad4f [Andrew Or] Comments, imports and formatting (minor)
cd000b0 [Andrew Or] Merge github.com:apache/spark into ui-refactor
7d57444 [Andrew Or] Refactoring the UI interface to add flexibility
aef4dd5 [Tathagata Das] Added Apache licenses.
db27bad [Tathagata Das] Added last batch processing time to StreamingUI.
4d86e98 [Tathagata Das] Added basic stats to the StreamingUI and refactored the UI to a Page to make it easier to transition to using SparkUI later.
93f1c69 [Tathagata Das] Added network receiver information to the Streaming UI.
56cc7fb [Tathagata Das] First cut implementation of Streaming UI.
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 12, 2015
Support ! boolean logic operator like NOT in sql as follows
select * from for_test where !(col1 > col2)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#3555 from YanTangZhai/SPARK-4692 and squashes the following commits:

1a9f605 [YanTangZhai] Update HiveQuerySuite.scala
7c03c68 [YanTangZhai] Merge pull request alteryx#23 from apache/master
992046e [YanTangZhai] Update HiveQuerySuite.scala
ea618f4 [YanTangZhai] Update HiveQuerySuite.scala
192411d [YanTangZhai] Merge pull request alteryx#17 from YanTangZhai/master
e4c2c0a [YanTangZhai] Merge pull request alteryx#15 from apache/master
1e1ebb4 [YanTangZhai] Update HiveQuerySuite.scala
efc4210 [YanTangZhai] Update HiveQuerySuite.scala
bd2c444 [YanTangZhai] Update HiveQuerySuite.scala
1893956 [YanTangZhai] Merge pull request alteryx#14 from marmbrus/pr/3555
59e4de9 [Michael Armbrust] make hive test
718afeb [YanTangZhai] Merge pull request alteryx#12 from apache/master
950b21e [YanTangZhai] Update HiveQuerySuite.scala
74175b4 [YanTangZhai] Update HiveQuerySuite.scala
92242c7 [YanTangZhai] Update HiveQl.scala
6e643f8 [YanTangZhai] Merge pull request alteryx#11 from apache/master
e249846 [YanTangZhai] Merge pull request alteryx#10 from apache/master
d26d982 [YanTangZhai] Merge pull request alteryx#9 from apache/master
76d4027 [YanTangZhai] Merge pull request alteryx#8 from apache/master
03b62b0 [YanTangZhai] Merge pull request alteryx#7 from apache/master
8a00106 [YanTangZhai] Merge pull request alteryx#6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 12, 2015
SQL
```
select key from (select key from src limit 100) t2 limit 10
```
Optimized Logical Plan before modifying
```
== Optimized Logical Plan ==
Limit 10
Limit 100
Project key#3
MetastoreRelation default, src, None
```
Optimized Logical Plan after modifying
```
== Optimized Logical Plan ==
Limit 10
 Project [key#1]
  MetastoreRelation default, src, None
```

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes apache#5770 from DoingDone9/limitOptimizer and squashes the following commits:

c68eaa7 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
97e18cf [Zhongshuai Pei] Update Optimizer.scala
19ab875 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
7db4566 [Zhongshuai Pei] Update CombiningLimitsSuite.scala
e2a491d [Zhongshuai Pei] Update Optimizer.scala
f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master
802261c [DoingDone9] Merge pull request alteryx#7 from apache/master
d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master
98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master
161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 12, 2015
SQL
```
select key from (select key,value from t1 limit 100) t2 limit 10
```
Optimized Logical Plan before modifying
```
== Optimized Logical Plan ==
Limit 10
  Project key#228
    Limit 100
      MetastoreRelation default, t1, None
```
Optimized Logical Plan after modifying
```
== Optimized Logical Plan ==
Limit 10
  Limit 100
    Project key#228
      MetastoreRelation default, t1, None
```
After this, we can combine limits

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes apache#5797 from DoingDone9/ProjectLimit and squashes the following commits:

70d0fca [Zhongshuai Pei] Update FilterPushdownSuite.scala
dc83ae9 [Zhongshuai Pei] Update FilterPushdownSuite.scala
485c61c [Zhongshuai Pei] Update Optimizer.scala
f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master
802261c [DoingDone9] Merge pull request alteryx#7 from apache/master
d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master
98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master
161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 12, 2015
…" into true or false directly

SQL
```
select key from src where 3 in (4, 5);
```
Before
```
== Optimized Logical Plan ==
Project [key#12]
 Filter 3 INSET (5,4)
  MetastoreRelation default, src, None
```

After
```
== Optimized Logical Plan ==
LocalRelation [key#228], []
```

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes apache#5972 from DoingDone9/InToFalse and squashes the following commits:

4c722a2 [Zhongshuai Pei] Update predicates.scala
abe2bbb [Zhongshuai Pei] Update Optimizer.scala
fa461a5 [Zhongshuai Pei] Update Optimizer.scala
e34c28a [Zhongshuai Pei] Update predicates.scala
24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
35ceb7a [Zhongshuai Pei] Update Optimizer.scala
36c194e [Zhongshuai Pei] Update Optimizer.scala
2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
14952e2 [Zhongshuai Pei] Merge pull request alteryx#13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master
802261c [DoingDone9] Merge pull request alteryx#7 from apache/master
d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master
98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master
161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request May 12, 2015
…" into true or false directly

SQL
```
select key from src where 3 in (4, 5);
```
Before
```
== Optimized Logical Plan ==
Project [key#12]
 Filter 3 INSET (5,4)
  MetastoreRelation default, src, None
```

After
```
== Optimized Logical Plan ==
LocalRelation [key#228], []
```

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes apache#5972 from DoingDone9/InToFalse and squashes the following commits:

4c722a2 [Zhongshuai Pei] Update predicates.scala
abe2bbb [Zhongshuai Pei] Update Optimizer.scala
fa461a5 [Zhongshuai Pei] Update Optimizer.scala
e34c28a [Zhongshuai Pei] Update predicates.scala
24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
35ceb7a [Zhongshuai Pei] Update Optimizer.scala
36c194e [Zhongshuai Pei] Update Optimizer.scala
2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
14952e2 [Zhongshuai Pei] Merge pull request alteryx#13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master
802261c [DoingDone9] Merge pull request alteryx#7 from apache/master
d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master
98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master
161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master

(cherry picked from commit 4b5e1fe)
Signed-off-by: Michael Armbrust <michael@databricks.com>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jul 19, 2015
…into a single batch.

SQL
```
select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e)
```
Plan before modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297))))
  MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```
Plan after modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
  Filter (a#293 > 3)
   MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```

CombineLimits ==> Limit(If(LessThan(ne, le), ne, le), grandChild) and LessThan is in BooleanSimplification ,  so CombineLimits  must before BooleanSimplification and BooleanSimplification must before PushPredicateThroughJoin.

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes apache#6351 from DoingDone9/master and squashes the following commits:

20de7be [Zhongshuai Pei] Update Optimizer.scala
7bc7d28 [Zhongshuai Pei] Merge pull request alteryx#17 from apache/master
0ba5f42 [Zhongshuai Pei] Update Optimizer.scala
f8b9314 [Zhongshuai Pei] Update FilterPushdownSuite.scala
c529d9f [Zhongshuai Pei] Update FilterPushdownSuite.scala
ae3af6d [Zhongshuai Pei] Update FilterPushdownSuite.scala
a04ffae [Zhongshuai Pei] Update Optimizer.scala
11beb61 [Zhongshuai Pei] Update FilterPushdownSuite.scala
f2ee5fe [Zhongshuai Pei] Update Optimizer.scala
be6b1d5 [Zhongshuai Pei] Update Optimizer.scala
b01e622 [Zhongshuai Pei] Merge pull request alteryx#15 from apache/master
8df716a [Zhongshuai Pei] Update FilterPushdownSuite.scala
d98bc35 [Zhongshuai Pei] Update FilterPushdownSuite.scala
fa65718 [Zhongshuai Pei] Update Optimizer.scala
ab8e9a6 [Zhongshuai Pei] Merge pull request alteryx#14 from apache/master
14952e2 [Zhongshuai Pei] Merge pull request alteryx#13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request alteryx#12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request alteryx#10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request alteryx#9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request alteryx#8 from apache/master
802261c [DoingDone9] Merge pull request alteryx#7 from apache/master
d00303b [DoingDone9] Merge pull request alteryx#6 from apache/master
98b134f [DoingDone9] Merge pull request alteryx#5 from apache/master
161cae3 [DoingDone9] Merge pull request alteryx#4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Apr 27, 2016
## What changes were proposed in this pull request?

This PR brings the support for chained Python UDFs, for example

```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```

Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.

For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#10 AS double(double(1))alteryx#9]
:     +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
   +- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#19 AS double((double(1) + double(2)))alteryx#16]
:     +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
   +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
      +- !BatchPythonEvaluation double(1), [pythonUDF#17]
         +- Scan OneRowRelation[]
```

TODO: will support multiple unrelated Python UDFs in one batch (another PR).

## How was this patch tested?

Added new unit tests for chained UDFs.

Author: Davies Liu <davies@databricks.com>

Closes apache#12014 from davies/py_udfs.
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jun 14, 2016
## What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.

**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)alteryx#6,(1 + a#0)alteryx#7,(A#0 + 1)alteryx#8,(1 + A#0)alteryx#9], functions=[], output=[(a + 1)alteryx#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)alteryx#6, (1 + a#0)alteryx#7, (A#0 + 1)alteryx#8, (1 + A#0)alteryx#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)alteryx#6,(1 + a#0) AS (1 + a#0)alteryx#7,(A#0 + 1) AS (A#0 + 1)alteryx#8,(1 + A#0) AS (1 + A#0)alteryx#9], functions=[], output=[(a#0 + 1)alteryx#6,(1 + a#0)alteryx#7,(A#0 + 1)alteryx#8,(1 + A#0)alteryx#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)alteryx#6], functions=[], output=[(a + 1)alteryx#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)alteryx#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)alteryx#6], functions=[], output=[(a#0 + 1)alteryx#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#12590 from dongjoon-hyun/SPARK-14830.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.