-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-14390][GraphX] Make initialization step in Pregel optional. #12159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
|
Jenkins, test this please. |
|
Test build #54908 has finished for PR 12159 at commit
|
This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request? Remove sbt-idea plugin as importing sbt project provides much better support. Author: Luciano Resende <lresende@apache.org> Closes apache#12151 from lresende/SPARK-14366.
Use PartitionerAwareUnionRDD when possbile for optimizing shuffling and preserving the partitioner. Author: Guillaume Poulin <poulin.guillaume@gmail.com> Closes apache#10382 from gpoulin/dstream_union_optimisation.
With the addition of StreamExecution (ContinuousQuery) to Datasets, data will become unbounded. With unbounded data, the execution of some methods and operations will not make sense, e.g. `Dataset.count()`. A simple API is required to check whether the data in a Dataset is bounded or unbounded. This will allow users to check whether their Dataset is in streaming mode or not. ML algorithms may check if the data is unbounded and throw an exception for example. The implementation of this method is simple, however naming it is the challenge. Some possible names for this method are: - isStreaming - isContinuous - isBounded - isUnbounded I've gone with `isStreaming` for now. We can change it before Spark 2.0 if we decide to come up with a different name. For that reason I've marked it as `Experimental` Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#12080 from brkyvz/is-streaming.
…oncrete types ## What changes were proposed in this pull request? In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types. This PR: return concrete types The MIMA checks appear to be OK with this change. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#12158 from jkbradley/hide-dtm.
…case unit. ## What changes were proposed in this pull request? This fix tries to address the issue in PySpark where `spark.python.worker.memory` could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to conform to the JVM memory string as is specified in the documentation . ## How was this patch tested? This fix adds additional test to cover the changes. Author: Yong Tang <yong.tang.github@outlook.com> Closes apache#12163 from yongtang/SPARK-14368.
## What changes were proposed in this pull request? This adds the corresponding Java static functions for built-in typed aggregates already exposed in Scala. ## How was this patch tested? Unit tests. rxin Author: Eric Liang <ekl@databricks.com> Closes apache#12168 from ericl/sc-2794.
…mand ## What changes were proposed in this pull request? This PR adds Native execution of SHOW TBLPROPERTIES command. Command Syntax: ``` SQL SHOW TBLPROPERTIES table_name[(property_key_literal)] ``` ## How was this patch tested? Tests added in HiveComandSuiie and DDLCommandSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes apache#12133 from dilipbiswal/dkb_show_tblproperties.
…/DDL in SQL Context. #### What changes were proposed in this pull request? Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context. For example, - When calling `Drop Table` in SQL Context, we got the following message: ``` Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown. ``` - When calling `Script Transform` in SQL Context, we got the message: ``` assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187 ``` Updates: Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell ! #### How was this patch tested? A few test cases are added. Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes apache#12134 from gatorsmile/hiveParserCommand.
Member
|
Jenkins, test this please. |
|
Test build #55401 has finished for PR 12159 at commit
|
…ndrade/spark into pregel-optional-initmessage # Conflicts: # graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala # graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Suppose a sendMsg function depends on the state of a edge's vertices to send messages, and those Pregel messages update such state. In this scenario, initialMsg will initially enforce the same message on all vertices, effectively removing a custom initialization one may have per vertex.
To deal with this situation, we must define a dummy initMsg (i.e. None), and all Pregel functions must be modified to handle this type of message. A simpler and less cumbersome solution is to make initMsg and the initialization step in Pregel optional.
How was this patch tested?
No new tests were added. Previous functionality was kept.