-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data in HDFS #1004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15522/ |
test this please |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
@rxin I added https://issues.apache.org/jira/browse/SPARK-2068 to track other places where we need to fix this, but we should probably just merge this one right away. |
How big does the closure size increase by? |
Is there an easy way to measure that? Either way it was wrong before and I don't think making it possible to plan
|
I'm going to merge this. YOu can test this easily by looking at the log. Spark tells you the size of the task closure and how long it takes to serialize each of them in the info log. |
Merged in master & branch-1.0. |
One reason we had to add @transient lazy val is due to the lack of an init method on each partition for operators. I think there are benefits of adding that - it makes clear and explicit about object initialization, and then you can probably avoid this problem. |
… data in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes #1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers. (cherry picked from commit a6c72ab) Signed-off-by: Reynold Xin <rxin@apache.org>
… data in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes apache#1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
… data in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes apache#1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.
…edException (#1004) * [CARMEL-6072] Return more information in SchemaColumnConvertNotSupportedException * [CARMEL-6072] Return more information in SchemaColumnConvertNotSupportedException * [CARMEL-6072] Return more information in SchemaColumnConvertNotSupportedException * [CARMEL-6072] Return more information in SchemaColumnConvertNotSupportedException * [CARMEL-6072] Return more information in SchemaColumnConvertNotSupportedException
…kage (apache#1004) Co-authored-by: Egor Krivokon <>
…kage (apache#1004) Co-authored-by: Egor Krivokon <>
Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?).
In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here.