SPARK-1100: prevent Spark from overwriting directory silently #11

CodingCat · 2014-02-27T00:10:45Z

Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100)

the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because

overwriting the data silently is not user-friendly

if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings

My fix includes:

add some new APIs with a flag for users to define whether he/she wants to overwrite the directory:
if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running;

if the flag is set to false, Spark will throw an exception if the output directory already exists

changed JavaAPI part

default behaviour is overwriting

Two questions

should we deprecate the old APIs without such a flag?

I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas

AmplabJenkins · 2014-02-27T00:13:54Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-02-27T00:13:59Z

Can one of the admins verify this patch?

A bunch of work mostly to ensure catalyst does not return wrong answers.

pwendell · 2014-02-27T07:52:45Z

Jenkins, test this please.

pwendell · 2014-02-27T07:53:24Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -77,7 +74,6 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializerClass: String = null): RDD[(K, C)] = {
-    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0


why did you remove this? We want to require this here.

AmplabJenkins · 2014-02-27T07:53:55Z

Merged build triggered.

AmplabJenkins · 2014-02-27T07:54:03Z

Merged build triggered.

pwendell · 2014-02-27T07:56:09Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -618,10 +619,6 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
        attemptNumber)
      val hadoopContext = newTaskAttemptContext(wrappedConf.value, attemptId)
      val format = outputFormatClass.newInstance
-      format match {


ah I think somehow you may have created a diff or something based on some older code... this also got updated recently and I think we want to keep it.

yeah, I think so...I just synced the code

AmplabJenkins · 2014-02-27T08:00:54Z

Merged build started.

AmplabJenkins · 2014-02-27T08:27:52Z

Merged build finished.

AmplabJenkins · 2014-02-27T08:27:52Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12910/

…irectory

CodingCat · 2014-02-28T23:42:04Z

I changed the code and tested in local side, mind reviewing it again? @pwendell

pwendell · 2014-03-01T02:18:56Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -712,6 +713,10 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
    logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " +
      valueClass.getSimpleName + ")")

+    val path = new Path(conf.get("mapred.output.dir"))


I'm not totally sure mapred.output.dir is going to be set always here. It might be safer for this one to just only do it the format is a FileOutputFormat (e.g. check if the output format is instanceof FileOutputFormat). Then get the path by casting it to a FileOutputFormat and calling getOutputPath. That seems a bit safer to me.

Yes, you are right!

CodingCat · 2014-03-01T03:26:35Z

@pwendell , I just updated the code

In the latest update, I make the checking only applicable to FileOutputFormat, the difference with your suggestion is that I added checking for that if the user has set mapred.output.dir, because the internal implementation of getOutputPath is actually based on this property....

/**

Get the {@link Path} to the output directory for the map-reduce job.
@return the {@link Path} to the output directory for the map-reduce job.
@see FileOutputFormat#getWorkOutputPath(JobConf)
*/

public static Path getOutputPath(JobConf conf) {
String name = conf.get("mapred.output.dir");
return name == null ? null: new Path(name);
}

pwendell · 2014-03-01T18:47:32Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -712,6 +713,16 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
    logDebug("Saving as hadoop file of type (" + keyClass.getSimpleName + ", " +
      valueClass.getSimpleName + ")")

+    if (outputFormatClass.isInstanceOf[FileOutputFormat[_, _]]) {
+      val outputPath = conf.get("mapred.output.dir")


So I looked into this more and it turns out the fileSystem parameter is totally ignored by FileOutputFormat. So I think we can just do this:

if (outputFormatClass.isInstanceOf[FileOutputFormat[_, _]]) { // FileOutputFormat ignores the filesystem parameter val ignoredFs = FileSystem.get(conf) conf.getOutputFormat.checkOutputSpecs(ignoredFs, conf) }

http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1/org/apache/hadoop/mapred/FileOutputFormat.java#FileOutputFormat.checkOutputSpecs%28org.apache.hadoop.fs.FileSystem%2Corg.apache.hadoop.mapred.JobConf%29

CodingCat · 2014-03-01T21:30:29Z

@pwendell Thank you again! Just updated the code

pwendell · 2014-03-01T23:11:58Z

core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

@@ -642,7 +643,7 @@ class PairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)])
    val jobTaskContext = newTaskAttemptContext(wrappedConf.value, jobAttemptId)
    val jobCommitter = jobFormat.getOutputCommitter(jobTaskContext)
    jobCommitter.setupJob(jobTaskContext)
-    val count = self.context.runJob(self, writeShard _).sum
+    self.context.runJob(self, writeShard _).sum


sorry just noticed this - is there any need for .sum here now?

pwendell · 2014-03-01T23:12:57Z

Hey @CodingCat found a tiny issue but otherwise LGTM - if you patch it I can merge.

CodingCat · 2014-03-01T23:51:57Z

@pwendell done

pwendell · 2014-03-02T00:12:50Z

Jenkins, test this please.

AmplabJenkins · 2014-03-02T00:21:04Z

Merged build triggered.

AmplabJenkins · 2014-03-02T00:21:04Z

Merged build started.

AmplabJenkins · 2014-03-02T01:18:33Z

Merged build finished.

AmplabJenkins · 2014-03-02T01:18:34Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12945/

pwendell · 2014-03-02T01:28:09Z

Thanks, merged into master.

…/`to_avro` Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))#11] ``` It only makes the alias name quite long: `from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)#11 ``` Manual check Closes apache#22890 from gengliangwang/revise_from_to_avro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…from_avro`/`to_avro` Back port #22890 to branch-2.4. It is a bug fix for this issue: https://issues.apache.org/jira/browse/SPARK-26063 ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))#11] ``` It only makes the alias name quite long: `from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)#11 ``` ## How was this patch tested? Manual check Closes #23047 from gengliangwang/backport_avro_pretty_name. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

…/`to_avro` ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))apache#11] ``` It only makes the alias name quite long: `from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)apache#11 ``` ## How was this patch tested? Manual check Closes apache#22890 from gengliangwang/revise_from_to_avro. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…pressions This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root |-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)apache#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)apache#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)apache#11] +- LogicalRDD [col#1] == Physical Plan == *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)apache#11] +- Scan ExistingRDD[col#1] // Schema root |-- ((col * col) * col): decimal(38,12) (nullable = true) ``` This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes apache#18583 from aokolnychyi/spark-21332. (cherry picked from commit 0be5fb4) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…from_avro`/`to_avro` Back port apache#22890 to branch-2.4. It is a bug fix for this issue: https://issues.apache.org/jira/browse/SPARK-26063 ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))logicalclocks#11] ``` It only makes the alias name quite long: `from_avro(col, struct<col1:bigint,col2:double>, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)logicalclocks#11 ``` ## How was this patch tested? Manual check Closes apache#23047 from gengliangwang/backport_avro_pretty_name. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

[HOPSWORKS-1081] Upgrade Spark to 2.4.3

### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. **Example SQL** ``` SELECT /*+ REPARTITION(5) */ * FROM test ORDER BY a ``` **BEFORE** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` **AFTER** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes #26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. **Example SQL** ``` SELECT /*+ REPARTITION(5) */ * FROM test ORDER BY a ``` **BEFORE** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=apache#11] +- Exchange RoundRobinPartitioning(5), false, [id=apache#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` **AFTER** ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=apache#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes apache#26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

apache#11 SQLConf support thread local prop

…r-permission change tini path

[YSPARK-1261] : Updating spark-starter to run with shc-core jar built…

* [SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types ### What changes were proposed in this pull request? Add migration guide for CHAR VARCHAR types ### Why are the changes needed? for migration ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? passing ci Closes apache#30654 from yaooqinn/SPARK-33641-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode ### What changes were proposed in this pull request? This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx" ### Why are the changes needed? For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request. But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application xxxxx; YARN application has exited unexpectedly with state xxxxx" is logged with error level. which confuse user a lot. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? very simple patch, seems no need? Closes apache#30617 from sqlwindspeaker/yarn-client-interrupt-monitor. Authored-by: suqilong <suqilong@qiyi.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> * [SPARK-33655][SQL] Improve performance of processing FETCH_PRIOR ### What changes were proposed in this pull request? Currently, when a client requests FETCH_PRIOR to Thriftserver, Thriftserver reiterates from the start position. Because Thriftserver caches a query result with an array when THRIFTSERVER_INCREMENTAL_COLLECT feature is off, FETCH_PRIOR can be implemented without reiterating the result. A trait FeatureIterator is added in order to separate the implementation for iterator and an array. Also, FeatureIterator supports moves cursor with absolute position, which will be useful for the implementation of FETCH_RELATIVE, FETCH_ABSOLUTE. ### Why are the changes needed? For better performance of Thriftserver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? FetchIteratorSuite Closes apache#30600 from Dooyoung-Hwang/refactor_with_fetch_iterator. Authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33719][DOC] Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### What changes were proposed in this pull request? Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### Why are the changes needed? Users can know that these functions throw runtime exceptions under ANSI mode if the result is not valid. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and check it in browser: ![image](https://user-images.githubusercontent.com/1097932/101608930-34a79e80-39bb-11eb-9294-9d9b8c3f6faa.png) Closes apache#30683 from gengliangwang/improveDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> * [SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize delete conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `DeleteFromTable`. Closes apache#30688 from aokolnychyi/spark-33722. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Co-authored-by: suqilong <suqilong@qiyi.com> Co-authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Anton Okolnychyi <aokolnychyi@apple.com>

…n properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes #45214 from zhengruifeng/connect_fix_read_join. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…n properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#45214 from zhengruifeng/connect_fix_read_join. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix #45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes #46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix #45214 to 3.4 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes #46290 from zhengruifeng/connect_fix_read_join_34. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

marmbrus referenced this pull request in marmbrus/spark Feb 27, 2014

Merge pull request #11 from marmbrus/goldenGen

b21f803

A bunch of work mostly to ensure catalyst does not return wrong answers.

pwendell reviewed Feb 27, 2014
View reviewed changes

prevent Spark from overwriting directory silently and leaving dirty d…

ec490e8

…irectory

pwendell reviewed Mar 1, 2014
View reviewed changes

checkOutputSpecs not applicable to FSOutputFormat

ac63136

pwendell reviewed Mar 1, 2014
View reviewed changes

add new test cases and code clean

ef2d43f

pwendell reviewed Mar 1, 2014
View reviewed changes

code clean

6a4e3a3

hn5092 added a commit to hn5092/spark that referenced this pull request Dec 27, 2018

apache#11 SQLConf support thread local prop

1c88c48

hn5092 added a commit to hn5092/spark that referenced this pull request Apr 25, 2019

apache#11 SQLConf support thread local prop

0753cc3

SirOibaf pushed a commit to SirOibaf/spark that referenced this pull request Sep 18, 2019

Merge pull request apache#11 from kai-chi/HOPSWORKS-1081

434f24d

[HOPSWORKS-1081] Upgrade Spark to 2.4.3

PingHao mentioned this pull request Oct 9, 2019

[SPARK-28120][SS] Rocksdb state storage implementation #24922

Closed

jackylee-ch mentioned this pull request Dec 19, 2019

[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by #26946

Closed

Wayne1c pushed a commit to Wayne1c/spark that referenced this pull request Apr 15, 2020

apache#11 SQLConf support thread local prop

5204dfd

Wayne1c added a commit to Wayne1c/spark that referenced this pull request Apr 15, 2020

Merge pull request apache#109 from Wayne1c/ke-13010

7126907

apache#11 SQLConf support thread local prop

holdenk mentioned this pull request May 1, 2020

[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned #28370

Closed

ringtail added a commit to ringtail/spark that referenced this pull request May 9, 2020

Merge pull request apache#11 from ringtail/bugfix/spark-history-serve…

9d21b91

…r-permission change tini path

microbearz pushed a commit to microbearz/spark that referenced this pull request Jan 13, 2021

apache#11 SQLConf support thread local prop

d74316b

redsanket pushed a commit to redsanket/spark that referenced this pull request Feb 16, 2021

Merge pull request apache#11 from pgandhi/branch-2.0-YSPARK-1261

1ce1dce

[YSPARK-1261] : Updating spark-starter to run with shc-core jar built…

woyumen4597 pushed a commit to woyumen4597/spark that referenced this pull request Mar 19, 2021

apache#11 SQLConf support thread local prop

523259f

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Mar 18, 2022

apache#11 SQLConf support thread local prop

98ef7a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1100: prevent Spark from overwriting directory silently #11

SPARK-1100: prevent Spark from overwriting directory silently #11

CodingCat commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

pwendell commented Feb 27, 2014

pwendell Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

pwendell Feb 27, 2014

CodingCat Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

CodingCat commented Feb 28, 2014

pwendell Mar 1, 2014

CodingCat Mar 1, 2014

CodingCat commented Mar 1, 2014

pwendell Mar 1, 2014

CodingCat commented Mar 1, 2014

pwendell Mar 1, 2014

pwendell commented Mar 1, 2014

CodingCat commented Mar 1, 2014

pwendell commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

pwendell commented Mar 2, 2014

SPARK-1100: prevent Spark from overwriting directory silently #11

SPARK-1100: prevent Spark from overwriting directory silently #11

Conversation

CodingCat commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

pwendell commented Feb 27, 2014

pwendell Feb 27, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

pwendell Feb 27, 2014

Choose a reason for hiding this comment

CodingCat Feb 27, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

AmplabJenkins commented Feb 27, 2014

CodingCat commented Feb 28, 2014

pwendell Mar 1, 2014

Choose a reason for hiding this comment

CodingCat Mar 1, 2014

Choose a reason for hiding this comment

CodingCat commented Mar 1, 2014

pwendell Mar 1, 2014

Choose a reason for hiding this comment

CodingCat commented Mar 1, 2014

pwendell Mar 1, 2014

Choose a reason for hiding this comment

pwendell commented Mar 1, 2014

CodingCat commented Mar 1, 2014

pwendell commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

AmplabJenkins commented Mar 2, 2014

pwendell commented Mar 2, 2014