[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

…read-safe ### What changes were proposed in this pull request? Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy. We need to add a unit test. ### Why are the changes needed? Increase test coverage to ensure that iterator() is thread-safe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? add ut Closes apache#29669 from cxzl25/SPARK-31511-test. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bd3dc2f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This PR backports apache#29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](apache#29593 (comment)) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes apache#29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…aths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of apache#29670, and it is based on another bug fix for CSV/JSON datasources apache#29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes apache#29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…denSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from apache#29485 and apache#29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes apache#29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…he .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes apache#29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e8634d8) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes apache#29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 514bf56) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…der if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes apache#29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit de0dc52) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…CSV datasources v1 and v2 ### What changes were proposed in this pull request? In the PR, I propose to move the test `SPARK-32810: CSV and JSON data sources should be able to read files with escaped glob metacharacter in the paths` from `DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in `JsonV1Suite`/`JsonV2Suite`. ### Why are the changes needed? To improve test coverage by checking JSON/CSV datasources v1 and v2. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.csv.*" $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*" ``` Closes apache#29690 from MaxGekk/globbing-paths-when-inferring-schema-dsv2-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…with some stateful queries + no-data-batches + V1 sources ### What changes were proposed in this pull request? Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. ### Why are the changes needed? Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test with a mock v1 source that fails without the fix. Closes apache#29696 from tdas/SPARK-32794-3.0. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…ter options correctly ### What changes were proposed in this pull request? This PR aims to fix the test coverage at `DataStreamReaderWriterSuite`. ### Why are the changes needed? Currently, the test case checks `DataStreamReader` options instead of `DataStreamWriter` options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the revised test case. Closes apache#29701 from dongjoon-hyun/SPARK-32836. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 06a9945) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…options This PR aims to fix indeterministic behavior on DataStreamReader/Writer options like the following. ```scala scala> spark.readStream.format("parquet").option("paTh", "1").option("PATH", "2").option("Path", "3").option("patH", "4").option("path", "5").load() org.apache.spark.sql.AnalysisException: Path does not exist: 1; ``` This will make the behavior deterministic. Yes, but the previous behavior is indeterministic. Pass the newly test cases. Closes apache#29702 from dongjoon-hyun/SPARK-32832. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2f85f95) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ve recursively ### What changes were proposed in this pull request? This patch proposes to check `ignoreNullability` parameter recursively in `equalsStructurally` method. This backports apache#29698 to branch-3.0. ### Why are the changes needed? `equalsStructurally` is used to check type equality. We can optionally ask to ignore nullability check. But the parameter `ignoreNullability` is not passed recursively down to nested types. So it produces weird error like: ``` data type mismatch: argument 3 requires array<array<string>> type, however ... is of array<array<string>> type. ``` when running the query `select aggregate(split('abcdefgh',''), array(array('')), (acc, x) -> array(array( x ) ) )`. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when running user query. ### How was this patch tested? Unit tests. Closes apache#29705 from viirya/SPARK-32819-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…adhesive with the unit THIS PR backports apache#29708 to 3.0 ### What changes were proposed in this pull request? In this PR, we add a checker for STRING form interval value ahead for parsing multiple units intervals and fail directly if the interval value contains alphabets to prevent correctness issues like `interval '1 day 2' day`=`3 days`. ### Why are the changes needed? fix correctness issue ### Does this PR introduce _any_ user-facing change? yes, in spark 3.0.0 `interval '1 day 2' day`=`3 days` but now we fail with ParseException ### How was this patch tested? add a test. Closes apache#29716 from yaooqinn/SPARK-32840-30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…nCommand ### What changes were proposed in this pull request? We made a mistake in apache#29502, as there is no code comment to explain why we can't load the UDF class when creating functions. This PR improves the code comment. ### Why are the changes needed? To avoid making the same mistake. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#29713 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 328d81a) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…stly in DataStreamReaderWriterSuite This PR aims to add `sinkParameter` to check sink options robustly and independently in DataStreamReaderWriterSuite `LastOptions.parameters` is designed to catch three cases: `sourceSchema`, `createSource`, `createSink`. However, `StreamQuery.stop` invokes `queryExecutionThread.join`, `runStream`, `createSource` immediately and reset the stored options by `createSink`. To catch `createSink` options, currently, the test suite is trying a workaround pattern. However, we observed a flakiness in this pattern sometimes. If we split `createSink` option separately, we don't need this workaround and can eliminate this flakiness. ```scala val query = df.writeStream. ... .start() assert(LastOptions.paramters(..)) query.stop() ``` No. This is a test-only change. Pass the newly updated test case. Closes apache#29730 from dongjoon-hyun/SPARK-32845. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b4be6a6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Follow-up PR as per the review comments in [29649](https://github.com/apache/spark/pull/29649/files/8d45542e915bea1b321f42988b407091065a2539#r487140171) ### Why are the changes needed? Delete the un used code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes apache#29736 from sandeep-katta/deadlockfollowup. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2009f95) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…SPARK_VERSION correctly ### What changes were proposed in this pull request? In https://github.com/apache/spark/blame/master/docs/quick-start.md#L402,it should be `{{site.SPARK_VERSION}}` rather than `{site.SPARK_VERSION}` ### Why are the changes needed? SPARK_VERSION isn't displayed correctly, as shown below ![image](https://user-images.githubusercontent.com/1892692/93006726-d03c8680-f514-11ea-85e3-1d7cfb682ef2.png) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tested locally, as shown below ![image](https://user-images.githubusercontent.com/1892692/93006712-a6835f80-f514-11ea-8d78-6831c9d65265.png) Closes apache#29738 from bowenli86/doc. Authored-by: bowen.li <bowenli86@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0549c20) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…7 in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? The Jenkins job fails to get the versions. This was fixed by adding temporary fallbacks at apache#28536. This still doesn't work without the temporary fallbacks. See apache#29694 This PR adds new fallbacks since 2.3 is EOL and Spark 3.0.1 and 2.4.7 are released. ### Why are the changes needed? To test correctly in Jenkins. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Jenkins and GitHub Actions builds should test. Closes apache#29748 from HyukjinKwon/SPARK-32876. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0696f04) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…eding growth threshold ### What changes were proposed in this pull request? When BytesToBytesMap is at `MAX_CAPACITY` and reaches its growth threshold, `numKeys >= growthThreshold` is true but `longArray.size() / 2 < MAX_CAPACITY` is false. This correctly prevents the map from growing, but `canGrowArray` incorrectly remains true. Therefore the map keeps accepting new keys and exceeds its growth threshold. If we attempt to spill the map in this state, the UnsafeKVExternalSorter will not be able to reuse the long array for sorting. By this point the task has typically consumed all available memory, so the allocation of the new pointer array is likely to fail. This PR fixes the issue by setting `canGrowArray` to false in this case. This prevents the map from accepting new elements when it cannot grow to accommodate them. ### Why are the changes needed? Without this change, hash aggregations will fail when the number of groups per task is greater than `MAX_CAPACITY / 2 = 2^28` (approximately 268 million), and when the grouping aggregation is the only memory-consuming operator in its stage. For example, the final aggregation in `SELECT COUNT(DISTINCT id) FROM tbl` fails when `tbl` contains 1 billion distinct values and when `spark.sql.shuffle.partitions=1`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Reproducing this issue requires building a very large BytesToBytesMap. Because this is infeasible to do in a unit test, this PR was tested manually by adding the following test to AbstractBytesToBytesMapSuite. Before this PR, the test fails in 8.5 minutes. With this PR, the test passes in 1.5 minutes. ```java public abstract class AbstractBytesToBytesMapSuite { // ... Test public void respectGrowthThresholdAtMaxCapacity() { TestMemoryManager memoryManager2 = new TestMemoryManager( new SparkConf() .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true) .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L) .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false) .set(package$.MODULE$.SHUFFLE_COMPRESS(), false)); TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0); final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes); try { // Insert keys into the map until it stops accepting new keys. for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY; i++) { if (i % (1024 * 1024) == 0) System.out.println("Inserting element " + i); final long[] value = new long[]{i}; BytesToBytesMap.Location loc = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8); Assert.assertFalse(loc.isDefined()); boolean success = loc.append(value, Platform.LONG_ARRAY_OFFSET, 8, value, Platform.LONG_ARRAY_OFFSET, 8); if (!success) break; } // The map should grow to its max capacity. long capacity = map.getArray().size() / 2; Assert.assertTrue(capacity == BytesToBytesMap.MAX_CAPACITY); // The map should stop accepting new keys once it has reached its growth // threshold, which is half the max capacity. Assert.assertTrue(map.numKeys() == BytesToBytesMap.MAX_CAPACITY / 2); map.free(); } finally { map.free(); } } } ``` Closes apache#29744 from ankurdave/SPARK-32872. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 72550c3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…oadcast ### What changes were proposed in this pull request? In TorrentBroadcast.scala ```scala L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec) L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) ``` After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory. This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast. ### Why are the changes needed? We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak. > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): 116G->112G(170G), 184.9121920 secs] [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] 1: 676531691 72035438432 [B 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568 [Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7: 2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024 java.net.URI ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. This UT is hard to write and the patch is straightforward. Closes apache#29558 from LantaoJin/SPARK-32715. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a9b066) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…loat and double ### What changes were proposed in this pull request? The `LiteralGenerator` for float and double datatypes was supposed to yield special values (NaN, +-inf) among others, but the `Gen.chooseNum` method does not yield values that are outside the defined range. The `Gen.chooseNum` for a wide range of floats and doubles does not yield values in the "everyday" range as stated in typelevel/scalacheck#113 . There is an similar class `RandomDataGenerator` that is used in some other tests. Added `-0.0` and `-0.0f` as special values to there too. These changes revealed an inconsistency with the equality check between `-0.0` and `0.0`. ### Why are the changes needed? The `LiteralGenerator` is mostly used in the `checkConsistencyBetweenInterpretedAndCodegen` method in `MathExpressionsSuite`. This change would have caught the bug fixed in apache#29495 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally reverted apache#29495 and verified that the existing test cases caught the bug. Closes apache#29515 from tanelk/SPARK-32688. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 6051755) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…th for reading CSV ### What changes were proposed in this pull request? This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header. ### Why are the changes needed? This behavior can confuse users. We should explicitly document it. ### Does this PR introduce _any_ user-facing change? No. Only doc change. ### How was this patch tested? Only doc change. Closes apache#29765 from viirya/SPARK-32888. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 550c1c9) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…n.builder.getOrCreate ### What changes were proposed in this pull request? In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below: ```python import warnings from pyspark.sql import SparkSession, SQLContext warnings.simplefilter('always', DeprecationWarning) spark.stop() SparkSession.builder.getOrCreate() ``` it shows the deprecation warning as below: ``` /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` via https://github.com/apache/spark/blob/d3304268d3046116d39ec3d54a8e319dce188f36/python/pyspark/sql/session.py#L222 We shouldn't print the deprecation warning from it. This is the only place ^. ### Why are the changes needed? To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly. ### Does this PR introduce _any_ user-facing change? Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`. ### How was this patch tested? Manually tested as above. Closes apache#29768 from HyukjinKwon/SPARK-32897. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 657e39a) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…e nulls ### What changes were proposed in this pull request? This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. ### Why are the changes needed? Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes apache#29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com>

### What changes were proposed in this pull request? Correct the typo in Show Table document ### Why are the changes needed? Current Document of Show Table returns in parse error, so it is misleading to users ### Does this PR introduce _any_ user-facing change? Yes, the document of show table is corrected now ### How was this patch tested? NA Closes apache#29758 from Udbhav30/showtable. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 88e87bc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](apache#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children. Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`): ```scala val a = Seq("1").toDF("col1").withColumn("col2", lit("1")) val b = Seq("2").toDF("col1").withColumn("col2", lit("2")) val aub = a.union(b) val c = aub.filter($"col1" === "2").cache() val d = Seq("2").toDF( "col4") val r = d.join(aub, $"col2" === $"col4").select("col4") val l = c.select("col2") val df = l.join(r, $"col2" === $"col4", "LeftOuter") df.show() ``` foldable propagation happens incorrectly: ``` Join LeftOuter, (col2#6 = col4#34) Join LeftOuter, (col2#6 = col4#34) !:- Project [col2#6] :- Project [1 AS col2#6] : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- Union : +- Union : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) LocalTableScan [value#1] : : +- *(1) LocalTableScan [value#1] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) LocalTableScan [value#10] : +- *(2) LocalTableScan [value#10] +- Project [col4#34] +- Project [col4#34] +- Join Inner, (col2#6 = col4#34) +- Join Inner, (col2#6 = col4#34) :- Project [value#31 AS col4#34] :- Project [value#31 AS col4#34] : +- LocalRelation [value#31] : +- LocalRelation [value#31] +- Project [col2#6] +- Project [col2#6] +- Union false, false +- Union false, false :- Project [1 AS col2#6] :- Project [1 AS col2#6] : +- LocalRelation [value#1] : +- LocalRelation [value#1] +- Project [2 AS col2#15] +- Project [2 AS col2#15] +- LocalRelation [value#10] +- LocalRelation [value#10] ``` and so the result is wrong: ``` +----+----+ |col2|col4| +----+----+ | 1|null| +----+----+ ``` After this PR foldable propagation will not happen incorrectly and the result is correct: ``` +----+----+ |col2|col4| +----+----+ | 2| 2| +----+----+ ``` ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Existing and new UTs. Closes apache#29771 from peter-toth/SPARK-32635-fix-foldable-propagation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4ced588) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? 1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1). 2. Added a test to check different accuracies. 3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test. ### Why are the changes needed? To fix incorrect percentile calculation, see an example in SPARK-32908. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`. - Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`. Closes apache#29784 from MaxGekk/fix-percentile_approx-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 75dd864) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…zing floats ### What changes were proposed in this pull request? This PR intends to fix a minor bug when normalizing floats for struct types; ``` scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") scala> val agg = df.distinct() scala> agg.explain() == Physical Plan == *(2) HashAggregate(keys=[k#40], functions=[]) +- Exchange hashpartitioning(k#40, 200), true, [id=apache#62] +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1)))) AS k#40], functions=[]) +- *(1) LocalTableScan [k#40] scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head } scala> aggOutput.foreach { attr => println(attr.prettyJson) } ### Final Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "_1", ^^^ "type" : "double", "nullable" : false, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ### Partial Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "col1", ^^^^ "type" : "double", "nullable" : true, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#29780 from maropu/FixBugInNormalizedFloatingNumbers. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit b49aaa3) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…legationTokens message ### What changes were proposed in this pull request? With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function. ```java 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them. 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Why are the changes needed? bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing jenkins and verify manually I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi The simplest way to reproduce the bug and verify this fix is to follow these steps #### 1 build the `kyuubi-spark-sql-engine` module ``` mvn clean package -pl :kyuubi-spark-sql-engine ``` #### 2. config the spark with Kerberos settings towards your secured cluster #### 3. start it in the background ``` nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log & ``` #### 4. check the AM log and see "Updating delegation tokens ..." for SUCCESS "Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE Closes apache#29777 from yaooqinn/SPARK-32905. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9e9d4b6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR aims to replace deprecated `isFile` and `isDirectory` methods. ```diff - fs.isDirectory(hadoopPath) + fs.getFileStatus(hadoopPath).isDirectory ``` ```diff - fs.isFile(new Path(inProgressLog)) + fs.getFileStatus(new Path(inProgressLog)).isFile ``` ### Why are the changes needed? It shows deprecation warnings. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (!fs.isFile(new Path(inProgressLog))) { ``` ``` [warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information. [warn] if (fs.isDirectory(hadoopPath)) { ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes apache#29796 from williamhyun/filesystem. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 7892887) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This is a follow-up PR to apache#29771 and just adds a new test case. ### Why are the changes needed? To have better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. Closes apache#29802 from peter-toth/SPARK-32635-fix-foldable-propagation-followup. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3309a2b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… real start ### What changes were proposed in this pull request? Only calculate the executorRunTime when taskStartTimeNs > 0. Otherwise, set executorRunTime to 0. ### Why are the changes needed? bug fix. It's possible that a task be killed (e.g., by another successful attempt) before it reaches "taskStartTimeNs = System.nanoTime()". In this case, taskStartTimeNs is still 0 since it hasn't been really initialized. And we will get the wrong executorRunTime by calculating System.nanoTime() - taskStartTimeNs. ### Does this PR introduce _any_ user-facing change? Yes, users will see the correct executorRunTime. ### How was this patch tested? Pass existing tests. Closes apache#29789 from Ngone51/fix-SPARK-32898. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f1dc479) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Fix ".../jobs/undefined" link from "Event Timeline" in jobs page. Job page link in "Event Timeline" view is constructed by fetching job page link defined in job list below. when job count exceeds page size of job table, only links of jobs in job table can be fetched from page. Other jobs' link would be 'undefined', and links of them in "Event Timeline" are broken, they are redirected to some wired URL like ".../jobs/undefined". This PR is fixing this wrong link issue. With this PR, job link in "Event Timeline" view would always redirect to correct job page. ### Why are the changes needed? Wrong link (".../jobs/undefined") in "Event Timeline" of jobs page. for example, the first job in below page is not in table below, as job count(116) exceeds page size(100). When clicking it's item in "Event Timeline", page is redirected to ".../jobs/undefined", which is wrong. Links in "Event Timeline" should always be correct. ![undefinedlink](https://user-images.githubusercontent.com/10524738/93184779-83fa6d80-f6f1-11ea-8a80-1a304ca9cbb2.JPG) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. Closes apache#29757 from zhli1142015/fix-link-event-timeline-view. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit d01594e) Signed-off-by: Sean Owen <srowen@gmail.com>

Backport apache#29560 to 3.0, as it's kind of a bug fix for the ANSI mode. People can't use `year`, `month`, etc. functions under ANSI mode. Closes apache#29823 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…nSubqueryExec ### What changes were proposed in this pull request? This is a followup of apache#29475. This PR updates the code to broadcast the Array instead of Set, which was the behavior before apache#29475 ### Why are the changes needed? The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#29840 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? In the PR, I propose to replace current examples for `percentile_approx()` with **only one** input value by example **with multiple values** in the input column. ### Why are the changes needed? Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - by running `ExpressionInfoSuite` - `./dev/scalastyle` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit b53da23) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#29848 from MaxGekk/example-percentile_approx-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…x()` ### What changes were proposed in this pull request? More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns **one of elements** (or array of elements) from the input column. ### Why are the changes needed? To improve Spark docs and avoid misunderstanding of the function behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Liang-Chi Hsieh <viiryagmail.com> (cherry picked from commit 7c14f17) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#29845 from MaxGekk/doc-percentile_approx-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this. ### Why are the changes needed? To fix documentation ### Does this PR introduce _any_ user-facing change? Doc change. ### How was this patch tested? Manual. Closes apache#29853 from RussellSpitzer/SPARK-32977. Authored-by: Russell Spitzer <russell.spitzer@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit b3f0087) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Add test to cover Hive UDF whose input contains complex decimal type. Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`. ### Why are the changes needed? For better test coverage with Hive which we compatible or not. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes apache#29863 from ulysses-you/SPARK-32877-test. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f2fc966) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… class name in TreeNode ### What changes were proposed in this pull request? Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`. ### Why are the changes needed? On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error. Similar to apache#29050, we should use Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue. ### Does this PR introduce _any_ user-facing change? Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes. ### How was this patch tested? Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs. Manually tested on JDK8u and JDK11u and observed expected behavior: - JDK8u: the test case triggers the "Malformed class name" issue and the fix works; - JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped. Closes apache#29875 from rednaxelafx/spark-32999-getsimplename. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 9a155d4) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Compute the current date at the specified time zone using timestamp taken at the start of query evaluation. ### Why are the changes needed? According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`. Closes apache#29889 from MaxGekk/fix-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 68cd567) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…e evaluated ### What changes were proposed in this pull request? Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value ### Why are the changes needed? Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation: https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L71-L91 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running `./dev/scalastyle`. Closes apache#29892 from MaxGekk/doc-current_date. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1b60ff5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…est_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes apache#29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…rrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes apache#29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…nalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes apache#29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit f167002) Signed-off-by: herman <herman@databricks.com>

Fixing log message for better clarity. Closes apache#29870 from akshatb1/master. Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com> Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 7766fd1) Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This pr fix estimate statistics issue if child has 0 bytes. ### Why are the changes needed? The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example: ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes apache#29894 from wangyum/SPARK-33018. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 711d8dd) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…gorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes apache#29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit cc06266) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CLUSTERED BY SORTED BY INTO num_buckets BUCKETS ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? No ![image](https://user-images.githubusercontent.com/46367746/94428281-0a6b8080-01c3-11eb-9ff3-899f8da602ca.png) ![image](https://user-images.githubusercontent.com/46367746/94428285-0d667100-01c3-11eb-8a54-90e7641d917b.png) ![image](https://user-images.githubusercontent.com/46367746/94428288-0f303480-01c3-11eb-9e1d-023538aa6e2d.png) ### How was this patch tested? generate html test Closes apache#29883 from GuoPhilipse/add-sql-missing-keywords. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 3bdbb55) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…E FUNCTION doc ### What changes were proposed in this pull request? Fix Typo ### Why are the changes needed? To maintain consistency. Correct table name should be used for SELECT command. ### Does this PR introduce _any_ user-facing change? Yes. Now CREATE FUNCTION doc will show the correct name of table. ### How was this patch tested? Manually. Doc changes. Closes apache#29920 from iRakson/fixTypo. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit d3dbe1a) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…tricsJsonSerializer ### What changes were proposed in this pull request? This is a backport PR for branch-3.0. This change was raised to `master` branch in `apache#29872 When `peakMemoryMetrics` in `ExecutorSummary` is `Option.empty`, then the `ExecutorMetricsJsonSerializer#serialize` method does not execute the `jsonGenerator.writeObject` method. This causes the json to be generated with `peakMemoryMetrics` key added to the serialized string, but no corresponding value. This causes an error to be thrown when it is the next key `attributes` turn to be added to the json: `com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value ` ### Why are the changes needed? At the start of the Spark job, if `peakMemoryMetrics` is `Option.empty`, then it causes a `com.fasterxml.jackson.core.JsonGenerationException` to be thrown when we navigate to the Executors tab in Spark UI. Complete stacktrace: > com.fasterxml.jackson.core.JsonGenerationException: Can not write a field name, expecting a value > at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2080) > at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:161) > at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725) > at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:721) > at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:166) > at com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:145) > at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:26) > at com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents$(IterableSerializerModule.scala:25) > at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54) > at com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54) > at com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:250) > at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480) > at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319) > at com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094) > at com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404) > at org.apache.spark.ui.exec.ExecutorsPage.allExecutorsDataScript$1(ExecutorsTab.scala:64) > at org.apache.spark.ui.exec.ExecutorsPage.render(ExecutorsTab.scala:76) > at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.sparkproject.jetty.server.Server.handle(Server.java:505) > at org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103) > at org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) > at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) > at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) > at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.base/java.lang.Thread.run(Thread.java:834) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#29914 from shrutig/SPARK-32996-3.0. Authored-by: Shruti Gumma <shruti_gumma@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…uild ### What changes were proposed in this pull request? At SPARK-32493, the R installation was switched to manual installation because setup-r was broken. This seems fixed in the upstream so we should better switch it back. ### Why are the changes needed? To avoid maintaining the installation steps by ourselve. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions build in this PR should test it. Closes apache#29931 from HyukjinKwon/recover-r-build. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b205be5) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…uristic computation ### What changes were proposed in this pull request? RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method. ### Why are the changes needed? Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case. ### Does this PR introduce _any_ user-facing change? Not other than the bug fix of course. ### How was this patch tested? Existing RowMatrix tests plus a new test. Closes apache#29925 from srowen/SPARK-33043. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit f86171a) Signed-off-by: Sean Owen <srowen@gmail.com>

…ocalityPlacementStrategySuite for Java 11 with sbt ### What changes were proposed in this pull request? This PR fixes an issue that a test in `LocalityPlacementStrategySuite` fails with Java 11 due to `StackOverflowError`. ``` [info] - handle large number of containers and tasks (SPARK-18750) *** FAILED *** (170 milliseconds) [info] StackOverflowError should not be thrown; however, got: [info] [info] java.lang.StackOverflowError [info] at java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012) [info] at java.base/java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1541) [info] at java.base/java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:668) [info] at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:591) [info] at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579) [info] at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) [info] at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) ``` The solution is to expand the stack size of a thread in the test from 32KB to 256KB. Currently, the stack size is specified as 32KB but the actual stack size can be greater than 32KB. According to the code of Hotspot, the minimum stack size is prefer to the specified size. Java 8: https://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c92ba514724d/src/os/linux/vm/os_linux.cpp#l900 Java 11: https://hg.openjdk.java.net/jdk-updates/jdk11u/file/73edf743a93a/src/hotspot/os/posix/os_posix.cpp#l1555 For Linux on x86_64, the minimum stack size seems to be 224KB and 136KB for Java 8 and Java 11 respectively. So, the actual stack size should be 224KB rather than 32KB for Java 8 on x86_64/Linux. As the test passes for Java 8 but doesn't for Java 11, 224KB is enough while 136KB is not. So I think specifing 256KB is reasonable for the new stack size. ### Why are the changes needed? To pass the test for Java 11. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Following command with Java 11. ``` build/sbt -Pyarn clean package "testOnly org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite" ``` Closes apache#29943 from sarutak/fix-stack-size. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fab5321) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…e found ### What changes were proposed in this pull request? This PR proposes to skip test reporting ("Report test results") if there are no JUnit XML files are found. Currently, we're running and skipping the tests dynamically. For example, - if there are only changes in SparkR at the underlying commit, it only runs the SparkR tests, and skip the other tests and generate JUnit XML files for SparkR test cases. - if there are only changes in `docs` at the underlying commit, the build skips all tests except linters and do not generate any JUnit XML files. When test reporting ("Report test results") job is triggered after the main build ("Build and test ") is finished, and there are no JUnit XML files found, it reports the case as a failure. See https://github.com/apache/spark/runs/1196184007 as an example. This PR works around it by simply skipping the testing report when there are no JUnit XML files are found. Please see apache#29906 (comment) for more details. ### Why are the changes needed? To avoid false alarm for test results. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my fork. Positive case: https://github.com/HyukjinKwon/spark/runs/1208624679?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/288996327 Negative case: https://github.com/HyukjinKwon/spark/runs/1208229838?check_suite_focus=true https://github.com/HyukjinKwon/spark/actions/runs/289000058 Closes apache#29946 from HyukjinKwon/test-junit-files. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a0aa8f3) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…rsion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user. ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot Closes apache#29951 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0812d6c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…SinkSuite ### What changes were proposed in this pull request? Increase size of socket receive buffer in these tests. ### Why are the changes needed? The socket receive buffer size set in this test was too small for the StatsdSinkSuite tests to run reliably on some systems. For a test in this suite to run reliably the buffer needs to be large enough to hold all the data in the packets being sent in a test along with any additional kernel or protocol overhead. The amount of kernel overhead per packet can vary from system to system but is typically far higher than the protocol overhead. If the receive buffer is too small and fills up then packets are silently dropped. This leads to the test failing with a timeout. If the socket defaults to a larger receive buffer (normally true) then we should keep that size. As well as increasing the minimum buffer size I've also decoupled the datagram packet buffer size from the receive buffer size. The receive buffer should in general be far larger to account for the fact that multiple packets might be buffered, as well as the aforementioned overhead. Any truncated data in individual packets will be picked up by the tests. ### Does this PR introduce _any_ user-facing change? No, this only affects the tests. ### How was this patch tested? Existing tests on IBM Z and x86. Closes apache#29819 from mundaym/fix-statsd. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit b5e4b8c) Signed-off-by: Sean Owen <srowen@gmail.com>

…ow conversion failures" This reverts commit 4f71231.

…pping in QueryPlan#transformUpWithNewOutput ### What changes were proposed in this pull request? This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this; ```scala case class TestRule extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput { case p Project(projList, _) => val newPlan = p.copy(projectList = projList.map { _.transform { // Assigns a new `ExprId` for references case a: AttributeReference => Alias(a, a.name)() }}.asInstanceOf[Seq[NamedExpression]]) val attrMapping = p.output.zip(newPlan.output) newPlan -> attrMapping } } ``` Then, this rule is applied into a plan below; ```scala (3) Project [a#5, b#6] +- (2) Project [a#5, b#6] +- (1) Project [a#5, b#6] +- LocalRelation <empty>, [a#5, b#6] ``` In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project` first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule. This is the backport of apache#29911. ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests in `QueryPlanSuite`. Closes apache#29953 from maropu/SPARK-33035-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…conversion failures ### What changes were proposed in this pull request? This improves error handling when a failure in conversion from Pandas to Arrow occurs. And fixes tests to be compatible with upcoming Arrow 2.0.0 release. ### Why are the changes needed? Current tests will fail with Arrow 2.0.0 because of a change in error message when the schema is invalid. For these cases, the current error message also includes information on disabling safe conversion config, which is mainly meant for floating point truncation and overflow. The tests have been updated to use a message that is show for past Arrow versions, and upcoming. If the user enters an invalid schema, the error produced by pyarrow is not consistent and either `TypeError` or `ArrowInvalid`, with the latter being caught, and raised as a `RuntimeError` with the extra info. The error handling is improved by: - narrowing the exception type to `TypeError`s, which `ArrowInvalid` is a subclass and what is raised on safe conversion failures. - The exception is only raised with additional information on disabling "spark.sql.execution.pandas.convertToArrowArraySafely" if it is enabled in the first place. - The original exception is chained to better show it to the user (only for Spark 3.1+ which requires Python 3) ### Does this PR introduce _any_ user-facing change? Yes, the error re-raised changes from a RuntimeError to a ValueError, which better categorizes this type of error and in-line with the original Arrow error. ### How was this patch tested? Existing tests, using pyarrow 1.0.1 and 2.0.0-snapshot, and Python 2 with 0.15.1 Closes apache#29962 from BryanCutler/arrow-better-handle-pandas-errors-SPARK-33073-branch-3.0. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The pod template configmap always had the same name. This PR makes it unique. ### Why are the changes needed? If you scheduled 2 spark jobs they will both use the same configmap name this will result in conflicts. This PR fixes that **BEFORE** ``` $ kubectl get cm --all-namespaces -w | grep podspec podspec-configmap 1 65s ``` **AFTER** ``` $ kubectl get cm --all-namespaces -w | grep podspec aaece65ef82e4a30b7b7800aad600d4f spark-test-app-aac9f37502b2ca55-driver-podspec-conf-map 1 0s ``` This can be seen when running the integration tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests and the integration tests test if this works Closes apache#29934 from stijndehaes/bugfix/SPARK-32067-unique-name-for-template-configmap. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3099fd9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ptions to underlying HDFS file system ### What changes were proposed in this pull request? In `AvroUtils`'s `inferSchema()`, propagate Hadoop config from DS options to underlying HDFS file system. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("avro").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? unit test added Closes apache#29971 from yuningzh-db/avro_options. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit bbc887b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ial side effect at callers of OrcUtils.readCatalystSchema ### What changes were proposed in this pull request? This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly. When you use `map`, it might be lazily evaluated and not executed. To avoid this, we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues. ### Why are the changes needed? To avoid potential issues from `map` being lazy and not executed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran related tests. CI in this PR should verify. Closes apache#29974 from HyukjinKwon/SPARK-32646. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 5effa8e) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…xecutors ### What changes were proposed in this pull request? This PR aims to use `LinkedHashMap` instead of `Map` for `newlyCreatedExecutors`. ### Why are the changes needed? This makes log messages (INFO/DEBUG) more readable. This is helpful when `spark.kubernetes.allocation.batch.size` is large and especially when K8s dynamic allocation is used. **BEFORE** ``` 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:24:21 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (5,10,6,9,2,7,3,8,4). ``` **AFTER** ``` 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 2 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 3 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 7 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 8 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 9 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found in the Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08 10:25:17 INFO ExecutorPodsAllocator: Deleting 9 excess pod requests (2,3,4,5,6,7,8,9,10). ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI or `build/sbt -Pkubernetes "kubernetes/test"` Closes apache#29979 from dongjoon-hyun/SPARK-K8S-LOG. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 4987db8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…m DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate LibSVM options to Hadoop configs in the LibSVM datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("libsvm").options(conf).load(path) ``` The underlying file system will not receive the `conf` options. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for example, users should read files from Azure Data Lake successfully: ```scala def hadoopConf1() = Map[String, String]( s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential", s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key = "..."), s"fs.adl.oauth2.refresh.url" -> s"https://login.microsoftonline.com/.../oauth2/token") val df = spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...") ``` and not get the following exception because the settings above are not propagated to the filesystem: ```java java.lang.IllegalArgumentException: No value for fs.adl.oauth2.access.token.provider found in conf file. at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820) at ....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220) at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257) at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) ``` ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Dongjoon Hyun <dhyunapple.com> (cherry picked from commit 1234c66) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#29986 from MaxGekk/ml-option-propagation-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…DS options to underlying HDFS file system ### What changes were proposed in this pull request? Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource. ### Why are the changes needed? There is a bug that when running: ```scala spark.read.format("orc").options(conf).load(path) ``` The underlying file system will not receive the conf options. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `OrcSourceSuite`. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Dongjoon Hyun <dhyunapple.com> (cherry picked from commit c5f6af9) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#29985 from MaxGekk/orc-option-propagation-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

We have a problem when you use CREATE TEMPORARY TABLE with LOCATION ```scala spark.range(3).write.parquet("/tmp/testspark1") sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/tmp/testspark1')") sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/tmp/testspark1'") ``` ```scala org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ``` This bug was introduced by SPARK-30507. sparksqlparser --> visitCreateTable --> visitCreateTableClauses --> cleanTableOptions extract the path from the options but in this case CreateTempViewUsing need the path in the options map. To fix the problem No Unit testing and manual testing Closes apache#30014 from planga82/bugfix/SPARK-33118_create_temp_table_location. Authored-by: Pablo <pablo.langa@stratio.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 819f12e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…e` modules ### What changes were proposed in this pull request? Fix Javadoc generation errors in `kvstore` and `unsafe` modules according to error message hints. ### Why are the changes needed? Fixes `doc` task failures which prevented other tasks successful executions (eg `publishLocal` task depends on `doc` task). ### Does this PR introduce _any_ user-facing change? No. Meaning of text in Javadoc is stayed the same. ### How was this patch tested? Run `build/sbt kvstore/Compile/doc`, `build/sbt unsafe/Compile/doc` and `build/sbt doc` without errors. Closes apache#30007 from gemelen/feature/doc-task-fix. Authored-by: Denis Pyshev <git@gemelen.net> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1b0875b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ects ### What changes were proposed in this pull request? In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects. ### Why are the changes needed? 1. To not raise exception to users in the PERMISSIVE mode 2. To fix a regression and to have the same behavior as Spark 2.4.x has 3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, the code below: ```scala val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events") val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType))) val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event")) pokerhand_events.show ``` throws the exception even in the default **PERMISSIVE** mode: ```java java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195) ``` After the changes: ``` +-----+ |event| +-----+ | null| +-----+ ``` ### How was this patch tested? Added a test to `JsonFunctionsSuite`. Closes apache#30032 from MaxGekk/json-skip-row-wrong-schema-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

….outputResolved ### What changes were proposed in this pull request? This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`. ### Why are the changes needed? Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped. ### Does this PR introduce _any_ user-facing change? Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable. ### How was this patch tested? New UT added. Closes apache#30033 from HeartSaVioR/SPARK-33136. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 8e5cb1d) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…cations in SHS ### What changes were proposed in this pull request? Adds an additional check for non-fatal errors when attempting to add a new entry to the history server application listing. ### Why are the changes needed? A bad rolling event log folder (missing appstatus file or no log files) would cause no applications to be loaded by the Spark history server. Figuring out why invalid event log folders are created in the first place will be addressed in separate issues, this just lets the history server skip the invalid folder and successfully load all the valid applications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT Closes apache#30037 from Kimahriman/bug/rolling-log-crashing-history. Authored-by: Adam Binford <adam.binford@radiantsolutions.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit 9ab0ec4) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

…ionsSuite on Python 3.8/3.9 ### What changes were proposed in this pull request? This PR aims to ignore Apache Spark 2.4.x distribution in HiveExternalCatalogVersionsSuite if Python version is 3.8 or 3.9. ### Why are the changes needed? Currently, `HiveExternalCatalogVersionsSuite` is broken on the latest OS like `Ubuntu 20.04` because its default Python version is 3.8. PySpark 2.4.x doesn't work on Python 3.8 due to SPARK-29536. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ python3 --version Python 3.8.5 $ build/sbt "hive/testOnly *.HiveExternalCatalogVersionsSuite" ... [info] All tests passed. [info] Passed: Total 1, Failed 0, Errors 0, Passed 1 ``` Closes apache#30044 from dongjoon-hyun/SPARK-33153. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit ec34a00) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ew applications in SHS" This reverts commit d9669bd.

…gacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up apache#28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes apache#30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 38c05af) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…m Benchmark ### What changes were proposed in this pull request? This PR proposes to remove `assert` from `Benchmark` for making it easier to run benchmark codes via `spark-submit`. ### Why are the changes needed? Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we need to pass the proper jars of `scalatest` and `scalactic`; - scalatest-core_2.12-3.2.0.jar - scalatest-compatible-3.2.0.jar - scalactic_2.12-3.0.jar ``` ./bin/spark-submit --jars scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` This update can make developers submit benchmark codes without these dependencies; ``` ./bin/spark-submit --jars ./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes apache#30064 from maropu/RemoveDepInBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a5c17de) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…expressions ### What changes were proposed in this pull request? For queries with multiple foldable distinct columns, since they will be eliminated during execution, it's not mandatory to let `RewriteDistinctAggregates` handle this case. And in the current code, `RewriteDistinctAggregates` *dose* miss some "aggregating with multiple foldable distinct expressions" cases. For example: `select count(distinct 2), count(distinct 2, 3)` will be missed. But in the planner, this will trigger an error that "multiple distinct expressions" are not allowed. As the foldable distinct columns can be eliminated finally, we can allow this in the aggregation planner check. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test case Authored-by: Linhong Liu <linhong.liudatabricks.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit a410658) Closes apache#30052 from linhongliu-db/SPARK-32761-3.0. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR proposes to use `scala.Predef.assert` instead of `org.scalatest.Assertions.assert` removed at apache#30064 ### Why are the changes needed? Just to keep the same behaviour. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Recover the existing asserts. Closes apache#30065 from HyukjinKwon/SPARK-33165. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Suite as ExtendedSQLTest ### What changes were proposed in this pull request? This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`. - ParquetV1FilterSuite/ParquetV2FilterSuite - ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite ### Why are the changes needed? Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job. **BEFORE** - https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes) **AFTER** - https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Github Action with the reduced time. Closes apache#30068 from dongjoon-hyun/MOVE3. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…apStatus.readExternal ### What changes were proposed in this pull request? This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`. In Scala 2.12, this is initialized to `-1` via the following. ```scala protected def this() = this(null, -1, null, -1, null, -1) // For deserialization only ``` ### Why are the changes needed? In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example. - org.apache.spark.scheduler.MapStatusSuite ``` MapStatusSuite: - compressSize - decompressSize *** RUN ABORTED *** java.lang.NoSuchFieldError: numNonEmptyBlocks at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181) at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281) at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64) at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61) at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18) at scala.collection.immutable.List.foreach(List.scala:333) at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60) ... ``` ### Does this PR introduce _any_ user-facing change? No. This is a private class. ### How was this patch tested? 1. Pass the GitHub Action or Jenkins with the existing tests. 2. Test with Scala-2.13 with `MapStatusSuite`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite ... MapStatusSuite: - compressSize - decompressSize - MapStatus should never report non-empty blocks' sizes as 0 - large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus - HighlyCompressedMapStatus: estimated size should be the average non-empty block size - SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize - RoaringBitmap: runOptimize succeeded - RoaringBitmap: runOptimize failed - Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated. - SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE Run completed in 7 seconds, 971 milliseconds. Total number of tests run: 10 Suites: completed 2, aborted 0 Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes apache#29231 from dongjoon-hyun/SPARK-32436. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f9f1867) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…resolve qualified col name This is [apache#30029](apache#30029) backport for branch-3.0. ### What changes were proposed in this pull request? Correct the resolution of having clause. ### Why are the changes needed? Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example: ``` -- Works resolved by `ResolveReferences` select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1 -- Works because of the extra expression c1 select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 -- Failed select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1 ``` It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? add test. Closes apache#30077 from ulysses-you/SPARK-33131-branch-3.0. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…leFormatWriter ### What changes were proposed in this pull request? This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default. ### Why are the changes needed? In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry. Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases. We are going to add a config we can control this behavior and set it false for fast-fail by default. ### Does this PR introduce _any_ user-facing change? Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it. ### How was this patch tested? Unit test. Closes apache#30073 from viirya/SPARK-33170. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3010e90) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This pull request changes the `sep` parameter's value from `:` to `;` in the example of `examples/src/main/python/sql/datasource.py`. This code snippet is shown on the Spark SQL Guide documentation. The `sep` parameter's value should be `;` since the data in https://github.com/apache/spark/blob/master/examples/src/main/resources/people.csv is separated by `;`. ### Why are the changes needed? To fix the example code so that it can be executed properly. ### Does this PR introduce _any_ user-facing change? Yes. This code snippet is shown on the Spark SQL Guide documentation: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options ### How was this patch tested? By building the documentation and checking the Spark SQL Guide documentation manually in the local environment. Closes apache#30082 from kjmrknsn/fix-example-python-datasource. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR aims to use `openjdk:11-jre-slim` as default in K8s Dockerfile. ### Why are the changes needed? Although Apache Spark supports both Java8/Java11, there is a difference. 1. Java8-built distribution can run both Java8/Java11 2. Java11-built distribution can run on Java11, but not Java8. In short, we had better use Java11 in Dockerfile to embrace both cases without any issues. ### Does this PR introduce _any_ user-facing change? Yes. This will remove the change of user frustration when they build with JDK11 and build the image without overriding Java base image. ### How was this patch tested? Pass the K8s IT. Closes apache#30083 from dongjoon-hyun/SPARK-33176. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to ignore GitHub only changes in Amplab Jenkins build. ### Why are the changes needed? This will save server resources. ### Does this PR introduce _any_ user-facing change? No, this is a dev-only change. ### How was this patch tested? Manually. I used the following doctest during testing and removed it at the clean-up. E2E tests: ``` cd dev cat test.py ``` ```python import importlib runtests = importlib.import_module("run-tests") print([x.name for x in runtests.determine_modules_for_files([".github/workflows/build_and_test.yml"])]) ``` ```python $ GITHUB_ACTIONS=1 python test.py ['root'] $ python test.py [] ``` Unittests: ```bash $ GITHUN_ACTIONS=1 python3 -m doctest dev/run-tests.py $ python3 -m doctest dev/run-tests.py ``` Closes apache#30020 from williamhyun/SPARK-33123. Lead-authored-by: William Hyun <williamhyun3@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e6c53c2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in apache#29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes apache#29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…oading new applications in SHS"" This reverts commit e40c147.

…files are found" This reverts commit 46a62ca.

…L Select Reference ### What changes were proposed in this pull request? Add the link to the feature: "Run SQL on files directly" to SQL reference documentation page ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce _any_ user-facing change? yes. Previously, reading in sql from file directly is not included in the documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not listed in from_items. The new link is added to the select statement documentation, like the below: ![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png) ![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png) ### How was this patch tested? Manually built and tested Closes apache#30095 from liaoaoyuan97/master. Authored-by: liaoaoyuan97 <al3468@columbia.edu> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit f65a244) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Hub Actions PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+: ``` ====================================================================== ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key .select('id', 'result').collect() File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect sock_info = self._jdf.collectToPython() File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches for series in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper return f(keys, vals) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped result = f(key, pd.concat(value_series, axis=1)) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper return f(*args, **kwargs) File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f "{} != {}".format(expected_key[i][1], window_range) AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)} ``` https://github.com/apache/spark/runs/1278917457 This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189). To make build pass. No, dev-only. GitHub Actions in this build will test it out. Closes apache#30098 from HyukjinKwon/hot-fix-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit eb9966b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)]. ### Why are the changes needed? Same as above. ### Does this PR introduce _any_ user-facing change? Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro. ### How was this patch tested? Tested manually by building and checking the document in the local environment. Closes apache#30105 from kjmrknsn/fix-docs-sql-data-sources-avro. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 46ad325) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…imestamps in pyarrow Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion. The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285. No Existing tests Closes apache#30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 47a6568) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…complete interval literals ### What changes were proposed in this pull request? Address comments apache#29635 (comment) to improve migration guide ### Why are the changes needed? improve migration guide ### Does this PR introduce _any_ user-facing change? NO，only doc update ### How was this patch tested? passing GitHub action Closes apache#30117 from yaooqinn/SPARK-32785-F30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? This PR aims to fix syntax error. ### Why are the changes needed? ``` ======================================================================== Running Python style checks ======================================================================== pycodestyle checks failed. *** Error compiling './python/run-tests.py'... File "./python/run-tests.py", line 80 'PYARROW_IGNORE_TIMEZONE': '1', ^ SyntaxError: invalid syntax ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes apache#30125 from dongjoon-hyun/SPARK-33189-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to install `scipy` as well in PyPy. It will test several ML specific test cases in PyPy as well. For example, https://github.com/apache/spark/blob/31a16fbb405a19dc3eb732347e0e1f873b16971d/python/pyspark/mllib/tests/test_linalg.py#L487 It was not installed when GitHub Actions build was added because it failed to install for an unknown reason. Seems like it's fixed in the latest scipy. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions build in this PR will test it out. Closes apache#30054 from HyukjinKwon/SPARK-32247. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

Handle executor failure with multiple containers Added a spark property spark.kubernetes.executor.checkAllContainers, with default being false. When it's true, the executor snapshot will take all containers in the executor into consideration when deciding whether the executor is in "Running" state, if the pod restart policy is "Never". Also, added the new spark property to the doc. ### What changes were proposed in this pull request? Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true. ### Why are the changes needed? Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled. ### Does this PR introduce _any_ user-facing change? Yes, new spark property added. User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property. ### How was this patch tested? Unit test was added and all passed. I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass. Closes apache#29924 from huskysun/exec-sidecar-failure. Authored-by: Shiqi Sun <s.sun@salesforce.com> Signed-off-by: Holden Karau <hkarau@apple.com> (cherry picked from commit f659527) Signed-off-by: Holden Karau <hkarau@apple.com>

…e same logical plan ### What changes were proposed in this pull request? SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows; ``` // Spark v2.4.6+ scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.explain() == Physical Plan == *(1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() == Physical Plan == *(1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) // If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() == Physical Plan == *(1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) // Until v2.4.6 scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() 20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException == Physical Plan == *(1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() == Physical Plan == *(1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#30140 from maropu/FixBugInReplaceView. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 87b4984) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ake effect at runtime ### What changes were proposed in this pull request? Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime. ### Why are the changes needed? `spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### Does this PR introduce _any_ user-facing change? Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### How was this patch tested? modified unit test Closes apache#30108 from yuningzh-db/dynamic-analyzer-max-iterations. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit a21945c) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…ql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes apache#30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 02fa19f) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… is Stream ### What changes were proposed in this pull request? The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`. ```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9 ``` When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later. Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680. The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes. Closes apache#30160 from ankurdave/SPARK-33260. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 3f2a2b5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…"NULL AND False" ### What changes were proposed in this pull request? The documentation of the Spark SQL null semantics states that "NULL AND False" yields NULL. This is incorrect. "NULL AND False" yields False. ``` Seq[(java.lang.Boolean, java.lang.Boolean)]( (null, false) ) .toDF("left_operand", "right_operand") .withColumn("AND", 'left_operand && 'right_operand) .show(truncate = false) +------------+-------------+-----+ |left_operand|right_operand|AND | +------------+-------------+-----+ |null |false |false| +------------+-------------+-----+ ``` I propose the documentation be updated to reflect that "NULL AND False" yields False. This contribution is my original work and I license it to the project under the project’s open source license. ### Why are the changes needed? This change improves the accuracy of the documentation. ### Does this PR introduce _any_ user-facing change? Yes. This PR introduces a fix to the documentation. ### How was this patch tested? Since this is only a documentation change, no tests were added. Closes apache#30161 from stwhit/SPARK-33246. Authored-by: Stuart White <stuart@spotright.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 7d11d97) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…etrical ### What changes were proposed in this pull request? This PR fix `UserDefinedType.equal()` by comparing the UDT class instead of checking `acceptsType()`. ### Why are the changes needed? It's weird that equality comparison between two UDT types can have different result by switching the order: ```scala // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass val udt1 = new ExampleBaseTypeUDT val udt2 = new ExampleSubTypeUDT println(udt1 == udt2) // true println(udt2 == udt1) // false ``` ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass val udt1 = new ExampleBaseTypeUDT val udt2 = new ExampleSubTypeUDT println(udt1 == udt2) // true println(udt2 == udt1) // false ``` After: ```scala // ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass val udt1 = new ExampleBaseTypeUDT val udt2 = new ExampleSubTypeUDT println(udt1 == udt2) // false println(udt2 == udt1) // false ``` ### How was this patch tested? Added a unit test. Closes apache#28923 from Ngone51/fix-udt-equal. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…documents ### What changes were proposed in this pull request? This PR intends to add a dedicated page for SQL-on-file in SQL documents. This comes from the comment: https://github.com/apache/spark/pull/30095/files#r508965149 ### Why are the changes needed? For better documentations. ### Does this PR introduce _any_ user-facing change? <img width="544" alt="Screen Shot 2020-10-28 at 9 56 59" src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png"> ### How was this patch tested? N/A Closes apache#30165 from maropu/DocForFile. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit c2bea04) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b ### What changes were proposed in this pull request? Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries. ### Why are the changes needed? To clarify the behavior of SparkSession#sql. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No needed. Closes apache#30168 from waitinfuture/master. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b26ae98) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ntains null ### What changes were proposed in this pull request? This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue. ### Why are the changes needed? This is an obvious bug as `In` filter doesn't care about null value when calculating hash code. ### Does this PR introduce _any_ user-facing change? Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail. ### How was this patch tested? UT added. The new UT fails without the PR and passes with the PR. Closes apache#30170 from HeartSaVioR/SPARK-33267. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a744fea) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…Cluster and Kubernetes with --jars ### What changes were proposed in this pull request? This is a backport PR for branch-3.0. This PR changes Executor to load jars and files added by --jars and --files on Executor initialization. To avoid downloading those jars/files twice, they are assosiated with `startTime` as their uploaded timestamp. ### Why are the changes needed? ExecutorPlugin can't work with Standalone Cluster and Kubernetes when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit. This is because jars and files added by --jars and --files are not loaded on Executor initialization. I confirmed it works with YARN because jars/files are distributed as distributed cache. ### Does this PR introduce _any_ user-facing change? Yes. jars/files added by --jars and --files are downloaded on each executor on initialization. ### How was this patch tested? Added a new testcase. Closes apache#29621 from sarutak/fix-plugin-issue-3.0. Lead-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Co-authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

This PR fixes from_unixtime documentation to show that fmt is optional parameter. Yes, documentation update. **Before change:** ![image](https://user-images.githubusercontent.com/4176173/97497659-18c6cc80-1928-11eb-93d8-453ef627ac7c.png) **After change:** ![image](https://user-images.githubusercontent.com/4176173/97496153-c5537f00-1925-11eb-8102-457e85e019d5.png) Style check using: ./dev/run-tests Manual check and screenshotting with: ./sql/create-docs.sh Manual verification of behavior with latest spark-sql binary. Closes apache#30176 from Obbay2/from_unixtime_doc. Authored-by: Nathan Wreggit <obbay2@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…on disambiguous ### What changes were proposed in this pull request? This PR aims to wrap `ArrayBasedMapData` literal representation with `map(...)`. ### Why are the changes needed? Literal ArrayBasedMapData has inconsistent string representation from `LogicalPlan` to `Optimized Logical Plan/Physical Plan`. Also, the representation at `Optimized Logical Plan` and `Physical Plan` is ambiguous like `[1 AS a#0, keys: [key1], values: [value1] AS b#1]`. **BEFORE** ```scala scala> spark.version res0: String = 2.4.7 scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#0, 'map(key1, value1) AS b#1] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#0, map(key1, value1) AS b#1] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- OneRowRelation == Physical Plan == *(1) Project [1 AS a#0, keys: [key1], values: [value1] AS b#1] +- Scan OneRowRelation[] ``` **AFTER** ```scala scala> spark.version res0: String = 3.1.0-SNAPSHOT scala> sql("SELECT 1 a, map('key1', 'value1') b").explain(true) == Parsed Logical Plan == 'Project [1 AS a#4, 'map(key1, value1) AS b#5] +- OneRowRelation == Analyzed Logical Plan == a: int, b: map<string,string> Project [1 AS a#4, map(key1, value1) AS b#5] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- OneRowRelation == Physical Plan == *(1) Project [1 AS a#4, map(keys: [key1], values: [value1]) AS b#5] +- *(1) Scan OneRowRelation[] ``` ### Does this PR introduce _any_ user-facing change? Yes. This changes the query plan's string representation in `explain` command and UI. However, this is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes apache#30190 from dongjoon-hyun/SPARK-33292. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 838791b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…honUserDefinedType ### What changes were proposed in this pull request? This PR intends to fix bugs for casting data from/to PythonUserDefinedType. A sequence of queries to reproduce this issue is as follows; ``` >>> from pyspark.sql import Row >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import * >>> from pyspark.testing.sqlutils import * >>> >>> row = Row(point=ExamplePoint(1.0, 2.0)) >>> df = spark.createDataFrame([row]) >>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py", line 1402, in select jdf = self._jdf.select(self._jcols(*cols)) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o44.select. : java.lang.NullPointerException at org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84) at org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96) at org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290) at org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290) ``` A root cause of this issue is that, since `PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in `UserDefinedType#acceptsType` throws a null exception. To fix it, this PR defines `acceptsType` in `PythonUserDefinedType` and filters out the null case in `UserDefinedType#acceptsType`. This backport comes from apache#30169. ### Why are the changes needed? Bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#30191 from maropu/SPARK-33268-BRANCH3.0. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…hysical rule to remove redundant sorts Backport apache#30093 for branch-3.0. I've updated the configuration version to 2.4.8. ### What changes were proposed in this pull request? This PR aims to fix a correctness bug in the optimizer rule EliminateSorts. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix. ### Why are the changes needed? A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts. ``` Sort(orders, global = True, ...) Sort(orders, global = False, ...) ``` Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Unit tests Closes apache#30195 from allisonwang-db/SPARK-33183-branch-3.0. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…the table itself may not be cached ### What changes were proposed in this pull request? In `CatalogImpl.refreshTable`, this moves the `uncacheQuery` call out of the condition `if (cache.nonEmpty)` so that it will be called whether the table itself is cached or not. ### Why are the changes needed? In the case like the following: ```sql CREATE TABLE t ...; CREATE VIEW t1 AS SELECT * FROM t; REFRESH TABLE t; ``` If the table `t` is refreshed, the view `t1` which is depending on `t` will not be invalidated. This could lead to incorrect result and is similar to [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). On the other hand, if we have: ```sql CREATE TABLE t ...; CACHE TABLE t; CREATE VIEW t1 AS SELECT * FROM t; REFRESH TABLE t; ``` Then the view `t1` will be refreshed. The behavior is somewhat inconsistent. ### Does this PR introduce _any_ user-facing change? Yes, with the change any cache that are depending on the table refreshed will be invalidated with the change. Previously this only happens if the table itself is cached. ### How was this patch tested? Added a new UT for the case. Closes apache#30187 from sunchao/SPARK-33290. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 32b78d3) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L309. Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting. A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true ``` when we execute the following sql, ``` select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` ### Why are the changes needed? As described above, it's a bug here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add more UT Closes apache#30213 from WangGuangxin/SPARK-33306. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 69c27f4) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…suming after the task ends ### What changes were proposed in this pull request? This is a backport of apache#30177. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes apache#30217 from ueshin/issues/SPARK-33277/3.0/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR proposes to port back apache#30219 but keeps testthat 1.x support in other branches. This PR modifies `R/pkg/tests/run-all.R` by: - ~Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant.~ - Add `testthat` 3.x support to avoid CI failures. ### Why are the changes needed? Currently used internal API has been removed in the latest `testthat` release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0` Closes apache#30220 from HyukjinKwon/SPARK-33313. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade `Github Action` runner image from `Ubuntu 18.04 (LTS)` to `Ubuntu 20.04 (LTS)`. ### Why are the changes needed? `ubuntu-latest` in `GitHub Action` is still `Ubuntu 18.04 (LTS)`. - https://github.com/actions/virtual-environments#available-environments This upgrade will prepare AmbLab Jenkins upgrade. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the `Github Action` in this PR. Closes apache#30231 from dongjoon-hyun/SPARK-33156-3.0. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… changed from k8s ### What changes were proposed in this pull request? This is a straight application of apache#28423 onto branch-3.0 Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was tested in apache#28423 by running spark-submit to a k8s cluster. Closes apache#29533 from jkleckner/backport-SPARK-24266-to-branch-3.0. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ort the table will cause the header content to be lost ### What changes were proposed in this pull request? In the old version of spark in the storage UI page, the sorting function is normal, but sorting in the new version will cause the header content to be lost, So I try to fix the bug. ### Why are the changes needed? The header field of the table on the page is similar to the following, **note that each th contains the span attribute**: ```html <thead> <tr> .... <th width="" class=""> <span data-toggle="tooltip" title="" data-original-title="StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) andreplication factor of the persisted RDD"> Storage Level </span> </th> ..... </tr> </thead> ``` Since [PR#26136](apache#26136), if the `th` in the table itself contains the `span` attribute, the `span` will be deleted directly after clicking the sort, and the original header content will be lost. There are three problems in `sorttable.js`: 1. `sortrevind.class = "sorttable_sortrevind"` in [sorttab.js#107](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L107) and `sortfwdind.class = "sorttable_sortfwdind"` in [sorttab.js#125](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L125) sorttable_xx attribute should be assigned to`className` instead of `class`, as javascript uses `rowlists[j].className.search` rather than `rowlists[j].class.search` to determine whether the component has a sorting flag or not. 2. `rowlists[j].className.search(/\sorttable_sortrevind\b/)` in [sorttab.js#120](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L120) was wrong. The original intention is to search whether `className` contains the word `sorttable_sortrevind` , but the expression is wrong, it should be `\bsorttable_sortrevind\b` instead of `\sorttable_sortrevind\b` 3. The if check statement in the following code snippet ([sorttab.js#141](https://github.com/apache/spark/blob/9d5e48ea95d1c3017a51ff69584f32a18901b2b5/core/src/main/resources/org/apache/spark/ui/static/sorttable.js#L141)) was wrong. **If the `search` function does not find the target, it will return -1, but Boolean(-1) is actually equals true**. This statement will cause span to be deleted even if it does not contain `sorttable_sortfwdind` or `sorttable_sortrevind`. ```javascript rowlists = this.parentNode.getElementsByTagName("span"); for (var j=0; j < rowlists.length; j++) { if (rowlists[j].className.search(/\bsorttable_sortfwdind\b/) || rowlists[j].className.search(/\sorttable_sortrevind\b/) ) { rowlists[j].parentNode.removeChild(rowlists[j]); } } ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? The manual test result of the ui page is as below: ![fix sorted](https://user-images.githubusercontent.com/52202080/97543194-daeaa680-1a02-11eb-8b11-8109c3e4e9a3.gif) Closes apache#30182 from akiyamaneko/ui_storage_sort_error. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 56c623e) Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This PR aims to upgrade Jetty to 9.4.28.v20200408 in `branch-3.0` for `Apache Spark 3.0.2` like `master` branch `Apache Spark 3.1`. ### Why are the changes needed? To bring the bug fixes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#30240 from dongjoon-hyun/SPARK-33333. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

### What changes were proposed in this pull request? This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`. ### Why are the changes needed? This is a regression from Apache Spark 1.6.x. ```scala scala> sc.version res1: String = 1.6.3 scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show +---+ |_c0| +---+ | v1| +---+ ``` Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries. ```sql CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k SELECT map('k1', 'v1')[k] FROM t GROUP BY 1 SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k] SELECT map('k1', 'v1')[k] a FROM t GROUP BY a ``` **BEFORE** ```scala Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]apache#6] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ``` **AFTER** ```sql spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1; v1 Time taken: 1.278 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]; v1 Time taken: 0.313 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a; v1 Time taken: 0.265 seconds, Fetched 1 row(s) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes apache#30246 from dongjoon-hyun/SPARK-33338. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 42c0b17) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…k jobs ### What changes were proposed in this pull request? This is a backport of apache#30059 . This PR aims to use `pre-built image` at Github Action PySpark jobs. To isolate the changes, `pyspark` jobs are split from the main job. The docker image is built by the following. | Item | URL | | --------------- | ------------- | | Dockerfile | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/Dockerfile | | Builder | https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/blob/main/.github/workflows/build.yml | | Image Location | https://hub.docker.com/r/dongjoon/apache-spark-github-action-image | Please note that. 1. The community still will use `build_and_test.yml` to add new features like as we did until now. The `Dockerfile` will be updated regularly. 2. When Apache Spark gets an official docker repository location, we will use it. 3. Also, it's the best if we keep this docker file and builder script at a new Apache Spark dev branch instead of outside GitHub repository. ### Why are the changes needed? This will reduce the Python and its package installation time. **BEFORE (branch-3.0)** ![Screen Shot 2020-11-04 at 2 28 49 PM](https://user-images.githubusercontent.com/9700541/98174664-17f2e500-1eaa-11eb-9222-018eead9c418.png) **AFTER (branch-3.0)** ![Screen Shot 2020-11-04 at 2 29 43 PM](https://user-images.githubusercontent.com/9700541/98174758-378a0d80-1eaa-11eb-8e6a-929158c2fea3.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action on this PR without `package installation steps`. Closes apache#30253 from dongjoon-hyun/GHA-3.0. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… job ### What changes were proposed in this pull request? This is a backport of apache#30066 . This PR aims to use a pre-built image for Github Action SparkR job. ### Why are the changes needed? This will reduce the execution time and the flakiness. **BEFORE (branch-3.0: 21 minutes 7 seconds)** ![Screen Shot 2020-11-04 at 8 53 50 PM](https://user-images.githubusercontent.com/9700541/98199386-e39a1b80-1edf-11eb-8dec-c6819ebb3f0d.png) **AFTER** No R and R package installation steps. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action `sparkr` job in this PR. Closes apache#30258 from dongjoon-hyun/SPARK-33239-3.0. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…stop consuming after the task ends" This reverts commit 92ba08d.

### What changes were proposed in this pull request? Update join type in stream static joins code examples in structured streaming programming guide. 1) Scala, Java and Python examples have a common issue. The join keyword is "right_join", it should be "left_outer". _Reasons:_ a) This code snippet is an example of "left outer join" as the streaming df is on left and static df is on right. Also, right outer join between stream df(left) and static df(right) is not supported. b) The keyword "right_join/left_join" is unsupported and it should be "right_outer/left_outer". So, all of these code snippets have been updated to "left_outer". 2) R exmaple is correct, but the example is of "right_outer" with static df (left) and streaming df(right). It is changed to "left_outer" to make it consistent with other three examples of scala, java and python. ### Why are the changes needed? To fix the mistake in example code of documentation. ### Does this PR introduce _any_ user-facing change? Yes, it is a user-facing change (but documentation update only). **Screenshots 1: Scala/Java/python example (similar issue)** _Before:_ <img width="941" alt="Screenshot 2020-11-05 at 12 16 09 AM" src="https://user-images.githubusercontent.com/62717942/98155351-19e59400-1efc-11eb-8142-e6a25a5e6497.png"> _After:_ <img width="922" alt="Screenshot 2020-11-05 at 12 17 12 AM" src="https://user-images.githubusercontent.com/62717942/98155503-5d400280-1efc-11eb-96e1-5ba0f3c35c82.png"> **Screenshots 2: R example (Make it consistent with above change)** _Before:_ <img width="896" alt="Screenshot 2020-11-05 at 12 19 57 AM" src="https://user-images.githubusercontent.com/62717942/98155685-ac863300-1efc-11eb-93bc-b7ca4dd34634.png"> _After:_ <img width="919" alt="Screenshot 2020-11-05 at 12 20 51 AM" src="https://user-images.githubusercontent.com/62717942/98155739-c0ca3000-1efc-11eb-8f95-a7538fa784b7.png"> ### How was this patch tested? The change was tested locally. 1) cd docs/ SKIP_API=1 jekyll build 2) Verify docs/_site/structured-streaming-programming-guide.html file in browser. Closes apache#30252 from sarveshdave1/doc-update-stream-static-joins. Authored-by: Sarvesh Dave <sarveshdave1@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit e66201b) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

… be resolved ### What changes were proposed in this pull request? Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved. ### Why are the changes needed? To prevent potential bugs that we skip resolve the input query. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes apache#30265 from cloud-fan/ds-minor-2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 26ea417) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…Encoders ### What changes were proposed in this pull request? Javadocs updated for the encoder to include maps as a collection type ### Why are the changes needed? The javadocs were not updated with fix SPARK-16706 ### Does this PR introduce _any_ user-facing change? Yes, the javadocs are updated ### How was this patch tested? sbt was run to ensure it meets scalastyle Closes apache#30274 from hannahkamundson/SPARK-32860. Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com> Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1090b1b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR is a backport of apache#30277 This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9. Also, this PR fixes some minor fixes into the test codes. - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons. NOTE that PyArrow does not support Python 3.9 yet. ### Why are the changes needed? To officially support Python 3.9. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.9. ### How was this patch tested? Manually ran the tests: ``` $ ./run-tests --python-executable=python Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python python_implementation is CPython python version is: Python 3.9.0 Starting test(python): pyspark.ml.tests.test_base Starting test(python): pyspark.ml.tests.test_evaluation Starting test(python): pyspark.ml.tests.test_algorithms Starting test(python): pyspark.ml.tests.test_feature Finished test(python): pyspark.ml.tests.test_base (12s) Starting test(python): pyspark.ml.tests.test_image Finished test(python): pyspark.ml.tests.test_evaluation (15s) Starting test(python): pyspark.ml.tests.test_linalg Finished test(python): pyspark.ml.tests.test_feature (25s) Starting test(python): pyspark.ml.tests.test_param Finished test(python): pyspark.ml.tests.test_image (17s) Starting test(python): pyspark.ml.tests.test_persistence Finished test(python): pyspark.ml.tests.test_param (17s) Starting test(python): pyspark.ml.tests.test_pipeline Finished test(python): pyspark.ml.tests.test_linalg (30s) Starting test(python): pyspark.ml.tests.test_stat Finished test(python): pyspark.ml.tests.test_pipeline (6s) Starting test(python): pyspark.ml.tests.test_training_summary Finished test(python): pyspark.ml.tests.test_stat (12s) Starting test(python): pyspark.ml.tests.test_tuning Finished test(python): pyspark.ml.tests.test_algorithms (68s) Starting test(python): pyspark.ml.tests.test_wrapper Finished test(python): pyspark.ml.tests.test_persistence (51s) Starting test(python): pyspark.mllib.tests.test_algorithms Finished test(python): pyspark.ml.tests.test_training_summary (33s) Starting test(python): pyspark.mllib.tests.test_feature Finished test(python): pyspark.ml.tests.test_wrapper (19s) Starting test(python): pyspark.mllib.tests.test_linalg Finished test(python): pyspark.mllib.tests.test_feature (26s) Starting test(python): pyspark.mllib.tests.test_stat Finished test(python): pyspark.mllib.tests.test_stat (22s) Starting test(python): pyspark.mllib.tests.test_streaming_algorithms Finished test(python): pyspark.mllib.tests.test_algorithms (53s) Starting test(python): pyspark.mllib.tests.test_util Finished test(python): pyspark.mllib.tests.test_linalg (54s) Starting test(python): pyspark.sql.tests.test_arrow Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped Starting test(python): pyspark.sql.tests.test_catalog Finished test(python): pyspark.mllib.tests.test_util (11s) Starting test(python): pyspark.sql.tests.test_column Finished test(python): pyspark.sql.tests.test_catalog (16s) Starting test(python): pyspark.sql.tests.test_conf Finished test(python): pyspark.sql.tests.test_column (17s) Starting test(python): pyspark.sql.tests.test_context Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_dataframe Finished test(python): pyspark.sql.tests.test_conf (11s) Starting test(python): pyspark.sql.tests.test_datasources Finished test(python): pyspark.sql.tests.test_datasources (19s) Starting test(python): pyspark.sql.tests.test_functions Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_group Finished test(python): pyspark.sql.tests.test_functions (32s) Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_grouped_map Finished test(python): pyspark.sql.tests.test_group (19s) Starting test(python): pyspark.sql.tests.test_pandas_map Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_window Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped Starting test(python): pyspark.sql.tests.test_readwriter Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped Starting test(python): pyspark.sql.tests.test_serde Finished test(python): pyspark.sql.tests.test_serde (19s) Starting test(python): pyspark.sql.tests.test_session Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s) Starting test(python): pyspark.sql.tests.test_streaming Finished test(python): pyspark.sql.tests.test_readwriter (25s) Starting test(python): pyspark.sql.tests.test_types Finished test(python): pyspark.ml.tests.test_tuning (208s) Starting test(python): pyspark.sql.tests.test_udf Finished test(python): pyspark.sql.tests.test_session (31s) Starting test(python): pyspark.sql.tests.test_utils Finished test(python): pyspark.sql.tests.test_streaming (35s) Starting test(python): pyspark.streaming.tests.test_context Finished test(python): pyspark.sql.tests.test_types (34s) Starting test(python): pyspark.streaming.tests.test_dstream Finished test(python): pyspark.sql.tests.test_utils (14s) Starting test(python): pyspark.streaming.tests.test_kinesis Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python): pyspark.streaming.tests.test_listener Finished test(python): pyspark.streaming.tests.test_listener (11s) Starting test(python): pyspark.tests.test_appsubmit Finished test(python): pyspark.sql.tests.test_udf (39s) Starting test(python): pyspark.tests.test_broadcast Finished test(python): pyspark.streaming.tests.test_context (23s) Starting test(python): pyspark.tests.test_conf Finished test(python): pyspark.tests.test_conf (15s) Starting test(python): pyspark.tests.test_context Finished test(python): pyspark.tests.test_broadcast (33s) Starting test(python): pyspark.tests.test_daemon Finished test(python): pyspark.tests.test_daemon (5s) Starting test(python): pyspark.tests.test_install_spark Finished test(python): pyspark.tests.test_context (44s) Starting test(python): pyspark.tests.test_join Finished test(python): pyspark.tests.test_appsubmit (68s) Starting test(python): pyspark.tests.test_profiler Finished test(python): pyspark.tests.test_join (7s) Starting test(python): pyspark.tests.test_rdd Finished test(python): pyspark.tests.test_profiler (9s) Starting test(python): pyspark.tests.test_rddbarrier Finished test(python): pyspark.tests.test_rddbarrier (7s) Starting test(python): pyspark.tests.test_readwrite Finished test(python): pyspark.streaming.tests.test_dstream (107s) Starting test(python): pyspark.tests.test_serializers Finished test(python): pyspark.tests.test_serializers (8s) Starting test(python): pyspark.tests.test_shuffle Finished test(python): pyspark.tests.test_readwrite (14s) Starting test(python): pyspark.tests.test_taskcontext Finished test(python): pyspark.tests.test_install_spark (65s) Starting test(python): pyspark.tests.test_util Finished test(python): pyspark.tests.test_shuffle (8s) Starting test(python): pyspark.tests.test_worker Finished test(python): pyspark.tests.test_util (5s) Starting test(python): pyspark.accumulators Finished test(python): pyspark.accumulators (5s) Starting test(python): pyspark.broadcast Finished test(python): pyspark.broadcast (6s) Starting test(python): pyspark.conf Finished test(python): pyspark.tests.test_worker (14s) Starting test(python): pyspark.context Finished test(python): pyspark.conf (4s) Starting test(python): pyspark.ml.classification Finished test(python): pyspark.tests.test_rdd (60s) Starting test(python): pyspark.ml.clustering Finished test(python): pyspark.context (21s) Starting test(python): pyspark.ml.evaluation Finished test(python): pyspark.tests.test_taskcontext (69s) Starting test(python): pyspark.ml.feature Finished test(python): pyspark.ml.evaluation (26s) Starting test(python): pyspark.ml.fpm Finished test(python): pyspark.ml.clustering (45s) Starting test(python): pyspark.ml.functions Finished test(python): pyspark.ml.fpm (24s) Starting test(python): pyspark.ml.image Finished test(python): pyspark.ml.functions (17s) Starting test(python): pyspark.ml.linalg.__init__ Finished test(python): pyspark.ml.linalg.__init__ (0s) Starting test(python): pyspark.ml.recommendation Finished test(python): pyspark.ml.classification (74s) Starting test(python): pyspark.ml.regression Finished test(python): pyspark.ml.image (8s) Starting test(python): pyspark.ml.stat Finished test(python): pyspark.ml.stat (29s) Starting test(python): pyspark.ml.tuning Finished test(python): pyspark.ml.regression (53s) Starting test(python): pyspark.mllib.classification Finished test(python): pyspark.ml.tuning (35s) Starting test(python): pyspark.mllib.clustering Finished test(python): pyspark.ml.feature (103s) Starting test(python): pyspark.mllib.evaluation Finished test(python): pyspark.mllib.classification (33s) Starting test(python): pyspark.mllib.feature Finished test(python): pyspark.mllib.evaluation (21s) Starting test(python): pyspark.mllib.fpm Finished test(python): pyspark.ml.recommendation (103s) Starting test(python): pyspark.mllib.linalg.__init__ Finished test(python): pyspark.mllib.linalg.__init__ (1s) Starting test(python): pyspark.mllib.linalg.distributed Finished test(python): pyspark.mllib.feature (26s) Starting test(python): pyspark.mllib.random Finished test(python): pyspark.mllib.fpm (23s) Starting test(python): pyspark.mllib.recommendation Finished test(python): pyspark.mllib.clustering (50s) Starting test(python): pyspark.mllib.regression Finished test(python): pyspark.mllib.random (13s) Starting test(python): pyspark.mllib.stat.KernelDensity Finished test(python): pyspark.mllib.stat.KernelDensity (1s) Starting test(python): pyspark.mllib.stat._statistics Finished test(python): pyspark.mllib.linalg.distributed (42s) Starting test(python): pyspark.mllib.tree Finished test(python): pyspark.mllib.stat._statistics (19s) Starting test(python): pyspark.mllib.util Finished test(python): pyspark.mllib.regression (33s) Starting test(python): pyspark.profiler Finished test(python): pyspark.mllib.recommendation (36s) Starting test(python): pyspark.rdd Finished test(python): pyspark.profiler (9s) Starting test(python): pyspark.resource.tests.test_resources Finished test(python): pyspark.mllib.tree (19s) Starting test(python): pyspark.serializers Finished test(python): pyspark.mllib.util (21s) Starting test(python): pyspark.shuffle Finished test(python): pyspark.resource.tests.test_resources (9s) Starting test(python): pyspark.sql.avro.functions Finished test(python): pyspark.shuffle (1s) Starting test(python): pyspark.sql.catalog Finished test(python): pyspark.rdd (22s) Starting test(python): pyspark.sql.column Finished test(python): pyspark.serializers (12s) Starting test(python): pyspark.sql.conf Finished test(python): pyspark.sql.conf (6s) Starting test(python): pyspark.sql.context Finished test(python): pyspark.sql.catalog (14s) Starting test(python): pyspark.sql.dataframe Finished test(python): pyspark.sql.avro.functions (15s) Starting test(python): pyspark.sql.functions Finished test(python): pyspark.sql.column (24s) Starting test(python): pyspark.sql.group Finished test(python): pyspark.sql.context (20s) Starting test(python): pyspark.sql.pandas.conversion Finished test(python): pyspark.sql.pandas.conversion (13s) Starting test(python): pyspark.sql.pandas.group_ops Finished test(python): pyspark.sql.group (36s) Starting test(python): pyspark.sql.pandas.map_ops Finished test(python): pyspark.sql.pandas.group_ops (21s) Starting test(python): pyspark.sql.pandas.serializers Finished test(python): pyspark.sql.pandas.serializers (0s) Starting test(python): pyspark.sql.pandas.typehints Finished test(python): pyspark.sql.pandas.typehints (0s) Starting test(python): pyspark.sql.pandas.types Finished test(python): pyspark.sql.pandas.types (0s) Starting test(python): pyspark.sql.pandas.utils Finished test(python): pyspark.sql.pandas.utils (0s) Starting test(python): pyspark.sql.readwriter Finished test(python): pyspark.sql.dataframe (56s) Starting test(python): pyspark.sql.session Finished test(python): pyspark.sql.functions (57s) Starting test(python): pyspark.sql.streaming Finished test(python): pyspark.sql.pandas.map_ops (12s) Starting test(python): pyspark.sql.types Finished test(python): pyspark.sql.types (10s) Starting test(python): pyspark.sql.udf Finished test(python): pyspark.sql.streaming (16s) Starting test(python): pyspark.sql.window Finished test(python): pyspark.sql.session (19s) Starting test(python): pyspark.streaming.util Finished test(python): pyspark.streaming.util (0s) Starting test(python): pyspark.util Finished test(python): pyspark.util (0s) Finished test(python): pyspark.sql.readwriter (24s) Finished test(python): pyspark.sql.udf (13s) Finished test(python): pyspark.sql.window (14s) Tests passed in 780 seconds ``` Closes apache#30288 from HyukjinKwon/SPARK-33371-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This pr fix `InSet` bucket pruning because of it's values should not be `Literal`: https://github.com/apache/spark/blob/cbd3fdea62dab73fc4a96702de8fd1f07722da66/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L253-L255 ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test: ```scala spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t") spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show ``` Before this PR | After this PR -- | -- ![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) | ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png) Closes apache#30279 from wangyum/SPARK-33372. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 69799c5) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…erns-for-shs-custom-executor-log-url ### What changes were proposed in this pull request? 1. replace `{{}}` with `&apache#123;&apache#123;&apache#125;&apache#125;` 2. using `<code></code>` in td-tag ### Why are the changes needed? to fix this. ![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png) ### Does this PR introduce _any_ user-facing change? yes, you will see the correct online doc with this change ![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png) ### How was this patch tested? shown as the above pic via jekyll serve. Closes apache#30298 from yaooqinn/SPARK-33397. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 036c11b) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade `commons-compress` from 1.8 to 1.20. ### Why are the changes needed? - https://commons.apache.org/proper/commons-compress/security-reports.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#30305 from dongjoon-hyun/SPARK-33405-3.0. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes apache#30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e3a768d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ion error ### What changes were proposed in this pull request? When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang. The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck we can run a simple test to reproduce this case: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` ### Why are the changes needed? to make sure pyspark application won't hang if there's non-Exception error in python worker ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added a new test and also manually tested the case above Closes apache#30248 from li36909/pyspark. Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 27bb40b) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…CDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes apache#30324 from maropu/FilterBugInTPCDSQueryBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4b36797) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…te condition based on the table relation not the input query backport apache#30318 to 3.0 Closes apache#30328 from cloud-fan/backport. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…duce JobIDs ### What changes were proposed in this pull request? 1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in `spark.sql.sources.writeJobUUID` 1. `SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by appending a random long number to the supplied timestamp to ensure the probability of a collision is near-zero. 1. With tests of uniqueness, round trips and negative jobID rejection. ### Why are the changes needed? Without this, if more than one job is started in the same second *and the committer expects application attempt IDs to be unique* is at risk of clashing with other jobs. With the fix, * those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique. * committers which use the Hadoop JobID for unique paths and filenames will get the randomly generated jobID. Assuming all clocks in a cluster in sync, the probability of two jobs launched in the same second has dropped from 1 to 1/(2^63) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests There's a new test suite SparkHadoopWriterUtilsSuite which creates jobID, verifies they are unique even for the same timestamp and that they can be marshalled to string and parsed back in the hadoop code, which contains some (brittle) assumptions about the format of job IDs. Functional Integration Tests 1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven repository 1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARs. 1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version 1. And executed against AWS london. The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique. With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs. Closes apache#30319 from steveloughran/BUG/SPARK-33402-jobuuid. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 318a173) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

**Backport apache#30303 to 3.0** ### What changes were proposed in this pull request? The following query produces incorrect results: ``` SELECT date_trunc('minute', '1769-10-17 17:10:02') ``` Spark currently incorrectly returns ``` 1769-10-17 17:10:02 ``` against the expected return value of ``` 1769-10-17 17:10:00 ``` **Steps to repro** Run the following commands in spark-shell: ``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show() ``` This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`. This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`. This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests to `DateTimeUtilsSuite` which previously failed and pass now. Closes apache#30339 from utkarsh39/date_trunc_fix_3.0. Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to upgrade K8s R image to use R 3.6.3 which is the same version installed in Jenkins Servers. ### Why are the changes needed? Jenkins Server is using `R 3.6.3`. ``` + SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s + /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz * using log directory ‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’ * using R version 3.6.3 (2020-02-29) ``` OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and currently `spark-3.0.1` fails to run SparkR. ``` $ cd spark-3.0.1-bin-hadoop3.2 $ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build $ bin/spark-submit --master k8s://https://192.168.64.49:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark-r:latest local:///opt/spark/examples/src/main/r/dataframe.R $ k logs dataframe-r-b1c14b75b0c09eeb-driver ... + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.RRunner local:///opt/spark/examples/src/main/r/dataframe.R 20/11/10 06:03:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Error: package or namespace load failed for ‘SparkR’ in rbind(info, getNamespaceInfo(env, "S3methods")): number of columns of matrices must match (see arg 2) In addition: Warning message: package ‘SparkR’ was built under R version 4.0.2 Execution halted ``` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Pass K8s IT. Closes apache#30310 from dongjoon-hyun/SPARK-33408. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Remove "in cluster mode" from the description of `spark.executor.memoryOverhead` ### Why are the changes needed? fix correctness issue in documentaion ### Does this PR introduce _any_ user-facing change? yes, users may not get confused about the description `spark.executor.memoryOverhead` ### How was this patch tested? pass GA doc generation Closes apache#30311 from yaooqinn/minordoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4335af0) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…referencing the table ### What changes were proposed in this pull request? This is a backport for PR apache#30359. This changes `RefreshTableExec` in DSv2 to also invalidate caches with references to the target table to be refreshed. The change itself is similar to what's done in apache#30211. Note that though, since we currently don't support caching a DSv2 table directly, this doesn't add recache logic as in the DSv1 impl. I marked it as a TODO for now. Note there is some conflicts in the backport: in branch-3.0 `DataSourceV2Strategy` we don't have a `ResolvedTable` when analyzing `RefreshTable` so instead in `RefreshTableExec` this loads the table from the catalog if it exists, and the rest is the same. ### Why are the changes needed? Currently the behavior in DSv1 and DSv2 is inconsistent w.r.t refreshing table: in DSv1 we invalidate both metadata cache as well as all table caches that are related to the table, but in DSv2 we only do the former. This addresses the issue and make the behavior consistent. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing a v2 table also invalidate all the related caches. ### How was this patch tested? Added a new UT. Closes apache#30360 from sunchao/SPARK-33435-branch-3.0. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to decrease the parallelism of `SQL` module like `Hive` module. ### Why are the changes needed? GitHub Action `sql - slow tests` become flaky. - https://github.com/apache/spark/runs/1393670291 - https://github.com/apache/spark/runs/1393088031 ### Does this PR introduce _any_ user-facing change? No. This is dev-only feature. Although this will increase the running time, but it's better than flakiness. ### How was this patch tested? Pass the GitHub Action stably. Closes apache#30365 from dongjoon-hyun/SPARK-33439. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a70a2b0) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Exit Spark SQL CLI processing loop if one of the commands (sub sql statement) process failed This is a regression at Apache Spark 3.0.0. ``` $ cat 1.sql select * from nonexistent_table; select 2; ``` **Apache Spark 2.4.7** ``` spark-2.4.7-bin-hadoop2.7:$ bin/spark-sql -f 1.sql 20/11/15 16:14:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Error in query: Table or view not found: nonexistent_table; line 1 pos 14 ``` **Apache Spark 3.0.1** ``` $ bin/spark-sql -f 1.sql Error in query: Table or view not found: nonexistent_table; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [nonexistent_table] 2 Time taken: 2.786 seconds, Fetched 1 row(s) ``` **Apache Hive 1.2.2** ``` apache-hive-1.2.2-bin:$ bin/hive -f 1.sql Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'nonexistent_table' ``` Yes. This is a fix of regression. Pass the UT. Closes apache#30263 from artiship/SPARK-33358. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1ae6d64) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…itionThresholdInBytes' in documentation ### What changes were proposed in this pull request? In the 'Optimizing Skew Join' section of the following two pages: 1. [https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html) 2. [https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html) The configuration 'spark.sql.adaptive.skewedPartitionThresholdInBytes' should be changed to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes', The former is missing the 'skewJoin'. ### Why are the changes needed? To document the correct name of configuration ### Does this PR introduce _any_ user-facing change? Yes, this is a user-facing doc change. ### How was this patch tested? Jenkins / CI builds in this PR. Closes apache#30376 from aof00/doc_change. Authored-by: aof00 <x14562573449@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 0933f1c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…al, and examples ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: graphx, external, and examples. Split per holdenk apache#30323 (comment) NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356 Backport of apache#30326 ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No testing was performed Closes apache#30342 from jsoref/branch-3.0-30326. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…ure GitHub Actions yaml ### What changes were proposed in this pull request? This PR backports apache#30391. Note that it's a partial backport. This PR proposes: - Add `~/.sbt` directory into the build cache, see also sbt/sbt#3681 - ~Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see apache#30391 (comment) - Remove unnecessary `.m2` cache if you run SBT tests only. - Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it. - ~Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later.~ - Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools. - Use the same syntax of Java version, 1.8 -> 8. ### Why are the changes needed? - Remove unnecessary stuff - Cache what we can in the build ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It will be tested in GitHub Actions build at the current PR Closes apache#30416 from HyukjinKwon/SPARK-33464-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…g.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com> (cherry picked from commit 014e1fb) Signed-off-by: Yuming Wang <yumwang@ebay.com>

… LICENSE ### What changes were proposed in this pull request? This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0) ### Why are the changes needed? This prevents the situation like apache#30415. Currently, it missed `catalog` directory due to `.log` rule. ``` $ dev/check-license Could not find Apache license headers in the following files: !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the new rule. Closes apache#30424 from dongjoon-hyun/SPARK-33483-3.0. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Limit the height of the menu area on the left to display vertical scroll bar ### Why are the changes needed? The bottom menu item cannot be displayed when the left menu tree is long ### Does this PR introduce any user-facing change? Yes, if the menu item shows more, you'll see it by pulling down the vertical scroll bar before: ![image](https://user-images.githubusercontent.com/28332082/98805115-16995d80-2452-11eb-933a-3b72c14bea78.png) after: ![image](https://user-images.githubusercontent.com/28332082/98805418-7e4fa880-2452-11eb-9a9b-8d265078297c.png) ### How was this patch tested? NA Closes apache#30335 from liucht-inspur/master. Authored-by: liucht <liucht@inspur.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit cbc8be2) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Backport apache#30373 for branch-3.0. ### What changes were proposed in this pull request? This PR switched the order for the rule `RemoveRedundantSorts` and `EnsureRequirements` so that `EnsureRequirements` will be invoked before `RemoveRedundantSorts` to avoid IllegalArgumentException when instantiating PartitioningCollection. ### Why are the changes needed? `RemoveRedundantSorts` rule uses SparkPlan's `outputPartitioning` to check whether a sort node is redundant. Currently, it is added before `EnsureRequirements`. Since `PartitioningCollection` requires left and right partitioning to have the same number of partitions, which is not necessarily true before applying `EnsureRequirements`, the rule can fail with the following exception: ``` IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#30438 from allisonwang-db/spark-33472-3.0. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to suppress the warning `File exists` in check-license ### Why are the changes needed? **BEFORE** ``` % dev/check-license Attempting to fetch rat RAT checks passed. % dev/check-license mkdir: target: File exists RAT checks passed. ``` **AFTER** ``` % dev/check-license Attempting to fetch rat RAT checks passed. % dev/check-license RAT checks passed. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually do dev/check-license twice. Closes apache#30460 from williamhyun/checklicense. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a459238) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…hCode for `BucketTransform` This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`. SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions. - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159 **Scala 2.12.10** ```scala $ bin/scala Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode res0: Int = -2074071657 ``` **Scala 2.13.3** ```scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode val res0: Int = -1669302457 ``` Yes. This is a correctness issue. Pass the UT with both Scala 2.12/2.13. Closes apache#30477 from dongjoon-hyun/SPARK-33524. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8380e00) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…enkins script ### What changes were proposed in this pull request? It seems that Jenkins tests tasks in many pr have test failed. The failed cases include: - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 get binary type` - `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 get binary type` The error message as follows: ``` Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("Stacktracesbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) ``` But they can pass the GitHub Action, maybe it's related to the `LANG` of the Jenkins build machine, this pr add `export LANG="en_US.UTF-8"` in `run-test-jenkins` script. ### Why are the changes needed? Ensure LANG in Jenkins test process is `en_US.UTF-8` to pass `HIVE_CLI_SERVICE_PROTOCOL_VX` related tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Jenkins tests pass Closes apache#30487 from LuciferYang/SPARK-33535. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 048a982) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? remove python 3.8 from python/run-tests.py and stop build breaks ### Why are the changes needed? the python tests are running against the bare-bones system install of python3, rather than an anaconda environment. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? jenkins see also apache#30506 Closes apache#30509 from shaneknapp/remove-py38-spark3. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>

…on 3.8 in GitHub Actions ### What changes were proposed in this pull request? This is a backport PR of apache#30510 This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue. **Before this change in GitHub Actions:** ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['/usr/bin/python3', 'python2.7', 'pypy3'] ... ``` **After this change in GitHub Actions:** ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.8', 'python2.7', 'pypy3'] ``` ### Why are the changes needed? To keep the test coverage with Python 3.8 in GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions in this build will test. Closes apache#30511 from HyukjinKwon/SPARK-33565-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…d mention the `database` column ### What changes were proposed in this pull request? Change the comments for `SQLContext.tables()` to "The returned DataFrame has three columns, database, tableName and isTemporary". ### Why are the changes needed? Currently, the comment mentions only 2 columns but `tables()` returns 3 columns actually: ```scala scala> spark.range(10).createOrReplaceTempView("view1") scala> val tables = spark.sqlContext.tables() tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field] scala> tables.printSchema root |-- database: string (nullable = false) |-- tableName: string (nullable = false) |-- isTemporary: boolean (nullable = false) scala> tables.show +--------+---------+-----------+ |database|tableName|isTemporary| +--------+---------+-----------+ | default| t1| false| | default| t2| false| | default| ymd| false| | | view1| true| +--------+---------+-----------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes apache#30526 from MaxGekk/sqlcontext-tables-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a088a80) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Fix some "hardcoded" API urls in Web UI. More specifically, we avoid the use of `location.origin` when constructing URLs for internal API calls within the JavaScript. Instead, we use `apiRoot` global variable. ### Why are the changes needed? On one hand, it allows us to build relative URLs. On the other hand, `apiRoot` reflects the Spark property `spark.ui.proxyBase` which can be set to change the root path of the Web UI. If `spark.ui.proxyBase` is actually set, original URLs become incorrect, and we end up with an executors blank page. I encounter this bug when accessing the Web UI behind a proxy (in my case a Kubernetes Ingress). See the following link for more context: jupyterhub/jupyter-server-proxy#57 (comment) ### Does this PR introduce _any_ user-facing change? Yes, as all the changes introduced are in the JavaScript for the Web UI. ### How the changes have been tested ? I modified/debugged the JavaScript as in the commit with the help of the developer tools in Google Chrome, while accessing the Web UI of my Spark app behind my k8s ingress. Closes apache#30523 from pgillet/fix-executors-blank-page-behind-proxy. Authored-by: Pascal Gillet <pascal.gillet@stack-labs.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 6e5446e) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

…while resolving partition spec in v1 `SHOW TABLE EXTENDED` ### What changes were proposed in this pull request? Perform partition spec normalization in `ShowTablesCommand` according to the table schema before getting partitions from the catalog. The normalization via `PartitioningUtils.normalizePartitionSpec()` adjusts the column names in partition specification, w.r.t. the real partition column names and case sensitivity. ### Why are the changes needed? Even when `spark.sql.caseSensitive` is `false` which is the default value, v1 `SHOW TABLE EXTENDED` is case sensitive: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > partitioned by (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`'; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the `SHOW TABLE EXTENDED` command respects the SQL config. And for example above, it returns correct result: ```sql spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); default tbl1 false Partition Values: [year=2015, month=1] Location: file:/Users/maximgekk/spark-warehouse/tbl1/year=2015/month=1 Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=1, path=file:/Users/maximgekk/spark-warehouse/tbl1] Partition Parameters: {transient_lastDdlTime=1606595118, totalSize=623, numFiles=1} Created Time: Sat Nov 28 23:25:18 MSK 2020 Last Access: UNKNOWN Partition Statistics: 623 bytes ``` ### How was this patch tested? By running the modified test suite via: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Dongjoon Hyun <dongjoonapache.org> (cherry picked from commit 0054fc9) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30549 from MaxGekk/show-table-case-sensitive-spec-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…SDelegationTokenProvider when the issue date for token is not set up properly ### What changes were proposed in this pull request? This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details. ### Why are the changes needed? Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal. In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below: https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L123-L134 The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below: ``` 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN ``` Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L58-L71 Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L228-L234 The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative. https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L180-L188 There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously. As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer. ### Does this PR introduce _any_ user-facing change? Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change. ### How was this patch tested? Manually tested with problematic environment. Closes apache#30366 from HeartSaVioR/SPARK-33440. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit f5d2165) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

…itten proxy URL ### What changes were proposed in this pull request? When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`: ``` > val uri = "http://localhost:8081/test" > val query = "order%5B0%5D%5Bcolumn%5D=0" // query string of URL from the reverse proxy > val rewrittenURI = URI.create(uri.toString()) > new URI(rewrittenURI.getScheme(), rewrittenURI.getAuthority(), rewrittenURI.getPath(), query, rewrittenURI.getFragment()).toString result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0 ``` In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176 Other than that, the other parameter may not work as expected after encoded twice. This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter. ``` > val uri = "http://localhost:8081/test" > val query = "order%5B0%5D%5Bcolumn%5D=0" > URI.create(s"$uri?$query").toString result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0 > URI.create(s"$uri?$query").getQuery result: order[0][column]=0 ``` ### Why are the changes needed? Fix a potential bug when Spark's reverse proxy is enabled. The bug itself is similar to apache#29271. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new unit test. Also, Manual UI testing for master, worker and app UI with an nginx proxy Spark config: ``` spark.ui.port 8080 spark.ui.reverseProxy=true spark.ui.reverseProxyUrl=/path/to/spark/ ``` nginx config: ``` server { listen 9000; set $SPARK_MASTER http://127.0.0.1:8080; # split spark UI path into prefix and local path within master UI location ~ ^(/path/to/spark/) { # strip prefix when forwarding request rewrite /path/to/spark(/.*) $1 break; #rewrite /path/to/spark/ "/" ; # forward to spark master UI proxy_pass $SPARK_MASTER; proxy_intercept_errors on; error_page 301 302 307 = handle_redirects; } location handle_redirects { set $saved_redirect_location '$upstream_http_location'; proxy_pass $saved_redirect_location; } } ``` Closes apache#30552 from gengliangwang/decodeProxyRedirect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com> (cherry picked from commit 5d0045e) Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

…ontains sensitive attributes should be redacted ### What changes were proposed in this pull request? To make sure the sensitive attributes to be redacted in the history server log. ### Why are the changes needed? We found the secure attributes like password in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly. The screenshot can be viewed in the attachment of JIRA spark-33504 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? muntual test works well, I have also added unit testcase. Closes apache#30446 from akiyamaneko/eventlog_unredact. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit 28dad1b) Signed-off-by: Thomas Graves <tgraves@apache.org>

…server contains sensitive attributes should be redacted" ### What changes were proposed in this pull request? Revert SPARK-33504 on branch-3.0 compilation error. Original PR apache#30446 This reverts commit e59179b. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#30576 from tgravescs/revert33504. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

…eout from configuration.md SPARK-9767 remove `ConnectionManager` and related files, the configuration `spark.core.connection.ack.wait.timeout` previously used by `ConnectionManager` is no longer used by other Spark code, but it still exists in the `configuration.md`. So this pr cleans up the useless configuration item spark.core.connection.ack.wait.timeout` from `configuration.md`. Clean up useless configuration from `configuration.md`. No Pass the Jenkins or GitHub Action Closes apache#30569 from LuciferYang/SPARK-33631. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 92bfbcb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a followup to add missing `labelsArray` to PySpark `StringIndexer`. ### Why are the changes needed? `labelsArray` is for multi-column case for `StringIndexer`. We should provide this accessor at PySpark side too. ### Does this PR introduce _any_ user-facing change? Yes, `labelsArray` was missing in PySpark `StringIndexer` in Spark 3.0. ### How was this patch tested? Unit test. Closes apache#30580 from viirya/SPARK-33636. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

… driver side `spark.buffer.size` not applied in driver from pyspark. In this PR I've fixed this issue. Apply the mentioned config on driver side. No. Existing unit tests + manually. Added the following code temporarily: ``` def local_connect_and_auth(port, auth_secret): ... sock.connect(sa) print("SPARK_BUFFER_SIZE: %d" % int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) <- This is the addition sockfile = sock.makefile("rwb", int(os.environ.get("SPARK_BUFFER_SIZE", 65536))) ... ``` Test: ``` echo "spark.buffer.size 10000" >> conf/spark-defaults.conf $ ./bin/pyspark Python 3.8.5 (default, Jul 21 2020, 10:48:26) [Clang 11.0.3 (clang-1103.0.32.62)] on darwin Type "help", "copyright", "credits" or "license" for more information. 20/12/03 13:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/12/03 13:38:14 WARN SparkEnv: I/O encryption enabled without RPC encryption: keys will be visible on the wire. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Python version 3.8.5 (default, Jul 21 2020 10:48:26) Spark context Web UI available at http://192.168.0.189:4040 Spark context available as 'sc' (master = local[*], app id = local-1606999094506). SparkSession available as 'spark'. >>> sc.setLogLevel("TRACE") >>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() ... SPARK_BUFFER_SIZE: 10000 ... [[0], [2], [3], [4], [6]] >>> ``` Closes apache#30592 from gaborgsomogyi/SPARK-33629. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit bd71186) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Update kafka headers documentation, type is not longer a map but an array [jira](https://issues.apache.org/jira/browse/SPARK-33660) ### Why are the changes needed? To help users ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It is only documentation Closes apache#30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation. Authored-by: german <germanschiavon@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> (cherry picked from commit d671e05) Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

…for `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` ### What changes were proposed in this pull request? For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by: 1. Explicitly document on which parquet types, those configs influence on 2. Refer to corresponding configs for `INT96` ### Why are the changes needed? To avoid user confusions like reposted in SPARK-33571, and make the config description more precise. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes apache#30604 from MaxGekk/clarify-rebase-docs-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…onsumer.scala ### What changes were proposed in this pull request? This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`. ### Why are the changes needed? To fix a string interpolation bug. ### Does this PR introduce _any_ user-facing change? Yes, the string will be correctly constructed. ### How was this patch tested? Existing tests since they were used in exception/log messages. Closes apache#30609 from imback82/fix_cache_str_interporlation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 154f604) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…while resolving partition spec in v1 `SHOW PARTITIONS` ### What changes were proposed in this pull request? Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag **spark.sql.caseSensitive**. ### Why are the changes needed? V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; ``` The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the command above works as expected: ```sql spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); year=2015/month=1 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite" ``` Closes apache#30626 from MaxGekk/show-partitions-case-sensitivity-test-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR aims to add `GitHub Action` job to publish snapshot from `branch-3.0`. https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.0.2-SNAPSHOT/ ### Why are the changes needed? This will remove our maintenance burden for `branch-3.0` and will stop automatically when we don't have any commit on `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes apache#30630 from dongjoon-hyun/SPARK-33675-3.0. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR aims to increase the timeout of K8s integration test of `branch-3.0/2.4` from 2 minutes to 3 minutes which is consistent with `master/branch-3.1`. ### Why are the changes needed? This will reduce the chance of this kind of failures. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36905/console ``` ... 20/12/07 00:11:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 20/12/07 00:11:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources " did not contain "PySpark Worker Memory Check is: True" The application did not complete.. (KubernetesSuite.scala:249) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8s IT Jenkins job. Closes apache#30632 from dongjoon-hyun/SPARK-33681. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ams in estimatorParamMaps may be lost after saving and reloading Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) ~~~ Bug fix. No Unit test. Closes apache#30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xudatabricks.com> Signed-off-by: Ruifeng Zheng <ruifengzfoxmail.com> (cherry picked from commit 8016123) Signed-off-by: Weichen Xu <weichen.xudatabricks.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#30590 from WeichenXu123/SPARK-33592-bp-3.0. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…HOW TABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up apache#16373 and apache#15515. ### Why are the changes needed? To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 29096a8) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30640 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved object" (instead of a user-friendly error msg: "table or view not found"). This PR fixes this wrongly preprocess for CTAS using V2 catalog. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? The error message for CTAS with a non-exists table changed from: `UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to `AnalysisException: Table or view not found: xxx` ### How was this patch tested? added test Closes apache#30649 from linhongliu-db/fix-ctas-3.0. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ny escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes apache#30625 from luluorta/SPARK-33677. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 99613cd) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes apache#30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6fd2345) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org ### Why are the changes needed? gnupg.net key servers are flaky and sometimes fail to resolve or return keys. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed. Closes apache#30696 from holdenk/SPARK-33727-gnupg-server-is-flaky. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 991b797) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This upgrades snappy-java to 1.1.8.2. ### Why are the changes needed? Minor version upgrade that includes: - [Fixed](xerial/snappy-java#265) an initialization issue when using a recent Mac OS X version - Support Apple Silicon (M1, Mac-aarch64) - Fixed the pure-java Snappy fallback logic when no native library for your platform is found. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#30698 from viirya/upgrade-snappy-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ork with Minikube 1.9+ ### What changes were proposed in this pull request? This is a backport of apache#30700 . This PR changes `Minikube.scala` for Kubernetes integration tests to work with Minikube 1.9+. `Minikube.scala` assumes that `apiserver.key` and `apiserver.crt` are in `~/.minikube/`. But as of Minikube 1.9, they are in `~/.minikube/profiles/<profile>`. ### Why are the changes needed? Currently, Kubernetes integration tests doesn't work with Minikube 1.9+. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the following test passes. ``` $ build/sbt -Pkubernetes -Pkubernetes-integration-tests package 'kubernetes-integration-tests/testOnly -- -z "SparkPi with no"' ``` Closes apache#30702 from sarutak/minikube-1.9-branch-3.0. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…and flake8 Once you build and ran K8S tests, Python lint fails as below: ```bash $ ./dev/lint-python ``` Before this PR: ``` starting python compilation test... python compilation succeeded. downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py... starting pycodestyle test... pycodestyle checks failed: ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/pyspark/cloudpickle/cloudpickle.py:15:101: E501 line too long (105 > 100 characters) ./resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/python/docs/source/conf.py:60:101: E501 line too long (124 > 100 characters) ... ``` After this PR: ``` starting python compilation test... python compilation succeeded. downloading pycodestyle from https://raw.githubusercontent.com/PyCQA/pycodestyle/2.6.0/pycodestyle.py... starting pycodestyle test... pycodestyle checks passed. starting flake8 test... flake8 checks passed. starting mypy test... mypy checks passed. starting sphinx-build tests... sphinx-build checks passed. ``` This PR excludes target directory to avoid such cases in the future. To make it easier to run linters No, dev-only. Manually tested va running `./dev/lint-python`. Closes apache#30718 from HyukjinKwon/SPARK-33749. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit cd7a306) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… pre-existing hadoop ones Backport apache#30709 to 3.0 ### What changes were proposed in this pull request? org.apache.hadoop.conf.Configuration#setIfUnset will ignore those with defaults too ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes apache#30720 from yaooqinn/SPARK-33740-30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…Actions and AppVeyor ### What changes were proposed in this pull request? This PR fixes the R dependencies build error on GitHub Actions and AppVeyor. The reason seems that `usethis` package is updated 2020/12/10. https://cran.r-project.org/web/packages/usethis/index.html ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by GitHub Actions. Closes apache#30737 from sarutak/fix-r-dependencies-build-error. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit fb2e3af) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…veExternalCatalog.createPartitions() ### What changes were proposed in this pull request? Throw `PartitionsAlreadyExistException` from `createPartitions()` in Hive external catalog when a partition exists. Currently, `HiveExternalCatalog.createPartitions()` throws `AlreadyExistsException` wrapped by `AnalysisException`. In the PR, I propose to catch `AlreadyExistsException` in `HiveClientImpl` and replace it by `PartitionsAlreadyExistException`. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `PartitionsAlreadyExistException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running existing test suites: ``` $ build/sbt -Phive -Phive-thriftserver "hive/test:testOnly org.apache.spark.sql.hive.client.VersionsSuite" $ build/sbt -Phive -Phive-thriftserver "hive/test:testOnly org.apache.spark.sql.hive.execution.HiveDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Dongjoon Hyun <dongjoonapache.org> (cherry picked from commit fab2995) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30730 from MaxGekk/hive-partition-exception-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Current Spark Web UI job page's header link of pending/skipped stages is inconsistent with their statuses. See the picture below: ![image](https://user-images.githubusercontent.com/9404831/101998894-1e843180-3c8c-11eb-8d94-10df9edb68e7.png) ### Why are the changes needed? The code determining the `pendingOrSkippedTableId` has the wrong logic. As explained in the code: > If the job is completed, then any pending stages are displayed as "skipped" [code pointer](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L266) This PR fixes the logic for `pendingOrSkippedTableId` which aligns with the stage statuses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified that header link is consistent with stage status with the fix. Closes apache#30749 from linzebing/ui_bug. Authored-by: linzebing <linzebing1995@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 0277fdd) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

### What changes were proposed in this pull request? This PR proposes a better solution for the R build failure on GitHub Actions. The issue is solved in apache#30737 but I noticed the following two things. * We can use the latest `usethis` if we install additional libraries on the GitHub Actions environment. * For tests on AppVeyor, `usethis` is not necessary, so I partially revert the previous change. ### Why are the changes needed? For more simple solution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed on GitHub Actions and AppVeyor on my account. Closes apache#30753 from sarutak/followup-SPARK-33757. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b135db3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ITION` tests that delete files out of partition path ### What changes were proposed in this pull request? Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`. The problem belongs to Hive's MetaStore method `drop_partition_common`: https://github.com/apache/hive/blob/8696c82d07d303b6dbb69b4d443ab6f2b241b251/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L4876 which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path. ### Why are the changes needed? To fix test failures like apache#30643 (comment): ``` org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) ... Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381) at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source) ``` The issue can be reproduced by the following steps: 1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location` 2. Create a sub-folder in the base folder and drop permissions for it: ``` $ mkdir /Users/maximgekk/tmp/part-location/aaa $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa $ ls -al /Users/maximgekk/tmp/part-location total 0 drwxr-xr-x 3 maximgekk staff 96 Dec 13 18:42 . drwxr-xr-x 33 maximgekk staff 1056 Dec 13 18:32 .. d--------- 2 maximgekk staff 64 Dec 13 18:42 aaa ``` 3. Create a table with a partition folder in the base folder: ```sql spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int); spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl'; ``` 4. Try to drop this partition: ``` spark-sql> alter table tbl drop partition (part0=1,part1=2); 20/12/13 18:46:07 ERROR HiveClientImpl: ====================== Attempt to drop the partition specs in table 'tbl' database 'default': Map(part0 -> 1, part1 -> 2) In this attempt, the following partitions have been dropped successfully: The remaining partitions have not been dropped: [1, 2] ====================== Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; ``` The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected tests from local IDEA which does not have access to folders out of partition paths. Lead-authored-by: Max Gekk <max.gekkgmail.com> Co-authored-by: Maxim Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 9160d59) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30756 from MaxGekk/fix-drop-partition-location-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…cted when a table name is altered ### What changes were proposed in this pull request? This is a back port of apache#30774. This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`. ### Why are the changes needed? Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example: ```scala def getStorageLevel(tableName: String): StorageLevel = { val table = spark.table(tableName) val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get cachedData.cachedRepresentation.cacheBuilder.storageLevel } Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'") sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')") val oldStorageLevel = getStorageLevel("old") sql("ALTER TABLE old RENAME TO new") val newStorageLevel = getStorageLevel("new") ``` `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level. ### Does this PR introduce _any_ user-facing change? Yes, now the storage level for the cache will be retained. ### How was this patch tested? Added a unit test. Closes apache#30793 from imback82/alter_table_rename_cache_fix_3.0. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…m HiveExternalCatalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running new UT via: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 3dfdcf4) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#30802 from MaxGekk/hive-drop-partition-exception-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…anup in tests Backport of: apache#30783 ### What changes were proposed in this pull request? This PR introduces a helper method `withExecutor` that handles the creation of an Executor object and ensures that it is always stopped in a finally block. The tests in ExecutorSuite have been refactored to use this method. ### Why are the changes needed? Recently an issue was discovered that leaked Executors (which are not explicitly stopped after a test) can cause other tests to fail due to the JVM being killed after 10 min. It is therefore crucial that tests always stop the Executor. By introducing this helper method, a simple pattern is established that can be easily adopted in new tests, which reduces the risk of regressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run the ExecutorSuite locally. Closes apache#30801 from sander-goos/SPARK-33793-close-executors-3.0. Authored-by: Sander Goos <sander.goos@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ect deterministic field backport [apache#30703](apache#30703) for branch-3.0. ### What changes were proposed in this pull request? The deterministic field is wider than `NonDerterministic`, we should keep same range between pull out and check analysis. ### Why are the changes needed? For example ``` select * from values(1), (4) as t(c1) order by java_method('java.lang.Math', 'abs', c1) ``` We will get exception since `java_method` deterministic field is false but not a `NonDeterministic` ``` Exception in thread "main" org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: java_method('java.lang.Math', 'abs', t.`c1`) ASC NULLS FIRST in operator Sort [java_method(java.lang.Math, abs, c1#1) ASC NULLS FIRST], true ;; ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes apache#30771 from ulysses-you/SPARK-33733-branch-3.0. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…FilesFileReader should be `package private` ### What changes were proposed in this pull request? This PR aims to convert `EventLogFileReader`'s derived classes into `package private`. - SingleFileEventLogFileReader - RollingEventLogFilesFileReader `EventLogFileReader` itself is used in `scheduler` module during tests. ### Why are the changes needed? This classes were designed to be internal. This PR hides it explicitly to reduce the maintenance burden. ### Does this PR introduce _any_ user-facing change? Yes, but these were exposed accidentally. ### How was this patch tested? Pass CIs. Closes apache#30820 from dongjoon-hyun/SPARK-33819-3.0. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ne cluster ### What changes were proposed in this pull request? Initiate the `masterWebUiUrl` with the `webUi. webUrl` instead of the `masterPublicAddress`. ### Why are the changes needed? Since [SPARK-21642](https://issues.apache.org/jira/browse/SPARK-21642), `WebUI` has changed from `localHostName` to `localCanonicalHostName` as the hostname to set up the web UI. However, the `masterPublicAddress` is from `RpcEnv`'s host address, which still uses `localHostName`. As a result, it returns the wrong Master web URL to the Worker. ### Does this PR introduce _any_ user-facing change? Yes, when users click "Back to Master" in the Worker page: Before this PR: <img width="3258" alt="WeChat4acbfd163f51c76a5f9bc388c7479785" src="https://user-images.githubusercontent.com/16397174/102057951-b9664280-3e29-11eb-8749-5ee293902bdf.png"> After this PR: ![image](https://user-images.githubusercontent.com/16397174/102058016-d438b700-3e29-11eb-8641-a23a6b2f542e.png) (Return to the Master page successfully.) ### How was this patch tested? Tested manually. Closes apache#30759 from Ngone51/fix-back-to-master. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 34e4d87) Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](apache@031c5ef)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50 The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed. Closes apache#30818 from maropu/BugfixInAQE. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 51ef443) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…oin" This reverts commit 3ef6827.

### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](apache@031c5ef)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=apache#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50 The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. This is a backport PR for apache#30818. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed with AQE enabled. Closes apache#30830 from maropu/SPARK-33822-BRANCH3.0. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Update Jetty to 9.4.34 Picks up fixes and improvements, including a possible CVE fix. https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020 https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102 No. Existing tests. Closes apache#30828 from srowen/SPARK-33831. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 131a23d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…partition value ### What changes were proposed in this pull request? Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` ### Why are the changes needed? Fix data incorrect issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes apache#30839 from AngersZhuuuu/SPARK-33593-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tly from the SHS under high load ### What changes were proposed in this pull request? Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again. ### Why are the changes needed? The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria. The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS. 241 entities are displayed at `20:50:42`: ![1-241-entities-at-20-50](https://user-images.githubusercontent.com/61428392/102611539-c2138d00-4137-11eb-9bbd-d77b22041f3b.png) 203 entities are displayed at `20:52:17`: ![2-203-entities-at-20-52](https://user-images.githubusercontent.com/61428392/102611561-cdff4f00-4137-11eb-91ed-7405fe58a695.png) The number of loaded applications over time: ![4-loaded-applications](https://user-images.githubusercontent.com/61428392/102611586-d8b9e400-4137-11eb-8747-4007fc5469de.png) ### Does this PR introduce _any_ user-facing change? Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically. ### How was this patch tested? Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script: * Build SHS with the proposed change * Download Hadoop AWS and AWS Java SDK * Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key * Configure SHS to read event logs from S3 * Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/main/monitor.sh) script to query SHS API * Run 8 [producers](https://github.com/vladhlinsky/shs-monitor/blob/main/producer.sh) for ~10 mins, create 805(83.1 MB) event log copies * Wait for SHS to load all the applications * Verify that the number of loaded applications increases continuously over time ![5-loaded-applications-fixed](https://user-images.githubusercontent.com/61428392/102617363-bf1d9a00-4141-11eb-9bae-f982d02fd30f.png) For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor) repository. Closes apache#30842 from vladhlinsky/SPARK-33841-branch-3.0. Authored-by: Vlad Glinsky <vladhlinsky@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Make MapIterator of BytesToBytesMap `hasNext` method idempotent ### Why are the changes needed? The `hasNext` maybe called multiple times, if not guarded, second call of hasNext method after reaching the end of iterator will throw NoSuchElement exception. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Update a unit test to cover this case. Closes apache#30728 from advancedxy/SPARK-33756. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 1339168) Signed-off-by: Sean Owen <srowen@gmail.com>

…subquery code ### What changes were proposed in this pull request? This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries. The following example is about `EXPLAIN CODEGEN`. ``` spark.conf.set("spark.sql.adaptive.enabled", "false") val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) == *(1) Project [Subquery scalar-subquery#3, [id=apache#24] AS scalarsubquery()#5L] : +- Subquery scalar-subquery#3, [id=apache#24] : +- *(2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=apache#20] : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) : +- *(1) Range (1, 100, step=1, splits=12) +- *(1) Scan OneRowRelation[] Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator rdd_input_0; /* 010 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; /* 011 */ /* 012 */ public GeneratedIteratorForCodegenStage1(Object[] references) { /* 013 */ this.references = references; /* 014 */ } /* 015 */ /* 016 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 017 */ partitionIndex = index; /* 018 */ this.inputs = inputs; /* 019 */ rdd_input_0 = inputs[0]; /* 020 */ project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 021 */ /* 022 */ } /* 023 */ /* 024 */ private void project_doConsume_0() throws java.io.IOException { /* 025 */ // common sub-expressions /* 026 */ /* 027 */ project_mutableStateArray_0[0].reset(); /* 028 */ /* 029 */ if (false) { /* 030 */ project_mutableStateArray_0[0].setNullAt(0); /* 031 */ } else { /* 032 */ project_mutableStateArray_0[0].write(0, 1L); /* 033 */ } /* 034 */ append((project_mutableStateArray_0[0].getRow())); /* 035 */ /* 036 */ } /* 037 */ /* 038 */ protected void processNext() throws java.io.IOException { /* 039 */ while ( rdd_input_0.hasNext()) { /* 040 */ InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next(); /* 041 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1); /* 042 */ project_doConsume_0(); /* 043 */ if (shouldStop()) return; /* 044 */ } /* 045 */ } /* 046 */ /* 047 */ } ``` After this change, the corresponding code for subqueries are shown. ``` Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) == *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) +- *(1) Range (1, 100, step=1, splits=12) Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=1 /* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private boolean agg_initAgg_0; /* 010 */ private boolean agg_bufIsNull_0; /* 011 */ private long agg_bufValue_0; /* 012 */ private boolean range_initRange_0; /* 013 */ private long range_nextIndex_0; /* 014 */ private TaskContext range_taskContext_0; /* 015 */ private InputMetrics range_inputMetrics_0; /* 016 */ private long range_batchEnd_0; /* 017 */ private long range_numElementsTodo_0; /* 018 */ private boolean agg_agg_isNull_2_0; /* 019 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3]; /* 020 */ /* 021 */ public GeneratedIteratorForCodegenStage1(Object[] references) { /* 022 */ this.references = references; /* 023 */ } /* 024 */ /* 025 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 026 */ partitionIndex = index; /* 027 */ this.inputs = inputs; /* 028 */ /* 029 */ range_taskContext_0 = TaskContext.get(); /* 030 */ range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); /* 031 */ range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 032 */ range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 033 */ range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); /* 034 */ /* 035 */ } /* 036 */ /* 037 */ private void agg_doAggregateWithoutKey_0() throws java.io.IOException { /* 038 */ // initialize aggregation buffer /* 039 */ agg_bufIsNull_0 = true; /* 040 */ agg_bufValue_0 = -1L; /* 041 */ /* 042 */ // initialize Range /* 043 */ if (!range_initRange_0) { /* 044 */ range_initRange_0 = true; /* 045 */ initRange(partitionIndex); /* 046 */ } /* 047 */ /* 048 */ while (true) { /* 049 */ if (range_nextIndex_0 == range_batchEnd_0) { /* 050 */ long range_nextBatchTodo_0; /* 051 */ if (range_numElementsTodo_0 > 1000L) { /* 052 */ range_nextBatchTodo_0 = 1000L; /* 053 */ range_numElementsTodo_0 -= 1000L; /* 054 */ } else { /* 055 */ range_nextBatchTodo_0 = range_numElementsTodo_0; /* 056 */ range_numElementsTodo_0 = 0; /* 057 */ if (range_nextBatchTodo_0 == 0) break; /* 058 */ } /* 059 */ range_batchEnd_0 += range_nextBatchTodo_0 * 1L; /* 060 */ } /* 061 */ /* 062 */ int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); /* 063 */ for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { /* 064 */ long range_value_0 = ((long)range_localIdx_0 * 1L) + range_nextIndex_0; /* 065 */ /* 066 */ agg_doConsume_0(range_value_0); /* 067 */ /* 068 */ // shouldStop check is eliminated /* 069 */ } /* 070 */ range_nextIndex_0 = range_batchEnd_0; /* 071 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localEnd_0); /* 072 */ range_inputMetrics_0.incRecordsRead(range_localEnd_0); /* 073 */ range_taskContext_0.killTaskIfInterrupted(); /* 074 */ } /* 075 */ /* 076 */ } /* 077 */ /* 078 */ private void initRange(int idx) { /* 079 */ java.math.BigInteger index = java.math.BigInteger.valueOf(idx); /* 080 */ java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L); /* 081 */ java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L); /* 082 */ java.math.BigInteger step = java.math.BigInteger.valueOf(1L); /* 083 */ java.math.BigInteger start = java.math.BigInteger.valueOf(1L); /* 084 */ long partitionEnd; /* 085 */ /* 086 */ java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); /* 087 */ if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { /* 088 */ range_nextIndex_0 = Long.MAX_VALUE; /* 089 */ } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { /* 090 */ range_nextIndex_0 = Long.MIN_VALUE; /* 091 */ } else { /* 092 */ range_nextIndex_0 = st.longValue(); /* 093 */ } /* 094 */ range_batchEnd_0 = range_nextIndex_0; /* 095 */ /* 096 */ java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) /* 097 */ .multiply(step).add(start); /* 098 */ if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { /* 099 */ partitionEnd = Long.MAX_VALUE; /* 100 */ } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { /* 101 */ partitionEnd = Long.MIN_VALUE; /* 102 */ } else { /* 103 */ partitionEnd = end.longValue(); /* 104 */ } /* 105 */ /* 106 */ java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( /* 107 */ java.math.BigInteger.valueOf(range_nextIndex_0)); /* 108 */ range_numElementsTodo_0 = startToEnd.divide(step).longValue(); /* 109 */ if (range_numElementsTodo_0 < 0) { /* 110 */ range_numElementsTodo_0 = 0; /* 111 */ } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { /* 112 */ range_numElementsTodo_0++; /* 113 */ } /* 114 */ } /* 115 */ /* 116 */ private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException { /* 117 */ // do aggregate /* 118 */ // common sub-expressions /* 119 */ /* 120 */ // evaluate aggregate functions and update aggregation buffers /* 121 */ /* 122 */ agg_agg_isNull_2_0 = true; /* 123 */ long agg_value_2 = -1L; /* 124 */ /* 125 */ if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 || /* 126 */ agg_value_2 > agg_bufValue_0)) { /* 127 */ agg_agg_isNull_2_0 = false; /* 128 */ agg_value_2 = agg_bufValue_0; /* 129 */ } /* 130 */ /* 131 */ if (!false && (agg_agg_isNull_2_0 || /* 132 */ agg_value_2 > agg_expr_0_0)) { /* 133 */ agg_agg_isNull_2_0 = false; /* 134 */ agg_value_2 = agg_expr_0_0; /* 135 */ } /* 136 */ /* 137 */ agg_bufIsNull_0 = agg_agg_isNull_2_0; /* 138 */ agg_bufValue_0 = agg_value_2; /* 139 */ /* 140 */ } /* 141 */ /* 142 */ protected void processNext() throws java.io.IOException { /* 143 */ while (!agg_initAgg_0) { /* 144 */ agg_initAgg_0 = true; /* 145 */ long agg_beforeAgg_0 = System.nanoTime(); /* 146 */ agg_doAggregateWithoutKey_0(); /* 147 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] /* aggTime */).add((System.nanoTime() - agg_beforeAgg_0) / 1000000); /* 148 */ /* 149 */ // output the result /* 150 */ /* 151 */ ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] /* numOutputRows */).add(1); /* 152 */ range_mutableStateArray_0[2].reset(); /* 153 */ /* 154 */ range_mutableStateArray_0[2].zeroOutNullBytes(); /* 155 */ /* 156 */ if (agg_bufIsNull_0) { /* 157 */ range_mutableStateArray_0[2].setNullAt(0); /* 158 */ } else { /* 159 */ range_mutableStateArray_0[2].write(0, agg_bufValue_0); /* 160 */ } /* 161 */ append((range_mutableStateArray_0[2].getRow())); /* 162 */ } /* 163 */ } /* 164 */ /* 165 */ } ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`. ### How was this patch tested? New test. Closes apache#30859 from sarutak/explain-codegen-subqueries. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f4e1069) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… for each PySpark test job ### What changes were proposed in this pull request? This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations. ### Why are the changes needed? To make PySpark tests less flaky. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by trying some sleeps in apache#30873. Closes apache#30875 from HyukjinKwon/SPARK-33869. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 38bbcca) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…e re-analyzed backport apache#30777 to 3.0 ---------- ### What changes were proposed in this pull request? It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly. This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true. ### Why are the changes needed? make the code simpler and safer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes apache#30872 from cloud-fan/ds. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…h special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes apache#30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1dd63dc) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? The GitHub workflow `Publish Snapshot` publishes master and 3.1 branch via Nexus. For this, the workflow uses `secrets.NEXUS_USER` and `secrets.NEXUS_PW` secrets. These are not available in forks where this workflow fails every day: - https://github.com/G-Research/spark/actions/runs/431626797 - https://github.com/G-Research/spark/actions/runs/433153049 - https://github.com/G-Research/spark/actions/runs/434680048 - https://github.com/G-Research/spark/actions/runs/436958780 ### Why are the changes needed? Avoid attempting to publish snapshots from forked repositories. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Code review only. Closes apache#30884 from EnricoMi/branch-do-not-publish-snapshots-from-forks. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 1d45025) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…yst match special Array value" This reverts commit 7af54fd.

### What changes were proposed in this pull request? This PR aims to update the followings. - Remove the outdated requirement for `spark.shuffle.service.enabled` in `configuration.md` - Dynamic allocation section in `job-scheduling.md` ### Why are the changes needed? To make the document up-to-date. ### Does this PR introduce _any_ user-facing change? No, it's a documentation update. ### How was this patch tested? Manual. **BEFORE** ![Screen Shot 2020-12-23 at 2 22 04 AM](https://user-images.githubusercontent.com/9700541/102986441-ae647f80-44c5-11eb-97a3-87c2d368952a.png) ![Screen Shot 2020-12-23 at 2 22 34 AM](https://user-images.githubusercontent.com/9700541/102986473-bcb29b80-44c5-11eb-8eae-6802001c6dfa.png) **AFTER** ![Screen Shot 2020-12-23 at 2 25 36 AM](https://user-images.githubusercontent.com/9700541/102986767-2df24e80-44c6-11eb-8540-e74856a4c313.png) ![Screen Shot 2020-12-23 at 2 21 13 AM](https://user-images.githubusercontent.com/9700541/102986366-8e34c080-44c5-11eb-8054-1efd07c9458c.png) Closes apache#30906 from dongjoon-hyun/SPARK-33891. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 47d1aa4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…g after the task ends ### What changes were proposed in this pull request? This is a retry of apache#30177. This is not a complete fix, but it would take long time to complete (apache#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes apache#30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 5c9b421) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…only remotebytesread is available ### What changes were proposed in this pull request? Shuffle Read Size / Records can also be displayed in remoteBytesRead>0 localBytesRead=0. current: ![image](https://user-images.githubusercontent.com/3898450/103079421-c4ca2280-460e-11eb-9e2f-49d35b5d324d.png) fix: ![image](https://user-images.githubusercontent.com/3898450/103079439-cc89c700-460e-11eb-9a41-6b2882980d11.png) ### Why are the changes needed? At present, the page only displays the data of Shuffle Read Size / Records when localBytesRead>0. When there is only remote reading, metrics cannot be seen on the stage page. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test Closes apache#30916 from cxzl25/SPARK-33900. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit 700f5ab) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>

…anges in `HiveClientImpl` ### What changes were proposed in this pull request? Update the SQL migration guide about the changes made by: - apache#30778 - apache#30711 ### Why are the changes needed? To inform users about the recent changes in the upcoming releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#30932 from MaxGekk/sql-migr-guide-hiveclientimpl-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This minor patch changes two variables where calling `fetchEarliestOffsets` to `lazy` because these values are not always necessary. ### Why are the changes needed? To avoid unnecessary Kafka RPC calls. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes apache#30969 from viirya/ss-minor3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 4a669f5) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… metrics docs ### What changes were proposed in this pull request? Removed the **hiveClientCalls.count** in CodeGenerator metrics in Component instance = Executor ### Why are the changes needed? Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md) Closes apache#30976 from coderbond007/SPARK-33942. Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 13e8c28) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This is a backport of apache#30959 . This PR aims to recover GitHub Action `build_and_test` job. ### Why are the changes needed? Currently, `build_and_test` job fails to start because of the following in master/branch-3.1 at least. ``` r-lib/actions/setup-rv1 is not allowed to be used in apache/spark. Actions in this workflow must be: created by GitHub, verified in the GitHub Marketplace, within a repository owned by apache or match the following: adoptopenjdk/*, apache/*, gradle/wrapper-validation-action. ``` - https://github.com/apache/spark/actions/runs/449826457 ![Screen Shot 2020-12-28 at 10 06 11 PM](https://user-images.githubusercontent.com/9700541/103262174-f1f13a80-4958-11eb-8ceb-631527155775.png) ### Does this PR introduce _any_ user-facing change? No. This is a test infra. ### How was this patch tested? To check GitHub Action `build_and_test` job on this PR. Closes apache#30986 from dongjoon-hyun/SPARK-33931-3.0. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Skip table stats in canonicalizing of `HiveTableRelation`. ### Why are the changes needed? The changes fix a regression comparing to Spark 3.0, see SPARK-33963. ### Does this PR introduce _any_ user-facing change? Yes. After changes Spark behaves as in the version 3.0.1. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes apache#30995 from MaxGekk/fix-caching-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit fc7d016) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/21632/files#diff-0fdae8a6782091746ed20ea43f77b639f9c6a5f072dd2f600fcf9a7b37db4f47, a new field `rawCount` was added into `NodeData`, which cause that a tree model trained in 2.4 can not be loaded in 3.0/3.1/master; field `rawCount` is only used in training, and not used in `transform`/`predict`/`featureImportance`. So I just set it to -1L. ### Why are the changes needed? to support load old tree model in 3.0/3.1/master ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added testsuites Closes apache#30889 from zhengruifeng/fix_tree_load. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 6b7527e) Signed-off-by: Sean Owen <srowen@gmail.com>

… PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this re-creates the cache associated with the modified table. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 0` since it was deleted by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 1 1 ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit 67195d0) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31006 from MaxGekk/drop-partition-refresh-cache-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit eef0e4c) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…oSuchElementException ### What changes were proposed in this pull request? From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`. ``` 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded) 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0) at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97) ``` ### Why are the changes needed? To avoid throwing the java.util.NoSuchElementException ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue. Closes apache#31025 from LantaoJin/SPARK-34000. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit a7d3fcd) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…ntation build ### What changes were proposed in this pull request? This PR proposes to use python3 instead of python in SQL documentation build. After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too. This blocks release because the release container does not have `python` but only `python3`. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually ran the script Closes apache#31041 from HyukjinKwon/SPARK-34010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8d09f96) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Changed the cost function in CBO to match documentation. ### Why are the changes needed? The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as: ``` The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). ``` The implementation in `JoinReorderDP.betterThan` does not match this documentaiton: ``` def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 || other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } ``` This different implementation has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. This happens with several of the TPCDS queries. The new implementation does not have this behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and existing UTs Closes apache#31042 from tanelk/SPARK-33935_cbo_cost_function_3.0. Authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…me too ### What changes were proposed in this pull request? In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma. When we use spark 2.4 with UT ``` test("insert overwrite directory with comma col name") { withTempDir { dir => val path = dir.toURI.getPath val v1 = s""" | INSERT OVERWRITE DIRECTORY '${path}' | STORED AS TEXTFILE | SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false") """.stripMargin sql(v1).explain(true) sql(v1).show() } } ``` failed with as below since column name contains `,` then column names and column types size not equal. ``` 19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ','： https://github.com/apache/hive/blob/6f4c35c9e904d226451c465effdc5bfd31d395a0/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1180-L1188 https://github.com/apache/hive/blob/6f4c35c9e904d226451c465effdc5bfd31d395a0/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java#L1044-L1075 And in script transform, we parse column name to avoid this problem https://github.com/apache/spark/blob/554600c2af0dbc8979955807658fafef5dc66c08/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala#L257-L261 So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well. ### Why are the changes needed? More save use serde ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes apache#31038 from AngersZhuuuu/SPARK-33844-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…enUpdate to remedy perf regression ### What changes were proposed in this pull request? This PR proposes to adjust the order of check in KafkaTokenUtil.needTokenUpdate, so that short-circuit applies on the non-delegation token cases (insecure + secured without delegation token) and remedies the performance regression heavily. ### Why are the changes needed? There's a serious performance regression between Spark 2.4 vs Spark 3.0 on read path against Kafka data source. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually ran a reproducer (https://github.com/codegorillauk/spark-kafka-read with modification to just count instead of writing to Kafka topic) with measuring the time. > the branch applying the change with adding measurement https://github.com/HeartSaVioR/spark/commits/debug-SPARK-33635-v3.0.1 > the branch only adding measurement https://github.com/HeartSaVioR/spark/commits/debug-original-ver-SPARK-33635-v3.0.1 > the result (before the fix) count: 10280000 Took 41.634007047 secs 21/01/06 13:16:07 INFO KafkaDataConsumer: debug ver. 17-original 21/01/06 13:16:07 INFO KafkaDataConsumer: Total time taken to retrieve: 82118 ms > the result (after the fix) count: 10280000 Took 7.964058475 secs 21/01/06 13:08:22 INFO KafkaDataConsumer: debug ver. 17 21/01/06 13:08:22 INFO KafkaDataConsumer: Total time taken to retrieve: 987 ms Closes apache#31056 from HeartSaVioR/SPARK-33635. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fa93090) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…marking the driver as blacklisted This is a backport of apache#30954 ### What changes were proposed in this pull request? Filter out the driver entity when updating the exclusion status of live executors(including the driver), so the driver won't be marked as blacklisted in the UI even if the node that hosts the driver has been marked as blacklisted. ### Why are the changes needed? Before this change, if we run spark with the standalone mode and with spark.blacklist.enabled=true. The driver will be marked as blacklisted when the host that hosts that driver has been marked as blacklisted. While it's incorrect because the exclude list feature will exclude executors only and the driver is still active. ![image](https://user-images.githubusercontent.com/26694233/103732959-3494c180-4fae-11eb-9da0-2c906309ea83.png) After the fix, the driver won't be marked as blacklisted. ![image](https://user-images.githubusercontent.com/26694233/103732974-3fe7ed00-4fae-11eb-90d1-7ee44d4ed7c9.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Reopen the UI and see the driver is no longer marked as blacklisted. Closes apache#31057 from baohe-zhang/SPARK-33029-3.0. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In apache#22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after apache#28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from apache#31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes apache#31049 from AngersZhuuuu/SPARK-34012-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…TO PARTITION` ### What changes were proposed in this pull request? 1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table. 2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 2` since `0 0` was renamed by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 2 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes apache#31060 from MaxGekk/rename-partition-refresh-cache-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit f18d68a) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… in spark-sql ### What changes were proposed in this pull request? Now the spark-sql does not support parse the sql statements with bracketed comments. For the sql statements: ``` /* SELECT 'test'; */ SELECT 'test'; ``` Would be split to two statements: The first one: `/* SELECT 'test'` The second one: `*/ SELECT 'test'` Then it would throw an exception because the first one is illegal. In this PR, we ignore the content in bracketed comments while splitting the sql statements. Besides, we ignore the comment without any content. NOTE: This backport comes from apache#29982 ### Why are the changes needed? Spark-sql might split the statements inside bracketed comments and it is not correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes apache#31033 from turboFei/SPARK-33100. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes apache#31116 from MaxGekk/add-partition-refresh-cache-2-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? 1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. ### Why are the changes needed? Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, **the resulted table doesn't contain `null`**. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *SQLQuerySuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit 157b72a) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31095 from MaxGekk/partition-spec-value-null-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Current test may sometimes fail under different BLAS library. Due to some absTol check. Error like ``` Expected 0.7 and 0.6485507246376814 to be within 0.05 using absolute tolerance... ``` * Change absTol to relTol: The `absTol 0.05` in some cases (such as compare 0.1 and 0.05) is a big difference * Remove the `exp` when comparing params. The `exp` will amplify the relative error. ### Why are the changes needed? Flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes apache#31004 from WeichenXu123/improve_bayes_tests. Authored-by: Weichen Xu <weichen.xudatabricks.com> Signed-off-by: Ruifeng Zheng <ruifengzfoxmail.com> ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#31123 from WeichenXu123/bp-3.1-nb-test. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit d33f0d4) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

…y `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. ### Why are the changes needed? This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ |id |part| +---+----+ |0 |0 | |1 |1 | +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowCreateTableSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit d97e991) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31126 from MaxGekk/fix-caching-hive-table-2-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… sure execute it eagerly ### What changes were proposed in this pull request? This is a backport of apache#31110. I ran intelliJ inspection again in this branch. This PR is basically a followup of apache#14332. Calling `map` alone might leave it not executed due to lazy evaluation, e.g.) ``` scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 ``` We should better use `foreach` to make sure it's executed where the output is unused or `Unit`. ### Why are the changes needed? To prevent the potential issues by not executing `map`. ### Does this PR introduce _any_ user-facing change? No, the current codes look not causing any problem for now. ### How was this patch tested? I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally. Closes apache#31138 from HyukjinKwon/SPARK-34059-3.0. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Aggregate ### What changes were proposed in this pull request? This PR takes over apache#28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in apache#28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes apache#31140 from Ngone51/cp-spark-31952. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

The package commons-crypto-1.0.0 doesn't support aarch64 platform, change to use v1.1.0. ### What changes were proposed in this pull request? Update the package commons-crypto to v1.1.0 to support aarch64 platform - https://issues.apache.org/jira/browse/CRYPTO-139 NOTE: This backport comes from apache#30275 ### Why are the changes needed? The package commons-crypto-1.0.0 available in the Maven repository doesn't support aarch64 platform. It costs long time in CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever receive block data from client, if the time more than the default value 120s, IOException raised and client will retry replicate the block data to other executors. But in fact the replication is complete, it makes the replication number incorrect. This makes DistributedSuite tests pass. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#31078 from huangtianhua/origin/branch-3.0. Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…BLE .. ADD PARTITION` ### What changes were proposed in this pull request? Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when: - A table doesn't have stats - `spark.sql.statistics.size.autoUpdate.enabled` is `true` In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically. ### Why are the changes needed? The changes fix the issue demonstrated by the example: ```sql spark-sql> create table tbl (col0 int, part int) partitioned by (part); spark-sql> insert into tbl partition (part = 0) select 0; spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; spark-sql> alter table tbl add partition (part = 1); ``` the `add partition` command should update table stats but it does not. There are no stats in the output of: ``` spark-sql> describe table extended tbl; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table doesn't have stats before the command: ```sql spark-sql> alter table tbl add partition (part = 1); spark-sql> describe table extended tbl; col0 int NULL part int NULL # Partition Information # col_name data_type comment part int NULL # Detailed Table Information ... Statistics 2 bytes ``` ### How was this patch tested? By running new UT and existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.StatisticsSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit 6c04795) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31158 from MaxGekk/fix-stats-in-add-partition-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…to 3.0 ### What changes were proposed in this pull request? This PR aims to fix `MiMaExcludes` rule by moving SPARK-23429 from 2.4 to 3.0. ### Why are the changes needed? SPARK-23429 was added at Apache Spark 3.0.0. This should land on `master` and `branch-3.1` and `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the MiMa rule. Closes apache#31174 from dongjoon-hyun/SPARK-34103. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 9e93fdb) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…GE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and NETWORK_TIMEOUT ### What changes were proposed in this pull request? As described in SPARK-33557, `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` will always use `Network.NETWORK_TIMEOUT.defaultValueString` as value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when we configure `NETWORK_TIMEOUT` without configure `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`, this is different from the relationship described in `configuration.md`. To fix this problem，the main change of this pr as follow: - Remove the explicitly default value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` - Use actual value of `NETWORK_TIMEOUT` as `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` not configured in `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` ### Why are the changes needed? To ensure the relationship between `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` as we described in `configuration.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test configure `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` locally Closes apache#31175 from dongjoon-hyun/SPARK-33557. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… with exists or forall ### What changes were proposed in this pull request? This pr use `exists` or `forall` to simplify `filter + emptiness check`, it's semantically consistent, but looks simpler. The rule as follow: - `seq.filter(p).size == 0)` -> `!seq.exists(p)` - `seq.filter(p).length > 0` -> `seq.exists(p)` - `seq.filterNot(p).isEmpty` -> `seq.forall(p)` - `seq.filterNot(p).nonEmpty` -> `!seq.forall(p)` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes apache#31190 from LuciferYang/SPARK-34118-30. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…leFileEventLogFileReader ### What changes were proposed in this pull request? `FsHistoryProvider#checkForLogs` already has `FileStatus` when constructing `SingleFileEventLogFileReader`, and there is no need to get the `FileStatus` again when `SingleFileEventLogFileReader#fileSizeForLastIndex`. ### Why are the changes needed? This can reduce a lot of rpc calls and improve the speed of the history server. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? exist ut Closes apache#31187 from HeartSaVioR/SPARK-33790-branch-3.0. Lead-authored-by: sychen <sychen@ctrip.com> Co-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

…rs tab in standalone cluster mode ### What changes were proposed in this pull request? Fix [SPARK-32598] (missing driver logs under UI-ApplicationDetails-Executors tab in standalone cluster mode) . The direct bug is: the original author forgot to implement `getDriverLogUrls` in `StandaloneSchedulerBackend` https://github.com/apache/spark/blob/1de272f98d0ff22d0dd151797f22b8faf310963a/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala#L70-L75 So we set DriverLogUrls as env in `DriverRunner`, and retrieve it at `StandaloneSchedulerBackend`. ### Why are the changes needed? Fix bug [SPARK-32598]. ### Does this PR introduce _any_ user-facing change? Yes. User will see driver logs (standalone cluster mode) under UI-ApplicationDetails-Executors tab now. Before: ![image](https://user-images.githubusercontent.com/17903517/93901055-b5de8600-fd28-11ea-879a-d97e6f70cc6e.png) After: ![image](https://user-images.githubusercontent.com/17903517/93901080-baa33a00-fd28-11ea-8895-3787c5efbf88.png) ### How was this patch tested? Re-check the real case in [SPARK-32598] and found this user-facing bug fixed. Closes apache#29644 from KevinSmile/kw-dev-master. Authored-by: KevinSmile <kevinwang013@hotmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit c75c29d) Signed-off-by: Sean Owen <srowen@gmail.com>

…nager and scheduler backend ### What changes were proposed in this pull request? Missing POD detection is extended by timestamp (and time limit) based check to avoid wrongfully detection of missing POD detection. The two new timestamps: - `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only updated by the pod polling snapshot source - `registrationTs` is introduced for the `ExecutorData` and it is initialized at the executor registration at the scheduler backend Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is used to specify the accepted delta between the two. ### Why are the changes needed? Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single POD changes. This could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager. A key indicator of this error is seeing this log message: > "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event." So one of the problem is running the missing POD detection check even when a single POD is changed without having a full consistent snapshot about all the PODs (see `ExecutorPodsPollingSnapshotSource`). The other problem could be the race between the executor POD lifecycle manager and the scheduler backend: so even in case of a having a full snapshot the registration at the scheduler backend could precede the snapshot polling (and processing of those polled snapshots). ### Does this PR introduce any user-facing change? Yes. When the POD is missing then the reason message explaining the executor's exit is extended with both timestamps (the polling time and the executor registration time) and even the new config is mentioned. ### How was this patch tested? The existing unit tests are extended. (cherry picked from commit 6bd7a62) Closes apache#31195 from attilapiros/SPARK-33711-branch-3.0. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…d CatalogTable ### What changes were proposed in this pull request? Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after apache#31112 due to usage of `filterKeys`. The workaround was taken from scala/bug#7005. ### Why are the changes needed? This prevents the errors like: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 [info] Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 ``` ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite affected by apache#31112: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes apache#31197 from MaxGekk/fix-caching-hive-table-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c3d81fb) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR aims to update the Parquet website link from http://parquet.io to https://parquet.apache.orc ### Why are the changes needed? The old website goes to the incubator site. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#31208 from williamhyun/minor-parquet. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 1cf09b7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Fixing typos in the docs sql-ref-datatypes.md. ### Why are the changes needed? To display '<element_type>' correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run jekyll. before this fix ![image](https://user-images.githubusercontent.com/2217224/104865408-3df33600-597f-11eb-857b-c6223ff9159a.png) after this fix ![image](https://user-images.githubusercontent.com/2217224/104865458-62e7a900-597f-11eb-8a21-6d838eecaaf2.png) Closes apache#31221 from kariya-mitsuru/fix-typo. Authored-by: Mitsuru Kariya <Mitsuru.Kariya@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 536a725) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ileEventLogFileReader to remove Mima exclusion ### What changes were proposed in this pull request? This PR proposes to remove Mima exclusion via restoring the old constructor of SingleFileEventLogFileReader. This partially adopts the remaining parts of apache#30814 which was excluded while porting back. ### Why are the changes needed? To remove unnecessary Mima exclusion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass CIs. Closes apache#31225 from HeartSaVioR/SPARK-33819-followup-branch-3.0. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ITIONS` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveSchemaInferenceSuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Wenchen Fan <wenchendatabricks.com> (cherry picked from commit dee596e) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31236 from MaxGekk/recover-partitions-refresh-cache-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ExternalCatalog.alterPartitions()` Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`. It reduces the number of calls to Hive External catalog. No By existing test suites. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit bea10a6) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes apache#31241 from MaxGekk/remove-getRawTable-from-alterPartitions-3.1. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 246ff31) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…o 60s ### What changes were proposed in this pull request? Increase the timeout for StreamingLinearRegressionSuite to 60s to deflake the test. ### Why are the changes needed? Reduce merge conflict. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#31248 from liangz1/increase-timeout. Authored-by: Liang Zhang <liang.zhang@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com> (cherry picked from commit f7ff7ff) Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

### What changes were proposed in this pull request? Check SPARK_TESTING as lazy val to avoid slow down when there are many environment variables ### Why are the changes needed? If there are many environment variables, sys.env slows is very slow. As Utils.isTesting is called very often during Dataframe-Optimization, this can slow down evaluation very much. An example for triggering the problem can be found in the bug ticket https://issues.apache.org/jira/browse/SPARK-34115 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With the example provided in the ticket. Closes apache#31244 from nob13/bug/34115. Lead-authored-by: Norbert Schultz <norbert.schultz@reactivecore.de> Co-authored-by: Norbert Schultz <noschultz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c3d8352) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…eRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes apache#31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f498977) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This test fails flakily. I found it failing in 1 out of 80 runs. ``` Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance. ``` Increasing relative tolerance to 0.2 should improve flakiness. ``` 0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)| ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#31266 from Loquats/NaiveBayesSuite-reltol. Authored-by: Andy Zhang <yue.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c8c70d5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dad201e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…iInstanceRelation.newInstance" This reverts commit 89443ab.

### What changes were proposed in this pull request? When ruby version is 3.0, jekyll server will failed with ``` yi.zhu$ SKIP_API=1 jekyll serve --watch Configuration file: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_config.yml Source: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs Destination: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 5.085 seconds. Auto-regeneration: enabled for '/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs' ------------------------------------------------ Jekyll 4.2.0 Please append `--trace` to the `serve` command for any additional information or backtrace. ------------------------------------------------ <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- webrick (LoadError) from <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `require_relative' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `setup' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:100:in `process' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/exe/jekyll:15:in `<top (required)>' from /usr/local/bin/jekyll:23:in `load' from /usr/local/bin/jekyll:23:in `<main>' ``` This issue is solved in jekyll/jekyll#8523 ### Why are the changes needed? Fix build issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes apache#31263 from AngersZhuuuu/SPARK-34181. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit faa4f0c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Server's spatial types ### What changes were proposed in this pull request? This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails. MS SQL server supports two non-standard spatial JDBC types, `geometry` and `geography` but Spark SQL can't treat them ``` java.sql.SQLException: Unrecognized SQL type -157 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381) ``` Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`. ### Why are the changes needed? To provide better support. ### Does this PR introduce _any_ user-facing change? Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables. ### How was this patch tested? New test case added to `MsSqlServerIntegrationSuite`. Closes apache#31283 from sarutak/mssql-spatial-types. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

Commits on Sep 8, 2020

Commits on Sep 9, 2020

Commits on Sep 10, 2020

Commits on Sep 11, 2020

Commits on Sep 12, 2020

Commits on Sep 13, 2020

Commits on Sep 14, 2020

Commits on Sep 15, 2020

Commits on Sep 16, 2020

Commits on Sep 17, 2020

Commits on Sep 18, 2020

Commits on Sep 21, 2020

Commits on Sep 22, 2020

Commits on Sep 23, 2020

Commits on Sep 24, 2020

Commits on Sep 25, 2020

Commits on Sep 26, 2020

Commits on Sep 29, 2020

Commits on Sep 30, 2020

Commits on Oct 1, 2020

Commits on Oct 2, 2020

Commits on Oct 3, 2020

Commits on Oct 4, 2020

Commits on Oct 6, 2020

Commits on Oct 7, 2020

Commits on Oct 8, 2020

Commits on Oct 9, 2020

Commits on Oct 12, 2020

Commits on Oct 13, 2020

Commits on Oct 14, 2020

Commits on Oct 15, 2020

Commits on Oct 16, 2020

Commits on Oct 17, 2020

Commits on Oct 18, 2020

Commits on Oct 19, 2020

Commits on Oct 20, 2020

Commits on Oct 21, 2020

Commits on Oct 22, 2020

Commits on Oct 24, 2020

Commits on Oct 25, 2020

Commits on Oct 26, 2020

Commits on Oct 27, 2020

Commits on Oct 28, 2020

Commits on Oct 29, 2020

Commits on Oct 30, 2020

Commits on Oct 31, 2020

Commits on Nov 2, 2020

Commits on Nov 3, 2020

Commits on Nov 4, 2020

Commits on Nov 5, 2020

Commits on Nov 8, 2020

Commits on Nov 9, 2020

Commits on Nov 10, 2020

Commits on Nov 11, 2020

Commits on Nov 12, 2020

Commits on Nov 13, 2020

Commits on Nov 16, 2020

Commits on Nov 17, 2020

Commits on Nov 19, 2020

Commits on Nov 20, 2020

Commits on Nov 23, 2020

Commits on Nov 24, 2020

Commits on Nov 25, 2020

Commits on Nov 26, 2020

Commits on Nov 29, 2020

Commits on Nov 30, 2020

Commits on Dec 1, 2020

Commits on Dec 2, 2020

Commits on Dec 3, 2020

Commits on Dec 4, 2020

Commits on Dec 6, 2020

Commits on Dec 7, 2020

Commits on Dec 8, 2020

Commits on Dec 10, 2020

Commits on Dec 11, 2020

Commits on Dec 13, 2020

Commits on Dec 14, 2020

Commits on Dec 16, 2020