[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

sarutak · 2021-01-22T07:36:04Z

What changes were proposed in this pull request?

This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails.
MS SQL server supports two non-standard spatial JDBC types, geometry and geography but Spark SQL can't treat them

java.sql.SQLException: Unrecognized SQL type -157
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
 at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
 at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
 at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
 at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366)
 at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
 at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381)

Considering the data type mapping says, I think those spatial types can be mapped to Catalyst's BinaryType.

Why are the changes needed?

To provide better support.

Does this PR introduce any user-facing change?

Yes. MS SQL Server users can use geometry and geography types in datasource tables.

How was this patch tested?

New test case added to MsSqlServerIntegrationSuite.

…read-safe ### What changes were proposed in this pull request? Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy. We need to add a unit test. ### Why are the changes needed? Increase test coverage to ensure that iterator() is thread-safe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? add ut Closes apache#29669 from cxzl25/SPARK-31511-test. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit bd3dc2f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This PR backports apache#29593 to branch-3.0 ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](apache#29593 (comment)) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce any user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes apache#29665 from manuzhang/spark-32753-3.0. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…aths with glob metacharacters ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. The fix is a backport of apache#29670, and it is based on another bug fix for CSV/JSON datasources apache#29659. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc]`: ```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show ``` but would end up hitting an exception: ``` Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770) at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added UT to `LibSVMRelationSuite`. Closes apache#29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…denSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. This backport for 3.0 comes from apache#29485 and apache#29643 ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes apache#29680 from maropu/SPARK-32638-BRANCH3.0. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…he .amount in a resource config ### What changes were proposed in this pull request? If the user forgets to specify .amount on a resource config like spark.executor.resource.gpu, the error message thrown is very confusing: ``` ERROR SparkContext: Error initializing SparkContext.java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at ``` This makes it so we have a readable error thrown ### Why are the changes needed? confusing error for users ### Does this PR introduce _any_ user-facing change? just error message ### How was this patch tested? Tested manually on standalone cluster Closes apache#29685 from tgravescs/SPARK-32824. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit e8634d8) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Fixes the master UI for properly summing the resources total across multiple workers. field: Resources in use: 0 / 8 gpu The bug here is that it was creating MutableResourceInfo and then reducing using the + operator. the + operator in MutableResourceInfo simple adds the address from one to the addresses of the other. But its using a HashSet so if the addresses are the same then you lose the correct amount. ie worker1 has gpu addresses 0,1,2,3 and worker2 has addresses 0,1,2,3 then you only see 4 total GPUs when there are 8. In this case we don't really need to create the MutableResourceInfo at all because we just want the sums for used and total so just remove the use of it. The other uses of it are per Worker so those should be ok. ### Why are the changes needed? fix UI ### Does this PR introduce _any_ user-facing change? UI ### How was this patch tested? tested manually on standalone cluster with multiple workers and multiple GPUs and multiple fpgas Closes apache#29683 from tgravescs/SPARK-32823. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 514bf56) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…der if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes apache#29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit de0dc52) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…CSV datasources v1 and v2 ### What changes were proposed in this pull request? In the PR, I propose to move the test `SPARK-32810: CSV and JSON data sources should be able to read files with escaped glob metacharacter in the paths` from `DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in `JsonV1Suite`/`JsonV2Suite`. ### Why are the changes needed? To improve test coverage by checking JSON/CSV datasources v1 and v2. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.csv.*" $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*" ``` Closes apache#29690 from MaxGekk/globbing-paths-when-inferring-schema-dsv2-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…with some stateful queries + no-data-batches + V1 sources ### What changes were proposed in this pull request? Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. ### Why are the changes needed? Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test with a mock v1 source that fails without the fix. Closes apache#29696 from tdas/SPARK-32794-3.0. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

…ter options correctly ### What changes were proposed in this pull request? This PR aims to fix the test coverage at `DataStreamReaderWriterSuite`. ### Why are the changes needed? Currently, the test case checks `DataStreamReader` options instead of `DataStreamWriter` options. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the revised test case. Closes apache#29701 from dongjoon-hyun/SPARK-32836. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 06a9945) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…options This PR aims to fix indeterministic behavior on DataStreamReader/Writer options like the following. ```scala scala> spark.readStream.format("parquet").option("paTh", "1").option("PATH", "2").option("Path", "3").option("patH", "4").option("path", "5").load() org.apache.spark.sql.AnalysisException: Path does not exist: 1; ``` This will make the behavior deterministic. Yes, but the previous behavior is indeterministic. Pass the newly test cases. Closes apache#29702 from dongjoon-hyun/SPARK-32832. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2f85f95) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ve recursively ### What changes were proposed in this pull request? This patch proposes to check `ignoreNullability` parameter recursively in `equalsStructurally` method. This backports apache#29698 to branch-3.0. ### Why are the changes needed? `equalsStructurally` is used to check type equality. We can optionally ask to ignore nullability check. But the parameter `ignoreNullability` is not passed recursively down to nested types. So it produces weird error like: ``` data type mismatch: argument 3 requires array<array<string>> type, however ... is of array<array<string>> type. ``` when running the query `select aggregate(split('abcdefgh',''), array(array('')), (acc, x) -> array(array( x ) ) )`. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when running user query. ### How was this patch tested? Unit tests. Closes apache#29705 from viirya/SPARK-32819-3.0. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…adhesive with the unit THIS PR backports apache#29708 to 3.0 ### What changes were proposed in this pull request? In this PR, we add a checker for STRING form interval value ahead for parsing multiple units intervals and fail directly if the interval value contains alphabets to prevent correctness issues like `interval '1 day 2' day`=`3 days`. ### Why are the changes needed? fix correctness issue ### Does this PR introduce _any_ user-facing change? yes, in spark 3.0.0 `interval '1 day 2' day`=`3 days` but now we fail with ParseException ### How was this patch tested? add a test. Closes apache#29716 from yaooqinn/SPARK-32840-30. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…nCommand ### What changes were proposed in this pull request? We made a mistake in apache#29502, as there is no code comment to explain why we can't load the UDF class when creating functions. This PR improves the code comment. ### Why are the changes needed? To avoid making the same mistake. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#29713 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 328d81a) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…stly in DataStreamReaderWriterSuite This PR aims to add `sinkParameter` to check sink options robustly and independently in DataStreamReaderWriterSuite `LastOptions.parameters` is designed to catch three cases: `sourceSchema`, `createSource`, `createSink`. However, `StreamQuery.stop` invokes `queryExecutionThread.join`, `runStream`, `createSource` immediately and reset the stored options by `createSink`. To catch `createSink` options, currently, the test suite is trying a workaround pattern. However, we observed a flakiness in this pattern sometimes. If we split `createSink` option separately, we don't need this workaround and can eliminate this flakiness. ```scala val query = df.writeStream. ... .start() assert(LastOptions.paramters(..)) query.stop() ``` No. This is a test-only change. Pass the newly updated test case. Closes apache#29730 from dongjoon-hyun/SPARK-32845. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit b4be6a6) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Follow-up PR as per the review comments in [29649](https://github.com/apache/spark/pull/29649/files/8d45542e915bea1b321f42988b407091065a2539#r487140171) ### Why are the changes needed? Delete the un used code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes apache#29736 from sandeep-katta/deadlockfollowup. Authored-by: sandeep.katta <sandeep.katta2007@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 2009f95) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…SPARK_VERSION correctly ### What changes were proposed in this pull request? In https://github.com/apache/spark/blame/master/docs/quick-start.md#L402,it should be `{{site.SPARK_VERSION}}` rather than `{site.SPARK_VERSION}` ### Why are the changes needed? SPARK_VERSION isn't displayed correctly, as shown below ![image](https://user-images.githubusercontent.com/1892692/93006726-d03c8680-f514-11ea-85e3-1d7cfb682ef2.png) ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tested locally, as shown below ![image](https://user-images.githubusercontent.com/1892692/93006712-a6835f80-f514-11ea-8d78-6831c9d65265.png) Closes apache#29738 from bowenli86/doc. Authored-by: bowen.li <bowenli86@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0549c20) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…7 in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? The Jenkins job fails to get the versions. This was fixed by adding temporary fallbacks at apache#28536. This still doesn't work without the temporary fallbacks. See apache#29694 This PR adds new fallbacks since 2.3 is EOL and Spark 3.0.1 and 2.4.7 are released. ### Why are the changes needed? To test correctly in Jenkins. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Jenkins and GitHub Actions builds should test. Closes apache#29748 from HyukjinKwon/SPARK-32876. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0696f04) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…eding growth threshold ### What changes were proposed in this pull request? When BytesToBytesMap is at `MAX_CAPACITY` and reaches its growth threshold, `numKeys >= growthThreshold` is true but `longArray.size() / 2 < MAX_CAPACITY` is false. This correctly prevents the map from growing, but `canGrowArray` incorrectly remains true. Therefore the map keeps accepting new keys and exceeds its growth threshold. If we attempt to spill the map in this state, the UnsafeKVExternalSorter will not be able to reuse the long array for sorting. By this point the task has typically consumed all available memory, so the allocation of the new pointer array is likely to fail. This PR fixes the issue by setting `canGrowArray` to false in this case. This prevents the map from accepting new elements when it cannot grow to accommodate them. ### Why are the changes needed? Without this change, hash aggregations will fail when the number of groups per task is greater than `MAX_CAPACITY / 2 = 2^28` (approximately 268 million), and when the grouping aggregation is the only memory-consuming operator in its stage. For example, the final aggregation in `SELECT COUNT(DISTINCT id) FROM tbl` fails when `tbl` contains 1 billion distinct values and when `spark.sql.shuffle.partitions=1`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Reproducing this issue requires building a very large BytesToBytesMap. Because this is infeasible to do in a unit test, this PR was tested manually by adding the following test to AbstractBytesToBytesMapSuite. Before this PR, the test fails in 8.5 minutes. With this PR, the test passes in 1.5 minutes. ```java public abstract class AbstractBytesToBytesMapSuite { // ... Test public void respectGrowthThresholdAtMaxCapacity() { TestMemoryManager memoryManager2 = new TestMemoryManager( new SparkConf() .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true) .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 * 1024L) .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false) .set(package$.MODULE$.SHUFFLE_COMPRESS(), false)); TaskMemoryManager taskMemoryManager2 = new TaskMemoryManager(memoryManager2, 0); final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page marker final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2, 1024, pageSizeBytes); try { // Insert keys into the map until it stops accepting new keys. for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY; i++) { if (i % (1024 * 1024) == 0) System.out.println("Inserting element " + i); final long[] value = new long[]{i}; BytesToBytesMap.Location loc = map.lookup(value, Platform.LONG_ARRAY_OFFSET, 8); Assert.assertFalse(loc.isDefined()); boolean success = loc.append(value, Platform.LONG_ARRAY_OFFSET, 8, value, Platform.LONG_ARRAY_OFFSET, 8); if (!success) break; } // The map should grow to its max capacity. long capacity = map.getArray().size() / 2; Assert.assertTrue(capacity == BytesToBytesMap.MAX_CAPACITY); // The map should stop accepting new keys once it has reached its growth // threshold, which is half the max capacity. Assert.assertTrue(map.numKeys() == BytesToBytesMap.MAX_CAPACITY / 2); map.free(); } finally { map.free(); } } } ``` Closes apache#29744 from ankurdave/SPARK-32872. Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 72550c3) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…oadcast ### What changes were proposed in this pull request? In TorrentBroadcast.scala ```scala L133: if (!blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)) L137: TorrentBroadcast.blockifyObject(value, blockSize, SparkEnv.get.serializer, compressionCodec) L147: if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) ``` After the original value is saved successfully(TorrentBroadcast.scala: L133), but the following `blockifyObject()`(L137) or store piece(L147) steps are failed. There is no opportunity to release broadcast from memory. This patch is to remove all pieces of the broadcast when failed to blockify or failed to store some pieces of a broadcast. ### Why are the changes needed? We use Spark thrift-server as a long-running service. A bad query submitted a heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the bad query but we found the driver's memory usage was still high and full GCs were still frequent. By investigating with GC dump and log, we found the broadcast may memory leak. > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): 116G->112G(170G), 184.9121920 secs] [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] 1: 676531691 72035438432 [B 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568 [Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7: 2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024 java.net.URI ... ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. This UT is hard to write and the patch is straightforward. Closes apache#29558 from LantaoJin/SPARK-32715. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 7a9b066) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…loat and double ### What changes were proposed in this pull request? The `LiteralGenerator` for float and double datatypes was supposed to yield special values (NaN, +-inf) among others, but the `Gen.chooseNum` method does not yield values that are outside the defined range. The `Gen.chooseNum` for a wide range of floats and doubles does not yield values in the "everyday" range as stated in typelevel/scalacheck#113 . There is an similar class `RandomDataGenerator` that is used in some other tests. Added `-0.0` and `-0.0f` as special values to there too. These changes revealed an inconsistency with the equality check between `-0.0` and `0.0`. ### Why are the changes needed? The `LiteralGenerator` is mostly used in the `checkConsistencyBetweenInterpretedAndCodegen` method in `MathExpressionsSuite`. This change would have caught the bug fixed in apache#29495 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Locally reverted apache#29495 and verified that the existing test cases caught the bug. Closes apache#29515 from tanelk/SPARK-32688. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 6051755) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…th for reading CSV ### What changes were proposed in this pull request? This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header. ### Why are the changes needed? This behavior can confuse users. We should explicitly document it. ### Does this PR introduce _any_ user-facing change? No. Only doc change. ### How was this patch tested? Only doc change. Closes apache#29765 from viirya/SPARK-32888. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 550c1c9) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…n.builder.getOrCreate ### What changes were proposed in this pull request? In PySpark shell, if you call `SparkSession.builder.getOrCreate` as below: ```python import warnings from pyspark.sql import SparkSession, SQLContext warnings.simplefilter('always', DeprecationWarning) spark.stop() SparkSession.builder.getOrCreate() ``` it shows the deprecation warning as below: ``` /.../spark/python/pyspark/sql/context.py:72: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead. DeprecationWarning) ``` via https://github.com/apache/spark/blob/d3304268d3046116d39ec3d54a8e319dce188f36/python/pyspark/sql/session.py#L222 We shouldn't print the deprecation warning from it. This is the only place ^. ### Why are the changes needed? To prevent to inform users that `SparkSession.builder.getOrCreate` is deprecated mistakenly. ### Does this PR introduce _any_ user-facing change? Yes, it won't show a deprecation warning to end users for calling `SparkSession.builder.getOrCreate`. ### How was this patch tested? Manually tested as above. Closes apache#29768 from HyukjinKwon/SPARK-32897. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit 657e39a) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…e nulls ### What changes were proposed in this pull request? This PR changes the way `UnsafeExternalSorter.SpillableIterator` checks whether it has spilled already, by checking whether `inMemSorter` is null. It also allows it to spill other `UnsafeSorterIterator`s than `UnsafeInMemorySorter.SortedIterator`. ### Why are the changes needed? Before this PR `UnsafeExternalSorter.SpillableIterator` could not spill when there are NULLs in the input and radix sorting is used. Currently, Spark determines whether UnsafeExternalSorter.SpillableIterator has not spilled yet by checking whether `upstream` is an instance of `UnsafeInMemorySorter.SortedIterator`. When radix sorting is used and there are NULLs in the input however, `upstream` will be an instance of `UnsafeExternalSorter.ChainedIterator` instead, and Spark will assume that the `SpillableIterator` iterator has spilled already, and therefore cannot spill again when it's supposed to spill. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A test was added to `UnsafeExternalSorterSuite` (and therefore also to `UnsafeExternalSorterRadixSortSuite`). I manually confirmed that the test failed in `UnsafeExternalSorterRadixSortSuite` without this patch. Closes apache#29772 from tomvanbussel/SPARK-32900. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: herman <herman@databricks.com> (cherry picked from commit e5e54a3) Signed-off-by: herman <herman@databricks.com>

### What changes were proposed in this pull request? Correct the typo in Show Table document ### Why are the changes needed? Current Document of Show Table returns in parse error, so it is misleading to users ### Does this PR introduce _any_ user-facing change? Yes, the document of show table is corrected now ### How was this patch tested? NA Closes apache#29758 from Udbhav30/showtable. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 88e87bc) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…f fatal error happens in `Inbox.process` This is a backport for [pr#29580](apache#29580) to branch 3.0. ### What changes were proposed in this pull request? Processing for `ThreadSafeRpcEndpoint` is controlled by `numActiveThreads` in `Inbox`. Now if any fatal error happens during `Inbox.process`, `numActiveThreads` is not reduced. Then other threads can not process messages in that inbox, which causes the endpoint to "hang". For other type of endpoints, we also should keep `numActiveThreads` correct. This problem is more serious in previous Spark 2.x versions since the driver, executor and block manager endpoints are all thread safe endpoints. To fix this, we should reduce the number of active threads if fatal error happens in `Inbox.process`. ### Why are the changes needed? `numActiveThreads` is not correct when fatal error happens and will cause the described problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes apache#29763 from wzhfy/deal_with_fatal_error_3.0. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

### What changes were proposed in this pull request? This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children. Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`): ```scala val a = Seq("1").toDF("col1").withColumn("col2", lit("1")) val b = Seq("2").toDF("col1").withColumn("col2", lit("2")) val aub = a.union(b) val c = aub.filter($"col1" === "2").cache() val d = Seq("2").toDF( "col4") val r = d.join(aub, $"col2" === $"col4").select("col4") val l = c.select("col2") val df = l.join(r, $"col2" === $"col4", "LeftOuter") df.show() ``` foldable propagation happens incorrectly: ``` Join LeftOuter, (col2#6 = col4#34) Join LeftOuter, (col2#6 = col4#34) !:- Project [col2#6] :- Project [1 AS col2#6] : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- Union : +- Union : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) LocalTableScan [value#1] : : +- *(1) LocalTableScan [value#1] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) LocalTableScan [value#10] : +- *(2) LocalTableScan [value#10] +- Project [col4#34] +- Project [col4#34] +- Join Inner, (col2#6 = col4#34) +- Join Inner, (col2#6 = col4#34) :- Project [value#31 AS col4#34] :- Project [value#31 AS col4#34] : +- LocalRelation [value#31] : +- LocalRelation [value#31] +- Project [col2#6] +- Project [col2#6] +- Union false, false +- Union false, false :- Project [1 AS col2#6] :- Project [1 AS col2#6] : +- LocalRelation [value#1] : +- LocalRelation [value#1] +- Project [2 AS col2#15] +- Project [2 AS col2#15] +- LocalRelation [value#10] +- LocalRelation [value#10] ``` and so the result is wrong: ``` +----+----+ |col2|col4| +----+----+ | 1|null| +----+----+ ``` After this PR foldable propagation will not happen incorrectly and the result is correct: ``` +----+----+ |col2|col4| +----+----+ | 2| 2| +----+----+ ``` ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Existing and new UTs. Closes apache#29771 from peter-toth/SPARK-32635-fix-foldable-propagation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 4ced588) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

### What changes were proposed in this pull request? 1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1). 2. Added a test to check different accuracies. 3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test. ### Why are the changes needed? To fix incorrect percentile calculation, see an example in SPARK-32908. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`. - Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`. Closes apache#29784 from MaxGekk/fix-percentile_approx-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 75dd864) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…zing floats ### What changes were proposed in this pull request? This PR intends to fix a minor bug when normalizing floats for struct types; ``` scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k") scala> val agg = df.distinct() scala> agg.explain() == Physical Plan == *(2) HashAggregate(keys=[k#40], functions=[]) +- Exchange hashpartitioning(k#40, 200), true, [id=apache#62] +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1)))) AS k#40], functions=[]) +- *(1) LocalTableScan [k#40] scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head } scala> aggOutput.foreach { attr => println(attr.prettyJson) } ### Final Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "_1", ^^^ "type" : "double", "nullable" : false, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ### Partial Aggregate ### [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "k", "dataType" : { "type" : "struct", "fields" : [ { "name" : "col1", ^^^^ "type" : "double", "nullable" : true, "metadata" : { } } ] }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 40, "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366" }, "qualifier" : [ ] } ] ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#29780 from maropu/FixBugInNormalizedFloatingNumbers. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit b49aaa3) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…legationTokens message ### What changes were proposed in this pull request? With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function. ```java 20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them. 20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Why are the changes needed? bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing jenkins and verify manually I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi The simplest way to reproduce the bug and verify this fix is to follow these steps #### 1 build the `kyuubi-spark-sql-engine` module ``` mvn clean package -pl :kyuubi-spark-sql-engine ``` #### 2. config the spark with Kerberos settings towards your secured cluster #### 3. start it in the background ``` nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log & ``` #### 4. check the AM log and see "Updating delegation tokens ..." for SUCCESS "Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE Closes apache#29777 from yaooqinn/SPARK-32905. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9e9d4b6) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Check SPARK_TESTING as lazy val to avoid slow down when there are many environment variables ### Why are the changes needed? If there are many environment variables, sys.env slows is very slow. As Utils.isTesting is called very often during Dataframe-Optimization, this can slow down evaluation very much. An example for triggering the problem can be found in the bug ticket https://issues.apache.org/jira/browse/SPARK-34115 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With the example provided in the ticket. Closes apache#31244 from nob13/bug/34115. Lead-authored-by: Norbert Schultz <norbert.schultz@reactivecore.de> Co-authored-by: Norbert Schultz <noschultz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit c3d8352) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…eRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes apache#31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f498977) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This test fails flakily. I found it failing in 1 out of 80 runs. ``` Expected -0.35667494393873245 and -0.41914521201224453 to be within 0.15 using relative tolerance. ``` Increasing relative tolerance to 0.2 should improve flakiness. ``` 0.2 * 0.35667494393873245 = 0.071 > 0.062 = |-0.35667494393873245 - (-0.41914521201224453)| ``` ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#31266 from Loquats/NaiveBayesSuite-reltol. Authored-by: Andy Zhang <yue.zhang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit c8c70d5) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit dad201e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…iInstanceRelation.newInstance" This reverts commit 89443ab.

### What changes were proposed in this pull request? When ruby version is 3.0, jekyll server will failed with ``` yi.zhu$ SKIP_API=1 jekyll serve --watch Configuration file: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_config.yml Source: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs Destination: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 5.085 seconds. Auto-regeneration: enabled for '/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs' ------------------------------------------------ Jekyll 4.2.0 Please append `--trace` to the `serve` command for any additional information or backtrace. ------------------------------------------------ <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- webrick (LoadError) from <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `require_relative' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `setup' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:100:in `process' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/exe/jekyll:15:in `<top (required)>' from /usr/local/bin/jekyll:23:in `load' from /usr/local/bin/jekyll:23:in `<main>' ``` This issue is solved in jekyll/jekyll#8523 ### Why are the changes needed? Fix build issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes apache#31263 from AngersZhuuuu/SPARK-34181. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit faa4f0c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…Server's spatial types ### What changes were proposed in this pull request? This PR fixes the issue that reading tables which contain spatial datatypes from MS SQL Server fails. MS SQL server supports two non-standard spatial JDBC types, `geometry` and `geography` but Spark SQL can't treat them ``` java.sql.SQLException: Unrecognized SQL type -157 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:251) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:366) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:355) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:355) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:381) ``` Considering the [data type mapping](https://docs.microsoft.com/ja-jp/sql/connect/jdbc/using-basic-data-types?view=sql-server-ver15) says, I think those spatial types can be mapped to Catalyst's `BinaryType`. ### Why are the changes needed? To provide better support. ### Does this PR introduce _any_ user-facing change? Yes. MS SQL Server users can use `geometry` and `geography` types in datasource tables. ### How was this patch tested? New test case added to `MsSqlServerIntegrationSuite`. Closes apache#31283 from sarutak/mssql-spatial-types. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cxzl25 and others added 30 commits September 8, 2020 11:54

nob13 and others added 7 commits January 20, 2021 09:39

Revert "[SPARK-34178][SQL] Copy tags for the new node created by Mult…

4e80f8c

…iInstanceRelation.newInstance" This reverts commit 89443ab.

Modify the URL to comply with the version of mssql-jdbc driver.

c36fe77

sarutak closed this Jan 22, 2021

github-actions bot added AVRO BUILD CORE DEPLOY DOCS DSTREAM EXAMPLES GRAPHX INFRA KUBERNETES MESOS ML MLLIB PYTHON R SPARK SHELL SQL STRUCTURED STREAMING WEB UI WINDOWS YARN labels Jan 22, 2021

sarutak deleted the SPARK-33813-branch-3.0 branch June 4, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

sarutak commented Jan 22, 2021

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

[SPARK-33813][SQL][3.0] Fix the issue that JDBC source can't treat MS SQL Server's spatial types #31289

Conversation

sarutak commented Jan 22, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?