[SPARK-32064][SQL] Supporting create temporary table by LantaoJin · Pull Request #28901 · apache/spark

LantaoJin · 2020-06-23T03:55:11Z

What changes were proposed in this pull request?

Many databases and data warehouse SQL engines support temporary tables. A temporary table, as its named implied, is a short-lived table that its life will be only for current session.
Hive Temporary Table
Teradata Volatile Table
PostgreSQL Temporary Table

In Spark, there is no temporary table. the DDL “CREATE TEMPORARY TABLE AS SELECT” will create a temporary view. A temporary view is totally different with a temporary table.

This ticket to support Spark native temporary table. More details are described in DESIGN DOCS

Parent ticket https://issues.apache.org/jira/browse/SPARK-32063

Why are the changes needed?

A temporary view is just a VIEW. It doesn’t materialize data in storage. So it has below shortage:

View will not give improved performance. Materialize intermediate data in temporary tables for a complex query will accurate queries, especially in an ETL pipeline.
View which calls other views can cause severe performance issues. Even, executing a very complex view may fail in Spark.
Temporary view has no database namespace. In some complex ETL pipelines or data warehouse applications, without database prefix is not convenient. It needs some tables which only used in current session.

Does this PR introduce any user-facing change?

YES.

CREATE TEMPORARY TABLE tt1 AS SELECT ..

before the patch, it will create a local temporary VIEW. After this patch, it will create a temporary table.

CREATE TEMPORARY TABLE tt1 USING ..

before the patch, it will throw exception. After this patch, it will create a temporary table.

Add a new API in Catalog.scala

def dropTempTable(tableName: String): Boolean

How was this patch tested?

Add unit tests.

SparkQA · 2020-06-23T04:11:23Z

Test build #124382 has finished for PR 28901 at commit f3f4e15.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TempViewAlreadyExistsException(table: String)
class TempTableAlreadyExistsException(table: TableIdentifier)
class TempTablePartitionUnsupportedException(table: TableIdentifier)

SparkQA · 2020-06-23T06:51:52Z

Test build #124394 has finished for PR 28901 at commit fa1a84a.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-23T07:01:22Z

retest this please

SparkQA · 2020-06-23T07:06:08Z

Test build #124397 has finished for PR 28901 at commit fa1a84a.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-23T07:30:38Z

retest this please

SparkQA · 2020-06-23T10:18:22Z

Test build #124400 has finished for PR 28901 at commit fa1a84a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-24T02:11:19Z

Test build #124444 has finished for PR 28901 at commit 5f23917.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-24T02:27:25Z

retest this please

SparkQA · 2020-06-24T02:33:19Z

Test build #124446 has finished for PR 28901 at commit 5f23917.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-24T02:40:47Z

retest this please

SparkQA · 2020-06-24T02:48:56Z

Test build #124447 has finished for PR 28901 at commit 5f23917.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-24T06:01:59Z

retest this please

SparkQA · 2020-06-24T07:05:02Z

Test build #124463 has finished for PR 28901 at commit 5f23917.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

This feature requires a lot of changes in different places. We need to define whether it should be global or local; whether we should create such a schema in each session; various error handling when the expected table dropping is not completely finished.

Trying to understand your use case first. Instead of creating a regular table, you want to create a temp table that does not need to be manually dropped?

LantaoJin · 2020-06-28T05:14:19Z

@gatorsmile Yes. Just like Hive temporary table or Teradata volatile table. We are migrating our Spark to v3.0. This is one of inside features which had widely used in our prodution.

SparkQA · 2020-06-28T07:05:02Z

Test build #124594 has finished for PR 28901 at commit 6d3274f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-28T10:06:28Z

Test build #124600 has finished for PR 28901 at commit 62e8ac9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-06-28T12:37:35Z

retest this please

SparkQA · 2020-06-28T15:18:04Z

Test build #124603 has finished for PR 28901 at commit 62e8ac9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-07-03T01:16:40Z

@gatorsmile @cloud-fan Current implementation is not complex. Any comments?

gatorsmile · 2020-07-06T04:55:03Z

For a proper support, this requires more discussions about the semantics.

Also, we need to list the expected behaviors for all the statements listed in https://spark.apache.org/docs/latest/sql-ref-syntax.html .

So far, this PR and design doc does not have the corresponding contents.

rxin · 2020-07-06T05:00:38Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

    }
  }
+
+  test("create temporary table using data source") {


maybe create a new suite for these?

rxin · 2020-07-06T05:01:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

+
+  val SPARK_SCRATCH_DIR =
+    buildStaticConf("spark.scratchdir")
+      .doc("Scratch space for Spark temporary table and so on. Similar with hive.exec.scratchdir")


let's not bring up hive here. Slowly nobody will care about Hive.

Also this should be spark.sql.scratchdir?

LantaoJin · 2020-07-06T06:47:39Z

For a proper support, this requires more discussions about the semantics.

Also, we need to list the expected behaviors for all the statements listed in https://spark.apache.org/docs/latest/sql-ref-syntax.html .

So far, this PR and design doc does not have the corresponding contents.

Sure, if you could tell me more about what do we need to discuss and what details should be written in documentation, that would be very helpful to me. About the concept of "temporary table", I think it is widly used in database domain: MySQL, PostgreSQL, Oracle etc, also in data warehouse domain: Hive, Teradata etc. Even though their implementations and grammar maybe different more or less, the purposes are similar in my opinion. This implementation and grammar of Spark temporary table mainly references Hive and PostgreSQL. They are basicly same.

cloud-fan · 2020-07-06T10:21:56Z

If I write the output to a temp location and then create a temp view, is it similar to the temp table? Except that temp table can be removed when the session terminates.

LantaoJin · 2020-07-06T12:16:34Z

If I write the output to a temp location and then create a temp view, is it similar to the temp table? Except that temp table can be removed when the session terminates.

There is no path and materialized data for temp view. So the answer is no. You can simply treate a temporary table as a Spark permanent data source table which will be dropped automatically when the session closed. So the implementation is not complex. But the user cases of temporary table are more than it looks. A permanent metastore table needs more maintenance, creating a permanent metastore table needs write permission of database it related and folder permission of storage. For many ad-hoc user case (OLAP), a user may only have limited permission like read. So current, users use temporary view to implement their complicated queries. View will not give improved performance without data materialzation. So if user can create temporary table, no need to grant write permission in production databases, it is very convenience for users.

Imaging the user case like below:
A databricks runtime user login to the notebook. He/She will write some statements to do an analysis job. Maybe in databricks runtime, user has r/w permissions on his/her own default space (we call it workspace), but no write permission on production database (for example, database "dw"). Without temporary table, they may use temporary view in their SQL statements. Or they can create a temporary workspace/database (for example, database named "tony_work"), and create permanent tables in "tony_work" then drop them all when they logout (if they can logout without failure). But users may want to share their scripts (above SQL statements) to another user or batch account. He/She has to change their scripts since the batch account or another user don't have the write permission on database "tony_ work". So in our production, the temporary table feature is wildly used, especially for the Teradata users who migrated to Spark. @cloud-fan

cloud-fan · 2020-07-06T12:45:18Z

Let me clarify my previous comment a little bit: If I write the output to a temp location and then create a temp view to read from this temp location, ...

I agree with your use case, but I'm bit worried about adding a new big API (temp table) if there are easy workarounds.

LantaoJin · 2020-07-06T13:07:47Z

If I write the output to a temp location and then create a temp view to read from this temp location, ...

Ah, I knew your meaning now, using CREATE TEMP VIEW USING command. I think it's still tricky. First, platform needs to provide the "temp" location to storage users' data which are able to clean all untracked files. It's hard to do that. Second, "CREATE TEMP VIEW USING" command lack the view definition. In most cases, immutable temporary table is enough. After the temporary created with AS SELECT ..., less insert/overwrite operations will perform on it. CREATE TEMP VIEW USING command without AS SELECT, most practises seem to insert/overwrite on it. What's more, the storage folder permission is still a problem in a ACL enabled platform (even Spark code didn't import ACL, many companies build their platforms similar with Databricks Runtime' ACL), and analysts are always not care about 'PATH'.

SparkQA · 2020-07-07T04:17:19Z

Test build #125153 has finished for PR 28901 at commit 9b11aac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-07-07T05:44:21Z

retest this please

SparkQA · 2020-07-07T07:05:01Z

Test build #125178 has finished for PR 28901 at commit 9b11aac.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-07-07T08:03:53Z

retest this please

SparkQA · 2020-07-07T10:37:02Z

Test build #125195 has finished for PR 28901 at commit 9b11aac.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

LantaoJin · 2020-07-14T05:17:33Z

retest this please

SparkQA · 2020-07-14T07:05:02Z

Test build #125805 has finished for PR 28901 at commit 9b11aac.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kevinjmh · 2020-07-23T03:34:55Z

how about to use CACHE TABLE command to do that?

LantaoJin · 2020-07-23T07:29:04Z

how about to use CACHE TABLE command to do that?

I think you mean CACHE VIEW since CACHE TABLE still needs to DROP TABLE manually.
There are four reasons to build temporary table instead of cache a temporary view:

The intermediate table which users want to create as a temporary table are always very large. To avoid OOM, user has to use CACHE TABLE viewname OPTIONS('storageLevel', 'disk_only'). It's not friend to SQL users. Users confuse what 'storageLevel' and 'disk_only' are.
View is dynamic and table is static. Whatever the underly detail tables changes, accessing a view should always access the latest data. So when the underly detail tables of a view changed, Spark will recache all data to executors's local disks again. For a large intermediate table, this is not performance friendly.
The storages between cached view and temporary table are different. Cache command stores the block in executors local disk which managed by blockManager, and data of temporary table is stored in external storage like HDFS. Local disks in executors are very limited and not easy to scale out. Besides, the data in HDFS can be organized by Parquet file format, this can highly benefits Scan operation and predicates pushdown.
The accuracy of table statistics for cached view is easily expired. IIUC, table statistics for cached view are calculated when the cache operation occurs. When the data of under details tables changed, the statistics of a cached view won't be updated. So some optimization like AQE cannot work correctly. But to a temporary table, no problem.

github-actions · 2020-11-01T00:36:52Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-32064][SQL] Supporting create temporary table

f3f4e15

probot-autolabeler bot added CORE SQL labels Jun 23, 2020

fix mima check

fa1a84a

probot-autolabeler bot added the BUILD label Jun 23, 2020

LantaoJin changed the title ~~[SPARK-32064][SQL] Supporting create temporary table~~ [WIP][SPARK-32064][SQL] Supporting create temporary table Jun 23, 2020

fix ut

5f23917

gatorsmile reviewed Jun 24, 2020

View reviewed changes

fix ut

6d3274f

map Spark temporary table to the LOCAL TEMPORARY table type in JDBC

62e8ac9

LantaoJin changed the title ~~[WIP][SPARK-32064][SQL] Supporting create temporary table~~ [SPARK-32064][SQL] Supporting create temporary table Jun 29, 2020

rxin reviewed Jul 6, 2020

View reviewed changes

add TemporaryTableSuite

9b11aac

github-actions bot added the Stale label Nov 1, 2020

github-actions bot closed this Nov 2, 2020

Conversation

LantaoJin commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

LantaoJin commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

LantaoJin commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

LantaoJin commented Jun 24, 2020

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

LantaoJin commented Jun 24, 2020

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

LantaoJin commented Jun 24, 2020

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Jun 28, 2020

Uh oh!

SparkQA commented Jun 28, 2020

Uh oh!

SparkQA commented Jun 28, 2020

Uh oh!

LantaoJin commented Jun 28, 2020

Uh oh!

SparkQA commented Jun 28, 2020

Uh oh!

LantaoJin commented Jul 3, 2020

Uh oh!

gatorsmile commented Jul 6, 2020

Uh oh!

rxin Jul 6, 2020

Choose a reason for hiding this comment

Uh oh!

LantaoJin Jul 7, 2020

Choose a reason for hiding this comment

Uh oh!

rxin Jul 6, 2020

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Jul 6, 2020

Uh oh!

cloud-fan commented Jul 6, 2020

Uh oh!

LantaoJin commented Jul 6, 2020

Uh oh!

cloud-fan commented Jul 6, 2020

Uh oh!

LantaoJin commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

LantaoJin commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

LantaoJin commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

LantaoJin commented Jun 23, 2020 •

edited

Loading

LantaoJin commented Jul 6, 2020 •

edited

Loading