Skip to content

Conversation

@danny0405
Copy link
Contributor

What is the purpose of the change

Support create table for SqlClient

Brief change log

  • Add create table command support for SqlClient
  • Always cache the current catalog to the original session, so that when we set a variable, the current catalog table still can be reused

Verifying this change

See tests in LocalExecutorITCase

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not documented

@flinkbot
Copy link
Collaborator

flinkbot commented Jul 10, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 829e8ae (Tue Aug 06 15:40:36 UTC 2019)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for this PR. I would like to propose a different architecture.

How about the following flow:

  • CLI sends SQL (any SQL not just DDL) to Executor
  • Executor needs to distinguish between immediate executable statements (such as storing tables in a catalog) and statements that just enrich the session context.
  • The updated session context could be sent back to the client.

This is just one idea. In any case we need to get the separation between CLI (client) and executor (gateway/server) right. This PR mixes responsibilities.

We can also discuss that in an privately first.


private final Map<String, ViewEntry> views;

private volatile Catalog currentCatalog;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A catalog should not be part of a session context. You need to image the session context like a JSON map that is serialized between CLI and Executor. Context + SQL could be sent to a stateless server.


#==============================================================================
# TEST ENVIRONMENT FILE
# General purpose default environment file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you copy code, make sure to also update it accordingly. This is not a "general purpose" file anymore but a file for a specific test. Also remove all unrelated content to see what you are actually testing.

@flinkbot
Copy link
Collaborator

CI report for commit 0033ba7: FAILURE Build

@xuefuz
Copy link
Contributor

xuefuz commented Jul 11, 2019

I appreciate Timo's vision of client/server model even in case that we only have local execution at the moment. However, I'm a little curious about the goal of a stateless gateway with state being passed back and forth. Some state is easily fit to this model, while others, such as temporary tables created by the user that's further referenced in subsequent queries. This is just one thing that's currently maintained in a table env. Without a live table env instance maintained for a remote client, it's hard to maintain the state via a session context.

To me, stateless gateway seems making a lot of sense for largely stateless client, like one submitting a streaming job. The scenario for batch can be quite different.

Personally, I am in favor of having some DDL support now and leaving re-architecturing for the next release.

@danny0405
Copy link
Contributor Author

@twalthr How about we cache a catalogName -> DDLs mapping in the SessionContext, just like we cache the ViewEntry, then we recover the tables from these DDLs every time we switch to a new session.

@flinkbot
Copy link
Collaborator

CI report for commit 829e8ae: FAILURE Build

@twalthr
Copy link
Contributor

twalthr commented Jul 11, 2019

Tables registered in catalogs should have been persisted. So we don't need to memorize them. It is true that SessionContext has some cache for views, but if you look into the implementation, a view is just a string that comes directly from the CLI. It is not evaluated before submitting an entire query.

I agree that we should not re-register tables in a catalog. A CREATE TABLE statement should be executed immediately and be persistent in the catalog. Have we ever thought about properly distinguishing between temporary tables (for the session) and persistent tables (across sessions)? This would make the implementation in the SQL Client much easier:

CREATE TEMPORARY TABLE (...) WITH (..) // buffered in the CLI session and re-registered when a query is executed

CREATE TABLE (...) WITH (...) // executed immediately and applied to the catalog

This is one idea of fixing it properly.

@xuefuz
Copy link
Contributor

xuefuz commented Jul 11, 2019

@twalthr Your example of temporary table usage might be simple enough to make client-side caching workable. However, that's usually not the case. For instance, user might do
CREATE TEMPORARY TABLE (...) WITH (..) AS SELECT ... FROM X JOIN Y ON .... WHERE ...
it's not efficient to cache this at client side and execute every time when the temp table is referenced.

The essence of the problem is that a SQL user session has a complicated state that cannot be easily managed by a client. Thus, the idea of stateless gateway, while solving some use cases (like submitting a streaming job), has difficulty to apply for batch cases where session state is large and complicated.

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 6, 2019

CI report:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants