Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog #25903

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Sep 23, 2019

What changes were proposed in this pull request?

when the current catalog is session catalog, get/set the current namespace from/to the SessionCatalog.

Why are the changes needed?

It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in CatalogManager or SessionCatalog.

Ideally, we should always track the current catalog/namespace in CatalogManager. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in CatalogManager and blindly go to SessionCatalog. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog.

Thus, we can't use CatalogManager to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in SessionCatalog.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Newly added and updated test cases.

@cloud-fan
Copy link
Contributor Author

@imback82 I think this should help to simplify #25771

also cc @rdblue @gengliangwang

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111228 has finished for PR 25903 at commit e901b89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111230 has finished for PR 25903 at commit 7c8f315.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(

}

def setCurrentCatalog(catalogName: String): Unit = synchronized {
_currentCatalog = Some(catalogName)
_currentCatalogName = Some(catalogName)
_currentNamespace = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to reset the namespace (set to default database) in v1SessionCatalog to make the behavior consistent when catalog is changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a behavior we can discuss: a user sets the current database of the session catalog, then switch to a custom catalog, shall we reset the current database of the session catalog? Think about the following commands:

USE myDB
SET CURRENT CATALOG myCata
ANALYZE TABLE t

shall we analyze myDB.t or default.t or fail?

also cc @gatorsmile @dongjoon-hyun @viirya

Copy link
Contributor

@imback82 imback82 Sep 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current namespace should be either 1) reset to default or 2) maintained per catalog, but not carried over when a catalog is switched (seems less intuitive since namespace is under catalog).

For example,

  1. reset to default:
# Starts with session catalog
SELECT current_database() # default
USE db
SELECT current_database() # db

# switch catalog
USE CATALOG myCat
SELECT current_database() # [] (default)
USE ns1.ns2
SELECT current_database() # ns1.ns2

# switch back to different catalogs. namespace is reset.
USE CATALOG session
SELECT current_database() # default
USE CATALOG myCat
SELECT current_database() # [] (default)
  1. previous namespace is maintained per catalog:
# Starts with session catalog
SELECT current_database() # default
USE db
SELECT current_database() # db

# switch catalog
USE CATALOG myCat
SELECT current_database() # [] (default)
USE ns1.ns2
SELECT current_database() # ns1.ns2

# switch back to different catalogs. namespace is preserved.
USE CATALOG session
SELECT current_database() # db
USE CATALOG myCat
SELECT current_database() # ns1.ns2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for resetting it when we switch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting it when switching is definitely reasonable, but what I was talking about is v1 command:

# Starts with session catalog
SELECT current_database() # default
ANALYZE TABLE t # analyze default.t from session catalog

USE db
SELECT current_database() # db
ANALYZE TABLE t # analyze db.t from session catalog

# switch catalog
USE CATALOG myCat
SELECT current_database() # [] (default namespace of myCat)
ANALYZE TABLE t

For the last ANALYZE TABLE,it's unclear what the behavior should be. There are 3 options:

  1. We should look up the table from myCat, so this should either throw NoSuchTable exception, or throw unsupported exception because we don't have an analyze table API in DS v2.
  2. V1 commands should still go with the v1 code path. We should analyze table default.t because the current catalog is not session catalog and we don't know what the currennt database of session catalog is now.
  3. V1 commands should still go with the v1 code path. We should analyze table db.t because this is the most recent current database of the session catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I also think option 1 is better, but it's a lot of work. We need to create statement logical plans (e.g. DeleteFromStatement) and related resolution rules for all commands that needs to resolve tables even if they don't support v2 APIs. An initial list: ANALYZE TABLE, CACHE TABLE, REFRESH TABLE, ALTER TABLE ADD PARTITION, ALTER TABLE SET SERDE, REPAIR TABLE, ...

My refactoring should make this easier, but it's still a lot of work to be done here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Option 1. I think we can make the behavior consistent when we switch catalogs (reset to default). But if Option 1 is a lot of work, we can go with Option 2 (default.t) until we create rules for all commands that don't support v2 APIs (I can help with this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! I'll go with option 2 here, with a TODO saying that we want to do option 1 in the future.

I'll create JIRA tickets for the followup work so that we can have people working on it in parallel,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, @cloud-fan .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Agree on that option 1 is more reasonable.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Member

cc @rdblue

@SparkQA
Copy link

SparkQA commented Sep 24, 2019

Test build #111302 has finished for PR 25903 at commit b05cb4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

def setCurrentCatalog(catalogName: String): Unit = synchronized {
_currentCatalog = Some(catalogName)
_currentCatalogName = Some(catalogName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set to SESSION_CATALOG_NAME, but current catalog is already SESSION_CATALOG_NAME? Will this also reset to DEFAULT_DATABASE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! This is an issue for all catalogs, instead of just session catalog. setCurrentCatalog should be noop if it doesn't switch catalog. I'll fix it in this PR since it's simple.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111317 has finished for PR 25903 at commit 31db62d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants