Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11836

Closed
wants to merge 34 commits into from

Conversation

andrewor14
Copy link
Contributor

What changes were proposed in this pull request?

SessionCatalog, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to ExternalCatalog. This functionality overlaps a lot with the existing analysis.Catalog.

As of this commit, SessionCatalog and ExternalCatalog will no longer be dead code. There are still things that need to be done after this patch, namely:

  • SPARK-14013: Properly implement temporary functions in SessionCatalog
  • SPARK-13879: Decide which DDL/DML commands to support natively in Spark
  • SPARK-?????: Implement the ones we do want to support through SessionCatalog.
  • SPARK-?????: Merge SQL/HiveContext

How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are SessionCatalogSuite and ExternalCatalogSuite.

Andrew Or added 8 commits March 16, 2016 15:50
commit ad43a5f
Author: Andrew Or <andrew@databricks.com>
Date:   Wed Mar 16 14:35:02 2016 -0700

    Expand test scope + clean up test code

commit 08969cd
Author: Andrew Or <andrew@databricks.com>
Date:   Wed Mar 16 13:21:50 2016 -0700

    Fix tests

commit 6d9fa2f
Author: Andrew Or <andrew@databricks.com>
Date:   Wed Mar 16 12:31:52 2016 -0700

    Keep track of current database in SessionCatalog

    This allows us to not pass it into every single method like
    we used to before this commit.

commit ff1c2c4
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 19:42:22 2016 -0700

    Add TODO

commit 8c84dd8
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 19:41:30 2016 -0700

    Implement tests for functions

commit 3da16fb
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 19:04:03 2016 -0700

    Implement tests for table partitions

commit 7947445
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 18:52:30 2016 -0700

    Implement tests for databases and tables

commit 2f5121b
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 16:59:38 2016 -0700

    Fix infinite loop (woops)

commit d3f252d
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 16:12:55 2016 -0700

    Refactor CatalogTestCases to make methods accessible

commit caa4013
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 15:44:23 2016 -0700

    Clean up duplicate code in Table/FunctionIdentifier

commit 90ccdbb
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 15:33:30 2016 -0700

    Fix style

commit 5587a49
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 15:32:38 2016 -0700

    Implement SessionCatalog using ExternalCatalog

commit 196f7ce
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 14:39:22 2016 -0700

    Document and clean up function methods

commit 6d530a9
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 14:38:50 2016 -0700

    Fix tests

commit 2118212
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 14:33:20 2016 -0700

    Refactor CatalogFunction to use FunctionIdentifier

commit dd1fbae
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 14:22:37 2016 -0700

    Refactor CatalogTable to use TableIdentifier

    This is a standalone commit such that in the future we can split
    it out into a separate patch if preferrable.

commit 39a153c
Author: Andrew Or <andrew@databricks.com>
Date:   Tue Mar 15 13:53:42 2016 -0700

    Take into account current database in table methods

commit 5bf695c
Author: Andrew Or <andrew@databricks.com>
Date:   Mon Mar 14 17:14:59 2016 -0700

    Do the same for functions and partitions

commit 1d12578
Author: Andrew Or <andrew@databricks.com>
Date:   Mon Mar 14 16:27:11 2016 -0700

    Clean up table method signatures + add comments

commit 98c8a3b
Author: Andrew Or <andrew@databricks.com>
Date:   Thu Mar 10 16:35:35 2016 -0800

    Merge in @yhuai's changes
We need to be able to pass in ExternalCatalog in the constructor
of SQLContext and subclasses because these should be persistent
across sessions. Unfortunately without significant refactoring
in the HiveContext and TestHive code we cannot make this simple
change happen.
This failed because SessionCatalog does not implement
refreshTable. This is a bigger problem because SessionCatalog
has no notion of caching tables in the first place and so it
doesn't really make sense to implement refreshTable. More
refactoring involving HiveMetastoreCatalog is required to
make this work.
This commit deletes the trait analysis.Catalog and all of its
subclasses, with one notable exception: HiveMetastoreCatalog
is kept because a lot of existing functionality (like caching
data source tables) are still needed. All other occurrences
are now replaced with SessionCatalog.

Unfortunately, because HiveMetastoreCatalog is a massive
sprawl of unmaintainable code, there is no clean way to
integrate it nicely with the new HiveCatalog. The path of
least resistance, then, route previous usages of
HiveMetastoreCatalog through HiveCatalog. This requires
some whacky initialization order hacks because HMC takes
in HiveContext but HiveContext takes in HiveCatalog.
The biggest change here is moving HiveMetastoreCatalog from
HiveCatalog (the external one) to HiveSessionCatalog (the session
specific one). This is needed because HMC depends on a lot of
session specific things for, e.g. creating data source tables.
This was failing tests that do things with multiple sessions,
i.e. HiveQuerySuite.
There were some issues with case sensitivity analysis and error
messages not being exactly as expected. The latter is now relaxed
where possible.
@andrewor14
Copy link
Contributor Author

@yhuai @rxin

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53579 has finished for PR 11836 at commit 5e16480.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SimpleAnalyzer(conf: CatalystConf)
    • class SessionCatalog(externalCatalog: ExternalCatalog, conf: CatalystConf)

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53582 has finished for PR 11836 at commit c439280.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53587 has finished for PR 11836 at commit f089e2b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53585 has finished for PR 11836 at commit 34948a5.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

private def existsPartition(db: String, table: String, spec: TablePartitionSpec): Boolean = {
private def partitionExists(db: String, table: String, spec: TablePartitionSpec): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future let's separate renames likes this from large prs.

@rxin
Copy link
Contributor

rxin commented Mar 19, 2016

The changes look pretty good. We'd need to make Jenkins pass though.

@andrewor14
Copy link
Contributor Author

all good, let's try this again test this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #2664 has finished for PR 11836 at commit 5ea8469.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 23, 2016

looks really close ...

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #2662 has finished for PR 11836 at commit 5ea8469.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #2665 has finished for PR 11836 at commit 5ea8469.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53900 has finished for PR 11836 at commit 5ea8469.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53906 has finished for PR 11836 at commit 9519cd8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Mar 23, 2016

Ah, we also need to update the python test after we changed the behavior of list tables in a non-existent db.

@yhuai
Copy link
Contributor

yhuai commented Mar 23, 2016

Looks like the conflict is introduced by https://github.com/apache/spark/pull/11889/files. Should be pretty minor. But, let me see.

@yhuai
Copy link
Contributor

yhuai commented Mar 23, 2016

looks like the conflicts are very easy to resolve.

dir
}

private lazy val temporaryConfig = newTemporaryConfiguration(useInMemoryDerby = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why we have to remove

TestHiveContext.hiveClientConfigurations(hiveconf, warehousePath, scratchDirPath)
  .foreach { case (k, v) => metadataHive.runSqlHive(s"SET $k=$v") }

in 16a54ba. temporaryConfig was originally a lazy val. But, if we use hiveClientConfigurations, we are actually generate new confs every time we call reset.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53952 has finished for PR 11836 at commit c53f483.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53955 has finished for PR 11836 at commit ea1645b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Mar 23, 2016

Merging in master - thanks!

@asfgit asfgit closed this in 5dfc019 Mar 23, 2016
@andrewor14 andrewor14 deleted the use-session-catalog branch March 23, 2016 20:53
@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53956 has finished for PR 11836 at commit 350bd2e.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Mar 24, 2016
## What changes were proposed in this pull request?

`SessionCatalog`, introduced in apache#11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.

As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext

## How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes apache#11836 from andrewor14/use-session-catalog.
asfgit pushed a commit that referenced this pull request Mar 25, 2016
## What changes were proposed in this pull request?

This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.

## How was this patch tested?

See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11938 from andrewor14/session-catalog-again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants