Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13139][SQL] Create native DDL commands #11048

Closed
wants to merge 20 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 3, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13139

From JIRA: We currently delegate most DDLs directly to Hive, through NativePlaceholder in HiveQl.scala. In Spark 2.0, we want to provide native implementations for DDLs for both SQLContext and HiveContext.

This PR will do the first step to parse DDL commands and create logical commands that encapsulate them. Actual implementations still delegate to HiveNativeCommand now.

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50658 has finished for PR 11048 at commit 77252af.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateDataBase(
    • case class CreateFunction(

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50660 has finished for PR 11048 at commit 2b38d11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 3, 2016

Thanks - this looks pretty good as a start. We will need to add many other ddls, including alter/drop table, etc.

@viirya
Copy link
Member Author

viirya commented Feb 8, 2016

@rxin I've added alter table command support. As this command and corresponding change is big, I think we should let this PR only cover these three commands and do other commands in other PRs. How do you think?

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50919 has finished for PR 11048 at commit f9c1397.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 8, 2016

Test build #50922 has finished for PR 11048 at commit eb1fab7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2016

Test build #50965 has finished for PR 11048 at commit 3db2e1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 10, 2016

@viirya yes we can do this incrementally. Let's just create subtasks under https://issues.apache.org/jira/browse/SPARK-13139

@@ -62,6 +66,458 @@ private[sql] class SparkQl(conf: ParserConf = SimpleParserConf()) extends Cataly
val tableIdent = extractTableIdent(nameParts)
RefreshTable(tableIdent)

case Token("TOK_CREATEDATABASE", Token(databaseName, Nil) :: createDatabaseArgs) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hvanhovell

any suggestions on how we can make this file/function more modular? It is getting too long and we are about to add a lot more statements to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem is LogicalPlan parsing, which we can split up in command/ddl/query parsing. We could use partial functions (or something like that) to implement the different parts of the parsing logic.

For instance:

abstract class BaseParser(val conf: ParserConf) extends ParserInterface {
  val planParsers: Seq[PlanParser]

  lazy val planParser = planParsers.reduce(_.orElse(_))

  def nodeToPlan(node: ASTNode): LogicalPlan = {
    planParser.applyOrElse(node, throw new NotImplementedError(node.text))
  }
}

abstract class PlanParser extends PartialFunction[ASTNode, LogicalPlan]

case class ExplainCommandParser(base: BaseParser) extends PlanParser {

  override def isDefinedAt(node: ASTNode): Boolean = node.text == "TOK_EXPLAIN"

  override def apply(v1: ASTNode): LogicalPlan = v1.children match {
    case (crtTbl @ Token("TOK_CREATETABLE" | "TOK_QUERY", _)) :: rest =>
      val extended = rest.exists(_.text.toUpperCase == "EXTENDED")
      ExplainCommand(base.nodeToPlan(crtTbl), extended)
  }
}

class SomeParser(conf: ParserConf) extends BaseParser {
  val planParsers: Seq[PlanParser] = Seq(
    ExplainCommandParser(this))
}

@rxin
Copy link
Contributor

rxin commented Feb 11, 2016

btw @viirya can we create a execution.commands package for this?


import org.apache.spark.sql.catalyst.plans.PlanTest

class SparkQlSuite extends PlanTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should test the resulting plans here, and not wait for an AnalysisException to be thrown. I know this is a PITA, but it will save us a lot of headaches in the future.

@hvanhovell
Copy link
Contributor

@viirya I have made an initial pass. This PR is large enough as it is, lets not more commands to it.

@viirya viirya changed the title [SPARK-13139][SQL][WIP] Create native DDL commands [SPARK-13139][SQL] Create native DDL commands Feb 16, 2016
@SparkQA
Copy link

SparkQA commented Feb 16, 2016

Test build #51356 has finished for PR 11048 at commit 170bd77.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class BaseParser(val conf: ParserConf) extends ParserInterface with ParserBase
    • trait ParserBase
    • abstract class NativeDDLCommands(val sql: String) extends RunnableCommand
    • case class AlterTableSetProperties(
    • case class AlterTableDropProperties(
    • case class AlterTableCommandParser(base: CatalystQl) extends PlanParser


abstract class NativeDDLCommands(val sql: String) extends RunnableCommand {
override def run(sqlContext: SQLContext): Seq[Row] = {
sqlContext.catalog.runNativeCommand(sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it's because we want the NativeDDLCommands to be in the sql package, where it shouldn't know anything about Hive. In the future these commands will no longer be passed to Hive directly as a string so they shouldn't be "native" anymore. For now, I would create a temporary method in SQLContext:

// In SQLContext.scala
// TODO: remove this once we call specific operations in the catalog instead
protected[sql] def runDDLCommand(text: String): Seq[Row] = {
  throw new UnsupportedOperationException
}

// In HiveContext.scala
protected[sql] override def runDDLCommand(text: String): Seq[Row] = {
  runHiveSql(text).map(Row(_))
}

even though it's temporary I still think it's cleaner than doing it in the catalog.

@andrewor14
Copy link
Contributor

@viirya Thanks for working on this. The overall approach is very reasonable but a few things can be improved:

  • many files need to be moved to the new packages
  • many unused imports in all the files
  • many fields and methods can be marked as private, but are not
  • the splitting of commands into 2 files is kind of arbitrary. IIUC all native commands we currently pass to Hive directly today are grouped under ddl.scala, and the rest are grouped under commands.scala. In the future, however, we would like to handle all the DDLs ourselves, so this division won't make sense anymore.
  • general lack of comments in this area of the code makes things difficult to follow (not your fault)

Additionally I think the reason why this patch is so big is because we moved around a lot of files. The moving itself can be done in a separate PR, which would reduce the diff significantly and make the patch easier to review.

Addressing all the outstanding comments will likely take a long time. Would you mind that I take this over? I'll be sure to give you credit in the final patch.

@viirya
Copy link
Member Author

viirya commented Mar 3, 2016

@andrewor14 Thanks for reviewing. I don't mind if you want to take this over. Thanks for the credit!

asfgit pushed a commit that referenced this pull request Mar 3, 2016
## What changes were proposed in this pull request?

This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11482 from andrewor14/commands-package.
asfgit pushed a commit that referenced this pull request Mar 4, 2016
## What changes were proposed in this pull request?

This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11506 from andrewor14/parser-package.
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Mar 8, 2016
asfgit pushed a commit that referenced this pull request Mar 8, 2016
## What changes were proposed in this pull request?

When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.

This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.

## How was this patch tested?

No change in functionality, so just Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #11529 from andrewor14/parser-utils.
asfgit pushed a commit that referenced this pull request Mar 11, 2016
## What changes were proposed in this pull request?

This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

## How was this patch tested?

Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.

Author: Andrew Or <andrew@databricks.com>

Closes #11573 from andrewor14/parser-plus-plus.
@viirya viirya closed this Mar 11, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This patch simply moves things to a new package in an effort to reduce the size of the diff in apache#11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes apache#11482 from andrewor14/commands-package.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in apache#11048. This is conceptually the same as a recently merged patch apache#11482.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes apache#11506 from andrewor14/parser-package.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object.

This is based on viirya's changes in apache#11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller.

## How was this patch tested?

No change in functionality, so just Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes apache#11529 from andrewor14/parser-utils.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This patch is ported over from viirya's changes in apache#11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

## How was this patch tested?

Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.

Author: Andrew Or <andrew@databricks.com>

Closes apache#11573 from andrewor14/parser-plus-plus.
@viirya viirya deleted the native-ddl branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants