-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13139][SQL] Create native DDL commands #11048
Conversation
Test build #50658 has finished for PR 11048 at commit
|
Test build #50660 has finished for PR 11048 at commit
|
Thanks - this looks pretty good as a start. We will need to add many other ddls, including alter/drop table, etc. |
@rxin I've added alter table command support. As this command and corresponding change is big, I think we should let this PR only cover these three commands and do other commands in other PRs. How do you think? |
Test build #50919 has finished for PR 11048 at commit
|
Test build #50922 has finished for PR 11048 at commit
|
Test build #50965 has finished for PR 11048 at commit
|
@viirya yes we can do this incrementally. Let's just create subtasks under https://issues.apache.org/jira/browse/SPARK-13139 |
@@ -62,6 +66,458 @@ private[sql] class SparkQl(conf: ParserConf = SimpleParserConf()) extends Cataly | |||
val tableIdent = extractTableIdent(nameParts) | |||
RefreshTable(tableIdent) | |||
|
|||
case Token("TOK_CREATEDATABASE", Token(databaseName, Nil) :: createDatabaseArgs) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hvanhovell
any suggestions on how we can make this file/function more modular? It is getting too long and we are about to add a lot more statements to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem is LogicalPlan parsing, which we can split up in command/ddl/query parsing. We could use partial functions (or something like that) to implement the different parts of the parsing logic.
For instance:
abstract class BaseParser(val conf: ParserConf) extends ParserInterface {
val planParsers: Seq[PlanParser]
lazy val planParser = planParsers.reduce(_.orElse(_))
def nodeToPlan(node: ASTNode): LogicalPlan = {
planParser.applyOrElse(node, throw new NotImplementedError(node.text))
}
}
abstract class PlanParser extends PartialFunction[ASTNode, LogicalPlan]
case class ExplainCommandParser(base: BaseParser) extends PlanParser {
override def isDefinedAt(node: ASTNode): Boolean = node.text == "TOK_EXPLAIN"
override def apply(v1: ASTNode): LogicalPlan = v1.children match {
case (crtTbl @ Token("TOK_CREATETABLE" | "TOK_QUERY", _)) :: rest =>
val extended = rest.exists(_.text.toUpperCase == "EXTENDED")
ExplainCommand(base.nodeToPlan(crtTbl), extended)
}
}
class SomeParser(conf: ParserConf) extends BaseParser {
val planParsers: Seq[PlanParser] = Seq(
ExplainCommandParser(this))
}
btw @viirya can we create a execution.commands package for this? |
|
||
import org.apache.spark.sql.catalyst.plans.PlanTest | ||
|
||
class SparkQlSuite extends PlanTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really should test the resulting plans here, and not wait for an AnalysisException
to be thrown. I know this is a PITA, but it will save us a lot of headaches in the future.
@viirya I have made an initial pass. This PR is large enough as it is, lets not more commands to it. |
Test build #51356 has finished for PR 11048 at commit
|
|
||
abstract class NativeDDLCommands(val sql: String) extends RunnableCommand { | ||
override def run(sqlContext: SQLContext): Seq[Row] = { | ||
sqlContext.catalog.runNativeCommand(sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it's because we want the NativeDDLCommands
to be in the sql package, where it shouldn't know anything about Hive. In the future these commands will no longer be passed to Hive directly as a string so they shouldn't be "native" anymore. For now, I would create a temporary method in SQLContext
:
// In SQLContext.scala
// TODO: remove this once we call specific operations in the catalog instead
protected[sql] def runDDLCommand(text: String): Seq[Row] = {
throw new UnsupportedOperationException
}
// In HiveContext.scala
protected[sql] override def runDDLCommand(text: String): Seq[Row] = {
runHiveSql(text).map(Row(_))
}
even though it's temporary I still think it's cleaner than doing it in the catalog.
@viirya Thanks for working on this. The overall approach is very reasonable but a few things can be improved:
Additionally I think the reason why this patch is so big is because we moved around a lot of files. The moving itself can be done in a separate PR, which would reduce the diff significantly and make the patch easier to review. Addressing all the outstanding comments will likely take a long time. Would you mind that I take this over? I'll be sure to give you credit in the final patch. |
@andrewor14 Thanks for reviewing. I don't mind if you want to take this over. Thanks for the credit! |
## What changes were proposed in this pull request? This patch simply moves things to a new package in an effort to reduce the size of the diff in #11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11482 from andrewor14/commands-package.
## What changes were proposed in this pull request? This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in #11048. This is conceptually the same as a recently merged patch #11482. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11506 from andrewor14/parser-package.
## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in #11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #11529 from andrewor14/parser-utils.
## What changes were proposed in this pull request? This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes #11573 from andrewor14/parser-plus-plus.
## What changes were proposed in this pull request? This patch simply moves things to a new package in an effort to reduce the size of the diff in apache#11048. Currently the new package only has one file, but in the future we'll add many new commands in SPARK-13139. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes apache#11482 from andrewor14/commands-package.
## What changes were proposed in this pull request? This patch simply moves things to existing package `o.a.s.sql.catalyst.parser` in an effort to reduce the size of the diff in apache#11048. This is conceptually the same as a recently merged patch apache#11482. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes apache#11506 from andrewor14/parser-package.
## What changes were proposed in this pull request? When we add more DDL parsing logic in the future, SparkQl will become very big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to parse alter table commands. However, these parser objects will need to access some helper methods that exist in CatalystQl. The proposal is to move those methods to an isolated ParserUtils object. This is based on viirya's changes in apache#11048. It prefaces the bigger fix for SPARK-13139 to make the diff of that patch smaller. ## How was this patch tested? No change in functionality, so just Jenkins. Author: Andrew Or <andrew@databricks.com> Closes apache#11529 from andrewor14/parser-utils.
## What changes were proposed in this pull request? This patch is ported over from viirya's changes in apache#11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes apache#11573 from andrewor14/parser-plus-plus.
JIRA: https://issues.apache.org/jira/browse/SPARK-13139
From JIRA: We currently delegate most DDLs directly to Hive, through NativePlaceholder in HiveQl.scala. In Spark 2.0, we want to provide native implementations for DDLs for both SQLContext and HiveContext.
This PR will do the first step to parse DDL commands and create logical commands that encapsulate them. Actual implementations still delegate to HiveNativeCommand now.