Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12689][SQL] Migrate DDL parsing to the newly absorbed parser #10723

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions project/MimaExcludes.scala
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,11 @@ object MimaExcludes {
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.flume.sink.Logging.org$apache$spark$streaming$flume$sink$Logging$$_log_="),
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.flume.sink.TransactionProcessor.org$apache$spark$streaming$flume$sink$Logging$$log_"),
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.flume.sink.TransactionProcessor.org$apache$spark$streaming$flume$sink$Logging$$log__=")
) ++ Seq(
// SPARK-12689 Migrate DDL parsing to the newly absorbed parser
ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.execution.datasources.DDLParser"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment to say which JIRA ticket cause this, see examples above.

ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.execution.datasources.DDLException"),
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.sql.SQLContext.ddlParser")
) ++ Seq(
// SPARK-7799 Add "streaming-akka" project
ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.zeromq.ZeroMQUtils.createStream"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -493,6 +493,16 @@ descFuncNames
| functionIdentifier
;

//We are allowed to use From and To in CreateTableUsing command's options (actually seems we can use any string as the option key). But we can't simply add them into nonReserved because by doing that we mess other existing rules. So we create a looseIdentifier and looseNonReserved here.
looseIdentifier
:
Identifier
| looseNonReserved -> Identifier[$looseNonReserved.text]
// If it decides to support SQL11 reserved keywords, i.e., useSQL11ReservedKeywordsForIdentifier()=false,
// the sql11keywords in existing q tests will NOT be added back.
| {useSQL11ReservedKeywordsForIdentifier()}? sql11ReservedKeywordsUsedAsIdentifier -> Identifier[$sql11ReservedKeywordsUsedAsIdentifier.text]
;

identifier
:
Identifier
Expand All @@ -518,6 +528,10 @@ principalIdentifier
| QuotedIdentifier
;

looseNonReserved
: nonReserved | KW_FROM | KW_TO
;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are allowed to use From and To in CreateTableUsing command's options (actually seems we can use any string as the option key). But we can't simply add them into nonReserved because by doing that we mess other existing rules. So we create a looseIdentifier and looseNonReserved here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this to the option rule directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't know if we will add other reserved words later. If so, the option rule might be too long. I don't count if any keywords are not included in nonReserved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both (current approach or adding it to the option rule) are okay for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add your initial line commentaar as a comment in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding. I've added it.

//The new version of nonReserved + sql11ReservedKeywordsUsedAsIdentifier = old version of nonReserved
//Non reserved keywords are basically the keywords that can be used as identifiers.
//All the KW_* are automatically not only keywords, but also reserved keywords.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,8 @@ KW_ISOLATION: 'ISOLATION';
KW_LEVEL: 'LEVEL';
KW_SNAPSHOT: 'SNAPSHOT';
KW_AUTOCOMMIT: 'AUTOCOMMIT';
KW_REFRESH: 'REFRESH';
KW_OPTIONS: 'OPTIONS';
KW_WEEK: 'WEEK'|'WEEKS';
KW_MILLISECOND: 'MILLISECOND'|'MILLISECONDS';
KW_MICROSECOND: 'MICROSECOND'|'MICROSECONDS';
Expand Down Expand Up @@ -465,7 +467,7 @@ Identifier
fragment
QuotedIdentifier
:
'`' ( '``' | ~('`') )* '`' { setText(getText().substring(1, getText().length() -1 ).replaceAll("``", "`")); }
'`' ( '``' | ~('`') )* '`' { setText(getText().replaceAll("``", "`")); }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old rule simply strips backquotes. I think we should keep it because it has special meaning. At least, column name rule will need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are nog stripping quotes in the middle of strings anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just don't strip the first and last backquotes as I remove the calling of substring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want backticks at the beginning and the end of the identifier? I thought the first and the last backtick are a means of identifying a quoted identifier, and not a part of the name. Do these backticks remain a part of the name throughout the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using the CatalystQl cleanIdentifier function to get rid of the remaining backticks. What happens if we do this:

sql("select * from (select 1 as ```weird`name`) as a")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above query will get ParseException: mismatched character '' expecting '`', in both Hive and this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mispasted the query (github also uses backticks for escaping):

sql("select * from (select 1 as ```weird``name`) as a")

We currently also support backticks in the name. The regex used in CatalystQl.cleanIdentifiers is quite limited and will not strip the inital and final backticks in this case. So were are expecting this:

`weird`name

And getting this:

``weird`name`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we can make cleanIdentifier to strip the initial and final backticks for this case too. Actually it should strip initial and final backticks for all cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would solve it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will update this later. Thanks.

;

WS : (' '|'\r'|'\t'|'\n') {$channel=HIDDEN;}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ TOK_UNIONTYPE;
TOK_COLTYPELIST;
TOK_CREATEDATABASE;
TOK_CREATETABLE;
TOK_CREATETABLEUSING;
TOK_TRUNCATETABLE;
TOK_CREATEINDEX;
TOK_CREATEINDEX_INDEXTBLNAME;
Expand Down Expand Up @@ -371,6 +372,10 @@ TOK_TXN_READ_WRITE;
TOK_COMMIT;
TOK_ROLLBACK;
TOK_SET_AUTOCOMMIT;
TOK_REFRESHTABLE;
TOK_TABLEPROVIDER;
TOK_TABLEOPTIONS;
TOK_TABLEOPTION;
}


Expand Down Expand Up @@ -648,6 +653,12 @@ import java.util.HashMap;
}
private char [] excludedCharForColumnName = {'.', ':'};
private boolean containExcludedCharForCreateTableColumnName(String input) {
if (input.length() > 0) {
if (input.charAt(0) == '`' && input.charAt(input.length() - 1) == '`') {
// When column name is backquoted, we don't care about excluded chars.
return false;
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comment said, when we use specify column names in backquotes, we can use these excluded chars.

for(char c : excludedCharForColumnName) {
if(input.indexOf(c)>-1) {
return true;
Expand Down Expand Up @@ -764,6 +775,7 @@ ddlStatement
| truncateTableStatement
| alterStatement
| descStatement
| refreshStatement
| showStatement
| metastoreCheck
| createViewStatement
Expand Down Expand Up @@ -890,12 +902,31 @@ createTableStatement
@init { pushMsg("create table statement", state); }
@after { popMsg(state); }
: KW_CREATE (temp=KW_TEMPORARY)? (ext=KW_EXTERNAL)? KW_TABLE ifNotExists? name=tableName
( like=KW_LIKE likeName=tableName
(
like=KW_LIKE likeName=tableName
tableRowFormat?
tableFileFormat?
tableLocation?
tablePropertiesPrefixed?
-> ^(TOK_CREATETABLE $name $temp? $ext? ifNotExists?
^(TOK_LIKETABLE $likeName?)
tableRowFormat?
tableFileFormat?
tableLocation?
tablePropertiesPrefixed?
)
|
tableProvider
tableOpts?
(KW_AS selectStatementWithCTE)?
-> ^(TOK_CREATETABLEUSING $name $temp? ifNotExists?
tableProvider
tableOpts?
selectStatementWithCTE?
)
| (LPAREN columnNameTypeList RPAREN)?
(p=tableProvider?)
tableOpts?
tableComment?
tablePartition?
tableBuckets?
Expand All @@ -905,8 +936,15 @@ createTableStatement
tableLocation?
tablePropertiesPrefixed?
(KW_AS selectStatementWithCTE)?
)
-> ^(TOK_CREATETABLE $name $temp? $ext? ifNotExists?
-> {p != null}?
^(TOK_CREATETABLEUSING $name $temp? ifNotExists?
columnNameTypeList?
$p
tableOpts?
selectStatementWithCTE?
)
->
^(TOK_CREATETABLE $name $temp? $ext? ifNotExists?
^(TOK_LIKETABLE $likeName?)
columnNameTypeList?
tableComment?
Expand All @@ -918,7 +956,8 @@ createTableStatement
tableLocation?
tablePropertiesPrefixed?
selectStatementWithCTE?
)
)
)
;

truncateTableStatement
Expand Down Expand Up @@ -1362,6 +1401,13 @@ tabPartColTypeExpr
: tableName partitionSpec? extColumnName? -> ^(TOK_TABTYPE tableName partitionSpec? extColumnName?)
;

refreshStatement
@init { pushMsg("refresh statement", state); }
@after { popMsg(state); }
:
KW_REFRESH KW_TABLE tableName -> ^(TOK_REFRESHTABLE tableName)
;

descStatement
@init { pushMsg("describe statement", state); }
@after { popMsg(state); }
Expand Down Expand Up @@ -1757,6 +1803,30 @@ showStmtIdentifier
| StringLiteral
;

tableProvider
@init { pushMsg("table's provider", state); }
@after { popMsg(state); }
:
KW_USING Identifier (DOT Identifier)*
-> ^(TOK_TABLEPROVIDER Identifier+)
;

optionKeyValue
@init { pushMsg("table's option specification", state); }
@after { popMsg(state); }
:
(looseIdentifier (DOT looseIdentifier)*) StringLiteral
-> ^(TOK_TABLEOPTION looseIdentifier+ StringLiteral)
;

tableOpts
@init { pushMsg("table's options", state); }
@after { popMsg(state); }
:
KW_OPTIONS LPAREN optionKeyValue (COMMA optionKeyValue)* RPAREN
-> ^(TOK_TABLEOPTIONS optionKeyValue+)
;

tableComment
@init { pushMsg("table's comment", state); }
@after { popMsg(state); }
Expand Down Expand Up @@ -2115,7 +2185,7 @@ structType
mapType
@init { pushMsg("map type", state); }
@after { popMsg(state); }
: KW_MAP LESSTHAN left=primitiveType COMMA right=type GREATERTHAN
: KW_MAP LESSTHAN left=type COMMA right=type GREATERTHAN
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key in Map can be any type.

-> ^(TOK_MAP $left $right)
;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ private[sql] class CatalystQl(val conf: ParserConf = SimpleParserConf()) extends
case Token("TOK_BOOLEAN", Nil) => BooleanType
case Token("TOK_STRING", Nil) => StringType
case Token("TOK_VARCHAR", Token(_, Nil) :: Nil) => StringType
case Token("TOK_CHAR", Token(_, Nil) :: Nil) => StringType
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support Char type.

case Token("TOK_FLOAT", Nil) => FloatType
case Token("TOK_DOUBLE", Nil) => DoubleType
case Token("TOK_DATE", Nil) => DateType
Expand All @@ -156,9 +157,10 @@ private[sql] class CatalystQl(val conf: ParserConf = SimpleParserConf()) extends

protected def nodeToStructField(node: ASTNode): StructField = node match {
case Token("TOK_TABCOL", Token(fieldName, Nil) :: dataType :: Nil) =>
StructField(fieldName, nodeToDataType(dataType), nullable = true)
case Token("TOK_TABCOL", Token(fieldName, Nil) :: dataType :: _ /* comment */:: Nil) =>
StructField(fieldName, nodeToDataType(dataType), nullable = true)
StructField(cleanIdentifier(fieldName), nodeToDataType(dataType), nullable = true)
case Token("TOK_TABCOL", Token(fieldName, Nil) :: dataType :: comment :: Nil) =>
val meta = new MetadataBuilder().putString("comment", unquoteString(comment.text)).build()
StructField(cleanIdentifier(fieldName), nodeToDataType(dataType), nullable = true, meta)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to StructField's metadata.

case _ =>
noParseRule("StructField", node)
}
Expand Down Expand Up @@ -633,15 +635,15 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
nodeToExpr(qualifier) match {
case UnresolvedAttribute(nameParts) =>
UnresolvedAttribute(nameParts :+ cleanIdentifier(attr))
case other => UnresolvedExtractValue(other, Literal(attr))
case other => UnresolvedExtractValue(other, Literal(cleanIdentifier(attr)))
}

/* Stars (*) */
case Token("TOK_ALLCOLREF", Nil) => UnresolvedStar(None)
// The format of dbName.tableName.* cannot be parsed by HiveParser. TOK_TABNAME will only
// has a single child which is tableName.
case Token("TOK_ALLCOLREF", Token("TOK_TABNAME", target) :: Nil) if target.nonEmpty =>
UnresolvedStar(Some(target.map(_.text)))
UnresolvedStar(Some(target.map(x => cleanIdentifier(x.text))))

/* Aggregate Functions */
case Token("TOK_FUNCTIONDI", Token(COUNT(), Nil) :: args) =>
Expand Down Expand Up @@ -949,7 +951,7 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
protected def nodeToGenerate(node: ASTNode, outer: Boolean, child: LogicalPlan): Generate = {
val Token("TOK_SELECT", Token("TOK_SELEXPR", clauses) :: Nil) = node

val alias = getClause("TOK_TABALIAS", clauses).children.head.text
val alias = cleanIdentifier(getClause("TOK_TABALIAS", clauses).children.head.text)

val generator = clauses.head match {
case Token("TOK_FUNCTION", Token(explode(), Nil) :: childNode :: Nil) =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -205,10 +205,7 @@ class SQLContext private[sql](
@transient
protected[sql] val sqlParser: ParserInterface = new SparkSQLParser(new SparkQl(conf))

@transient
protected[sql] val ddlParser: DDLParser = new DDLParser(sqlParser)

protected[sql] def parseSql(sql: String): LogicalPlan = ddlParser.parse(sql, false)
protected[sql] def parseSql(sql: String): LogicalPlan = sqlParser.parsePlan(sql)

protected[sql] def executeSql(sql: String):
org.apache.spark.sql.execution.QueryExecution = executePlan(parseSql(sql))
Expand Down
Loading