Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28697][SQL] Invalidate Database/Table names starting with underscore #25448

Open
wants to merge 3 commits into
base: master
from

Conversation

@ajithme
Copy link
Contributor

commented Aug 14, 2019

What changes were proposed in this pull request?

I think we should disallow if a identifier starts with _ for create database and create table
Partially we can see its effect in SPARK-28697 where as the table name starts with _ (like _sampleTable) , the FileFormat assumes it to be a hidden folder and do not list it which causes unusual behavior

How was this patch tested?

Avoiding creating tables and databases with names starting from underscore. Added test case for same

@ajithme

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

@dongjoon-hyun @cloud-fan @HyukjinKwon please review and let me know your opinion on the fix

@ajithme ajithme changed the title [SPARK-28697] Invalidate names starting with _ to avoid unexpected behaviour [SPARK-28697] Invalidate Database/Table names starting with underscore Aug 14, 2019

@HyukjinKwon

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

ok to test

@SparkQA

This comment has been minimized.

Copy link

commented Aug 14, 2019

Test build #109093 has finished for PR 25448 at commit cef86c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@ajithme

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

retest this please

@SparkQA

This comment has been minimized.

Copy link

commented Aug 14, 2019

Test build #109097 has finished for PR 25448 at commit 439d99e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -102,7 +102,7 @@ class SessionCatalog(
@GuardedBy("this")
protected var currentDb: String = formatDatabaseName(DEFAULT_DATABASE)

private val validNameFormat = "([\\w_]+)".r
private val validNameFormat = "([a-zA-Z0-9]+[\\w]*)".r

This comment has been minimized.

Copy link
@cloud-fan

cloud-fan Aug 14, 2019

Contributor

can we use 0-9 as the first letter of table name?

This comment has been minimized.

Copy link
@ajithme

ajithme Aug 14, 2019

Author Contributor

yes, same as before.. just restricts name not to start with underscore

This comment has been minimized.

Copy link
@pvk2727

pvk2727 Aug 16, 2019

we have restricted only _ by this change ?

This comment has been minimized.

Copy link
@pvk2727

pvk2727 Aug 16, 2019

and abiding by hive standard

@SparkQA

This comment has been minimized.

Copy link

commented Aug 14, 2019

Test build #109104 has finished for PR 25448 at commit e49aa2b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28697] Invalidate Database/Table names starting with underscore [SPARK-28697][SQL] Invalidate Database/Table names starting with underscore Aug 14, 2019

@dongjoon-hyun dongjoon-hyun added the SQL label Aug 14, 2019

@dongjoon-hyun

This comment has been minimized.

Copy link
Member

commented Aug 14, 2019

@ajithme . As you see in the UT failures, this is exactly reverting our previous efforts. Please don't change the UT and try to narrow down this patch to fix only your issues (Hive tables?).

SPARK-19059: read file based table whose name starts with underscore

cc @gatorsmile

@HyukjinKwon

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Do we need to keep the failed unittest case actually?
Seems Hive doesn't fix it too given HIVE-6431. If that's an unreasonable behaviour, let's drop that support and update migration guide - from a cursory look seems like there are multiple places to fix to support the table with leading _ and doesn't look worthy.

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

Wait, does table name starting with _ work in Spark currently? From SPARK-19059 it seems supported, but from SPARK-28697 it seems not.

@dilipbiswal

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@cloud-fan @dongjoon-hyun @HyukjinKwon
Was just checking the db2 definition of a identifier in link

Its defined as following :

An ordinary identifier is an uppercase letter followed by zero or more characters, each of which is an uppercase letter, a digit, or the underscore character. Note that lower case letters can be used when specifying an ordinary identifier, but they are converted to uppercase when processed. An ordinary identifier should not be a reserved word.

Hive seems to have allowed digit as first character as well.

Identifier
    :
    (Letter | Digit) (Letter | Digit | '_')*
    | {allowQuotedId()}? QuotedIdentifier  /* though at the language level we allow all Identifiers to be QuotedIdentifiers;
                                              at the API level only columns are allowed to be of this form */
    | '`' RegexComponent+ '`'
    ;

Not sure why in spark we allowed "_" as starting char to begin with ? Is it to match some other system ?

@maropu

This comment has been minimized.

Copy link
Member

commented Aug 15, 2019

Just FYI: pgSQL accepts an underscore for the start of table names and doesn't do a digit;

postgres=# create table _t(t int);
CREATE TABLE

postgres=# create table 0t(t int);
ERROR:  syntax error at or near "0"
LINE 1: create table 0t(t int);
@ajithme

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

Wait, does table name starting with _ work in Spark currently? From SPARK-19059 it seems supported, but from SPARK-28697 it seems not.

When enable Hive support, spark doesn't support

@ajithme

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

I think spark allowed "_" to be used as start char for table/database to fall inline with PostGre, but this does deviate from behaviour of Hive.

@dilipbiswal

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

@maropu @ajithme
Thanks !! Yeah.. looks like the rule for identifier is different in different systems. I think we started with Presto's grammar. Presto also allows "_" as start char and so does SQL server.

@pvk2727
Copy link

left a comment

changes seems fine, do wee need to add positive test cases such as test valid names, to assure the valid format "([a-zA-Z0-9]+[\w]*)".r

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

According to the discussions in this PR, seems there are many other databases allowing the identifiers to start with _. Spark can support it as well without hive support. Shall we add the check only in HiveExternalCatalog.createTable?

@ajithme

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

According to the discussions in this PR, seems there are many other databases allowing the identifiers to start with _. Spark can support it as well without hive support. Shall we add the check only in HiveExternalCatalog.createTable?

@cloud-fan @HyukjinKwon
This for inputs. Would it not differ spark behavior when hive support is enabled or disabled.? (Hive cannot support _ starting identifiers due to limitations in FileFormat which affect its extended classes too) I think the identifier support inside spark should be uniform. Please correct me if i am wrong. Else i am ok to add check only at HiveExternalCatalog.createTable so that we can get more proper error message and mark this as a limitation when hive support is enabled

@AmplabJenkins

This comment has been minimized.

Copy link

commented Sep 16, 2019

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.