Skip to content

Conversation

@ajithme
Copy link
Contributor

@ajithme ajithme commented Aug 14, 2019

What changes were proposed in this pull request?

I think we should disallow if a identifier starts with _ for create database and create table
Partially we can see its effect in SPARK-28697 where as the table name starts with _ (like _sampleTable) , the FileFormat assumes it to be a hidden folder and do not list it which causes unusual behavior

How was this patch tested?

Avoiding creating tables and databases with names starting from underscore. Added test case for same

@ajithme
Copy link
Contributor Author

ajithme commented Aug 14, 2019

@dongjoon-hyun @cloud-fan @HyukjinKwon please review and let me know your opinion on the fix

@ajithme ajithme changed the title [SPARK-28697] Invalidate names starting with _ to avoid unexpected behaviour [SPARK-28697] Invalidate Database/Table names starting with underscore Aug 14, 2019
@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109093 has finished for PR 25448 at commit cef86c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ajithme
Copy link
Contributor Author

ajithme commented Aug 14, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109097 has finished for PR 25448 at commit 439d99e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

protected var currentDb: String = formatDatabaseName(DEFAULT_DATABASE)

private val validNameFormat = "([\\w_]+)".r
private val validNameFormat = "([a-zA-Z0-9]+[\\w]*)".r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use 0-9 as the first letter of table name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, same as before.. just restricts name not to start with underscore

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have restricted only _ by this change ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and abiding by hive standard

@SparkQA
Copy link

SparkQA commented Aug 14, 2019

Test build #109104 has finished for PR 25448 at commit e49aa2b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28697] Invalidate Database/Table names starting with underscore [SPARK-28697][SQL] Invalidate Database/Table names starting with underscore Aug 14, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 14, 2019

@ajithme . As you see in the UT failures, this is exactly reverting our previous efforts. Please don't change the UT and try to narrow down this patch to fix only your issues (Hive tables?).

SPARK-19059: read file based table whose name starts with underscore

cc @gatorsmile

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 15, 2019

Do we need to keep the failed unittest case actually?
Seems Hive doesn't fix it too given HIVE-6431. If that's an unreasonable behaviour, let's drop that support and update migration guide - from a cursory look seems like there are multiple places to fix to support the table with leading _ and doesn't look worthy.

@cloud-fan
Copy link
Contributor

Wait, does table name starting with _ work in Spark currently? From SPARK-19059 it seems supported, but from SPARK-28697 it seems not.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 15, 2019

@cloud-fan @dongjoon-hyun @HyukjinKwon
Was just checking the db2 definition of a identifier in link

Its defined as following :

An ordinary identifier is an uppercase letter followed by zero or more characters, each of which is an uppercase letter, a digit, or the underscore character. Note that lower case letters can be used when specifying an ordinary identifier, but they are converted to uppercase when processed. An ordinary identifier should not be a reserved word.

Hive seems to have allowed digit as first character as well.

Identifier
    :
    (Letter | Digit) (Letter | Digit | '_')*
    | {allowQuotedId()}? QuotedIdentifier  /* though at the language level we allow all Identifiers to be QuotedIdentifiers;
                                              at the API level only columns are allowed to be of this form */
    | '`' RegexComponent+ '`'
    ;

Not sure why in spark we allowed "_" as starting char to begin with ? Is it to match some other system ?

@maropu
Copy link
Member

maropu commented Aug 15, 2019

Just FYI: pgSQL accepts an underscore for the start of table names and doesn't do a digit;

postgres=# create table _t(t int);
CREATE TABLE

postgres=# create table 0t(t int);
ERROR:  syntax error at or near "0"
LINE 1: create table 0t(t int);

@ajithme
Copy link
Contributor Author

ajithme commented Aug 15, 2019

Wait, does table name starting with _ work in Spark currently? From SPARK-19059 it seems supported, but from SPARK-28697 it seems not.

When enable Hive support, spark doesn't support

@ajithme
Copy link
Contributor Author

ajithme commented Aug 15, 2019

I think spark allowed "_" to be used as start char for table/database to fall inline with PostGre, but this does deviate from behaviour of Hive.

@dilipbiswal
Copy link
Contributor

@maropu @ajithme
Thanks !! Yeah.. looks like the rule for identifier is different in different systems. I think we started with Presto's grammar. Presto also allows "_" as start char and so does SQL server.

Copy link

@pvk2727 pvk2727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes seems fine, do wee need to add positive test cases such as test valid names, to assure the valid format "([a-zA-Z0-9]+[\w]*)".r

@cloud-fan
Copy link
Contributor

According to the discussions in this PR, seems there are many other databases allowing the identifiers to start with _. Spark can support it as well without hive support. Shall we add the check only in HiveExternalCatalog.createTable?

@ajithme
Copy link
Contributor Author

ajithme commented Aug 21, 2019

According to the discussions in this PR, seems there are many other databases allowing the identifiers to start with _. Spark can support it as well without hive support. Shall we add the check only in HiveExternalCatalog.createTable?

@cloud-fan @HyukjinKwon
This for inputs. Would it not differ spark behavior when hive support is enabled or disabled.? (Hive cannot support _ starting identifiers due to limitations in FileFormat which affect its extended classes too) I think the identifier support inside spark should be uniform. Please correct me if i am wrong. Else i am ok to add check only at HiveExternalCatalog.createTable so that we can get more proper error message and mark this as a limitation when hive support is enabled

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 26, 2019
@github-actions github-actions bot closed this Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants