Skip to content

SPARK-33755: Allow creating orc table when row format separator is defined#30785

Closed
StefanXiepj wants to merge 2 commits intoapache:branch-2.4from
StefanXiepj:SPARK-33755
Closed

SPARK-33755: Allow creating orc table when row format separator is defined#30785
StefanXiepj wants to merge 2 commits intoapache:branch-2.4from
StefanXiepj:SPARK-33755

Conversation

@StefanXiepj
Copy link

@StefanXiepj StefanXiepj commented Dec 15, 2020

What changes were proposed in this pull request?

When creating table like this:
create table test_orc(c1 string) row format delimited fields terminated by '002' stored as orcfile;
spark throws exception like :
Operation not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not 'orcfile'(line 2, pos 0)

In this pr, we support non-strict rules when creating orc table with row format delimited.

Why are the changes needed?

I found this problem when migrating task from hive to spark. Hive is supported (It's not good, but it's not a problem, we can ignore it). So I fixed it in version Spark 2.4. Although Orc doesn't need this delimiter, but I don't think we need to be so strict in syntax. It is more convenient to migrate tasks from hive to spark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UTs to be added

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

| intField INT,
| stringField STRING
|)
|ROW FORMAT DELIMITED FIELDS TERMINATED BY '002'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the ORC table wotk with the delimiter?

Copy link
Author

@StefanXiepj StefanXiepj Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is unnecessary for ORC table, we can support an option that user can choose to ignore it or not. Maybe it is better to throw warning than error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to be explicit for what Spark supports. It's a bit odd that we add no-op syntax.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many tasks which are running on hive and slowly, so we want to migrate it from hive to spark. I think other companies want to this too. If we support ignore this exception by set spark.sql.orc.skipRowFormatDelimitedError=true , we can migrate hive task and user do not need to modify his sql script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Hive work with this delimiter specified with ORC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive doesn't work with it, but hive doesn't throw exception too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's don't fix it then. It's odd that Spark works with a syntax that's no-op, and that does not also work in Hive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx very much, i'll close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants