-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-677: Quoted identifiers in column names. #361
base: master
Are you sure you want to change the base?
Conversation
Hi @ptkool, thanks for your contribution. |
@@ -426,7 +426,7 @@ public void testAll() throws Exception { | |||
assertEquals(emptyMap, nextRecord.get("myemptymap")); | |||
assertEquals(genericFixed, nextRecord.get("myfixed")); | |||
} | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please remove the whitespace-only changes. It's harder to pick commits between branches and do maintenance with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for posting a patch @ptkool! Is this preventing Parquet from being used for some cases? I want to make sure the trade-off is worth including this. Right now, we don't have to worry about column name compatibility across object models. Avro, for example, will generate Java and C code for schemas so it only allows column names that are alphanumeric (plus '_'). This would break compatibility with those Avro fields and probably with Thrift and some Protobuf as well. Is that worth fixing strange Hive columns? I'm wondering if we shouldn't solve this problem by using field IDs or string aliases for column names that are alphanumeric to be consistent across object models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily preventing Parquet from being used. Delimited identifiers are quite common in the SQL world, and moving data from a table in a relational database to a Parquet file quite often involves 'fudging' column names to conform to Parquet identifier constraints. And then to use something like Apache Presto, you have a choice of either using different SQL between the two data sources, or creating views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Overall this looks great.
My main request is to centralize the quoting/unquoting logic and to make sure we don't quote things that worked without quoting before.
name = checkNotNull(name, "name"); | ||
if (name.charAt(0) == '`') { | ||
name = name.substring(1, name.length() - 1).replace("``", "`"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check that the last character is also a quote.
We should factor out the logic with MessageTypeParser.Tokenizer.getName()
@@ -114,6 +116,11 @@ public boolean isMoreRestrictiveThan(Repetition other) { | |||
private final ID id; | |||
|
|||
/** | |||
* Pattern for matching column names that need to wrapped in backquotes. | |||
*/ | |||
private static final Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue : does that satisfy your requirement?
Ideally this should describe only what was supported for parquet before that change.
We don't want to quote strings that worked before without quoting.
Should this be anything that is in MessageTypeParser.Tokenizer's constructor initialization of the delimiter list? " ,;{}()\n\t="
What are the other character that wouldn't work? I think "." since it is used as a column path delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What requirement are you referring to?
My main concern is that if we allow special characters in column names we will allow creating Parquet files that are incompatible with most object models. Thrift and protobuf can't generate a method that contains "`" and Avro's spec disallows special characters specifically for compatibility across languages and uses (code gen, generic, etc.).
I don't think it's a good idea to add this support. The alternative is for the user to choose a column name in Hive that isn't going to break others systems, and I think that's a completely reasonable requirement.
Group g1 = gf.newGroup(); | ||
g1.addGroup("foo").append("bar?@", 2l); | ||
|
||
testSchema(schema, Arrays.asList(g1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but did that fail before without quotes?
assertEquals(expected, parsed); | ||
MessageType reparsed = MessageTypeParser.parseMessageType(parsed.toString()); | ||
assertEquals(expected, reparsed); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add checks that column names that worked before do not get quoted when generating the schema now.
Otherwise older versions of parquet won't be able to read those files.
|
||
import org.apache.parquet.Strings; | ||
|
||
import static org.apache.parquet.Preconditions.checkNotNull; | ||
|
||
public final class ColumnPath implements Iterable<String>, Serializable { | ||
|
||
private static Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to define this pattern again?
We should define it once.
@@ -41,7 +47,14 @@ protected ColumnPath toCanonical(ColumnPath value) { | |||
|
|||
public static ColumnPath fromDotString(String path) { | |||
checkNotNull(path, "path"); | |||
return get(path.split("\\.")); | |||
String[] parts = path.split("\\.(?=([^`]*`[^`]*`)*[^`]*$)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should centralize the quote parsing logic.
This is the 3rd place where we unquote/unescape quoted names.
for (int i = 0; i < parts.length; ++i) { | ||
String part = parts[i]; | ||
if (part.startsWith("`") && part.endsWith("`")) { | ||
parts[i] = part.substring(1, part.length() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unescaping "``" ?
private String getQuotedName(String name) { | ||
Matcher matcher = pattern.matcher(name); | ||
return matcher.find() ? "`" + name.replace("`", "``") + "`" : name; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated with schema
@julienledem, the problem is that this allows names in Parquet files that will break some object models. |
@rdblue But typically object model integration (Avro, Thrift, Protobuf) is used to import data into Parquet or read data imported in the same model. Since people have existing schemas, it makes sense to me to allow escaping special characters rather than forcing renaming individual fields which is painful and brittle. Possibly we could add a flag on write to enable this. That would force users to acknowledge that they may not be able to read this with Avro (for example). But even then something could be done at that layer to convert to an accepted field name. |
@julienledem Yes, I will update this. |
I'm still -1 on this because it is allowing names that we know will break object models. I wouldn't say that Thrift, Protobuf, and Avro are only used to import data that will be read by other models after that. Avro originally only read files where there was an Avro schema, but we quickly had requests to be able to read non-Avro files (those written by Impala) with Avro. |
e6f428b
to
9b9fc67
Compare
@rdblue I agree with your statement that we should not restrict reading a parquet file to only the model that wrote it however I don't think that this PR forces that. To read back from a model that has different naming restrictions I think we should deal with that on read by having a standard rule per model on how to transform a name into one that is valid in that model on read. All we need is a way to map that name on read. There are already valid Parquet field that are considered keywords in thrift for example. |
@julienledem Any more thoughts on this? |
This is currently blocking us from using Parquet as we are having arbitrary field names in our (so far json-backed) schema. While it is possible to escape/translate them to conform to Parquet limitations, it would be way preferable if Parquet could use arbitrary field names. |
Any news about this PR? It would be very useful IMHO |
Any update on this issue? I am facing this issue in one of my project with parquet format and its pissing me off. I don't want to alter the dataset by replacing space with |
No description provided.