Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-677: Quoted identifiers in column names. #361

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ptkool
Copy link

@ptkool ptkool commented Aug 22, 2016

No description provided.

@julienledem
Copy link
Member

Hi @ptkool, thanks for your contribution.
Can you open a Parquet jira for this? (https://issues.apache.org/jira/browse/PARQUET)
then prefix the title of your PR with "PARQUET-XXX: "
Thank you!

@ptkool ptkool changed the title Quoted identifiers in column names. PARQUET-677: identifiers in column names. Aug 29, 2016
@@ -426,7 +426,7 @@ public void testAll() throws Exception {
assertEquals(emptyMap, nextRecord.get("myemptymap"));
assertEquals(genericFixed, nextRecord.get("myfixed"));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please remove the whitespace-only changes. It's harder to pick commits between branches and do maintenance with these changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rdblue
Copy link
Contributor

rdblue commented Sep 8, 2016

Thanks for posting a patch @ptkool!

Is this preventing Parquet from being used for some cases? I want to make sure the trade-off is worth including this. Right now, we don't have to worry about column name compatibility across object models. Avro, for example, will generate Java and C code for schemas so it only allows column names that are alphanumeric (plus '_'). This would break compatibility with those Avro fields and probably with Thrift and some Protobuf as well. Is that worth fixing strange Hive columns?

I'm wondering if we shouldn't solve this problem by using field IDs or string aliases for column names that are alphanumeric to be consistent across object models.

Copy link
Author

@ptkool ptkool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily preventing Parquet from being used. Delimited identifiers are quite common in the SQL world, and moving data from a table in a relational database to a Parquet file quite often involves 'fudging' column names to conform to Parquet identifier constraints. And then to use something like Apache Presto, you have a choice of either using different SQL between the two data sources, or creating views.

@ptkool ptkool changed the title PARQUET-677: identifiers in column names. PARQUET-677: Quoted identifiers in column names. Oct 26, 2016
Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.
Overall this looks great.
My main request is to centralize the quoting/unquoting logic and to make sure we don't quote things that worked without quoting before.

name = checkNotNull(name, "name");
if (name.charAt(0) == '`') {
name = name.substring(1, name.length() - 1).replace("``", "`");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check that the last character is also a quote.
We should factor out the logic with MessageTypeParser.Tokenizer.getName()

@@ -114,6 +116,11 @@ public boolean isMoreRestrictiveThan(Repetition other) {
private final ID id;

/**
* Pattern for matching column names that need to wrapped in backquotes.
*/
private static final Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue : does that satisfy your requirement?
Ideally this should describe only what was supported for parquet before that change.
We don't want to quote strings that worked before without quoting.
Should this be anything that is in MessageTypeParser.Tokenizer's constructor initialization of the delimiter list? " ,;{}()\n\t="
What are the other character that wouldn't work? I think "." since it is used as a column path delimiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What requirement are you referring to?

My main concern is that if we allow special characters in column names we will allow creating Parquet files that are incompatible with most object models. Thrift and protobuf can't generate a method that contains "`" and Avro's spec disallows special characters specifically for compatibility across languages and uses (code gen, generic, etc.).

I don't think it's a good idea to add this support. The alternative is for the user to choose a column name in Hive that isn't going to break others systems, and I think that's a completely reasonable requirement.

Group g1 = gf.newGroup();
g1.addGroup("foo").append("bar?@", 2l);

testSchema(schema, Arrays.asList(g1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but did that fail before without quotes?

assertEquals(expected, parsed);
MessageType reparsed = MessageTypeParser.parseMessageType(parsed.toString());
assertEquals(expected, reparsed);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add checks that column names that worked before do not get quoted when generating the schema now.
Otherwise older versions of parquet won't be able to read those files.


import org.apache.parquet.Strings;

import static org.apache.parquet.Preconditions.checkNotNull;

public final class ColumnPath implements Iterable<String>, Serializable {

private static Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to define this pattern again?
We should define it once.

@@ -41,7 +47,14 @@ protected ColumnPath toCanonical(ColumnPath value) {

public static ColumnPath fromDotString(String path) {
checkNotNull(path, "path");
return get(path.split("\\."));
String[] parts = path.split("\\.(?=([^`]*`[^`]*`)*[^`]*$)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should centralize the quote parsing logic.
This is the 3rd place where we unquote/unescape quoted names.

for (int i = 0; i < parts.length; ++i) {
String part = parts[i];
if (part.startsWith("`") && part.endsWith("`")) {
parts[i] = part.substring(1, part.length() - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unescaping "``" ?

private String getQuotedName(String name) {
Matcher matcher = pattern.matcher(name);
return matcher.find() ? "`" + name.replace("`", "``") + "`" : name;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated with schema

@julienledem
Copy link
Member

@ptkool did you want to update this?
I still think this would be a useful addition.
@rdblue I think it is compatible as long as we don't quote names that did not need quoting before.

@rdblue
Copy link
Contributor

rdblue commented Feb 2, 2017

@julienledem, the problem is that this allows names in Parquet files that will break some object models.

@julienledem
Copy link
Member

@rdblue But typically object model integration (Avro, Thrift, Protobuf) is used to import data into Parquet or read data imported in the same model. Since people have existing schemas, it makes sense to me to allow escaping special characters rather than forcing renaming individual fields which is painful and brittle. Possibly we could add a flag on write to enable this. That would force users to acknowledge that they may not be able to read this with Avro (for example). But even then something could be done at that layer to convert to an accepted field name.
There are probably a few places that rely on names not containing "." for example for path representation that would need to be fixed similarly.

@ptkool
Copy link
Author

ptkool commented Feb 4, 2017

@julienledem Yes, I will update this.

@rdblue
Copy link
Contributor

rdblue commented Feb 21, 2017

I'm still -1 on this because it is allowing names that we know will break object models. I wouldn't say that Thrift, Protobuf, and Avro are only used to import data that will be read by other models after that. Avro originally only read files where there was an Avro schema, but we quickly had requests to be able to read non-Avro files (those written by Impala) with Avro.

@ptkool ptkool force-pushed the master branch 5 times, most recently from e6f428b to 9b9fc67 Compare February 21, 2017 23:43
@julienledem
Copy link
Member

@rdblue I agree with your statement that we should not restrict reading a parquet file to only the model that wrote it however I don't think that this PR forces that.
Each model has different rules about what is a valid name and what is not. By restricting the names in a parquet schema we force mangling them while writing as well as when reading them back. Often schemas already exist with the name as is.

To read back from a model that has different naming restrictions I think we should deal with that on read by having a standard rule per model on how to transform a name into one that is valid in that model on read. All we need is a way to map that name on read.

There are already valid Parquet field that are considered keywords in thrift for example.

@ptkool
Copy link
Author

ptkool commented Apr 30, 2017

@julienledem Any more thoughts on this?

@MatthiasEgli
Copy link

This is currently blocking us from using Parquet as we are having arbitrary field names in our (so far json-backed) schema. While it is possible to escape/translate them to conform to Parquet limitations, it would be way preferable if Parquet could use arbitrary field names.

@fpompermaier
Copy link

Any news about this PR? It would be very useful IMHO

@airshad
Copy link

airshad commented Sep 11, 2020

Any update on this issue? I am facing this issue in one of my project with parquet format and its pissing me off. I don't want to alter the dataset by replacing space with _ OR removing it altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants