PARQUET-677: Quoted identifiers in column names. #361

ptkool · 2016-08-22T20:10:55Z

No description provided.

julienledem · 2016-08-23T16:36:31Z

Hi @ptkool, thanks for your contribution.
Can you open a Parquet jira for this? (https://issues.apache.org/jira/browse/PARQUET)
then prefix the title of your PR with "PARQUET-XXX: "
Thank you!

rdblue · 2016-09-08T16:47:54Z

parquet-avro/src/test/java/org/apache/parquet/avro/TestReadWrite.java

@@ -426,7 +426,7 @@ public void testAll() throws Exception {
    assertEquals(emptyMap, nextRecord.get("myemptymap"));
    assertEquals(genericFixed, nextRecord.get("myfixed"));
  }
-
+  


Nit: Please remove the whitespace-only changes. It's harder to pick commits between branches and do maintenance with these changes.

rdblue · 2016-09-08T16:58:00Z

Thanks for posting a patch @ptkool!

Is this preventing Parquet from being used for some cases? I want to make sure the trade-off is worth including this. Right now, we don't have to worry about column name compatibility across object models. Avro, for example, will generate Java and C code for schemas so it only allows column names that are alphanumeric (plus '_'). This would break compatibility with those Avro fields and probably with Thrift and some Protobuf as well. Is that worth fixing strange Hive columns?

I'm wondering if we shouldn't solve this problem by using field IDs or string aliases for column names that are alphanumeric to be consistent across object models.

ptkool

This is not necessarily preventing Parquet from being used. Delimited identifiers are quite common in the SQL world, and moving data from a table in a relational database to a Parquet file quite often involves 'fudging' column names to conform to Parquet identifier constraints. And then to use something like Apache Presto, you have a choice of either using different SQL between the two data sources, or creating views.

julienledem

Thank you.
Overall this looks great.
My main request is to centralize the quoting/unquoting logic and to make sure we don't quote things that worked without quoting before.

julienledem · 2016-10-26T17:11:46Z

parquet-column/src/main/java/org/apache/parquet/schema/Type.java

+    name = checkNotNull(name, "name");
+    if (name.charAt(0) == '`') {
+      name = name.substring(1, name.length() - 1).replace("``", "`");
+    }


we should check that the last character is also a quote.
We should factor out the logic with MessageTypeParser.Tokenizer.getName()

julienledem · 2016-10-26T17:14:29Z

parquet-column/src/main/java/org/apache/parquet/schema/Type.java

@@ -114,6 +116,11 @@ public boolean isMoreRestrictiveThan(Repetition other) {
  private final ID id;

  /**
+   * Pattern for matching column names that need to wrapped in backquotes.
+   */
+  private static final Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]");


@rdblue : does that satisfy your requirement?
Ideally this should describe only what was supported for parquet before that change.
We don't want to quote strings that worked before without quoting.
Should this be anything that is in MessageTypeParser.Tokenizer's constructor initialization of the delimiter list? " ,;{}()\n\t="
What are the other character that wouldn't work? I think "." since it is used as a column path delimiter.

What requirement are you referring to?

My main concern is that if we allow special characters in column names we will allow creating Parquet files that are incompatible with most object models. Thrift and protobuf can't generate a method that contains "`" and Avro's spec disallows special characters specifically for compatibility across languages and uses (code gen, generic, etc.).

I don't think it's a good idea to add this support. The alternative is for the user to choose a column name in Hive that isn't going to break others systems, and I think that's a completely reasonable requirement.

julienledem · 2016-10-26T17:17:43Z

parquet-column/src/test/java/org/apache/parquet/io/TestColumnIO.java

+    Group g1 = gf.newGroup();
+    g1.addGroup("foo").append("bar?@", 2l);
+
+    testSchema(schema, Arrays.asList(g1));


but did that fail before without quotes?

julienledem · 2016-10-26T17:19:21Z

parquet-column/src/test/java/org/apache/parquet/parser/TestParquetParser.java

+    assertEquals(expected, parsed);
+    MessageType reparsed = MessageTypeParser.parseMessageType(parsed.toString());
+    assertEquals(expected, reparsed);
+  }


we should add checks that column names that worked before do not get quoted when generating the schema now.
Otherwise older versions of parquet won't be able to read those files.

julienledem · 2016-10-26T17:20:58Z

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java


 import org.apache.parquet.Strings;

 import static org.apache.parquet.Preconditions.checkNotNull;

 public final class ColumnPath implements Iterable<String>, Serializable {

+  private static Pattern pattern = Pattern.compile("[^a-zA-Z\\d_-]");


why do we need to define this pattern again?
We should define it once.

julienledem · 2016-10-26T17:22:07Z

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java

@@ -41,7 +47,14 @@ protected ColumnPath toCanonical(ColumnPath value) {

  public static ColumnPath fromDotString(String path) {
    checkNotNull(path, "path");
-    return get(path.split("\\."));
+    String[] parts = path.split("\\.(?=([^`]*`[^`]*`)*[^`]*$)");


I think we should centralize the quote parsing logic.
This is the 3rd place where we unquote/unescape quoted names.

julienledem · 2016-10-26T17:22:19Z

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java

+    for (int i = 0; i < parts.length; ++i) {
+      String part = parts[i];
+      if (part.startsWith("`") && part.endsWith("`")) {
+        parts[i] = part.substring(1, part.length() - 1);


unescaping "``" ?

julienledem · 2016-10-26T17:22:43Z

parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java

+  private String getQuotedName(String name) {
+    Matcher matcher = pattern.matcher(name);
+    return matcher.find() ? "`" + name.replace("`", "``") + "`" : name;
+  }


duplicated with schema

julienledem · 2017-02-02T00:36:09Z

@ptkool did you want to update this?
I still think this would be a useful addition.
@rdblue I think it is compatible as long as we don't quote names that did not need quoting before.

rdblue · 2017-02-02T01:01:11Z

@julienledem, the problem is that this allows names in Parquet files that will break some object models.

julienledem · 2017-02-02T07:42:35Z

@rdblue But typically object model integration (Avro, Thrift, Protobuf) is used to import data into Parquet or read data imported in the same model. Since people have existing schemas, it makes sense to me to allow escaping special characters rather than forcing renaming individual fields which is painful and brittle. Possibly we could add a flag on write to enable this. That would force users to acknowledge that they may not be able to read this with Avro (for example). But even then something could be done at that layer to convert to an accepted field name.
There are probably a few places that rely on names not containing "." for example for path representation that would need to be fixed similarly.

ptkool · 2017-02-04T11:45:08Z

@julienledem Yes, I will update this.

rdblue · 2017-02-21T22:43:25Z

I'm still -1 on this because it is allowing names that we know will break object models. I wouldn't say that Thrift, Protobuf, and Avro are only used to import data that will be read by other models after that. Avro originally only read files where there was an Avro schema, but we quickly had requests to be able to read non-Avro files (those written by Impala) with Avro.

julienledem · 2017-02-22T00:42:22Z

@rdblue I agree with your statement that we should not restrict reading a parquet file to only the model that wrote it however I don't think that this PR forces that.
Each model has different rules about what is a valid name and what is not. By restricting the names in a parquet schema we force mangling them while writing as well as when reading them back. Often schemas already exist with the name as is.

To read back from a model that has different naming restrictions I think we should deal with that on read by having a standard rule per model on how to transform a name into one that is valid in that model on read. All we need is a way to map that name on read.

There are already valid Parquet field that are considered keywords in thrift for example.

ptkool · 2017-04-30T22:38:08Z

@julienledem Any more thoughts on this?

MatthiasEgli · 2017-08-31T21:34:07Z

This is currently blocking us from using Parquet as we are having arbitrary field names in our (so far json-backed) schema. While it is possible to escape/translate them to conform to Parquet limitations, it would be way preferable if Parquet could use arbitrary field names.

fpompermaier · 2019-01-31T10:10:29Z

Any news about this PR? It would be very useful IMHO

airshad · 2020-09-11T11:58:41Z

Any update on this issue? I am facing this issue in one of my project with parquet format and its pissing me off. I don't want to alter the dataset by replacing space with _ OR removing it altogether.

Quoted identifiers in column names.

3d0aec1

ptkool changed the title ~~Quoted identifiers in column names.~~ PARQUET-677: identifiers in column names. Aug 29, 2016

rdblue reviewed Sep 8, 2016
View reviewed changes

Changes based on comments.

33d5088

ptkool commented Oct 25, 2016

View reviewed changes

ptkool changed the title ~~PARQUET-677: identifiers in column names.~~ PARQUET-677: Quoted identifiers in column names. Oct 26, 2016

julienledem requested changes Oct 26, 2016

View reviewed changes

ptkool added 2 commits February 10, 2017 10:10

Merge remote-tracking branch 'upstream/master'

368677e

Changes based on feedback.

9b9fc67

ptkool force-pushed the master branch 5 times, most recently from e6f428b to 9b9fc67 Compare February 21, 2017 23:43

gatorsmile mentioned this pull request May 16, 2017

[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names apache/spark#18000

Closed

Skyscimitar mentioned this pull request Feb 17, 2021

AtotiJavaException: Illegal character in: created date atoti/atoti#242

Closed

wanx4910 mentioned this pull request Aug 30, 2021

PyArrow Parquet column partitioning apache/arrow#11027

Closed

liwensun mentioned this pull request Feb 23, 2022

Support arbitrary column names delta-io/delta#957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-677: Quoted identifiers in column names. #361

PARQUET-677: Quoted identifiers in column names. #361

ptkool commented Aug 22, 2016

julienledem commented Aug 23, 2016

rdblue Sep 8, 2016

ptkool Oct 22, 2016

rdblue commented Sep 8, 2016

ptkool left a comment

julienledem left a comment

julienledem Oct 26, 2016

julienledem Oct 26, 2016

rdblue Oct 26, 2016

julienledem Oct 26, 2016

julienledem Oct 26, 2016

julienledem Oct 26, 2016

julienledem Oct 26, 2016

julienledem Oct 26, 2016

julienledem Oct 26, 2016

julienledem commented Feb 2, 2017

rdblue commented Feb 2, 2017

julienledem commented Feb 2, 2017

ptkool commented Feb 4, 2017

rdblue commented Feb 21, 2017

julienledem commented Feb 22, 2017

ptkool commented Apr 30, 2017

MatthiasEgli commented Aug 31, 2017

fpompermaier commented Jan 31, 2019

airshad commented Sep 11, 2020

PARQUET-677: Quoted identifiers in column names. #361

Are you sure you want to change the base?

PARQUET-677: Quoted identifiers in column names. #361

Conversation

ptkool commented Aug 22, 2016

julienledem commented Aug 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Sep 8, 2016

ptkool left a comment

Choose a reason for hiding this comment

julienledem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Feb 2, 2017

rdblue commented Feb 2, 2017

julienledem commented Feb 2, 2017

ptkool commented Feb 4, 2017

rdblue commented Feb 21, 2017

julienledem commented Feb 22, 2017

ptkool commented Apr 30, 2017

MatthiasEgli commented Aug 31, 2017

fpompermaier commented Jan 31, 2019

airshad commented Sep 11, 2020