Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1253: Support for new logical type representation #463

Merged
merged 12 commits into from
May 24, 2018

Conversation

nandorKollar
Copy link
Contributor

This PR implements the new logical type representation in parquet-mr which is already available in parquet-format. Reviews are welcome!

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.
I've had some comments in the related parts.
In addition, could you please take care of the code parts using the former OriginalType/ConvertedType classes by deprecating public API and using the new LogicalType classes instead of the former ones wherever it is possible? (If you prefer, create a separate JIRA for it.)

this.precision = precision;
}

public void setPrecision(int precision) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this type being mutable. Is it possible to initialize the scale/precision values at construction only somehow?

}
}

class NullLogicalTypeAnnotation implements LogicalTypeAnnotation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this implementation is not needed. All the cases where this instance is returned an exception should be thrown instead that the related case is unsupported.

}
}

class IntervalLogicalTypeAnnotation implements LogicalTypeAnnotation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is fake type as INTERVAL is not supported in LogicalTypes currently. I can see the purpose though. I would suggest adding some comments to make the purpose clear.

return self();
}

public THIS as(LogicalTypeAnnotation type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method breaks the fluent API of the builder as you need to use an outside factory method of the final type to create a LogicalTypeAnnotation. If it is practically feasible, I would suggest to refactor the LogicalTypeAnnotation API to fit more in the fluent API of Types.

}
}

LogicalTypeAnnotation.TimeUnit convertTimeUnit(TimeUnit unit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see this method can be private. Please, use visibility as tight as possible and comment if something is visible for testing purposes only.

@@ -41,6 +40,10 @@

private static final ObjectMapper objectMapper = new ObjectMapper();

static {
objectMapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why it is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this configuration change on the objectMapper three tests in parquet-cascading3 fail with the following error:

testReadPattern(org.apache.parquet.cascading.TestParquetTupleScheme): [namecp] could not build flow from assembly: [java.io.IOException: Could not read footer: java.lang.RuntimeException: org.codehaus.jackson.map.JsonMappingException: No serializer found for class org.apache.parquet.schema.LogicalTypeAnnotation$StringLogicalTypeAnnotation and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS) ) (through reference chain: org.apache.parquet.hadoop.metadata.ParquetMetadata["fileMetaData"]->org.apache.parquet.hadoop.metadata.FileMetaData["schema"]->org.apache.parquet.schema.MessageType["fields"]->java.util.ArrayList[0]->org.apache.parquet.schema.PrimitiveType["logicalTypeAnnotation"])]

The reason is: the logical types are no longer represented as enum, but a classed, and Jackson can't serialize empty classes because FAIL_ON_EMPTY_BEANS feature is enabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.
Could you please add some comments about that so others will also understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem, I will.

@@ -88,6 +88,7 @@ public GroupType(Repetition repetition, String name, OriginalType originalType,
* @param fields the contained fields
* @param id the id of the field
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it is enough to deprecate public API. If a method is not public we can freely modify/remove it.

}

/**
* @param repetition OPTIONAL, REPEATED, REQUIRED
* @param primitive STRING, INT64, ...
* @param name the name of the type
* @param originalType (optional) the original type to help with cross schema convertion (LIST, MAP, ...)
*
* @deprecated use {@link #PrimitiveType(Repetition, PrimitiveTypeName, String, LogicalTypeAnnotation)} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual pattern in parquet for deprecating is to mention that the related method will be removed in 2.0.0. In case of the suggestion would be to use one of the overloaded methods then it is fine to not mention.

@@ -436,13 +438,20 @@ public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
* @param originalType (optional) the original type (MAP, DECIMAL, UTF8, ...)
* @param decimalMeta (optional) metadata about the decimal type
* @param id the id of the field
*
* @deprecated use {@link #PrimitiveType(Repetition, PrimitiveTypeName, int, String, LogicalTypeAnnotation, ID)} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
int length, String name, OriginalType originalType,
DecimalMetadata decimalMeta, ID id) {
this(repetition, primitive, length, name, originalType, decimalMeta, id, null);
}

/**
* @deprecated use {@link #PrimitiveType(Repetition, PrimitiveTypeName, int, String, LogicalTypeAnnotation, ID, ColumnOrder)} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not public, no need for deprecation.

/**
* @deprecated use {@link #PrimitiveType(Repetition, PrimitiveTypeName, int, String, LogicalTypeAnnotation, ID, ColumnOrder)} instead
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -459,6 +468,37 @@ public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
this.columnOrder = requireValidColumnOrder(columnOrder);
}

public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in long term we do not want to expose the construction of types from the API. We would like the clients to use the builder instead. Therefore, I would suggest not adding public constructors if your code does not need it.

Copy link
Contributor

@gszadovszky gszadovszky Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my previous comment? If you agree, you should say at the deprecations to use the Types factory instead of directly creating PrimitiveType objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, missed this one. Sure, I can make it package private

*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not public, no deprecating is required.

/**
* @deprecated use {@link #Type(String, Repetition, LogicalTypeAnnotation, ID)} instead
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@@ -459,6 +468,37 @@ public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
this.columnOrder = requireValidColumnOrder(columnOrder);
}

public PrimitiveType(Repetition repetition, PrimitiveTypeName primitive,
Copy link
Contributor

@gszadovszky gszadovszky Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my previous comment? If you agree, you should say at the deprecations to use the Types factory instead of directly creating PrimitiveType objects.

private final byte bitWidth;
private final boolean isSigned;

public static LogicalTypeAnnotation create(byte bitWidth, boolean isSigned) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you expect a byte for bitWidth the caller has to cast the value all the time. I think, there is no reason expecting a byte here so, I would suggest using an int instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntType in parquet-format expects byte as bitWidth, thus if this method would accept int instead of byte, then in line 503 (LogicalType.INTEGER(new IntType(bitWidth, isSigned))) we should downcast bitWidth to byte. Is this fine, should I downcast there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to do the cast in one place so the user don't have to do it every time when using this API. The cast would be safe as only the values 8, 16, 32 and 64 are allowed. By the way, I would be good to check if really one of these values are passed.

class StringLogicalTypeAnnotation implements LogicalTypeAnnotation {
private static final StringLogicalTypeAnnotation INSTANCE = new StringLogicalTypeAnnotation();

public static LogicalTypeAnnotation create() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the pattern how the client has to write the whole classnames of the specialized LogicalTypeAnnotataion implementations to create one. What do you think about having named static factory methods one level up instead?
For example instead of calling LogicalTypeAnnotation.StringLogicalTypeAnnotation.create() you might call LogicalTypeAnnotation.string() or more simply string() if static import is used.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one comment. Otherwise, seems good to me.

return BsonLogicalTypeAnnotation.INSTANCE;
}

static IntervalLogicalTypeAnnotation intervalType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the user to be able to create an Interval or a MapKeyValue type? If they are only exist for backward compatibility reasons the factory method should not be public so users won't be able to create them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MapKeyValue is here only for backward compatibility, but Interval should be supported in the future. Since these factory methods are defined on an interface, I can't easily change the visibility to private/package protected. Should I change it to an abstract class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval is not supported in LogicalType currently, so I would not expose it yet. It is always possible to make a restricted access public but in the other way it would break backward compatibility.
Abstract class sounds reasonable to me.

This commit
* Address code review changes: replace Interval and KeyValue logical type factory methods with methods on logical
  type to indicate that they're not intended to be used publicly
* Incorporate new parameters for logical types for timestamp, time and integer
Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the hard work you already invested in this change!
I would like you to keep up the good work so I added some additional comments ;) sorry...
Could you please also investigate the Travis failures and make it pass?

import java.util.Objects;

public interface LogicalTypeAnnotation {
public abstract class LogicalTypeAnnotation {
public enum LogicalTypes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum is used only for parsing/printing and we don't want the users to really use them. So, I would suggest using a name that suggests its use e.g. LogicalTypeParseHelper?
Also, it would be nice if we could annotate/comment that this one is not part of the public API.

/**
* Convert this parquet-mr logical type to parquet-format LogicalType.
*
* @return the parquet-format LogicalType representation of this logical type implementation
*/
LogicalType toLogicalType();
public abstract LogicalType toLogicalType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually try to keep the usage of the thrift generated objects (from parquet-format) in ParquetMetadataConverter. Do you think it is possible to refactor the related code parts to there (not necessarily to the same class but to the same module)?

void accept(LogicalTypeAnnotationVisitor logicalTypeAnnotationVisitor);
public abstract void accept(LogicalTypeAnnotationVisitor logicalTypeAnnotationVisitor);

public abstract LogicalTypes getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this one is not used outside of this class/package. I would suggest not being public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accept is intended for public use, for the visitors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant getType().


MessageType parsed = MessageTypeParser.parseMessageType(message);
MessageType expected = Types.buildMessage()
.required(INT32).as(LogicalTypeAnnotation.intType(8, true)).named("i8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think, the static import of the method intType would help reading the code.

@nandorKollar
Copy link
Contributor Author

@gszadovszky it looks like the version of maven-shade-plugin and enforcer-rule dependency in maven-enforcer-plugin Maven plugins used in Parquet don't like Java 8 lambda expressions, that's why Travis build failed. Since Parquet is already upgraded to Java 8, I think we should upgrade these plugins. I'll create a separate Jira for this issue.

Copy link
Contributor

@gszadovszky gszadovszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the changes. 2 more to go ;)
After fixing these, I'll be fine with this.
Thanks again for the efforts


import java.util.List;
import java.util.Objects;

public abstract class LogicalTypeAnnotation {
public enum LogicalTypes {
// This is a private enum intended only for internal use for parsing the schema
public enum LogicalTypeToken {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see this enum is used only from the schema package. I would suggest using package private access so the comment is not needed and the clients cannot misuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks Gabor!

@@ -164,7 +135,7 @@ protected LogicalTypeAnnotation fromString(List<String> params) {
*/
public abstract void accept(LogicalTypeAnnotationVisitor logicalTypeAnnotationVisitor);

public abstract LogicalTypes getType();
protected abstract LogicalTypeToken getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: package private would be fine here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks! I'm wondering, if we should narrow down the scope of a bunch of other methods to package private in LogicalTypeAnnotation too? For example, toOriginalType is only used within the same package, apart from one test case, not sure that it should be part of public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I would say if something is not necessary to be public then restrict its access before committing. The later the harder to remove from the public API.

@wesm
Copy link
Member

wesm commented Nov 21, 2018

What will happen if you try reading a Parquet file with the new logical types with an old reader (or another implementation, like Apache Impala)?

@nandorKollar
Copy link
Contributor Author

@wesm old logical types (ConvertedTypes in Thrift schema) are still written as long as the LogicalType can be paired with a ConvertedType. For those logical type, which already exist, this mapping is documented here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, old readers should be able to read these Thrift schemas as they did before, because the expected ConvertedType field is populated.

For those types, which are introduced later, for example nanosecond precision timestamp, we decided not to introduce a ConvertedType, but just add a new TimeUnit to Timestamp and Time logical types. In this case, the ConvertedType field is not populated at all, therefore old readers will just see the physical type. Does this answer your question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants