-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-813] support metadata in Avro sink #1177
Conversation
R: @lukecwik |
@@ -455,6 +458,13 @@ private Read() {} | |||
} | |||
|
|||
/** | |||
* Returns a {@link PTransform} that writes Avro file(s) with specified metadata. | |||
*/ | |||
public static Bound<GenericRecord> withMetadata(Map<String, String> metadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also allow byte[] and long by making the value type of the map an Object.
In the Bound constructor we can validate that the map that the user passed in only contain these types. We should explicitly say in the javadoc that only these three types are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to Map<String, Object>
and added Javadoc.
|
||
DataFileStream dataFileStream = new DataFileStream(new FileInputStream(outputFile), | ||
new GenericDatumReader()); | ||
assertTrue(dataFileStream.getMetaString("key").equals("value")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assertEquals because it will give a better error message on failure.
assertEquals("value", dataFileStream.getMetaString("key"));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for (Map.Entry<String, String> entry : metadata.entrySet()) { | ||
builder.add(DisplayData.item(entry.getKey(), entry.getValue()).withLabel("Metadata")); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this extra white space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
/** | ||
* Returns a new {@link PTransform} that's like this one but | ||
* that writes to Avro file(s) with specified metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with specified metadata. -> with the specified metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -455,6 +458,13 @@ private Read() {} | |||
} | |||
|
|||
/** | |||
* Returns a {@link PTransform} that writes Avro file(s) with specified metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with specified metadata. -> with the specified metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -519,6 +532,7 @@ private Read() {} | |||
this.schema = schema; | |||
this.validate = validate; | |||
this.codec = codec; | |||
this.metadata = ImmutableMap.copyOf(metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are duplicating the map copy multiple times. It makes sense that its only done within the withMetadata call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
@@ -779,6 +823,10 @@ public void populateDisplayData(DisplayData.Builder builder) { | |||
.addIfNotDefault(DisplayData.item("codec", codec.toString()) | |||
.withLabel("Avro Compression Codec"), | |||
DEFAULT_CODEC.toString()); | |||
for (Map.Entry<String, String> entry : metadata.entrySet()) { | |||
builder.add(DisplayData.item(entry.getKey(), entry.getValue()).withLabel("Metadata")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make all metadata have the same label called "Metadata" irrespective of the key which is not what I think you want.
I think you want to do:
builder.include("Metadata", new Metadata());
and add the following private class inside Bound:
private class Metadata implements HasDisplayData {
void populateDisplayData(DisplayData.Builder builder) {
for (Map... entry : ...) {
String valueString = // convert entry.getValue() to string
builder.add(DisplayData.item(entry.getKey(), entryString);
}
}
}
This allows for all the metadata to be grouped under a single namespace component and be logically related.
For the binary data, it won't be human readable and doesn't really fit as information that would be consumed, so I suggest that either:
- skip over specifying binary fields
- add them and make the value a word like "..." to mark that its specified
- add them and make the value the hex or base64 representation of a fixed prefix of the bytes followed by "..." if the bytes is long otherwise the entire hex/base64 representation.
- add them and make the value a checksum of the bytes
@nevillelyh @dhalperi @swegner What kinds of binary data do people add to Avro files and is its presence in the DisplayData genuinely useful?
@nevillelyh @dhalperi @swegner any specific preference for the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. You may also find it useful to use DisplayData.inferType(..)
to easily pull out String/Long values and special-case others:
for (Map... entry : ...) {
DisplayData.Type type = DisplayData.inferFrom(entry.getValue());
if (type != null) {
builder.add(DisplayData.item(entry.getKey(), type, entry.getValue());
} else {
// handle binary data
}
}
Unless we know there is binary metadata which would be valuable to display, I recommend going the simple route and substituting a marker word ("..."
) or using the Object.toString()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object.toString on a byte[] will not provide any meaningfully useful information since it will just print out the address of the byte[] in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only interested in String metadata but base64 with optional "..." sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@lukecwik addressed your comments. LMK what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, and then LGTM
Note in the future, please don't squash your additional commits into existing reviewed commits. It makes it much easier if a reviewer can look at the diff between the last reviewed version. If you prefix your commit message with fixup! or squash!, a committer will use git to clean-up the commit history. Contribution guide needs some updating to add the fixup!/squash! stuff to it but this is the practice that has been followed. http://beam.incubator.apache.org/contribute/contribution-guide/#code-review-and-revision
} | ||
} | ||
checkArgument( | ||
badKeys.isEmpty(), "Metadata value must be String, Long, or byte[]. {}", badKeys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata value must be String, Long, or byte[]. {} -> Metadata value type must be one of String, Long, or byte[]. Found {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
builder.add(DisplayData.item(entry.getKey(), type, entry.getValue())); | ||
} else { | ||
String base64 = BaseEncoding.base64().encode((byte[]) entry.getValue()); | ||
String repr = base64.length() <= 40 ? base64 : base64.substring(0, 40) + "..."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, extract the 40 to a constant, also make a comment for it that it should be a multiple of 4 to not get a partial encoded byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
dataFileWriter.setMeta(entry.getKey(), (Long) v); | ||
} else if (v instanceof byte[]) { | ||
dataFileWriter.setMeta(entry.getKey(), (byte[]) v); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this case is guarded against by the Bound class, its best to have an else block that throws an IllegalStateException, This will prevent code rot if someone only updates the one list and not the other instead of silently ignoring it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
LGTM Thanks for the contribution. |
This was merged here but failed to close the PR: eba099f Please close this PR. |
No description provided.