Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-813] support metadata in Avro sink #1177

Closed
wants to merge 2 commits into from
Closed

[BEAM-813] support metadata in Avro sink #1177

wants to merge 2 commits into from

Conversation

nevillelyh
Copy link
Contributor

No description provided.

@lukecwik
Copy link
Member

R: @lukecwik

@@ -455,6 +458,13 @@ private Read() {}
}

/**
* Returns a {@link PTransform} that writes Avro file(s) with specified metadata.
*/
public static Bound<GenericRecord> withMetadata(Map<String, String> metadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also allow byte[] and long by making the value type of the map an Object.

In the Bound constructor we can validate that the map that the user passed in only contain these types. We should explicitly say in the javadoc that only these three types are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to Map<String, Object> and added Javadoc.


DataFileStream dataFileStream = new DataFileStream(new FileInputStream(outputFile),
new GenericDatumReader());
assertTrue(dataFileStream.getMetaString("key").equals("value"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assertEquals because it will give a better error message on failure.
assertEquals("value", dataFileStream.getMetaString("key"));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (Map.Entry<String, String> entry : metadata.entrySet()) {
builder.add(DisplayData.item(entry.getKey(), entry.getValue()).withLabel("Metadata"));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this extra white space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


/**
* Returns a new {@link PTransform} that's like this one but
* that writes to Avro file(s) with specified metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with specified metadata. -> with the specified metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -455,6 +458,13 @@ private Read() {}
}

/**
* Returns a {@link PTransform} that writes Avro file(s) with specified metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with specified metadata. -> with the specified metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -519,6 +532,7 @@ private Read() {}
this.schema = schema;
this.validate = validate;
this.codec = codec;
this.metadata = ImmutableMap.copyOf(metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are duplicating the map copy multiple times. It makes sense that its only done within the withMetadata call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

@@ -779,6 +823,10 @@ public void populateDisplayData(DisplayData.Builder builder) {
.addIfNotDefault(DisplayData.item("codec", codec.toString())
.withLabel("Avro Compression Codec"),
DEFAULT_CODEC.toString());
for (Map.Entry<String, String> entry : metadata.entrySet()) {
builder.add(DisplayData.item(entry.getKey(), entry.getValue()).withLabel("Metadata"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make all metadata have the same label called "Metadata" irrespective of the key which is not what I think you want.

I think you want to do:

builder.include("Metadata", new Metadata());

and add the following private class inside Bound:

private class Metadata implements HasDisplayData {
  void populateDisplayData(DisplayData.Builder builder) {
    for (Map... entry : ...) {
      String valueString = // convert entry.getValue() to string
      builder.add(DisplayData.item(entry.getKey(), entryString);
    }
  }
}

This allows for all the metadata to be grouped under a single namespace component and be logically related.
For the binary data, it won't be human readable and doesn't really fit as information that would be consumed, so I suggest that either:

  • skip over specifying binary fields
  • add them and make the value a word like "..." to mark that its specified
  • add them and make the value the hex or base64 representation of a fixed prefix of the bytes followed by "..." if the bytes is long otherwise the entire hex/base64 representation.
  • add them and make the value a checksum of the bytes

@nevillelyh @dhalperi @swegner What kinds of binary data do people add to Avro files and is its presence in the DisplayData genuinely useful?
@nevillelyh @dhalperi @swegner any specific preference for the above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. You may also find it useful to use DisplayData.inferType(..) to easily pull out String/Long values and special-case others:

for (Map... entry : ...) {
  DisplayData.Type type = DisplayData.inferFrom(entry.getValue());
  if (type != null) {
    builder.add(DisplayData.item(entry.getKey(), type, entry.getValue());
  } else {
    // handle binary data
  }
}

Unless we know there is binary metadata which would be valuable to display, I recommend going the simple route and substituting a marker word ("...") or using the Object.toString().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object.toString on a byte[] will not provide any meaningfully useful information since it will just print out the address of the byte[] in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only interested in String metadata but base64 with optional "..." sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@nevillelyh
Copy link
Contributor Author

@lukecwik addressed your comments. LMK what you think.

Copy link
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, and then LGTM

Note in the future, please don't squash your additional commits into existing reviewed commits. It makes it much easier if a reviewer can look at the diff between the last reviewed version. If you prefix your commit message with fixup! or squash!, a committer will use git to clean-up the commit history. Contribution guide needs some updating to add the fixup!/squash! stuff to it but this is the practice that has been followed. http://beam.incubator.apache.org/contribute/contribution-guide/#code-review-and-revision

}
}
checkArgument(
badKeys.isEmpty(), "Metadata value must be String, Long, or byte[]. {}", badKeys);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata value must be String, Long, or byte[]. {} -> Metadata value type must be one of String, Long, or byte[]. Found {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

builder.add(DisplayData.item(entry.getKey(), type, entry.getValue()));
} else {
String base64 = BaseEncoding.base64().encode((byte[]) entry.getValue());
String repr = base64.length() <= 40 ? base64 : base64.substring(0, 40) + "...";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, extract the 40 to a constant, also make a comment for it that it should be a multiple of 4 to not get a partial encoded byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

dataFileWriter.setMeta(entry.getKey(), (Long) v);
} else if (v instanceof byte[]) {
dataFileWriter.setMeta(entry.getKey(), (byte[]) v);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this case is guarded against by the Bound class, its best to have an else block that throws an IllegalStateException, This will prevent code rot if someone only updates the one list and not the other instead of silently ignoring it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@lukecwik
Copy link
Member

LGTM

Thanks for the contribution.

@lukecwik
Copy link
Member

This was merged here but failed to close the PR: eba099f

Please close this PR.

@nevillelyh nevillelyh closed this Oct 27, 2016
@nevillelyh nevillelyh deleted the neville/avro-metadata branch October 31, 2016 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants