[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS #3744

iemejia · 2017-08-22T12:55:27Z

Follow this checklist to help us incorporate your contribution quickly and easily:

Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
Each commit in the pull request should have a meaningful subject line and body.
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

iemejia · 2017-08-22T12:55:39Z

Since the contained InputStream on some of the Hadoop FileSystem implementations may not implement ByteBufferReadable we should take a conservative approach and read directly from the byte array.
R: @lukecwik

coveralls · 2017-08-22T14:38:48Z

Coverage decreased (-0.01%) to 69.971% when pulling 76f2d91 on iemejia:BEAM-2790-fix-s3-read into 9b175cc on apache:master.

lukecwik · 2017-08-22T16:58:55Z

sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java

+      // We avoid using the ByteBuffer based read for Hadoop because some FSInputStream
+      // implementations are not ByteBufferReadable,
+      // See https://issues.apache.org/jira/browse/HADOOP-14603
+      int read = inputStream.read(dst.array());


dst make have a limit on how much to read and its start position may not be zero.
dst may not have array accessible because the bytes may be off heap.

You should use position and limit to write into the bytebuffer or you will not respect what the user has asked for.
You should only use array() if you checked hasArray() and correctly calculate the offsets/limits with position(), arrayOffset() and remaining. Otherwise you should use a fixed size buffer and copy the bytes twice.

I'd recommend looking at PositionedReadable.readFully(), which will handle partial reads of the dest buffer by reading in more data, etc. On a simple() read it could return any value, including 0.

@lukecwik Not sure if I understand, if you look at TextSource, it defines a fixed size for the read buffer, so we have a limit. TextBasedReader already defines it to be 8192.

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java

Lines 89 to 90 in 5181e61

private static final int READ_BUFFER_SIZE = 8192;

private final ByteBuffer readBuffer = ByteBuffer.allocate(READ_BUFFER_SIZE);

And the method that iterates to read (by delegating to the HadoopSeekableByteChannel) is already moving the buffer start position to zero on each iteration so we cannot get off heap, no?

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java

Lines 228 to 240 in 5181e61

private boolean tryToEnsureNumberOfBytesInBuffer(int minCapacity) throws IOException {

// While we aren't at EOF or haven't fulfilled the minimum buffer capacity,

// attempt to read more bytes.

while (buffer.size() <= minCapacity && !eof) {

eof = inChannel.read(readBuffer) == -1;

readBuffer.flip();

buffer = buffer.concat(ByteString.copyFrom(readBuffer));

readBuffer.clear();

}

// Return true if we were able to honor the minimum buffer capacity request

return buffer.size() >= minCapacity;

}

}

That is only true for TextSource as it is implemented today. We can't say that all user implemented sources will always use in memory byte buffers with a fixed size or that the user will consume and reset the ByteBuffer on each read (for example they may detect that not enough was read and ask for more).

lukecwik · 2017-08-22T16:59:48Z

sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java

@@ -189,7 +189,14 @@ public int read(ByteBuffer dst) throws IOException {
      if (closed) {
        throw new IOException("Channel is closed");
      }
-      return inputStream.read(dst);
+      // We avoid using the ByteBuffer based read for Hadoop because some FSInputStream


Add a test which exercises various bytebuffer setups.

echauchot · 2017-09-12T14:58:17Z

@lukecwik @steveloughran, Ismaël is off for some weeks, I will continue this PR. For now I'm just starting to dig into the subject.

echauchot · 2017-09-13T15:38:42Z

I did the fix in the code to deal with the offsets/limits to have the same behavior than read(ByteBuffer). I'm adding the corner tests to HadoopFilesystem.

iemejia · 2017-09-13T19:25:46Z

I am closing this one because I suppose @echauchot will do a new one. See you in october.

[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS

76f2d91

lukecwik reviewed Aug 22, 2017

View reviewed changes

iemejia closed this Sep 13, 2017

echauchot mentioned this pull request Sep 14, 2017

[BEAM-2790] Use byte[] instead of ByteBuffer to read from HadoopFilesystem #3850

Closed

6 tasks

iemejia deleted the BEAM-2790-fix-s3-read branch September 16, 2017 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS #3744

[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS #3744

iemejia commented Aug 22, 2017

iemejia commented Aug 22, 2017

coveralls commented Aug 22, 2017

lukecwik Aug 22, 2017

steveloughran Aug 24, 2017

iemejia Aug 24, 2017 •

edited

lukecwik Aug 24, 2017

lukecwik Aug 22, 2017

echauchot commented Sep 12, 2017 •

edited

echauchot commented Sep 13, 2017

iemejia commented Sep 13, 2017

	private static final int READ_BUFFER_SIZE = 8192;
	private final ByteBuffer readBuffer = ByteBuffer.allocate(READ_BUFFER_SIZE);

	private boolean tryToEnsureNumberOfBytesInBuffer(int minCapacity) throws IOException {
	// While we aren't at EOF or haven't fulfilled the minimum buffer capacity,
	// attempt to read more bytes.
	while (buffer.size() <= minCapacity && !eof) {
	eof = inChannel.read(readBuffer) == -1;
	readBuffer.flip();
	buffer = buffer.concat(ByteString.copyFrom(readBuffer));
	readBuffer.clear();
	}
	// Return true if we were able to honor the minimum buffer capacity request
	return buffer.size() >= minCapacity;
	}
	}

[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS #3744

[BEAM-2790] Use byte[] instead of ByteBuffer to read from Hadoop FS #3744

Conversation

iemejia commented Aug 22, 2017

iemejia commented Aug 22, 2017

coveralls commented Aug 22, 2017

lukecwik Aug 22, 2017

Choose a reason for hiding this comment

steveloughran Aug 24, 2017

Choose a reason for hiding this comment

iemejia Aug 24, 2017 • edited

Choose a reason for hiding this comment

lukecwik Aug 24, 2017

Choose a reason for hiding this comment

lukecwik Aug 22, 2017

Choose a reason for hiding this comment

echauchot commented Sep 12, 2017 • edited

echauchot commented Sep 13, 2017

iemejia commented Sep 13, 2017

iemejia Aug 24, 2017 •

edited

echauchot commented Sep 12, 2017 •

edited