Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader #3093

Conversation

industrial-sloth
Copy link
Contributor

replaces the existing read() call with readFully().

@@ -115,7 +115,7 @@ private[spark] class FixedLengthBinaryRecordReader
if (currentPosition < splitEnd) {
// setup a buffer to store the record
val buffer = recordValue.getBytes
fileInputStream.read(buffer, 0, recordLength)
fileInputStream.readFully(buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but this also doesn't check how many bytes were actually read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but readFully will either block until the full number of bytes is available, or throw an error:
http://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#readFully(byte[])

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@industrial-sloth
Copy link
Contributor Author

FWIW, we got bit by this over in the Thunder project, where we'd been using this class before Jeremy contributed it into Spark. The circumstance where it came up was doing big reads from S3, where the underlying buffer size was something like 16kb? I forget exactly... anyway, when the records didn't align with this buffer, badness happened.

Here's the equivalent change I made over there:
thunder-project/thunder@7ea31c8

The previous version on that patch shows my previous fix, which does check the number of bytes returned, before I realized that readFully() was a thing. Both that fix and the one here do resolve the original problem we encountered, though I'm having some trouble coming up with a clean unit test for it...

@mateiz
Copy link
Contributor

mateiz commented Nov 5, 2014

Good catch, thanks! Can you check that this is the only version of read() in that code?

@mateiz
Copy link
Contributor

mateiz commented Nov 5, 2014

BTW it would be good to open a JIRA issue for this on https://issues.apache.org/jira/browse/SPARK but unfortunately ASF JIRA seems to be down at the moment.

@mateiz
Copy link
Contributor

mateiz commented Nov 5, 2014

Jenkins, this is ok to test

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22904 has started for PR 3093 at commit a245c8a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22904 has finished for PR 3093 at commit a245c8a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • class RDDFunctions[T: ClassTag](self: RDD[T]) extends Serializable
    • class VectorUDT(UserDefinedType):
    • class NullType(PrimitiveType):
    • class UserDefinedType(DataType):
    • case class ScalaUdfBuilder[T: TypeTag](f: AnyRef)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22904/
Test PASSed.

@industrial-sloth
Copy link
Contributor Author

Thanks @mateiz - Oops, I actually did open https://issues.apache.org/jira/browse/SPARK-4222 for this, then (inexplicably...) left it off the commit message. Not sure offhand of how best to address that - would be happy to close this PR and open a new one with a proper message, let me know if that's how you'd like to proceed.

This was indeed the only use of read() in the FixedLengthBinary* files.

@srowen
Copy link
Member

srowen commented Nov 5, 2014

@industrial-sloth just change the PR title to "SPARK-4222 [CORE] ..." since that title will be the commit message when merged.

@industrial-sloth industrial-sloth changed the title use readFully in FixedLengthBinaryRecordReader SPARK-4222 [CORE] use readFully in FixedLengthBinaryRecordReader Nov 5, 2014
@industrial-sloth
Copy link
Contributor Author

Thanks @srowen! Done.

@mateiz
Copy link
Contributor

mateiz commented Nov 5, 2014

Cool, thanks. Will merge this soon.

@asfgit asfgit closed this in f37817b Nov 5, 2014
asfgit pushed a commit that referenced this pull request Nov 5, 2014
replaces the existing read() call with readFully().

Author: industrial-sloth <industrial-sloth@users.noreply.github.com>

Closes #3093 from industrial-sloth/branch-1.2-fixedLenRecRdr and squashes the following commits:

a245c8a [industrial-sloth] use readFully in FixedLengthBinaryRecordReader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants