Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating to spark 3.0.1 and hadoop 3.2.1 #141

Merged
merged 5 commits into from
Oct 7, 2020

Conversation

lbergelson
Copy link
Contributor

@lbergelson lbergelson commented Jun 19, 2020

Fixes #130, fixes #142

@heuermh Can you weigh in on this? I needed to make a weird change to use the RawLocalFileSystem in order to avoid a checksum issue. I'm not sure why we're getting the checksum failure though. I suspect it's not getting recomputed correctly after some operation but I don't know why.

If I don't force it to use the raw filesystem we get

htsjdk.samtools.util.RuntimeIOException: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/var/folders/q3/hw5cxmn52wq347lg7rb_mzlw0000gq/T/test1179670579977857255.vcf at 0

	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:88)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:75)
	at htsjdk.samtools.util.AbstractIterator.hasNext(AbstractIterator.java:44)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.<init>(AsciiLineReaderIterator.java:78)
	at htsjdk.tribble.readers.AsciiLineReaderIterator.<init>(AsciiLineReaderIterator.java:33)
	at org.disq_bio.disq.impl.formats.vcf.VcfSource.getFileHeader(VcfSource.java:80)
	at org.disq_bio.disq.HtsjdkVariantsRddStorage.read(HtsjdkVariantsRddStorage.java:96)
	at org.disq_bio.disq.HtsjdkVariantsRddStorage.read(HtsjdkVariantsRddStorage.java:80)
	at org.disq_bio.disq.HtsjdkVariantsRddTest.testReadAndWrite(HtsjdkVariantsRddTest.java:97)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at junitparams.internal.InvokeParameterisedMethod.evaluate(InvokeParameterisedMethod.java:234)
	at junitparams.internal.ParameterisedTestMethodRunner.runMethodInvoker(ParameterisedTestMethodRunner.java:47)
	at junitparams.internal.ParameterisedTestMethodRunner.runTestMethod(ParameterisedTestMethodRunner.java:40)
	at junitparams.internal.ParameterisedTestClassRunner.runParameterisedTest(ParameterisedTestClassRunner.java:146)
	at junitparams.JUnitParamsRunner.runChild(JUnitParamsRunner.java:446)
	at junitparams.JUnitParamsRunner.runChild(JUnitParamsRunner.java:393)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/var/folders/q3/hw5cxmn52wq347lg7rb_mzlw0000gq/T/test1179670579977857255.vcf at 0
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:260)
	at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:300)
	at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
	at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.disq_bio.disq.impl.file.HadoopFileSystemWrapper$SeekableHadoopStream.read(HadoopFileSystemWrapper.java:241)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at htsjdk.samtools.seekablestream.SeekableBufferedStream.read(SeekableBufferedStream.java:133)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.FilterInputStream.read(FilterInputStream.java:107)
	at htsjdk.tribble.readers.PositionalBufferedStream.fill(PositionalBufferedStream.java:132)
	at htsjdk.tribble.readers.PositionalBufferedStream.peek(PositionalBufferedStream.java:123)
	at htsjdk.tribble.readers.PositionalBufferedStream.read(PositionalBufferedStream.java:62)
	at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:134)
	at htsjdk.tribble.readers.AsciiLineReader.readLine(AsciiLineReader.java:182)
	at htsjdk.tribble.readers.AsciiLineReaderIterator$TupleIterator.advance(AsciiLineReaderIterator.java:86)
	... 34 more

There may be other mechanisms to avoid this check. A better solution would be to make the check pass but I'm not sure why it's failing in the first place.

Would you support dropping support for spark 2 and scala 11? I'm in favor because it makes my life easier but I'm not sure what versions you need to support.

This would close #130

@tomwhite If you happen to have any insight into the checksum thing it might be valuable. I believe we had a similar issue in the hadoop-bam days but it went away in disq.

@lbergelson lbergelson force-pushed the lb_update_to_spark_hadoop_3 branch 2 times, most recently from 6aead20 to 5a31e40 Compare June 19, 2020 21:32
@heuermh
Copy link
Contributor

heuermh commented Jun 21, 2020

Would you support dropping support for spark 2 and scala 11?

I would really like to, depends mostly on how soon AWS EMR and other cloud provider support for Spark 3 shows up.

Then as far as this particular issue goes, I will be updating all our Spark 3 related pull requests to use the 3.0 release version this week. I expect to run into other runtime issues, will investigate this along with everything else I find.

@lbergelson
Copy link
Contributor Author

@heuermh Have you gotten a chance to take a look at this at all?

@heuermh
Copy link
Contributor

heuermh commented Aug 10, 2020

Thanks for the ping! Yeah, we have released ADAM and downstream cross-building with Scala 2.12 and Spark 3. For Disq, going forward I would be fine with only releasing against Spark 3. I have not had a chance to investigate this issue specifically.

Copy link

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson One comment, otherwise looks good to me

throws IOException {
final FileSystem fileSystem = p.getFileSystem(conf);
if (fileSystem instanceof LocalFileSystem) {
return ((LocalFileSystem) fileSystem).getRawFileSystem();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining this special casing of LocalFileSystem (or, if we can't explain it, at least provide a comment with a reference for where the fix came from).

@heuermh
Copy link
Contributor

heuermh commented Oct 7, 2020

From the travis failure log

[ERROR] Found 1 non-complying files, failing build
[ERROR] To fix formatting errors, run "mvn com.coveo:fmt-maven-plugin:format"
[ERROR] Non complying file: /home/travis/build/disq-bio/disq/src/main/java/org/
disq_bio/disq/impl/file/HadoopFileSystemWrapper.java

@lbergelson
Copy link
Contributor Author

Woops! I always forget to run the linter locally.

@lbergelson lbergelson changed the title Updating to spark 3.0.0 and hadoop 3.2.1 Updating to spark 3.0.1 and hadoop 3.2.1 Oct 7, 2020
@lbergelson
Copy link
Contributor Author

I'm going to merge this. If we ever understand it better we should revisit it...

@lbergelson lbergelson merged commit 4c44399 into master Oct 7, 2020
@lbergelson lbergelson deleted the lb_update_to_spark_hadoop_3 branch October 7, 2020 22:16
@heuermh
Copy link
Contributor

heuermh commented Oct 8, 2020

Thank you, @lbergelson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Spark dependency to version 3.0.1 Add support for Spark 3
3 participants