Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't discard unmapped reads in indel realignment #2019

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@pauldwolfe
Copy link
Contributor

pauldwolfe commented Jul 22, 2018

When doing some QC on ADAM outputs we noticed unmapped reads were dropped in indel realignment. We need to keep all reads throughout our pipeline, mapped or no.

Improvement comments on the implementation welcome, its a bit quick and dirty but at least gives us something to talk about. The idea now is to just union the unmapped reads when realignment is complete. Could sort at this point as well but decided to leave that to the client. I'm not sure indel realignment returns a sorted RDD in the first place.

Paul Wolfe added some commits Jun 28, 2018

Paul Wolfe Paul Wolfe
[ADAM-1697] Expand illumina metadata regex
Now includes possible index sequence characters.
Paul Wolfe Paul Wolfe
[ADAM-1697] Expand illumina metadata regex
Fix typo in unit test.
Paul Wolfe Paul Wolfe
ADAM-2018 Append unmapped reads after indel realignment
Resolves #2018. This change will optionally append the unmapped reads
to the realigned RDD when indel realignment is complete.
Paul Wolfe Paul Wolfe
Merge remote-tracking branch 'upstream/master'
# Conflicts:
#	adam-core/src/test/scala/org/bdgenomics/adam/rdd/fragment/FragmentRDDSuite.scala

@pauldwolfe pauldwolfe changed the title [ADAM-2018] Don't discard unmapped reads in indel realignment Don't discard unmapped reads in indel realignment Jul 22, 2018

@pauldwolfe

This comment has been minimized.

Copy link
Contributor Author

pauldwolfe commented Jul 22, 2018

Just realized I made a bit of a mess of this pull request by using an old fork. Let me know if we can salvage this one or if you'd like me to make a new fork from scratch

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 22, 2018

Coverage Status

Coverage increased (+0.003%) to 78.937% when pulling a0cb07e on pauldwolfe:ADAM-2018 into 8e8007e on bigdatagenomics:master.

1 similar comment
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 22, 2018

Coverage Status

Coverage increased (+0.003%) to 78.937% when pulling a0cb07e on pauldwolfe:ADAM-2018 into 8e8007e on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jul 22, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2792/
Test PASSed.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jul 23, 2018

Thank you for the contribution, @pauldwolfe!

No where in the scaladoc does it say that unmapped reads are filtered out; we should certainly add that.

Can the select for unmapped reads and later union happen outside the call to realignIndels? I wonder if that might be cleaner than adding another flag to the method call.

@pauldwolfe

This comment has been minimized.

Copy link
Contributor Author

pauldwolfe commented Jul 24, 2018

Hey @heuermh , see your point, adding more options and conditionals always increases complexity. I don't mind pulling the fix out into my code, it's pretty straightforward.

That said, you might want to consider going the other direction, and having realign indels always return the unmapped reads. Looking at the commit that introduced this behaviour, seems like an internal performance improvement with a functional side-effect rather than an intentional behaviour change. Also from a client perspective it's easier to filter the unmapped after realignment than to do the filter + union.

Just a thought, let me know what you think. Otherwise I'll go ahead and handle in my own code.

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jan 14, 2019

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 14, 2019

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2934/

Build result: FAILURE

GitHub pull request #2019 of commit a0cb07e.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > git rev-parse a0cb07e^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains a0cb07e # timeout=10Checking out Revision a0cb07e (origin/pr/2019/head, origin/pr/2019/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f a0cb07e45e73727f38b0247370d51e845cae6bb4First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.5,2.11,2.2.2,ubuntuADAM-prb ? 2.7.5,2.11,2.2.2,ubuntu completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.