New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding utilities for read trimming. #248

Merged
merged 1 commit into from May 29, 2014

Conversation

Projects
None yet
5 participants
@fnothaft
Member

fnothaft commented May 28, 2014

This pull request adds a few utilities for read trimming, as a first start towards read error correction. This pull request contains:

  • An update to the ADAMRecord schema to add fields for bases trimmed from the read start/end. After alignment, these fields are redundant (with hard clipping in the CIGAR), but the fields are important for trimming performed before alignment. These fields are added in the SAMRecordConverter.
  • Two new RDD functions. One performs a fixed length trim to all reads. A second function trims the prefix/suffix of all reads in a read group based on the geometric mean of the error probability at that position.
  • An update to the Transform CLI module, to add these two functions as possible transforms.
  • A correction to the CycleCovariate that correctly calculates the sequencing cycle for hard clipped reads.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 28, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/327/

AmplabJenkins commented May 28, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/327/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 28, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/328/

AmplabJenkins commented May 28, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/328/

@@ -31,6 +31,8 @@ record ADAMRecord {
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { null, int } recordGroupId = null;
union { int, null } basesTrimmedFromStart = 0;

This comment has been minimized.

@tdanford

tdanford May 29, 2014

Contributor

I forget (Matt, this was the question I was going to ask you on IM), is there a backwards-compatibility issue with adding extra records to the middle of an Avro schema like this?

@tdanford

tdanford May 29, 2014

Contributor

I forget (Matt, this was the question I was going to ask you on IM), is there a backwards-compatibility issue with adding extra records to the middle of an Avro schema like this?

This comment has been minimized.

@fnothaft

fnothaft May 29, 2014

Member

Paging @massie...

@fnothaft

This comment has been minimized.

@massie

massie May 29, 2014

Member

I was wrong to worry about adding fields in the middle of the schema before. Both Avro and Parquet use name-based column access. We can add, remove and reorder the fields, we just can't rename them.

See https://github.com/Parquet/parquet-format/issues/91

@massie

massie May 29, 2014

Member

I was wrong to worry about adding fields in the middle of the schema before. Both Avro and Parquet use name-based column access. We can add, remove and reorder the fields, we just can't rename them.

See https://github.com/Parquet/parquet-format/issues/91

This comment has been minimized.

@tdanford

tdanford May 29, 2014

Contributor

Okay, then I have no comments on this line.

@tdanford

tdanford May 29, 2014

Contributor

Okay, then I have no comments on this line.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

Are we still maintaining updates to the CHANGES file(s) as part of a pull request?

Contributor

tdanford commented May 29, 2014

Are we still maintaining updates to the CHANGES file(s) as part of a pull request?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2014

Member

@tdanford I believe we've dropped the CHANGES.txt, and are just keeping the CHANGES.md. However, CHANGES.md should probably be updated via the script whenever we do a release.

Member

fnothaft commented May 29, 2014

@tdanford I believe we've dropped the CHANGES.txt, and are just keeping the CHANGES.md. However, CHANGES.md should probably be updated via the script whenever we do a release.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/332/

AmplabJenkins commented May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/332/

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

Frank, I think you've got some rebasing to do -- it looks like there are some commits in this branch that are duplicated from master, right?

Contributor

tdanford commented May 29, 2014

Frank, I think you've got some rebasing to do -- it looks like there are some commits in this branch that are duplicated from master, right?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2014

Member

@tdanford correct; not sure how that happened...

Member

fnothaft commented May 29, 2014

@tdanford correct; not sure how that happened...

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2014

Member

@tdanford this is rebased now, thanks for the heads up!

Member

fnothaft commented May 29, 2014

@tdanford this is rebased now, thanks for the heads up!

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

Again! I just merged Matt's edit to CONTRIBUTING.md, you're still out of date! :-)

Contributor

tdanford commented May 29, 2014

Again! I just merged Matt's edit to CONTRIBUTING.md, you're still out of date! :-)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2014

Member

@tdanford u r rebase troll :'(

Re-rebased...

Member

fnothaft commented May 29, 2014

@tdanford u r rebase troll :'(

Re-rebased...

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

I like my commit histories like I like my spaces: linear.

Contributor

tdanford commented May 29, 2014

I like my commit histories like I like my spaces: linear.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

Okay, a more substantive comment: Frank, do you have somewhere (comments? a change file?) where you can write a little bit about the motivation behind these trimming additions? Are you trying to recreate specific functionality from Picard (I'm thinking the MergeBamAlignments command, but maybe something else)? From somewhere else? Does this satisfy a requirement or need for avocado or variant calling in particular? I understand there are a lot of reasons why one might "trim" reads, I'm just hoping for a little context here.

Contributor

tdanford commented May 29, 2014

Okay, a more substantive comment: Frank, do you have somewhere (comments? a change file?) where you can write a little bit about the motivation behind these trimming additions? Are you trying to recreate specific functionality from Picard (I'm thinking the MergeBamAlignments command, but maybe something else)? From somewhere else? Does this satisfy a requirement or need for avocado or variant calling in particular? I understand there are a lot of reasons why one might "trim" reads, I'm just hoping for a little context here.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 29, 2014

Member

@tdanford Good point; I don't have this documented yet. When I started on this, I was planning to implement read error correction first (à la Quake), but I then decided that read trimming was a better first start. The need is indeed for variant calling/assembly.

Member

fnothaft commented May 29, 2014

@tdanford Good point; I don't have this documented yet. When I started on this, I was planning to implement read error correction first (à la Quake), but I then decided that read trimming was a better first start. The need is indeed for variant calling/assembly.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/336/

AmplabJenkins commented May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/336/

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/337/

AmplabJenkins commented May 29, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/337/

massie added a commit that referenced this pull request May 29, 2014

Merge pull request #248 from fnothaft/read-trimming
Adding utilities for read trimming.

@massie massie merged commit abe6834 into bigdatagenomics:master May 29, 2014

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie May 29, 2014

Member

Thanks, Frank!

Member

massie commented May 29, 2014

Thanks, Frank!

@fnothaft fnothaft deleted the fnothaft:read-trimming branch Jul 10, 2014

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 10, 2015

Contributor

@fnothaft what are you computing this manually rather than taking the difference between samRecord.getUnclippedStart() and samRecord.getStart()? Is it because ADAM treats hard/soft clipping differently than samtools?

@fnothaft what are you computing this manually rather than taking the difference between samRecord.getUnclippedStart() and samRecord.getStart()? Is it because ADAM treats hard/soft clipping differently than samtools?

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 10, 2015

Member

Old school!

IIRC, samRecord.getUnclippedStart just looks at soft clipped bases. The AlignmentRecord basesClippedFromStart field specifically refers to hard clipped bases.

Member

fnothaft replied Sep 10, 2015

Old school!

IIRC, samRecord.getUnclippedStart just looks at soft clipped bases. The AlignmentRecord basesClippedFromStart field specifically refers to hard clipped bases.

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Sep 10, 2015

Contributor

Perfect. Looking at this as I'm trying to wire up hellbender to read/write ADAM

Contributor

laserson replied Sep 10, 2015

Perfect. Looking at this as I'm trying to wire up hellbender to read/write ADAM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment