scala.MatchError RegExp does not catch colons in value part properly #1061

Closed
pauca opened this Issue Jun 30, 2016 · 1 comment

Comments

Projects
None yet
2 participants
@pauca

pauca commented Jun 30, 2016

line

val attrRegex = RegExp("([^:]{2,4}):([AifZHB]):([cCiIsSf]{1},)?(.*)")

with
val attrRegex = RegExp("([^:]{2,4}):([AifZHB]):([cCiIsSf]{1},)?(.*)")

does not handle properly alignmentrecords with attributes like
OQ:Z:C55/15D:::::::.7GFFAFDA442.40F=AGHHE
ie. have colons in the value part

some problematic reads are contained in gatk bundle file CEUTrio.HiSeq.WGS.b37.NA12878.bam

scala> BamWriter.adamSAMSave( "output.bam", bam.sequences, bam.recordGroups , true, true ,false)
2016-06-30 17:01:41 ERROR Utils:95 - Aborting task
scala.MatchError: Z:C, (of class java.lang.String)
    at org.bdgenomics.adam.util.AttributeUtils$.createAttribute(AttributeUtils.scala:92)
    at org.bdgenomics.adam.util.AttributeUtils$.parseAttribute(AttributeUtils.scala:74)
    at org.bdgenomics.adam.util.AttributeUtils$$anonfun$parseAttributes$2.apply(AttributeUtils.scala:61)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 1, 2016

Member

Thanks for reporting this @pauca! We will look into this in the next week. We have some separate logic to extract the OQ field, and I think this isn't getting handled properly.

Member

fnothaft commented Jul 1, 2016

Thanks for reporting this @pauca! We will look into this in the next week. We have some separate logic to extract the OQ field, and I think this isn't getting handled properly.

@fnothaft fnothaft added the bug label Jul 1, 2016

@fnothaft fnothaft added this to the 0.20.0 milestone Jul 1, 2016

@fnothaft fnothaft self-assigned this Jul 16, 2016

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 17, 2016

[ADAM-1061] Fix attribute regex bug.
We had a bug in `org.bdgenomics.adam.util.AttributeUtils` where the regex for
splitting out the formatting string for array attributes was applied to all
attributes. In an array attribute (SAM "B" tags), the type of the array elements
is encoded before the attribute values, and is split off by commas. E.g.,
"B:i,1,2,3". If the attribute is a string (SAM "Z" tags), commas are allowed.
To resolve this, I split this regex into two regexes. We only apply the
regex for splitting out the array type if we are working on an array
attribute. This resolves #1061.

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 18, 2016

[ADAM-1061] Fix attribute regex bug.
We had a bug in `org.bdgenomics.adam.util.AttributeUtils` where the regex for
splitting out the formatting string for array attributes was applied to all
attributes. In an array attribute (SAM "B" tags), the type of the array elements
is encoded before the attribute values, and is split off by commas. E.g.,
"B:i,1,2,3". If the attribute is a string (SAM "Z" tags), commas are allowed.
To resolve this, I split this regex into two regexes. We only apply the
regex for splitting out the array type if we are working on an array
attribute. This resolves #1061.

@heuermh heuermh closed this in #1080 Jul 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment