-
Notifications
You must be signed in to change notification settings - Fork 577
Commit
- Loading branch information
There are no files selected for viewing
6 comments
on commit 0ea45ee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vruano now i completely rewrote this old gatk3 code and encoded the whole thing as a regexp. please review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I used a regex in my previous review I didn't mean to encourage to use one for the actual solution although I have to admit that it occurred to me. This could be a very elegant solution although it might not work well in practice if it turns out to be inefficient respect to just work with the list of CigarOperations.... I would like some input about this from @lbergelson or @droazen. Is true that we should not fall in premature optimizations but is also true that we should not produce grossly inefficient code (I'm kind of quoting @droazen here). Hopefully the pattern pre-compilation step does a good job in preventing that but we have to make sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before go ahead with merging this pull-request I would like to make sure that we are actually capturing what we are supposed to be capturing here.
For example,
- Do we need to filter out operators with 0 length (is this a valid Cigar at all?)
- Do we care if the cigar has consecutive operators of the same kind? what we should fail for two consecutive insertions (xIyI were x and y are numbers) but not for consecutive match (xMyM).
- Why we should not allow for a Insertion to be followed by a Deletion or viceversa. I think it is a genuine possibility otherwise you are kind of forcing an artifactual base map. Perhaps I am missing something here though.
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intention is to port what there is in GATK3 as it is for now, the reg-expression provided is not doing it fully.
For example GATK3's filter would not complain for consecutive H operations at the beginning whereas the new code does.
Also it does accept non empty CIGAR with out any operator that consume read bases such as N or P. That would not happen with the GATK3 code.
I don't know if this can be captured appropriately with a regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we stick to using regex ...
I don't see the need of the of() operation at all with an or(String ... comps) that would join using the sep "I" and surround each component with brackets makes the code more readable IMO. Also you would be able to get rid of the clutter from those repetitive string concats and pipe ("I") string literals.
You could have static method to add starts or crosses star(), plus().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish with my review back to @akiezun
I think that pat should be renamed to something more informative and using upper cases and underscores ... E.g. BAD_CIGARS?