-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle weird chr in fasta headers - fix issue #32 #50
Conversation
@@ -348,7 +349,9 @@ def main(args): | |||
logging.debug('b2w returned %d', ret_b2w) | |||
|
|||
# run diri_sampler on the aligned reads | |||
win_file = 'w-%s-%d-%d.reads.fas' % (ref_name, reg_start, reg_stop) | |||
win_file = 'w-%s-%u-%u.reads.fas' % (ref_name, reg_start, reg_stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favour of writing file names with a single dot and with a conventional suffix (these were my mistakes many years ago), like
win_file = 'w-%s-%u-%u-reads.fasta' % (ref_name, reg_start, reg_stop)
This would need to be corrected in a few places throughout the code, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, that would be replacing all the searches for '.reads' with '-reads'.
But my main concern would be the change of names compared to legacy data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is reasonable. On the other hand, we are dropping a whole bunch of code and relative functions (global reconstruction), changing API, probably switching to shared object files in the future... I'd say that name changes has a minor impact.
@@ -325,8 +326,8 @@ def main(args): | |||
ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0] | |||
ref_name = ref_seq.id | |||
if region: | |||
reg_bound = region.split(':')[1].split('-') | |||
reg_start, reg_stop = int(reg_bound[0]), int(reg_bound[1]) | |||
reg_bound = re.search(r':(?P<start>\d+)-(?P<stop>\d+)$', region) # handles situation where ':' or '-' appears in the ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good!
@@ -325,8 +326,8 @@ def main(args): | |||
ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0] | |||
ref_name = ref_seq.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easier and also add some robustness to just clean the reference id from special characters? Something like ref_name = re.sub(r'[^\w\s]', '_', ref_seq.id)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something like that too. (And going even further : like [^[:alnum:]_]
or [^\w]
with ASCII rule flag).
The main reasons I am hesitating :
-
Change of file names compared to legacy data.
(shorah 2.0 would end up using different names than shorah 1.x. Scripts relying on these will need to be changed and shorah 2.0 will not necessarily recognize files processed with older versions)
-
the gcc version (4.8) used by bioconda is an older one and doesn't provide the necessary
libstdcxx
for the c++11 regex support (it appears in gcc 4.9+).That would either require to use boost::regex as a fall back on bioconda.
Or working around using
algorithm
'sreplace_if
andisalnum
On the other hand, a major version bump like 1.x => 2.0 would be a good opportunity to introduce breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scripts will have to be adapted anyway.
I'm not really sure I'm getting the point about gcc, it's long time since I wrote it. Since we are sanitising sequence and, hence, file names in python, isn't it enough to find the last (and only) dot with rfind
? Where is the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File names (w-%s-%u-%u.read.fas
) as also generated in the C++ side of stuff.
So we should have the exact same file name sanitation on both Python and C++.
Which is problematic because the dinosaur of a GCC version (4.8) still used by bioconda lacks c++11 regex support so we won't be able to use the exact same re.sub
command on both sides.
Lastly, we'll also have to pay attention to minor detail between the C++ and the python sides that can come biting us back like the locale causing diverging interpretations of I don't know how many people could be running this on their desktops/workstation (hello, French locale !) |
@@ -243,8 +243,11 @@ int main(int argc, char* argv[]) | |||
rLen[j] = 0; | |||
} | |||
tmp.rLen = rLen; | |||
sprintf(filename, "w-%s-%d-%d.reads.fas", // read window filename | |||
sprintf(filename, "w-%s-%u-%u.reads.fas", // read window filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Names generated in C++ code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we change in python code, we could just do "w-%s-%u-%u-reads.fasta"
here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested 0933ede and it fails when there is a dot in the reference name (I just renamed |
?!?!? I did exactly that :
Are you sure there's no old version of What you see is the behaviour of the old unpatched |
Looks like you are right. I need to check this. |
Tomorrow, I'll rebase and merge if tests pass. |
You were right. The reason was that in this PR we had the old I was trying to address too many issues at the same time and I created some confusion, sorry. |
- handles '.', '-', ':'
Rebased against latest master (that includes the latest |
Thanks for the support! |
Did a quick edit today to handle issue #32 and weird characters in fasta headers.
Questions :