handle weird chr in fasta headers - fix issue #32 #50

DrYak · 2018-04-23T19:01:56Z

Did a quick edit today to handle issue #32 and weird characters in fasta headers.

Questions :

shall I keep this regex-based approach ?
(Or shall I try to make a fasta header name-cleaning routine that removes anything weird)
I've gone using 'rfind' in c++, because regex are missing in the GCC (4.8) used by bio-conda
But I could use boost::regex as a fall-back (see docopt in Haploclique)

ozagordi · 2018-04-23T19:50:58Z

src/shorah/amplicon.py

@@ -348,7 +349,9 @@ def main(args):
    logging.debug('b2w returned %d', ret_b2w)

    # run diri_sampler on the aligned reads
-    win_file = 'w-%s-%d-%d.reads.fas' % (ref_name, reg_start, reg_stop)
+    win_file = 'w-%s-%u-%u.reads.fas' % (ref_name, reg_start, reg_stop)


I would be in favour of writing file names with a single dot and with a conventional suffix (these were my mistakes many years ago), like

win_file = 'w-%s-%u-%u-reads.fasta' % (ref_name, reg_start, reg_stop)

This would need to be corrected in a few places throughout the code, though.

Basically, that would be replacing all the searches for '.reads' with '-reads'.

But my main concern would be the change of names compared to legacy data.

Which is reasonable. On the other hand, we are dropping a whole bunch of code and relative functions (global reconstruction), changing API, probably switching to shared object files in the future... I'd say that name changes has a minor impact.

ozagordi · 2018-04-23T19:52:05Z

src/shorah/amplicon.py

@@ -325,8 +326,8 @@ def main(args):
    ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0]
    ref_name = ref_seq.id
    if region:
-        reg_bound = region.split(':')[1].split('-')
-        reg_start, reg_stop = int(reg_bound[0]), int(reg_bound[1])
+        reg_bound = re.search(r':(?P<start>\d+)-(?P<stop>\d+)$', region)  # handles situation where ':' or '-' appears in the ID


This is very good!

ozagordi · 2018-04-23T19:57:20Z

src/shorah/amplicon.py

@@ -325,8 +326,8 @@ def main(args):
    ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0]
    ref_name = ref_seq.id


Would it be easier and also add some robustness to just clean the reference id from special characters? Something like ref_name = re.sub(r'[^\w\s]', '_', ref_seq.id)?

I was thinking about something like that too. (And going even further : like [^[:alnum:]_] or [^\w] with ASCII rule flag).

The main reasons I am hesitating :

Change of file names compared to legacy data.

(shorah 2.0 would end up using different names than shorah 1.x. Scripts relying on these will need to be changed and shorah 2.0 will not necessarily recognize files processed with older versions)

the gcc version (4.8) used by bioconda is an older one and doesn't provide the necessary libstdcxx for the c++11 regex support (it appears in gcc 4.9+).

That would either require to use boost::regex as a fall back on bioconda.

Or working around using algorithm's replace_if and isalnum

On the other hand, a major version bump like 1.x => 2.0 would be a good opportunity to introduce breaking changes.

Scripts will have to be adapted anyway.

I'm not really sure I'm getting the point about gcc, it's long time since I wrote it. Since we are sanitising sequence and, hence, file names in python, isn't it enough to find the last (and only) dot with rfind? Where is the issue?

File names (w-%s-%u-%u.read.fas) as also generated in the C++ side of stuff.
So we should have the exact same file name sanitation on both Python and C++.
Which is problematic because the dinosaur of a GCC version (4.8) still used by bioconda lacks c++11 regex support so we won't be able to use the exact same re.sub command on both sides.

DrYak · 2018-04-24T08:47:48Z

Lastly, we'll also have to pay attention to minor detail between the C++ and the python sides that can come biting us back like the locale causing diverging interpretations of isalnum, \w versus ASCII rule and POSIX [:alpha:].
(Which can also be a problem with printing %f)

I don't know how many people could be running this on their desktops/workstation (hello, French locale !)

DrYak · 2018-04-24T09:06:38Z

src/cpp/b2w.cpp

@@ -243,8 +243,11 @@ int main(int argc, char* argv[])
                    rLen[j] = 0;
                }
                tmp.rLen = rLen;
-                sprintf(filename, "w-%s-%d-%d.reads.fas",  // read window filename
+                sprintf(filename, "w-%s-%u-%u.reads.fas",  // read window filename


Names generated in C++ code here.

But if we change in python code, we could just do "w-%s-%u-%u-reads.fasta" here, right?

but we will need to make sure that a future sanitize_id(tmp.in->header->target_name[i]) does the exact same thing as in python to keep thing coherent within the whole pipeline.

ozagordi · 2018-04-24T12:22:37Z

I tested 0933ede and it fails when there is a dot in the reference name (I just renamed >reference to >reference.2 in examples/ref_genome.fasta and realigned. The problem seems to be that diri_sampler only creates one debug file and one sampling file without window information: w-reference.dbg instead of w-reference.2-1-201.dbg, w-reference.2-68-268.dbg and so on. Would you please give a look?

DrYak · 2018-04-24T13:31:34Z

?!?!?

I did exactly that :

seded the reference name to reference.2
smalted a new .bam file according to the README.me
ran shorah amplicon on it (using freshly installed files)
got the expected results.

Are you sure there's no old version of diri_sampler at the wrong part of your search path (cf. your pullrequest #48 not always transferring the files where it should, and pullrequest #51 about searching them) ?

What you see is the behaviour of the old unpatched diri_sampler (using .find('.') instead of .rfind(".reads"), as if you were still having an old version left dangling around.

ozagordi · 2018-04-24T15:08:01Z

Looks like you are right. I need to check this.

DrYak · 2018-04-24T21:18:03Z

Tomorrow, I'll rebase and merge if tests pass.

ozagordi · 2018-04-25T07:56:52Z

You were right. The reason was that in this PR we had the old setup.py that didn't work with meson + pip.

I was trying to address too many issues at the same time and I created some confusion, sorry.

- handles '.', '-', ':'

DrYak · 2018-04-25T09:19:38Z

Rebased against latest master (that includes the latest setup.py).
Everything seems to work, as it should.

ozagordi · 2018-04-25T09:43:07Z

Thanks for the support!

DrYak requested a review from SoapZA April 23, 2018 19:01

ozagordi reviewed Apr 23, 2018

View reviewed changes

DrYak requested a review from ozagordi April 24, 2018 08:30

DrYak commented Apr 24, 2018

View reviewed changes

DrYak mentioned this pull request Apr 24, 2018

Restore custom install of exe files #48

Merged

ozagordi approved these changes Apr 25, 2018

View reviewed changes

handle weird chr in fasta headers - fix issue #32

d2c7d59

- handles '.', '-', ':'

DrYak force-pushed the anti-dots branch from e79850b to d2c7d59 Compare April 25, 2018 09:18

DrYak merged commit 6925abb into master Apr 25, 2018

DrYak deleted the anti-dots branch April 25, 2018 09:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle weird chr in fasta headers - fix issue #32 #50

handle weird chr in fasta headers - fix issue #32 #50

DrYak commented Apr 23, 2018

ozagordi Apr 23, 2018 •

edited

Loading

DrYak Apr 24, 2018

ozagordi Apr 24, 2018

ozagordi Apr 23, 2018

ozagordi Apr 23, 2018

DrYak Apr 24, 2018

ozagordi Apr 24, 2018

DrYak Apr 24, 2018

DrYak commented Apr 24, 2018

DrYak Apr 24, 2018

ozagordi Apr 24, 2018

DrYak Apr 24, 2018

ozagordi commented Apr 24, 2018

DrYak commented Apr 24, 2018

ozagordi commented Apr 24, 2018

DrYak commented Apr 24, 2018

ozagordi commented Apr 25, 2018

DrYak commented Apr 25, 2018

ozagordi commented Apr 25, 2018

		@@ -325,8 +326,8 @@ def main(args):
		ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0]
		ref_name = ref_seq.id

handle weird chr in fasta headers - fix issue #32 #50

handle weird chr in fasta headers - fix issue #32 #50

Conversation

DrYak commented Apr 23, 2018

ozagordi Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrYak commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozagordi commented Apr 24, 2018

DrYak commented Apr 24, 2018

ozagordi commented Apr 24, 2018

DrYak commented Apr 24, 2018

ozagordi commented Apr 25, 2018

DrYak commented Apr 25, 2018

ozagordi commented Apr 25, 2018

ozagordi Apr 23, 2018 •

edited

Loading