Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle weird chr in fasta headers - fix issue #32 #50

Merged
merged 1 commit into from
Apr 25, 2018
Merged

Conversation

DrYak
Copy link
Member

@DrYak DrYak commented Apr 23, 2018

Did a quick edit today to handle issue #32 and weird characters in fasta headers.

Questions :

  • shall I keep this regex-based approach ?
  • (Or shall I try to make a fasta header name-cleaning routine that removes anything weird)
  • I've gone using 'rfind' in c++, because regex are missing in the GCC (4.8) used by bio-conda
  • But I could use boost::regex as a fall-back (see docopt in Haploclique)

@DrYak DrYak requested a review from SoapZA April 23, 2018 19:01
@@ -348,7 +349,9 @@ def main(args):
logging.debug('b2w returned %d', ret_b2w)

# run diri_sampler on the aligned reads
win_file = 'w-%s-%d-%d.reads.fas' % (ref_name, reg_start, reg_stop)
win_file = 'w-%s-%u-%u.reads.fas' % (ref_name, reg_start, reg_stop)
Copy link
Collaborator

@ozagordi ozagordi Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of writing file names with a single dot and with a conventional suffix (these were my mistakes many years ago), like

win_file = 'w-%s-%u-%u-reads.fasta' % (ref_name, reg_start, reg_stop)

This would need to be corrected in a few places throughout the code, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, that would be replacing all the searches for '.reads' with '-reads'.

But my main concern would be the change of names compared to legacy data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is reasonable. On the other hand, we are dropping a whole bunch of code and relative functions (global reconstruction), changing API, probably switching to shared object files in the future... I'd say that name changes has a minor impact.

@@ -325,8 +326,8 @@ def main(args):
ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0]
ref_name = ref_seq.id
if region:
reg_bound = region.split(':')[1].split('-')
reg_start, reg_stop = int(reg_bound[0]), int(reg_bound[1])
reg_bound = re.search(r':(?P<start>\d+)-(?P<stop>\d+)$', region) # handles situation where ':' or '-' appears in the ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good!

@@ -325,8 +326,8 @@ def main(args):
ref_seq = list(SeqIO.parse(in_fasta, 'fasta'))[0]
ref_name = ref_seq.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easier and also add some robustness to just clean the reference id from special characters? Something like ref_name = re.sub(r'[^\w\s]', '_', ref_seq.id)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like that too. (And going even further : like [^[:alnum:]_] or [^\w] with ASCII rule flag).

The main reasons I am hesitating :

  • Change of file names compared to legacy data.

    (shorah 2.0 would end up using different names than shorah 1.x. Scripts relying on these will need to be changed and shorah 2.0 will not necessarily recognize files processed with older versions)

  • the gcc version (4.8) used by bioconda is an older one and doesn't provide the necessary libstdcxx for the c++11 regex support (it appears in gcc 4.9+).

    That would either require to use boost::regex as a fall back on bioconda.

    Or working around using algorithm's replace_if and isalnum

On the other hand, a major version bump like 1.x => 2.0 would be a good opportunity to introduce breaking changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scripts will have to be adapted anyway.

I'm not really sure I'm getting the point about gcc, it's long time since I wrote it. Since we are sanitising sequence and, hence, file names in python, isn't it enough to find the last (and only) dot with rfind? Where is the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File names (w-%s-%u-%u.read.fas) as also generated in the C++ side of stuff.
So we should have the exact same file name sanitation on both Python and C++.
Which is problematic because the dinosaur of a GCC version (4.8) still used by bioconda lacks c++11 regex support so we won't be able to use the exact same re.sub command on both sides.

@DrYak DrYak requested a review from ozagordi April 24, 2018 08:30
@DrYak
Copy link
Member Author

DrYak commented Apr 24, 2018

Lastly, we'll also have to pay attention to minor detail between the C++ and the python sides that can come biting us back like the locale causing diverging interpretations of isalnum, \w versus ASCII rule and POSIX [:alpha:].
(Which can also be a problem with printing %f)

I don't know how many people could be running this on their desktops/workstation (hello, French locale !)

@@ -243,8 +243,11 @@ int main(int argc, char* argv[])
rLen[j] = 0;
}
tmp.rLen = rLen;
sprintf(filename, "w-%s-%d-%d.reads.fas", // read window filename
sprintf(filename, "w-%s-%u-%u.reads.fas", // read window filename
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names generated in C++ code here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we change in python code, we could just do "w-%s-%u-%u-reads.fasta" here, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we will need to make sure that a future sanitize_id(tmp.in->header->target_name[i]) does the exact same thing as in python to keep thing coherent within the whole pipeline.

@ozagordi
Copy link
Collaborator

I tested 0933ede and it fails when there is a dot in the reference name (I just renamed >reference to >reference.2 in examples/ref_genome.fasta and realigned. The problem seems to be that diri_sampler only creates one debug file and one sampling file without window information: w-reference.dbg instead of w-reference.2-1-201.dbg, w-reference.2-68-268.dbg and so on. Would you please give a look?

@DrYak
Copy link
Member Author

DrYak commented Apr 24, 2018

?!?!?

I did exactly that :

  • seded the reference name to reference.2
  • smalted a new .bam file according to the README.me
  • ran shorah amplicon on it (using freshly installed files)
  • got the expected results.

Are you sure there's no old version of diri_sampler at the wrong part of your search path (cf. your pullrequest #48 not always transferring the files where it should, and pullrequest #51 about searching them) ?

What you see is the behaviour of the old unpatched diri_sampler (using .find('.') instead of .rfind(".reads"), as if you were still having an old version left dangling around.

@ozagordi
Copy link
Collaborator

Looks like you are right. I need to check this.

@DrYak
Copy link
Member Author

DrYak commented Apr 24, 2018

Tomorrow, I'll rebase and merge if tests pass.

@ozagordi
Copy link
Collaborator

You were right. The reason was that in this PR we had the old setup.py that didn't work with meson + pip.

I was trying to address too many issues at the same time and I created some confusion, sorry.

@DrYak
Copy link
Member Author

DrYak commented Apr 25, 2018

Rebased against latest master (that includes the latest setup.py).
Everything seems to work, as it should.

@DrYak DrYak merged commit 6925abb into master Apr 25, 2018
@DrYak DrYak deleted the anti-dots branch April 25, 2018 09:37
@ozagordi
Copy link
Collaborator

Thanks for the support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants