Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need for nonamecheck in bedtools closest #212

Closed
michael-imbeault opened this issue Mar 20, 2015 · 10 comments
Closed

Need for nonamecheck in bedtools closest #212

michael-imbeault opened this issue Mar 20, 2015 · 10 comments

Comments

@michael-imbeault
Copy link

So recently a bug with the filter for chromosome name consistency was fixed in bedtools intersect, adding the nonamecheck option to bypass it.

Bedtools closest fails in the same manner and needs a similar option.

It is also my opinion that the filter is just plain harmful - at worst it should emit some warning, not crash the program completely as chromosome names are not stable nor predictable.

Here is one example occurring in the current Ensembl mouse annotation:

ERROR: File /tmp/pybedtools.HdzgEa.tmp has inconsistent naming convention for record:
CHR_MG132_PATCH 124291803 124294101 ENSMUSG00000098810 . - protein_coding exon CAAA01180111.1

@arq5x
Copy link
Owner

arq5x commented Mar 20, 2015

Thanks for reporting this, Michael. @nkindlon can you add -nonamecheck to closest? Also, IIRC, you did change it to be a warning, not an exit, correct?

@nkindlon
Copy link
Contributor

Sure, will do.

@nkindlon
Copy link
Contributor

To clarify, though, it is exactly because the chromosome names aren't stable or predictable that we opted to die instead of warn in the first place. We'd thought it better to die with an error than to allow possibly / probably wrong results to go through. Warnings can be missed if people redirect stderr and don't check it, or are running in batch mode and not catching it, but I guess there's only so much we can do with that.
If everyone's certain that a warning is better than dying, that's a very simple code change that I'll make.

@arq5x
Copy link
Owner

arq5x commented Mar 20, 2015

Hey Neil,

Your point about the warning is a fair one, as warnings can easily be missed given the vast output of most bedtools operations. Michael, the concer with not exiting is that users might miss warnings and therefore have incorrect results owing to inconsistent chromosome labelling schemes.

@michael-imbeault
Copy link
Author

I see - maybe adding a 'use nonamecheck if you want to bypass the filter' in the die message would be helpful then. To me it is still not clear why the chromosome names being not in the format we expect them is a problem - is it related to sort?

@arq5x
Copy link
Owner

arq5x commented Mar 20, 2015

Good point - Neil, does "CHR_MG132_PATCH" fail your test for chromosome goodness?

@nkindlon
Copy link
Contributor

Hi guys,
Michael, the chrom names format is only a problem when they don't match in all files. The intent was to catch cases like the query file having chr1, chr2, chr3, etc, but the db file just calls the chromosomes 1,2,3, and so on. Since intersect and all the tools based on it (map, closest, jaccard, etc) use exact string matching for the chromosome names, what you'll get are many false negatives, i.e, no hit for the chromosomes in question. I think the problem with the warning telling people to run with -nonamecheck might imply that doing so would fix the problem, when in fact it would only allow them to run to completion with no warning and wrong results. What they really need to do instead is fix their files, perhaps using sed or something similar.

All that being said, this and other user feedback we've gotten suggests that testing the name conventions doesn't work as well in practice as it does in theory. Revising it to cover all possibilities would be rather difficult; perhaps we should consider removing it.

@lparsons
Copy link

lparsons commented May 1, 2015

I'm running into a problem with bedtools intersect where there are some chromosomes named in one file, but not another and as a result it is failing this check. At least, I'm pretty sure that is what is going on. This is a bit of an issue since there is nothing wrong with either of these files, it's just that there simply aren't any regions defined for a given chromosome (sequence id) in one of the files.

CORRECTION:
Actually, it seems to be complaining for various records that don't start with "CHR" in some way. This seems overly restrictive.

@arq5x
Copy link
Owner

arq5x commented May 4, 2015

Hi @lparsons - does this behavior persist with the -nonamecheck option? The repo (subsequent to release 2.23.0) contains new logic that should avoid these warnings when -nonamecheck is used. Please let me know if you find otherwise.

@lparsons
Copy link

lparsons commented May 4, 2015

My apologies, this does appear to be working now, I am guessing that
updating to the latest version fixed it, though I had thought I was running
the latest at the time. At any rate, thanks for the excellent tools and
rapid response.

Lance

On Mon, May 4, 2015 at 3:07 PM, Aaron Quinlan notifications@github.com
wrote:

Hi @lparsons https://github.com/lparsons - does this behavior persist
with the -nonamecheck option? The repo (subsequent to release 2.23.0)
contains new logic that should avoid these warnings when -nonamecheck is
used. Please let me know if you find otherwise.


Reply to this email directly or view it on GitHub
#212 (comment).

  • Lance Parsons

@arq5x arq5x closed this as completed May 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants