Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snpEff_get_chr_names wrapper is dysfunctional #2254

Closed
wm75 opened this issue Jan 24, 2019 · 15 comments
Closed

snpEff_get_chr_names wrapper is dysfunctional #2254

wm75 opened this issue Jan 24, 2019 · 15 comments

Comments

@wm75
Copy link
Contributor

wm75 commented Jan 24, 2019

The snpEff_get_chr_names wrapper, which is part of the snpeff collection of tools is currently broken in several ways.

  • Testing the tool on usegalaxy.org and usegalaxy.eu indicates that the tool tries to write outside the job working directory, at least, with the "named" option selected as its "genomeSrc" (on main, e.g., this causes a java.lang.RuntimeException: Cannot create directory '/cvmfs/main.galaxyproject.org/deps/_conda/pkgs/snpeff-4.3.1t-0/share/snpeff-4.3.1t-0/data/....

  • The strategy to pipe snpEff output to awk seems to be viable only for small test genomes, at least, exercising the "cached" option for "genomeSrc" on usegalaxy.eu runs into an OutOfMemory error.

  • The tool (again, tested on usegalaxy.org and usegalaxy.eu) does not seem to recognize any snpEff database datasets from a user's history (as attempted when the "history" or "custom" options of "genomeSrc" are selected). The cause for this is likely the metadata filter and the validator used, but I couldn't immediately see what's wrong with it.

I'm currently lacking the time to provide a PR for this, but it'd be great if anyone could take over this task (including adding better tests) because the tool gets installed together with the rest of the SnpEff tools from the MTS and really should not remain in such a poor state.

@bgruening
Copy link
Member

@hepcat72 do you have time to look at this? Would be great!

@hepcat72
Copy link
Contributor

Yeah, I'll take a look.

@hepcat72
Copy link
Contributor

I'll see what I can do about the memory error. I didn't test large genomes. However the genomeSrc select list issues are problems that were inherited from the snpeff eff tool, where I obtained that select list from in the first place, which I could not get to work either. I did a good bit of debugging on it and the only thing I found that worked were locally installed databases and the named option. Selecting the named option ("download on demand") tells snpeff to download the database itself and install it in its own prescribed location, somewhere off of the root dir. At least, that's what my copy of snpeff on my machine does, and I could not find where galaxy's copy was putting it, so I decided to assume that some data handler somewhere had it all worked out. I was surprised that the named option worked when I found it in the snfeff eff tool for that reason. I decided that figuring out why it worked wasn't in the scope of what I was trying to do and just copied that select list code out of the snpeff eff tool's xml. So I could be wrong, but I suspect those issues aren't specific to the chromosome info tool. Have you run the snpeff eff tool using that same named option? Does it get the same java exception? I bet it would.

The history and custom (which is also in the history) options never worked when I tried them in the snpeff eff tool either. In fact, that's why I created the chromosome info tool when I discovered that the only viable options were the locally installed genomes and the download on demand/named genome options. If the user couldn't supply a snpeff database with fixed chromosome IDs in their history, they needed to figure out what was in the database and adjust their annotations instead.

At first, I had tried to remove the history and custom options from the genomeSrc select list, but ultimately decided against it. Chromosome info's genomeSrc select list, having been snagged from the snpEff eff tool, thus inherited the same issues that the eff tool has. I gave figuring those issues out a try but it turned out to be too steep a learning curve and I figured that since they didn't work in the snpeff eff tool, that wrapper's developer would eventually address it. There may even be a ticket for it. That's also why I didn't create tests for those options, because I knew they were broken.

@hepcat72
Copy link
Contributor

As for testing a large genome for the memory issue, the dump command tends to take inordinately long, even though I only use the first few lines of output and ignore the rest. There's not an option to just get metadata. I can add a test for a large genome, but it will be a really lengthy test.

@wm75
Copy link
Contributor Author

wm75 commented Jan 24, 2019

@hepcat72 Turns out you are right about SnpEff eff's Download on demand option for the Genome source producing that same error. I wasn't aware of this.

On the other hand, I hope you agree that at least some options of the chromosome-info tool should be working 😉. The fact that you copied the select options directly from SnpEff eff just made me compare the two sections and realize where the problem with the non-recognized datasets stems from:
the get_chr_names wrapper has received that macro name fix recently, and turns out that, mistakenly, the
metadata.snpeff_version here cd82ec8#diff-a74728a8c784c38dd976c631ca5aba49L65 got capitalized, too.

@hepcat72
Copy link
Contributor

I feel like I recall encountering the capitalization issue, but I can't quite remember the context. Glad you figured that out.

Can you give me an example of either the named or locally installed databases that cause the error so I can reproduce the error and debug it?

@wm75
Copy link
Contributor Author

wm75 commented Jan 24, 2019

Locally installed: there are three versions of the human genome cached on usegalaxy.eu, so you can try things yourself. Of course, that won't really let you debug much.
Named: this is the same error with really just about everything - try: Saccharomyces_cerevisiae for example.

In case you really manage to fix the Download on demand option, you should also update its label, which gives GRCh38.76 as an example, but for version 4.3 of SnpEff it should be GRCh38.86.

@wm75
Copy link
Contributor Author

wm75 commented Jan 24, 2019

As for fixing the Download on demand option: from my own experience with SnpEff interfaces, the solution might be to use the -dataDir option also there (just like it's currently done for the other three options); but, without having tested it, it's hard to be sure.

@hepcat72
Copy link
Contributor

So let me see if I understand this correctly, because I’d tried the datadir option and couldn’t figure out where it was to set it, but maybe I had missed something obvious. If I recall correctly, the named option in the eff tool never sets datadir. I’ll take a look later today to double check I’m remembering correctly. So is the datadir the same as the locally installed datadir location? And even if I set it, how does that ever work with the eff tool?

@wm75
Copy link
Contributor Author

wm75 commented Jan 24, 2019

If I recall correctly, the named option in the eff tool never sets datadir.

exactly - and that could be the problem. You'd set it to the job working directory and then, hopefully, the download would go there. For the chomsome-info tool that might be all you have to change already, but I leave it up to you to work out if/how you can adapt this to the eff tool.

@hepcat72
Copy link
Contributor

If the problem you encounter is the creation of a directory outside the tool directory, then the named option should never work, but obviously it does in some cases or else my tests would never have passed. So I tried it on usegalaxy.org and on our installation using a genome that I used when I created the tests. It failed on usegalaxy, but works on ours:

galaxy princeton genomics - screen shot 2019-01-24 at 11 14 58 am

The key is going to be the difference in the galaxy installations. But this issue again, isn't specific to the chromosome info tool. All tools using the named feature in genomeSrc will be affected. The core issue is that snpeff is installing the database in a location outside of itself, which it does by default. Our galaxy install doesn't seem to care about that violation of the galaxy tool rules. But setting the datadir to inside the working directory is something I tried in the past, and it didn't work and I didn't understand why, despite spending a good bit of time on it. Pre-installed snpeff databases must be configured somewhere and they are definitely set up and configured in a database I know nothing about because when you use the download tool, the item in the history is a metadata file. I'm not sure this is something I have the skills to fix myself. I'm too new to galaxy tool development. I'll take a look when I get the chance though.

lparsons added a commit to lparsons/tools-iuc that referenced this issue Jan 24, 2019
@lparsons lparsons mentioned this issue Jan 24, 2019
5 tasks
@lparsons
Copy link
Contributor

Thanks for the report @wm75.

I've posted a PR that should address the first issue with the named option, could you test that out?

Can you clarify the nature of the OutOfMemory error? Is it your job manager (SLURM, PBS, etc.) that complains about memory usage? Or is his a Java OutOfMemory exception?

Also, in regards to the problem using a database in the users history, were those databases uploaded to Galaxy, or were they created using the SNP Eff Download tool?

lparsons added a commit to lparsons/tools-iuc that referenced this issue Jan 24, 2019
@lparsons
Copy link
Contributor

lparsons commented Jan 24, 2019

OK, I think PR #2255 will fix the issues with the named option as well as with the history and custom options.

The OutOfMemory error is a java heap space issue. It seems that the snpEff dump command is using the memory to compute various statistics.

Fatal error: Exit code 1 (Error)
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
	at java.lang.StringBuffer.append(StringBuffer.java:367)
	at org.snpeff.util.Gpr.readFile(Gpr.java:576)
	at org.snpeff.SnpEff.loadInteractions(SnpEff.java:662)
	at org.snpeff.SnpEff.loadDb(SnpEff.java:614)
	at org.snpeff.snpEffect.commandLine.SnpEffCmdDump.run(SnpEffCmdDump.java:184)
	at org.snpeff.SnpEff.run(SnpEff.java:1183)
	at org.snpeff.SnpEff.main(SnpEff.java:162)

To resolve that, it will require modification of the job configuration in the job_conf.xml file. Raising the memory limit for those jobs should resolve the issue, based on https://github.com/galaxyproject/tools-iuc/blob/master/tool_collections/snpeff/snpEff_macros.xml#L19

@hepcat72
Copy link
Contributor

Awesome. Thanks @lparsons!

@nsoranzo
Copy link
Member

I think all the issues reported here have been fixed in #2255, if not I'd suggest to open a new issue with what's left to address. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants