-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PhenoMenal wrapper merge #8
Conversation
* Modify script for new create msp * Add filter for preprocessing * Add thresholds * update test files * delete files * update file test
Cool, I'm currently testing the module. Regarding the local MetChem database, the python script throws an error: Fatal error: Exit code 2 ()
usage: metfrag.py [-h] [--input_pth INPUT_PTH] [--result_pth RESULT_PTH]
[..]
metfrag.py: error: unrecognized arguments: --LocalMetChemDatabaseServerIp 1.1.1.1 Seems like the argument is not exported in the python script. Regarding the suspect list, it is generated from this repo: https://github.com/oolonek/ISDB/tree/master/Data/dbs The final suspect list only contains the InCHIs and looks like this:
Maybe @sneumann knows whether we can share our version for testing or do we need to generate the db by ourselves? As far as I know, the weights are simply multiplied to the scores. @sneumann should know more details. I'm currently testing the module and it seems to run fine with PubChem - even utilizing less cpu (which is what is intended in our cluster to not overload it). I'm waiting for it to finish the calculations and report back. |
@Tomnl Can you give me access so that I can push an update to the phen_msp_to_metrag_update branch? |
@korseby I have added you to the repository |
Cool, I have added the suspect list and added some contributors (who have contributed to the PhenoMeNal Galaxy module). Please have a look. |
Hi @Tomnl , the module fails (with using PubChem) with the following output:
|
@korseby what MSP dataset are you using? |
Hi Tom, cut & paste of the first few entries. Can you try with this one?
|
Ok, testing again with the local MetChem database... |
Sorry forgot to mention, "Generic MSP" in case you ask... |
@Tomnl still no luck for the script to complete successfully. Could it be that some results return 0 candidates?
|
@korseby What other params are you using? It works ok with default params I used for some tests I ran with the example msp dataset |
Hi @Tomnl , my example MSP file has two newlines at the end of the file. I think your script expects that a new spectra starts but as nothing is there, it fails with the error message. Otherwise I'm using default parameters. Can you try that? |
Hi @korseby, Sorry still can't get the error. I tested with an example MSP file with two extra newlines at the end of the file and it seems to work OK. Is it OK if you could send the file you are testing on please? |
Hi @Tomnl same error with latest additions, I have sent you an email with the URL of the file. |
@korseby thanks. Just running now |
@korseby I have been running the example you sent via email with planemo. It is still running... However, I have made a bugfix that should hopefully catch the error you described |
Mine is still running. It takes approx 4 hours in our cluster (24 cpu). Have you completed the example? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work guys!
@RJMW thanks for review @korseby I ran the example without the new bug fix yesterday (finished this morning) and replicated the error you were getting I am now running the example again with the bug fix detailed in 01168c9 (started processing this morning - but only on 4 cores) Hopefully it passes this time! 🤞 |
Hi @korseby, The example file you sent (via email) passed without error on my local system. Did it pass on your system? |
Yes, since the latest change it also works in our Galaxy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't test right now but all looks very good for me ! :)
Thanks @korseby @jsaintvanne for the review |
Overview
This pull request relates to issue #6
The updates are based on adding the functionality that has been developed as part of PhenoMenal (script and xml). Keeping the tool as python though.
Additionally, @jsaintvanne and I have added general updates that we either required to make compatible with changes with msPurity and additional useful functionality.
Updates
Pre filters
Pre filters (e.g. MinimumElementsFilter, UnconnectedCompoundFilter, ElementInclusionFilter) were added with merged PR #7 by @jsaintvanne . These are pre-filters performed by MetFrag that limit what compounds will be used for the in silico fragmentation creation. e.g. only include compounds that contain certain elements. See xml and python
Post filters
Post filters (e.g. minimum score) were added with merged PR #7. These are filters that @jsaintvanne added to be performed after MetFrag has finished to make the ouput more manageable and relevant.
MassBank and generic MSP format
The tool can now handle both MassBank format and generic MSP format. User chooses at the tool level and different regex will be used
EDIT: Updated to have option to automatically determine schema type
Multiprocessing
Multiprocessing performed at python script level rather than within the MetFrag tool (e.g. each command line call is put onto a different process). This seems to give the best performance see below.
The command line option to change the number of threads used by MetFrag is still available but for the moment this is not accessible via the XML Galaxy tool. In the Phenomenal tool users could choose the number of cores directly though the Galaxy tool, however generally speaking with Galaxy tools the number of cores chosen is done via the Galaxy configuration - @korseby is this a problem for you?
Meta data in output
A number of approaches are now available to include metadata from the MSP as columns in the output.
Scoring and weights
The choice of different scoring approaches can now be chosen. The choice is dependant on the if suspsect list is used or not though (see below).
Regarding the weightings - @korseby - is there any documentation as to how to choose these weights? How does it affect the final "Score" column.
Suspect lists
Suspectlist can be included - this requires the user to provide an additional list compounds that are expected and they can be used to improve the annotation results. Depending on if this is used will impact on what Scoring options can be used.
@korseby do we have any examples we can use for a unit test for this?
Database lookup choice
As per #5 the choice of database is now conditional. The choices are now as follows:
The MetChem approach has not been tested yet
Unit tests
Additional unit test added for testing MassBank style files, generic MSP files and actual MassBank data files from https://github.com/MassBank/MassBank-data.
FYI: @RJMW