Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhenoMenal wrapper merge #8

Merged
merged 12 commits into from
Apr 15, 2019
Merged

PhenoMenal wrapper merge #8

merged 12 commits into from
Apr 15, 2019

Conversation

Tomnl
Copy link
Member

@Tomnl Tomnl commented Apr 9, 2019

Overview

This pull request relates to issue #6

The updates are based on adding the functionality that has been developed as part of PhenoMenal (script and xml). Keeping the tool as python though.

Additionally, @jsaintvanne and I have added general updates that we either required to make compatible with changes with msPurity and additional useful functionality.

Updates

Pre filters

Pre filters (e.g. MinimumElementsFilter, UnconnectedCompoundFilter, ElementInclusionFilter) were added with merged PR #7 by @jsaintvanne . These are pre-filters performed by MetFrag that limit what compounds will be used for the in silico fragmentation creation. e.g. only include compounds that contain certain elements. See xml and python

Post filters

Post filters (e.g. minimum score) were added with merged PR #7. These are filters that @jsaintvanne added to be performed after MetFrag has finished to make the ouput more manageable and relevant.

MassBank and generic MSP format

The tool can now handle both MassBank format and generic MSP format. User chooses at the tool level and different regex will be used

EDIT: Updated to have option to automatically determine schema type

Multiprocessing

Multiprocessing performed at python script level rather than within the MetFrag tool (e.g. each command line call is put onto a different process). This seems to give the best performance see below.

Method Real (elapsed time) User (CPU time)
Single processing 21m0.825s 21m13.223s
4 top level cores 7m58.166s 23m36.103s
4 MetFrag threads 19m27.448s 20m57.833s

The command line option to change the number of threads used by MetFrag is still available but for the moment this is not accessible via the XML Galaxy tool. In the Phenomenal tool users could choose the number of cores directly though the Galaxy tool, however generally speaking with Galaxy tools the number of cores chosen is done via the Galaxy configuration - @korseby is this a problem for you?

Meta data in output

A number of approaches are now available to include metadata from the MSP as columns in the output.

  • Use either the "Name:" or "RECORD_TITLE:"
  • Use either the "Name:" or "RECORD_TITLE:" but split the content based on expected output from msPurity (e.g. 'MZ:100.2 | RT:20 | xcms_grp_id:1' would be split into columns MZ, RT, and xcms_grp_id)
  • Use all the parameters in the MSP metadata as columns

Scoring and weights

The choice of different scoring approaches can now be chosen. The choice is dependant on the if suspsect list is used or not though (see below).

Regarding the weightings - @korseby - is there any documentation as to how to choose these weights? How does it affect the final "Score" column.

Suspect lists

Suspectlist can be included - this requires the user to provide an additional list compounds that are expected and they can be used to improve the annotation results. Depending on if this is used will impact on what Scoring options can be used.

@korseby do we have any examples we can use for a unit test for this?

Database lookup choice

As per #5 the choice of database is now conditional. The choices are now as follows:

  • PubChem
  • KEGG
  • MetChem
    • User provides an IP address of MetChem instance
  • LocalCSV
    • user provides an uploaded CSV file

The MetChem approach has not been tested yet

Unit tests

Additional unit test added for testing MassBank style files, generic MSP files and actual MassBank data files from https://github.com/MassBank/MassBank-data.

FYI: @RJMW

@Tomnl Tomnl changed the title PhenoMenal warpper merge PhenoMenal wrapper merge Apr 9, 2019
@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Cool,

I'm currently testing the module.

Regarding the local MetChem database, the python script throws an error:

Fatal error: Exit code 2 ()
usage: metfrag.py [-h] [--input_pth INPUT_PTH] [--result_pth RESULT_PTH]
[..]
metfrag.py: error: unrecognized arguments: --LocalMetChemDatabaseServerIp 1.1.1.1

Seems like the argument is not exported in the python script.

Regarding the suspect list, it is generated from this repo:

https://github.com/oolonek/ISDB/tree/master/Data/dbs

The final suspect list only contains the InCHIs and looks like this:

AAANZTDKTFGJLZ
AADZLWJXMBAKKE
AAHHASSEIOXLIB
AAJDHOXHRGMSLB
[...]

Maybe @sneumann knows whether we can share our version for testing or do we need to generate the db by ourselves?

As far as I know, the weights are simply multiplied to the scores. @sneumann should know more details.

I'm currently testing the module and it seems to run fine with PubChem - even utilizing less cpu (which is what is intended in our cluster to not overload it). I'm waiting for it to finish the calculations and report back.

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

@Tomnl Can you give me access so that I can push an update to the phen_msp_to_metrag_update branch?

@RJMW
Copy link
Member

RJMW commented Apr 10, 2019

@korseby I have added you to the repository

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Cool, I have added the suspect list and added some contributors (who have contributed to the PhenoMeNal Galaxy module). Please have a look.

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Hi @Tomnl ,

the module fails (with using PubChem) with the following output:

Fatal error: Exit code 1 ()
INFO  de.ipbhalle.metfraglib.database.OnlinePubChemDatabase - Fetching candidates from PubChem
[..]
Traceback (most recent call last):
  File "/var/lib/galaxy/galaxy/tools/metfrag/metfrag.py", line 380, in <module>
    pctexplpeak = nbfindpeak / totpeaks * 100
ZeroDivisionError: float division by zero

@Tomnl
Copy link
Member Author

Tomnl commented Apr 10, 2019

@korseby what MSP dataset are you using?

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Hi Tom,

cut & paste of the first few entries. Can you try with this one?

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:1
AlignmentID: 1
RETENTIONTIME: 720.852
PRECURSORMZ: 149.13513692
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 3
121.099329887276	182
149.133267159026	510
150.14016979533	114

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:2
AlignmentID: 2
RETENTIONTIME: 650.574
PRECURSORMZ: 151.11569214
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 3
136.090178471023	168
151.110697295085	1052
152.117423567352	136

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:3
AlignmentID: 3
RETENTIONTIME: 42.174
PRECURSORMZ: 166.09213257
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 5
121.067062911877	262
131.049505932606	146
137.060202192821	531
149.059561491169	254
166.08537028381	545

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:4
AlignmentID: 4
RETENTIONTIME: 78.2928
PRECURSORMZ: 166.089286806667
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 5
118.065933655806	115
119.072278952628	113
120.080979101728	5069
121.083935362562	743
131.049093887557	139

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:5
AlignmentID: 5
RETENTIONTIME: 24.56202
PRECURSORMZ: 175.12203471
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 4
130.097489783372	495
158.093607647479	276
175.118502558422	485
224.939823965059	134

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:6
AlignmentID: 6
RETENTIONTIME: 155.1198
PRECURSORMZ: 188.07421875
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 11
115.051197438708	113
117.064988462758	153
118.064808002218	808
142.065631276731	291
143.074218460193	314
144.080055471837	423
145.082639304305	179
146.060591300355	1378
147.064153177706	130
170.060029455358	489
188.068854738121	140

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:7
AlignmentID: 7
RETENTIONTIME: 45.21798
PRECURSORMZ: 189.12660726
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 4
130.05053925562	396
143.118033450683	186
172.096003173354	747
173.099902938432	243

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:8
AlignmentID: 8
RETENTIONTIME: 908.514
PRECURSORMZ: 191.183143615
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 3
121.096264265038	145
135.117847855305	190
191.178343770579	534

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:9
AlignmentID: 9
RETENTIONTIME: 42.51
PRECURSORMZ: 193.038887025
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 2
129.018746579505	104
139.001406290082	1123

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:10
AlignmentID: 10
RETENTIONTIME: 654.654
PRECURSORMZ: 203.182054000909
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 14
119.085151368499	532
120.087842039128	126
121.10025451309	165
123.113860374699	160
133.101231199035	487
135.116023275453	263
145.099428428133	180
147.116855030957	1228
148.120983769871	220
160.122136310914	160
161.13287307095	473
175.147633115734	266
203.17708173968	2575
204.182146126692	473

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:11
AlignmentID: 11
RETENTIONTIME: 684.822
PRECURSORMZ: 203.1841506975
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 21
119.086281852106	714
120.086519303416	150
121.101020929824	481
123.11762390545	427
133.101300997891	1165
134.104738891484	142
135.115993638443	795
145.10107865268	164
147.117680797515	3016
148.120860686395	343
149.132746517263	1065
159.115213477893	187
160.120916343035	152
161.132652047532	1438
162.136273667092	375
163.14917441567	112
175.147496697069	1005
176.151809505938	117
203.177553694006	7654
204.181896474424	1085
205.193137337792	501

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:12
AlignmentID: 12
RETENTIONTIME: 693.978
PRECURSORMZ: 203.179397585
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 17
109.099539657111	100
119.086173445489	424
121.101564115738	149
123.1191016261	158
133.100915983421	578
135.119416350425	282
145.101362852958	145
147.116935252073	1229
148.121059665111	174
149.132579709198	461
160.123069074404	218
161.132037241938	737
162.132771840172	147
175.14796504241	388
203.177704143394	3406
204.181630319691	491
205.191100661583	495

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:13
AlignmentID: 13
RETENTIONTIME: 751.686
PRECURSORMZ: 203.18260956
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 17
119.084078178435	478
121.100445329585	1449
123.115793159832	170
133.101011024192	729
135.116358247534	375
137.129946417725	133
147.118112174997	723
148.122564637243	141
149.132422636733	1356
160.122915204839	143
161.133994338511	423
163.149343297591	147
174.134993824972	128
175.145412282655	353
203.177096426471	1453
204.179140280475	352
205.192224241714	625

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:14
AlignmentID: 14
RETENTIONTIME: 154.1052
PRECURSORMZ: 205.100937907333
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 21
115.053899672898	225
117.060446582388	343
118.065228231081	2719
119.06878068649	318
127.055178279259	146
130.065821086856	674
132.081237107688	1583
133.085586278994	180
142.065754353636	1029
143.073254521873	1092
144.080927565304	4638
145.08368898786	511
146.060441864619	17519
147.063392081277	1707
159.091754230715	2397
160.085132047051	559
161.078424202321	104
170.059623154778	3561
171.063443408256	494
188.069593846953	11560
189.073260192115	1306

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:15
AlignmentID: 15
RETENTIONTIME: 679.056
PRECURSORMZ: 205.19841384875
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 22
119.085109436512	145
120.086587341883	116
121.100476187437	1630
122.105624193074	251
123.116818139512	1611
124.121071253113	153
133.100817443176	243
134.105583706823	168
135.116100351222	5204
136.119760266374	618
137.131622959373	340
147.117356723788	214
148.120667843349	451
149.132251074927	9207
150.136797885252	1172
162.136402785973	267
163.148045475809	995
164.151799011145	124
177.163620388369	208
204.180873034966	962
205.193785614305	7044
206.196511855015	884

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:16
AlignmentID: 16
RETENTIONTIME: 718.818
PRECURSORMZ: 205.198698256667
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 31
105.066197704056	144
107.084533004556	326
109.100187251137	348
110.102972985183	138
119.085879851288	399
121.100586305974	8689
122.10373354907	912
123.116920816833	5891
124.119498978167	692
133.101226395772	247
134.109380064678	106
135.041605038547	128
135.116569600759	18435
136.119846049985	1828
137.131670234114	689
147.117402164736	703
148.121339251552	184
149.038833538425	160
149.13277347591	32582
149.330539557801	114
150.135914574479	3857
151.139641374012	139
162.139000847802	198
163.147424639626	4038
164.152117545449	477
176.152839476843	153
177.163153413103	569
178.1654705618	136
204.181053660849	228
205.192890286224	21636
206.196484133296	2987

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:17
AlignmentID: 17
RETENTIONTIME: 746.616
PRECURSORMZ: 205.197532655
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 22
107.08495152857	133
109.102368117571	371
119.086811365183	281
121.100697403017	5296
122.103561149594	517
123.116681113309	2222
124.118657856089	278
135.116527032889	8763
136.118429023375	815
137.13143326875	670
147.115016170194	238
149.045959860786	106
149.132224602793	16285
150.135340193089	1795
151.141077653114	103
161.132093770344	117
163.147875347619	2259
164.148539689899	316
177.163121834245	430
204.183468516206	155
205.193734948981	15545
206.196029191969	2414

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:18
AlignmentID: 18
RETENTIONTIME: 831.474
PRECURSORMZ: 205.197184243333
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 9
121.098786119125	292
123.11618362409	329
135.11622087148	829
136.120199192853	116
149.133118408405	1743
150.13938759065	129
163.146169489001	180
205.19267108391	1154
206.198128554339	113

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:19
AlignmentID: 19
RETENTIONTIME: 726.942
PRECURSORMZ: 206.20942688
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 21
107.082081889626	129
109.098158327114	170
119.084466037582	172
121.101174660971	5912
122.104386843486	709
123.116008386586	2476
124.122326374207	283
135.116248039231	8307
136.120492358628	844
137.134438558984	489
138.140609651624	130
147.113719641202	173
149.13279200944	14507
150.135909500474	2064
151.141809154572	134
163.147699526741	1709
164.15299246976	302
177.164252447133	281
205.1931551242	9487
206.197141901046	1614
207.200952891379	173

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:20
AlignmentID: 20
RETENTIONTIME: 559.038
PRECURSORMZ: 209.155904133333
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 8
139.077678697146	215
141.090489086884	200
149.128496040341	168
153.090459447657	144
165.125055454232	424
167.139353549861	145
209.15139685919	1024
210.154274023472	158

NAME: pos_27_winter_Marpol_27_2.E.3_01_17272:21
AlignmentID: 21
RETENTIONTIME: 194.7012
PRECURSORMZ: 216.09188652125
METABOLITENAME: Unknown
ADDUCTIONNAME: [M+H]+
NumPeaks: 12
136.062142355198	423
137.067912002702	137
148.062177362166	3260
149.063032114786	338
171.068975078169	136
173.069373292394	786
188.091620602901	1923
189.093418649142	260
198.07690532756	187
200.126641033121	338
216.087050918035	5363
217.089120645467	688

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Ok, testing again with the local MetChem database...

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

Sorry forgot to mention, "Generic MSP" in case you ask...

@korseby
Copy link
Collaborator

korseby commented Apr 10, 2019

@Tomnl still no luck for the script to complete successfully. Could it be that some results return 0 candidates?

Traceback (most recent call last):
  File "/var/lib/galaxy/galaxy/tools/metfrag/metfrag.py", line 400, in <module>
    pctexplpeak = nbfindpeak / totpeaks * 100
ZeroDivisionError: float division by zero

@Tomnl
Copy link
Member Author

Tomnl commented Apr 10, 2019

@korseby What other params are you using?

It works ok with default params I used for some tests I ran with the example msp dataset

@korseby
Copy link
Collaborator

korseby commented Apr 11, 2019

Hi @Tomnl , my example MSP file has two newlines at the end of the file. I think your script expects that a new spectra starts but as nothing is there, it fails with the error message. Otherwise I'm using default parameters. Can you try that?

@Tomnl
Copy link
Member Author

Tomnl commented Apr 11, 2019

Hi @korseby,

Sorry still can't get the error. I tested with an example MSP file with two extra newlines at the end of the file and it seems to work OK.

Is it OK if you could send the file you are testing on please?

@korseby
Copy link
Collaborator

korseby commented Apr 11, 2019

Hi @Tomnl same error with latest additions, I have sent you an email with the URL of the file.

@Tomnl
Copy link
Member Author

Tomnl commented Apr 11, 2019

@korseby thanks. Just running now

@Tomnl
Copy link
Member Author

Tomnl commented Apr 11, 2019

@korseby I have been running the example you sent via email with planemo. It is still running...

However, I have made a bugfix that should hopefully catch the error you described

@korseby
Copy link
Collaborator

korseby commented Apr 12, 2019

Mine is still running. It takes approx 4 hours in our cluster (24 cpu). Have you completed the example?

Copy link
Member

@RJMW RJMW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work guys!

@Tomnl
Copy link
Member Author

Tomnl commented Apr 12, 2019

@RJMW thanks for review

@korseby I ran the example without the new bug fix yesterday (finished this morning) and replicated the error you were getting

I am now running the example again with the bug fix detailed in 01168c9 (started processing this morning - but only on 4 cores)

Hopefully it passes this time! 🤞

@Tomnl
Copy link
Member Author

Tomnl commented Apr 15, 2019

Hi @korseby,

The example file you sent (via email) passed without error on my local system. Did it pass on your system?

@korseby
Copy link
Collaborator

korseby commented Apr 15, 2019

Yes, since the latest change it also works in our Galaxy.

Copy link
Collaborator

@jsaintvanne jsaintvanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't test right now but all looks very good for me ! :)

@Tomnl
Copy link
Member Author

Tomnl commented Apr 15, 2019

Thanks @korseby @jsaintvanne for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants