## Convert files from Thermo

Download proteowizard (Windows only) https://proteowizard.sourceforge.io/tools/msconvert.html

Run the following command:

msconvert *.raw --filter "peakPicking true 1- " --filter "msLevel 1-" --ignoreUnknownInstrumentError

Or PeakPicking MS 1- through the GUI

If you are still getting negative intensities, then, from CLI of openms nightly, filter out the negative intensities:

./FileFilter -in "path" -out "path" -int "0:"

## Import libraries

In [1]:
from pyopenms import *
from pandas import DataFrame
import pandas as pd
import pyteomics
from pyteomics.openms import featurexml
import numpy as np
import sys
from pyteomics import mztab

Determination of memory status is not supported on this 
 platform, measuring for memoryleaks will never fail


## Preprocessing step

The first preprocessing function consists of six steps:

### 1) PrecursorCorrection (To the "highest intensity MS1 peak")

So the highestintensity is used directly right after the file introduction, in order to correct wrong MS1 precursor annotation - this means it will correct in the precursor window (e.g. 0.01 Da) and then use the highest intensity peak as the precursor. This is to correct for any instrument error. Since we assume that in the give mass window, the precursor with the hightest intensity was actually fragmented instead of any other (top-n method).

### 2) Mass trace detection

A mass trace extraction method that gathers peaks similar in m/z and moving along retention time.

Peaks of a MSExperiment are sorted by their intensity and stored in a list of potential chromatographic apex positions. Only peaks that are above the noise threshold (user-defined) are analyzed and only peaks that are n times above this minimal threshold are considered as apices. This saves computational resources and decreases the noise in the resulting output.

Starting with these, mass traces are extended in- and decreasingly in retention time. During this extension phase, the centroid m/z is computed on-line as an intensity-weighted mean of peaks.

The extension phase ends when either the frequency of gathered peaks drops below a threshold (min_sample_rate, see MassTraceDetection parameters) or when the number of missed scans exceeds a threshold (trace_termination_outliers, see MassTraceDetection parameters).

Finally, only mass traces that pass a filter (a certain minimal and maximal length as well as having the minimal sample rate criterion fulfilled) get added to the result.

### 3) Elution peak detection

Extracts chromatographic peaks from a mass trace.

Mass traces may consist of several consecutively (partly overlapping) eluting peaks, e.g., stemming from (almost) isobaric compounds that are separated by retention time. Especially in metabolomics, isomeric compounds with exactly the same mass but different retentional behaviour may still be contained in the same mass trace.

This method first applies smoothing on the mass trace's intensities, then detects local minima/maxima in order to separate the chromatographic peaks from each other. Detection of maxima is performed on the smoothed intensities and uses a fixed peak width (given as parameter chrom_fwhm) within which only a single maximum is expected. Currently smoothing is done using SavitzkyGolay smoothing with a second order polynomial and a frame length of the fixed peak width.

Depending on the "width_filtering" parameters, mass traces are filtered by length in seconds ("fixed" filter) or by quantile.

The output of the algorithm is a set of chromatographic peaks for each mass trace, i.e. a vector of split mass traces (see ElutionPeakDetection parameters).

In general, a user would want to call the "detectPeaks" functions, potentially followed by the "filterByPeakWidth" function.

### 4) Feature detection

FeatureFinderMetabo assembles metabolite features from singleton mass traces.

Mass traces alone would allow for further analysis such as metabolite ID or statistical evaluation. However, in general, monoisotopic mass traces are accompanied by satellite C13 peaks and thus may render the analysis more difficult. FeatureFinderMetabo fulfills a further data reduction step by assembling compatible mass traces to metabolite features (that is, all mass traces originating from one metabolite). To this end, multiple metabolite hypotheses are formulated and scored according to how well differences in RT (optional), m/z or intensity ratios match to those of theoretical isotope patterns.

If the raw data scans contain the scan polarity information, it is stored as meta value "scan_polarity" in the output file.

Mass trace clustering can be done using either 13C distances or a linear model (Kenar et al) – see parameter 'ffm:mz_scoring_13C'. Generally, for lipidomics, use 13C, since lipids contain a lot of 13C. For general metabolites, the linear model is usually more appropriate. To decide what is better, the total number of features can be used as indirect measure.

the lower(!) the better (since more mass traces are assembled into single features). Detailed information is stored in the featureXML output: it contains meta-values for each feature about the mass trace differences (inspectable via TOPPView). 

By default, the linear model is used.

### 5) Metabolite adduct decharger (MetaboliteFeatureDeconvolution)

For each peak, this algorithm reconstructs neutral masses by enumerating all possible adducts with matching charge. 
You can add the list of adduct files and database files for the algorithm to parse through.
With SIRIUS, an algorithm that is later used, you are only able to use singly charged adducts so charges higher than 1 are filtered out.

### 6) PrecursorCorrection (To the "nearest feature”)

This algorithm is used after feature detection, adduct grouping and even identification via accurate mass search. It basically allows the precursor correction on MS2 level. Which means that if there are MS2 spectra in my feature space which have been measured in isotope traces, it “corrects” the MS2 spectrum annotation to the monoisotopic trace. That is why you have a high mass deviation 100 pm, but 0.0 rt tolerance. So it basically corrects the MS2 to the feature centroid that can be found/mapped by SIRIUS::preprocessing later on.

In [2]:
def preprocess(filename):
    exp = MSExperiment()
    MzMLFile().load(filename, exp)
    exp.sortSpectra(True)
    
    delta_mzs= []
    mzs = []
    rts= []
    PrecursorCorrection.correctToHighestIntensityMS1Peak(exp, 100.0, True, delta_mzs, mzs, rts)
    
    mass_traces = []
    mtd = MassTraceDetection()
    mtd_par = mtd.getDefaults()
    mtd_par.setValue("mass_error_ppm", 10.0) 
    mtd_par.setValue("noise_threshold_int", 1.0e04)
    mtd.setParameters(mtd_par)
    mtd.run(exp, mass_traces, 0)
    
    mass_traces_split = []
    mass_traces_final = []
    epd = ElutionPeakDetection()
    epd_par = epd.getDefaults()
    epd_par.setValue("width_filtering", "fixed")
    epd.setParameters(epd_par)
    epd.detectPeaks(mass_traces, mass_traces_split)
    
    if (epd.getParameters().getValue("width_filtering") == "auto"):
        epd.filterByPeakWidth(mass_traces_split, mass_traces_final)
    else:
        mass_traces_final = mass_traces_split
        
    feature_map_FFM = FeatureMap()
    feat_chrom = []
    ffm = FeatureFindingMetabo()
    ffm_par = ffm.getDefaults() 
    ffm_par.setValue("isotope_filtering_model", "none")
    ffm_par.setValue("remove_single_traces", "true")
    ffm_par.setValue("mz_scoring_by_elements", "true")
    ffm.setParameters(ffm_par)
    ffm.run(mass_traces_final, feature_map_FFM, feat_chrom)
    feature_map_FFM.setUniqueIds()
    fh = FeatureXMLFile()
    fh.store('./pyOpenMS_results/FeatureFindingMetabo.featureXML', feature_map_FFM)
    
    mfd = MetaboliteFeatureDeconvolution()
    mdf_par = mfd.getDefaults()
    mdf_par.setValue("potential_adducts",  [b"H:+:0.6",b"Na:+:0.2",b"NH4:+:0.1", b"H2O:-:0.1"])
    mdf_par.setValue("charge_min", 1, "Minimal possible charge")
    mdf_par.setValue("charge_max", 1, "Maximal possible charge")
    mdf_par.setValue("charge_span_max", 1)
    mdf_par.setValue("max_neutrals", 1)
    mfd.setParameters(mdf_par)
    
    feature_map_DEC = FeatureMap()
    cons_map0 = ConsensusMap()
    cons_map1 = ConsensusMap()
    mfd.compute(feature_map_FFM, feature_map_DEC, cons_map0, cons_map1)
    fxml = FeatureXMLFile()
    PrecursorCorrection.correctToNearestFeature(feature_map_DEC, exp, 0.0, 100.0, True, False, False, False, 3, 0)
    fxml.store("./pyOpenMS_results/deconvoluted.featureXML", feature_map_DEC)
    
    with featurexml.read("./pyOpenMS_results/deconvoluted.featureXML") as f:
        features_list = [FXML for FXML in f]
    
    df = pd.DataFrame() 

    for feat in features_list:
        idx = feat['id']
        for key in feat.keys():
            if key == 'id':
                pass
            # For col with dictionary do the following
            elif key == 'position':
                pos_list = feat['position']
                for pos in pos_list:
                    if pos['dim'] == '0':
                        df.loc[idx, 'position_0'] = pos['position']
                    elif pos['dim'] == '1':
                        df.loc[idx, 'position_1'] = pos['position']
            elif key == 'quality':
                qual_list = feat['quality']
                for qual in qual_list:
                    if qual['dim'] == '0':
                        df.loc[idx, 'quality_0'] = qual['quality']
                    elif qual['dim'] == '1':
                        df.loc[idx, 'quality_1'] = qual['quality']
            else:
                df.loc[idx, key] = feat[key]
    df_tidy = df.rename(columns = {'position_0': 'mz', 'position_1': 'RT'}, inplace = False)
    df_tidy=df_tidy.drop(columns= ["quality_0", "quality_1", "overallquality", "label", "legal_isotope_pattern"])
    df_tidy.reset_index(drop=True, inplace=True) 
    df_tidy.to_csv("./pyOpenMS_results/preprocessedDF.csv")
    return df_tidy

### SIRIUS Adapter
The SIRIUS function is optional and includes the SIRIUS Adapter Algorithm from the Boecher lab. 


The algorithm generates formula prediction from scores calculated from 1) MS2 fragmentation scores (ppm error + intensity) and 2) MS1 isotopic pattern scores.


It can only compute feautures that are singly charged. There is also a timeout for compounds (compound timeout so that it doesn't compute for longer than 100 seconds per feature, which normally happens with larger molecules).


-sirius:compound_timeout <number>                    
Maximal computation time in seconds for a single compound. 0 for an infinite amount of time. (default: '100' min: '0')

    
This algorith can help in data dereplication and analysis for direct library search. 

In [3]:
def SIRIUS(filename):    
    exp = MSExperiment()
    MzMLFile().load(filename, exp)
    exp.sortSpectra(True)
    
    delta_mzs= []
    mzs = []
    rts= []
    PrecursorCorrection.correctToHighestIntensityMS1Peak(exp, 100.0, True, delta_mzs, mzs, rts)
    sirius_algo = SiriusAdapterAlgorithm()
    sirius_algo_par = sirius_algo.getDefaults()
    sirius_algo_par.setValue("preprocessing:filter_by_num_masstraces", 2) 
    sirius_algo_par.setValue("preprocessing:precursor_mz_tolerance", 10.0) #default
    sirius_algo_par.setValue("preprocessing:precursor_mz_tolerance_unit", "ppm")
    sirius_algo_par.setValue("preprocessing:precursor_rt_tolerance", 5.0) #default
    sirius_algo_par.setValue("preprocessing:feature_only", "true")
    sirius_algo_par.setValue("sirius:profile", "orbitrap")
    sirius_algo_par.setValue("sirius:db", "none")
    sirius_algo_par.setValue("sirius:ions_considered", "[M+H]+, [M-H2O+H]+, [M+Na]+, [M+NH4]+")
    sirius_algo_par.setValue("sirius:candidates", 10)
    sirius_algo_par.setValue("sirius:elements_enforced", "CHNOS") 
    sirius_algo_par.setValue("project:processors", 2)
    sirius_algo_par.setValue("fingerid:db", "BIO")
    sirius_algo.setParameters(sirius_algo_par)
    
    featureinfo = "./pyOpenMS_results/deconvoluted.featureXML"
    fm_info = FeatureMapping_FeatureMappingInfo()
    feature_mapping = FeatureMapping_FeatureToMs2Indices() 
    sirius_algo.preprocessingSirius(featureinfo,
                                    exp,
                                    fm_info,
                                    feature_mapping)
    sirius_algo.logFeatureSpectraNumber(featureinfo, 
                                    feature_mapping,
                                    exp)
    msfile = SiriusMSFile()
    debug_level = 3
    sirius_tmp = SiriusTemporaryFileSystemObjects(debug_level)
    siriusstring= String(sirius_tmp.getTmpMsFile())
    feature_only = sirius_algo.isFeatureOnly()
    isotope_pattern_iterations = sirius_algo.getIsotopePatternIterations()
    no_mt_info = sirius_algo.isNoMasstraceInfoIsotopePattern()
    compound_info = []
    msfile.store(exp,
                 String(sirius_tmp.getTmpMsFile()),
                 feature_mapping, 
                 feature_only,
                 isotope_pattern_iterations, 
                 no_mt_info, 
                 compound_info)
    out_csifingerid = "./pyOpenMS_results/csifingerID.mzTab" 
    executable= "/Users/eeko/Desktop/software/Contents/MacOS/sirius"
    subdirs = sirius_algo.callSiriusQProcess(String(sirius_tmp.getTmpMsFile()),
                                             String(sirius_tmp.getTmpOutDir()),
                                             String(executable),
                                             String(out_csifingerid),
                                             False)
    candidates = sirius_algo.getNumberOfSiriusCandidates()
    sirius_result = MzTab()
    siriusfile = MzTabFile()
    SiriusMzTabWriter.read(subdirs,
                            filename,
                            candidates,
                            sirius_result)
    siriusfile.store("./pyOpenMS_results/out_sirius_test.mzTab", sirius_result)
    
    siriusfile= "./pyOpenMS_results/out_sirius_test.mzTab"
    sirius=  pyteomics.mztab.MzTab(siriusfile, encoding='UTF8', table_format='df')
    sirius.metadata
    df= sirius.small_molecule_table
    SIRIUS_DF= df.drop(columns= ["identifier", "smiles", "inchi_key", "description", "calc_mass_to_charge", "charge", "taxid", "species","database", "database_version", "spectra_ref", "search_engine", "modifications"])
    SIRIUS_DF.to_csv("./pyOpenMS_results/SIRIUS_DF.csv")
    return SIRIUS_DF

### CSIFinger:ID

The CSI_fingerID function is another algorithm from the Boecher lab, just like SIRIUS adapter and is using the formula predictions from SIRIUS, to search in structural libraries and predict the structure of each formula

### I need to somehow introduce the subdirs to CSI finger ID without having to rerun SIRIUS - save it somehow

In [4]:
def CSI_fingerID(filename):
    top_hits= 5
    csi_result=MzTab()
    csi_file=MzTabFile()
    CsiFingerIdMzTabWriter.read(subdirs,
                        filename,
                        top_hits,
                        csi_result)
    csi_file.store("./pyOpenMS_results/csifingerID.mzTab", csi_result)
    csi_file= "./pyOpenMS_results/csifingerID.mzTab"
    CSI=  pyteomics.mztab.MzTab(csi_file, encoding='UTF8', table_format='df')
    CSI.metadata
    df= CSI.small_molecule_table
    csifingerID= df.drop(columns= ["calc_mass_to_charge", "charge", "taxid", "species","database", "database_version", "spectra_ref", "search_engine", "modifications"])
    return csifingerID

In [5]:
filename= "./rawdata/CLImzml/EpemicinsFLT.mzML"
preprocess(filename)

Unnamed: 0,mz,RT,intensity,charge,FWHM,max_height,num_of_masstraces,masstrace_intensity,masstrace_centroid_rt,masstrace_centroid_mz,isotope_distances,dc_charge_adducts,dc_charge_adduct_mass,Group,is_ungrouped_with_charge,map_idx,adducts,is_backbone
0,42.379768303980001,404.190102738177075,3.534303e05,1,6.644110,129759.664062,2.0,"[3.534302611248636e05, 8.032523655859887e05]","[42.379768303980001, 42.379768303980001]","[404.190102738177075, 405.198191650642855]",[1.00808891246578],H1,1.007276,6922735754755490651,1.0,,,
1,44.196742559999997,373.18483283821422,1.709681e05,1,5.536063,43890.843750,2.0,"[1.709681121205188e05, 1.690066035111322e05]","[44.196742559999997, 45.101220592019999]","[373.18483283821422, 374.175316417540955]",[0.990483579326735],H1,1.007276,7664605077221577358,1.0,,,
2,44.196742559999997,1477.007106295991207,3.43794e05,2,5.519711,145118.765625,2.0,"[3.437940108001174e05, 5.645828668345909e05]","[44.196742559999997, 44.196742559999997]","[1477.007106295991207, 1477.501176986231258]",[0.494070690240051],H2,2.014553,16454649826006369365,1.0,,,
3,45.101220592019999,364.153943880532893,1.747644e05,1,4.832314,84974.312500,2.0,"[1.74764381524645e05, 3.333736645573281e05]","[45.101220592019999, 46.025830015979999]","[364.153943880532893, 365.155544176485137]",[1.001600295952244],H1,1.007276,2635349743038291014,1.0,,,
4,45.101220592019999,421.15685600186265,4.954244e05,1,5.429286,118755.718750,2.0,"[4.954243807043884e05, 2.144474323613433e05]","[45.101220592019999, 46.959832751999997]","[421.15685600186265, 422.174593880076941]",[1.017737878214291],H1,1.007276,12474513868269941195,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2490,870.433413328020038,554.426359678747076,1.142151e06,1,5.756469,220571.781250,2.0,"[1.142150814992797e06, 3.30921115430037e05]","[870.433413328020038, 869.834840385000007]","[554.426359678747076, 555.4297484989944]",[1.003388820247324],H1,1.007276,10212228753604300667,1.0,,,
2491,871.433711809019997,600.450020134685815,4.951236e05,2,12.587696,48399.726562,2.0,"[4.951236260218139e05, 4.044639759062053e05]","[871.433711809019997, 870.433413328020038]","[600.450020134685815, 600.951764100184505]",[0.50174396549869],H2,2.014553,3366897729140362775,1.0,,,
2492,872.868845488980014,510.400164716685481,1.072304e06,1,7.293561,169854.453125,2.0,"[1.072303827657634e06, 3.000456179847701e05]","[872.868845488980014, 872.868845488980014]","[510.400164716685481, 511.403523198760354]",[1.003358482074873],H1,1.007276,10236554635733993854,1.0,,,
2493,877.998554481000042,466.373949664382394,6.042666e05,1,5.262895,128090.562500,2.0,"[6.042666170837692e05, 1.725592405209275e05]","[877.998554481000042, 877.998554481000042]","[466.373949664382394, 467.377344142265713]",[1.003394477883319],H1,1.007276,4894034969050366270,1.0,,,


In [6]:
SIRIUS(filename)

Unnamed: 0,chemical_formula,exp_mass_to_charge,retention_time,best_search_engine_score[1],best_search_engine_score[2],best_search_engine_score[3],opt_global_adduct,opt_gobal_precursorFormula,opt_global_rank,opt_global_explainedPeaks,opt_global_explainedIntensity,opt_global_median_mass_error_fragment_peaks_ppm,opt_global_median_absolute_mass_error_fragment_peaks_ppm,opt_global_mass_error_precursor_ppm,opt_global_compoundId,opt_global_compoundScanNumber,opt_global_featureId,opt_global_native_id
0,C20H39NO9,460.254448,269.175851,175.890441,174.933845,0.956596,[M + Na]+,C20H39NO9,1,44,0.831305,-1.926314,1.926314,5.964750,1262,1263,id_10377325295008937059,controllerType=0 controllerNumber=1 scan=1263
1,C21H35N5O5,460.254448,269.175851,173.401610,171.102401,2.299209,[M + Na]+,C21H35N5O5,2,44,0.836212,-1.974327,1.974327,3.058984,1262,1263,id_10377325295008937059,controllerType=0 controllerNumber=1 scan=1263
2,C17H31N11O3,460.254448,269.175851,169.644633,169.644633,0.000000,[M + Na]+,C17H31N11O3,3,44,0.836212,-1.974327,1.974327,8.893513,1262,1263,id_10377325295008937059,controllerType=0 controllerNumber=1 scan=1263
3,C23H37N2O6,460.254448,269.175851,169.542596,166.236359,3.306237,[M + Na]+,C23H37N2O6,4,44,0.831305,-1.926314,1.926314,0.141719,1262,1263,id_10377325295008937059,controllerType=0 controllerNumber=1 scan=1263
4,C18H39N5O5S,460.254448,269.175851,168.452900,168.452900,0.000000,[M + Na]+,C18H39N5O5S,5,43,0.823230,-1.939068,1.939068,-4.265504,1262,1263,id_10377325295008937059,controllerType=0 controllerNumber=1 scan=1263
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8341,C25H22N4O,417.166788,733.395929,132.572124,132.508969,0.063155,[M + Na]+,C25H22N4O,6,28,0.489294,-3.896603,3.896603,-4.299557,3689,3690,id_728039705289880318,controllerType=0 controllerNumber=1 scan=3690|...
8342,C22H24N3O4,417.166788,733.395929,132.507189,129.705345,2.801844,[M + Na]+,C22H24N3O4,7,28,0.489294,1.731576,2.094982,2.124914,3689,3690,id_728039705289880318,controllerType=0 controllerNumber=1 scan=3690|...
8343,C17H26N6O3S,417.166788,733.395929,131.650290,127.424556,4.225734,[M + Na]+,C17H26N6O3S,8,28,0.489294,-2.313506,2.313506,-2.737516,3689,3690,id_728039705289880318,controllerType=0 controllerNumber=1 scan=3690|...
8344,C13H22N12OS,417.166788,733.395929,129.746709,126.971450,2.775259,[M + Na]+,C13H22N12OS,9,28,0.489294,3.763273,3.763273,3.699641,3689,3690,id_728039705289880318,controllerType=0 controllerNumber=1 scan=3690|...


In [14]:
from pandas import DataFrame
import pandas as pd

import pyteomics
from pyteomics.openms import featurexml
with featurexml.read("./mzML_files/wf_testing/deconvolutedEpemicins.featureXML") as f:
    features_list = [FXML for FXML in f]
    
df = pd.DataFrame() 

for feat in features_list:
    idx = feat['id']
    for key in feat.keys():
        if key == 'id':
            pass
        # For col with dictionary do the following
        elif key == 'position':
            pos_list = feat['position']
            for pos in pos_list:
                if pos['dim'] == '0':
                    df.loc[idx, 'position_0'] = pos['position']
                elif pos['dim'] == '1':
                    df.loc[idx, 'position_1'] = pos['position']
        elif key == 'quality':
            qual_list = feat['quality']
            for qual in qual_list:
                if qual['dim'] == '0':
                    df.loc[idx, 'quality_0'] = qual['quality']
                elif qual['dim'] == '1':
                    df.loc[idx, 'quality_1'] = qual['quality']
        else:
            df.loc[idx, key] = feat[key]
df_tidy = df.rename(columns = {'position_0': 'mz', 'position_1': 'RT'}, inplace = False)
df_tidy=df_tidy.drop(columns= ["quality_0", "quality_1", "overallquality", "label", "legal_isotope_pattern"])
df_tidy.reset_index(drop=True, inplace=True) 
df_tidy


df_tidy.loc[df_tidy['mz'] == "658"]
df_tidy

Unnamed: 0,mz,RT,intensity,charge,FWHM,max_height,num_of_masstraces,masstrace_intensity,masstrace_centroid_rt,masstrace_centroid_mz,isotope_distances,Group,is_ungrouped_monoisotopic,dc_charge_adducts,dc_charge_adduct_mass,is_ungrouped_with_charge,map_idx,adducts,is_backbone,old_charge
0,39.529802865000001,349.172005427819727,6.483638e05,0,8.939569,75668.257812,1.0,[6.48363806042836e05],[39.529802865000001],[349.172005427819727],[],8497532420404919093,1.0,,,,,,,
1,42.379768303980001,391.182592020985794,1.612096e06,0,10.832493,183479.296875,1.0,[1.612096255726351e06],[42.379768303980001],[391.182592020985794],[],1021302065514414871,1.0,,,,,,,
2,42.379768303980001,404.190102738177075,3.534303e05,1,6.644110,129759.664062,2.0,"[3.534302611248636e05, 8.032523655859887e05]","[42.379768303980001, 42.379768303980001]","[404.190102738177075, 405.198191650642855]",[1.00808891246578],1531188304737912594,,H1,1.007276,1.0,,,,
3,42.379768303980001,461.235803441382416,3.050932e05,0,4.296739,88459.125000,1.0,[3.050931746679126e05],[42.379768303980001],[461.235803441382416],[],14370752362800809058,1.0,,,,,,,
4,43.264481632020001,361.208104366155453,2.066669e06,0,5.489381,487036.281250,1.0,[2.066669269290893e06],[43.264481632020001],[361.208104366155453],[],17717354507030679941,1.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14064,917.212863695999999,1321.983679608177454,2.423242e05,0,11.887800,29230.892578,1.0,[2.423241585391494e05],[917.212863695999999],[1321.983679608177454],[],16230232327793072483,1.0,,,,,,,
14065,918.210342559979949,420.319015627257613,2.520901e05,0,6.803956,42458.042969,1.0,[2.520900600146166e05],[918.210342559979949],[420.319015627257613],[],9712883258301259330,1.0,,,,,,,
14066,918.210342559979949,1421.977424770406287,1.326787e05,0,4.103276,38798.972656,1.0,[1.3267872324558e05],[918.210342559979949],[1421.977424770406287],[],3859115546576027604,1.0,,,,,,,
14067,918.210342559979949,1521.970953123048957,2.119047e05,0,6.773869,37788.941406,1.0,[2.119047253454692e05],[918.210342559979949],[1521.970953123048957],[],10096857489543211695,1.0,,,,,,,


### Explanation of columns
#### mz= mass-to-charge ratio (m/z)
#### RT= retention time (min)
#### intensity = intensity of the feature (AU-arbitrary units)
#### FWHM= Full Width of the peak at Half its Maximum height
#### num_of_masstraces	= number of mass traces detected (single mass traces are excluded). This is relevant to the isotopic pattern
#### isotope_distances = distance in mz between the isotopes (jumps of app. 1 is important to confirm that this is a real feature)
#### 

In [2]:
import pandas as pd
import numpy as np
import sys
import pyteomics
from pyteomics import mztab
filename= "./mzML_files/wf_testing/out_sirius_testGermB.mzTab"
sirius=  pyteomics.mztab.MzTab(filename, encoding='UTF8', table_format='df')
sirius.metadata
df= sirius.small_molecule_table
data= df.drop(columns= ["identifier", "smiles", "inchi_key", "description", "calc_mass_to_charge", "charge", "taxid", "species","database", "database_version", "spectra_ref", "search_engine", "modifications"])
data

Unnamed: 0,chemical_formula,exp_mass_to_charge,retention_time,best_search_engine_score[1],best_search_engine_score[2],best_search_engine_score[3],opt_global_adduct,opt_gobal_precursorFormula,opt_global_rank,opt_global_explainedPeaks,opt_global_explainedIntensity,opt_global_median_mass_error_fragment_peaks_ppm,opt_global_median_absolute_mass_error_fragment_peaks_ppm,opt_global_mass_error_precursor_ppm,opt_global_compoundId,opt_global_compoundScanNumber,opt_global_featureId,opt_global_native_id
0,CH4O4,81.017307,262.347736,10.29401,8.832051,1.461959,[M + H]+,CH4O4,1,2,0.008202,-13.847945,13.847945,-11.453751,1286,1287,id_15401598686271704206,controllerType=0 controllerNumber=1 scan=1287|...
1,C2H5FS,81.017307,262.347736,1.771033,1.771033,0.0,[M + H]+,C2H5FS,2,2,0.008202,4.838726,4.838726,5.323257,1286,1287,id_15401598686271704206,controllerType=0 controllerNumber=1 scan=1287|...
2,CB2O3,81.017307,262.347736,-1.872022,-1.872022,0.0,[M + H]+,CB2O3,3,1,0.0,-7.247884,7.247884,-7.247884,1286,1287,id_15401598686271704206,controllerType=0 controllerNumber=1 scan=1287|...
3,C5H12OS,121.068635,300.520158,14.094925,11.656118,2.438807,[M + H]+,C5H12OS,1,4,0.092262,3.68721,4.694264,3.90611,1539,1540,id_15868649392726213180,controllerType=0 controllerNumber=1 scan=1540|...
4,C5H8B2S,121.068635,300.520158,8.513833,8.513833,0.0,[M + H]+,C5H8B2S,2,5,0.110982,6.720613,9.794502,6.720613,1539,1540,id_15868649392726213180,controllerType=0 controllerNumber=1 scan=1540|...
5,C5H8F2N,121.068635,300.520158,1.258188,1.258188,0.0,[M + H]+,C5H8F2N,3,3,0.074048,-14.698028,14.698028,-9.265691,1539,1540,id_15868649392726213180,controllerType=0 controllerNumber=1 scan=1540|...
6,C5H6BN2O,121.068635,300.520158,-1.30328,-1.30328,0.0,[M + H]+,C5H6BN2O,4,2,0.052254,4.027006,4.027006,3.379071,1539,1540,id_15868649392726213180,controllerType=0 controllerNumber=1 scan=1540|...
7,CH8B2O5,121.068635,300.520158,-2.279097,-2.279097,0.0,[M + H]+,CH8B2O5,5,1,0.0,-13.946176,13.946176,-13.946176,1539,1540,id_15868649392726213180,controllerType=0 controllerNumber=1 scan=1540|...
8,C6H13FN4,183.102896,418.423143,53.600507,53.600507,0.0,[M + Na]+,C6H13FN4,1,20,0.340086,-4.561994,7.225714,6.830271,2228,2229,id_14840376333216993323,controllerType=0 controllerNumber=1 scan=2229|...
9,C11H14N,183.102896,418.423143,23.223615,20.174189,3.049426,[M + Na]+,C11H14N,2,11,0.201965,-7.28066,7.621158,5.739016,2228,2229,id_14840376333216993323,controllerType=0 controllerNumber=1 scan=2229|...


In [4]:
filename= "./mzML_files/wf_testing/csifingerID.mzTab"
CSI=  pyteomics.mztab.MzTab(filename, encoding='UTF8', table_format='df')
CSI.metadata
df= CSI.small_molecule_table
csifingerID= df.drop(columns= ["calc_mass_to_charge", "charge", "taxid", "species","database", "database_version", "spectra_ref", "search_engine", "modifications"])
csifingerID

Unnamed: 0,identifier,chemical_formula,smiles,inchi_key,description,exp_mass_to_charge,retention_time,best_search_engine_score[1],opt_global_rank,opt_global_compoundId,opt_global_compoundScanNumber,opt_global_featureId,opt_global_native_id,opt_global_adduct,opt_global_dblinks,opt_global_dbflags
0,123571071,C17H25BN2O2S,B1(OC(C(O1)(C)C)(C)C)C2=CC3=C(C=C2)N(SN3C)CC4CC4,QCQPLSTXPJUAAM,,349.209026,123.805954,-260.2015,1,745,746,id_6128946280250909851,controllerType=0 controllerNumber=1 scan=746,[M + H3N + H]+,PubChem:(123571071),2
1,131432082,C17H28BN3O2S,B1(OC(C(O1)(C)C)(C)C)C2=CN=C(S2)N3CCN4CCCCC4C3,SECXVOZRGXMKEM,,349.209026,123.805954,-299.350229,1,745,746,id_6128946280250909851,controllerType=0 controllerNumber=1 scan=746,[M + H]+,PubChem:(131432082),2
2,109567509,C15H31FN4O3S,CCNC(=NCCCF)N1CCN(CC1)S(=O)(=O)CCOC(C)C,BRHUEMIRBTXUSW,,349.209026,123.805954,-267.945346,1,745,746,id_6128946280250909851,controllerType=0 controllerNumber=1 scan=746,[M - H2O + H]+,PubChem:(109567509),2
3,70910984,C15H26FN3O2S,CC(C)(C)S(=O)(=O)NCCCCCNC1=NC=C(C=C1)CF,PEYBFSDPVVZPQL,,349.209026,123.805954,-244.285143,1,745,746,id_6128946280250909851,controllerType=0 controllerNumber=1 scan=746,[M + H3N + H]+,PubChem:(70910984),2
4,128994375,C15H26FN3O2S,CC1=NC(=CS1)CN2CC(CC2CN(C)CC(COC)O)F,CQZCDUGNYLCVRO,,349.209026,123.805954,-279.329205,2,745,746,id_6128946280250909851,controllerType=0 controllerNumber=1 scan=746,[M + H3N + H]+,PubChem:(128994375),2
5,138722980,C11H17FN6O3,CC(C(CO)OC(CF)N1C=NC2=C(N=C(N=C21)N)N)O,JJXXHWZKDWLJHM,,323.122389,356.847436,-270.225119,1,2846,2847,id_11377481446250170993,controllerType=0 controllerNumber=1 scan=2847,[M + Na]+,PubChem:(138722980),2
6,135244532,C11H17FN6O3,CC1C(N2C(=N1)C(=NC(=N2)N)N)C3C(C(C(O3)CO)O)F,VNYAFPCEJVZLAO,,323.122389,356.847436,-299.163846,2,2846,2847,id_11377481446250170993,controllerType=0 controllerNumber=1 scan=2847,[M + Na]+,PubChem:(135244532),2
7,137397861|137397935,C11H17FN6O3,COC(CN)(CO)C(C(N1C=NC2=C(N=CN=C21)N)F)O,RQTBEDYRRNIRGZ,,323.122389,356.847436,-314.805122,3,2846,2847,id_11377481446250170993,controllerType=0 controllerNumber=1 scan=2847,[M + Na]+,PubChem:(137397861 137397935),2
8,118595172,C11H17FN6O3,C=C1NC(C=CN1C2C(C(C(O2)(CN=[N+]=[N-])CO)O)F)N,MYSZRXFXRIVGPS,,323.122389,356.847436,-438.401003,4,2846,2847,id_11377481446250170993,controllerType=0 controllerNumber=1 scan=2847,[M + Na]+,PubChem:(118595172),2
9,25000395,C11H20BN3O4S,B1(OC(C(O1)(C)C)(C)C)C2=CN(C=N2)S(=O)(=O)N(C)C,ZWQXNFBXPCSFAO,"1-(N,N-Dimethylsulfamoyl)imidazole-4-boronic a...",323.122389,356.847436,-259.475588,1,2846,2847,id_11377481446250170993,controllerType=0 controllerNumber=1 scan=2847,[M + Na]+,PubChem:(25000395),2
