# Raw Spectra processing

This notebook is designed to take 'raw' mass spectra in the mzML open format and convert to tables that can be used in the main module of this software.

Before you begin, make sure that all your spectra are in the open mzML format. If they are not, you must first convert them. This can be done using freely available tools such as [Proteowizard](https://proteowizard.sourceforge.io/download.html)'s MSConvert.

The notebook provides a visualization tool near the end, which allows you to compare the different sepctra and see what changes between the different stages of data processing. Ideally, the spectra should be as 'cleaned up' as possible without losing relevant information.

The two export options are defaulted to False so that the user can have time to evaluate the quality of the processing, and possibly tweak parameters, before proceding. Set them to True to export.

In [None]:
import numpy as np
import pandas as pd
import pyopenms as oms
%matplotlib ipympl
import matplotlib.pyplot as plt
import metabolinks as mtl
from metabolinks import align

## Convert raw spectra into processed mass lists

Insert the names of your files in the following cell

In [None]:
files = ['BY0_000001.mzML','BY0_000002.mzML', 'BY0_000003.mzML',
         'GRE3_000001.mzML', 'GRE3_000002.mzML', 'GRE3_000003.mzML', 
         'dGLO1_000001.mzML', 'dGLO1_000002.mzML', 'dGLO1_000003.mzML',
         'GLO2_000001.mzML', 'GLO2_000002.mzML', 'GLO2_000003.mzML',
         'ENO1_000001.mzML', 'ENO1_000002.mzML', 'ENO1_000003.mzML']

out_name = '5_yeasts' # Stem of the name for the desired output files

See documentation: https://openms.de/current_doxygen/html/classOpenMS_1_1PeakPickerHiRes.html

While the signal to noise threshold is the most critical parameter, others may also be customized in the pyopenms environment. Delete the #'s to customize any of the others if you know what you're doing.

In [None]:
signal_to_noise = 3.3

In [None]:
specs = {}

for f in files:

    f_name = f.split('.')[0]

    exp = oms.MSExperiment()
    oms.MzMLFile().load(f, exp)
    spec = exp[0]
    mz, intensity = spec.get_peaks()
    raw_df = pd.DataFrame(data={f_name: intensity}, index= mz)

    centroided_spectra = oms.MSExperiment()
    cnt = oms.PeakPickerHiRes()
    param = cnt.getParameters()
    param.setValue("signal_to_noise", signal_to_noise)
    #param.setValue("spacing_difference_gap", 7.0)
    #param.setValue("spacing_difference", 5.0)
    #param.setValue("SignalToNoise:win_len", 200.0) # def 200
    #param.setValue("SignalToNoise:bin_count", 200) #def 200
    #param.setValue("SignalToNoise:min_required_elements", 100)
    #param.setValue("SignalToNoise:max_intensity", 1000000)
    #param.setValue("SignalToNoise:auto_mode", -1)
    cnt.setParameters(param)
    cnt.pickExperiment(exp, centroided_spectra, True,)
    cent_mz, cent_ints = centroided_spectra[0].get_peaks()
    cent_df = pd.DataFrame(data={f_name: cent_ints}, index = cent_mz)

    print(f_name,'|' ,'Raw signals:' , len(raw_df), '|', 'Centroided signals:', len(cent_df))

    specs[f_name] = {'raw': raw_df, 'centroided': cent_df}

If you wish to export the individual spectra (centroided but unaligned) to an excel file (with each spectrum on its own sheet), you may do it here. This file can be used as input in the main module's data alignment section. Alternatively, you may proceed and do the alignment in this module.

In [None]:
CENT_TO_EXCEL = False

if CENT_TO_EXCEL:
    with  pd.ExcelWriter(out_name+'_centroided.xlsx') as writer:
        for s in specs:
            specs[s]['centroided'].to_excel(writer, sheet_name=s, index=False)

## Data alignment

This section performs data alignment. You may select the PPM tolerance below which spectral entity will be considered the same compound. You may also select a minimum number of samples and entity needs to appear in to be preserved. This section is identical to the data alignment section in the main module.

In [None]:
frames = [specs[s]['centroided'] for s in specs]

In [None]:
ppmtol = 1 # PPM Tolerance
min_samples = 2 # Number of times a compound needs to appear across all samples to be included

aligned_spectra = align(frames, ppmtol, min_samples)

aligned_spectra

In [None]:
print(list(specs))

In [None]:
ALIGNED_TO_CSV = True

if ALIGNED_TO_CSV:
    aligned_spectra.to_csv(out_name+'_aligned.csv')

## Visualization

You may select the spectra that you wish to see and in which forms. Be mindful that displaying too many spectra at once may crash the notebook. The normalization option is for visual comparisons only and does not alter the intensities in the output files.

In [None]:
see_spec = ['BY0_000001', 'GRE3_000001', 'dGLO1_000001']

spec_form = ['raw', 'cent', 'aligned'] # 'raw', 'cent', 'aligned'

norm_specs = True # Normalization to the sum of all intensities. Centroided and aligned ints are normalized with the raw sum

mass_lim = (200, 1000) #(380, 383) #(1046, 1050) # Can be None
int_lim = None #(None, 0.0005) # can be None


# Create figure
fig1, axs1 = plt.subplots(len(see_spec), len(spec_form), constrained_layout=True)

# Fill in the data
for (s, s1) in zip(see_spec, range(len(see_spec))):

    if norm_specs:
        raw_masses = specs[s]['raw'].index
        raw_ints = specs[s]['raw'].values/specs[s]['raw'].values.sum()
        cent_masses = specs[s]['centroided'].index
        cent_ints = specs[s]['centroided'].values/specs[s]['raw'].values.sum()
        align_masses = aligned_spectra.index
        align_ints = aligned_spectra[s].values/specs[s]['raw'].values.sum()
    else:
        raw_masses = specs[s]['raw'].index
        raw_ints = specs[s]['raw'].values
        cent_masses = specs[s]['centroided'].index
        cent_ints = specs[s]['centroided'].values
        align_masses = aligned_spectra.index
        align_ints = aligned_spectra[s].values
    
    if len(spec_form) == 1:

        if len(see_spec) == 1:

            if spec_form[0] == 'raw':
                axs1.plot(raw_masses, raw_ints)
                axs1.set_xlim(mass_lim)
                axs1.set_ylim(int_lim)
                axs1.set_title(s+' Raw')
            
            if spec_form[0] == 'cent':
                axs1.stem(cent_masses, cent_ints)
                axs1.set_xlim(mass_lim)
                axs1.set_ylim(int_lim)
                axs1.set_title(s+' Centroided')
            
            if spec_form[0] == 'aligned':
                axs1.stem(align_masses, align_ints)
                axs1.set_xlim(mass_lim)
                axs1.set_ylim(int_lim)
                axs1.set_title(s+' Aligned')
        else:

            if spec_form[0] == 'raw':
                axs1[s1].plot(raw_masses, raw_ints)
                axs1[s1].set_xlim(mass_lim)
                axs1[s1].set_ylim(int_lim)
                axs1[s1].set_title(s+' Raw')
            
            if spec_form[0] == 'cent':
                axs1[s1].stem(cent_masses, cent_ints)
                axs1[s1].set_xlim(mass_lim)
                axs1[s1].set_ylim(int_lim)
                axs1[s1].set_title(s+' Centroided')
            
            if spec_form[0] == 'aligned':
                axs1[s1].stem(align_masses, align_ints)
                axs1[s1].set_xlim(mass_lim)
                axs1[s1].set_ylim(int_lim)
                axs1[s1].set_title(s+' Aligned')

    else:

        for (sf, sf1) in zip(spec_form, range(len(spec_form))):

            if len(see_spec) == 1:

                if sf == 'raw':
                    axs1[sf1].plot(raw_masses, raw_ints)
                    axs1[sf1].set_xlim(mass_lim)
                    axs1[sf1].set_ylim(int_lim)
                    axs1[sf1].set_title(s+' Raw')

                if sf == 'cent':
                    axs1[sf1].stem(cent_masses, cent_ints)
                    axs1[sf1].set_xlim(mass_lim)
                    axs1[sf1].set_ylim(int_lim)
                    axs1[sf1].set_title(s+' Centroided')
                
                if sf == 'aligned':
                    axs1[sf1].stem(align_masses, align_ints)
                    axs1[sf1].set_xlim(mass_lim)
                    axs1[sf1].set_ylim(int_lim)
                    axs1[sf1].set_title(s+' Aligned')

            else:

                if sf == 'raw':
                    axs1[s1, sf1].plot(raw_masses, raw_ints)
                    axs1[s1, sf1].set_xlim(mass_lim)
                    axs1[s1, sf1].set_ylim(int_lim)
                    axs1[s1, sf1].set_title(s+' Raw')

                if sf == 'cent':
                    axs1[s1, sf1].stem(cent_masses, cent_ints)
                    axs1[s1, sf1].set_xlim(mass_lim)
                    axs1[s1, sf1].set_ylim(int_lim)
                    axs1[s1, sf1].set_title(s+' Centroided')
                
                if sf == 'aligned':
                    axs1[s1, sf1].stem(align_masses, align_ints)
                    axs1[s1, sf1].set_xlim(mass_lim)
                    axs1[s1, sf1].set_ylim(int_lim)
                    axs1[s1, sf1].set_title(s+' Aligned')

## Evaluation

This section performs basic processing of the aligned data in order to allow for simple statistical evaluation. It's intended only to assist in troubleshooting raw data processing and alignment and should not be taken as the 'definitive' statistical analysis of your results. For that, please refer to the equivalent sections in the main module.