## Alignment of peaks from several _Vitis_ cultivars

Alignment of peak lists based on m/z relative differences (below a ppm threshold) and filtering of peaks that appear in less than a certain number of samples of the Grapevine Datasets (using data in the 'data' folder).

Organization of the notebook:

- Reading the data
- Presenting the two types of Alignment and Filtering strategies
- Application of Strategy nº 1 
- Application of Strategy nº 2

Requirements:

- metabolinks

#### Needed Imports

In [1]:
import pandas as pd
from collections import OrderedDict
from pathlib import Path

from metabolinks import align, read_data_from_xcel
from metabolinks.similarity import mz_similarity
import metabolinks as mtl

  PANDAS_TYPES = (pd.Series, pd.DataFrame, pd.Panel)


# Peak Alignment and Filtering to build the Negative and Positive Grapevine Datasets

Here, the process to align the different spectra (33 samples) from the Grapevine Dataset and filter m/z peak or features that appear in less than a certain number of samples (min_sample) is done.

For ease, the different peak alignments and filtering will only be referred as 'Alignments'.

## Set up a store to hold all alignments

Alignments was used as a store of the different alignments. Then, after some changes, alignment_new.h5 was created with the same alignments compressed.

In [2]:
#alignments = pd.HDFStore('alignments_new.h5')
alignments = pd.HDFStore('alignments.h5')
pd.set_option('io.hdf.default_format','table')

### Set up metadata descriptions

There were 14 varieties, however 3 of them were varieties from another "experiment" so only the 11 mentioned here were aligned.

In [3]:
# Used to read the files
data_folder = 'data'
header_row = 3

data = {
    'CAN': {'filename': 'CAN (14, 15, 16).xlsx',
            'names'   : {'sample_names': '14 15 16'.split(), 'labels' : 'CAN'}},
    'CS':  {'filename': 'CS (29, 30, 31).xlsx',
            'names'   : {'sample_names': '29 30 31'.split(), 'labels' : 'CS'}},
    'LAB':  {'filename': 'LAB (8, 9, 10).xlsx',
            'names'   : {'sample_names': '8  9  10'.split(), 'labels' : 'LAB'}},
    'PN':  {'filename': 'PN (23, 24, 25).xlsx',
            'names'   : {'sample_names': '23 24 25'.split(), 'labels' : 'PN'}},
    'REG':  {'filename': 'REG (38, 39, 40).xlsx',
            'names'   : {'sample_names': '38 39 40'.split(), 'labels' : 'REG'}},
    'RIP':  {'filename': 'RIP (17, 18, 19).xlsx',
            'names'   : {'sample_names': '17 18 19'.split(), 'labels' : 'RIP'}},
    'RL':  {'filename': 'RL (26, 27, 28).xlsx',
            'names'   : {'sample_names': '26 27 28'.split(), 'labels' : 'RL'}},
    'ROT':  {'filename': 'ROT (20, 21, 22).xlsx',
            'names'   : {'sample_names': '20 21 22'.split(), 'labels' : 'ROT'}},
    'RU':  {'filename': 'RU (35, 36, 37).xlsx',
            'names'   : {'sample_names': '35 36 37'.split(), 'labels' : 'RU'}},
    'SYL':  {'filename': 'SYL (11, 12, 13).xlsx',
            'names'   : {'sample_names': '11 12 13'.split(), 'labels' : 'SYL'}},
    'TRI':  {'filename': 'TRI (32, 33, 34).xlsx',
            'names'   : {'sample_names': '32 33 34'.split(), 'labels' : 'TRI'}},
    # these are the new cultivars
#    'CFN':  {'filename': 'CFN (10713_1, 10713_2, 10713_3).xlsx',
#            'names'   : {'sample_names': '10713-1 10713-2 10713-3'.split(), 'labels' : 'CFN'}},
#    'CHT':  {'filename': 'CHT (13514_1, 13514_2, 13514_3).xlsx',
#            'names'   : {'sample_names': '13514-1 13514-2 13514-3'.split(), 'labels' : 'CHT'}},
#    'SB':  {'filename': 'SB (53211_1, 53211_2, 53211_3).xlsx',
#            'names'   : {'sample_names': '53211-1 53211-2 53211-3'.split(), 'labels' : 'SB'}},
}

### Read spectra from Excel files

In [4]:
def read_vitis_data(filename, metadata):
    "Read spectra (sometimes multiple) from excel files."
    exp=read_data_from_xcel(filename, header=[3])
    # For each sheet (sample) in Excel
    for sname in exp:
        dfs = exp[sname]
        label2assign = metadata['names']['labels']
        for name, df in zip(metadata['names']['sample_names'], dfs):
            df.columns = [name]
            df.index.name = 'm/z'
        exp[sname] = [mtl.add_labels(df, labels=label2assign) for df in exp[sname]]
    return exp

exp = read_vitis_data(f"data/{data['CAN']['filename']}", data['CAN'])
#exp # seems ok!

In [5]:
# Read all spectra from the data and store them in OrderedDict
all_spectra = OrderedDict()

for d, desc in data.items():
    fpath = Path(data_folder, desc['filename']) # Path to each data file
    sheets = read_vitis_data(fpath, desc)
    for sheet, spectra in sheets.items():
        print(f'Sheet {sheet} contains {len(spectra)} spectra')
        all_spectra[sheet] = spectra

Sheet CAN - NEGATIVO contains 3 spectra
Sheet CAN - POSITIVO contains 3 spectra
Sheet CS - NEGATIVO contains 3 spectra
Sheet CS - Positivo contains 3 spectra
Sheet LAB - NEGATIVO contains 3 spectra
Sheet LAB - POSITIVO contains 3 spectra
Sheet PN - NEGATIVO contains 3 spectra
Sheet PN - POSITIVO contains 3 spectra
Sheet REG - NEGATIVO contains 3 spectra
Sheet REG - POSITIVO contains 3 spectra
Sheet RIP - NEGATIVO contains 3 spectra
Sheet RIP - POSITIVO contains 3 spectra
Sheet RL - NEGATIVO contains 3 spectra
Sheet RL - POSITIVO contains 3 spectra
Sheet ROT - NEGATIVO contains 3 spectra
Sheet ROT - POSITIVO contains 3 spectra
Sheet RU - NEGATIVO contains 3 spectra
Sheet RU - POSITIVO contains 3 spectra
Sheet SYL - NEGATIVE contains 3 spectra
Sheet SYL - POSITIVE contains 3 spectra
Sheet TRI - NEGATIVO contains 3 spectra
Sheet TRI - POSITIVO contains 3 spectra


## Alignment and Filtering of peak lists

The alignment and filtering of the peak lists of the different spectra is performed using the `align` function from metabolinks Python package.

Two different strategies for alignment and filtering can be done:

#### 1) Align all samples that make up the dataset simultaneously and filter peaks from there

This strategy aligns all samples simultaneously based on a ppm tolerance. Then, after alignment removes features that appear in less than 'min_sample' samples in the whole dataset. This makes the alignment unbiased between samples of the same group, they are aligned independently of one another. For example, the alignment used for the Grapevine Datasets in the dissertation used this strategy with 1 ppm tolerance and 2 minimum samples.

The nomenclature used for this type of alignment was: 'AlignAllSamples_ppmTol_MinSample_IonizationMode'.

For example, for the Negative GD used in the dissertation: 'all_1ppm_min2_neg'.

#### 2) Align the samples that belong to each biological group individually and filter peaks based on the number of times a feature appears in the samples of a group and then align again the data from the samples of all groups

This strategy makes individual alignments for the samples of the same group for each biological group and filters features in said alignments if they appear in less than 'min_samples' samples in the samples of the biological group. Then, a global alignment is made from all the different alignments to make up the final dataset. This makes the samples of the same group artifically closer to each other than to other samples (in relation to pre-alignment samples). 

The nomenclature used for this type of alignment was: 'AlignGroupSamples_ppmTol_MinSample_AlignAllSamples_ppmTol_IonizationMode'.

For example, for the Negative GD that uses this strategy in 'BinSim_Analysis_GD11_all2_groups2all1.ipynb' (Alignment 2-1): 'groups_1ppm_min2_all_1ppm_neg'.

#### An example of application for each of these strategies will be shown whose parameters can be changed to change the alignment very quickly

##### Also in the introduction for these examples the parameters used for each of the alignments (that use said strategy) present in the store 'alignments.h5' are presented

In [6]:
# The different alignments made that are in the store
alignments.keys()
# Nomenclature (depending on the strategy):
# AlignAllSamples_ppmTol_MinSample_IonizationMode
# AlignGroupSamples_ppmTol_MinSample_AlignAllSamples_ppmTol_IonizationMode

['/all_1ppm_min13_neg',
 '/all_1ppm_min13_pos',
 '/all_1ppm_min2_neg',
 '/all_1ppm_min2_pos',
 '/all_1ppm_min6_neg',
 '/all_1ppm_min6_pos',
 '/groups_1ppm_min2_all_1ppm_neg',
 '/groups_1ppm_min2_all_1ppm_pos',
 '/groups_1ppm_min3_all_1ppm_neg',
 '/groups_1ppm_min3_all_1ppm_pos',
 '/groups_2ppm_min3_all_2ppm_neg',
 '/groups_2ppm_min3_all_2ppm_pos']

### Strategy 1: Align all samples that make up the dataset simultaneously and filter peaks from there (not aligning replicates first)

The alignments in the different alignments store that use this strategy use the following parameters:

- 'all_1ppm_min2_neg' and 'all_1ppm_min2_pos': ppmtol = 1.0, min_samples = 2.
- 'all_1ppm_min6_neg' and 'all_1ppm_min6_pos': ppmtol = 1.0, min_samples = 6.
- 'all_1ppm_min13_neg' and 'all_1ppm_min13_pos': ppmtol = 1.0, min_samples = 13.

min_samples here has a maximum of 33 (all samples of the dataset).

#### Example 'all_1ppm_min2_neg' and 'all_1ppm_min2_pos'

#### Separate negative from positive ionization modes data, putting all samples in the same list with the correct sample_names

In [7]:
posi = []
nega = []
for k, s in all_spectra.items():
    if k.upper()[-8:-1]=='POSITIV':
        #s[0].columns = [data[k.split()[0]]['names']['sample_names'][0]]
        #s[1].columns = [data[k.split()[0]]['names']['sample_names'][1]]
        #s[2].columns = [data[k.split()[0]]['names']['sample_names'][2]]
        posi.append(s[0])
        posi.append(s[1])
        posi.append(s[2])
    if k.upper()[-8:-1]=='NEGATIV':
        #s[0].columns = [data[k.split()[0]]['names']['sample_names'][0]]
        #s[1].columns = [data[k.split()[0]]['names']['sample_names'][1]]
        #s[2].columns = [data[k.split()[0]]['names']['sample_names'][2]]
        nega.append(s[0])
        nega.append(s[1])
        nega.append(s[2])

#### Align globally the samples (for each mode).

In [8]:
ppmtol = 1.0
min_samples = 2 # 2 for the all_1ppm_min2_neg/pos, 6 for the all_1ppm_min6_neg/pos and 13 for the all_1ppm_min13_neg/pos

aligned_all_positive = align(posi, ppmtol, min_samples)
aligned_all_negative = align(nega, ppmtol, min_samples)

------ Aligning tables -------------
 Samples to align: [[('CAN', '14')], [('CAN', '15')], [('CAN', '16')], [('CS', '29')], [('CS', '30')], [('CS', '31')], [('LAB', '8')], [('LAB', '9')], [('LAB', '10')], [('PN', '23')], [('PN', '24')], [('PN', '25')], [('REG', '38')], [('REG', '39')], [('REG', '40')], [('RIP', '17')], [('RIP', '18')], [('RIP', '19')], [('RL', '26')], [('RL', '27')], [('RL', '28')], [('ROT', '20')], [('ROT', '21')], [('ROT', '22')], [('RU', '35')], [('RU', '36')], [('RU', '37')], [('SYL', '11')], [('SYL', '12')], [('SYL', '13')], [('TRI', '32')], [('TRI', '33')], [('TRI', '34')]]
- Extracting all features...
  Done, (total 62052 features in 33 samples)
- Grouping and joining...
  Done, 30660 groups found
Elapsed time: 00m 29.089s

- 23634 groups were discarded (#samples < 2)
Sample coverage of features
 2742 features in 2 samples
 1313 features in 3 samples
  671 features in 4 samples
  420 features in 5 samples
  328 features in 6 samples
  244 features in 7 samples
 

In [9]:
aligned_all_positive

label,CAN,CAN,CAN,CS,CS,CS,LAB,LAB,LAB,PN,...,ROT,RU,RU,RU,SYL,SYL,SYL,TRI,TRI,TRI
Unnamed: 0_level_1,14,15,16,29,30,31,8,9,10,23,...,22,35,36,37,11,12,13,32,33,34
96.999045,,,,,,,,,,,...,,,,,106869.0,,,,110856.0,
97.031130,,,,,,,,,,98169.0,...,,,,,,,,,,
97.071640,,,,,,,,108081.0,,,...,,,,,,110127.0,,,,
97.100765,,,,,,,,,,179818.0,...,,,,,,,,,,
97.177455,,,104165.0,146957.0,,166304.0,,,153067.0,275602.0,...,,430984.0,382831.0,712734.0,171313.0,113533.0,,,198609.0,496041.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853.851470,,,,,,,,,,472207.0,...,,,,,,,,,,
863.742605,,,,,,,,,,,...,,,,161976.0,,,,268862.0,,269364.0
864.749430,,,,,,,,,,,...,,,,,,,,130892.0,,
885.372610,,,,,,,,,,973025.0,...,,,,,,,,,,


#### Storing Alignments into the hdf5store file (writing, using `put`)

In [10]:
# Put labels on the data (if needed/wanted)

# Right now, we don't want
#aligned_all_positive = mtl.add_labels(aligned_all_positive,
#                                      ['CAN','CS','LAB','PN','REG','RIP','RL','ROT','RU','SYL','TRI','CFN','CHT','SB'])
#aligned_all_negative = mtl.add_labels(aligned_all_negative,
#                                      ['CAN','CS','LAB','PN','REG','RIP','RL','ROT','RU','SYL','TRI','CFN','CHT','SB'])

In [11]:
# Take out the # to store
#alignments.put('all_1ppm_min2_pos', aligned_all_positive)
# Nomenclature: all samples at 1ppm with n min_samples

In [12]:
# Take out the # to store
#alignments.put('all_1ppm_min2_neg', aligned_all_negative)
# Nomenclature: all samples at 1ppm with n min_samples

#### Save alignments in CSV files

In [13]:
outdict = {'POSITIVE': aligned_all_positive, 'NEGATIVE': aligned_all_negative}

# Take out # to store in CSVs
#aligned_all_pos.to_csv('aligned_all_1ppm_min2_positive.csv', with_labels=True, sep=',')
#aligned_all_neg.to_csv('aligned_all_1ppm_min2_negative.csv', with_labels=True, sep=',')

### Strategy 2: Align the samples that belong to each biological group individually (replicates) and filter peaks based on the number of times a feature appears in the samples of a group and then align again the data from the samples of all groups

The alignments in the different alignments store that use this strategy use the following parameters:

- 'groups_1ppm_min2_all_1ppm_neg' and 'groups_1ppm_min2_all_1ppm_neg': ppmtol = 1.0, min_samples = 2.
- 'groups_1ppm_min3_all_1ppm_neg' and 'groups_1ppm_min3_all_1ppm_pos': ppmtol = 1.0, min_samples = 3.
- 'groups_2ppm_min3_all_2ppm_neg' and 'groups_2ppm_min3_all_2ppm_pos': ppmtol = 2.0, min_samples = 3.

min_samples here has a maximum of 3 (number of samples per biological group).

ppmtol for the 2nd alignment is equal to the the 1st alignment and min_samples = 1.

#### Example 'groups_1ppm_min2_all_1ppm_neg' and 'groups_1ppm_min2_all_1ppm_neg'

#### Align for each mode and cultivar (keep if peak appears in at least 2 samples - min_samples = 2)

In [14]:
ppmtol = 1.0 # 2.0 for the groups_2ppm_min3_all_2ppm_neg/pos; 1 for the rest.
min_samples = 2 # 2 for the groups_1ppm_min2_all_1ppm_neg/pos, 3 for the groups_1ppm_min3_all_1ppm_neg/pos and 
                # groups_2ppm_min3_all_2ppm_neg/pos

aligned = {}
for k, s in all_spectra.items():
    print('=======================================')
    print(k)
    # print(s)
    aligned[k]  = align(s, ppmtol, min_samples)

CAN - NEGATIVO
------ Aligning tables -------------
 Samples to align: [[('CAN', '14')], [('CAN', '15')], [('CAN', '16')]]
- Extracting all features...
  Done, (total 1686 features in 3 samples)
- Grouping and joining...
  Done, 1014 groups found
Elapsed time: 00m 00.685s

- 539 groups were discarded (#samples < 2)
Sample coverage of features
  278 features in 2 samples
  197 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 17
  [0.1,0.2[ : 18
  [0.2,0.3[ : 25
  [0.3,0.4[ : 44
  [0.4,0.5[ : 75
  [0.5,0.6[ : 121
  [0.6,0.7[ : 86
  [0.7,0.8[ : 29
  [0.8,0.9[ : 27
  [0.9,1.0[ : 25
Peaks with m/z range in excess of tolerance
     # features   mean m/z  m/z range (ppm)
123           3  243.18020         1.233654
178           3  256.23689         1.014686
183           3  269.24925         1.002789
513           3  477.06803         1.131915
753           3  557.27065         1.040787
765           3  564.50908         1.045156
823           3  592.56747         1.130674
976

  Done, 754 groups found
Elapsed time: 00m 00.562s

- 342 groups were discarded (#samples < 2)
Sample coverage of features
  111 features in 2 samples
  301 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 62
  [0.1,0.2[ : 93
  [0.2,0.3[ : 67
  [0.3,0.4[ : 50
  [0.4,0.5[ : 36
  [0.5,0.6[ : 23
  [0.6,0.7[ : 18
  [0.7,0.8[ : 23
  [0.8,0.9[ : 15
  [0.9,1.0[ : 14
Peaks with m/z range in excess of tolerance
     # features   mean m/z  m/z range (ppm)
95            3  241.01233         1.120275
165           3  272.18686         1.175664
296           3  315.10937         1.110726
304           3  319.08297         1.159574
373           3  343.10296         1.020102
465           3  395.19279         1.012165
466           3  395.27592         1.163745
473           3  397.22624         1.258729
489           3  404.31045         1.162474
708           3  557.26871         1.058736
731           3  610.14751         1.065317
PN - POSITIVO
------ Aligning tables -------------

  Done, 1261 groups found
Elapsed time: 00m 00.881s

- 606 groups were discarded (#samples < 2)
Sample coverage of features
  274 features in 2 samples
  381 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 43
  [0.1,0.2[ : 125
  [0.2,0.3[ : 68
  [0.3,0.4[ : 55
  [0.4,0.5[ : 37
  [0.5,0.6[ : 44
  [0.6,0.7[ : 41
  [0.7,0.8[ : 67
  [0.8,0.9[ : 109
  [0.9,1.0[ : 40
Peaks with m/z range in excess of tolerance
      # features   mean m/z  m/z range (ppm)
20             3   99.65838         1.003428
21             3   99.65891         1.103765
22             3   99.65938         1.003418
56             3  112.14478         1.159216
166            3  170.83528         1.112183
213            3  238.53274         1.173844
335            3  266.15237         1.239892
407            3  282.40990         1.026877
501            3  282.70517         1.025804
522            3  283.76563         1.092452
536            3  293.12513         1.023454
549            3  299.26009        

  Done, 5436 groups found
Elapsed time: 00m 03.493s

- 3715 groups were discarded (#samples < 2)
Sample coverage of features
  824 features in 2 samples
  897 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 186
  [0.1,0.2[ : 229
  [0.2,0.3[ : 210
  [0.3,0.4[ : 204
  [0.4,0.5[ : 172
  [0.5,0.6[ : 170
  [0.6,0.7[ : 150
  [0.7,0.8[ : 132
  [0.8,0.9[ : 109
  [0.9,1.0[ : 95
Peaks with m/z range in excess of tolerance
      # features   mean m/z  m/z range (ppm)
217            3  105.00071         1.238088
259            3  106.90305         1.028970
508            3  117.66529         1.189816
927            3  139.17930         1.005897
953            3  140.76415         1.136654
...          ...        ...              ...
4598           3  547.64328         1.259945
4940           3  626.47591         1.149287
5154           3  683.74696         1.096898
5223           3  703.65177         1.009023
5378           3  835.70899         1.052999

[64 rows x 3 columns]
SYL 

#### Separate negative from positive ionization modes data

In [15]:
aligned_pos = {name : value for name,value in aligned.items() if name.upper()[-8:-1]=='POSITIV'}
aligned_neg = {name : value for name,value in aligned.items() if name.upper()[-8:-1]=='NEGATIV'}

#save_aligned_to_excel('aligned_cultivars_positive_1ppm_min2.xlsx', aligned_pos)
#save_aligned_to_excel('aligned_cultivars_negative_1ppm_min2.xlsx', aligned_neg)

#### Align globally the previously obtained alignments (for each mode).

In [16]:
ppmtol = 1.0 #2.0 for the groups_2ppm_min3_all_2ppm_neg/pos, 1 for the rest.
min_samples = 1 # Now it has to be 1, no extra filtering
positive = aligned_pos.values()
negative = aligned_neg.values()

aligned_all_pos = align(positive, ppmtol=ppmtol, min_samples=min_samples)
aligned_all_neg = align(negative, ppmtol=ppmtol, min_samples=min_samples)

------ Aligning tables -------------
 Samples to align: [[('CAN', '14'), ('CAN', '15'), ('CAN', '16')], [('CS', '29'), ('CS', '30'), ('CS', '31')], [('LAB', '8'), ('LAB', '9'), ('LAB', '10')], [('PN', '23'), ('PN', '24'), ('PN', '25')], [('REG', '38'), ('REG', '39'), ('REG', '40')], [('RIP', '17'), ('RIP', '18'), ('RIP', '19')], [('RL', '26'), ('RL', '27'), ('RL', '28')], [('ROT', '20'), ('ROT', '21'), ('ROT', '22')], [('RU', '35'), ('RU', '36'), ('RU', '37')], [('SYL', '11'), ('SYL', '12'), ('SYL', '13')], [('TRI', '32'), ('TRI', '33'), ('TRI', '34')]]
- Extracting all features...
  Done, (total 10828 features in 11 samples)
- Grouping and joining...
  Done, 4565 groups found
Elapsed time: 00m 03.759s

Sample coverage of features
 2574 features in 1 samples
  711 features in 2 samples
  422 features in 3 samples
  259 features in 4 samples
  140 features in 5 samples
  118 features in 6 samples
   79 features in 7 samples
   53 features in 8 samples
   51 features in 9 samples
   52 f

#### Storing and Reading Alignments into the hdf5store file (writing and reading back, using `put` and `get`)

Other functions are `df.to_hdf(store)` and `store.append(key, df)`

In [17]:
# Take out # to store
#alignments.put('groups_1ppm_min2_all_1ppm_pos', aligned_all_pos)
# Nomenclature: first groups at 1ppm then all at 1ppm

In [18]:
# Take out # to store
#alignments.put('groups_1ppm_min2_all_1ppm_neg', aligned_all_neg)
# Nomenclature: first groups at 1ppm then all at 1ppm

In [19]:
#alignments.keys()
# it seems to work
bigalignment = alignments.get('groups_1ppm_min2_all_1ppm_neg')
bigalignment.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 3369 entries, 97.58869 to 977.1140350000001
Data columns (total 42 columns):
(CAN, 14)         442 non-null float64
(CAN, 15)         319 non-null float64
(CAN, 16)         378 non-null float64
(CS, 29)          674 non-null float64
(CS, 30)          677 non-null float64
(CS, 31)          628 non-null float64
(LAB, 8)          370 non-null float64
(LAB, 9)          402 non-null float64
(LAB, 10)         410 non-null float64
(PN, 23)          350 non-null float64
(PN, 24)          376 non-null float64
(PN, 25)          388 non-null float64
(REG, 38)         594 non-null float64
(REG, 39)         896 non-null float64
(REG, 40)         915 non-null float64
(RIP, 17)         462 non-null float64
(RIP, 18)         446 non-null float64
(RIP, 19)         469 non-null float64
(RL, 26)          586 non-null float64
(RL, 27)          487 non-null float64
(RL, 28)          592 non-null float64
(ROT, 20)         469 non-null float64
(ROT, 21)    

#### Save alignments in CSV files

In [20]:
outdict = {'POSITIVE': aligned_all_pos, 'NEGATIVE': aligned_all_neg}

# Take out # to store in CSVs
#aligned_all_pos.to_csv('aligned_1ppm_min2_1ppm_positive.csv', with_labels=True, sep=',')
#aligned_all_neg.to_csv('aligned_1ppm_min2_1ppm_negative.csv', with_labels=True, sep=',')

### Another Example 'groups_1ppm_min3_all_1ppm_neg' and 'groups_1ppm_min3_all_1ppm_pos'

#### REPEAT alignments, this time requiring presence of a peak in all replicas within each label

In [21]:
ppmtol = 1.0
min_samples = 3

aligned = {}
for k, s in all_spectra.items():
    print('=======================================')
    print(k)
    aligned[k]  = align(s, ppmtol, min_samples)

CAN - NEGATIVO
------ Aligning tables -------------
 Samples to align: [[('CAN', '14')], [('CAN', '15')], [('CAN', '16')]]
- Extracting all features...
  Done, (total 1686 features in 3 samples)
- Grouping and joining...
  Done, 1014 groups found
Elapsed time: 00m 00.768s

- 817 groups were discarded (#samples < 3)
Sample coverage of features
  197 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 1
  [0.1,0.2[ : 2
  [0.2,0.3[ : 6
  [0.3,0.4[ : 11
  [0.4,0.5[ : 47
  [0.5,0.6[ : 54
  [0.6,0.7[ : 37
  [0.7,0.8[ : 8
  [0.8,0.9[ : 12
  [0.9,1.0[ : 11
Peaks with m/z range in excess of tolerance
     # features   mean m/z  m/z range (ppm)
123           3  243.18020         1.233654
178           3  256.23689         1.014686
183           3  269.24925         1.002789
513           3  477.06803         1.131915
753           3  557.27065         1.040787
765           3  564.50908         1.045156
823           3  592.56747         1.130674
976           3  771.25059         1

  Done, 754 groups found
Elapsed time: 00m 00.541s

- 453 groups were discarded (#samples < 3)
Sample coverage of features
  301 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 46
  [0.1,0.2[ : 76
  [0.2,0.3[ : 51
  [0.3,0.4[ : 37
  [0.4,0.5[ : 25
  [0.5,0.6[ : 12
  [0.6,0.7[ : 12
  [0.7,0.8[ : 15
  [0.8,0.9[ : 11
  [0.9,1.0[ : 5
Peaks with m/z range in excess of tolerance
     # features   mean m/z  m/z range (ppm)
95            3  241.01233         1.120275
165           3  272.18686         1.175664
296           3  315.10937         1.110726
304           3  319.08297         1.159574
373           3  343.10296         1.020102
465           3  395.19279         1.012165
466           3  395.27592         1.163745
473           3  397.22624         1.258729
489           3  404.31045         1.162474
708           3  557.26871         1.058736
731           3  610.14751         1.065317
PN - POSITIVO
------ Aligning tables -------------
 Samples to align: [[('PN', 

  Done, 1261 groups found
Elapsed time: 00m 00.928s

- 880 groups were discarded (#samples < 3)
Sample coverage of features
  381 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 8
  [0.1,0.2[ : 21
  [0.2,0.3[ : 21
  [0.3,0.4[ : 36
  [0.4,0.5[ : 26
  [0.5,0.6[ : 23
  [0.6,0.7[ : 23
  [0.7,0.8[ : 57
  [0.8,0.9[ : 102
  [0.9,1.0[ : 38
Peaks with m/z range in excess of tolerance
      # features   mean m/z  m/z range (ppm)
20             3   99.65838         1.003428
21             3   99.65891         1.103765
22             3   99.65938         1.003418
56             3  112.14478         1.159216
166            3  170.83528         1.112183
213            3  238.53274         1.173844
335            3  266.15237         1.239892
407            3  282.40990         1.026877
501            3  282.70517         1.025804
522            3  283.76563         1.092452
536            3  293.12513         1.023454
549            3  299.26009         1.002473
560            3  30

  Done, 5436 groups found
Elapsed time: 00m 03.607s

- 4539 groups were discarded (#samples < 3)
Sample coverage of features
  897 features in 3 samples
m/z range (ppm) distribution
  [0.0,0.1[ : 73
  [0.1,0.2[ : 122
  [0.2,0.3[ : 102
  [0.3,0.4[ : 116
  [0.4,0.5[ : 90
  [0.5,0.6[ : 95
  [0.6,0.7[ : 76
  [0.7,0.8[ : 59
  [0.8,0.9[ : 49
  [0.9,1.0[ : 51
Peaks with m/z range in excess of tolerance
      # features   mean m/z  m/z range (ppm)
217            3  105.00071         1.238088
259            3  106.90305         1.028970
508            3  117.66529         1.189816
927            3  139.17930         1.005897
953            3  140.76415         1.136654
...          ...        ...              ...
4598           3  547.64328         1.259945
4940           3  626.47591         1.149287
5154           3  683.74696         1.096898
5223           3  703.65177         1.009023
5378           3  835.70899         1.052999

[64 rows x 3 columns]
SYL - NEGATIVE
------ Aligning tables 

#### Separate negative from positive ionization modes data

In [22]:
aligned_pos = {name : value for name,value in aligned.items() if name.upper()[-8:-1]=='POSITIV'}
aligned_neg = {name : value for name,value in aligned.items() if name.upper()[-8:-1]=='NEGATIV'}

#save_aligned_to_excel('aligned_cultivars_positive_1ppm_min3.xlsx', aligned_pos)
#save_aligned_to_excel('aligned_cultivars_negative_1ppm_min3.xlsx', aligned_neg)

#### Align globally the previously obtained alignments (for each mode).

In [23]:
ppmtol = 1.0
min_samples = 1 # Now it has to be 1
positive = aligned_pos.values()
negative = aligned_neg.values()

aligned_all_pos = align(positive, ppmtol=ppmtol, min_samples=min_samples)
aligned_all_neg = align(negative, ppmtol=ppmtol, min_samples=min_samples)

------ Aligning tables -------------
 Samples to align: [[('CAN', '14'), ('CAN', '15'), ('CAN', '16')], [('CS', '29'), ('CS', '30'), ('CS', '31')], [('LAB', '8'), ('LAB', '9'), ('LAB', '10')], [('PN', '23'), ('PN', '24'), ('PN', '25')], [('REG', '38'), ('REG', '39'), ('REG', '40')], [('RIP', '17'), ('RIP', '18'), ('RIP', '19')], [('RL', '26'), ('RL', '27'), ('RL', '28')], [('ROT', '20'), ('ROT', '21'), ('ROT', '22')], [('RU', '35'), ('RU', '36'), ('RU', '37')], [('SYL', '11'), ('SYL', '12'), ('SYL', '13')], [('TRI', '32'), ('TRI', '33'), ('TRI', '34')]]
- Extracting all features...
  Done, (total 5549 features in 11 samples)
- Grouping and joining...
  Done, 2079 groups found
Elapsed time: 00m 01.904s

Sample coverage of features
 1071 features in 1 samples
  300 features in 2 samples
  204 features in 3 samples
  151 features in 4 samples
   80 features in 5 samples
   70 features in 6 samples
   36 features in 7 samples
   55 features in 8 samples
   26 features in 9 samples
   30 fe

#### Storing and into the hdf5store file (writing, using `put`)

In [24]:
outdict = {'POSITIVE': aligned_all_pos, 'NEGATIVE': aligned_all_neg}

# Take out the # to store
#aligned_all_pos.to_csv('aligned_1ppm_min3_1ppm_positive.csv', with_labels=True, sep=',')
#aligned_all_neg.to_csv('aligned_1ppm_min3_1ppm_negative.csv', with_labels=True, sep=',')