# Automated Multiple Reaction Monitoring (MRM)-profiling and Ozone Electrospray Ionizaton (OzESI)-MRM Informatics Platform for High-throughput Lipidomics


In this jupyter notebook you will automate the data analysis of the lipidome. This is a challenging problem to perform manually due to the diverse nature of lipids and the many potential isomers. In this notebook you will analyze mzML files containing data from lipid MRMs, with ozone off and ozone on. The goal is to identify possible double-bond locations in a lipid, in this case a TAG (triacylglycerols).

In [40]:
from IPython.display import Image

![title](Figures/agilent_lcms.png)

The examples shown here were run on an Agilent 6495C Triple Quadrupole LC/MS (example shown above) that has been connected to an ozone line (not shown in picture) for ozoneolysis of lipids.

![title](Figures/TAG_example.png)
Here is an example of a TAG. Notice how many possibilities there are for locations of one double-bond there could be and how convoluted the analysis can become! This image is obtained from LipidMaps.org

Import all necessary libraries

In [1]:
#Import all the necessary libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import math
from matplotlib import pyplot as plt
import re
import plotly.express as px

No module named 'ms_deisotope._c.averagine' averagine
No module named 'ms_deisotope._c.scoring'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'


In [3]:
#Make a list of all folder names in data_mzml
folder_names_data_mzml = os.listdir('data_mzml')
print(folder_names_data_mzml)
#Make a list of all folder names in data_mzml/fly_1-11-23
folder_name = os.listdir('data_mzml/03-03-23')
print(folder_name)


['fly_03_08_23', '03-13-23', '03-03-23', 'liver_LD', 'fly_1-11-23', 'old']
['all']


In [4]:
# Load the Excel file into a Pandas ExcelFile object
excel_file = pd.ExcelFile('SUPPLE_2.XLS')

# Create an empty list to store the dataframes
dfs = []

# Loop through each sheet in the Excel file and create a DataFrame for each sheet
for sheet_name in excel_file.sheet_names:
    # Load the sheet into a DataFrame
    df = pd.read_excel(excel_file, sheet_name=sheet_name)
    
    # Add a column with the sheet name to the DataFrame
    df['Class'] = sheet_name
    
    # Add the DataFrame to the list
    dfs.append(df)

# Concatenate the list of DataFrames into a single DataFrame
merged_df = pd.concat(dfs, ignore_index=True)

# Print the merged DataFrame
print(merged_df)


      Compound Group                    Compound Name  ISTD?  Parent Ion  \
0                NaN                         LPC(2:0)  False  300.121225   
1                NaN                         LPC(3:1)  False  312.121225   
2                NaN               LPC(3:0),PC(O-3:0)  False  314.136825   
3                NaN     LPC(4:0),PC(O-4:0),PC(O-5:0)  False  328.188925   
4                NaN                 PC(4:0),LPC(5:0)  False  342.168125   
...              ...                              ...    ...         ...   
3264             NaN  STD_15:0-18:1(d7) PI (NH4 Salt)  False  847.600000   
3265             NaN   STD_15:0-18:1(d7) PS (Na Salt)  False  755.530000   
3266             NaN       STD_15:0-18:1(d7)-15:0 TAG  False  829.800000   
3267             NaN          STD_18:1(d7) Chol Ester  False  675.640000   
3268             NaN            STD_d18:1-18:1(d9) SM  False  738.640000   

     MS1 Res  Product Ion MS2 Res  Dwell  Fragmentor  Collision Energy  \
0       Unit 

MAKE CLASSES FOR EACH LIPID

In [5]:
lipid_types = ["CE","TAG","CER","FFA","PC","PE","PG","PI","SM","AC"]

#loop throught each sheet in SUPPLE_2.XLS and make a list of the sheet names


#loop through all sheets in SUPPLE_2.XLS and make a df of Compound Name, Parent Ion, and Product Ion
mrm_list_new = pd.read_excel('SUPPLE_2.XLS', sheet_name = None)
mrm_list_new = pd.concat(mrm_list_new, ignore_index=True)
mrm_list_offical = mrm_list_new[['Compound Name', 'Parent Ion', 'Product Ion']]
#Add underscore to middle of columns names
mrm_list_offical.columns = mrm_list_offical.columns.str.replace(' ', '_')
#round Parent Ion and Product Ion to 1 decimal place
mrm_list_offical['Parent_Ion'] = np.floor(mrm_list_offical['Parent_Ion'].round(1))
mrm_list_offical['Product_Ion'] = np.floor(mrm_list_offical['Product_Ion'].round(1))
#create transition column by combining Parent Ion and Product Ion with arrow between numbers
mrm_list_offical['Transition'] = mrm_list_offical['Parent_Ion'].astype(str) + ' -> ' + mrm_list_offical['Product_Ion'].astype(str)
#change column compound name to lipid
mrm_list_offical = mrm_list_offical.rename(columns={'Compound_Name': 'Lipid'})
#make a column called Class match lipid column to lipid types


mrm_list_offical['Class'] = mrm_list_offical['Lipid'].str.extract('([a-zA-Z]+)')
#if Class = O or omega, then change to Cer
mrm_list_offical['Class'] = mrm_list_offical['Class'].replace(['O','omega'], 'Cer')
#if TG then change to TAG
mrm_list_offical['Class'] = mrm_list_offical['Class'].replace('TG', 'TAG')
#make a column called Lipid Name match lipid column to lipid types
#If STD then use Compound Name as class
mrm_list_offical['Class'] = np.where(mrm_list_offical['Lipid'].str.contains('STD'), mrm_list_offical['Lipid'], mrm_list_offical['Class'])
#mrm_list_offical['Class'] = mrm_list_offical['Lipid'].str.extract('([A-Z]+)')
#make a column called Lipid Name match lipid column to lipid types
pd.set_option('display.max_rows', None)
print(mrm_list_offical.head(None))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Parent_Ion'] = np.floor(mrm_list_offical['Parent_Ion'].round(1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Product_Ion'] = np.floor(mrm_list_offical['Product_Ion'].round(1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Transition'] = mrm_

                                                  Lipid  Parent_Ion  \
0                                              LPC(2:0)       300.0   
1                                              LPC(3:1)       312.0   
2                                    LPC(3:0),PC(O-3:0)       314.0   
3                          LPC(4:0),PC(O-4:0),PC(O-5:0)       328.0   
4                                      PC(4:0),LPC(5:0)       342.0   
5                                              LPC(6:0)       356.0   
6                                      PC(6:0),LPC(7:0)       370.0   
7                                              LPC(8:0)       384.0   
8                                           LPC(O-10:1)       396.0   
9                                      PC(8:0),LPC(9:0)       398.0   
10                                 LPC(10:0),PC(O-10:0)       412.0   
11                                   PC(10:0),LPC(11:0)       426.0   
12                                           PC(O-12:1)       438.0   
13    

In [6]:
list_of_lipid_classes = mrm_list_offical['Class'].unique()
print(list_of_lipid_classes)

['LPC' 'PC' 'LPE' 'PE' 'SM' 'Cer' 'CerP' 'CAR' 'TAG' 'DG' 'LPS' 'PS' 'LPI'
 'PI' 'LPG' 'PG' 'CE' 'FA' 'STD 15:0-18:1-d7 DG' 'STD 18:1 (d7) Lyso PC'
 'STD 18:1 (d7) Lyso PE' 'STD 18:1(d7) MAG' 'STD C15 ceramide-D7'
 'STD_15:0-18:1(d7) PC' 'STD_15:0-18:1(d7) PE'
 'STD_15:0-18:1(d7) PG (Na Salt)' 'STD_15:0-18:1(d7) PI (NH4 Salt)'
 'STD_15:0-18:1(d7) PS (Na Salt)' 'STD_15:0-18:1(d7)-15:0 TAG'
 'STD_18:1(d7) Chol Ester' 'STD_d18:1-18:1(d9) SM']


Load mzML file and convert to pandas dataframe and csv file. |
Columns = Q1, Q3, Intensity, Transition, Lipid, Class  |
Parsed data is also stored as csv file in data_csv

In [7]:
#Create for loop to load all mzml files from the data folder into the run object from pymzml reader function and store in pandas dataframe
#Create empty dictionary to store all the data
data_folder = os.listdir('./data_mzml/03-13-23/all/') #Path to the mzml files
path_to_mzml_files = './data_mzml/03-13-23/all/' #Path to the mzml files
data_dict = {} #Empty dictionary to store all the data

#Create a similar for loop, except store all data in a single pandas dataframe
df_all = pd.DataFrame(columns=['Parent_Ion','Product_Ion','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
#df_all = pd.DataFrame(columns=['Q1','Q3','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
for file in data_folder:
        if file.endswith('.mzML'):
                print(file)
                run = pymzml.run.Reader(path_to_mzml_files+file, skip_chromatogram=False) #Load the mzml file into the run object
                print('Spectrum # = ',run.get_spectrum_count())
                print('Chromatogram # =',run.get_chromatogram_count())
                #create pandas dataframe to store the data with the columns Parent Ion, Product Ion, Intensity, Transition Lipid and Class
                #df_sample = pd.DataFrame(columns=['Parent_Ion','Product_Ion','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
                #df_sample = pd.DataFrame(columns=['Q1','Q3','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
                q1_mz = 0 #Create empty variables to store the Q1 and Q3 m/z values
                q3_mz = 0
                count = 0 #Create a counter to keep track of the number of transitions
                for spectrum in run:
                        for element in spectrum.ID.split(' '):
                                # print('element',element)
                                intensity_store = np.array([])
                                if 'Q1' in element:
                                        #print('Q1',element)
                                        q1 = element.split('=')
                                        #print('q1',q1[1])
                                        q1_mz= np.floor(round(float(q1[1]),1))
                                        # print('q1',q1)
                                
                                if 'Q3' in element:
                                        # print('Q3',element)
                                        q3 = element.split('=')
                                        #print('q3',q3[1])
                                        q3_mz=np.floor(round(float(q3[1]),1))
                                        # print('q3',q3)
                                        # df_sample.loc[count,'Q1'] = q1_mz
                                        # df_sample.loc[count,'Q3'] = q3_mz
                                        
                                        for mz,intensity in spectrum.peaks(): #Get the m/z and intensity values from the spectrum
                                                intensity_store = np.append(intensity_store,intensity) #Store the intensity values in an array
                        
                                
                                if 'Q3' in element:
                                        # print(intensity_sum)
                                        intensity_sum = np.sum(intensity_store) #Sum the intensity values
                                        df_all.loc[count,'Parent_Ion'] = q1_mz #Store the Q1 and Q3 m/z values in the pandas dataframe
                                        df_all.loc[count,'Product_Ion'] = q3_mz
                                        #round the Q1 and Q3 m/z values to 1 decimal places
                                        df_all.loc[count,'Parent_Ion'] = np.floor(round(df_all.loc[count,'Parent_Ion'],1))
                                        df_all.loc[count,'Product_Ion'] = np.floor(round(df_all.loc[count,'Product_Ion'],1))
                                        df_all.loc[count,'Intensity'] = intensity_sum #Store the intensity values in the pandas dataframe
                                        df_all.loc[count,'Transition'] = str(q1_mz)+ ' -> '+ str(q3_mz) #Store the transition values in the pandas dataframe
                                        count+=1

CER_D12-Sup_2-1_N1_03132023.mzML
Spectrum # =  None
Chromatogram # = 84
TAG1_IPA_03132023-r001.mzML
Spectrum # =  None
Chromatogram # = 199
TAG1_D1-Sup_4-3_N1_03132023.mzML
Spectrum # =  None
Chromatogram # = 199
PG_D1-Sup_2-2_N1_03132023.mzML
Spectrum # =  None
Chromatogram # = 185
PS_IPA_03132023-r003.mzML
Spectrum # =  None
Chromatogram # = 185
AC_D1-Sup_2-2_N1_03132023.mzML
Spectrum # =  None
Chromatogram # = 67
PC_IPA_03132023-r005.mzML
Spectrum # =  None
Chromatogram # = 205
CL_IPA_03132023-r001.mzML
Spectrum # =  None
Chromatogram # = 247
CER_NoInject_03132023-r002.mzML
Spectrum # =  None
Chromatogram # = 84
TAG2_NoInject_03132023-r002.mzML
Spectrum # =  None
Chromatogram # = 192
FFA_D1-Sup_4-2_N1_03132023.mzML
Spectrum # =  None
Chromatogram # = 39
PG_IPA_03132023-r004.mzML
Spectrum # =  None
Chromatogram # = 185
FFA_IPA_03132023-r005.mzML
Spectrum # =  None
Chromatogram # = 39
PG_IPA_03132023-r001.mzML
Spectrum # =  None
Chromatogram # = 185
TAG2_Blank_N2_03132023.mzML
Spectru

In [8]:
df_all.head(None)

Unnamed: 0,Parent_Ion,Product_Ion,Intensity,Transition,Lipid,Class
0,586.0,369.0,122099.448273,586.0 -> 369.0,,
1,586.0,369.0,118825.468719,586.0 -> 369.0,,
2,612.0,369.0,125085.789581,612.0 -> 369.0,,
3,614.0,369.0,119177.929077,614.0 -> 369.0,,
4,626.0,369.0,463736.451416,626.0 -> 369.0,,
5,628.0,369.0,445456.411865,628.0 -> 369.0,,
6,636.0,369.0,673411.312256,636.0 -> 369.0,,
7,638.0,369.0,717381.271973,638.0 -> 369.0,,
8,640.0,369.0,734453.010498,640.0 -> 369.0,,
9,640.0,369.0,739935.614014,640.0 -> 369.0,,


Load MRM transitions from csv file to pandas dataframe. This list will be used to identify the possible lipids in our sample.

In [48]:
#Match df_all to mrm_list_offical and append Lipid and Class columns to df_all
for index in range(len(df_all)):
    for row in range(len(mrm_list_offical)):
        if mrm_list_offical.loc[row,'Parent_Ion'] == df_all.loc[index,'Parent_Ion'] and mrm_list_offical.loc[row,'Product_Ion'] == df_all.loc[index,'Product_Ion']:
            df_all.loc[index,'Lipid'] = mrm_list_offical.loc[row,'Lipid']
            df_all.loc[index,'Class'] = mrm_list_offical.loc[row,'Class']

df_matching = df_all.dropna() #drop rows with NaN values
            

In [49]:
df_matching

Unnamed: 0,Parent_Ion,Product_Ion,Intensity,Transition,Lipid,Class
0,586.0,369.0,122099.448273,586.0 -> 369.0,CE(12:0),CE
1,586.0,369.0,118825.468719,586.0 -> 369.0,CE(12:0),CE
2,612.0,369.0,125085.789581,612.0 -> 369.0,CE(14:1),CE
3,614.0,369.0,119177.929077,614.0 -> 369.0,CE(14:0),CE
4,626.0,369.0,463736.451416,626.0 -> 369.0,CE(15:1),CE
5,628.0,369.0,445456.411865,628.0 -> 369.0,CE(15:0),CE
6,636.0,369.0,673411.312256,636.0 -> 369.0,CE(16:3),CE
7,638.0,369.0,717381.271973,638.0 -> 369.0,CE(16:2),CE
8,640.0,369.0,734453.010498,640.0 -> 369.0,CE(16:1),CE
9,640.0,369.0,739935.614014,640.0 -> 369.0,CE(16:1),CE


In [50]:
#import visualization libraries
import umap
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [51]:
#plot transition versus intensity of df_matching
fig = px.bar(df_matching, x="Transition", y="Intensity", color="Lipid", hover_data=['Lipid','Class'])
fig.show()


#plot lipid class versus intensity of df_matching in a bar chart
fig = px.bar(df_matching, x="Class", y="Intensity", color="Class", hover_data=['Lipid','Class'])
fig.show()
#plot lipid class versus intensity of df_matching in a pie chart
fig = px.pie(df_matching, values='Intensity', names='Class', title='Lipid Class')
fig.show()
#make a plotly heatmap of the intensity of each transition in each sample
fig = go.Figure(data=go.Heatmap(
                     z=df_matching['Intensity'],
                        x=df_matching['Lipid'],
                        y=df_matching['Class'],
                        colorscale='Viridis'))
fig.show()