# Automated Multiple Reaction Monitoring (MRM)-profiling and Ozone Electrospray Ionizaton (OzESI)-MRM Informatics Platform for High-throughput Lipidomics


In this jupyter notebook you will automate the data analysis of the lipidome. This is a challenging problem to perform manually due to the diverse nature of lipids and the many potential isomers. In this notebook you will analyze mzML files containing data from lipid MRMs, with ozone off and ozone on. The goal is to identify possible double-bond locations in a lipid, in this case a TAG (triacylglycerols).

In [1]:
from IPython.display import Image

![title](Figures/agilent_lcms.png)

The examples shown here were run on an Agilent 6495C Triple Quadrupole LC/MS (example shown above) that has been connected to an ozone line (not shown in picture) for ozoneolysis of lipids.

![title](Figures/TAG_example.png)
Here is an example of a TAG. Notice how many possibilities there are for locations of one double-bond there could be and how convoluted the analysis can become! This image is obtained from LipidMaps.org

Import all necessary libraries

In [2]:
#Import all the necessary libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import math
from matplotlib import pyplot as plt
import re
import plotly.express as px

No module named 'ms_deisotope._c.averagine' averagine
No module named 'ms_deisotope._c.scoring'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'


In [None]:
# #Make a list of all folder names in data_mzml
# folder_names_data_mzml = os.listdir('data_mzml')
# print(folder_names_data_mzml)
# #Make a list of all folder names in data_mzml/fly_1-11-23
# folder_name = os.listdir('data_mzml/03-03-23')
# print(folder_name)


In [None]:
# # Load the Excel file into a Pandas ExcelFile object
# excel_file = pd.ExcelFile('Lipid_Database.XLS')

# # Create an empty list to store the dataframes
# dfs = []

# # Loop through each sheet in the Excel file and create a DataFrame for each sheet
# for sheet_name in excel_file.sheet_names:
#     # Load the sheet into a DataFrame
#     df = pd.read_excel(excel_file, sheet_name=sheet_name)
    
#     # Add a column with the sheet name to the DataFrame
#     df['Class'] = sheet_name
    
#     # Add the DataFrame to the list
#     dfs.append(df)

# # Concatenate the list of DataFrames into a single DataFrame
# merged_df = pd.concat(dfs, ignore_index=True)

# # Print the merged DataFrame
# print(merged_df)


MAKE CLASSES FOR EACH LIPID

In [3]:
lipid_types = ["CE","TAG","CER","FFA","PC","PE","PG","PI","SM","AC"]

#loop through all sheets in SUPPLE_2.XLS and make a df of Compound Name, Parent Ion, and Product Ion
mrm_list_new = pd.read_excel('Lipid_Database.xlsx', sheet_name = None)
mrm_list_new = pd.concat(mrm_list_new, ignore_index=True)
mrm_list_offical = mrm_list_new[['Compound Name', 'Parent Ion', 'Product Ion', 'Class']]
#Add underscore to middle of columns names
mrm_list_offical.columns = mrm_list_offical.columns.str.replace(' ', '_')
#round Parent Ion and Product Ion to 1 decimal place
mrm_list_offical['Parent_Ion'] = np.floor(mrm_list_offical['Parent_Ion'].round(1))
mrm_list_offical['Product_Ion'] = np.floor(mrm_list_offical['Product_Ion'].round(1))
#create transition column by combining Parent Ion and Product Ion with arrow between numbers
mrm_list_offical['Transition'] = mrm_list_offical['Parent_Ion'].astype(str) + ' -> ' + mrm_list_offical['Product_Ion'].astype(str)
#change column compound name to lipid
mrm_list_offical = mrm_list_offical.rename(columns={'Compound_Name': 'Lipid'})
#make a column called Class match lipid column to lipid types


pd.set_option('display.max_rows', None)
print(mrm_list_offical.head(None))


                                                  Lipid  Parent_Ion  \
0                                              LPC(2:0)       300.0   
1                                              LPC(3:1)       312.0   
2                                    LPC(3:0),PC(O-3:0)       314.0   
3                          LPC(4:0),PC(O-4:0),PC(O-5:0)       328.0   
4                                      PC(4:0),LPC(5:0)       342.0   
5                                              LPC(6:0)       356.0   
6                                      PC(6:0),LPC(7:0)       370.0   
7                                              LPC(8:0)       384.0   
8                                           LPC(O-10:1)       396.0   
9                                      PC(8:0),LPC(9:0)       398.0   
10                                 LPC(10:0),PC(O-10:0)       412.0   
11                                   PC(10:0),LPC(11:0)       426.0   
12                                           PC(O-12:1)       438.0   
13    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Parent_Ion'] = np.floor(mrm_list_offical['Parent_Ion'].round(1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Product_Ion'] = np.floor(mrm_list_offical['Product_Ion'].round(1))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Transition'] = mrm_

In [4]:
list_of_lipid_classes = mrm_list_offical['Class'].unique()
print(list_of_lipid_classes)

['PC' 'PE' 'SM' 'Cer' 'CAR' 'TAG' 'DAG' 'PS' 'PI' 'PG' 'CE' 'FA'
 'STD 15:0-18:1-d7 DG' 'STD 18:1 (d7) Lyso PC' 'STD 18:1 (d7) Lyso PE'
 'STD 18:1(d7) MAG' 'STD C15 ceramide-D7' 'STD_15:0-18:1(d7) PC'
 'STD_15:0-18:1(d7) PE' 'STD_15:0-18:1(d7) PG (Na Salt)'
 'STD_15:0-18:1(d7) PI (NH4 Salt)' 'STD_15:0-18:1(d7) PS (Na Salt)'
 'STD_15:0-18:1(d7)-15:0 TAG' 'STD_18:1(d7) Chol Ester'
 'STD_d18:1-18:1(d9) SM']


Load mzML file and convert to pandas dataframe and csv file. |
Columns = Q1, Q3, Intensity, Transition, Lipid, Class  |
Parsed data is also stored as csv file in data_csv

In [5]:
#Create for loop to load all mzml files from the data folder into the run object from pymzml reader function and store in pandas dataframe
#Create empty dictionary to store all the data
data_folder = os.listdir('./data_mzml/Burda_3_17_23/') #Path to the mzml files
path_to_mzml_files = './data_mzml/Burda_3_17_23/' #Path to the mzml files
#data_dict = {} #Empty dictionary to store all the data
df = pd.DataFrame(columns=['Parent_Ion','Product_Ion','Intensity','Transition','Lipid','Class','Sample_ID'])
#Create a similar for loop, except store all data in a single pandas dataframe
df_all = pd.DataFrame(columns=['Parent_Ion','Product_Ion','Intensity','Transition','Lipid','Class','Sample_ID']) #Create empty pandas dataframe to store the data
#df_all = pd.DataFrame(columns=['Q1','Q3','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
for file in data_folder:
        if file.endswith('.mzML'):
                print(file)
                run = pymzml.run.Reader(path_to_mzml_files+file, skip_chromatogram=False) #Load the mzml file into the run object
                print('Spectrum # = ',run.get_spectrum_count())
                print('Chromatogram # =',run.get_chromatogram_count())
                #create pandas dataframe to store the data with the columns Parent Ion, Product Ion, Intensity, Transition Lipid and Class
                #df_sample = pd.DataFrame(columns=['Parent_Ion','Product_Ion','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
                #df_sample = pd.DataFrame(columns=['Q1','Q3','Intensity','Transition','Lipid','Class']) #Create empty pandas dataframe to store the data
                q1_mz = 0 #Create empty variables to store the Q1 and Q3 m/z values
                q3_mz = 0
                count = 0 #Create a counter to keep track of the number of transitions
                for spectrum in run:
                        for element in spectrum.ID.split(' '):
                                # print('element',element)
                                intensity_store = np.array([])
                                if 'Q1' in element:
                                        #print('Q1',element)
                                        q1 = element.split('=')
                                        #print('q1',q1[1])
                                        q1_mz= np.floor(round(float(q1[1]),1))
                                        # print('q1',q1)
                                
                                if 'Q3' in element:
                                        # print('Q3',element)
                                        q3 = element.split('=')
                                        #print('q3',q3[1])
                                        q3_mz=np.floor(round(float(q3[1]),1))
                                        # print('q3',q3)
                                        # df_sample.loc[count,'Q1'] = q1_mz
                                        # df_sample.loc[count,'Q3'] = q3_mz
                                        
                                        for mz,intensity in spectrum.peaks(): #Get the m/z and intensity values from the spectrum
                                                intensity_store = np.append(intensity_store,intensity) #Store the intensity values in an array
                        
                                
                                if 'Q3' in element:
                                        # print(intensity_sum)
                                        intensity_sum = np.sum(intensity_store) #Sum the intensity values
                                        df_all.loc[count,'Parent_Ion'] = q1_mz #Store the Q1 and Q3 m/z values in the pandas dataframe
                                        df_all.loc[count,'Product_Ion'] = q3_mz
                                        #round the Q1 and Q3 m/z values to 1 decimal places
                                        df_all.loc[count,'Parent_Ion'] = np.floor(round(df_all.loc[count,'Parent_Ion'],1))
                                        df_all.loc[count,'Product_Ion'] = np.floor(round(df_all.loc[count,'Product_Ion'],1))
                                        df_all.loc[count,'Intensity'] = intensity_sum #Store the intensity values in the pandas dataframe
                                        df_all.loc[count,'Transition'] = str(q1_mz)+ ' -> '+ str(q3_mz) #Store the transition values in the pandas dataframe
                                        #add file name to Sample_ID column without the mzmL extension
                                        df_all.loc[count,'Sample_ID'] = file[:-5]
                                        count+=1
        #append df_all to df_all2
        df = df.append(df_all, ignore_index=True)

PE_IPA Wash5x_p1-a1_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 189


  df = df.append(df_all, ignore_index=True)


CE_Blank-4x_p1 -a5_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 60
PS_IPA Wash4x_p1-a1_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 183
PG_AK0174-1x_p1-b1_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 183
TAG1_AA0474-4x-p_p1-e7_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 197


  df = df.append(df_all, ignore_index=True)


TAG1_SPLASH10x_p1-a9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 197
CER_AK0168-3x_p1-c7_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 82
AC_AA0474-4x-p_p1-e7_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 65
PS_Blank1-3x_p1 -a4_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 183
CE_AK0157-4x_p1-d4_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 60
PS_AK0141-1x_p1-b9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 183
TAG1_AK0155-4x_p1-c9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 197
CE_AK0098-3x-p_p1-e6_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 60
CE_AK0140-4x_p1-d2_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 60
DAG_AK0139-1x_p1-b6_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 26
FFA_Blank1-1x_p1 -a2_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 37
PG_AA0469-3x-p_p1-e5_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 

  df = df.append(df_all, ignore_index=True)


PS_AK0175-2x_p1-c3_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 183
DAG_AA0472-2x-p_p1-e2_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 26
PI_SPLASH1x_p1-a9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 181
TAG2_AK0079-4x-p_p1-e9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 190
CER_AK0169-2x_p1-c4_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 82
FFA_AK0076-4x-p_p1-e8_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 37
DAG_AK0174-1x_p1-b1_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 26
TAG2_IPA Wash3x_p1-a1_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 190
CER_AK0079-4x-p_p1-e9_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 82
CER_AK0145-1x_p1-b3_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 82
AC_AK0166-3x_p1-c8_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram # = 65
DAG_AK0164-1x_p1-b5_N1__Burda__3_17_23.mzML
Spectrum # =  None
Chromatogram 

In [6]:
df.tail(5) #Print the pandas dataframe

Unnamed: 0,Parent_Ion,Product_Ion,Intensity,Transition,Lipid,Class,Sample_ID
140393,954.0,184.0,7980.880661,954.0 -> 184.0,,,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140394,958.0,184.0,7836.600552,958.0 -> 184.0,,,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140395,986.0,184.0,8709.900608,986.0 -> 184.0,,,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140396,1006.0,184.0,8850.500561,1006.0 -> 184.0,,,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140397,1014.0,184.0,6855.520538,1014.0 -> 184.0,,,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23


In [None]:
###BLANK SUBTRACTION

#Find all Sample_IDs with blank in the name and average the intensity values
blank_sample_ids = df[df['Sample_ID'].str.contains('blank')][['Parent_Ion','Product_Ion','Intensity', 'Transition','Lipid','Class', 'Sample_ID']]

#add a row to take the mean of all intensities for each transition
blank_sample_ids.loc['mean'] = blank_sample_ids.mean()
print(blank_sample_ids.tail(5))


Load MRM transitions from csv file to pandas dataframe. This list will be used to identify the possible lipids in our sample.

In [10]:
#Match df_all to mrm_list_offical and append Lipid and Class columns to df_all
for index in range(len(df)):
    for row in range(len(mrm_list_offical)):
        if mrm_list_offical.loc[row,'Parent_Ion'] == df.loc[index,'Parent_Ion'] and mrm_list_offical.loc[row,'Product_Ion'] == df.loc[index,'Product_Ion']:
            df.loc[index,'Lipid'] = mrm_list_offical.loc[row,'Lipid']
            df.loc[index,'Class'] = mrm_list_offical.loc[row,'Class']

df_matching = df.dropna() #drop rows with NaN values
            

In [9]:
# #Match df to mrm_list_offical and append Lipid and Class columns to df_all
# #df_test = df.where( (df['Parent_Ion'].reset_index(drop=True) == mrm_list_offical['Parent_Ion'].reset_index(drop=True)) )
# df_matching = df.where(df['Parent_Ion'].isin(mrm_list_offical['Parent_Ion']) & df['Product_Ion'].isin(mrm_list_offical['Product_Ion']))
# #Match the lipid and the class to the df
# #df_matching = df_matching.merge(mrm_list_offical, left_on=['Parent_Ion','Product_Ion'], right_on=['Parent_Ion','Product_Ion'])
# #match Lipid of mrm_list_offical to Lipid of df_matching

# #df_matching = df_matching.dropna(axis=1)
# # print(df['Parent_Ion'])
# # print(mrm_list_offical['Parent_Ion'])
# print(df_matching.tail(5))

#create df_test and all rows from df where the Parent_Ion and Product_Ion are in mrm_list_offical
df_test = df[df['Parent_Ion'].isin(mrm_list_offical['Parent_Ion']) & df['Product_Ion'].isin(mrm_list_offical['Product_Ion'])]
#If Transition is in mrm_list_offical then add the Lipid and Class to df_test
df_test2 = df_test[df_test['Transition'].isin(mrm_list_offical['Transition'])]
#df_test2['Lipid'] = df_test2['Transition'].map(mrm_list_offical.set_index('Transition')['Lipid'])
#df_test2['Class'] = df_test2['Transition'].map(mrm_list_offical.set_index('Transition')['Class'])


#df_test2['Lipid'] = df_test2['Parent_Ion'].map(mrm_list_offical.set_index('Parent_Ion')['Lipid'])
# df_test2['Class'] = df_test2['Parent_Ion'].map(mrm_list_offical.set_index('Parent_Ion')['Class'])
print(df_test2.tail(5))
#print(df_matching.tail(5))

       Parent_Ion Product_Ion    Intensity       Transition Lipid Class  \
140393      954.0       184.0  7980.880661   954.0 -> 184.0   NaN   NaN   
140394      958.0       184.0  7836.600552   958.0 -> 184.0   NaN   NaN   
140395      986.0       184.0  8709.900608   986.0 -> 184.0   NaN   NaN   
140396     1006.0       184.0  8850.500561  1006.0 -> 184.0   NaN   NaN   
140397     1014.0       184.0  6855.520538  1014.0 -> 184.0   NaN   NaN   

                                        Sample_ID  
140393  PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23  
140394  PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23  
140395  PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23  
140396  PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23  
140397  PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23  


In [13]:
#df_test2.to_excel('data_results/data/data_matching/Burda/Burda2.xlsx')
#save df_matching to excel file in data_results data matching liver_LD
df_matching.to_excel('data_results/data/data_matching/Burda/Burda.xlsx')
df_matching.tail(5)

Unnamed: 0,Parent_Ion,Product_Ion,Intensity,Transition,Lipid,Class,Sample_ID
140393,954.0,184.0,7980.880661,954.0 -> 184.0,PC(48:2),PC,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140394,958.0,184.0,7836.600552,958.0 -> 184.0,PC(48:0),PC,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140395,986.0,184.0,8709.900608,986.0 -> 184.0,PC(50:0),PC,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140396,1006.0,184.0,8850.500561,1006.0 -> 184.0,PC(52:4),PC,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23
140397,1014.0,184.0,6855.520538,1014.0 -> 184.0,PC(52:0),PC,PCandSM_SPLASH3x_p1-a9_N1__Burda__3_17_23


In [10]:
#import visualization libraries
import umap
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
#plot transition versus intensity of df_matching
fig = px.bar(df_matching, x="Transition", y="Intensity", color="Lipid", hover_data=['Lipid','Class'])
fig.show()


#plot lipid class versus intensity of df_matching in a bar chart
fig = px.bar(df_matching, x="Class", y="Intensity", color="Class", hover_data=['Lipid','Class'])
fig.show()
#plot lipid class versus intensity of df_matching in a pie chart
fig = px.pie(df_matching, values='Intensity', names='Class', title='Lipid Class')
fig.show()
#make a plotly heatmap of the intensity of each transition in each sample
fig = go.Figure(data=go.Heatmap(
                     z=df_matching['Intensity'],
                        x=df_matching['Lipid'],
                        y=df_matching['Class'],
                        colorscale='Viridis'))
fig.show()

#plot sample ID versus intensity of df_matching
fig = px.bar(df_matching, x="Sample_ID", y="Intensity", color="Sample_ID", hover_data=['Lipid','Class'])
fig.show()
#plot sample ID versus intensity of df_matching in a pie chart
fig = px.pie(df_matching, values='Intensity', names='Sample_ID', title='Sample ID')
fig.show()