In [1]:
from IPython.display import display, Image, clear_output

# Comprehensive Lipidome Automation Workflow (CLAW)

Welcome to CLAW, a tool designed to facilitate and optimize the processing of lipidomic MRM data. This Jupyter notebook encapsulates a suite of tools that streamline the various stages of lipidomics data analysis.

Our toolset enables users to efficiently process MRM data files in the mzML format. Upload a file and CLAW will parse the data into a structured Pandas dataframe. This dataframe includes critical information like sample_ID, MRM transition, and signal intensity. Furthermore, our tool aligns each MRM transition with a default or custom lipid_database for accurate and swift annotation.

Moreover, CLAW is equipped with an OzESI option, a tool to elucidate the double bond location in lipid isomers. This feature allows users to input OzESI data and pinpoint the precise location of double bonds in isomeric lipids. Users have the flexibility to select which double bond locations they want to analyze. Following this, CLAW autonomously predicts potential m/z values and cross-references these predictions with sample data, ensuring a comprehensive and meticulous analysis.

With automation at its core, CLAW eliminates the need for manual data processing, significantly reducing time expenditure. It is a robust and invaluable tool for handling large volumes of lipid MRM data, accelerating scientific discovery in the field of lipidomics.

In [2]:
#Import all the necessary python libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import json

#Import all the necessary CLAW libraries
import create_directory
import CLAW

No module named 'ms_deisotope._c.averagine' averagine
No module named 'ms_deisotope._c.scoring'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'


## Directory and File Management
For structured data management and efficient workflow, the system first ensures the presence of an output directory. If such a directory already exists you can skip this step.

In [3]:
# # Create the output directory. If it already exists you can skip this step.
# create_directory.create_project_folder()


The name of the project is defined next. This is important as the created directory will bear this name, allowing users to manage and identify their data with ease.

After the mzML files are uploaded to the designated mzML folder, the next block of code segregates these files based on their characteristics. More specifically, it filters the files and transfers them to respective folders named 'o3on' and 'o2only'.

In [4]:
name_of_project = 'canola'
#After you load mzml files to mzml folder. this will filter the files and move them to o3on and o2only folders
create_directory.filter_o3mzml_files(name_of_project)

One or both of the destination directories 'Projects/canola/o3on' and 'Projects/canola/o2only' do not exist.


## Pre-Parsing Setup
The following block of code takes the preset variable values and uses them to parse the mzML files. The parsed data, including the sample ID, MRM transitions, and intensities, is stored in a pandas dataframe for easy manipulation and analysis.

The function CLAW.parsing_mzml_to_df takes several arguments. data_base_name_location is the location of the lipid database that contains information on lipid classes, fatty acid chains, and their corresponding MRM transitions. Project_Folder_data is the location of the mzML files for the samples to be analyzed. tolerance defines the acceptable range of deviation for the MRM transitions when matching them with the lipid database. The argument remove_std is a boolean that, when True, indicates to remove the MRM transitions that correspond to standards (internal or external) present in the samples.

The function outputs a pandas dataframe (df) where each row corresponds to an MRM transition detected in a sample, and columns include the sample ID, MRM transition, and intensity of the transition, among other values.

In [5]:
# Set default values
data_base_name_location = 'lipid_database/Lipid_Database.xlsx'
Project = './Projects/'
Project_Name = 'canola'
Project_Folder_data = Project + Project_Name + '/mzml/o3on/'
Project_results = Project + Project_Name + '/results/'
file_name_to_save = 'canola'
tolerance = 0.3
remove_std = True
save_data = True

# Call pre_parsing_setup to initialize the variables
data_base_name_location, Project_Folder_data, Project_results, file_name_to_save, tolerance, remove_std, save_data = CLAW.pre_parsing_setup(data_base_name_location,
 Project, 
 Project_Name, 
 Project_Folder_data,
 Project_results, 
 file_name_to_save, 
 tolerance, 
 remove_std,
 save_data)


data_base_name_location: lipid_database/Lipid_Database.xlsx
Project: ./Projects/
Project_Name: canola
Project_Folder_data: ./Projects/canola/mzml/o3on/
Project_results: ./Projects/canola/results/
file_name_to_save: canola
tolerance: 0.3
remove_std: True
save_data: True


Define the master dataframes where the data will be stored during the parsing step.

In [6]:
time_and_intensity_df, master_df, OzESI_time_df = CLAW.create_analysis_dataframes()

## CLAW.full_parse()
In this code, the `CLAW.full_parse()` function is used to analyze the MRM data. It takes several parameters like the location of the lipid database, paths to the data and results folders, the name of the result files, and the tolerance for MRM transitions matching. The function returns two dataframes: `df_matched` that contains information about each detected lipid species and their corresponding MRM transitions, and `OzESI_time_df` which captures data related to OzESI-MS scans, including potential double bond locations of lipids. If `remove_std` is `True`, it removes MRM transitions related to standards from the dataframe, and if `save_data` is `True`, the dataframe is saved as a .csv file in the specified results folder.

In [7]:
# Use the initialized variables as arguments to full_parse
df_MRM, df_OzESI = CLAW.full_parse(data_base_name_location, 
                                               Project_Folder_data, 
                                               Project_results, 
                                               file_name_to_save, 
                                               tolerance, 
                                               remove_std=True, 
                                               save_data=False,
                                               batch_processing=True,
                                               plot_chromatogram=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Parent_Ion'] = np.round(lipid_MRM_data['Parent_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Product_Ion'] = np.round(lipid_MRM_data['Product_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Transition'] = lipid_MRM_data['Parent_Ion

Finished parsing mzML file: ./Projects/canola/mzml/o3on/CrudeCanola_O3on_150gN3_02082023.mzML



  OzESI_time_df = OzESI_time_df.append(pd.DataFrame(ozesi_rows), ignore_index=True)
  master_df = master_df.append(df, ignore_index=True)


Finished parsing mzML file: ./Projects/canola/mzml/o3on/DegummedCanola_O3on_150gN3_02082023.mzML



  OzESI_time_df = OzESI_time_df.append(pd.DataFrame(ozesi_rows), ignore_index=True)
  master_df = master_df.append(df, ignore_index=True)


Finished parsing mzML file: ./Projects/canola/mzml/o3on/RBDCanola_O3on_150gN3_02082023.mzML

Finished parsing all mzML files



In [8]:
df_MRM.head(None)

Unnamed: 0,Class,Intensity,Lipid,Parent_Ion,Product_Ion,Sample_ID,Transition
0,,5.451378e+05,,760.6,571.6,CrudeCanola_O3on_150gN3_02082023,760.6 -> 571.6
1,,6.208219e+05,,762.6,573.6,CrudeCanola_O3on_150gN3_02082023,762.6 -> 573.6
2,,9.441859e+05,,764.6,575.6,CrudeCanola_O3on_150gN3_02082023,764.6 -> 575.6
3,,1.137434e+06,,766.7,577.6,CrudeCanola_O3on_150gN3_02082023,766.7 -> 577.6
4,,5.900676e+05,,782.6,593.6,CrudeCanola_O3on_150gN3_02082023,782.6 -> 593.6
...,...,...,...,...,...,...,...
100,TAG,4.897507e+05,[TG(54:6)]_FA18:1,896.8,597.6,RBDCanola_O3on_150gN3_02082023,896.8 -> 597.6
101,TAG,1.179904e+06,[TG(54:5)]_FA18:1,898.8,599.6,RBDCanola_O3on_150gN3_02082023,898.8 -> 599.6
102,TAG,1.654774e+06,"[TG(55:11),TG(54:4)]_FA18:1",900.8,601.6,RBDCanola_O3on_150gN3_02082023,900.8 -> 601.6
103,TAG,5.234119e+06,"[TG(55:10),TG(54:3)]_FA18:1",902.8,603.6,RBDCanola_O3on_150gN3_02082023,902.8 -> 603.6


The `read_mrm_list()` function is first invoked to read the MRM database from the specified file location and return it as a pandas DataFrame `mrm_database`. Subsequently, the `match_lipids_parser()` function is called to match the detected lipids from the `OzESI_time_df` DataFrame, obtained from the OzESI-MS scans, with the known lipids in the `mrm_database` based on the MRM transitions within the specified `tolerance`. The result is saved in the `df_oz_matched` DataFrame, which now contains matched lipid species from the OzESI-MS data.

In [9]:
mrm_database = CLAW.read_mrm_list(data_base_name_location, deuterated=False)
df_OzESI_matched = CLAW.match_lipids_parser(mrm_database, df_OzESI, tolerance)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Parent_Ion'] = np.round(lipid_MRM_data['Parent_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Product_Ion'] = np.round(lipid_MRM_data['Product_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Transition'] = lipid_MRM_data['Parent_Ion

In [10]:
df_OzESI_matched.tail()

Unnamed: 0,Class,Lipid,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition
225352,TAG,"[TG(55:9),TG(54:2)]_FA18:1",148.500015,904.8,605.6,34.9317,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225353,TAG,"[TG(55:9),TG(54:2)]_FA18:1",131.800003,904.8,605.6,34.948,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225354,TAG,"[TG(55:9),TG(54:2)]_FA18:1",151.960007,904.8,605.6,34.964317,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225355,TAG,"[TG(55:9),TG(54:2)]_FA18:1",137.700012,904.8,605.6,34.980617,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225356,TAG,"[TG(55:9),TG(54:2)]_FA18:1",103.460007,904.8,605.6,34.996933,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6


### Lipidomics Data Processing and Double-Bond Location Analysis

The code block initiates the data refining process by filtering the retention time from the `df_oz_matched` DataFrame using the `filter_rt()` function and concatenating this DataFrame with `df_matched` via `concat_dataframes()`. The resulting DataFrame is then enriched by adding the positional isomers and lipid information for specified double bond positions (here, 7, 9, 12) through the `DB_Position_df()` and `add_lipid_info()` functions respectively. Following sorting by 'Sample_ID' and 'Product_Ion', the `calculate_intensity_ratio()` function is employed to calculate and append intensity ratios to the DataFrame. Afterward, lipid species in the 'Lipid' column are sorted by their second triacylglycerol (TG) components. Lastly, the `filter_highest_ratios()` function is used to filter and keep rows with the highest intensity ratios, resulting in a more concise and useful DataFrame `df_matched_6`.

In [11]:
# Filter retention time from df_OzESI_matched 
df_OzESI_1_filtered = CLAW.filter_rt(df_OzESI_matched, min_rt=10, max_rt=20, min_intensity=400)

# Copy the filtered dataframe
df_OzESI_1_filtered_copy = df_OzESI_1_filtered.copy()

# Add double bond position
df_OzESI_2_DB_pos = CLAW.calculate_DB_Position(df_OzESI_1_filtered_copy, db_pos_list=[7,9,12])
print(df_OzESI_2_DB_pos)
#Make a new column for Labels of n-#
df_OzESI_2_DB_pos['db_pos'] = ''
db_pos_list = [7, 9, 12]
# Match db position to the n-# label
df_OzESI_3_DB_pos_matched = CLAW.add_lipid_info(df_OzESI_2_DB_pos, db_pos_list, tolerance=0.3)

# Sort OzESI data by Sample_ID and Product_Ion
df_OzESI_3_DB_pos_sorted = df_OzESI_3_DB_pos_matched.sort_values(by=['Sample_ID','Product_Ion'])

# Make a copy of the sorted dataframe
df_OzESI_4_ratio = df_OzESI_3_DB_pos_sorted.copy()
# Add a column for Ratios
# df_OzESI_4_ratio['Ratio'] = None

# Calculate Ratios
df_OzESI_4_ratio = CLAW.calculate_intensity_ratio(df_OzESI_4_ratio)

# Make a copy, sort lipids ratios for Lipids. Save in final dataframe df_OzESI_ratio_final
df_OzESI_4_ratio_sort = df_OzESI_4_ratio.copy()

df_OzESI_4_ratio_sort['Lipid'] = df_OzESI_4_ratio_sort['Lipid'].apply(CLAW.sort_by_second_tg)
df_OzESI_5_ratio_final = CLAW.filter_highest_ratio(df_OzESI_4_ratio_sort)


   Class                        Lipid  OzESI_Intensity  Parent_Ion  \
0    NaN                          NaN            579.0       760.6   
1    NaN                          NaN           1875.0       762.6   
2    NaN                          NaN           3420.0       764.6   
3    NaN                          NaN          16153.0       766.7   
4    NaN                          NaN            636.0       782.6   
..   ...                          ...              ...         ...   
91   TAG            [TG(54:6)]_FA18:1           1138.0       896.8   
92   TAG            [TG(54:5)]_FA18:1          12850.0       898.8   
93   TAG  [TG(55:11),TG(54:4)]_FA18:1          87513.0       900.8   
94   TAG  [TG(55:10),TG(54:3)]_FA18:1         297883.0       902.8   
95   TAG   [TG(55:9),TG(54:2)]_FA18:1          26943.0       904.8   

    Product_Ion  Retention_Time                         Sample_ID  \
0         571.6           12.41  CrudeCanola_O3on_150gN3_02082023   
1         573.6      

  final_dataframe = final_dataframe.append(appended_row, ignore_index=True)


In [12]:
from plot import printed_ratio

# Assuming df_OzESI_ratio_sort is already defined in your notebook
printed_ratio(df_OzESI_4_ratio_sort)


Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.7435897435897436
Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.7435897435897436
Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.7435897435897436
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.11009697661152311
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.11009697661152311
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 0.11009697661152311
Lipid: [TG(52:4)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 3.048780487804878
Lipid: [TG(52:4)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 3.048780487804878
Lipid: [TG(52:4)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratio: 3.048780487804878
Lipid: TG(

### Previewing Processed Lipidomics Data

This cell provides a snapshot of the fully processed and enriched lipidomics data set. At this stage, the dataframe includes the integrated information of lipid identities, their specific double-bond locations, and other pertinent characteristics. This prepared data is now ready to be exported for subsequent exploratory and statistical analyses, including visualization and inferential statistics.

In [13]:
df_OzESI_5_ratio_final.head(None)

Unnamed: 0,Class,Lipid,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition,n-7,n-9,n-12,db_pos,Ratio
107,,TG(52:2)]_FA18:1,16153.0,766.7,577.6,18.05,CrudeCanola_O3on_150gN3_02082023,766.7 -> 577.6,684.7,656.7,614.7,n-9,4.234076
25,TAG,TG(52:2)]_FA18:1,102684.0,876.8,577.6,18.04,CrudeCanola_O3on_150gN3_02082023,876.8 -> 577.6,794.8,766.8,724.8,,
155,,TG(52:2)]_FA18:1,3815.0,794.7,577.6,18.05,CrudeCanola_O3on_150gN3_02082023,794.7 -> 577.6,712.7,684.7,642.7,n-7,
103,,TG(52:3)]_FA18:1,3420.0,764.6,575.6,16.09,CrudeCanola_O3on_150gN3_02082023,764.6 -> 575.6,682.6,654.6,612.6,n-9,3.157895
24,TAG,TG(52:3)]_FA18:1,22411.0,874.8,575.6,16.12,CrudeCanola_O3on_150gN3_02082023,874.8 -> 575.6,792.8,764.8,722.8,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,TAG,[TG(54:5)]_FA18:1,12850.0,898.8,599.6,14.34,RBDCanola_O3on_150gN3_02082023,898.8 -> 599.6,816.8,788.8,746.8,,
329,,[TG(54:5)]_FA18:1,891.0,816.7,599.6,14.28,RBDCanola_O3on_150gN3_02082023,816.7 -> 599.6,734.7,706.7,664.7,n-7,
280,,[TG(54:6)]_FA18:1,587.0,786.6,597.6,19.48,RBDCanola_O3on_150gN3_02082023,786.6 -> 597.6,704.6,676.6,634.6,n-9,1.067273
91,TAG,[TG(54:6)]_FA18:1,1138.0,896.8,597.6,12.43,RBDCanola_O3on_150gN3_02082023,896.8 -> 597.6,814.8,786.8,744.8,,


### Lipidomic OzESI Data Visualization

This section presents the visual representation of the lipidomic OzESI data, focusing on the ratio analysis of isomeric lipids based on their double bond location. By default, the visualization emphasizes the n-9/n-7 ratios, but the configuration can be tailored to accommodate any specific double bond location on a lipid. Select the directory where the plots will be saved.

In [21]:
import plot
import re

def lipid_sort_key(lipid):
    # Extract numbers from the lipid string using regex
    matches = re.findall(r'(\d+)', lipid)
    
    # Extract the numbers and return a tuple for sorting
    if len(matches) >= 2:
        return (int(matches[0]), int(matches[1]))
    elif len(matches) == 1:
        return (int(matches[0]), 0)
    else:
        return (0, 0)  # Default return if no match


# Create the "Plots" directory if it doesn't exist
os.makedirs("Projects/canola/plots/Ratios/", exist_ok=True)
#select project
project = 'Projects/canola/'
# select project folder
plot_folder = 'plots/Ratios/'
# Copy the dataframe to df_plot for plotting
df_plot = df_OzESI_5_ratio_final.copy()
df_plot = df_plot.sort_values(by='Lipid', key=lambda x: x.map(lipid_sort_key))
df_plot = df_plot[~df_plot['Lipid'].str.contains(":0")]

# Define color mappings for Lipid patterns
color_mapping = {
    '50': 'red',
    '51': 'brown',
    '52': 'blue',
    '53': 'purple',
    '54': 'green',
}

# Specify output directory
output_directory = "Projects/canola/plots/Ratios/"

# Plot the ratios with the plot_ratios function from the plot module
plot.plot_ratio(df_plot, color_mapping, output_directory, ratio_threshold=0.5)

In [15]:
print(df_plot.columns)

Index(['Class', 'Lipid', 'OzESI_Intensity', 'Parent_Ion', 'Product_Ion',
       'Retention_Time', 'Sample_ID', 'Transition', 'n-7', 'n-9', 'n-12',
       'db_pos', 'Ratio'],
      dtype='object')
