In [18]:
from IPython.display import display, Image, clear_output

# Comprehensive Lipidome Automation Workflow (CLAW)

Welcome to CLAW_v2, a tool designed to facilitate and optimize the processing of lipidomic MRM data. This Jupyter notebook encapsulates a suite of tools that streamline the various stages of lipidomics data analysis.

Our toolset enables users to efficiently process MRM data files in the mzML format. Upload a file and CLAW_v2 will parse the data into a structured Pandas dataframe. This dataframe includes critical information like sample_ID, MRM transition, and signal intensity. Furthermore, our tool aligns each MRM transition with a default or custom lipid_database for accurate and swift annotation.

Moreover, CLAW_v2 is equipped with an OzESI option, a tool to elucidate the double bond location in lipid isomers. This feature allows users to input OzESI data and pinpoint the precise location of double bonds in isomeric lipids. Users have the flexibility to select which double bond locations they want to analyze. Following this, CLAW_v2 autonomously predicts potential m/z values and cross-references these predictions with sample data, ensuring a comprehensive and meticulous analysis.

With automation at its core, CLAW_v2 eliminates the need for manual data processing, significantly reducing time expenditure. It is a robust and invaluable tool for handling large volumes of lipid MRM data, accelerating scientific discovery in the field of lipidomics.

In [1]:
#Import all the necessary python libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import json

#Import all the necessary CLAW_v2 libraries
import create_directory
import CLAW_v2

No module named 'ms_deisotope._c.averagine' averagine
No module named 'ms_deisotope._c.scoring'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'


## Directory and File Management
For structured data management and efficient workflow, the system first ensures the presence of an output directory. If such a directory doesn't already exist, it will be created.

In [2]:
# # # Create the output directory if it doesn't exist
# create_directory.create_project_folder()


The name of the project (in this case 'test_py') is defined next. This is important as the created directory will bear this name, allowing users to manage and identify their data with ease.

After the mzML files are uploaded to the designated mzML folder, the next block of code segregates these files based on their characteristics. More specifically, it filters the files and transfers them to respective folders named 'o3on' and 'o2only'.

In [2]:
name_of_project = 'canola'
#After you load mzml files to mzml folder. this will filter the files and move them to o3on and o2only folders
create_directory.filter_o3mzml_files(name_of_project)

One or both of the destination directories 'Projects/canola/o3on' and 'Projects/canola/o2only' do not exist.


## Pre-Parsing Setup
The following block of code takes the preset variable values and uses them to parse the mzML files. The parsed data, including the sample ID, MRM transitions, and intensities, is stored in a pandas dataframe for easy manipulation and analysis.

The function CLAW_v2.parsing_mzml_to_df takes several arguments. data_base_name_location is the location of the lipid database that contains information on lipid classes, fatty acid chains, and their corresponding MRM transitions. Project_Folder_data is the location of the mzML files for the samples to be analyzed. tolerance defines the acceptable range of deviation for the MRM transitions when matching them with the lipid database. The argument remove_std is a boolean that, when True, indicates to remove the MRM transitions that correspond to standards (internal or external) present in the samples.

The function outputs a pandas dataframe (df) where each row corresponds to an MRM transition detected in a sample, and columns include the sample ID, MRM transition, and intensity of the transition, among other values.

In [3]:
# Set default values
data_base_name_location = 'lipid_database/Lipid_Database.xlsx'
Project = './Projects/'
Project_Name = 'canola'
Project_Folder_data = Project + Project_Name + '/mzml/o3on/'
Project_results = Project + Project_Name + '/results/'
file_name_to_save = 'canola'
tolerance = 0.3
remove_std = True
save_data = True

# # Define the data frame to store the data
# OzESI_time_df = pd.DataFrame(columns=['Lipid', 'Parent_Ion', 'Product_Ion', 'Intensity', 'Transition', 'Class', 'Sample_ID', 'Retention_Time', 'OzESI_Intensity'])
# time_and_intensity_df = pd.DataFrame(columns=['Time', 'Intensity'])

# Call pre_parsing_setup to initialize the variables
data_base_name_location, Project_Folder_data, Project_results, file_name_to_save, tolerance, remove_std, save_data = CLAW_v2.pre_parsing_setup(data_base_name_location,
 Project, 
 Project_Name, 
 Project_Folder_data,
 Project_results, 
 file_name_to_save, 
 tolerance, 
 remove_std,
 save_data)


data_base_name_location: lipid_database/Lipid_Database.xlsx
Project: ./Projects/
Project_Name: canola
Project_Folder_data: ./Projects/canola/mzml/o3on/
Project_results: ./Projects/canola/results/
file_name_to_save: canola
tolerance: 0.3
remove_std: True
save_data: True


## CLAW_v2.full_parse()
In this code, the `CLAW_v2.full_parse()` function is used to analyze the MRM data. It takes several parameters like the location of the lipid database, paths to the data and results folders, the name of the result files, and the tolerance for MRM transitions matching. The function returns two dataframes: `df_matched` that contains information about each detected lipid species and their corresponding MRM transitions, and `OzESI_time_df` which captures data related to OzESI-MS scans, including potential double bond locations of lipids. If `remove_std` is `True`, it removes MRM transitions related to standards from the dataframe, and if `save_data` is `True`, the dataframe is saved as a .csv file in the specified results folder.

In [4]:
time_and_intensity_df, master_df, OzESI_time_df = CLAW_v2.create_dataframes()



In [5]:
# Use the initialized variables as arguments to full_parse
df_matched, OzESI_time_df = CLAW_v2.full_parse(data_base_name_location, 
                                               Project_Folder_data, 
                                               Project_results, 
                                               file_name_to_save, 
                                               tolerance, 
                                               remove_std=True, 
                                               save_data=False,
                                               batch_processing=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Parent_Ion'] = np.round(mrm_list_offical['Parent_Ion'],1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Product_Ion'] = np.round(mrm_list_offical['Product_Ion'],1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Transition'] = mrm_list_offical['

Finished parsing mzML file: ./Projects/canola/mzml/o3on/CrudeCanola_O3on_150gN3_02082023.mzML



  OzESI_time_df = OzESI_time_df.append(pd.DataFrame(ozesi_rows), ignore_index=True)
  master_df = master_df.append(df, ignore_index=True)


Finished parsing mzML file: ./Projects/canola/mzml/o3on/DegummedCanola_O3on_150gN3_02082023.mzML

Finished parsing mzML file: ./Projects/canola/mzml/o3on/RBDCanola_O3on_150gN3_02082023.mzML

Finished parsing all mzML files



  OzESI_time_df = OzESI_time_df.append(pd.DataFrame(ozesi_rows), ignore_index=True)
  master_df = master_df.append(df, ignore_index=True)


The `read_mrm_list()` function is first invoked to read the MRM database from the specified file location and return it as a pandas DataFrame `mrm_database`. Subsequently, the `match_lipids_parser()` function is called to match the detected lipids from the `OzESI_time_df` DataFrame, obtained from the OzESI-MS scans, with the known lipids in the `mrm_database` based on the MRM transitions within the specified `tolerance`. The result is saved in the `df_oz_matched` DataFrame, which now contains matched lipid species from the OzESI-MS data.

In [7]:
# time_and_intensity_df = pd.DataFrame(columns=['Time', 'Intensity'])
# master_df = pd.DataFrame(columns=['Parent_Ion', 'Product_Ion', 'Intensity', 'Transition', 'Sample_ID'])
# OzESI_time_df = pd.DataFrame(columns=['Parent_Ion', 'Product_Ion', 'Retention_Time', 'OzESI_Intensity', 'Sample_ID', 'Transition'])


time_and_intensity_df.head()
master_df.head()
OzESI_time_df.tail()

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition
225352,904.8,605.6,34.9317,148.500015,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225353,904.8,605.6,34.948,131.800003,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225354,904.8,605.6,34.964317,151.960007,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225355,904.8,605.6,34.980617,137.700012,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225356,904.8,605.6,34.996933,103.460007,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6


In [8]:
mrm_database = CLAW_v2.read_mrm_list(data_base_name_location)
df_oz_matched = CLAW_v2.match_lipids_parser(mrm_database, OzESI_time_df, tolerance)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Parent_Ion'] = np.round(mrm_list_offical['Parent_Ion'],1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Product_Ion'] = np.round(mrm_list_offical['Product_Ion'],1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mrm_list_offical['Transition'] = mrm_list_offical['

In [9]:
df_oz_matched[df_oz_matched['Sample_ID'].str.contains('RBD')].head(None)

Unnamed: 0,Class,Lipid,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition
150238,,,96.220009,760.6,571.6,0.015933,RBDCanola_O3on_150gN3_02082023,760.6 -> 571.6
150239,,,193.240021,760.6,571.6,0.032233,RBDCanola_O3on_150gN3_02082023,760.6 -> 571.6
150240,,,206.660019,760.6,571.6,0.048550,RBDCanola_O3on_150gN3_02082023,760.6 -> 571.6
150241,,,216.500015,760.6,571.6,0.064850,RBDCanola_O3on_150gN3_02082023,760.6 -> 571.6
150242,,,170.640015,760.6,571.6,0.081167,RBDCanola_O3on_150gN3_02082023,760.6 -> 571.6
...,...,...,...,...,...,...,...,...
225352,TAG,"[TG(55:9),TG(54:2)]_FA18:1",148.500015,904.8,605.6,34.931700,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225353,TAG,"[TG(55:9),TG(54:2)]_FA18:1",131.800003,904.8,605.6,34.948000,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225354,TAG,"[TG(55:9),TG(54:2)]_FA18:1",151.960007,904.8,605.6,34.964317,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6
225355,TAG,"[TG(55:9),TG(54:2)]_FA18:1",137.700012,904.8,605.6,34.980617,RBDCanola_O3on_150gN3_02082023,904.8 -> 605.6


### Lipidomics Data Processing and Double-Bond Location Analysis

The code block initiates the data refining process by filtering the retention time from the `df_oz_matched` DataFrame using the `filter_rt()` function and concatenating this DataFrame with `df_matched` via `concat_dataframes()`. The resulting DataFrame is then enriched by adding the positional isomers and lipid information for specified double bond positions (here, 7, 9, 12) through the `DB_Position_df()` and `add_lipid_info()` functions respectively. Following sorting by 'Sample_ID' and 'Product_Ion', the `calculate_intensity_ratio()` function is employed to calculate and append intensity ratios to the DataFrame. Afterward, lipid species in the 'Lipid' column are sorted by their second triacylglycerol (TG) components. Lastly, the `filter_highest_ratios()` function is used to filter and keep rows with the highest intensity ratios, resulting in a more concise and useful DataFrame `df_matched_6`.

In [10]:
filtered_df = CLAW_v2.filter_rt(df_oz_matched)
# df_matched_2 = CLAW_v2.concat_dataframes(df_matched, filtered_df)
df_matched_2 = filtered_df.copy()
# Example usage:
df_matched_2 = CLAW_v2.DB_Position_df(df_matched_2, OzESI_list=[7,9,12])

#Make a new column for Labels of n-#
df_matched_2['db_pos'] = ''
OzESI_list = [7, 9, 12]

# #Add lipid name to df_matched 3
df_matched_3 = CLAW_v2.add_lipid_info(df_matched_2, OzESI_list, tolerance=0.3)

df_matched_3_sorted = df_matched_3.sort_values(by=['Sample_ID','Product_Ion'])

df_matched_4 = df_matched_3_sorted.copy()
df_matched_4['Ratios'] = None


df_matched_4 = CLAW_v2.calculate_intensity_ratio(df_matched_4)

df_matched_5 = df_matched_4.copy()
df_matched_5['Lipid'] = df_matched_5['Lipid'].apply(CLAW_v2.sort_by_second_tg)
df_matched_6 = CLAW_v2.filter_highest_ratios(df_matched_5)


  df_test_2 = df_test_2.append(appended_row, ignore_index=True)


In [11]:

# Iterate through each row in the DataFrame
for index, row in df_matched_5.iterrows():
    # Extract Lipid, Sample_ID, Labels and Ratios from the row
    lipid = row['Lipid']
    sample_id = row['Sample_ID']
    db_pos = row['db_pos']
    ratios = row['Ratios']

    # Check if ratios is not NaN
    if not pd.isna(ratios):
        # Print out the values
        print(f'Lipid: {lipid}, Sample_ID: {sample_id}, db_pos: {db_pos}, Ratios: {ratios}')


Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.7435897435897436
Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.7435897435897436
Lipid: [TG(54:6)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.7435897435897436
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 1.7387387387387387
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 1.7387387387387387
Lipid: [TG(52:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 1.7387387387387387
Lipid: [TG(54:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.25874125874125875
Lipid: [TG(54:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.25874125874125875
Lipid: [TG(54:5)]_FA18:1, Sample_ID: CrudeCanola_O3on_150gN3_02082023, db_pos: n-9, Ratios: 0.2587412587412587

### Previewing Processed Lipidomics Data

This cell provides a snapshot of the fully processed and enriched lipidomics data set. At this stage, the dataframe includes the integrated information of lipid identities, their specific double-bond locations, and other pertinent characteristics. This prepared data is now ready to be exported for subsequent exploratory and statistical analyses, including visualization and inferential statistics.

In [12]:
df_matched_6.head(None)
# look at only RBD sample ID
# df_matched_6[df_matched_6['Sample_ID'].str.contains('RBD')].head(None)

Unnamed: 0,Class,Lipid,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition,n-7,n-9,n-12,db_pos,Ratios
112,,TG(52:2)]_FA18:1,16153.0,766.7,577.6,18.05,CrudeCanola_O3on_150gN3_02082023,766.7 -> 577.6,684.7,656.7,614.7,n-9,4.234076
27,TAG,TG(52:2)]_FA18:1,102684.0,876.8,577.6,18.04,CrudeCanola_O3on_150gN3_02082023,876.8 -> 577.6,794.8,766.8,724.8,,
171,,TG(52:2)]_FA18:1,3815.0,794.7,577.6,18.05,CrudeCanola_O3on_150gN3_02082023,794.7 -> 577.6,712.7,684.7,642.7,n-7,
108,,TG(52:3)]_FA18:1,3420.0,764.6,575.6,16.09,CrudeCanola_O3on_150gN3_02082023,764.6 -> 575.6,682.6,654.6,612.6,n-9,3.157895
26,TAG,TG(52:3)]_FA18:1,22411.0,874.8,575.6,16.12,CrudeCanola_O3on_150gN3_02082023,874.8 -> 575.6,792.8,764.8,722.8,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,TAG,[TG(54:5)]_FA18:1,12850.0,898.8,599.6,14.34,RBDCanola_O3on_150gN3_02082023,898.8 -> 599.6,816.8,788.8,746.8,,
369,,[TG(54:5)]_FA18:1,891.0,816.7,599.6,14.28,RBDCanola_O3on_150gN3_02082023,816.7 -> 599.6,734.7,706.7,664.7,n-7,
312,,[TG(54:6)]_FA18:1,587.0,786.6,597.6,19.48,RBDCanola_O3on_150gN3_02082023,786.6 -> 597.6,704.6,676.6,634.6,n-9,1.067273
97,TAG,[TG(54:6)]_FA18:1,1138.0,896.8,597.6,12.43,RBDCanola_O3on_150gN3_02082023,896.8 -> 597.6,814.8,786.8,744.8,,


Visualizations

In [14]:
import plotly.express as px
import os

# Create the "Plots" directory if it doesn't exist
os.makedirs("Projects/canola/Plots/py_plots/", exist_ok=True)
project = 'Projects/canola/'
plot_folder = 'Plots/py_plots/'


# Define color mappings for Lipid patterns
color_mapping = {
    '50': 'red',
    '51': 'brown',
    '52': 'blue',
    '53': 'purple',
    '54': 'green',
}

# Get the unique Sample_IDs
sample_ids = df_matched_6['Sample_ID'].unique()

# Loop over the unique Sample_IDs
for sample_id in sample_ids:
    
    # Filter the dataframe for the current Sample_ID
    df_sample = df_matched_6[df_matched_6['Sample_ID'] == sample_id]
    
    # Assign colors to Lipids based on patterns
    lipid_colors = []
    for lipid in df_sample['Lipid']:
        color = 'gray'  # Default color
        for pattern, pattern_color in color_mapping.items():
            if pattern in lipid:
                color = pattern_color
                break
        lipid_colors.append(color)
    
    # Create the bar plot
    fig = px.bar(df_sample, x='Lipid', y='Ratios', text='Ratios', title=f'Bar Plot for Sample_ID: {sample_id}',
                 color_discrete_map=color_mapping)
    
    # Apply colors to the bars
    fig.update_traces(
        marker_color=lipid_colors,
        texttemplate='%{text:.2f}',
        textposition='auto',
        marker_line_width=0
    )
    
    # Customize the layout
    fig.update_layout(
        uniformtext_minsize=18,
        uniformtext_mode='hide',
        xaxis=dict(
            title='Lipid',
            titlefont=dict(size=16)
        ),
        yaxis=dict(
            title='Ratios',
            titlefont=dict(size=16),
            tickfont=dict(size=16)  # Set the font size of y-axis labels
        ),
        legend=dict(
            title='Lipid Patterns',
            tracegroupgap=50,
            itemsizing='constant'
        ),
        title=dict(
            text=f'Sample_ID: {sample_id}',
            font=dict(size=20)  # Set the title font size
        )
    )
    # Save the plot as an image
    file_name = f"{project}{plot_folder}/plot_{sample_id}.png"

    # Check if the file already exists
    index = 1
    while os.path.exists(file_name):
        file_name = f"{project}{plot_folder}/plot_{sample_id}_{index}.png"
        index += 1

    fig.write_image(file_name)
    # Show plot
    fig.show()


In [19]:
import plot
# Create the "Plots" directory if it doesn't exist
os.makedirs("Projects/test_py/plots/ratios/", exist_ok=True)
project = 'Projects/test_py/'
plot_folder = 'Plots/'
df_plot = df_matched_6.copy()

# Define color mappings for Lipid patterns
color_mapping = {
    '50': 'red',
    '51': 'brown',
    '52': 'blue',
    '53': 'purple',
    '54': 'green',
}

# Specify output directory
output_directory = "Projects/test_py/plots/ratios/"

plot.plot_ratios(df_plot, color_mapping, output_directory)

In [46]:
##########################