In [1]:
from IPython.display import display, Image, clear_output

# Comprehensive Lipidome Automation Workflow (CLAW)

Welcome to CLAW, a tool designed to facilitate and optimize the processing of lipidomic MRM data. This Jupyter notebook encapsulates a suite of tools that streamline the various stages of lipidomics data analysis.

Our toolset enables users to efficiently process MRM data files in the mzML format. Upload a file and CLAW will parse the data into a structured Pandas dataframe. This dataframe includes critical information like sample_ID, MRM transition, and signal intensity. Furthermore, our tool aligns each MRM transition with a default or custom lipid_database for accurate and swift annotation.

Moreover, CLAW is equipped with an OzESI option, a tool to elucidate the double bond location in lipid isomers. This feature allows users to input OzESI data and pinpoint the precise location of double bonds in isomeric lipids. Users have the flexibility to select which double bond locations they want to analyze. Following this, CLAW autonomously predicts potential m/z values and cross-references these predictions with sample data, ensuring a comprehensive and meticulous analysis.

With automation at its core, CLAW eliminates the need for manual data processing, significantly reducing time expenditure. It is a robust and invaluable tool for handling large volumes of lipid MRM data, accelerating scientific discovery in the field of lipidomics.

In [2]:
#Import all the necessary python libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import json

#Import all the necessary CLAW libraries
import create_directory
import CLAW

import warnings

# Suppress all warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## Directory and File Management
For structured data management and efficient workflow, the system first ensures the presence of an output directory. If such a directory already exists you can skip this step.

In [3]:
# # Create the output directory. If it already exists you can skip this step.
# create_directory.create_project_folder()


The name of the project is defined next. This is important as the created directory will bear this name, allowing users to manage and identify their data with ease.

After the mzML files are uploaded to the designated mzML folder, the next block of code segregates these files based on their characteristics. More specifically, it filters the files and transfers them to respective folders named 'o3on' and 'o2only'.

In [4]:
name_of_project = 'FaceFats'
#After you load mzml files to mzml folder. this will filter the files and move them to o3on and o2only folders
create_directory.filter_o3mzml_files(name_of_project)

One or both of the destination directories 'Projects/FaceFats/o3on' and 'Projects/FaceFats/o2only' do not exist.


## Pre-Parsing Setup
The following block of code takes the preset variable values and uses them to parse the mzML files. The parsed data, including the sample ID, MRM transitions, and intensities, is stored in a pandas dataframe for easy manipulation and analysis.

The function CLAW.parsing_mzml_to_df takes several arguments. data_base_name_location is the location of the lipid database that contains information on lipid classes, fatty acid chains, and their corresponding MRM transitions. Project_Folder_data is the location of the mzML files for the samples to be analyzed. tolerance defines the acceptable range of deviation for the MRM transitions when matching them with the lipid database. The argument remove_std is a boolean that, when True, indicates to remove the MRM transitions that correspond to standards (internal or external) present in the samples.

The function outputs a pandas dataframe (df) where each row corresponds to an MRM transition detected in a sample, and columns include the sample ID, MRM transition, and intensity of the transition, among other values.

In [5]:
# Set default values
data_base_name_location = 'lipid_database/Lipid_Database.xlsx'
Project = './Projects/'
Project_Name = 'FaceFats'
Project_Folder_data = Project + Project_Name + '/ozoff_ozon/'
Project_results = Project + Project_Name + '/results/'
file_name_to_save = 'FaceFats'
tolerance = 0.3
remove_std = True
save_data = True

# Call pre_parsing_setup to initialize the variables
data_base_name_location, Project_Folder_data, Project_results, file_name_to_save, tolerance, remove_std, save_data = CLAW.pre_parsing_setup(data_base_name_location,
 Project, 
 Project_Name, 
 Project_Folder_data,
 Project_results, 
 file_name_to_save, 
 tolerance, 
 remove_std,
 save_data)


data_base_name_location: lipid_database/Lipid_Database.xlsx
Project: ./Projects/
Project_Name: FaceFats
Project_Folder_data: ./Projects/FaceFats/ozoff_ozon/
Project_results: ./Projects/FaceFats/results/
file_name_to_save: FaceFats
tolerance: 0.3
remove_std: True
save_data: True


Define the master dataframes where the data will be stored during the parsing step.

In [6]:
time_and_intensity_df, master_df, OzESI_time_df = CLAW.create_analysis_dataframes()

## CLAW.full_parse()
In this code, the `CLAW.full_parse()` function is used to analyze the MRM data. It takes several parameters like the location of the lipid database, paths to the data and results folders, the name of the result files, and the tolerance for MRM transitions matching. The function returns two dataframes: `df_matched` that contains information about each detected lipid species and their corresponding MRM transitions, and `OzESI_time_df` which captures data related to OzESI-MS scans, including potential double bond locations of lipids. If `remove_std` is `True`, it removes MRM transitions related to standards from the dataframe, and if `save_data` is `True`, the dataframe is saved as a .csv file in the specified results folder.

In [7]:
# Use the initialized variables as arguments to full_parse
df_MRM, df_OzESI = CLAW.full_parse(data_base_name_location, 
                                               Project_Folder_data, 
                                               Project_results, 
                                               file_name_to_save, 
                                               tolerance, 
                                               remove_std=True, 
                                               save_data=False,
                                               batch_processing=True,
                                               plot_chromatogram=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Parent_Ion'] = np.round(lipid_MRM_data['Parent_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Product_Ion'] = np.round(lipid_MRM_data['Product_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Transition'] = lipid_MRM_data['Parent_Ion

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD93_F4_5xFAD_cereb_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD93_F4_5xFAD_cortex_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD93_F4_5xFAD_dienc_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD93_F4_5xFAD_hippo_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD94_F4_5xFAD_cereb_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD94_F4_5xFAD_cortex_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD94_F4_5xFAD_dienc_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_DOD94_F4_5xFAD_hippo_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_FAD185_M1_5xFAD_cereb_O3on_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/ozoff_ozon/11292023_FA

In [8]:
df_MRM.head(None)

Unnamed: 0,Class,Intensity,Lipid,Parent_Ion,Product_Ion,Sample_ID,Transition
0,,25452.201778,,584.4,437.3,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
1,,23887.601665,,612.4,437.3,11292023_DOD93_F4_5xFAD_cereb_O3on_01,612.4 -> 437.3
2,,32008.842213,,622.5,503.4,11292023_DOD93_F4_5xFAD_cereb_O3on_01,622.5 -> 503.4
3,,28793.361992,,624.5,505.4,11292023_DOD93_F4_5xFAD_cereb_O3on_01,624.5 -> 505.4
4,,25249.781784,,626.5,437.3,11292023_DOD93_F4_5xFAD_cereb_O3on_01,626.5 -> 437.3
...,...,...,...,...,...,...,...
7195,TAG,33222.982410,"[TG(57:9),TG(56:2)]_FA18:1",932.9,633.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6
7196,TAG,23429.881687,"[TG(58:7),TG(57:0)]_FA18:1",950.9,651.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,950.9 -> 651.6
7197,TAG,23313.761707,"[TG(59:13),TG(58:6)]_FA18:1",952.8,653.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,952.8 -> 653.5
7198,TAG,23350.421665,"[TG(59:12),TG(58:5)]_FA18:1",954.8,655.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,954.8 -> 655.5


In [9]:
df_OzESI.head(None)
# df_OzESI.to_csv('FF_OzOFF_full.csv')
# df_OzESI.to_excel('FaceFatsOzdf.xlsx')

Unnamed: 0,Lipid,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition
0,,584.4,437.3,0.044183,41.720001,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
1,,584.4,437.3,0.088567,41.680004,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
2,,584.4,437.3,0.132967,41.240002,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
3,,584.4,437.3,0.177367,41.800003,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
4,,584.4,437.3,0.221783,41.640003,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
...,...,...,...,...,...,...,...
4053552,,956.9,657.6,24.779100,41.120003,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053553,,956.9,657.6,24.823500,41.220001,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053554,,956.9,657.6,24.867917,41.100002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053555,,956.9,657.6,24.912317,41.260002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6


The `read_mrm_list()` function is first invoked to read the MRM database from the specified file location and return it as a pandas DataFrame `mrm_database`. Subsequently, the `match_lipids_parser()` function is called to match the detected lipids from the `OzESI_time_df` DataFrame, obtained from the OzESI-MS scans, with the known lipids in the `mrm_database` based on the MRM transitions within the specified `tolerance`. The result is saved in the `df_oz_matched` DataFrame, which now contains matched lipid species from the OzESI-MS data.

In [10]:
d1 = df_OzESI.iloc[:,1:9]

d1

import pandas as pd

# Assuming d1 is your DataFrame

# Define the retention time range as a tuple (lower_bound, upper_bound)
retention_time_range = (10, 25)  # Replace with your specific range values

# Filter the DataFrame to keep only rows where Retention_Time is within the specified range
filtered_d1 = d1[(d1['Retention_Time'] >= retention_time_range[0]) & (d1['Retention_Time'] <= retention_time_range[1])]

# Now, filtered_d1 contains only the rows from d1 where Retention_Time is within the specified range

filtered_d1

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition
225,584.4,437.3,10.035717,41.640003,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
226,584.4,437.3,10.080133,41.920002,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
227,584.4,437.3,10.124533,41.720001,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
228,584.4,437.3,10.168950,41.720001,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
229,584.4,437.3,10.213350,41.660004,11292023_DOD93_F4_5xFAD_cereb_O3on_01,584.4 -> 437.3
...,...,...,...,...,...,...
4053552,956.9,657.6,24.779100,41.120003,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053553,956.9,657.6,24.823500,41.220001,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053554,956.9,657.6,24.867917,41.100002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
4053555,956.9,657.6,24.912317,41.260002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6


DBSCAN
CLUSTER DATA

ALL SAMPLES CLustering

In [11]:
import pandas as pd
from sklearn.cluster import DBSCAN

# Assuming filtered_d1 is your DataFrame
clustered_data = pd.DataFrame()  # Initialize an empty DataFrame to store clustered results

# Iterate over each group of Parent_Ion and Product_Ion
for (parent_ion, product_ion), group in filtered_d1.groupby(['Parent_Ion', 'Product_Ion']):
    # Calculate 1% below the maximum 'OzESI_Intensity'
    max_intensity = group['OzESI_Intensity'].max()
    cutoff_intensity = max_intensity * 0.01

    # Apply the filters
    filtered_group = group[(group['OzESI_Intensity'] >= cutoff_intensity) & (group['OzESI_Intensity'] > 500)]

    # Check if the filtered group is empty, skip to the next iteration if so
    if filtered_group.empty:
        continue

    # Reshape the 'Retention_Time' data for DBSCAN
    retention_times = filtered_group[['Retention_Time']].values

    # Apply DBSCAN clustering to this specific ion pair group
    dbscan = DBSCAN(eps=0.09, min_samples=15).fit(retention_times)

    # Get the cluster labels
    labels = dbscan.labels_

    # Add the cluster labels to the filtered group
    filtered_group['Cluster_Label'] = labels

    # Append the clustered group to the results DataFrame
    clustered_data = pd.concat([clustered_data, filtered_group])

# The resulting DataFrame, clustered_data, now contains separately clustered data for each ion pair
clustered_data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Cluster_Label
2028402,622.5,503.4,22.202617,685.160034,11302023_DOD93_F4_5xFAD_cereb_O3off_01,622.5 -> 503.4,-1
2450587,622.5,503.4,19.493767,500.980042,11302023_DOD94_F4_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1
2450588,622.5,503.4,19.538167,553.400024,11302023_DOD94_F4_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1
2788457,622.5,503.4,22.691100,507.360046,11302023_FAD185_M1_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1
2957315,622.5,503.4,20.914817,547.780029,11302023_FAD185_M1_5xFAD_hippo_O3off_01,622.5 -> 503.4,-1
...,...,...,...,...,...,...,...
3124437,956.9,657.6,16.963483,605.400024,11302023_FAD185_M3_5xFAD_cereb_O3off_01,956.9 -> 657.6,-1
3124438,956.9,657.6,17.007883,1224.380127,11302023_FAD185_M3_5xFAD_cereb_O3off_01,956.9 -> 657.6,-1
3293334,956.9,657.6,16.919083,662.660034,11302023_FAD185_M3_5xFAD_dienc_O3off_01,956.9 -> 657.6,-1
3293335,956.9,657.6,16.963483,1544.320068,11302023_FAD185_M3_5xFAD_dienc_O3off_01,956.9 -> 657.6,-1


Group data by transition

In [12]:
#group cluster data by sample ID Parent and Product Ion
grouped_cluster_data = clustered_data.groupby(['Parent_Ion', 'Product_Ion'])
#add group number to each to a new column called Group
clustered_data['Group'] = grouped_cluster_data.ngroup()
clustered_data

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Cluster_Label,Group
2028402,622.5,503.4,22.202617,685.160034,11302023_DOD93_F4_5xFAD_cereb_O3off_01,622.5 -> 503.4,-1,0
2450587,622.5,503.4,19.493767,500.980042,11302023_DOD94_F4_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1,0
2450588,622.5,503.4,19.538167,553.400024,11302023_DOD94_F4_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1,0
2788457,622.5,503.4,22.691100,507.360046,11302023_FAD185_M1_5xFAD_cortex_O3off_01,622.5 -> 503.4,-1,0
2957315,622.5,503.4,20.914817,547.780029,11302023_FAD185_M1_5xFAD_hippo_O3off_01,622.5 -> 503.4,-1,0
...,...,...,...,...,...,...,...,...
3124437,956.9,657.6,16.963483,605.400024,11302023_FAD185_M3_5xFAD_cereb_O3off_01,956.9 -> 657.6,-1,59
3124438,956.9,657.6,17.007883,1224.380127,11302023_FAD185_M3_5xFAD_cereb_O3off_01,956.9 -> 657.6,-1,59
3293334,956.9,657.6,16.919083,662.660034,11302023_FAD185_M3_5xFAD_dienc_O3off_01,956.9 -> 657.6,-1,59
3293335,956.9,657.6,16.963483,1544.320068,11302023_FAD185_M3_5xFAD_dienc_O3off_01,956.9 -> 657.6,-1,59


Plot clustered data if need to validate

In [13]:
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches
# import pandas as pd

# save_dir = 'Projects/FaceFats/plots/validation/'
# # Assuming clustered_data is your DataFrame with the necessary columns

# # Define custom colors for the clusters
# color_map = {-1: 'black', 0: 'green', 1: 'blue', 2: 'red', 3: 'pink'}

# # Get unique Group values
# unique_groups = clustered_data['Group'].unique()

# # Iterate through each unique group
# for group in unique_groups:
#     # Filter the data for each Group
#     group_filtered_data = clustered_data[clustered_data['Group'] == group]
#     colors = group_filtered_data['Cluster_Label'].map(color_map)

#     # Extract the corresponding Transition value for the group
#     # Assuming that all rows in a group have the same Transition value
#     transition_value = group_filtered_data['Transition'].iloc[0]

#     # Set up the plot for each group
#     plt.figure(figsize=(10, 6))
#     plt.scatter(group_filtered_data['Retention_Time'], group_filtered_data['OzESI_Intensity'], color=colors)

#     # Add labels and title
#     plt.xlabel('Retention Time')
#     plt.ylabel('OzESI Intensity')
#     plt.title(f'Scatter Plot for Group {group} (Transition: {transition_value})')

#     # Create a legend for the plot
#     patch_list = [mpatches.Patch(color=color, label=f'Cluster {label}') for label, color in color_map.items()]
#     plt.legend(handles=patch_list, loc='upper left', title='Cluster Labels')
    
#     # Save the plot as a PNG file in the specified directory
#     filename = f'Group_{group}_Transition_{transition_value}.png'
#     plt.savefig(os.path.join(save_dir, filename), bbox_inches='tight')


#     # Show the plot
#     plt.show()



Max intesnity mean RT

In [14]:
import pandas as pd

# Assuming clustered_data is your original DataFrame

# Calculate the max OzESI_Intensity for each Group and Cluster_Label
# directly within the DataFrame
clustered_data['Max_OzESI_Intensity'] = clustered_data.groupby(['Group', 'Cluster_Label'])['OzESI_Intensity'].transform('max')

# # Step 2: Identify the cluster with the highest Max_OzESI_Intensity for each Group
# group_max_cluster = clustered_data.groupby('Group')['Max_OzESI_Intensity'].idxmax()



#change this code below

# Calculate the average Retention_Time for each Parent_Ion and Product_Ion
# directly within the DataFrame
clustered_data['Average_Retention_Time'] = clustered_data.groupby(['Parent_Ion', 'Product_Ion', 'Cluster_Label'])['Retention_Time'].transform('mean')

# # Step 3: Create a DataFrame with only those rows that belong to the identified clusters
# filtered_data = clustered_data.loc[group_max_cluster]
# Filtering out rows where 'Cluster_Label' is -1
filtered_df = clustered_data[clustered_data['Cluster_Label'] != -1]

filtered_df


Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Cluster_Label,Group,Max_OzESI_Intensity,Average_Retention_Time
1019850,694.6,505.4,11.986050,1639.940063,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4,0,5,3431.980225,12.045271
1019851,694.6,505.4,12.030450,2286.900146,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4,0,5,3431.980225,12.045271
1019852,694.6,505.4,12.074867,1462.900146,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4,0,5,3431.980225,12.045271
1019853,694.6,505.4,12.119267,609.940063,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4,0,5,3431.980225,12.045271
1188748,694.6,505.4,11.986067,1172.420044,11292023_FAD185_M3_5xFAD_dienc_O3on_01,694.6 -> 505.4,0,5,3431.980225,12.045271
...,...,...,...,...,...,...,...,...,...,...
4051151,932.9,633.6,18.163700,1315.160034,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6,0,55,3828.580322,18.285265
4051152,932.9,633.6,18.208100,1249.500122,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6,0,55,3828.580322,18.285265
4051153,932.9,633.6,18.252500,604.360046,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6,0,55,3828.580322,18.285265
4051154,932.9,633.6,18.296917,531.280029,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6,0,55,3828.580322,18.285265


Choose Cluster with correct RT

In [15]:
# # Assuming filtered_df is your DataFrame

# # Step 1: Identify the combination with the highest Max_OzESI_Intensity for each Group
# group_max_intensity_combination = filtered_df.groupby('Group').apply(lambda x: x.loc[x['Max_OzESI_Intensity'].idxmax()])
# print('group max:', group_max_intensity_combination)
# # Identify which Cluster_Label this belongs to
# group_to_cluster = group_max_intensity_combination.set_index('Group')['Cluster_Label'].to_dict()
# print('group to cluseter:',group_to_cluster)

# # Step 2: Drop other Cluster_Labels from the df for that specific group
# filtered_df = filtered_df[filtered_df.apply(lambda x: x['Cluster_Label'] == group_to_cluster[x['Group']], axis=1)]

# filtered_df

Filter max intesnity for Sample_ID group

In [16]:
import pandas as pd

# Assuming your DataFrame is named df

# Group by 'Sample_ID' and 'Group', and find the row with the highest 'OzESI_Intensity' for each group
filtered_df2= filtered_df.groupby(['Sample_ID', 'Group']).apply(lambda x: x.loc[x['OzESI_Intensity'].idxmax()]).reset_index(drop=True)

filtered_df2


Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Cluster_Label,Group,Max_OzESI_Intensity,Average_Retention_Time
0,736.7,547.5,21.575283,599.980042,11292023_DOD93_F4_5xFAD_cereb_O3on_01,736.7 -> 547.5,0,8,1183.760132,21.528528
1,788.7,599.5,13.480633,967.360046,11292023_DOD93_F4_5xFAD_cereb_O3on_01,788.7 -> 599.5,0,20,33090.062500,13.479926
2,790.7,601.5,14.634317,1444.960083,11292023_DOD93_F4_5xFAD_cereb_O3on_01,790.7 -> 601.5,0,21,84424.968750,14.694375
3,792.7,603.5,15.877117,9305.840820,11292023_DOD93_F4_5xFAD_cereb_O3on_01,792.7 -> 603.5,1,22,96035.085938,16.202419
4,794.7,605.5,17.075800,2328.940186,11292023_DOD93_F4_5xFAD_cereb_O3on_01,794.7 -> 605.5,1,23,13066.581055,17.112404
...,...,...,...,...,...,...,...,...,...,...
568,900.8,601.5,14.837100,12376.260742,11302023_FAD189_M2_5xFAD_hippo_O3off_01,900.8 -> 601.5,0,43,821874.000000,14.823158
569,902.8,603.5,15.946983,63850.085938,11302023_FAD189_M2_5xFAD_hippo_O3off_01,902.8 -> 603.5,0,45,599856.187500,16.366210
570,904.8,605.5,17.190067,9394.621094,11302023_FAD189_M2_5xFAD_hippo_O3off_01,904.8 -> 605.5,0,46,83030.585938,17.304423
571,906.8,607.5,17.100950,1047.360107,11302023_FAD189_M2_5xFAD_hippo_O3off_01,906.8 -> 607.5,0,47,4937.580566,17.197895


match lipids to this df

In [17]:
# mrm_database = CLAW.read_mrm_list(data_base_name_location, deuterated=False)
# matched_df = CLAW.match_lipids_parser(mrm_database, filtered_df2, tolerance=0.3)
# matched_df

Print out Lipid RTs for validation

In [18]:
# import pandas as pd

# # Assuming your DataFrame is named df

# # Filter the DataFrame to keep only unique lipids
# unique_lipids_df = matched_df.drop_duplicates(subset=['Lipid'])
# sorted_unique_lipids_df = unique_lipids_df.sort_values(by=['Lipid'])
# sorted_unique_lipids_df.head(50)
# # Now, unique_lipids_df contains only rows with unique values in the 'Lipid' column

# print(len(sorted_unique_lipids_df))
# # Loop through the DataFrame and print 'Lipid' and 'Retention_Time'
# for index, row in sorted_unique_lipids_df.iterrows():
#     print(f"Lipid: {row['Lipid']}, avg Retention Time: {row['Average_Retention_Time']}")

# #save this to a csv file with only the two columns Lipid and Retention Time
# sorted_unique_lipids_df.to_csv('FF_MRM_CorrectRT2.csv', columns=['Lipid', 'Average_Retention_Time','Parent_Ion','Product_Ion'], index=False)
# sorted_unique_lipids_df


Validation plots ################################################################################################################################################

In [19]:
mrm_database = CLAW.read_mrm_list(data_base_name_location, deuterated=False)
matched_df_v = CLAW.match_lipids_parser(mrm_database, filtered_df, tolerance=0.3)
matched_df_v

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Parent_Ion'] = np.round(lipid_MRM_data['Parent_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Product_Ion'] = np.round(lipid_MRM_data['Product_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Transition'] = lipid_MRM_data['Parent_Ion

Unnamed: 0,Average_Retention_Time,Class,Cluster_Label,Group,Lipid,Max_OzESI_Intensity,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition
1019850,12.045271,,0,5,,3431.980225,1639.940063,694.6,505.4,11.986050,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4
1019851,12.045271,,0,5,,3431.980225,2286.900146,694.6,505.4,12.030450,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4
1019852,12.045271,,0,5,,3431.980225,1462.900146,694.6,505.4,12.074867,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4
1019853,12.045271,,0,5,,3431.980225,609.940063,694.6,505.4,12.119267,11292023_FAD185_M3_5xFAD_cereb_O3on_01,694.6 -> 505.4
1188748,12.045271,,0,5,,3431.980225,1172.420044,694.6,505.4,11.986067,11292023_FAD185_M3_5xFAD_dienc_O3on_01,694.6 -> 505.4
...,...,...,...,...,...,...,...,...,...,...,...,...
4051151,18.285265,TAG,0,55,"[TG(57:9),TG(56:2)]_FA18:1",3828.580322,1315.160034,932.9,633.6,18.163700,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6
4051152,18.285265,TAG,0,55,"[TG(57:9),TG(56:2)]_FA18:1",3828.580322,1249.500122,932.9,633.6,18.208100,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6
4051153,18.285265,TAG,0,55,"[TG(57:9),TG(56:2)]_FA18:1",3828.580322,604.360046,932.9,633.6,18.252500,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6
4051154,18.285265,TAG,0,55,"[TG(57:9),TG(56:2)]_FA18:1",3828.580322,531.280029,932.9,633.6,18.296917,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6


In [20]:
import pandas as pd

# Assuming your DataFrame is named df

# Group by 'Lipid' and find the index of the row with the highest 'Intensity' in each group
indices_of_max_intensity = matched_df_v.groupby('Lipid')['OzESI_Intensity'].idxmax()
#round the retention time to 2 decimal places
matched_df_v['Average_Retention_Time'] = matched_df_v['Retention_Time'].round(2)

# Use these indices to filter your DataFrame
unique_lipids_df_v = matched_df_v.loc[indices_of_max_intensity]

sorted_unique_lipids_df_v = unique_lipids_df_v.sort_values(by=['Lipid'])
sorted_unique_lipids_df_v.head(50)
# Now, unique_lipids_df contains only rows with unique values in the 'Lipid' column

print(len(sorted_unique_lipids_df_v))
# Loop through the DataFrame and print 'Lipid' and 'Retention_Time'
for index, row in sorted_unique_lipids_df_v.iterrows():
    print(f"Lipid: {row['Lipid']}, avg Retention Time: {row['Average_Retention_Time']}, {row['Cluster_Label']}")

#save this to a csv file with only the two columns Lipid and Retention Time
# sorted_unique_lipids_df_v.to_csv('FF_OzOffvsOzON_test.csv', columns=['Lipid', 'Average_Retention_Time','Parent_Ion','Product_Ion'], index=False)
sorted_unique_lipids_df_v


17
Lipid: [TG(42:2)]_FA18:1 | DG(44:9),DG(43:2)_C18:1, avg Retention Time: 13.4, 0
Lipid: [TG(45:3)]_FA16:1, avg Retention Time: 12.28, 0
Lipid: [TG(50:3)]_FA18:1, avg Retention Time: 13.73, 0
Lipid: [TG(52:5)]_FA16:1, avg Retention Time: 12.4, 0
Lipid: [TG(52:6)]_FA18:2, avg Retention Time: 11.15, 0
Lipid: [TG(54:5)]_FA18:1, avg Retention Time: 13.64, 0
Lipid: [TG(54:6)]_FA18:1, avg Retention Time: 12.53, 0
Lipid: [TG(55:10),TG(54:3)]_FA18:1, avg Retention Time: 15.99, 0
Lipid: [TG(55:11),TG(54:4)]_FA18:1, avg Retention Time: 14.84, 0
Lipid: [TG(55:8),TG(54:1)]_FA18:1, avg Retention Time: 17.19, 0
Lipid: [TG(55:9),TG(54:2)]_FA18:1, avg Retention Time: 17.19, 0
Lipid: [TG(56:6)]_FA18:1, avg Retention Time: 13.77, 0
Lipid: [TG(56:7),TG(55:0)]_FA18:1, avg Retention Time: 13.5, 0
Lipid: [TG(57:10),TG(56:3)]_FA18:1, avg Retention Time: 17.05, 0
Lipid: [TG(57:11),TG(56:4)]_FA18:1, avg Retention Time: 15.9, 0
Lipid: [TG(57:12),TG(56:5)]_FA18:1, avg Retention Time: 14.75, 0
Lipid: [TG(57:9),T

Unnamed: 0,Average_Retention_Time,Class,Cluster_Label,Group,Lipid,Max_OzESI_Intensity,OzESI_Intensity,Parent_Ion,Product_Ion,Retention_Time,Sample_ID,Transition
180458,13.4,TAG | DAG,0,7,"[TG(42:2)]_FA18:1 | DG(44:9),DG(43:2)_C18:1",3586.900146,3586.900146,736.6,437.3,13.404683,11292023_DOD93_F4_5xFAD_dienc_O3on_01,736.6 -> 437.3
1381299,12.28,TAG,0,17,[TG(45:3)]_FA16:1,4149.860352,4149.860352,776.7,505.4,12.28485,11292023_FAD189_M1_5xFAD_cereb_O3on_01,776.7 -> 505.4
3105785,13.73,TAG,0,38,[TG(50:3)]_FA18:1,26481.722656,26481.722656,846.8,547.5,13.731417,11302023_FAD185_M3_5xFAD_cereb_O3off_01,846.8 -> 547.5
3279720,12.4,TAG,0,40,[TG(52:5)]_FA16:1,61686.785156,61686.785156,870.8,599.5,12.396817,11302023_FAD185_M3_5xFAD_dienc_O3off_01,870.8 -> 599.5
1504570,11.15,TAG,0,39,[TG(52:6)]_FA18:2,5064.280273,5064.280273,868.8,571.5,11.154017,11292023_FAD189_M1_5xFAD_cortex_O3on_01,868.8 -> 571.5
3282563,13.64,TAG,0,42,[TG(54:5)]_FA18:1,351194.34375,351194.34375,898.8,599.5,13.6387,11302023_FAD185_M3_5xFAD_dienc_O3off_01,898.8 -> 599.5
3281975,12.53,TAG,0,41,[TG(54:6)]_FA18:1,51666.304688,51666.304688,896.8,597.5,12.528833,11302023_FAD185_M3_5xFAD_dienc_O3off_01,896.8 -> 597.5
3284305,15.99,TAG,0,45,"[TG(55:10),TG(54:3)]_FA18:1",599856.1875,599856.1875,902.8,603.5,15.991383,11302023_FAD185_M3_5xFAD_dienc_O3off_01,902.8 -> 603.5
3283153,14.84,TAG,0,43,"[TG(55:11),TG(54:4)]_FA18:1",821874.0,821874.0,900.8,601.5,14.8371,11302023_FAD185_M3_5xFAD_dienc_O3off_01,900.8 -> 601.5
3285458,17.19,TAG,0,47,"[TG(55:8),TG(54:1)]_FA18:1",4937.580566,4937.580566,906.8,607.5,17.189767,11302023_FAD185_M3_5xFAD_dienc_O3off_01,906.8 -> 607.5


# END???