In [1]:
from IPython.display import display, Image, clear_output

# Comprehensive Lipidome Automation Workflow (CLAW)

Welcome to CLAW, a tool designed to facilitate and optimize the processing of lipidomic MRM data. This Jupyter notebook encapsulates a suite of tools that streamline the various stages of lipidomics data analysis.

Our toolset enables users to efficiently process MRM data files in the mzML format. Upload a file and CLAW will parse the data into a structured Pandas dataframe. This dataframe includes critical information like sample_ID, MRM transition, and signal intensity. Furthermore, our tool aligns each MRM transition with a default or custom lipid_database for accurate and swift annotation.

Moreover, CLAW is equipped with an OzESI option, a tool to elucidate the double bond location in lipid isomers. This feature allows users to input OzESI data and pinpoint the precise location of double bonds in isomeric lipids. Users have the flexibility to select which double bond locations they want to analyze. Following this, CLAW autonomously predicts potential m/z values and cross-references these predictions with sample data, ensuring a comprehensive and meticulous analysis.

With automation at its core, CLAW eliminates the need for manual data processing, significantly reducing time expenditure. It is a robust and invaluable tool for handling large volumes of lipid MRM data, accelerating scientific discovery in the field of lipidomics.

In [2]:
#Import all the necessary python libraries
import pymzml
import csv
import os
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import json
from scipy.integrate import trapz
from tqdm import tqdm

#Import all the necessary CLAW libraries
import create_directory
import CLAW
import matplotlib.pyplot as plt
import warnings

import re
from sklearn.mixture import GaussianMixture

# Suppress all warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## Directory and File Management
For structured data management and efficient workflow, the system first ensures the presence of an output directory. If such a directory already exists you can skip this step.

In [3]:
# # Create the output directory. If it already exists you can skip this step.
# create_directory.create_project_folder()


The name of the project is defined next. This is important as the created directory will bear this name, allowing users to manage and identify their data with ease.

After the mzML files are uploaded to the designated mzML folder, the next block of code segregates these files based on their characteristics. More specifically, it filters the files and transfers them to respective folders named 'o3on' and 'o2only'.

In [4]:
name_of_project = 'FaceFats'
#After you load mzml files to mzml folder. this will filter the files and move them to o3on and o2only folders
create_directory.filter_o3mzml_files(name_of_project)

One or both of the destination directories 'Projects/FaceFats/o3on' and 'Projects/FaceFats/o2only' do not exist.


## Pre-Parsing Setup
The following block of code takes the preset variable values and uses them to parse the mzML files. The parsed data, including the sample ID, MRM transitions, and intensities, is stored in a pandas dataframe for easy manipulation and analysis.

The function CLAW.parsing_mzml_to_df takes several arguments. data_base_name_location is the location of the lipid database that contains information on lipid classes, fatty acid chains, and their corresponding MRM transitions. Project_Folder_data is the location of the mzML files for the samples to be analyzed. tolerance defines the acceptable range of deviation for the MRM transitions when matching them with the lipid database. The argument remove_std is a boolean that, when True, indicates to remove the MRM transitions that correspond to standards (internal or external) present in the samples.

The function outputs a pandas dataframe (df) where each row corresponds to an MRM transition detected in a sample, and columns include the sample ID, MRM transition, and intensity of the transition, among other values.

In [5]:
# Set default values
data_base_name_location = 'lipid_database/Lipid_Database.xlsx'
Project = './Projects/'
Project_Name = 'FaceFats'
Project_Folder_data = Project + Project_Name + '/mzml_brain5xFAD_OFF/'
Project_results = Project + Project_Name + '/results/'
file_name_to_save = 'Brain5xFAD_OzON'
tolerance = 0.3
remove_std = True
save_data = True

# Call pre_parsing_setup to initialize the variables
data_base_name_location, Project_Folder_data, Project_results, file_name_to_save, tolerance, remove_std, save_data = CLAW.pre_parsing_setup(data_base_name_location,
 Project, 
 Project_Name, 
 Project_Folder_data,
 Project_results, 
 file_name_to_save, 
 tolerance, 
 remove_std,
 save_data)


data_base_name_location: lipid_database/Lipid_Database.xlsx
Project: ./Projects/
Project_Name: FaceFats
Project_Folder_data: ./Projects/FaceFats/mzml_brain5xFAD_OFF/
Project_results: ./Projects/FaceFats/results/
file_name_to_save: Brain5xFAD_OzON
tolerance: 0.3
remove_std: True
save_data: True


Define the master dataframes where the data will be stored during the parsing step.

In [6]:
time_and_intensity_df, master_df, OzESI_time_df = CLAW.create_analysis_dataframes()

## CLAW.full_parse()
In this code, the `CLAW.full_parse()` function is used to analyze the MRM data. It takes several parameters like the location of the lipid database, paths to the data and results folders, the name of the result files, and the tolerance for MRM transitions matching. The function returns two dataframes: `df_matched` that contains information about each detected lipid species and their corresponding MRM transitions, and `OzESI_time_df` which captures data related to OzESI-MS scans, including potential double bond locations of lipids. If `remove_std` is `True`, it removes MRM transitions related to standards from the dataframe, and if `save_data` is `True`, the dataframe is saved as a .csv file in the specified results folder.

In [7]:
# Use the initialized variables as arguments to full_parse
df_MRM, df_OzESI = CLAW.full_parse(data_base_name_location, 
                                               Project_Folder_data, 
                                               Project_results, 
                                               file_name_to_save, 
                                               tolerance, 
                                               remove_std=True, 
                                               save_data=False,
                                               batch_processing=True,
                                               plot_chromatogram=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Parent_Ion'] = np.round(lipid_MRM_data['Parent_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Product_Ion'] = np.round(lipid_MRM_data['Product_Ion'], 1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lipid_MRM_data['Transition'] = lipid_MRM_data['Parent_Ion

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD93_F4_5xFAD_cereb_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD93_F4_5xFAD_cortex_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD93_F4_5xFAD_dienc_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD93_F4_5xFAD_hippo_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD94_F4_5xFAD_cereb_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD94_F4_5xFAD_cortex_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD94_F4_5xFAD_dienc_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_DOD94_F4_5xFAD_hippo_O3off_01.mzML

Finished parsing mzML file: ./Projects/FaceFats/mzml_brain5xFAD_OFF/11302023_FAD185_M1_5xFAD_c

In [8]:
df_MRM.head(None)

Unnamed: 0,Class,Intensity,Lipid,Parent_Ion,Product_Ion,Sample_ID,Transition
0,,23995.461746,,584.4,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
1,,23274.421654,,612.4,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,612.4 -> 437.3
2,,32844.782284,,622.5,503.4,11302023_DOD93_F4_5xFAD_cereb_O3off_01,622.5 -> 503.4
3,,29137.402119,,624.5,505.4,11302023_DOD93_F4_5xFAD_cereb_O3off_01,624.5 -> 505.4
4,,23972.741741,,626.5,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,626.5 -> 437.3
...,...,...,...,...,...,...,...
3595,TAG,33222.982410,"[TG(57:9),TG(56:2)]_FA18:1",932.9,633.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6
3596,TAG,23429.881687,"[TG(58:7),TG(57:0)]_FA18:1",950.9,651.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,950.9 -> 651.6
3597,TAG,23313.761707,"[TG(59:13),TG(58:6)]_FA18:1",952.8,653.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,952.8 -> 653.5
3598,TAG,23350.421665,"[TG(59:12),TG(58:5)]_FA18:1",954.8,655.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,954.8 -> 655.5


Group for df_MRM as well

In [9]:
# Creating the Match_Group column
df_MRM['Match_Group'] = df_MRM.groupby(['Parent_Ion', 'Product_Ion', 'Sample_ID']).ngroup()

# Display the DataFrame with the new column
df_MRM

Unnamed: 0,Class,Intensity,Lipid,Parent_Ion,Product_Ion,Sample_ID,Transition,Match_Group
0,,23995.461746,,584.4,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
1,,23274.421654,,612.4,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,612.4 -> 437.3,24
2,,32844.782284,,622.5,503.4,11302023_DOD93_F4_5xFAD_cereb_O3off_01,622.5 -> 503.4,48
3,,29137.402119,,624.5,505.4,11302023_DOD93_F4_5xFAD_cereb_O3off_01,624.5 -> 505.4,72
4,,23972.741741,,626.5,437.3,11302023_DOD93_F4_5xFAD_cereb_O3off_01,626.5 -> 437.3,96
...,...,...,...,...,...,...,...,...
3595,TAG,33222.982410,"[TG(57:9),TG(56:2)]_FA18:1",932.9,633.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,932.9 -> 633.6,3359
3596,TAG,23429.881687,"[TG(58:7),TG(57:0)]_FA18:1",950.9,651.6,11302023_FAD189_M2_5xFAD_hippo_O3off_01,950.9 -> 651.6,3383
3597,TAG,23313.761707,"[TG(59:13),TG(58:6)]_FA18:1",952.8,653.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,952.8 -> 653.5,3407
3598,TAG,23350.421665,"[TG(59:12),TG(58:5)]_FA18:1",954.8,655.5,11302023_FAD189_M2_5xFAD_hippo_O3off_01,954.8 -> 655.5,3431


In [10]:
df_OzESI.head(None)
# df_OzESI.to_csv('FF_OzOFF_full.csv')
# df_OzESI.to_excel('FaceFatsOzdf.xlsx')

Unnamed: 0,Lipid,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition
0,,584.4,437.3,0.044183,41.160004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
1,,584.4,437.3,0.088567,41.180004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
2,,584.4,437.3,0.132967,41.120003,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
3,,584.4,437.3,0.177367,41.040005,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
4,,584.4,437.3,0.221783,99.900009,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
...,...,...,...,...,...,...,...
2026774,,956.9,657.6,24.779100,41.120003,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026775,,956.9,657.6,24.823500,41.220001,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026776,,956.9,657.6,24.867917,41.100002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026777,,956.9,657.6,24.912317,41.260002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6


The `read_mrm_list()` function is first invoked to read the MRM database from the specified file location and return it as a pandas DataFrame `mrm_database`. Subsequently, the `match_lipids_parser()` function is called to match the detected lipids from the `OzESI_time_df` DataFrame, obtained from the OzESI-MS scans, with the known lipids in the `mrm_database` based on the MRM transitions within the specified `tolerance`. The result is saved in the `df_oz_matched` DataFrame, which now contains matched lipid species from the OzESI-MS data.

In [11]:
d1 = df_OzESI.iloc[:,1:9]

# Assuming d1 is your DataFrame

# Define the retention time range as a tuple (lower_bound, upper_bound)
retention_time_range = (9.5, 21.5)  # Replace with your specific range values

# Filter the DataFrame to keep only rows where Retention_Time is within the specified range
d1a = d1[(d1['Retention_Time'] >= retention_time_range[0]) & (d1['Retention_Time'] <= retention_time_range[1])]

# # Now, filtered_d1 contains only the rows from d1 where Retention_Time is within the specified range
# filtered_d1.to_csv('filtered_d1_FF_OzON_Liver.csv')
d1a

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition
213,584.4,437.3,9.502817,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
214,584.4,437.3,9.547233,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
215,584.4,437.3,9.591633,41.060001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
216,584.4,437.3,9.636050,41.220001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
217,584.4,437.3,9.680450,41.080002,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3
...,...,...,...,...,...,...
2026696,956.9,657.6,21.315367,40.900002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026697,956.9,657.6,21.359783,41.000004,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026698,956.9,657.6,21.404183,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6
2026699,956.9,657.6,21.448600,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6


# create Group for OzESI df to iterate through it faster instead of checking every single row

In [12]:
# Creating the Match_Group column
d1a['Match_Group'] = d1a.groupby(['Parent_Ion', 'Product_Ion', 'Sample_ID']).ngroup()

# Display the DataFrame with the new column
d1a

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group
213,584.4,437.3,9.502817,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
214,584.4,437.3,9.547233,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
215,584.4,437.3,9.591633,41.060001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
216,584.4,437.3,9.636050,41.220001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
217,584.4,437.3,9.680450,41.080002,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0
...,...,...,...,...,...,...,...
2026696,956.9,657.6,21.315367,40.900002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455
2026697,956.9,657.6,21.359783,41.000004,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455
2026698,956.9,657.6,21.404183,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455
2026699,956.9,657.6,21.448600,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455


In [13]:


# # Assuming clustered_data is your DataFrame with the necessary data

# # Plotting the Intensity vs Retention Time
# plt.figure(figsize=(10, 6))
# plt.scatter(d1a['Retention_Time'], d1a['OzESI_Intensity'])

# # Adding labels and title to the plot
# plt.xlabel('Retention Time')
# plt.ylabel('OzESI Intensity')
# plt.title('Retention Time vs OzESI Intensity')

# # Displaying the plot
# plt.show()


# Pipeline for OzON Truth

Correct RTs add DB position based on correct RT

Nested for loops too slow but last resort if cannot get others to work

In [14]:
# d1b = d1a.copy()
# # Define a function to check if two ions are within the tolerance
# def is_within_tolerance(ion1, ion2, tolerance=0.3):
#     return abs(ion1 - ion2) <= tolerance

# # Initialize the Lipid column in d1a
# d1b['Lipid'] = None

# # Iterate through d1b and match lipids from df_MRM
# for index, row in d1b.iterrows():
#     for _, mrm_row in df_MRM.iterrows():
#         if is_within_tolerance(row['Parent_Ion'], mrm_row['Parent_Ion']) and is_within_tolerance(row['Product_Ion'], mrm_row['Product_Ion']):
#             d1b.at[index, 'Lipid'] = mrm_row['Lipid']
#             break  # Stop searching once a match is found

# # Display the updated d1b
# print(d1b)

# import pandas as pd
# from tqdm import tqdm

# d1b = d1a.copy()

# # Define a function to check if two ions are within the tolerance
# def is_within_tolerance(ion1, ion2, tolerance=0.3):
#     return abs(ion1 - ion2) <= tolerance

# # Initialize the Lipid column in d1b
# d1b['Lipid'] = None

# # Iterate through d1b with a progress bar and match lipids from df_MRM
# for index, row in tqdm(d1b.iterrows(), total=d1b.shape[0], desc="Matching Lipids"):
#     for _, mrm_row in df_MRM.iterrows():
#         if is_within_tolerance(row['Parent_Ion'], mrm_row['Parent_Ion']) and is_within_tolerance(row['Product_Ion'], mrm_row['Product_Ion']):
#             d1b.at[index, 'Lipid'] = mrm_row['Lipid']
#             break  # Stop searching once a match is found

# # Display the updated d1b
# print(d1b)


# Use merge instead of nested for loops

In [15]:
d1b = d1a.copy()

# Assuming df_MRM and d1b are already defined

# Create Match_Group in d1a if it doesn't exist
if 'Match_Group' not in d1a.columns:
    d1a['Match_Group'] = d1a.groupby(['Parent_Ion', 'Product_Ion', 'Sample_ID']).ngroup()

# Copy Match_Group to d1b
d1b['Match_Group'] = d1a['Match_Group']

# Function to check if two ions are within the tolerance
def is_within_tolerance(ion1, ion2, tolerance=0.3):
    return abs(ion1 - ion2) <= tolerance

# Add a new column for Lipid in d1b
d1b['Lipid'] = None

# Iterate through each unique Match_Group in d1a
for group in d1a['Match_Group'].unique():
    # Extract a representative row for the current group from d1a
    group_row = d1a[d1a['Match_Group'] == group].iloc[0]

    # Find a matching lipid in df_MRM for the representative row
    for _, mrm_row in df_MRM.iterrows():
        if is_within_tolerance(group_row['Parent_Ion'], mrm_row['Parent_Ion']) and is_within_tolerance(group_row['Product_Ion'], mrm_row['Product_Ion']):
            # Assign the lipid to all rows in the corresponding group in d1b
            d1b.loc[d1b['Match_Group'] == group, 'Lipid'] = mrm_row['Lipid']
            break  # Stop searching once a match is found

# Display the updated d1b DataFrame
d1b

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group,Lipid
213,584.4,437.3,9.502817,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,
214,584.4,437.3,9.547233,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,
215,584.4,437.3,9.591633,41.060001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,
216,584.4,437.3,9.636050,41.220001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,
217,584.4,437.3,9.680450,41.080002,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,
...,...,...,...,...,...,...,...,...
2026696,956.9,657.6,21.315367,40.900002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1"
2026697,956.9,657.6,21.359783,41.000004,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1"
2026698,956.9,657.6,21.404183,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1"
2026699,956.9,657.6,21.448600,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1"


# Assign Correct_RT for each Lipid group is OzON to define ground truth RT for each lipid in each sample

In [16]:


d1c = d1b.copy()    
# Assuming your DataFrame is named d1c

import pandas as pd

# Assuming your DataFrame is named df

# Group by Match_Group and find the Retention_Time corresponding to the max OzESI_Intensity for each group
max_rt_per_group = d1c.groupby('Match_Group').apply(lambda x: x.loc[x['OzESI_Intensity'].idxmax(), 'Retention_Time'])

# Map the max retention time to the Correct_RT column for each group
d1c['Correct_RT'] = d1c['Match_Group'].map(max_rt_per_group)

d1c


# # print len of unique values in Correct_RT column
# print(len(d1c['Correct_RT'].unique()))




Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group,Lipid,Correct_RT
213,584.4,437.3,9.502817,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,,15.49775
214,584.4,437.3,9.547233,41.020004,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,,15.49775
215,584.4,437.3,9.591633,41.060001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,,15.49775
216,584.4,437.3,9.636050,41.220001,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,,15.49775
217,584.4,437.3,9.680450,41.080002,11302023_DOD93_F4_5xFAD_cereb_O3off_01,584.4 -> 437.3,0,,15.49775
...,...,...,...,...,...,...,...,...,...
2026696,956.9,657.6,21.315367,40.900002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1",17.62960
2026697,956.9,657.6,21.359783,41.000004,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1",17.62960
2026698,956.9,657.6,21.404183,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1",17.62960
2026699,956.9,657.6,21.448600,41.240002,11302023_FAD189_M2_5xFAD_hippo_O3off_01,956.9 -> 657.6,3455,"[TG(59:11),TG(58:4)]_FA18:1",17.62960


Filter out NaN values from Lipid column

In [17]:
import pandas as pd

# Assuming d1c is your DataFrame

# Copying d1c to d1d
d1d = d1c.copy()

# Group by Match_Group and filter out groups where all 'Lipid' values are NaN
d1d = d1c.groupby('Match_Group').filter(lambda x: not x['Lipid'].isna().all())

# Find the max OzESI_Intensity for each Match_Group
max_intensity_per_group = d1d.groupby('Match_Group')['OzESI_Intensity'].max()

# Map the max intensity to a new column Max_Intensity for each group
d1d['Max_Intensity'] = d1d['Match_Group'].map(max_intensity_per_group)

# Round the Correct_RT column to 2 decimal places and also Max_Intensity and OzESI_Intensity to 0 decimal places and Retention_Time to 2 decimal places
d1d['Correct_RT'] = d1d['Correct_RT'].round(2)
d1d['Max_Intensity'] = d1d['Max_Intensity'].round(0)
d1d['OzESI_Intensity'] = d1d['OzESI_Intensity'].round(0)
d1d['Retention_Time'] = d1d['Retention_Time'].round(2)

# Keep only the row with the highest OzESI_Intensity in each Match_Group
d1d = d1d.sort_values('OzESI_Intensity', ascending=False).drop_duplicates('Match_Group')


def extract_details_from_sample_id(df, column_name='Sample_ID'):
    """
    Extracts details from the Sample_ID column and adds them as new columns: Cage, Mouse, Genotype, and Biology.

    Args:
    df (pandas.DataFrame): The DataFrame containing the Sample_ID column.
    column_name (str): The name of the column to extract the details from. Default is 'Sample_ID'.

    Returns:
    pandas.DataFrame: The original DataFrame with added columns 'Cage', 'Mouse', 'Genotype', 'Biology'.
    """
    # Regular expression pattern to extract Cage, Mouse, Genotype, and Biology
    pattern = r'^[^_]*_(?P<Cage>[^_]+)_(?P<Mouse>[^_]+)_(?P<Genotype>[^_]+)_(?P<Biology>[^_]+)'

    # Extract the matched patterns and create the new columns
    df_extracted = df[column_name].str.extract(pattern)

    # Add the new columns to the original DataFrame
    df = pd.concat([df, df_extracted[['Cage', 'Mouse', 'Genotype', 'Biology']]], axis=1)

    return df

# Usage Example:
# Assuming your DataFrame is named df
d1d = extract_details_from_sample_id(d1d)



# Group by Match_Group and filter out groups where the max Max_Intensity is under 300
d1d = d1d.groupby('Match_Group').filter(lambda x: x['Max_Intensity'].max() >= 2000)

# d1d now contains one row per Match_Group with the highest OzESI_Intensity
#save the data to csv
# d1d.to_csv('Projects/FaceFats/data/OzOFF_CorrectRT/FF_Brain5xFAD_OzOFF_CorrectRT.csv')
d1d


Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group,Lipid,Correct_RT,Max_Intensity,Cage,Mouse,Genotype,Biology
1256375,900.8,601.5,14.84,821874.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,900.8 -> 601.5,3014,"[TG(55:11),TG(54:4)]_FA18:1",14.84,821874.0,FAD185,M3,5xFAD,dienc
1087477,900.8,601.5,14.84,673141.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,900.8 -> 601.5,3012,"[TG(55:11),TG(54:4)]_FA18:1",14.84,673141.0,FAD185,M3,5xFAD,cereb
1257527,902.8,603.5,15.99,599856.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,902.8 -> 603.5,3062,"[TG(55:10),TG(54:3)]_FA18:1",15.99,599856.0,FAD185,M3,5xFAD,dienc
1088629,902.8,603.5,15.99,392364.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,902.8 -> 603.5,3060,"[TG(55:10),TG(54:3)]_FA18:1",15.99,392364.0,FAD185,M3,5xFAD,cereb
1255785,898.8,599.5,13.64,351194.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,898.8 -> 599.5,2990,[TG(54:5)]_FA18:1,13.64,351194.0,FAD185,M3,5xFAD,dienc
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1923502,846.8,547.5,13.95,2134.0,11302023_FAD189_M2_5xFAD_dienc_O3off_01,846.8 -> 547.5,2662,[TG(50:3)]_FA18:1,13.95,2134.0,FAD189,M2,5xFAD,dienc
1089783,906.8,607.5,17.23,2064.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,906.8 -> 607.5,3108,"[TG(55:8),TG(54:1)]_FA18:1",17.23,2064.0,FAD185,M3,5xFAD,cereb
1846339,896.8,597.5,12.48,2059.0,11302023_FAD189_M2_5xFAD_cortex_O3off_01,896.8 -> 597.5,2973,[TG(54:6)]_FA18:1,12.48,2059.0,FAD189,M2,5xFAD,cortex
487865,846.8,547.5,13.86,2054.0,11302023_DOD94_F4_5xFAD_cortex_O3off_01,846.8 -> 547.5,2645,[TG(50:3)]_FA18:1,13.86,2054.0,DOD94,F4,5xFAD,cortex


# create Group_Sample column

In [18]:


def add_group_sample_column(df):
    """
    Adds a new column 'Group_Sample' to the DataFrame, assigning a unique group number 
    for each combination of Cage, Mouse, Genotype, Biology, and Lipid.

    Args:
    df (pandas.DataFrame): The DataFrame to process.

    Returns:
    pandas.DataFrame: The DataFrame with the added 'Group_Sample' column.
    """
    # Create the 'Group_Sample' column by assigning a group number for each combination
    df['Group_Sample'] = df.groupby(['Cage', 'Mouse', 'Genotype', 'Biology', 'Lipid']).ngroup()

    return df

# Usage Example:
# Assuming your DataFrame is named df
d1d = add_group_sample_column(d1d)
d1d




Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group,Lipid,Correct_RT,Max_Intensity,Cage,Mouse,Genotype,Biology,Group_Sample
1256375,900.8,601.5,14.84,821874.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,900.8 -> 601.5,3014,"[TG(55:11),TG(54:4)]_FA18:1",14.84,821874.0,FAD185,M3,5xFAD,dienc,78
1087477,900.8,601.5,14.84,673141.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,900.8 -> 601.5,3012,"[TG(55:11),TG(54:4)]_FA18:1",14.84,673141.0,FAD185,M3,5xFAD,cereb,58
1257527,902.8,603.5,15.99,599856.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,902.8 -> 603.5,3062,"[TG(55:10),TG(54:3)]_FA18:1",15.99,599856.0,FAD185,M3,5xFAD,dienc,76
1088629,902.8,603.5,15.99,392364.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,902.8 -> 603.5,3060,"[TG(55:10),TG(54:3)]_FA18:1",15.99,392364.0,FAD185,M3,5xFAD,cereb,57
1255785,898.8,599.5,13.64,351194.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,898.8 -> 599.5,2990,[TG(54:5)]_FA18:1,13.64,351194.0,FAD185,M3,5xFAD,dienc,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1923502,846.8,547.5,13.95,2134.0,11302023_FAD189_M2_5xFAD_dienc_O3off_01,846.8 -> 547.5,2662,[TG(50:3)]_FA18:1,13.95,2134.0,FAD189,M2,5xFAD,dienc,129
1089783,906.8,607.5,17.23,2064.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,906.8 -> 607.5,3108,"[TG(55:8),TG(54:1)]_FA18:1",17.23,2064.0,FAD185,M3,5xFAD,cereb,59
1846339,896.8,597.5,12.48,2059.0,11302023_FAD189_M2_5xFAD_cortex_O3off_01,896.8 -> 597.5,2973,[TG(54:6)]_FA18:1,12.48,2059.0,FAD189,M2,5xFAD,cortex,125
487865,846.8,547.5,13.86,2054.0,11302023_DOD94_F4_5xFAD_cortex_O3off_01,846.8 -> 547.5,2645,[TG(50:3)]_FA18:1,13.86,2054.0,DOD94,F4,5xFAD,cortex,22


# Save OzOFF df

In [19]:
d1d.to_csv('Projects/FaceFats/data/OzOFF_CorrectRT/FF_Brain5xFAD_OzOFF_CorrectRT.csv')
d1d

Unnamed: 0,Parent_Ion,Product_Ion,Retention_Time,OzESI_Intensity,Sample_ID,Transition,Match_Group,Lipid,Correct_RT,Max_Intensity,Cage,Mouse,Genotype,Biology,Group_Sample
1256375,900.8,601.5,14.84,821874.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,900.8 -> 601.5,3014,"[TG(55:11),TG(54:4)]_FA18:1",14.84,821874.0,FAD185,M3,5xFAD,dienc,78
1087477,900.8,601.5,14.84,673141.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,900.8 -> 601.5,3012,"[TG(55:11),TG(54:4)]_FA18:1",14.84,673141.0,FAD185,M3,5xFAD,cereb,58
1257527,902.8,603.5,15.99,599856.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,902.8 -> 603.5,3062,"[TG(55:10),TG(54:3)]_FA18:1",15.99,599856.0,FAD185,M3,5xFAD,dienc,76
1088629,902.8,603.5,15.99,392364.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,902.8 -> 603.5,3060,"[TG(55:10),TG(54:3)]_FA18:1",15.99,392364.0,FAD185,M3,5xFAD,cereb,57
1255785,898.8,599.5,13.64,351194.0,11302023_FAD185_M3_5xFAD_dienc_O3off_01,898.8 -> 599.5,2990,[TG(54:5)]_FA18:1,13.64,351194.0,FAD185,M3,5xFAD,dienc,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1923502,846.8,547.5,13.95,2134.0,11302023_FAD189_M2_5xFAD_dienc_O3off_01,846.8 -> 547.5,2662,[TG(50:3)]_FA18:1,13.95,2134.0,FAD189,M2,5xFAD,dienc,129
1089783,906.8,607.5,17.23,2064.0,11302023_FAD185_M3_5xFAD_cereb_O3off_01,906.8 -> 607.5,3108,"[TG(55:8),TG(54:1)]_FA18:1",17.23,2064.0,FAD185,M3,5xFAD,cereb,59
1846339,896.8,597.5,12.48,2059.0,11302023_FAD189_M2_5xFAD_cortex_O3off_01,896.8 -> 597.5,2973,[TG(54:6)]_FA18:1,12.48,2059.0,FAD189,M2,5xFAD,cortex,125
487865,846.8,547.5,13.86,2054.0,11302023_DOD94_F4_5xFAD_cortex_O3off_01,846.8 -> 547.5,2645,[TG(50:3)]_FA18:1,13.86,2054.0,DOD94,F4,5xFAD,cortex,22
