# Comprehensive Lipidome Automation Workflow (CLAW)

Welcome to CLAW, a tool designed to facilitate and optimize the processing of lipidomic MRM data. This Jupyter notebook encapsulates a suite of tools that streamline the various stages of lipidomics data analysis.

Our toolset enables users to efficiently process MRM data files in the mzML format. Upload a file and CLAW will parse the data into a structured Pandas dataframe. This dataframe includes critical information like sample_ID, MRM transition, and signal intensity. Furthermore, our tool aligns each MRM transition with a default or custom lipid_database for accurate and swift annotation.

Moreover, CLAW is equipped with an OzESI option, a tool to elucidate the double bond location in lipid isomers. This feature allows users to input OzESI data and pinpoint the precise location of double bonds in isomeric lipids. Users have the flexibility to select which double bond locations they want to analyze. Following this, CLAW autonomously predicts potential m/z values and cross-references these predictions with sample data, ensuring a comprehensive and meticulous analysis.

With automation at its core, CLAW eliminates the need for manual data processing, significantly reducing time expenditure. It is a robust and invaluable tool for handling large volumes of lipid MRM data, accelerating scientific discovery in the field of lipidomics.

Import all necessary libraries

In [1]:
# Standard library imports
import csv
import json
import math
import os
import re
import time
import warnings

# Third-party imports
import ipywidgets as widgets
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import plotly.io as pio
import pymzml
from IPython.display import Image, clear_output, display
from collections import defaultdict

# Custom Scripts
from NO_AVERAGE_SCRIPTS import (average_pie_chart_no_repeats,
                                filter_dataframe, full_parse,
                                hex_to_rgba_hex, json_to_string,
                                make_bar_plot_comparisons,
                                make_pie_chart_no_replicates,
                                prep_edge_R)

# GUI tools
from tools.GUI import (assign_blank, display_pair_widgets, filter_samples,
                       folder_navigator, get_unique_json_objects, 
                       load_blank_name, load_data, load_data_labels, 
                       load_project_folder, remove_empty_entries)

# Parsing tools
from tools.parsing import add_suffix

# Pre-folder path
Pre_folder = './Projects/'


No module named 'ms_deisotope._c.averagine' averagine
No module named 'ms_deisotope._c.scoring'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'
No module named 'ms_deisotope._c.deconvoluter_base'


In [2]:
import os
import shutil

def normalize_name(name):
    return name.replace("_", "-").replace("tag", "tg").replace("TAG", "TG")

def count_and_get_prefixes(folder_name, prefix_length):
    prefixes = set()
    count = 0
    notneeded_folder = 'Projects/BRAIN_5XFAD_OLD_DATA_FOR_PAPER/notneeded'
    os.makedirs(notneeded_folder, exist_ok=True)
    
    try:
        for name in os.listdir(folder_name):
            full_path = os.path.join(folder_name, name)
            if os.path.isfile(full_path):
                
                # This is the new snippet to rename the files
                new_name = normalize_name(name)
                if new_name != name:  # Check if the name has been changed
                    new_full_path = os.path.join(folder_name, new_name)
                    shutil.move(full_path, new_full_path)  # Rename the file
                    name = new_name  # Update the name variable for subsequent use in this loop iteration
                    full_path = new_full_path  # Similarly, update the full_path variable
                
                if "equisplash" in name or "r Blank" in name or "blank" in name or "IPA" in name:
                    shutil.move(full_path, os.path.join(notneeded_folder, name))
                else:
                    prefixes.add(normalize_name(name[:prefix_length]))
                    count += 1
        return count, prefixes
    except FileNotFoundError:
        print(f"Folder {folder_name} not found!")
        return 0, set()


# The specific folder of interest
folder_path = 'Projects/BRAIN_5XFAD_OLD_DATA_FOR_PAPER/mzml'

# Assuming prefix_length is defined elsewhere in your script or you can set it here
prefix_length = 10

count, prefixes = count_and_get_prefixes(folder_path, prefix_length)
print(f"Number of files in folder {folder_path}: {count}")


Number of files in folder Projects/BRAIN_5XFAD_OLD_DATA_FOR_PAPER/mzml: 397


In the given Python code snippet, various parameters and flags are set which configure the execution of a lipidomic data parsing and visualization process. The name of the output file (file_name_to_save) is set to 'TEST' and an additional descriptor (extra_name) is assigned as 'Blank1'. The tolerance for acceptable error during data parsing is set to 0.1. Flags for whether to remove standard deviation values from the dataset (remove_std), whether to save the processed data (save_data), whether to load pre-existing parsed data (load_previously_parsed), and whether to utilize a custom dataset (custom_data) are all set to True

In [2]:
file_name_to_save = 'TEST' # Specifies the output file name
extra_name = "Blank1" # Additional descriptor for the output file
tolerance = 0.1 # Acceptable error level for data parsing
remove_std = True # Flag to decide if standard deviation values should be removed
save_data= True # Flag to decide if processed data should be saved
load_previously_parsed = False # Flag to decide if pre-existing parsed data should be loaded
custom_data=True # Flag to decide if a custom dataset should be used

**Project Organization and Data Preparation Summary**

This section of the code primarily focuses on project organization, sample labelling, and data preparation. Initially, it employs a folder navigation system to manually select the desired project folder. It then loads the relevant files like mzml files, lipid database, and label file from their respective directories within the chosen project folder. Unique samples are identified and a 'blank sample' is assigned for further analysis. The label data is further refined by filtering the samples and removing unnecessary columns. Finally, it prepares a list of labels, inclusive of "Class" and "Lipid" for the subsequent steps of the analysis.

In [3]:
# Launch a GUI to choose project folder
folder_navigator()

# Load selected project folder path and define various necessary paths for processed results, mzml files, etc. 
Project_Folder = load_project_folder()
folder_name_to_save = Project_Folder + 'Processed Results/'
data_base_name_location = 'lipid_database/Lipid_Database.xlsx'
mzml_folder = Project_Folder + "mzml/"
Pre_edge_r_path = Project_Folder + "Pre_EdgeR/"
plots_2_save_path = Project_Folder + "Plots/"

# Load labels from CSV file
label_file = Project_Folder + "Labels/labels.csv"
labels_df = pd.read_csv(label_file)

# Determine the blank sample
# Get unique sample names
unique_samples = labels_df['Sample Name'].unique()
# Launch a GUI to choose blank sample
assign_blank(unique_samples)
blank_name = load_blank_name()  # Load selected blank sample name

# Filter samples using GUI
filter_samples(labels_df)
labels_df2 = load_data_labels()

# Remove unnecessary columns "Sample Name" and "Position" from labels_df2
labels_df2 = labels_df2.drop(["Sample Name","Position"], axis=1)
#labels_df2 = labels_df2.drop(["Sample Name"], axis=1)


# Get the list of label names and extend it with 'Class' and 'Lipid'
labels_list = list(labels_df)
labels_list = labels_list + ["Class","Lipid"]


Button(description='Navigate', style=ButtonStyle())

Button(description='Select this folder', style=ButtonStyle())

Button(description='Select Current Folder', style=ButtonStyle())

Select(options=('/home/sanjay/github/lipids/Lipidomics/lipid_platform/Projects',), rows=10, value='/home/sanja…

Output()

Dropdown(description='Samples', options=('10xBlank_01202023', 'Blank10x_01182023', 'Blank10x_01192023', 'DOD10…

Button(description='Assign Blank', style=ButtonStyle())

Output()

SelectMultiple(description='Samples', index=(0,), options=('10xBlank_01202023', 'Blank10x_01182023', 'Blank10x…

Button(description='Filter Samples', style=ButtonStyle())

Output()

In [4]:
main_json = {col: labels_df2[col].unique().tolist() for col in labels_df2}

display_pair_widgets(main_json)

HBox(children=(SelectMultiple(description='Genotype', options=('WT', '5xFAD'), value=()), SelectMultiple(descr…

HBox(children=(SelectMultiple(description='Brain Region', options=('cerebellum', 'cortex', 'diencephalon', 'hi…

HBox(children=(SelectMultiple(description='Sex', options=('Male',), value=()), SelectMultiple(description='Sex…

HBox(children=(Button(description='Finish', style=ButtonStyle()), Button(description='Add more JSON pairs', st…

## Lipid Data Processing and Plotting
The following code performs lipid data processing and generates various plots to visualize the results. It includes the loading and parsing of lipid data, custom class renaming, creation of pie and bar plots, data preparation for EdgeR processing, and executing the EdgeR processing through a bash script.

In [5]:
# Suppress warnings for a cleaner output
warnings.filterwarnings('ignore')

# Display a GIF as a visual cue for data loading process
# gif = Image(filename='Figures/cat_gif.gif')  # replace 'your_gif.gif' with the path to your GIF
# display(gif)
print("Your data is PURRing...")

# Load preprocessed data if the flag is set to True, else parse raw data
if load_previously_parsed:
    df_matched = pd.read_csv(os.path.join(Project_Folder, "Processed Results", file_name_to_save+".csv"))
else:
    df_matched = full_parse(data_base_name_location, mzml_folder, folder_name_to_save, labels_df, blank_name, 
                            file_name_to_save, tolerance, custom_data=custom_data, remove_std=remove_std, save_data=save_data)
print("Data processing complete")

# Class renaming for custom data
if custom_data:
    class_rename_dict = {'AC': 'CAR', 'FFA': 'FA', 'CE | CE': 'CE', 'PE | PE': 'PE', 'PG | PG': 'PG', 
                         'PG | PG | PG': 'PG', 'PI | PI': 'PI', 'PS | PS': 'PS','CER': 'Cer', 'TAG': 'TG',}
    df_matched['Class'] = df_matched['Class'].replace(class_rename_dict)

# Load comparison pairs for plotting and remove empty entries
json_list_pairs = remove_empty_entries(load_data())
# Get unique JSON objects for individual plotting
json_list_singles = get_unique_json_objects(json_list_pairs)

# # Plotting section
# make_pie_chart_no_replicates(df_matched, plots_2_save_path, json_list_singles, labels_list, blank_name, extra_name)
# average_pie_chart_no_repeats(df_matched, plots_2_save_path, json_list_singles, labels_list, blank_name, extra_name)
# make_bar_plot_comparisons(df_matched, plots_2_save_path, json_list_pairs, labels_list, blank_name, extra_name)

# Preparation for EdgeR processing
labels_list += ['method_type', "Transition"]
df_matched = add_suffix(df_matched)
combined_df = prep_edge_R(df_matched, json_list_pairs, Pre_edge_r_path, blank_name, labels_list, extra_name)

# Call bash script to run EdgeR processing
!bash myjob.sh


Your data is PURRing...
      Lipid Parent_Ion Product_Ion       Intensity      Transition Class  \
0       NaN      368.3        85.1   1785924.25415   368.3 -> 85.1   NaN   
1       NaN      368.3       306.3    24784.382042  368.3 -> 306.3   NaN   
2       NaN      370.3        85.1  1490543.856445   370.3 -> 85.1   NaN   
3       NaN      370.3       308.3     5253.500408  370.3 -> 308.3   NaN   
4       NaN      372.3        85.1  1480042.431885   372.3 -> 85.1   NaN   
...     ...        ...         ...             ...             ...   ...   
56912   NaN      990.9       693.9    18609.741333  990.9 -> 693.9   NaN   
56913   NaN      992.9       663.9   113377.466042  992.9 -> 663.9   NaN   
56914   NaN      992.9       671.9    15024.920998  992.9 -> 671.9   NaN   
56915   NaN      992.9       693.9    17602.081253  992.9 -> 693.9   NaN   
56916   NaN      992.9       695.9    14037.620884  992.9 -> 695.9   NaN   

                               Sample_ID  
0                   

KeyError: 'Sample Name'

In [6]:
print(list(labels_df))
print(list(df_matched))

['Sample Name', 'Position', 'Genotype', 'Brain Region', 'Sex']


NameError: name 'df_matched' is not defined

In [8]:
# Call bash script to run EdgeR processing
!bash myjob_v2.sh

myjob_v2.sh: line 3: $'\r': command not found
myjob_v2.sh: line 4: $'\r': command not found
myjob_v2.sh: line 5: module: command not found
myjob_v2.sh: line 6: $'\r': command not found
Loading required package: ggplot2
[?25h[?25h[?25h── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.1     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32marrange()[39m   masks [34mplyr[39m::arrange()
[31m✖[39m [34mpurrr[39m::[32mcompact()[39m   masks [34mplyr[39m::compact()
[31m✖[39m [34mdplyr[39m::[32mcount()[39m     masks [34mplyr[39m::count()
[31m✖[39m [34