# Reproducing LifeTracer results

## Overview

**LifeTracer** is a comprehensive Python package for 2D gas chromatography analysis and molecular classification. This notebook provides a step-by-step guide to reproduce the results from our research paper, walking you through the complete pipeline from raw chromatographic data to trained classification models.

### What you'll learn:
- How to extract and process Total Ion Intensity (TII) heatmaps from raw data
- Chromatographic peak detection and clustering
- Parameter optimization using calibration datasets
- Binary classification of samples using machine learning
- Visualization of chromatographic data

### System Requirements
- **Disk Space**: Up to 500 GB for complete pipeline
- **Memory**: Recommended 16+ GB RAM
- **Python**: 3.10.8 or higher
- **Environment**: Conda recommended

---

## Quick Start

**Option 1: Script Execution**
You can run the complete pipeline using the scripts in the `scripts/` directory:

```bash
# Activate environment and run pipeline steps
conda activate LifeTracer
python scripts/step1_extract_heatmaps.py
python scripts/step2_TII_alignment.py
# ... continue with subsequent steps
```

**Option 2: Interactive Notebook**
Follow the cells in this notebook for an interactive, step-by-step experience with detailed explanations.

### Important Notes:
- ⚠️ **Storage**: Steps 1-3 require at least **500 GB** of disk space.
- 🔧 **Dependencies**: Ensure all paths in scripts point to correct project directory.
- 🔄 **Flexibility**: Each step can be run independently or sequentially. If you want run run each step but you have not executed previous step, follow the required data download guide per step.

# Installation

## Linux/macOS Installation

### Step 1: Create Environment
```bash
conda create -n LifeTracer python=3.10.8
```

### Step 2: Activate Environment
```bash
conda activate LifeTracer
```

### Step 3: Install Package
```bash
# Navigate to project directory
cd LifeTracer

# Install in development mode
pip install -e .
```

### Step 4: Verify Installation
```bash
python -c "import lifetracer; print('LifeTracer installed successfully!')"
```

## Windows Installation

The installation process is identical to Linux/macOS. Use Anaconda Prompt or PowerShell:

```powershell
# Follow the same steps as above
conda create -n LifeTracer python=3.10.8
conda activate LifeTracer
cd LifeTracer
pip install -e .
```

### Troubleshooting
- **Permission Errors**: Use `pip install --user -e .`
- **Environment Issues**: Try `conda clean --all` and recreate environment
- **Path Problems**: Ensure you're in the correct project directory

# Step 1: TII Extraction

In this first step, we extract **Total Ion Intensity (TII)** images from raw chromatographic data for each sample and mass-to-charge ratio (m/z). This creates 2D heatmap representations that serve as the foundation for all subsequent analysis.

### What happens in this step:
1. **Input**: Raw CSV files containing chromatographic data
2. **Process**: Convert 3D data (RT1, RT2, Intensity) to 2D heatmaps per m/z
3. **Output**: TII heatmap files organized by sample and m/z value

## 1.1 M/Z Target List

The m/z (mass-to-charge ratio) values we target for analysis are stored in `data/all_mz_values.csv`. This file contains m/z values ranging from **30 to 700**.

> 💡 **Tip**: You can modify this file to target specific m/z values relevant to your analysis.

In [None]:
# Load and examine m/z values
import pandas as pd

mz_data = pd.read_csv('data/all_mz_values.csv')
print(f"Total M/Z values: {len(mz_data)}")
print(f"Range: {mz_data['M/Z'].min()} - {mz_data['M/Z'].max()}")
print("\nFirst 5 entries:")
mz_data.head()

: 

## 1.2 Sample Labels & Metadata

Each sample requires metadata specification including the raw data filename, sample name, and classification label. This information is stored in `data/labels.csv`.

### Label Convention:
- **Label 1**: Terrestrial samples (soil, geological)
- **Label 0**: Extraterrestrial samples (meteorites)
- **Label -1**: Unlabeled samples (for prediction)

> 💡 **Tip**: For unlabeled samples that you want to classify, use label `-1`.

In [None]:
# Load and examine sample labels
labels_data = pd.read_csv('data/labels.csv')
labels_data.head(18)

## 1.3 Data Download

Download the raw chromatographic data required for Step 1.

### ⚠️ Requirements:
- **Disk Space**: ~350 GB free space
- **Internet**: Stable connection for large file downloads
- **Time**: 30-60 minutes depending on connection speed

In [None]:
import os
import urllib.request
from tqdm.auto import tqdm

# Dataset URLs organized by type
METEORITE_URLS = [
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230823_03_Murchison_Pristine_2.0_300uLDCM_100oC24h-003.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230830_01_EET96029_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230830_02_Orgueil_300uLDCM_100oC24h-001.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230901_06_ALH83100_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230901_07_LON94101_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/230901_08_LEW85311_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/231003_01_AZ_400uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/Meteorites_LifeTracer/resolve/main/231003_02_Jbilet_Winselwan_300uLDCM_100oC24h.csv',
]

SOIL_URLS = [
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_01_Atacama_Soil_300uLDCM_100oC24h-001.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_02_Rio_Tinto_Soil_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_04_Murchison_Soil_300uLDCM_100oC24h-001.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_05_Antarctica_Soil_300uLDCM_100oC24h-001.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_06_Jarosite_Soil_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230823_07_Green_River_Shale_Soil_500uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230901_05_GSFC_soil_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/230830_03_Lignite_300uLDCM_100oC24h-001.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/231003_03_Utah_Soil_300uLDCM_100oC24h.csv',
    'https://huggingface.co/datasets/DS-20202/SoilSample-LifeTracer/resolve/main/231003_04_Iceland_Soil_300uLDCM_100oC24h.csv'
]

# Create directory for raw data
os.makedirs('downloads/raw', exist_ok=True)

print("Downloading meteorite samples...")
for url in tqdm(METEORITE_URLS, desc="Meteorites"):
    filename = url.split('/')[-1]
    urllib.request.urlretrieve(url, f'downloads/raw/{filename}')

print("\nDownloading soil samples...")  
for url in tqdm(SOIL_URLS, desc="Soil samples"):
    filename = url.split('/')[-1]
    urllib.request.urlretrieve(url, f'downloads/raw/{filename}')

print(f"\n✅ Download complete! {len(METEORITE_URLS + SOIL_URLS)} files downloaded to downloads/raw/")

## 1.4 Execute TII Extraction

### Method 1: Run via Script
```bash
python scripts/step1_extract_alignment.py
```

### Method 2: Interactive Execution (Current Notebook)
Configure and run the extraction process directly in this notebook.

### Configuration Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `mz_list_path` | Path to M/Z values CSV | `data/all_mz_values.csv` |
| `labels_path` | Path to sample labels CSV | `data/labels.csv` |
| `m_z_column_name` | M/Z column in raw data | `M/Z` |
| `area_column_name` | Intensity/Area column | `Area` |
| `first_time_column_name` | First retention time (RT1) | `1st Time (s)` |
| `second_time_column_name` | Second retention time (RT2) | `2nd Time (s)` |
| `csv_file_name_column` | Raw CSV filename column | `csv_file_name` |
| `label_column_name` | Sample label column | `label` |
| `heatmap_dir` | Output directory for TIIs | `output/heatmaps/` |
| `extract_heatmaps.raw_csv_path` | Raw data directory | `downloads/raw/` |
| `extract_heatmaps.m_z_threshold` | M/Z quantization threshold | `0.5` |
| `extract_heatmaps.parallel_processing` | Enable parallel processing | `True` |

In [None]:
import lifetracer
from pathlib import Path
import time

# Configuration for TII extraction
config = {
    # Input data paths
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    
    # Column mappings for raw CSV data
    "m_z_column_name": "M/Z",
    "area_column_name": "Area", 
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    # Output directory for TII heatmaps
    "heatmap_dir": "output/heatmaps/",

    # Extraction parameters
    "extract_heatmaps": {
        "raw_csv_path": "downloads/raw/",        # Path to raw data directory
        "m_z_threshold": 0.5,                    # M/Z quantization threshold
        "parallel_processing": True              # Enable parallel processing
    },
}

# Create output directory
Path(config["heatmap_dir"]).mkdir(parents=True, exist_ok=True)

print("Starting TII extraction process...")
print(f"Input directory: {config['extract_heatmaps']['raw_csv_path']}")
print(f"Output directory: {config['heatmap_dir']}")
print(f"Parallel processing: {config['extract_heatmaps']['parallel_processing']}")

start_time = time.time()

# Execute TII extraction
lifetracer.extract_heatmap.heatmap_extraction(config)

elapsed_time = time.time() - start_time
print(f"\n✅ TII extraction completed in {elapsed_time/60:.1f} minutes")

# Step 2: TII Alignment

We will perform this step from the paper:

<div style="text-align: center;"><img src="img/TII_Alignment.png" width="800px"></div>

### Download data required for running step 2

If did not do the previous step, you can download the processed data (unaligned TIIs).

> 💡 **Tip**:  Ensure you have `350 GB` of disk space available for this step.

In [None]:
import os
import urllib.request
import subprocess

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/heatmaps.tar.gz.part-aa',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/heatmaps.tar.gz.part-ab',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/heatmaps.tar.gz.part-ac',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/heatmaps.tar.gz.part-ad'
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    output_path = f'output/{filename}'
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, output_path)

# Unzip the data
print("Combining and extracting files...")
subprocess.run("cat output/heatmaps.tar.gz.part-a* > output/heatmaps.tar.gz && tar -xzf output/heatmaps.tar.gz -C output/", shell=True)

### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step2_extract_alignment.py
```

The description of the rest of the parameters are the same as step 1.
| Parameter | Description |
|-----------|-------------|
| `heatmap_dir` | Input directory with TIIs from Step 1 |
| `TII_aligned_dir` | Output directory for aligned TIIs |


In [None]:
config = {
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",
    "heatmap_dir": "output/heatmaps/",

    # The aligned TIIs will be saved in this directory
    "TII_aligned_dir": "output/TII_aligned/",
}

lifetracer.TII_alignment.align(config)

# Step 3: Calibration phase for parameter selection

We will perform the parameter calibration procedure described in the paper here.

### Calibration Dataset

The calibration dataset contains expert-verified compounds, which will help us automate parameter selection.

**Note:** In your experiments, you can modify `data/calibration_dataset.csv` with your own calibration data.

In [5]:
pd.read_csv('data/calibration_dataset.csv').head(10)

Unnamed: 0,ID,Compound,RT1,RT2,Base Mass,ALH 83100,Aguas Zarcas,EET 96029,Jbilet Winselwan,Murchison,Orgueil,LEW 85311,LON 94101
0,1,Naphthalene,4081.648,1.816,102,1,1,0,1,1,1,1,1
1,2,Biphenyl,5178.4,1.728,154,1,0,0,1,1,1,1,1
2,3,Phenanthrene,6692.13,2.04,178,1,0,0,0,1,1,1,1
3,4,Anthracene,6727.17,2.019,178,0,0,0,0,0,0,1,0
4,5,1-Phenylnaphthalene,6727.17,2.019,178,0,0,0,0,0,0,1,0
5,6,Acenaphthene,5626.91,1.6,153,1,1,0,0,1,0,1,1


In [7]:
# This file contains the m/z values that we will use for calibration
pd.read_csv('data/calibration_mz_values.csv').head(10)

Unnamed: 0.1,Unnamed: 0,M/Z
0,0,102
1,1,154
2,2,178
3,3,153


### Download data required for running step 3

Run the cell below to download the required data.

**Note:** Ensure you have `120 GB` of disk space available for this step.

**Note:** Ensure that the file is unzipped in the output/TII_aligned/ directory to run the next step (or you can modify the path in the next step)

In [None]:
import os
import urllib.request
import subprocess

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-aa',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ab',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ac',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ad',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ae',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-af',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in downloads folder
for url in download_list:
    filename = url.split('/')[-1]
    output_path = f'output/{filename}'
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, output_path)

# Unzip the data
part_files = ' '.join([f'output/TII_aligned.tar.gz.part-a{chr(ord("a") + i)}' for i in range(6)])
subprocess.run(f'cat {part_files} > output/TII_aligned.tar.gz', shell=True)
subprocess.run('tar -xzf output/TII_aligned.tar.gz -C output/', shell=True)

### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step3_calibration_phase.py
```

The descriptions of the remaining parameters are the same as in step 1.
| Parameter | Description |
|-----------|-------------|
| `lambda1s` | The range of lambda1 values you want to explore. |
| `lambda2s` | The range of lambda2 values you want to explore. |
| `rt1_tol` | Maximum RT1 deviation of peak location for the reference compound. |
| `rt2_tol` | Maximum RT2 deviation of peak location for the reference compound. |
| `accuracy_threshold` | Filters out lambda1 and lambda2 combinations that result in less than `accuracy_threshold` in peak detection accuracy. |
| `TII_aligned_dir` | Directory for aligned TIIs (from step 2). |
| `calibration_phase_output_dir` | Directory for calibration phase outputs. |
| `best_config_save_path` | Directory where you want the best configuration to be saved in JSON format. |


In [None]:
import lifetracer

config = {
    "calibration_dataset_path": "data/calibration_dataset.csv",
    "mz_list_path": "data/calibration_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "lambda1s": [1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
    "lambda2s": [1,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200],
    "rt1_tol": 50,
    "rt2_tol": 1,
    "accuracy_threshold": 0.9,
    "TII_aligned_dir": "output/TII_aligned/",
    "calibration_phase_output_dir": "output/calibration_phase/",
    "best_config_save_path": "output/best_config/",
}

lifetracer.calibration_phase.calibration_phase(config)

In [9]:
# read json file
import json
with open('output/best_config/best_config.json', 'r') as f:
    best_config = json.load(f)

print(best_config)

{'lambda1': 5, 'lambda2': 100, 'rt1_threshold': 50, 'rt2_threshold': 0.8}


So the best parameters are $\lambda_1 = 5, \lambda_2=100, RT1_{thrsh} = 50 s, RT1_{thrsh} = 0.8 s$

# Step 4: Extract peaks

In this step, we're going to extract peaks from aligned TIIs in step 3.

### Download data required for running step 4

Run the cell below to download the required data.

> 💡 **Tip**: Ensure you have `120 GB` of disk space available for this step.

> 💡 **Tip**: Ensure that the file is unzipped in the output/TII_aligned/ directory to run the next step (or you can modify the path in the next step)

In [None]:
import os
import urllib.request
import subprocess

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-aa',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ab',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ac',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ad',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ae',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-af',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in downloads folder
for url in download_list:
    filename = url.split('/')[-1]
    output_path = f'output/{filename}'
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, output_path)
    print(f"Downloaded {filename}")

# Unzip the data
print("Combining and extracting files...")
subprocess.run(['cat', 'output/TII_aligned.tar.gz.part-aa', 'output/TII_aligned.tar.gz.part-ab', 
                'output/TII_aligned.tar.gz.part-ac', 'output/TII_aligned.tar.gz.part-ad',
                'output/TII_aligned.tar.gz.part-ae', 'output/TII_aligned.tar.gz.part-af'], 
               stdout=open('output/TII_aligned.tar.gz', 'wb'))
subprocess.run(['tar', '-xzf', 'output/TII_aligned.tar.gz'])
print("Extraction completed!")

### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step4_find_peaks.py
```

> 💡 **Tip**: The best lambda1 and lambda2 was obtained from step 3.
| Parameter | Description |
|-----------|-------------|
| `parallel_processing` | Enable/disable parallel processing |
| `number_of_splits` | Number of splits for parallel processing |
| `TII_aligned_dir` | Input directory with aligned TIIs |
| `peaks_dir_path` | Output directory for detected peaks |
| `lambda1` | Intensity threshold multiplier (default: 5) |
| `lambda2` | Local intensity filter threshold (default: 100) |
| `peak_max_neighbor_distance` | Max distance parameter for DBSCAN (default: 5) |
| `strict_noise_filtering` | Enable/disable rigorous noise filtering |
| `enable_noisy_regions` | Enable filtering of specific noisy regions |
| `noisy_regions` | List of rectangular regions in (RT1, RT2) space to filter |
| `convolution_filter.enable` | Enable/disable convolution-based filtering (Not used in the paper) |
| `overall_filter.enable` | Enable/disable filtering based on non-zero pixels |
| `overall_filter.non_zero_ratio_filter` | Threshold for non-zero pixel ratio |


In [None]:
import lifetracer

config = {
    "parallel_processing": True,
    "number_of_splits": 100,

    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",
    "TII_aligned_dir": "output/TII_aligned/",
    "peaks_dir_path": "output/peaks/",
    "lambda1": 5,
    "lambda2": 100,
    "peak_max_neighbor_distance": 5,
    "strict_noise_filtering": True,

    "enable_noisy_regions": True,
    "noisy_regions": [
        {
            "first_time_start": 0,
            "second_time_start": 0,
            "first_time_end": -1,
            "second_time_end": 1,
            "non_zero_ratio_region_threshold": 1e-3
        },
        {
            "first_time_start": 8700,
            "second_time_start": 1.1,
            "first_time_end": -1,
            "second_time_end": 1.8,
            "non_zero_ratio_region_threshold": 1e-2
        },
        {
            "first_time_start": 8700,
            "second_time_start": 0,
            "first_time_end": -1,
            "second_time_end": -1,
            "non_zero_ratio_region_threshold": 1e-2
        },
        {
            "first_time_start": 8690,
            "second_time_start": 2.2,
            "first_time_end": 8710,
            "second_time_end": 3,
            "non_zero_ratio_region_threshold": 1e-2
        },
        { # 202 EET
            "first_time_start": 5174-50,
            "second_time_start": 0,
            "first_time_end": 5174 + 50,
            "second_time_end": -1,
            "non_zero_ratio_region_threshold": 1e-1
        },

        { # 202 EET
            "first_time_start": 5300-50,
            "second_time_start": 0,
            "first_time_end": 5300 + 50,
            "second_time_end": 1.8,
            "non_zero_ratio_region_threshold": 1e-2
        },

        {
            "first_time_start": 7700-50,
            "second_time_start": 0,
            "first_time_end": 7700 + 50,
            "second_time_end": -1,
            "non_zero_ratio_region_threshold": 1e-1
        },
        {
            "first_time_start": 8700-50,
            "second_time_start": 0,
            "first_time_end": 8700 + 50,
            "second_time_end": -1,
            "non_zero_ratio_region_threshold": 1e-2
        }
    ],
    
    "convolution_filter": {
        "enable": False,
        "lambda3": 1000000,
        "rt1_window_size": 100,
        "rt2_window_size": 0.5,
        "rt1_stride": 20,
        "rt2_stride": 0.5,
        "non_zero_ratio_lambda3_filter": 0.9
    },

    "overall_filter": {
        "enable": True,
        # "lambda": 10,
        "non_zero_ratio_filter": 0.1
    },
}

lifetracer.find_peaks.extract_peaks(config)

# Step 5: Create Features

In this step, we're going create features from the extracted peaks from step 4.

### Download data required for running step 5

Run the cells below to download the required data.

In [None]:
# Download TIIs
import os
import urllib.request
import subprocess

# Download all samples
# If the cell doesn't run properly, you can download the data manually from the link below:
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-aa',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ab',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ac',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ad',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ae',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-af',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in downloads folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f'output/{filename}')
    print(f"Downloaded {filename}")

# Combine and extract the tar.gz parts
print("Combining tar.gz parts...")
with open('output/TII_aligned.tar.gz', 'wb') as outfile:
    for part in ['part-aa', 'part-ab', 'part-ac', 'part-ad', 'part-ae', 'part-af']:
        with open(f'output/TII_aligned.tar.gz.{part}', 'rb') as infile:
            outfile.write(infile.read())

print("Extracting tar.gz file...")
import tarfile
with tarfile.open('output/TII_aligned.tar.gz', 'r:gz') as tar:
    tar.extractall('output/')
print("Extraction complete!")

In [None]:
import os
import urllib.request
import zipfile

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip'
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in downloads folder
for url in download_list:
    filename = url.split('/')[-1]
    urllib.request.urlretrieve(url, f'output/{filename}')
    print(f"Downloaded {filename}")

# Unzip the data
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')
    print("Extracted peaks.zip")

### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step5_retention_time_alignments.py
```

> 💡 **Tip**: The best lambda1 and lambda2 was obtained from step 3.
| Parameter | Description |
|-----------|-------------|
| `peaks_dir_path` | Path to directory containing peaks |
| `features_path` | Output directory for features |
| `rt1_threshold` | Maximum RT1 difference for clustering |
| `rt2_threshold` | Maximum RT2 difference for clustering |


In [None]:
import lifetracer

config = {
    "parallel_processing": True,
    "number_of_splits": 100,

    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    "features_path": "output/features/",
    "TII_aligned_dir": "output/TII_aligned/",
    "peaks_dir_path": "output/peaks/",
    "lambda1": [5], # best lambda1 from step 3
    "lambda2": [100], # best lambda2 from step 3
    "rt1_threshold": [50], # best rt1_threshold from step 3
    "rt2_threshold": [0.8] # best rt2_threshold from step 3
}

lifetracer.retention_times_alignment.retention_times_alignment(config)

# Step 6: Parameter Selection for the Machine Learning Algorithm

In this step, we're going to select the best hyperparameter for Logistic Regression with L2 regularization based on cross-validation.

### Download data required for running step 6

Run the cell below to download the required data.

> 💡 **Tip**: The data required for this step is lightweight.

> 💡 **Tip**: Ensure that the file is unzipped in the output/TII_aligned/ directory to run the next step (or you can modify the path in the next step)

In [None]:
import os
import urllib.request

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f"output/{filename}")
    print(f"Downloaded {filename}")

# Unzip the data
import zipfile

print("Extracting features.zip...")
with zipfile.ZipFile('output/features.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extracting peaks.zip...")
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extraction complete!")

### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step6_parameters_selection.py
```

> 💡 **Tip**: The best lambda1 and lambda2 was obtained from step 3.
| Parameter | Description |
|-----------|-------------|
| `features_path` | Path to features |
| `parameters_selection_path` | Output directory for selection results |
| `C` | List of regularization strength values to test |
| `seed` | Random seed for reproducibility |


In [None]:
import lifetracer

config = {
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    "features_path": "output/features/",
    "peaks_dir_path": "output/peaks/",
    "parameters_selection_path":"output/parameters_selection/",

    # Logistic Regression with L2 regularization
    "C": [1e-4,1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3,1e+4],
    "seed": 42,
    "lambda1": [5],
    "lambda2": [100],
    "rt1_threshold": [50],
    "rt2_threshold": [0.8],
}

lifetracer.parameters_selection.parameters_selection(config)

# Step 7: Finally! Train a model on features

In this step, we're going to train a logistic regression model where the strength of regularization is determined from step 6.

### Download data required for running step 7

Run the cell below to download the required data.

> 💡 **Tip**: The data required for this step is lightweight.

> 💡 **Tip**: Ensure that the file is unzipped in the output/TII_aligned/ directory to run the next step (or you can modify the path in the next step)

In [2]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
[33m  DEPRECATION: Building 'wget' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'wget'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9687 sha256=dfb6ca71d29b0199b05cb64307d82e98b035c492b55f1a9bfedad92a77da529b
  Stored in directory: /Users/daniel/Library/Caches/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wg

In [None]:
import os
import urllib.request

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f"output/{filename}")
    print(f"Downloaded {filename}")

# Unzip the data
import zipfile

print("Extracting features.zip...")
with zipfile.ZipFile('output/features.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extracting peaks.zip...")
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extraction complete!")

Downloading features.zip...
Downloaded features.zip
Downloading peaks.zip...
Downloaded peaks.zip
Extracting features.zip...
Extracting peaks.zip...
Extraction complete!


### How to run the script

You can run the cell bellow or run this command below:

```bash
python scripts/step7_train_binary_classifier.py
```

In [5]:
import lifetracer

config = {
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    "features_path": "output/features/",
    "peaks_dir_path": "output/peaks/",
    "results_dir": "output/lr_l2_results/",

    # Logistic Regression with l2 regularization
    "C": 0.1, # best C from step 6
    "seed": 42, 
    "lambda1": 5, # Selected from calibration step
    "lambda2": 100, # Selected from calibration step
    "rt1_threshold": 50, # Selected from calibration step
    "rt2_threshold": 0.8, # Selected from calibration step
}

lifetracer.binary_classifier.binary_classifier(config)

[32m2025-09-03 15:41:27.661[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m322[0m - [1mIntercept: [0.03806688][0m
[32m2025-09-03 15:41:49.474[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m346[0m - [1mSignatures saved.[0m


Folder created here: output/lr_l2_results/top_features/
Folder created here: output/lr_l2_results/feature_groups


[32m2025-09-03 15:41:51.197[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m373[0m - [1mRemoved 30 indices from top_feature_group_indices manually[0m
[32m2025-09-03 15:41:54.519[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m385[0m - [1mTop 10 signatures plotted in top_coefficients folder.[0m
[32m2025-09-03 15:41:54.944[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m389[0m - [1mPCA plot saved.[0m
[32m2025-09-03 15:44:00.049[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m407[0m - [1m3D plot of signatures (png) saved.[0m
[32m2025-09-03 15:44:00.917[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m411[0m - [1m3D interactive plot of peaks saved.[0m
[32m2025-09-03 15:44:02.091[0m | [1mINFO    [0m | [36mlifetracer.

3.472 0.04


[32m2025-09-03 15:44:03.804[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m424[0m - [1mDistribution of peaks across m/z values saved.[0m
[32m2025-09-03 15:44:03.809[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_mz[0m:[36m110[0m - [1mMann Whitney U test for m/z p-value: 0.9999999966445333[0m
[32m2025-09-03 15:44:03.810[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_mz[0m:[36m115[0m - [1mFail to reject null hypothesis-> Abiotic peak distribution for m/z is not significantly lower than biotic[0m
[32m2025-09-03 15:44:03.814[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_rt1[0m:[36m123[0m - [1mMann Whitney U test for RT1 p-value: 6.465282194451019e-209[0m
[32m2025-09-03 15:44:03.815[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_rt1[0m:[

# Plotting TIIs

### Download required data

Run the cell below to download the required data.

> 💡 **Tip**: Ensure you have `120 GB` of disk space available for this step.

> 💡 **Tip**: Ensure that the file is unzipped in the output/TII_aligned/ directory to run the next step (or you can modify the path in the next step)

In [None]:
import os
import urllib.request
import tarfile
import subprocess

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-aa',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ab',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ac',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ad',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-ae',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/TII_aligned.tar.gz.part-af',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    output_path = f'output/{filename}'
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, output_path)
    print(f"Downloaded {filename}")

# Concatenate the parts and extract
print("Concatenating parts...")
with open('output/TII_aligned.tar.gz', 'wb') as outfile:
    for i, suffix in enumerate(['aa', 'ab', 'ac', 'ad', 'ae', 'af']):
        part_file = f'output/TII_aligned.tar.gz.part-{suffix}'
        with open(part_file, 'rb') as infile:
            outfile.write(infile.read())

print("Extracting archive...")
with tarfile.open('output/TII_aligned.tar.gz', 'r:gz') as tar:
    tar.extractall(path='output/')
    
print("Extraction complete!")

### How to run

In [None]:
import lifetracer

config = {
    # Path to the CSV file containing the labels
    "labels_path": "data/labels.csv",
    
    # Name of the column in the labels.csv that contains the labels
    "label_column_name": "label",
    
    # Directory where generated heatmaps are stored
    "heatmap_dir": "output/TII_aligned/",
    
    # Directory where generated plots will be saved
    "plot_dir": "output/plots/",
    
    # Boolean flag indicating whether all samples should be processed
    "all_samples": True, # Set to False to plot a single sample
    
    # Name of the sample to be analyzed if all_samples is False
    "sample_name": "230823_01_Atacama_Soil_300uLDCM_100oC24h-001.csv",

    "csv_file_name_column": "csv_file_name",
    
    # What m/z value to plot
    "m_z": "469"
}

lifetracer.plot_heatmap.plot_heatmap(config)

# Model Evaluation

### Download required data

Run the cell below to download the required data (if you didn't run the previous steps)


In [6]:
import os
import urllib.request

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f"output/{filename}")
    print(f"Downloaded {filename}")

# Unzip the data
import zipfile

print("Extracting features.zip...")
with zipfile.ZipFile('output/features.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extracting peaks.zip...")
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extraction complete!")

Downloading features.zip...
Downloaded features.zip
Downloading peaks.zip...
Downloaded peaks.zip
Extracting features.zip...
Extracting peaks.zip...
Extraction complete!


### How to run the code

In [7]:
import lifetracer

config = {
    "parallel_processing": True,
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    # Note: If you did not run all the steps, you can download the features and peaks from the huggingface.
    # You can download the features and peaks from the huggingface. Follow the instructions in the notebook to download the data.
    # Download features: https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip
    # Download peaks: https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip
    "features_path": "output/features", # Path to features directory
    "peaks_dir_path": "output/peaks", # Path to peaks directory
    "eval_path":"output/eval/lr_l2", # Change the path to your desired output directory
    
    "model": "lr_l2",
    "lr_l2": {
        "C": [1e-4,1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3,1e+4],
        "lambda1": [5],
        "lambda2": [100],
        "rt1_threshold": [50],
        "rt2_threshold": [0.8],
    }

    # Uncomment the model you want to evaluate

    # "model": "lr_l1",
    # "lr_l1": {
    #     "C": [1e-4,1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3,1e+4],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }

    
    # "model": "svm",
    # "svm": {
    #     "C": [1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3],
    #     "kernel": ["linear","poly","rbf","sigmoid"],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # },

    # "model": "rf",
    # "rf": {
    #     "n_estimators": [20, 50, 100, 200, 500],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }

    # "model": 'NaiveBayes',
    # "NaiveBayes": {
    #     "alpha": [0.01, 0.1, 0.5, 1, 5, 10],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }
}

lifetracer.evaluation.eval(config)

[32m2025-09-03 16:12:09.623[0m | [1mINFO    [0m | [36mlifetracer.src.evaluation[0m:[36meval[0m:[36m379[0m - [1mModel: lr_l2[0m
[32m2025-09-03 16:12:09.626[0m | [1mINFO    [0m | [36mlifetracer.src.evaluation[0m:[36meval[0m:[36m380[0m - [1mStarting evaluation[0m


Folder created here: output/eval/lr_l2
Parallel processing


Seed 741 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.93it/s]
Seed 627 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.96it/s]
Seed 661 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.88it/s]
Seed 522 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.87it/s]
Seed 137 - Processing parameters: 100%|██████████| 9/9 [00:02<00:00,  3.05it/s]
Seed 860 - Processing parameters: 100%|██████████| 9/9 [00:02<00:00,  3.05it/s]
Seed 412 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.74it/s]
Seed 514 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.67it/s]
Seed 738 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.63it/s]
Seed 679 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.65it/s]
Seed 741 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.68it/s]
Seed 522 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.83it/s]
Seed 514 - Processing parameters: 100%|█

[np.float64(88.88888888888889), np.float64(72.22222222222221), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(94.44444444444444), np.float64(88.88888888888889), np.float64(83.33333333333334), np.float64(88.88888888888889)]
[np.float64(20.78698548207745), np.float64(41.5739709641549), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(15.713484026367725), np.float64(20.78698548207745), np.float64(23.570226039551585), np.float64(20.78698548207745)]
