# Logistic Regression Results

This notebook demonstrates **Step 7** from the complete pipeline (`reproduce_paper_results.ipynb`), focusing specifically on logistic regression analysis and generating the plots used in the research paper.

## Prerequisites

- **Storage Requirements**: This notebook requires minimal storage (~2GB) compared to the full pipeline
- **Data Dependencies**: Assumes peaks and features are already extracted from previous pipeline steps
- **Alternative**: Run the complete `reproduce_paper_results.ipynb` if you want to follow all steps and have at least 750 GB storage available

## What This Notebook Does

1. Downloads pre-processed peaks and features data
2. Runs logistic regression classification with L2 regularization
3. Generates evaluation metrics and plots
4. Outputs results matching those reported in the paper

## Setup Instructions

### LifeTracer Package Installation

**For Linux/macOS:**


```bash
# Clone the repo:
git clone git@github.com:amirgroup-codes/LifeTracer.git

# Step 1: Create conda environment
conda create -n LifeTracer python=3.10.8

# Step 2: Activate environment
conda activate LifeTracer

# Step 3: Navigate to project directory and install
cd LifeTracer
pip install -e .

# Step 4: Verify installation
python -c "import lifetracer; print('LifeTracer installed successfully!')"
```

**For Windows:**

Use Anaconda Prompt or PowerShell and follow the same commands above.

### Common Installation Issues

| Issue | Solution |
|-------|----------|
| Permission Errors | Use `pip install --user -e .` |
| Environment Issues | Run `conda clean --all` and recreate environment |
| Path Problems | Ensure you're in the correct project directory |
| Import Errors | Verify Python version is 3.10.8 |

## Data Download

Download the required pre-processed data (peaks and features) from the previous pipeline steps:

In [None]:
import os
import urllib.request

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f"output/{filename}")
    print(f"Downloaded {filename}")

# Unzip the data
import zipfile

print("Extracting features.zip...")
with zipfile.ZipFile('output/features.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extracting peaks.zip...")
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extraction complete!")

## Running Logistic Regression

Execute the following cell to perform logistic regression classification and generate the plots used in the paper. The results are located in `output/lr_l2_results/`.

In [1]:
import lifetracer

config = {
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    "features_path": "output/features/",
    "peaks_dir_path": "output/peaks/",
    "results_dir": "output/lr_l2_results/",

    # Logistic Regression with l2 regularization
    "C": 0.1, # best C from step 6
    "seed": 42, 
    "lambda1": 5, # Selected from calibration step
    "lambda2": 100, # Selected from calibration step
    "rt1_threshold": 50, # Selected from calibration step
    "rt2_threshold": 0.8, # Selected from calibration step
}

lifetracer.binary_classifier.binary_classifier(config)

[32m2025-09-03 16:45:17.857[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m324[0m - [1mIntercept: [0.03806688][0m
[32m2025-09-03 16:45:27.251[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m348[0m - [1mSignatures saved.[0m


Folder created here: output/lr_l2_results/top_features/
Folder created here: output/lr_l2_results/feature_groups


[32m2025-09-03 16:45:28.317[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m375[0m - [1mRemoved 30 indices from top_feature_group_indices manually[0m
[32m2025-09-03 16:45:29.596[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m387[0m - [1mTop 10 signatures plotted in top_coefficients folder.[0m
[32m2025-09-03 16:45:29.886[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m391[0m - [1mPCA plot saved.[0m
[32m2025-09-03 16:47:05.374[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m409[0m - [1m3D plot of signatures (png) saved.[0m
[32m2025-09-03 16:47:06.127[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m413[0m - [1m3D interactive plot of peaks saved.[0m
[32m2025-09-03 16:47:06.985[0m | [1mINFO    [0m | [36mlifetracer.

3.472 0.04


[32m2025-09-03 16:47:08.349[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mbinary_classifier[0m:[36m426[0m - [1mDistribution of peaks across m/z values saved.[0m
[32m2025-09-03 16:47:08.352[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_mz[0m:[36m110[0m - [1mMann Whitney U test for m/z p-value: 0.9999999966445333[0m
[32m2025-09-03 16:47:08.353[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_mz[0m:[36m115[0m - [1mFail to reject null hypothesis-> Abiotic peak distribution for m/z is not significantly lower than biotic[0m
[32m2025-09-03 16:47:08.357[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_rt1[0m:[36m123[0m - [1mMann Whitney U test for RT1 p-value: 6.465282194451019e-209[0m
[32m2025-09-03 16:47:08.358[0m | [1mINFO    [0m | [36mlifetracer.src.binary_classifier[0m:[36mmann_whitney_u_test_rt1[0m:[

## Model Evaluation

### Data Download (Optional)

**Note**: If you haven't run the data download section above, execute the cell below to download the required datasets:

In [None]:
import os
import urllib.request

# Download all samples
download_list = [
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip',
    'https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip',
]

# create a folder for the raw data
os.makedirs('output', exist_ok=True)

# Download the data and store in output folder
for url in download_list:
    filename = url.split('/')[-1]
    print(f"Downloading {filename}...")
    urllib.request.urlretrieve(url, f"output/{filename}")
    print(f"Downloaded {filename}")

# Unzip the data
import zipfile

print("Extracting features.zip...")
with zipfile.ZipFile('output/features.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extracting peaks.zip...")
with zipfile.ZipFile('output/peaks.zip', 'r') as zip_ref:
    zip_ref.extractall('output/')

print("Extraction complete!")

Downloading features.zip...
Downloaded features.zip
Downloading peaks.zip...
Downloaded peaks.zip
Extracting features.zip...
Extracting peaks.zip...
Extraction complete!


### Model Evaluation Configuration

Configure and run the evaluation with different machine learning models. The default configuration uses **Logistic Regression with L2 regularization**.

In [None]:
import lifetracer

config = {
    "parallel_processing": True,
    "mz_list_path": "data/all_mz_values.csv",
    "labels_path": "data/labels.csv",
    "m_z_column_name": "M/Z",
    "area_column_name": "Area",
    "first_time_column_name": "1st Time (s)",
    "second_time_column_name": "2nd Time (s)",
    "csv_file_name_column": "csv_file_name",
    "label_column_name": "label",

    # Note: If you did not run all the steps, you can download the features and peaks from the huggingface.
    # You can download the features and peaks from the huggingface. Follow the instructions in the notebook to download the data.
    # Download features: https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/features.zip
    # Download peaks: https://huggingface.co/datasets/DS-20202/LifeTracer-Processed-Data/resolve/main/peaks.zip
    "features_path": "output/features", # Path to features directory
    "peaks_dir_path": "output/peaks", # Path to peaks directory
    "eval_path":"output/eval/lr_l2", # Change the path to your desired output directory
    
    "model": "lr_l2",
    "lr_l2": {
        "C": [1e-4,1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3,1e+4],
        "lambda1": [5],
        "lambda2": [100],
        "rt1_threshold": [50],
        "rt2_threshold": [0.8],
    }

    # Uncomment the model you want to evaluate

    # "model": "lr_l1",
    # "lr_l1": {
    #     "C": [1e-4,1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3,1e+4],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }

    
    # "model": "svm",
    # "svm": {
    #     "C": [1e-3,1e-2,1e-1,1e0,1e+1,1e+2,1e+3],
    #     "kernel": ["linear","poly","rbf","sigmoid"],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # },

    # "model": "rf",
    # "rf": {
    #     "n_estimators": [20, 50, 100, 200, 500],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }

    # "model": 'NaiveBayes',
    # "NaiveBayes": {
    #     "alpha": [0.01, 0.1, 0.5, 1, 5, 10],
    #     "lambda1": [5],
    #     "lambda2": [100],
    #     "rt1_threshold": [50],
    #     "rt2_threshold": [0.8],
    # }
}

lifetracer.evaluation.eval(config)

[32m2025-09-03 16:12:09.623[0m | [1mINFO    [0m | [36mlifetracer.src.evaluation[0m:[36meval[0m:[36m379[0m - [1mModel: lr_l2[0m
[32m2025-09-03 16:12:09.626[0m | [1mINFO    [0m | [36mlifetracer.src.evaluation[0m:[36meval[0m:[36m380[0m - [1mStarting evaluation[0m


Folder created here: output/eval/lr_l2
Parallel processing


Seed 741 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.93it/s]
Seed 627 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.96it/s]
Seed 661 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.88it/s]
Seed 522 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.87it/s]
Seed 137 - Processing parameters: 100%|██████████| 9/9 [00:02<00:00,  3.05it/s]
Seed 860 - Processing parameters: 100%|██████████| 9/9 [00:02<00:00,  3.05it/s]
Seed 412 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.74it/s]
Seed 514 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.67it/s]
Seed 738 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.63it/s]
Seed 679 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.65it/s]
Seed 741 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.68it/s]
Seed 522 - Processing parameters: 100%|██████████| 9/9 [00:03<00:00,  2.83it/s]
Seed 514 - Processing parameters: 100%|█

[np.float64(88.88888888888889), np.float64(72.22222222222221), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(88.88888888888889), np.float64(94.44444444444444), np.float64(88.88888888888889), np.float64(83.33333333333334), np.float64(88.88888888888889)]
[np.float64(20.78698548207745), np.float64(41.5739709641549), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(20.78698548207745), np.float64(15.713484026367725), np.float64(20.78698548207745), np.float64(23.570226039551585), np.float64(20.78698548207745)]
