# Tutorial: Loading, filtering, and pre-procesing SAMPL-seq datasets for model inference

This tutorial goes over how to process raw SAMPL-seq data and prepare it for model inference. It covers key preprocessing steops, including:
- Filtering taxa based on minimum relative abundance and consistency across subjects
- Filtering particles based on minimum and maximum reads per particle
- Excluding specific taxa, time points, or subjects from the analysis

For demonstration, we use our mouse dataset from the main paper, applying stricter filtering criteria to obtain a smaller dataset for the tutorials.

In [1]:
from mcspace.data_utils import parse
from pathlib import Path
import pandas as pd

# Paths

In [3]:
basepath = Path("./")
datapath = basepath / "data"
outpath = basepath / "results"
outpath.mkdir(exist_ok=True, parents=True)

In [4]:
ls "data"

 Volume in drive C is Windows-SSD
 Volume Serial Number is 1086-9223

 Directory of C:\Users\Gary\Documents\PROJECTS\MCSPACE_FINAL\MCSPACE\mcspace\tutorials\data

12/04/2024  03:55 PM    <DIR>          .
12/04/2024  03:57 PM    <DIR>          ..
12/04/2024  02:40 PM        12,618,963 mouse_counts.csv.gz
12/04/2024  02:40 PM            30,857 newick_tree_query_reads.nhx
12/04/2024  02:40 PM           157,797 tax.csv
               3 File(s)     12,807,617 bytes
               2 Dir(s)  694,422,749,184 bytes free


# Load raw data, check formatting

# Parse for inference

using min reads for smaller data, removing more time points...

In [None]:
parse(datapath/"mouse_counts.csv.gz",
     datapath/"taxa.csv",
     datapath/"perturbations.csv",
     subjects_remove='JX09',
     times_remove=times_remove,
     otus_remove=otus_remove,
     num_consistent_subjects=2,
     min_abundance=0.005,
     min_reads=1000,
     max_reads=10000)