# Tutorial 1: Formatting, filtering, and pre-procesing SAMPL-seq datasets for model inference

This tutorial goes over how to process raw SAMPL-seq data and prepare it for model inference. It covers key preprocessing steops, including:
- How to format input files for the MCSPACE software
- Filtering taxa based on minimum relative abundance and consistency across subjects
- Filtering particles based on minimum and maximum reads per particle
- Excluding specific taxa, time points, or subjects from the analysis

For demonstration, we use our mouse dataset from the main paper, applying stricter filtering criteria to obtain a smaller dataset for the tutorials.

In [17]:
from pathlib import Path
import pandas as pd
from mcspace.utils import pickle_save

The "parse" function processes SAMPL-seq data files and returns a dataset object that can be used for model inference. Import it as follows:

In [19]:
from mcspace.data_utils import parse

# Paths

Relative paths for this tutorial. `basepath` gives the path of this file

In [20]:
basepath = Path("./")
datapath = basepath / "data"
outpath = basepath / "results"
outpath.mkdir(exist_ok=True, parents=True)

In [21]:
ls "data"

 Volume in drive C is Windows-SSD
 Volume Serial Number is 1086-9223

 Directory of C:\Users\Gary\Documents\PROJECTS\MCSPACE_FINAL\MCSPACE\mcspace\tutorials\data

12/05/2024  09:49 AM    <DIR>          .
12/05/2024  11:37 AM    <DIR>          ..
12/04/2024  02:40 PM        12,618,963 mouse_counts.csv.gz
12/04/2024  02:40 PM            30,857 newick_tree_query_reads.nhx
12/04/2024  04:06 PM                58 perturbations.csv
12/05/2024  09:48 AM            93,514 taxonomy.csv
               4 File(s)     12,743,392 bytes
               2 Dir(s)  688,379,641,856 bytes free


The `data` folder contains data for this tutorial. It contains the following files:
- `mouse_counts.csv.gz`: This file gives particle count data for the samples in the study. Note, we use a compressed file here due to space limitations on GitHub. A uncompressed csv file would also work with the parse function however.
- `newick_tree_query_reads.nhx`: Phylogenetic tree for OTUs in study in Newick format. This file is optional and may be used when visualizing results.
- `perturbations.csv`: CSV file containing information on which time points are perturbed.
- `taxonomy.csv`: Taxonomic information for OTUs in the study

We load these files below to show how they should be formatted.

# Required formatting of input files

### counts file

In [24]:
counts = pd.read_csv(datapath / "mouse_counts.csv.gz", compression='gzip', index_col=0)

  counts = pd.read_csv(datapath / "mouse_counts.csv.gz", compression='gzip', index_col=0)


In [25]:
counts.head()

Unnamed: 0.1,Unnamed: 0,Particle,OTU,Sample,replicate,Count,Subject,chow,Time,timepoint,empty_tube_weight_mg,pellet_tube_weight_mg
0,0,20231021_1123_1_S_MEbcnum60_80_90,Otu1,20231021_1123_1_S_ME,1,537,JX07,standard,57,AM,1012.45,1054.25
1,1,20231021_1123_1_S_MEbcnum22_63_29,Otu1,20231021_1123_1_S_ME,1,79,JX07,standard,57,AM,1012.45,1054.25
2,2,20231021_1123_1_S_MEbcnum22_38_88,Otu1,20231021_1123_1_S_ME,1,7,JX07,standard,57,AM,1012.45,1054.25
3,3,20231021_1123_1_S_MEbcnum22_94_35,Otu1,20231021_1123_1_S_ME,1,12,JX07,standard,57,AM,1012.45,1054.25
4,4,20231021_1123_1_S_MEbcnum22_92_8,Otu1,20231021_1123_1_S_ME,1,34,JX07,standard,57,AM,1012.45,1054.25


The counts file contains sequencing counts for each OTU in each particle, for each sample. The following columns are **required** to be in this file:
- `Particle`: This column contains the unique particle ID for each particle in the study.
- `OTU`: This column gives the Otu# to which the counts correspond to for each row.
- `Count`: Each row in this column gives the number of sequencing counts corresponding to a given OTU in a given particle, in a given sample.
- `Time`: Each row in this column gives the timepoint to which the counts correspond to.
- `Subject`: Each row in this column gives the subject to which the counts correspond to.

### perturbations

In [26]:
perturbations = pd.read_csv(datapath/"perturbations.csv")

In [27]:
perturbations

Unnamed: 0,Time,Perturbed
0,10,0
1,18,1
2,35,0
3,43,1
4,57,0
5,65,1
6,76,0


This file gives perturbation information for each timepoint in the study. The csv file **requires** two columns:
- `Time`: Each row listing each timepoint in the study
- `Perturbed`: Each row must contain either a 0 or 1, with 0 indicating no perturbation on the corresponding timepoint and a 1 indicating the timepoint does correspond to a perturbation

### taxonomy

In [28]:
taxonomy = pd.read_csv(datapath / "taxonomy.csv")

In [30]:
taxonomy.head()

Unnamed: 0,Otu,Domain,Phylum,Class,Order,Family,Genus,Species
0,Otu1,Bacteria,Firmicutes,Bacilli,Lactobacillales,Lactobacillaceae,Lactobacillus,na
1,Otu7,Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae,Clostridium_XlVa,na
2,Otu11,Bacteria,Actinobacteria,Coriobacteriia,Eggerthellales,Eggerthellaceae,Adlercreutzia,na
3,Otu8,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Muribaculaceae,Duncaniella,na
4,Otu2,Bacteria,Verrucomicrobia,Verrucomicrobiae,Verrucomicrobiales,Akkermansiaceae,Akkermansia,na


The taxonomy file gives the taxonomic information for each OTU in the dataset. It requires the following columns: `Otu, Domain, Phylum, Class, Order, Family, Genus, Species`. A value of `na` indicates that the OTU is not resolved to the corresponding taxonomic level. In visualizing results (see the `visualizing_results.ipynb` tutorial file), the software automatically displays each OTU to its lowest resolved taxonomic level.

# Preparing data for inference

Data is prepared for model inference using the `parse` function as described below:

### parse:
**Required arguments**:
- `counts_data`: The first argument of the parse function takes the filepath for the counts file, as described above.
- `taxonomy`: The second argument of the parse function takes the taxonomy filepath.
- `perturbation_info`: The third argument of the parse function takes the filepath for the file containing perturbation information for the timepoints in the study.

**Optional keyword arguments**:
- `subjects_remove`: This argument takes in a list of subjects to be removed from the study. Default value is `None`.
- `times_remove`: This arugment takes in a list of timepoints to be removed from the study. Default value is `None`.
- `otus_remove`: This argument takes in a list of OTUs to be removed from the study. Default value is `None`.
- `num_consistent_subjects`: This is the number of subjects that must contain each OTU above the provided `min_abundance` for it to be included. Default value is 1.
- `min_abundance`: This is the minimum relative abundance an OTU must have on any timepoint for it to be included. The default value is 0.005.
- `min_reads`: This is the minimum number of reads a particle must contain for it to be included. The default value is 250.
- `max_reads`: This is the maximum number of reads a particle can contain for it to be included. The default value is 10000.

For our tutorials, we remove timepoints and apply a stricter read filtering, to obtain a smaller dataset in order to speed up inference in the next tutorial. We will keep only the times around the HFHF perturbation on day 43 and remove days 10,18,65, and 76. Additionally we use a minimum read threshold of 1000 reads for filtering out particles, and remove subject `JX09`.

In [32]:
times_remove = [10,18,65,76]

In [34]:
processed_data = parse(datapath/"mouse_counts.csv.gz",
                     datapath/"taxonomy.csv",
                     datapath/"perturbations.csv",
                     subjects_remove=['JX09'],
                     times_remove=times_remove,
                     otus_remove=None,
                     num_consistent_subjects=2,
                     min_abundance=0.005,
                     min_reads=1000,
                     max_reads=10000)

  self._long_data = pd.read_csv(reads, compression='gzip')


In [8]:
processed_data.keys()

dict_keys(['perturbations', 'dataset', 'taxonomy', 'inference_data'])

The **parse** function returns a dictionary containing objects used in the MCSPACE model inference step. See the `running_inference.ipynb` tutorial for more information on performing model inference.