# Tutorial 2: Running inference with the MCSPACE model

This tutorial goes over how to run model inference on processed SAMPL-seq data. Refer to the previous tutorial (`data_preprocessing.ipynb`) on how to prepare data for inference. 

In [1]:
from mcspace.data_utils import parse
from pathlib import Path
import pandas as pd
from mcspace.utils import pickle_save

The "run_inference" function performs model inference on preprocessed SAMPL-seq data. Import it as follows:

In [2]:
from mcspace.inference import run_inference

# Paths

Relative paths for this tutorial. `basepath` gives the path of this file

In [3]:
basepath = Path("./")
datapath = basepath / "data"
outpath = basepath / "results"
outpath.mkdir(exist_ok=True, parents=True)

# Process data for model inference

See previous tutorial `data_preprocessing.ipynb` for more details on this step.

In [4]:
times_remove = [10,18,65,76]

In [5]:
processed_data = parse(datapath/"mouse_counts.csv.gz",
                     datapath/"taxonomy.csv",
                     datapath/"perturbations.csv",
                     subjects_remove=['JX09'],
                     times_remove=times_remove,
                     otus_remove=None,
                     num_consistent_subjects=2,
                     min_abundance=0.005,
                     min_reads=1000,
                     max_reads=10000)

  self._long_data = pd.read_csv(reads, compression='gzip')


# Run model inference

Model inference is perfomed using the `run_inference` function as described below:

### run_inference:
**Required arguments**:
- `data`: The first argument takes the preprocessed data which is the resulting output from the **parse** function.
- `outpath`: The second argument takes a path to the directory to which the results of inference are to be saved.

**Optional keyword arguments**:
- `n_seeds`: This argument corresponds to how many resets are to be used for model inference. The model is then run `n_seeds` number of times and the best model is selected as the one with the lowest ELBO loss. The default value is 10.
- `n_epochs`: This is the number of training epochs to use in each reset. Default value is 20000.
- `learning_rate`: Learning rate to be used with the ADAM optimizer. Default  value is 5e-3.
- `num_assemblages`: Maximum possible assemblages the model can learn. Default value is 100.
- `sparsity_prior`: Prior probability of an assemblage being present. The default value is None, which sets the value to 0.5/`num_assemblages`.
- `sparsity_power`: Power to which we raise the sparsity prior to scale with the dataset size. Default value is `None` which sets the value to 0.5% of the total number of reads in the dataset.
- `anneal_prior`: Specifies whether to anneal the strength of the sparsity prior during training. Default value is True.
- `process_variance_prior`: Prior location of the process variance prior. Default value is `0.01`.
- `perturbation_prior`: Prior probability of a perturbation effect. Default value is None, which sets the value to 0.5/`num_assemblages`.
- `use_contamination`: Whether to use the contamination cluster in the model. Default value is True.
- `use_sparsity`: Specifies whether to sparsify the number of assemblages in the model. Default value is True.
- `use_kmeans_init`: Specifies whether to use a kmeans initialization for assemblage parameters. Default value is True.
- `device`: Specifies whether to use the CPU or GPU for model inference. By default, the software automatically detects and utilizes the GPU if available.

For this tutorial, we will run the MCSPACE model with 1 seed and 5000 epochs, keeping other arguments at their default values

In [6]:
run_inference(processed_data,
              outpath,
              n_seeds=1,
              n_epochs=5000)

running seed 0...

epoch 0
ELBO =  tensor(19047362., device='cuda:0', grad_fn=<NegBackward0>)

epoch 100
ELBO =  tensor(17319594., device='cuda:0', grad_fn=<NegBackward0>)

epoch 200
ELBO =  tensor(24300228., device='cuda:0', grad_fn=<NegBackward0>)

epoch 300
ELBO =  tensor(16783074., device='cuda:0', grad_fn=<NegBackward0>)

epoch 400
ELBO =  tensor(19798072., device='cuda:0', grad_fn=<NegBackward0>)

epoch 500
ELBO =  tensor(15433439., device='cuda:0', grad_fn=<NegBackward0>)

epoch 600
ELBO =  tensor(16299690., device='cuda:0', grad_fn=<NegBackward0>)

epoch 700
ELBO =  tensor(17427712., device='cuda:0', grad_fn=<NegBackward0>)

epoch 800
ELBO =  tensor(20006550., device='cuda:0', grad_fn=<NegBackward0>)

epoch 900
ELBO =  tensor(19618514., device='cuda:0', grad_fn=<NegBackward0>)

epoch 1000
ELBO =  tensor(20832796., device='cuda:0', grad_fn=<NegBackward0>)

epoch 1100
ELBO =  tensor(22068986., device='cuda:0', grad_fn=<NegBackward0>)

epoch 1200
ELBO =  tensor(23293548., device='

# Results of model inference

We output inference results in the folder `results/`

In [7]:
ls "results/"

 Volume in drive C is Windows-SSD
 Volume Serial Number is 1086-9223

 Directory of c:\Users\Gary\Documents\PROJECTS\MCSPACE_revisions_8_29_25\MCSPACE\mcspace\tutorials\results

08/29/2025  03:02 PM    <DIR>          .
08/29/2025  02:59 PM    <DIR>          ..
08/29/2025  03:02 PM            20,690 assemblage_proportions.csv
08/29/2025  03:02 PM            18,970 assemblages.csv
08/29/2025  03:02 PM    <DIR>          best_model
08/29/2025  03:02 PM               581 perturbation_bayes_factors.csv
08/29/2025  03:02 PM             2,078 relative_abundances.csv
08/29/2025  03:02 PM            21,294 results.pkl
08/29/2025  02:59 PM    <DIR>          runs
               5 File(s)         63,613 bytes
               4 Dir(s)  365,254,557,696 bytes free


In [9]:
ls "results/runs/seed_0/"

 Volume in drive C is Windows-SSD
 Volume Serial Number is 1086-9223

 Directory of c:\Users\Gary\Documents\PROJECTS\MCSPACE_revisions_8_29_25\MCSPACE\mcspace\tutorials\results\runs\seed_0

08/29/2025  03:02 PM    <DIR>          .
08/29/2025  02:59 PM    <DIR>          ..
08/29/2025  03:02 PM         2,667,946 data.pkl
08/29/2025  03:02 PM            40,151 elbos.pkl
08/29/2025  03:02 PM           308,399 model.pt
08/29/2025  03:02 PM             2,394 taxonomy.pkl
               4 File(s)      3,018,890 bytes
               2 Dir(s)  365,259,993,088 bytes free


The folder contains the following results from inference:
- `assemblages.csv`: A csv file containing the learned assemblages, with rows corresponding to each OTU and columns for each assemblages.
- `assemblage_proportions.csv`: A csv file giving the posterior summary of inferred assemblage proportions, in long format, for each assemblage at each timepoint for each subject.
- `perturbation_bayes_factors.csv`: A csv file containing the perturbation Bayes factors with columns corresponding to each perturbed timepoint and rows for each assemblage.
- `runs`: Folder containing model inference results for each seed run. The `model.pt` file in each folder gives the saved pytorch model for each corresponding seed. The `elbos.pkl` file gives the ELBO loss at each epoch for the run to help users monitor and assess model convergence and training quality.
- `best_model`: Folder containing inference results for the seed with the lowest average ELBO loss, and which is used to generate posterior summaries.
- `results.pkl`: A pickle file containing posterior summaries of inferred parameters. This contains the same information as the csv files. This file can be used with our visulization functions for easy visualization of model results. See next tutorial `visualizating_results.ipynb` for more details.