# Usecase 3: Microbiome load prediction data preparation

This notebook prepares the dataset for the microbiome load prediction usecase following the general data preparation approach outlined in [the original publication by Nishijima et al. 2024](10.1016/j.cell.2024.10.022). It can be run in the following conda environment:

This notebook can be run in the following conda environment:
```shell
mamba env create -f environment_prep_data.yml
conda activate ritme_examples_prep_data
pip install -e .
qiime dev refresh-cache
```

## Setup

In [None]:
import numpy as np
import pandas as pd
import qiime2 as q2

from src.process_u3 import process_feature_table

%load_ext autoreload
%autoreload 2

%matplotlib inline

## Fetch data

In [None]:
! ./../../src/fetch_mlp_data.sh

In [None]:
path_to_data = "../../data/u3_mlp_nishijima24"

## Create Galaxy dataset

### Metadata

In [None]:
galaxy_md = pd.read_csv(f"{path_to_data}/GALAXY_load.tsv", sep="\t", index_col=0)
galaxy_md["count_log10"] = np.log10(galaxy_md["count"])

print(galaxy_md.shape)
galaxy_md.head()

In [None]:
# save to disk
galaxy_md.to_csv(f"{path_to_data}/md_galaxy.tsv", sep="\t")

### Feature table

In [None]:
galaxy_motus = process_feature_table(path_to_data, "GALAXY_mOTUs_v25")
print(galaxy_motus.shape)

# save to disk
galaxy_motus.to_csv(f"{path_to_data}/galaxy_otu_table.tsv", sep="\t")
galaxy_motus.head()

In [None]:
# are they relative abundances?
assert galaxy_motus.sum(axis=1).round(3).eq(1.0).all()

In [None]:
# check are all sample IDs present in metadata?
assert len([x for x in galaxy_motus.index if x not in galaxy_md.index]) == 0
assert len([x for x in galaxy_md.index if x not in galaxy_motus.index]) == 0

### Taxonomy

In [None]:
taxonomy_mapping = pd.read_csv(
    "../../data/u3_mlp_nishijima24/motus2GTDB.txt", sep="\t", index_col=0
)

# remove empty spaces from column values
for col in taxonomy_mapping.columns:
    taxonomy_mapping[col] = taxonomy_mapping[col].str.replace(" ", "_")

taxonomy_mapping.head()

In [None]:
prefix_matching = {
    "Kingdom": "k__",
    "Phylum": "p__",
    "Class": "c__",
    "Order": "o__",
    "Family": "f__",
    "Genus": "g__",
    "Species": "s__",
}

tax_df = pd.DataFrame(index=taxonomy_mapping.index)
tax_df["Taxon"] = taxonomy_mapping.apply(
    lambda x: "; ".join(
        [f"{prefix_matching[k]}{v}" for k, v in x.items() if not pd.isna(v)]
    ),
    axis=1,
)
# create correct index
tax_df.index = [f"ref_mOTU_v25_{x}" for x in tax_df.index.tolist()]
tax_df.index.name = "Feature ID"

# add unclassified
tax_df.loc[
    "unclassified", "Taxon"
] = "k__undef; p__undef; c__undef; o__undef; f__undef; g__undef; s__undef"
tax_df.head()

In [None]:
# save to disk
tax_art = q2.Artifact.import_data("FeatureData[Taxonomy]", tax_df)
tax_art.save(f"{path_to_data}/u3_taxonomy.qza")

No phylogeny tree can be constructed since we do not have the nucleotide sequences of these mOTUs -> no trac trainable

## Create Metacardis dataset

### Metadata

In [None]:
metacardis_md = pd.read_csv(
    f"{path_to_data}/MetaCardis_load.tsv", sep="\t", index_col=0
)

# according to publication perform log10 transformation
metacardis_md["count_log10"] = np.log10(metacardis_md["count"])

print(metacardis_md.shape)
metacardis_md.head()

In [None]:
# save to disk
metacardis_md.to_csv(f"{path_to_data}/md_metacardis.tsv", sep="\t")

### Feature table

In [None]:
metacardis_motus = process_feature_table(path_to_data, "MetaCardis_mOTUs_v25")
print(metacardis_motus.shape)

# save to disk
metacardis_motus.to_csv(f"{path_to_data}/metacardis_otu_table.tsv", sep="\t")

metacardis_motus.head()

In [None]:
# are they relative abundances?
assert metacardis_motus.sum(axis=1).round(3).eq(1.0).all()

In [None]:
# check are all sample IDs present in metadata?
assert len([x for x in metacardis_motus.index if x not in metacardis_md.index]) == 0
assert len([x for x in metacardis_md.index if x not in metacardis_motus.index]) == 0

### Taxonomy

was already processed above as `tax_art` - same mapping can be used for both datasets.

No phylogeny tree can be constructed since we do not have the nucleotide sequences of these mOTUs -> no trac trainable