$\textit{Note:}$ In order to run the following notebook, one should install QIIME 2. The installation instructions can be found [here](https://docs.qiime2.org/2021.4/install/native/)

[QIIME 2](https://docs.qiime2.org/2021.4/) is a microbiome bioinformatics platform to analyze amplicon sequence data.

In [None]:
import qiime2
import pandas as pd
import numpy as np

### Metadata of pH values in soil samples

Read metadata about the samples, We need only pH

In [None]:
metadata = pd.read_table('../../data/soil/original/88soils_modified_metadata.txt', index_col=0)

In [None]:
ph = metadata["ph"].to_csv('../../data/soil/processed/ph.csv', index=True)

Import count OTUs data

In [None]:
%%bash
qiime tools import \
    --input-path ../../data/soil/original/238_otu_table.biom \
    --output-path ../../data/soil/original/88soils.biom.qza \
    --type FeatureTable[Frequency]

Load the table with 88 soil samples

In [None]:
table_art = qiime2.Artifact.load('../../data/soil/original/88soils.biom.qza')
all_samples = table_art.view(pd.DataFrame)

### Preprocessing

Select only those OTUs occuring at least $100$ times

In [None]:
%%bash
qiime feature-table filter-features \
    --i-table ../../data/soil/original/88soils.biom.qza \
    --o-filtered-table ../../data/soil/original/88soils_filt100.biom.qza \
    --p-min-frequency 100

[Replace](https://docs.qiime2.org/2018.6/plugins/available/composition/add-pseudocount/) zeros with pseudo $1$ in all samples for a smooth CLR transformation in the later analysis.

In [None]:
%%bash
qiime composition add-pseudocount \
    --i-table ../../data/soil/original/88soils_filt100.biom.qza \
    --p-pseudocount 1 \
    --o-composition-table ../../data/soil/original/88soils_composition.biom.qza

Load the table of filtered and zero-replaced data

In [None]:
table_1 = qiime2.Artifact.load('../../data/soil/original/88soils_composition.biom.qza')
df = table_1.view(pd.DataFrame)

Select 116 OTUs which are has been filtered out by [Morton et al](https://www.nature.com/articles/s41467-019-10656-5)

In [None]:
morton = pd.read_excel('../../data/soil/original/ph_morton.xlsx', engine='openpyxl')
morton_otus = np.array(morton['#OTU_ID'])

In [None]:
our_otus = list(map(int, np.array(df.columns)))

In [None]:
diff = np.setdiff1d(our_otus, morton_otus)
diff = ["".join(item) for item in diff.astype(str)]

Select OTUs described by Morton et al.

In [None]:
final = df.loc[:, ~df.columns.isin(diff)]
final.shape

In [None]:
final.to_csv('../../data/soil/processed/soil_116.csv', index=True)