This notebook creates OTU-filtered biom tables for the placebo group, with the goal of reducing the number of features that will be provided as input to machine learning-based methods. OTUs that are present in fewer than 10% of the placebo group pre-treatment samples are excluded from the final tables.

In [1]:
import pandas as pd
import biom 

sample_metadata_fp = 'sample-metadata.tsv'
treatmentgroup = 'placebo'

sample_md = pd.read_csv(sample_metadata_fp, sep='\t', index_col=0, dtype=object)

!mkdir temp

In [2]:
s_treatmentgroup_visit = '"treatmentgroup:%s;visit:pre"' % treatmentgroup
s_treatmentgroup = '"treatmentgroup:%s"' % treatmentgroup
treatmentgroup_biom_fp = 'vsearch-derep.ms2.rdp-tax-99.%s.biom' % treatmentgroup
treatmentgroup_pre_biom_fp = 'vsearch-derep.ms2.rdp-tax-99.%s-pre.biom' % treatmentgroup
temp_treatmentgroup_pre_biom_fp = 'temp/%s' % treatmentgroup_pre_biom_fp

!filter_samples_from_otu_table.py -i ../vsearch-derep.ms2.rdp-tax-99.biom -o $temp_treatmentgroup_pre_biom_fp -s $s_treatmentgroup_visit -m $sample_metadata_fp
!filter_samples_from_otu_table.py -i ../vsearch-derep.ms2.rdp-tax-99.biom -o $treatmentgroup_biom_fp -s $s_treatmentgroup -m $sample_metadata_fp

In [3]:
table = biom.load_table(temp_treatmentgroup_pre_biom_fp)

In [4]:
print(table.shape)

(505989, 238)


In [5]:
# get 10% of the number of samples for the next step
ten_percent_of_num_samples = int(round(0.1 * table.shape[1]))

In [6]:
ten_percent_of_num_samples

24

In [7]:
treatmentgroup_pre_10p_biom_fp = 'vsearch-derep.ms2.rdp-tax-99.%s-pre-ms10p.biom' % treatmentgroup
temp_treatmentgroup_pre_10p_biom_fp = 'temp/%s' % treatmentgroup_pre_10p_biom_fp

!filter_otus_from_otu_table.py -s $ten_percent_of_num_samples -i $temp_treatmentgroup_pre_biom_fp -o $temp_treatmentgroup_pre_10p_biom_fp

In [8]:
temp_biom_tsv_fp = 'temp/temp.tsv'
temp_otus_to_keep = './temp/otu-ids-to-keep.tsv '

!biom convert --to-tsv -i $temp_treatmentgroup_pre_10p_biom_fp -o $temp_biom_tsv_fp

In [9]:
!cut -f 1 $temp_biom_tsv_fp > $temp_otus_to_keep

In [10]:
!head $temp_otus_to_keep

# Constructed from biom file
#OTU ID
e07591b0049b8831e12cbb7caca207d48628bf53
7b38ac67bdd77bf68971c5d1c9a637eed6b42f65
3c612a1ee0b3bd27cdf62a3d5d2b20d4daed3b01
c1d03cfa67e45fe3ad18ae4005c3a24d81fa8778
4e4339ab92c072ae49a595333c4961607b1eb366
eaa89e412e9ee480b7019f5b31997726bb8661b3
7e3f625115c48cf9721e9ffe63b3c48b5149ba18
0fab7ba69e9a53216d231dd0458b41129e4515bd


In [11]:
treatmentgroup_biom_10p_fp = 'vsearch-derep.ms2.rdp-tax-99.%s-ms10p.biom' % treatmentgroup
!filter_otus_from_otu_table.py -i $treatmentgroup_biom_fp -o $treatmentgroup_biom_10p_fp --negate_ids_to_exclude -e $temp_otus_to_keep 

In [12]:
treatmentgroup_tsv_10p_fp = 'vsearch-derep.ms2.rdp-tax-99.%s-ms10p.tsv' % treatmentgroup
!biom convert --to-tsv -i $treatmentgroup_biom_10p_fp -o $treatmentgroup_tsv_10p_fp

Pre-filtering (treatmentgroup-only) table

In [13]:
!biom summarize-table -i $treatmentgroup_biom_fp

Num samples: 487
Num observations: 505989
Total count: 14704555
Table density (fraction of non-zero values): 0.009

Counts/sample summary:
 Min: 0.0
 Max: 276779.0
 Median: 19399.000
 Mean: 30194.158
 Std. dev.: 32395.350
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
udca.trial.74MYRQPORAQFW: 0.0
udca.trial.2HWCMBKU9QMSE: 1.0
udca.trial.57VOZHTSP5XEE: 1.0
udca.trial.4R42WF9QJC5NG: 1.0
udca.trial.4YOLU4HMLZSY1: 1.0
udca.trial.46CKFM6MTT2XN: 1.0
udca.trial.67HWOV7G52RWJ: 1.0
udca.trial.4YWM4Q9599QU2: 1.0
udca.trial.1FX7H8EN3TWJL: 1.0
udca.trial.5QCMCTI2GCPAC: 1.0
udca.trial.B9PVQTY3C8KT: 1.0
udca.trial.8J471W2UG1YW: 1.0
udca.trial.3GOMKD1USXNLL: 2.0
udca.trial.7NZXBHG13L9VY: 2.0
udca.trial.5E96GFZVVBP7Z: 2.0
udca.trial.7DWSZP1SFBDVH: 4.0
udca.trial.ZGCRCM164LNW: 4.0
udca.trial.3W1UJKN3BG73K: 8.0
udca.trial.1TXQ1BK9U87LC: 8.0
udca.trial.3E316JNAJ78UU: 42.0
udca.trial.27Z3N6VEYLN24: 321.0
udc

Post-filtering (treatmentgroup-only) table

In [14]:
!biom summarize-table -i $treatmentgroup_biom_10p_fp

Num samples: 487
Num observations: 6368
Total count: 11096891
Table density (fraction of non-zero values): 0.190

Counts/sample summary:
 Min: 0.0
 Max: 222959.0
 Median: 15010.000
 Mean: 22786.224
 Std. dev.: 24253.907
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
udca.trial.3GOMKD1USXNLL: 0.0
udca.trial.2HWCMBKU9QMSE: 0.0
udca.trial.7NZXBHG13L9VY: 0.0
udca.trial.57VOZHTSP5XEE: 0.0
udca.trial.4R42WF9QJC5NG: 0.0
udca.trial.4YOLU4HMLZSY1: 0.0
udca.trial.46CKFM6MTT2XN: 0.0
udca.trial.67HWOV7G52RWJ: 0.0
udca.trial.1FX7H8EN3TWJL: 0.0
udca.trial.5QCMCTI2GCPAC: 0.0
udca.trial.74MYRQPORAQFW: 0.0
udca.trial.8J471W2UG1YW: 0.0
udca.trial.7DWSZP1SFBDVH: 1.0
udca.trial.4YWM4Q9599QU2: 1.0
udca.trial.B9PVQTY3C8KT: 1.0
udca.trial.5E96GFZVVBP7Z: 2.0
udca.trial.1TXQ1BK9U87LC: 2.0
udca.trial.3E316JNAJ78UU: 3.0
udca.trial.ZGCRCM164LNW: 3.0
udca.trial.3W1UJKN3BG73K: 7.0
udca.trial.27Z3N6VEYLN24: 267.0
udca.t

In [15]:
!rm -r temp