# Tutorial 1: Data-loading and preprocessing
This tutorial demonstrates how to setup and use our software to load and preprocess raw data into the format (python pickle object) expected by the MDITRE model. In the following steps we show how our utilities can handle loading data from 2 sources: 16s and Metaphlan, and describe the different preprocessing options that could be specified by the user.

In [1]:
# Load the required python packages
from mditre.data_utils import preprocess, ConfigParser, pickle

## Step 1: Define a configuration file
We expect the user to define a configuration file, which has different options for data loading and data preprocessing. We have provided example configuration files for 2 datasets in the following locations.

- Dataset from David et al., 2014 (16s): [file](./datasets/raw/david/david_benchmark.cfg)
- Dataset from Kostic et al., 2014 (Metaphlan): [file](./datasets/raw/t1d/t1d_benchmark.cfg)

The user can specify a range of configuration options described in detail in the MITRE user manual Chapter 5 [here](https://github.com/gerberlab/mitre/blob/master/docs/manual.pdf). However for the sake of completeness of this tutorial we provide a short description of the available options as follows.

### Options for data loading

- <strong>tag</strong>: A short string used to generate output filenames.
- <strong>data_type</strong>: Choose '16s' (the default) or 'metaphlan'. If 'metaphlan', we expect Metaphlan clade abundance estimates as input and handle them specially.
- <strong>abundance_data</strong>: Filename or path to a table of OTU abundance data.
- <strong>sample_metadata</strong>: Filename or path to a table giving a subject and time-point for each sample in the abundance table.
- <strong>subject_data</strong>: Filename or path to a table of data about each subject.
- <strong>jplace_file</strong>: Filename or path to pplacer results in .jplace format
- <strong>sequence_key</strong>: Filename or path to a FASTA file to be used to rename the OTUs listed in the abundance table, in the case where the name of each OTU is simply a sequence, as in some DADA2 output.
- <strong>outcome_variable</strong>: String, specifying which of the columns in the subject data table encodes the outcome for prediction. Subjects for whom this data is not available will be dropped from the calculation.
- <strong>outcome_positive_value</strong>: The value of the outcome variable that corresponds to a positive outcome.
- <strong>taxonomy_source</strong> String specifying how taxonomic annotations for each variable (i.e., subtree of the overall phylogeny) should be generated. Valid options are ‘table’, ‘pplacer’, and ‘hybrid’ (‘table’ labels OTUs from a table and higher clades based on the OTUs they contain, ‘pplacer’ labels all variables based on the pplacer results, and ‘hybrid’ labels variables based on pplacer results except where a species-level placement is given in a table.)
- <strong>placement_table</strong> Filename or path to a table of taxonomic placements for OTUs
- <strong>pplacer_taxa_table</strong> Filename or path to a table from the pplacer reference package. Required if ‘taxonomy_source’ is ‘pplacer’ or ‘hybrid’; otherwise ignored
- <strong>pplacer_seq_info</strong> Filename or path to a table from the pplacer reference package. Required if ‘taxonomy_source’ is ‘pplacer’ or ‘hybrid’; otherwise ignored.
- <strong>metaphlan_do_weights</strong> Boolean, default false, ignored unless ‘data_type’ is ‘metaphlan’. If this is set, variables corresponding to clades will be weighted according to the length of the subtree beneath them (effectively, the number of descendant clades for which a row is present in the metaphlan input file); otherwise, all clades will have the same weight.
- <strong>metaphlan_weight_scale</strong> Float, default 1.0, ignored unless ‘data_type’ is ‘metaphlan’. If ‘metaphlan_do_weights’ is set, scale the subtree lengths by this value.

### Options for data processing

- <strong>min_overall_abundance</strong> If this is specified, all OTUs with lower total abundance data than this value, summed across all samples, are dropped. (Abundance data is assumed to be measured in 16S read counts at this stage, but this is not enforced.)
- <strong>min_sample_reads</strong> If this is specified, all samples where the total abundance data across all (remaining) OTUs sum to less than this value are dropped. (Abundance data is assumed to be measured in 16S read counts at this stage, but this is not enforced.)
- <strong>trim_start</strong> The time to consider as the beginning of the study. (By default, this will be the timepoint of the earliest (remaining) sample.) Samples before this time will be dropped. If this is given, ‘trim_stop’ must also be specified.
- <strong>trim_stop</strong> The time to consider as the end of the study. (By default, this will be the time- point of the latest (remaining) sample.) Samples after this time will be dropped.
- <strong>density_filter_n_samples</strong> If this is given, ‘density_filter_n_intervals’ and ‘density_filter_n_consecutive’ must be given also. The duration of the study will be divided into the specified number of time intervals, and subjects from whom at least the specified number of samples are available in every time window formed from the specified number of consecutive intervals are retained; samples from all other subjects are dropped.
- <strong>density_filter_n_intervals</strong> See above
- <strong>density_filter_n_consecutive</strong> See above.
- <strong>take_relative_abundance</strong> Boolean. If True, abundance data will be converted to relative abundances by normalizing so that the sum of the data for each sample is 1.0.
- <strong>aggregate_on_phylogeny</strong> If True, introduce new variables corresponding to subtrees of the overall phylogeny of the OTUs, inferred from pplacer results given in the ‘jplace_file’ option above. The data for each new variable will be the sum of the data for the OTUs contained in the subtree.
- <strong>log_transform</strong> If True, take the natural log of all abundance data.
- <strong>temporal_abundance_threshold</strong> If this option is given, ‘temporal_abundance_consecutive_samples’ and ‘temporal_abundance_n_subjects’ must be also given. Variables which exceed the abundance threshold in at least the specified number of consecutive samples in the data from each of at least the specified number of subjects will be kept; others will be dropped. If the data has been log-transformed, this should be specified on a log scale.
- <strong>temporal_abundance_consecutive_samples</strong> See above.
- <strong>temporal_abundance_n_subjects</strong> See above.
- <strong>discard_surplus_internal_nodes</strong> If true, variables corresponding to subtrees rooted at internal nodes which have only one (remaining) child node after the filtration steps have been applied will be dropped from the analysis.
- <strong>pickle_dataset</strong> If true, the dataset object corresponding to the processed and filtered dataset will be serialized to <specified_tag>_dataset_object.pickle and written to tisk at the conclusion of the preprocessing step.

## Step 2: Define the location of the config file and load using ConfigParser package

In [2]:
# Location of config file for 16s data
filename_16s_data_cfg = './datasets/raw/david/david_benchmark.cfg'
# Location of config file for metaphlan data
filename_metaphlan_data_cfg = './datasets/raw/t1d/t1d_benchmark.cfg'

In [3]:
# Load config files
config_16s = ConfigParser.ConfigParser()
config_16s.read(filename_16s_data_cfg)

['./datasets/raw/david/david_benchmark.cfg']

In [4]:
# Load config files
config_metaphlan = ConfigParser.ConfigParser()
config_metaphlan.read(filename_metaphlan_data_cfg)

['./datasets/raw/t1d/t1d_benchmark.cfg']

## Step 3: Load and preprocess data
Using the preprocess utility, we end up with a pickle object containing the required dataset, that is readily acceoted for analysis by MDITRE model code.

In [5]:
# Load and preprocess 16s data
dataset_16s = preprocess(config_16s)

Data imported (before any filtering:)
17310 variables, 20 subjects, 236 total samples
After filtering RSVs/OTUs with too few counts:
8273 variables, 20 subjects, 236 total samples
After filtering samples with too few counts:
8273 variables, 20 subjects, 233 total samples
Transformed to relative abundance.
Phylogenetic aggregation begins.
Loading and reprocessing phylogenetic placements...
Identifying best placements and calculating subtree weights...
Attaching sequences/OTUs to tree...
Pruning (this may take a moment)...
Calculating weights...
Aggregating data...
Finalizing aggregated data...
After phylogenetic aggregation:
8907 variables, 20 subjects, 233 total samples
Parsing taxonomic annotations.
Appying temporal filtering
Removing surplus internal nodes...
old/new nodes:
{'4250', '12661', 'Otu000065', '5011', '6413', 'Otu000204', 'Otu000084', '6117', 'Otu000090', '4213', 'Otu000124', '7983', 'Otu000012', '13241', '12627', '6115', '9999', '10000', '8216', '6119', 'Otu000060', 'Otu0

Dataset written to david_diet_dataset_object.pickle


In [None]:
# Load and preprocess Metaphlan data
dataset_metaphlan = preprocess(config_metaphlan)

Data imported (before any filtering:)
380 variables, 19 subjects, 128 total samples
After trimming dataset to specified experimental time window:
380 variables, 19 subjects, 116 total samples
0.000000 400.000000 1
200.000000 600.000000 2
400.000000 800.000000 2
600.000000 1000.000000 3
passing 0:
[303. 457. 638. 853. 943.]
0.000000 400.000000 3
200.000000 600.000000 5
400.000000 800.000000 2
600.000000 1000.000000 0
failing 1:
[208. 249. 355. 474. 508.]
0.000000 400.000000 2
200.000000 600.000000 4
400.000000 800.000000 4
600.000000 1000.000000 2
passing 2:
[208. 303. 474. 531. 659. 750.]
0.000000 400.000000 1
200.000000 600.000000 3
400.000000 800.000000 3
600.000000 1000.000000 2
passing 3:
[369. 465. 600. 785.]
0.000000 400.000000 2
200.000000 600.000000 5
400.000000 800.000000 6
600.000000 1000.000000 5
passing 4:
[237. 366. 430. 520. 562. 616. 683. 788. 844. 918.]
0.000000 400.000000 2
200.000000 600.000000 4
400.000000 800.000000 3
600.000000 1000.000000 1
passing 5:
[207. 339. 4

In [18]:
# The final pickle object can be loaded and examined as follows.
with open('david_diet_dataset_object.pickle', 'rb') as f:
    loaded_dataset_16s = pickle.load(f)

In [19]:
# Examine the contents of the loaded pickle file
vars(loaded_dataset_16s).keys()

dict_keys(['X', 'T', 'y', 'variable_names', 'variable_weights', 'experiment_start', 'experiment_end', 'subject_data', 'subject_IDs', 'n_subjects', 'n_variables', '_primitive_result_cache', 'additional_subject_categorical_covariates', 'additional_covariate_default_states', 'additional_subject_continuous_covariates', 'additional_covariate_encoding', 'additional_covariate_matrix', 'n_fixed_covariates', 'variable_annotations', 'variable_tree'])

In [20]:
# Abundances as list of numpy arrays
print(loaded_dataset_16s.X)

[array([[8.37356978e-02, 9.67469782e-02, 7.81279160e-02, ...,
        1.67929517e-01, 1.78865164e-01, 1.08235544e-01],
       [0.00000000e+00, 0.00000000e+00, 1.46631475e-04, ...,
        0.00000000e+00, 0.00000000e+00, 3.54521927e-04],
       [1.75138267e-02, 1.32230064e-02, 1.51297022e-02, ...,
        3.96712642e-02, 3.61544249e-02, 3.60194278e-02],
       ...,
       [1.62408919e-02, 1.32716801e-02, 3.06193180e-02, ...,
        3.26466584e-02, 2.40770049e-02, 2.05622718e-02],
       [1.06809469e-03, 4.05613694e-04, 2.93262950e-03, ...,
        1.03593204e-03, 4.28093298e-04, 3.54521927e-04],
       [1.00000000e+00, 1.00000000e+00, 1.00000000e+00, ...,
        1.00000000e+00, 1.00000000e+00, 1.00000000e+00]]), array([[2.87938879e-01, 2.78545178e-01, 2.96863018e-01, ...,
        1.91763386e-01, 3.01393481e-01, 2.37889762e-01],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        6.29585021e-04, 0.00000000e+00, 0.00000000e+00],
       [4.01469622e-02, 3.69091370e-02, 3

In [21]:
# Time-points of samples for each subject
print(loaded_dataset_16s.T)

[array([-4., -3., -2., -1.,  0.,  1.,  2.,  4.,  5.,  6.,  7.,  8.,  9.]), array([-4., -2., -1.,  0.,  1.,  2.,  3.,  4.,  6.,  8.,  9., 10.]), array([-4., -3.,  0.,  1.,  5.,  6.,  7.,  8.,  9., 10.]), array([-4., -3., -2., -1.,  0.,  1.,  2.,  4.,  5.,  6.,  7.,  8.,  9.,
       10.]), array([-4., -1.,  3.,  5.,  6.,  7.,  8.,  9.]), array([-4., -3., -2., -1.,  0.,  1.,  3.,  4.,  5.,  6.,  7.,  8., 10.]), array([-4., -3., -2.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  9.]), array([-4., -2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.]), array([-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,
        9., 10.]), array([-4., -2.,  0.,  1.,  2.,  3.,  5.,  7.,  9., 10.]), array([-4., -3.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.]), array([-2., -1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  8., 10.]), array([-2., -1.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]), array([-4., -3., -2., -1.,  2.,  3.,  4.,  5.,  8.,  9.]), array([-4., -2.,  0.,  2.,  3.,  4.,  5.,  6

In [22]:
# Outcomes of each subject
print(loaded_dataset_16s.y)

[False False False False False False False False False False  True  True
  True  True  True  True  True  True  True  True]


In [23]:
# Ete3 Tree object of Phylogenetic tree for present OTUs
print(loaded_dataset_16s.variable_tree)


   /-16219
  |
  |            /-Otu000038
  |         /-|
  |        |  |   /-Otu000040
  |        |   \-|
  |        |     |   /-Otu000025
  |        |      \-|
  |        |         \-Otu000127
  |        |
  |        |            /-Otu000188
  |        |           |
  |        |         /-|--12210
  |        |        |  |
  |        |      /-|   \-Otu000031
  |        |     |  |
  |        |     |  |   /-Otu000112
  |        |     |   \-|
  |        |     |      \-12562
  |        |     |
  |        |     |            /-Otu000186
  |        |     |         /-|
  |        |     |        |   \-Otu000172
  |        |     |        |
  |        |     |        |      /-Otu000218
  |        |     |        |     |
  |        |     |        |     |      /-Otu000020
  |        |     |        |     |     |
  |        |     |        |     |     |--Otu000030
  |        |     |        |     |   /-|
  |        |     |        |   /-|  |  |--Otu000009
  |        |     |      /-|  |  |  |  |
  |     

In [24]:
# Mapping of OTU names in the phylogenetic tree to taxonomic annotations
print(loaded_dataset_16s.variable_annotations)

{'15454': 'a clade within phylum Chlamydiae or Deinococcus-Thermus or Planctomycetes or Elusimicrobia or Synergistetes or Spirochaetes or Fibrobacteres or Nitrospirae or Tenericutes or Chlorobi or Fusobacteria or Ignavibacteriae or Proteobacteria or Dictyoglomi or Lentisphaerae or Bacteroidetes or Actinobacteria or Firmicutes or Verrucomicrobia or Chloroflexi or Deferribacteres or Gemmatimonadetes or Caldiserica or Chrysiogenetes or Acidobacteria', '16219': 'a clade within phylum Euryarchaeota,including representatives of class Halobacteria, Methanomicrobia, Methanobacteria, Thermoplasmata', '15410': 'a clade within phylum Chlamydiae or Deinococcus-Thermus or Planctomycetes or Elusimicrobia or Spirochaetes or Fibrobacteres or Nitrospirae or Tenericutes or Chlorobi or Fusobacteria or Ignavibacteriae or Proteobacteria or Dictyoglomi or Lentisphaerae or Bacteroidetes or Actinobacteria or Firmicutes or Verrucomicrobia or Chloroflexi or Deferribacteres or Gemmatimonadetes or Chrysiogenetes 