# Metagenomic Datasets

### MetAML - Metagenomic prediction Analysis based on Machine Learning
* Reference: Pasolli, Edoardo, et al. "Machine learning meta-analysis of large metagenomic datasets: tools and biological insights." PLoS computational biology 12.7 (2016): e1004977.
* MetAML is a computational tool for metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations.
    - The tool (i) is based on machine learning classifiers, (ii) includes automatic model and feature selection steps, (iii) comprises cross-validation and cross-study analysis, and (iv) uses as features quantitative microbiome profiles including species-level relative abundances and presence of strain-specific markers.
    - It provides also species-level taxonomic profiles, marker presence data, and metadata for 3000+ public available metagenomes.
* Open-source tools: http://segatalab.cibio.unitn.it/tools/metaml
    - The software framework, microbiome profiles, and metadata for thousands of samples are publicly available.
    - Github: https://github.com/segatalab/metaml
    - Tutorial: https://github.com/segatalab/metaml/wiki

In [32]:
import pandas as pd

## Dataset

* A collection of 2424 publicly available metagenomic samples from eight large-scale studies
* Available data for 3000+ metagenomes
    1. `abundance.txt.bz2`: species-level relative abundances 
    1. `marker_presence.txt.bz2`: presence of strain-specific markers
    1. `marker_abundance.txt.bz2`: abundance of strain-specific markers __Not available__
    1. `markers2clades_DB.txt.bz2`: lookup table to associate each marker identifier to the corresponding species
    1. `abundance_stoolsubset.txt.bz2`: no description (Added subset with stool samples only)
* Before using such files, it is required to uncompress them

In [17]:
%%bash
ls realdata_metagenomics/metaml/data/

abundance_stoolsubset.txt
abundance.txt
marker_presence.txt
markers2clades_DB.txt


## Tools

* `dataset_selection.py`: script to extract from the whole available data (e.g., from "abundance.txt") only the samples/features of interest
* `classification.py`: script to run the classification task on the selected data
* `tools` folder: additional scripts to generate the figures present in the published paper
* `scripts` folder: commands to replicate the results reported in the published paper

In [21]:
%%bash
cd realdata_metagenomics/metaml
ls

classification.py
classification_thomas-manghi.py
data
dataset_selection.py
README.md
regression_dev.py
scripts
tools


### Tools: dataset_selection

In [22]:
%%bash
cd realdata_metagenomics/metaml/
python3.5 dataset_selection.py -h

usage: dataset_selection.py [-h] [-z FEATURE_IDENTIFIER] [-s SELECT]
                            [-r REMOVE] [-i INCLUDE] [-e EXCLUDE] [-t]
                            [INPUT_FILE] [OUTPUT_FILE]

Select specific dataset from input dataset file

positional arguments:
  INPUT_FILE            the input dataset file [stdin if not present]
  OUTPUT_FILE           the output dataset file

optional arguments:
  -h, --help            show this help message and exit
  -z FEATURE_IDENTIFIER, --feature_identifier FEATURE_IDENTIFIER
                        the feature identifier
  -s SELECT, --select SELECT
                        the samples to select
  -r REMOVE, --remove REMOVE
                        the samples to remove
  -i INCLUDE, --include INCLUDE
                        the fields to include
  -e EXCLUDE, --exclude EXCLUDE
                        the fields to exclude
  -t, --tout            transpose output dataset file


#### Example

With the following command line we select the 440 samples in terms of species-level relative abundances belonging to the T2D and WT2D datasets considered in the published paper

* Input file : `data/abundance.txt`
* Output file: `data/abundance_t2d-WT2D.txt`
* Option `-z "k__"`
    - __Feature identifier__
    - All the rows that contain `"k__"` in its identifier (i.e., the first column) are identified as features, the rest is considered as metadata

* Options `-s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance`:
    - __Option Selection of samples__
    - The couple of options -s (SELECT) and -r (REMOVE) defines which are the samples to select or remove. 
    - SELECT all the samples having in the metadata field "dataset_name" the value "t2dmeta_long" OR "t2dmeta_short" OR "WT2D".
    - REMOVE all the samples having in the metadata field "gender" the value "-" OR " -" (in this scenario this permits to exclude the samples without metadata information) AND all the samples having in the metadata field "disease" the value "impaired_glucose_tolerance"

* Options `-i feature_level:s__,dataset_name:disease -e feature_level:t__`
    - __Selection of metadata/features__
    - The couple of Pptions -i (INCLUDE) and -e (EXCLUDE) defines which are the metadata/features to include or exclude.
    - SELECT all the features that go from species (included, denoted as "s_") to sub-species (excluded, denoted as "t_") levels (this implies to select features at species level). Moreover, we keep only the fields "dataset_name" AND "disease" for metadata.

In [23]:
%%bash
cd realdata_metagenomics/metaml/
python3.5 dataset_selection.py data/abundance.txt data/abundance_t2d-WT2D.txt -z "k__" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i feature_level:s__,dataset_name:disease -e feature_level:t__

In [45]:
%%bash
cd realdata_metagenomics/metaml/
ls data
du -sh data/abundance_t2d-WT2D.txt

abundance_stoolsubset.txt
abundance_t2d-WT2D.txt
abundance.txt
marker_presence_t2d-WT2D.txt
marker_presence.txt
markers2clades_DB.txt
852K	data/abundance_t2d-WT2D.txt


In [145]:
t2d_WT2D = pd.read_csv('realdata_metagenomics/metaml/data/abundance_t2d-WT2D.txt', sep='\t', index_col=0)
t2d_WT2D.shape

(607, 440)

In [146]:
t2d_WT2D

Unnamed: 0_level_0,t2dmeta_long,t2dmeta_long.1,t2dmeta_long.2,t2dmeta_long.3,t2dmeta_long.4,t2dmeta_long.5,t2dmeta_long.6,t2dmeta_long.7,t2dmeta_long.8,t2dmeta_long.9,...,WT2D.86,WT2D.87,WT2D.88,WT2D.89,WT2D.90,WT2D.91,WT2D.92,WT2D.93,WT2D.94,WT2D.95
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
disease,n,n,n,n,n,n,n,n,n,n,...,n,n,n,n,n,n,n,n,n,n
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii,0.33364,0.49776,0,0,0.49446,0,0,0,0,0,...,0,1.76247,0,2.96027,7.4432,0.02598,2.78607,2.46789,6.72433,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified,0,0.12802,0,0,0.06786,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.07156,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae,0,0,0,0,0,0,0,0,0,0,...,0,0.55541,0,0,0,0,0,0,0,0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Acidobacteriaceae_unclassified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_graevenitzii,0,0,0,0,0,0,0,0,0,0.01089,...,0,0,0,0,0.06781,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_odontolyticus,0,0,0,0,0,0,0,0,0,0.01138,...,0,0,0,0,0,0,0,0,0.03085,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_turicensis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Varibaculum|s__Varibaculum_cambriense,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae|g__Rothia|s__Rothia_mucilaginosa,0,0,0.01254,0.02847,0.02221,0,0.00123,0,0,0.00283,...,0,0,0,0.05249,0,0,0.02482,0,0,0


In [147]:
# Phylogenetic information
[name.split('|') for name in t2d_WT2D.index.values[1:]]

[['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanobrevibacter',
  's__Methanobrevibacter_smithii'],
 ['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanobrevibacter',
  's__Methanobrevibacter_unclassified'],
 ['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanosphaera',
  's__Methanosphaera_stadtmanae'],
 ['k__Bacteria',
  'p__Acidobacteria',
  'c__Acidobacteriia',
  'o__Acidobacteriales',
  'f__Acidobacteriaceae',
  'g__Acidobacteriaceae_unclassified'],
 ['k__Bacteria',
  'p__Actinobacteria',
  'c__Actinobacteria',
  'o__Actinomycetales',
  'f__Actinomycetaceae',
  'g__Actinomyces',
  's__Actinomyces_graevenitzii'],
 ['k__Bacteria',
  'p__Actinobacteria',
  'c__Actinobacteria',
  'o__Actinomycetales',
  'f__Actinomycetaceae',
  'g__Actinomyces',
  's__Act

#### T2D

In [148]:
t2dmeta_list = [name for name in t2d_WT2D.columns if 't2dmeta' in name]
t2dmeta = t2d_WT2D[t2dmeta_list]
t2d_x = t2dmeta.iloc[1:,:]
t2d_y = t2dmeta.iloc[0,:]
t2d_x.shape

(606, 344)

In [149]:
t2d_x

Unnamed: 0_level_0,t2dmeta_long,t2dmeta_long.1,t2dmeta_long.2,t2dmeta_long.3,t2dmeta_long.4,t2dmeta_long.5,t2dmeta_long.6,t2dmeta_long.7,t2dmeta_long.8,t2dmeta_long.9,...,t2dmeta_short.63,t2dmeta_short.64,t2dmeta_short.65,t2dmeta_short.66,t2dmeta_short.67,t2dmeta_short.68,t2dmeta_short.69,t2dmeta_short.70,t2dmeta_short.71,t2dmeta_short.72
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii,0.33364,0.49776,0,0,0.49446,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified,0,0.12802,0,0,0.06786,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Acidobacteriaceae_unclassified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_graevenitzii,0,0,0,0,0,0,0,0,0,0.01089,...,0,0,0,0,0,0,0.00479,0.00242,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_odontolyticus,0,0,0,0,0,0,0,0,0,0.01138,...,0,0,0,0,0,0,0,0,0,0.0035
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_turicensis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Varibaculum|s__Varibaculum_cambriense,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae|g__Rothia|s__Rothia_mucilaginosa,0,0,0.01254,0.02847,0.02221,0,0.00123,0,0,0.00283,...,0,0,0,0,0,0.04101,0.01181,0,0.12243,0.12846
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae|g__Rothia|s__Rothia_unclassified,0,0,0.00262,0,0,0,0.00361,0,0,0,...,0,0,0,0,0,0,0.01625,0,0,0


In [150]:
t2d_y.values

array(['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 

#### WT2D

In [156]:
WT2D_list = [name for name in t2d_WT2D.columns if 'WT2D' in name]
WT2D = t2d_WT2D[WT2D_list]
WT2D_x = WT2D.iloc[1:,:]
WT2D_y = WT2D.iloc[0,:]
WT2D_x.shape

(606, 96)

In [152]:
WT2D_x

Unnamed: 0_level_0,WT2D,WT2D.1,WT2D.2,WT2D.3,WT2D.4,WT2D.5,WT2D.6,WT2D.7,WT2D.8,WT2D.9,...,WT2D.86,WT2D.87,WT2D.88,WT2D.89,WT2D.90,WT2D.91,WT2D.92,WT2D.93,WT2D.94,WT2D.95
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii,0,0,3.83821,0.78534,9.11862,0.17688,1.7296,0,0,0.31469,...,0,1.76247,0,2.96027,7.4432,0.02598,2.78607,2.46789,6.72433,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified,0,0,0.33097,0,0,0,0,0,0,0.03476,...,0,0,0,0,0,0,0,0,0.07156,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae,0,0,0.35798,0,0,0,0,0,0,0,...,0,0.55541,0,0,0,0,0,0,0,0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Acidobacteriaceae_unclassified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_graevenitzii,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.06781,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_odontolyticus,0,0,0,0,0,0,0,0.00082,0.01135,0,...,0,0,0,0,0,0,0,0,0.03085,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_turicensis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Varibaculum|s__Varibaculum_cambriense,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae|g__Rothia|s__Rothia_mucilaginosa,0.00254,0.00647,0,0,0,0.01096,0,0,0.00203,0,...,0,0,0,0.05249,0,0,0.02482,0,0,0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae|g__Rothia|s__Rothia_unclassified,0,0.0017,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0.00889,0,0,0


In [153]:
WT2D_y.values

array(['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 't2d', 'n', 'n',
       'n', 'n', 'n', 't2d', 'n', 'n', 'n', 't2d', 't2d', 't2d', 't2d',
       't2d', 'n', 'n', 'n', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 'n',
       't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 'n', 't2d',
       't2d', 't2d', 't2d', 'n', 'n', 't2d', 't2d', 'n', 't2d', 't2d',
       't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 'n', 't2d', 't2d', 'n', 't2d', 't2d', 'n', 't2d',
       't2d', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n'],
      dtype=object)

#### Example

We can extract the same set of samples but in terms of presence of strain-specific markers by slightly modifying the command in the following way:

In [34]:
%%bash
cd realdata_metagenomics/metaml/
python3.5 dataset_selection.py data/marker_presence.txt data/marker_presence_t2d-WT2D.txt -z "GeneID":"gi|" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i dataset_name:disease

In [44]:
%%bash
cd realdata_metagenomics/metaml/
ls data
du -sh data/marker_presence_t2d-WT2D.txt

abundance_stoolsubset.txt
abundance_t2d-WT2D.txt
abundance.txt
marker_presence_t2d-WT2D.txt
marker_presence.txt
markers2clades_DB.txt
115M	data/marker_presence_t2d-WT2D.txt


In [36]:
pd.read_csv('realdata_metagenomics/metaml/data/marker_presence_t2d-WT2D.txt', sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,dataset_name,t2dmeta_long,t2dmeta_long.1,t2dmeta_long.2,t2dmeta_long.3,t2dmeta_long.4,t2dmeta_long.5,t2dmeta_long.6,t2dmeta_long.7,t2dmeta_long.8,...,WT2D.86,WT2D.87,WT2D.88,WT2D.89,WT2D.90,WT2D.91,WT2D.92,WT2D.93,WT2D.94,WT2D.95
0,disease,n,n,n,n,n,n,n,n,n,...,n,n,n,n,n,n,n,n,n,n
1,gi|104773257|ref|NC_008054.1|:116729-117526,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,gi|104773257|ref|NC_008054.1|:1737697-1738332,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,gi|104773257|ref|NC_008054.1|:266275-267207,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,gi|104773257|ref|NC_008054.1|:294312-294563,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,gi|104773257|ref|NC_008054.1|:444407-444904,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
6,gi|104773257|ref|NC_008054.1|:54492-55274,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
7,gi|104773257|ref|NC_008054.1|:794401-794844,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
8,gi|104773257|ref|NC_008054.1|:c1060211-1059462,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,gi|104773257|ref|NC_008054.1|:c1169983-1168085,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## TODO:  run a prediction analysis (using classification.py)

In [103]:
%%bash
cd realdata_metagenomics/metaml/
python3.5 classification.py -h

usage: classification.py [-h] [-z FEATURE_IDENTIFIER] [-d DEFINE] [-t TARGET]
                         [-u UNIQUE] [-b] [-r RUNS_N] [-p RUNS_CV_FOLDS] [-w]
                         [-l {rf,svm,lasso,enet}] [-i {lasso,enet}]
                         [-f CV_FOLDS] [-g CV_GRID] [-s CV_SCORING]
                         [-j FS_GRID] [-e FIGURE_EXTENSION]
                         [INPUT_FILE] [OUTPUT_FILE]

MetAML - Metagenomic prediction Analysis based on Machine Learning

positional arguments:
  INPUT_FILE            the input dataset file [stdin if not present]
  OUTPUT_FILE           the output file [stdout if not present]

optional arguments:
  -h, --help            show this help message and exit
  -z FEATURE_IDENTIFIER, --feature_identifier FEATURE_IDENTIFIER
                        the feature identifier
  -d DEFINE, --define DEFINE
                        define the classification problem
  -t TARGET, --target TARGET
                        define the target domain
  -u UNIQUE, --un

  from numpy.core.umath_tests import inner1d


#### Example
* The classification problem by setting to class "1" all the samples having in the metadata field "disease" the value "t2d". The remaining samples are automatically assigned to class "0"

In [109]:
%%bash
cd realdata_metagenomics/metaml/
#mkdir results
python3.5 classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf -d 1:disease:t2d -g [] -w

  from numpy.core.umath_tests import inner1d
Traceback (most recent call last):
  File "classification.py", line 245, in <module>
    f = pd.read_csv(par['inp_f'], sep='\t', header=None, index_col=0, dtype=unicode)
NameError: name 'unicode' is not defined
