# Metagenomic Datasets

### MetAML - Metagenomic prediction Analysis based on Machine Learning
* Reference: Pasolli, Edoardo, et al. "Machine learning meta-analysis of large metagenomic datasets: tools and biological insights." PLoS computational biology 12.7 (2016): e1004977.
* MetAML is a computational tool for metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations.
    - The tool (i) is based on machine learning classifiers, (ii) includes automatic model and feature selection steps, (iii) comprises cross-validation and cross-study analysis, and (iv) uses as features quantitative microbiome profiles including species-level relative abundances and presence of strain-specific markers.
    - It provides also species-level taxonomic profiles, marker presence data, and metadata for 3000+ public available metagenomes.
* Open-source tools: http://segatalab.cibio.unitn.it/tools/metaml
    - The software framework, microbiome profiles, and metadata for thousands of samples are publicly available.
    - Github: https://github.com/segatalab/metaml
    - Tutorial: https://github.com/segatalab/metaml/wiki

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold

## Dataset

* A collection of 2424 publicly available metagenomic samples from eight large-scale studies
* Available data for 3000+ metagenomes
    1. `abundance.txt.bz2`: species-level relative abundances 
    1. `marker_presence.txt.bz2`: presence of strain-specific markers
    1. `marker_abundance.txt.bz2`: abundance of strain-specific markers __Not available__
    1. `markers2clades_DB.txt.bz2`: lookup table to associate each marker identifier to the corresponding species
    1. `abundance_stoolsubset.txt.bz2`: no description (Added subset with stool samples only)
* Before using such files, it is required to uncompress them

In [2]:
%%bash
cd data/metagenomics/metaml/
ls data
du -sh data/abundance_t2d-WT2D.txt

abundance_stoolsubset.txt
abundance_t2d-WT2D.txt
abundance.txt
marker_presence_t2d-WT2D.txt
marker_presence.txt
markers2clades_DB.txt
852K	data/abundance_t2d-WT2D.txt


## T2D_WT2D data cleaning

In [3]:
t2d_WT2D = pd.read_csv('data/metagenomics/metaml/data/abundance_t2d-WT2D.txt', sep='\t', index_col=0)
t2d_WT2D.shape

(607, 440)

In [4]:
t2d_WT2D

Unnamed: 0_level_0,t2dmeta_long,t2dmeta_long.1,t2dmeta_long.2,t2dmeta_long.3,t2dmeta_long.4,t2dmeta_long.5,t2dmeta_long.6,t2dmeta_long.7,t2dmeta_long.8,t2dmeta_long.9,...,WT2D.86,WT2D.87,WT2D.88,WT2D.89,WT2D.90,WT2D.91,WT2D.92,WT2D.93,WT2D.94,WT2D.95
dataset_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
disease,n,n,n,n,n,n,n,n,n,n,...,n,n,n,n,n,n,n,n,n,n
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_smithii,0.33364,0.49776,0,0,0.49446,0,0,0,0,0,...,0,1.76247,0,2.96027,7.4432,0.02598,2.78607,2.46789,6.72433,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanobrevibacter|s__Methanobrevibacter_unclassified,0,0.12802,0,0,0.06786,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.07156,0
k__Archaea|p__Euryarchaeota|c__Methanobacteria|o__Methanobacteriales|f__Methanobacteriaceae|g__Methanosphaera|s__Methanosphaera_stadtmanae,0,0,0,0,0,0,0,0,0,0,...,0,0.55541,0,0,0,0,0,0,0,0
k__Bacteria|p__Acidobacteria|c__Acidobacteriia|o__Acidobacteriales|f__Acidobacteriaceae|g__Acidobacteriaceae_unclassified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
k__Bacteria|p__Planctomycetes|c__Planctomycetia|o__Planctomycetales|f__Planctomycetaceae|g__Rhodopirellula|s__Rhodopirellula_unclassified,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Vibrionales|f__Vibrionaceae|g__Vibrio|s__Vibrio_furnissii,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_sp_2_2_4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Lysinibacillus|s__Lysinibacillus_fusiformis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0.02423,0,0,0,0,0


### Phylogenetic tree

In [5]:
# Phylogenetic information
phylogentic_info_raw = [name.split('|') for name in t2d_WT2D.index.values[1:]]
phylogentic_info_raw

[['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanobrevibacter',
  's__Methanobrevibacter_smithii'],
 ['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanobrevibacter',
  's__Methanobrevibacter_unclassified'],
 ['k__Archaea',
  'p__Euryarchaeota',
  'c__Methanobacteria',
  'o__Methanobacteriales',
  'f__Methanobacteriaceae',
  'g__Methanosphaera',
  's__Methanosphaera_stadtmanae'],
 ['k__Bacteria',
  'p__Acidobacteria',
  'c__Acidobacteriia',
  'o__Acidobacteriales',
  'f__Acidobacteriaceae',
  'g__Acidobacteriaceae_unclassified'],
 ['k__Bacteria',
  'p__Actinobacteria',
  'c__Actinobacteria',
  'o__Actinomycetales',
  'f__Actinomycetaceae',
  'g__Actinomyces',
  's__Actinomyces_graevenitzii'],
 ['k__Bacteria',
  'p__Actinobacteria',
  'c__Actinobacteria',
  'o__Actinomycetales',
  'f__Actinomycetaceae',
  'g__Actinomyces',
  's__Act

In [6]:
phylogenetic_info = pd.DataFrame([dict([item.split('__') for item in sample]) for sample in phylogentic_info_raw])

# columns order: species genus family order class phylum kingdom(domain)
phylogenetic_info = phylogenetic_info[['s', 'g', 'f', 'o', 'c', 'p', 'k']]
phylogenetic_info.columns = ['Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum', 'Domain']

In [7]:
phylogenetic_info

Unnamed: 0,Species,Genus,Family,Order,Class,Phylum,Domain
0,Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
1,Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
2,Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
3,,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria,Bacteria
4,Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
...,...,...,...,...,...,...,...
601,Rhodopirellula_unclassified,Rhodopirellula,Planctomycetaceae,Planctomycetales,Planctomycetia,Planctomycetes,Bacteria
602,Vibrio_furnissii,Vibrio,Vibrionaceae,Vibrionales,Gammaproteobacteria,Proteobacteria,Bacteria
603,Bacteroides_sp_2_2_4,Bacteroides,Bacteroidaceae,Bacteroidales,Bacteroidia,Bacteroidetes,Bacteria
604,Lysinibacillus_fusiformis,Lysinibacillus,Bacillaceae,Bacillales,Bacilli,Firmicutes,Bacteria


In [8]:
# fill NaN with Genus column value
phylogenetic_info['Species'] = phylogenetic_info['Species'].fillna(phylogenetic_info['Genus'])
phylogenetic_info

Unnamed: 0,Species,Genus,Family,Order,Class,Phylum,Domain
0,Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
1,Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
2,Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
3,Acidobacteriaceae_unclassified,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria,Bacteria
4,Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
...,...,...,...,...,...,...,...
601,Rhodopirellula_unclassified,Rhodopirellula,Planctomycetaceae,Planctomycetales,Planctomycetia,Planctomycetes,Bacteria
602,Vibrio_furnissii,Vibrio,Vibrionaceae,Vibrionales,Gammaproteobacteria,Proteobacteria,Bacteria
603,Bacteroides_sp_2_2_4,Bacteroides,Bacteroidaceae,Bacteroidales,Bacteroidia,Bacteroidetes,Bacteria
604,Lysinibacillus_fusiformis,Lysinibacillus,Bacillaceae,Bacillales,Bacilli,Firmicutes,Bacteria


In [9]:
# Drop duplicates
phylogenetic_info = phylogenetic_info.drop_duplicates()
phylogenetic_info

Unnamed: 0,Species,Genus,Family,Order,Class,Phylum,Domain
0,Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
1,Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
2,Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
3,Acidobacteriaceae_unclassified,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria,Bacteria
4,Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
...,...,...,...,...,...,...,...
601,Rhodopirellula_unclassified,Rhodopirellula,Planctomycetaceae,Planctomycetales,Planctomycetia,Planctomycetes,Bacteria
602,Vibrio_furnissii,Vibrio,Vibrionaceae,Vibrionales,Gammaproteobacteria,Proteobacteria,Bacteria
603,Bacteroides_sp_2_2_4,Bacteroides,Bacteroidaceae,Bacteroidales,Bacteroidia,Bacteroidetes,Bacteria
604,Lysinibacillus_fusiformis,Lysinibacillus,Bacillaceae,Bacillales,Bacilli,Firmicutes,Bacteria


In [10]:
phylogenetic_info

Unnamed: 0,Species,Genus,Family,Order,Class,Phylum,Domain
0,Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
1,Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
2,Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
3,Acidobacteriaceae_unclassified,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria,Bacteria
4,Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
...,...,...,...,...,...,...,...
601,Rhodopirellula_unclassified,Rhodopirellula,Planctomycetaceae,Planctomycetales,Planctomycetia,Planctomycetes,Bacteria
602,Vibrio_furnissii,Vibrio,Vibrionaceae,Vibrionales,Gammaproteobacteria,Proteobacteria,Bacteria
603,Bacteroides_sp_2_2_4,Bacteroides,Bacteroidaceae,Bacteroidales,Bacteroidia,Bacteroidetes,Bacteria
604,Lysinibacillus_fusiformis,Lysinibacillus,Bacillaceae,Bacillales,Bacilli,Firmicutes,Bacteria


In [11]:
phylogenetic_info.shape

(606, 7)

In [12]:
### Save to csv
phylogenetic_info.to_csv('data/metagenomics/clean_t2d/t2d_species606_dic.csv', index=False)

In [13]:
%%bash
cd data/metagenomics/clean_t2d/
cat t2d_species606_dic.csv

Species,Genus,Family,Order,Class,Phylum,Domain
Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota,Archaea
Acidobacteriaceae_unclassified,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria,Bacteria
Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
Actinomyces_odontolyticus,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
Actinomyces_turicensis,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
Varibaculum_cambriense,Varibaculum,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria,Bacteria
Rothia_mucilagin

In [14]:
# Check
tree_level_list = ['Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum']
phylogenetic_tree_info = pd.read_csv('data/metagenomics/clean_t2d/t2d_species606_dic.csv')
checked_tree_level_list = [lvl_name for lvl_name in tree_level_list if lvl_name in phylogenetic_tree_info.columns.tolist()]
phylogenetic_tree_info = phylogenetic_tree_info[checked_tree_level_list]
phylogenetic_tree_info

Unnamed: 0,Species,Genus,Family,Order,Class,Phylum
0,Methanobrevibacter_smithii,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota
1,Methanobrevibacter_unclassified,Methanobrevibacter,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota
2,Methanosphaera_stadtmanae,Methanosphaera,Methanobacteriaceae,Methanobacteriales,Methanobacteria,Euryarchaeota
3,Acidobacteriaceae_unclassified,Acidobacteriaceae_unclassified,Acidobacteriaceae,Acidobacteriales,Acidobacteriia,Acidobacteria
4,Actinomyces_graevenitzii,Actinomyces,Actinomycetaceae,Actinomycetales,Actinobacteria,Actinobacteria
...,...,...,...,...,...,...
601,Rhodopirellula_unclassified,Rhodopirellula,Planctomycetaceae,Planctomycetales,Planctomycetia,Planctomycetes
602,Vibrio_furnissii,Vibrio,Vibrionaceae,Vibrionales,Gammaproteobacteria,Proteobacteria
603,Bacteroides_sp_2_2_4,Bacteroides,Bacteroidaceae,Bacteroidales,Bacteroidia,Bacteroidetes
604,Lysinibacillus_fusiformis,Lysinibacillus,Bacillaceae,Bacillales,Bacilli,Firmicutes


In [56]:
print('-----------------------------------------------------------------------------------------------------')
print('Phylogenetic tree level list: %s' % tree_level_list)
print('-----------------------------------------------------------------------------------------------------')
phylogenetic_tree_dict = {'Number':{}}
for tree_lvl in tree_level_list:
    lvl_category = phylogenetic_tree_info[tree_lvl].unique()
    lvl_num = lvl_category.shape[0]
    print('%6s: %d' % (tree_lvl, lvl_num))
    phylogenetic_tree_dict[tree_lvl] = dict(zip(lvl_category, np.arange(lvl_num)))
    phylogenetic_tree_dict['Number'][tree_lvl]=lvl_num
print('-----------------------------------------------------------------------------------------------------')
print('Phylogenetic_tree_dict info: %s' % list(phylogenetic_tree_dict.keys()))
print('-----------------------------------------------------------------------------------------------------')

-----------------------------------------------------------------------------------------------------
Phylogenetic tree level list: ['Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum']
-----------------------------------------------------------------------------------------------------
Species: 606
 Genus: 216
Family: 94
 Order: 48
 Class: 29
Phylum: 17
-----------------------------------------------------------------------------------------------------
Phylogenetic_tree_dict info: ['Number', 'Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum']
-----------------------------------------------------------------------------------------------------


### T2D

In [15]:
t2dmeta_list = [name for name in t2d_WT2D.columns if 't2dmeta' in name]
t2dmeta = t2d_WT2D[t2dmeta_list]
t2d_x = t2dmeta.iloc[1:,:]
t2d_y = t2dmeta.iloc[0,:]

t2d_x.index = phylogenetic_tree_info['Species']
x = t2d_x.T
x

Species,Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,...,Xanthomonas_axonopodis,Xanthomonas_fuscans,Mycoplasma_bovis,Zunongwangia_profunda,Enterococcus_pallens,Rhodopirellula_unclassified,Vibrio_furnissii,Bacteroides_sp_2_2_4,Lysinibacillus_fusiformis,Lysinibacillus_sphaericus
t2dmeta_long,0.33364,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_long.1,0.49776,0.12802,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_long.2,0,0,0,0,0,0,0,0,0.01254,0.00262,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_long.3,0,0,0,0,0,0,0,0,0.02847,0,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_long.4,0.49446,0.06786,0,0,0,0,0,0,0.02221,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
t2dmeta_short.68,0,0,0,0,0,0,0,0,0.04101,0,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_short.69,0,0,0,0,0.00479,0,0,0,0.01181,0.01625,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_short.70,0,0,0,0,0.00242,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
t2dmeta_short.71,0,0,0,0,0,0,0,0,0.12243,0,...,0,0,0,0.00086,0,0,0.00943,0,0,0


In [16]:
# Save to csv
x.to_csv('data/metagenomics/clean_t2d/X_t2d.csv', index=False)

In [17]:
%%bash
cd data/metagenomics/clean_t2d/
cat X_t2d.csv

Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,Bifidobacterium_adolescentis,Bifidobacterium_angulatum,Bifidobacterium_bifidum,Bifidobacterium_catenulatum,Bifidobacterium_longum,Bifidobacterium_pseudocatenulatum,Gardnerella_vaginalis,Adlercreutzia_equolifaciens,Atopobium_parvulum,Atopobium_rimae,Collinsella_aerofaciens,Collinsella_unclassified,Coriobacteriaceae_bacterium_phI,Eggerthella_lenta,Eggerthella_unclassified,Gordonibacter_pamelaeae,Olsenella_unclassified,Slackia_piriformis,Bacteroides_barnesiae,Bacteroides_caccae,Bacteroides_cellulosilyticus,Bacteroides_clarus,Bacteroides_coprocola,Bacteroides_coprophilus,Bacteroides_dorei,Bacteroides_eggerthii,Bacteroides_faecis,Bacteroides_finegoldii,Bacteroides_fragilis,Bacteroides_intestinalis,Bacteroides_massiliensis,Bacteroides_nordii,B

In [18]:
# Check
pd.read_csv('data/metagenomics/clean_t2d/X_t2d.csv')

Unnamed: 0,Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,...,Xanthomonas_axonopodis,Xanthomonas_fuscans,Mycoplasma_bovis,Zunongwangia_profunda,Enterococcus_pallens,Rhodopirellula_unclassified,Vibrio_furnissii,Bacteroides_sp_2_2_4,Lysinibacillus_fusiformis,Lysinibacillus_sphaericus
0,0.33364,0.00000,0.0,0.0,0.00000,0.0000,0.0,0.0,0.00000,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
1,0.49776,0.12802,0.0,0.0,0.00000,0.0000,0.0,0.0,0.00000,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
2,0.00000,0.00000,0.0,0.0,0.00000,0.0000,0.0,0.0,0.01254,0.00262,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
3,0.00000,0.00000,0.0,0.0,0.00000,0.0000,0.0,0.0,0.02847,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
4,0.49446,0.06786,0.0,0.0,0.00000,0.0000,0.0,0.0,0.02221,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,0.00000,0.00000,0.0,0.0,0.00000,0.0000,0.0,0.0,0.04101,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
340,0.00000,0.00000,0.0,0.0,0.00479,0.0000,0.0,0.0,0.01181,0.01625,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
341,0.00000,0.00000,0.0,0.0,0.00242,0.0000,0.0,0.0,0.00000,0.00000,...,0.0,0.0,0.0,0.00000,0.0,0.0,0.00000,0,0,0
342,0.00000,0.00000,0.0,0.0,0.00000,0.0000,0.0,0.0,0.12243,0.00000,...,0.0,0.0,0.0,0.00086,0.0,0.0,0.00943,0,0,0


In [19]:
t2d_y.values

array(['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 't2d', 't2d', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n',
       'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 

In [20]:
np.unique(t2d_y.values)

array(['n', 't2d'], dtype=object)

In [21]:
y = pd.DataFrame(t2d_y.values == 't2d')
y.columns = ['t2d']
y['t2d'] = y['t2d'].astype('int32')
y

Unnamed: 0,t2d
0,0
1,0
2,0
3,0
4,0
...,...
339,0
340,0
341,0
342,0


In [22]:
np.unique(y['t2d'])

array([0, 1], dtype=int32)

In [23]:
# Save to csv
y.to_csv('data/metagenomics/clean_t2d/y_t2d.csv', index=False)

In [24]:
%%bash
cd data/metagenomics/clean_t2d/
cat y_t2d.csv

t2d
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [25]:
# Check
pd.read_csv('data/metagenomics/clean_t2d/y_t2d.csv')

Unnamed: 0,t2d
0,0
1,0
2,0
3,0
4,0
...,...
339,0
340,0
341,0
342,0


#### Set of indices for the 5-fold cross-validation

In [26]:
number_of_fold = 5
nsample = y.shape[0]

kf = KFold(n_splits=number_of_fold, shuffle=True, random_state=12)
cv_gen = kf.split(range(nsample))
idxs = np.array([train_idx for train_idx, test_idx in cv_gen]).T

### WT2D

In [27]:
WT2D_list = [name for name in t2d_WT2D.columns if 'WT2D' in name]
WT2D = t2d_WT2D[WT2D_list]
WT2D_x = WT2D.iloc[1:,:]
WT2D_y = WT2D.iloc[0,:]

WT2D_x.index = phylogenetic_tree_info['Species']
x = WT2D_x.T
x

Species,Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,...,Xanthomonas_axonopodis,Xanthomonas_fuscans,Mycoplasma_bovis,Zunongwangia_profunda,Enterococcus_pallens,Rhodopirellula_unclassified,Vibrio_furnissii,Bacteroides_sp_2_2_4,Lysinibacillus_fusiformis,Lysinibacillus_sphaericus
WT2D,0,0,0,0,0,0,0,0,0.00254,0,...,0,0,0,0,0,0,0,0,0,0
WT2D.1,0,0,0,0,0,0,0,0,0.00647,0.0017,...,0,0,0,0,0,0,0,0,0,0
WT2D.2,3.83821,0.33097,0.35798,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WT2D.3,0.78534,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WT2D.4,9.11862,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WT2D.91,0.02598,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WT2D.92,2.78607,0,0,0,0,0,0,0,0.02482,0.00889,...,0,0,0,0,0,0,0,0,0,0
WT2D.93,2.46789,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
WT2D.94,6.72433,0.07156,0,0,0,0.03085,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Save to csv
x.to_csv('data/metagenomics/clean_t2d/X_wt2d.csv', index=False)

In [29]:
%%bash
cd data/metagenomics/clean_t2d/
cat X_wt2d.csv

Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,Bifidobacterium_adolescentis,Bifidobacterium_angulatum,Bifidobacterium_bifidum,Bifidobacterium_catenulatum,Bifidobacterium_longum,Bifidobacterium_pseudocatenulatum,Gardnerella_vaginalis,Adlercreutzia_equolifaciens,Atopobium_parvulum,Atopobium_rimae,Collinsella_aerofaciens,Collinsella_unclassified,Coriobacteriaceae_bacterium_phI,Eggerthella_lenta,Eggerthella_unclassified,Gordonibacter_pamelaeae,Olsenella_unclassified,Slackia_piriformis,Bacteroides_barnesiae,Bacteroides_caccae,Bacteroides_cellulosilyticus,Bacteroides_clarus,Bacteroides_coprocola,Bacteroides_coprophilus,Bacteroides_dorei,Bacteroides_eggerthii,Bacteroides_faecis,Bacteroides_finegoldii,Bacteroides_fragilis,Bacteroides_intestinalis,Bacteroides_massiliensis,Bacteroides_nordii,B

In [30]:
# Check
pd.read_csv('data/metagenomics/clean_t2d/X_wt2d.csv')

Unnamed: 0,Methanobrevibacter_smithii,Methanobrevibacter_unclassified,Methanosphaera_stadtmanae,Acidobacteriaceae_unclassified,Actinomyces_graevenitzii,Actinomyces_odontolyticus,Actinomyces_turicensis,Varibaculum_cambriense,Rothia_mucilaginosa,Rothia_unclassified,...,Xanthomonas_axonopodis,Xanthomonas_fuscans,Mycoplasma_bovis,Zunongwangia_profunda,Enterococcus_pallens,Rhodopirellula_unclassified,Vibrio_furnissii,Bacteroides_sp_2_2_4,Lysinibacillus_fusiformis,Lysinibacillus_sphaericus
0,0.00000,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00254,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
1,0.00000,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00647,0.00170,...,0,0,0,0,0,0,0,0.0,0.0,0.0
2,3.83821,0.33097,0.35798,0,0.0,0.00000,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
3,0.78534,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
4,9.11862,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,0.02598,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
92,2.78607,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.02482,0.00889,...,0,0,0,0,0,0,0,0.0,0.0,0.0
93,2.46789,0.00000,0.00000,0,0.0,0.00000,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0
94,6.72433,0.07156,0.00000,0,0.0,0.03085,0.0,0.0,0.00000,0.00000,...,0,0,0,0,0,0,0,0.0,0.0,0.0


In [31]:
WT2D_y.values

array(['n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 't2d', 'n', 'n',
       'n', 'n', 'n', 't2d', 'n', 'n', 'n', 't2d', 't2d', 't2d', 't2d',
       't2d', 'n', 'n', 'n', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 'n',
       't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 'n', 't2d',
       't2d', 't2d', 't2d', 'n', 'n', 't2d', 't2d', 'n', 't2d', 't2d',
       't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 't2d', 't2d', 'n', 't2d', 't2d', 't2d', 't2d', 't2d',
       't2d', 't2d', 'n', 't2d', 't2d', 'n', 't2d', 't2d', 'n', 't2d',
       't2d', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n', 'n'],
      dtype=object)

In [32]:
np.unique(WT2D_y.values)

array(['n', 't2d'], dtype=object)

In [33]:
y = pd.DataFrame(WT2D_y.values == 't2d')
y.columns = ['t2d']
y['t2d'] = y['t2d'].astype('int32')
y

Unnamed: 0,t2d
0,0
1,0
2,0
3,0
4,0
...,...
91,0
92,0
93,0
94,0


In [34]:
np.unique(y['t2d'])

array([0, 1], dtype=int32)

In [35]:
# Save to csv
y.to_csv('data/metagenomics/clean_t2d/y_wt2d.csv', index=False)

In [36]:
%%bash
cd data/metagenomics/clean_t2d/
cat y_wt2d.csv

t2d
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
1
1
1
1
1
0
0
0
1
0
1
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
1
0
0
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
0
1
1
0
1
1
0
0
0
0
0
0
0
0
0
0


In [37]:
# Check
pd.read_csv('data/metagenomics/clean_t2d/y_wt2d.csv')

Unnamed: 0,t2d
0,0
1,0
2,0
3,0
4,0
...,...
91,0
92,0
93,0
94,0


#### Set of indices for the 5-fold cross-validation

In [38]:
number_of_fold = 5
nsample = y.shape[0]

kf = KFold(n_splits=number_of_fold, shuffle=True, random_state=12)
cv_gen = kf.split(range(nsample))
idxs = np.array([train_idx for train_idx, test_idx in cv_gen]).T

In [39]:
idxs

array([array([ 0,  1,  2,  3,  4,  5,  6,  9, 10, 11, 13, 15, 18, 19, 20, 21, 22,
       24, 25, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36, 37, 41, 42, 43, 44,
       45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 59, 62, 63, 65, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       88, 89, 90, 91, 92, 93, 94, 95]),
       array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 11, 12, 13, 14, 15, 16, 17, 18,
       22, 23, 24, 25, 27, 30, 31, 32, 33, 34, 35, 37, 38, 39, 40, 42, 43,
       44, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 61, 62,
       63, 64, 65, 66, 67, 68, 71, 72, 73, 74, 75, 76, 78, 81, 83, 85, 86,
       87, 88, 89, 90, 91, 92, 93, 94, 95]),
       array([ 0,  2,  3,  5,  6,  7,  8,  9, 10, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
       38, 39, 40, 41, 45, 46, 47, 48, 49, 50, 52, 56, 58, 59, 60, 61, 62,
       63, 64, 66, 67, 69, 70, 71, 72, 74, 75, 76, 77, 78, 79, 80, 8