## Exploring the Kostic dataset
In this notebook we load the datafiles from the Kostic et al., (2015) dataset and show what each file contains. The study from Kostic et al., (2015) tracked the microbiomes of 17 infants sampled over the first three years of life, with the infants being classified as either normal or having developed type 1 diabetes.

In [1]:
import pandas as pd
import os

In [2]:
dataset_path = "./datasets/kostic/"

## MetaPhlAn abundance tables
Note that MetaPhlAn outputs organism relative abundances (out of 100%), listed as one clade per line. The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level: `Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__`. The total sum of relative abundances for each clade should then sum to 100.0.

In [3]:
abundances = pd.read_csv(os.path.join(dataset_path, "diabimmune_t1d_metaphlan_table.txt"), sep="\t")

In [4]:
abundances

Unnamed: 0,Taxonomy,G35421,G35451,G35893,G35464,G35465,G35474,G35488,G35906,G35951,...,G36267,G36268,G36302,G36548,G36556,G36858,G36863,G36866,G36868,G36870
0,k__Bacteria,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,...,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000
1,k__Bacteria|p__Actinobacteria,36.48375,3.45029,0.97899,46.08772,17.65351,72.96490,59.17006,2.76232,0.49062,...,46.72820,1.95127,0.05178,21.30908,2.03329,4.86346,20.12716,4.11605,27.65587,12.08499
2,k__Bacteria|p__Actinobacteria|c__Actinobacteria,36.48375,3.45029,0.97899,46.08772,17.65351,72.96490,59.17006,2.76232,0.49062,...,46.72820,1.95127,0.05178,21.30908,2.03329,4.86346,20.12716,4.11605,27.65587,12.08499
3,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00941,0.00000,0.00000,0.00000,0.00000,0.00000,0.00895,0.07594,0.00000,...,0.02479,0.00000,0.00000,0.00136,0.00190,0.00000,0.00000,0.00000,0.03765,0.00000
4,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00941,0.00000,0.00000,0.00000,0.00000,0.00000,0.00895,0.00000,0.00000,...,0.02479,0.00000,0.00000,0.00136,0.00190,0.00000,0.00000,0.00000,0.03765,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...,4.62994,0.00000,0.08629,0.00000,0.02831,0.01743,0.00000,4.35279,0.00000,...,3.32411,0.00000,0.00000,0.00000,3.04692,2.98608,0.01205,0.00000,0.14880,0.18646
376,k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...,4.62994,0.00000,0.08629,0.00000,0.02831,0.01743,0.00000,4.35279,0.00000,...,3.32411,0.00000,0.00000,0.00000,3.04692,2.98608,0.01205,0.00000,0.14880,0.18646
377,k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...,4.62994,0.00000,0.08629,0.00000,0.02831,0.01743,0.00000,4.35279,0.00000,...,3.32411,0.00000,0.00000,0.00000,3.04692,2.98608,0.01205,0.00000,0.14880,0.18646
378,k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...,4.62994,0.00000,0.08629,0.00000,0.02831,0.01743,0.00000,4.35279,0.00000,...,3.32411,0.00000,0.00000,0.00000,3.04692,2.98608,0.01205,0.00000,0.14880,0.18646


### Filtering at the species level

To filter out species, we can search for the pattern `s__` in the Taxonomy column. 

In [5]:
abundances.Taxonomy

0                                            k__Bacteria
1                          k__Bacteria|p__Actinobacteria
2        k__Bacteria|p__Actinobacteria|c__Actinobacteria
3      k__Bacteria|p__Actinobacteria|c__Actinobacteri...
4      k__Bacteria|p__Actinobacteria|c__Actinobacteri...
                             ...                        
375    k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...
376    k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...
377    k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...
378    k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...
379    k__Bacteria|p__Verrucomicrobia|c__Verrucomicro...
Name: Taxonomy, Length: 380, dtype: object

In [6]:
species_tag = "s__"

In [7]:
species = []
species_inds = []
for i, clade in enumerate(abundances.Taxonomy):
    if species_tag in clade:
        species_inds.append(i)
        species.append(clade)

In [8]:
species_abundances = abundances.iloc[species_inds,:]

In [9]:
species_abundances

Unnamed: 0,Taxonomy,G35421,G35451,G35893,G35464,G35465,G35474,G35488,G35906,G35951,...,G36267,G36268,G36302,G36548,G36556,G36858,G36863,G36866,G36868,G36870
6,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00941,0.00000,0.00000,0.00000,0.00000,0.00000,0.00895,0.00000,0.00000,...,0.02479,0.0,0.0,0.00136,0.00190,0.00000,0.00000,0.00000,0.03765,0.00000
7,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
8,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
10,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
13,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
366,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
368,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
369,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.14277,0.05966,0.37976,0.26512,0.09975,0.12800,0.00000,0.08678,1.07824,...,0.00268,0.0,0.0,0.27817,0.13322,0.01853,0.07897,0.49642,0.12100,0.27605
373,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.0,0.0,0.03358,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000


We can sum over all species in each sample to see that the abundances do indeed sum up to 100

In [10]:
species_abundances.sum(axis=0)

Taxonomy    k__Bacteria|p__Actinobacteria|c__Actinobacteri...
G35421                                              100.00002
G35451                                               99.99999
G35893                                               99.99999
G35464                                               99.99999
                                  ...                        
G36858                                               99.99294
G36863                                               99.99542
G36866                                              100.00003
G36868                                               99.99998
G36870                                              100.00001
Length: 125, dtype: object

### Sample metadata
A CSV file that specifies an associated subject ID and timepoint for each sample ID.

In [11]:
sample_metadata = pd.read_csv(os.path.join(dataset_path, "t1d_sample_metadata.csv"), header=None)

In [12]:
sample_metadata.columns = ["sample_ID", "subject_ID", "time"]

In [13]:
sample_metadata

Unnamed: 0,sample_ID,subject_ID,time
0,G36451,E001463,303
1,G36025,E001463,457
2,G36836,E001463,638
3,G36847,E001463,853
4,G36829,E001463,943
...,...,...,...
123,G35421,T025418,477
124,G36446,T025418,527
125,G36032,T025418,568
126,G36448,T025418,629


### Subject metadata
A CSV file that gives information about each subject, (including the value of whatever variable will be used as the host outcome for prediction (e.g., normal or type-1 diabetes for the Kostic dataset).

In [14]:
subject_metadata = pd.read_csv(os.path.join(dataset_path, "t1d_wgs_subject_data.csv"))

In [15]:
subject_metadata

Unnamed: 0,Subject_ID,Case_Control
0,E001463,control
1,E003251,case
2,E003989,case
3,E006547,control
4,E006574,case
5,E006673,control
6,E010590,control
7,E010629,case
8,E010937,case
9,E016924,control
