# Exploring the Brooks dataset
In this notebook, we load files from the Brooks (2017) study. The study tracked gut microbiomes of 30 infants sampled over 75 days. 

The files explored in this notebook were extracted from the pickled file `Brooks_dataset_phylo_new.pickle`. The google colab notebook in this directory `brooks_extract_from_pickle.ipynb` was used to extract and process the data from the pickled file to obtain the files explored here.

In [1]:
import pandas as pd

## MetaPhlAn abundance tables
Note that MetaPhlAn outputs organism relative abundances (out of 100%), listed as one clade per line. The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level: `Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__`. The total sum of relative abundances in this file should sum to 1.0.

In [9]:
abundances = pd.read_csv("abundances.csv", index_col=0)

In [10]:
abundances

Unnamed: 0,Taxonomy,S0_T0,S0_T1,S0_T2,S0_T3,S0_T4,S0_T5,S0_T6,S0_T7,S0_T8,...,S27_T8,S27_T9,S27_T10,S27_T11,S27_T12,S27_T13,S27_T14,S27_T15,S27_T16,S27_T17
0,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
1,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
2,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
3,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
4,k__Bacteria|p__Actinobacteria|c__Actinobacteri...,0.000000,0.000000,0.000000,0.001063,0.001083,0.002021,0.002090,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.194332,0.185561,0.195945,0.157017,0.147991,0.156049,0.144259,0.164822,0.170829,...,0.000633,0.000489,0.0,0.002896,0.000484,0.0,0.0,0.000000,0.0,0.0
122,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.717414,0.716604,0.719074,0.660998,0.647380,0.650086,0.602931,0.662269,0.699816,...,0.003169,0.000904,0.0,0.011152,0.003008,0.0,0.0,0.000136,0.0,0.0
123,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000316,0.000000,0.0,0.000079,0.000000,0.0,0.0,0.000000,0.0,0.0
124,k__Bacteria|p__Proteobacteria|c__Gammaproteoba...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,0.0


We can check to see that the relative abundances do indeed sum to 1.0

In [11]:
abundances.sum(axis=0)

Taxonomy    k__Bacteria|p__Actinobacteria|c__Actinobacteri...
S0_T0                                                     1.0
S0_T1                                                     1.0
S0_T2                                                     1.0
S0_T3                                                     1.0
                                  ...                        
S27_T13                                                   1.0
S27_T14                                                   1.0
S27_T15                                                   1.0
S27_T16                                                   1.0
S27_T17                                                   1.0
Length: 403, dtype: object

## Sample metadata
We next see how each sample maps to a subject and time point. The CSV file "sample_metadata.csv" specifies an associated subject ID and timepoint for each sample ID.

In [13]:
sample_metadata = pd.read_csv("sample_metadata.csv")

In [14]:
sample_metadata

Unnamed: 0,sampleID,subjectID,time
0,S0_T0,S0,0
1,S0_T1,S0,2
2,S0_T2,S0,4
3,S0_T3,S0,6
4,S0_T4,S0,8
...,...,...,...
397,S27_T13,S27,13
398,S27_T14,S27,14
399,S27_T15,S27,15
400,S27_T16,S27,16


## Subject metadata
The CSV file "subject_metadata.csv" gives us a table showing how each host_subject_id maps to a label.

In [15]:
subject_metadata = pd.read_csv("subject_metadata.csv")

In [16]:
subject_metadata

Unnamed: 0,subjectID,host_label
0,S0,1.0
1,S1,0.0
2,S2,1.0
3,S3,1.0
4,S4,1.0
5,S5,0.0
6,S6,0.0
7,S7,0.0
8,S8,0.0
9,S9,1.0


The processed pickled dataframe relabeled vaginal birth mode as 0 and c-section as 1. Let's map it back to more clearly see the labels

In [20]:
mapping = {0 : 'vaginal', 1: 'c_section'}

In [21]:
subject_metadata2 = subject_metadata.replace(mapping)

In [22]:
subject_metadata2

Unnamed: 0,subjectID,host_label
0,S0,c_section
1,S1,vaginal
2,S2,c_section
3,S3,c_section
4,S4,c_section
5,S5,vaginal
6,S6,vaginal
7,S7,vaginal
8,S8,vaginal
9,S9,c_section
