# Exploring the Bokulich dataset
In this notebook we load the relevant files from the Bokulich (2016) study and show what each file contains. The study tracks the gut microbiomes of infants sampled over the first two years of life. 

In [1]:
import pandas as pd

## Abundance data
The CSV file "abundance.csv" contains microbial abundance data obtained via 16s amplicon sequencing for each sample-ID.

In [2]:
abundances = pd.read_csv("abundance.csv")

Let's see how the data is formatted

In [3]:
abundances.head()

Unnamed: 0.1,Unnamed: 0,AAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCG,AAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAG,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTG,AAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCATGGCAAGCCAGATGTGAAAGCCCGGGGCTCAACCCCGGGACTGCATTTGGAACTGTCAGGCTAGAGTGTCGGAGAGGAAAGCGGAATTCCTA,AAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAGAACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCG,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACTGGTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGTCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTG,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATGTCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTG,AAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAG,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGAAACTGGCAGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTG,...,TAGCGTTGTCCGGAATCACTGGGCGTAAAGGGTTCGCAGGCGGAATAACAAGTCAGATGTGAAAGGCATGGGCTCAACCCATGTAAGCATTTGAAACTGTAATTCTTGAGAAGTGGAGAGGTAAGTGGAATTACTAG,AAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGGAGAACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAG,AAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGAGTAGGCGGCATGGTAAGTTAGATGTGAAAGCCTCGGGCTTAACTTGAGGATTGCATTTAAAACTATCAAGCTAGAGTACAGGAGAGGAAAGCGGAATTCCTA,AAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCAGCGCAAGTCTGGAGTGAAATCCCATGGCTTAACCATGGAACTGCTTTGGAAACTGTGCAGCTGGAGTGCAGGAGGGGTAAGCGGAATTCCTA,AAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTACAGTAGAGGTGGGCGGAATTCGTG,GAGCGTTGTCCGGAATCATTGGGCGTAAAGGGTTCGTAGGCGGATAAGCAAGTTAGAAGTTAAATCCTATAGCTCAACTATAGCAAGCTTTTAAAACTGCTCATCTTGAGGTATGGAAGGGAAAGTGGAATTCCTAG,AGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGCCAGGTAAGCGTGTTGTGAAATGTACCGGCTCAACCGGTGAATTGCAGCGCGAACTGTCTGGCTTGAGTGCACGGTAAGCAGGCGGAATTCATG,AAGCGTTATCCGGAATTATTGGGTGTAAAGGGTGCGTAGACGGAAGAACAAGTTGGTTGTGAAATCCCTCGGCTCAACTGAGGAACTGCAACCAAAACTATTCTCCTTGAGTGTCGGAGAGGAAAGTGGAATTCCTA,GAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGTGGTTTATTAAGTCTGGTGTAAAAGGCAGTGGCCCAACCATTGTATGCATTGGAAACTGGTAGACTTGAGTGCAGGAGAGGAGAGTGGAATTCCATG,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGATCAGTCAGTCTGTCTTAAAAGTTCGGGGCTTAACCCCGTGATGGGATGGAAACTGTTTTTCTAGAGTGCCGGAGAGGTAAGCGGAATTCCTAG
0,10249.C001.02SS,768,46,0,51,0,0,1821,42,46,...,0,0,0,0,0,0,0,0,0,0
1,10249.C001.03SS,17,7545,0,10,45,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10249.C001.05SS,13,18296,18,6,8,0,0,20,0,...,0,0,0,0,0,0,0,0,0,0
3,10249.C001.06SS,13,5850,5,7,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,10249.C001.07SS,1768,4367,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The sequence labels are given as columns, to make this easier to read, we do some reformatting. We set the sequences as rows and take the sampleIDs as column labels.

In [4]:
abundances2 = abundances.transpose().rename(columns=abundances.transpose().iloc[0]).iloc[1:]

In [5]:
abundances2

Unnamed: 0,10249.C001.02SS,10249.C001.03SS,10249.C001.05SS,10249.C001.06SS,10249.C001.07SS,10249.C001.08SS,10249.C001.09SS,10249.C001.10SS,10249.C001.11SS,10249.C001.12SS,...,10249.C038.06SS,10249.C041.03SS,10249.C041.05SS,10249.C042.03SS,10249.C042.05SS,10249.C043.02SS,10249.C043.04SS,10249.C044.02SS,10249.C044.03SS,10249.C044.05SS
AAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCG,768,17,13,13,1768,7,9,684,5427,2790,...,1755,2195,17023,1114,822,6834,1061,5211,2103,3605
AAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAG,46,7545,18296,5850,4367,3625,2077,1317,1767,583,...,349,0,0,49,574,136,317,33,0,154
GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTG,0,0,18,5,7,2028,3961,7,0,0,...,2890,212,971,0,0,0,0,0,0,136
AAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCATGGCAAGCCAGATGTGAAAGCCCGGGGCTCAACCCCGGGACTGCATTTGGAACTGTCAGGCTAGAGTGTCGGAGAGGAAAGCGGAATTCCTA,51,10,6,7,0,0,0,6,0,1604,...,1511,1053,7717,232,1399,1413,2136,0,0,0
AAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAGAACAAGTTGGAAGTGAAATCCATGGGCTCAACCCATGAACTGCTTTCAAAACTGTTTTTCTTGAGTAGTGCAGAGGTAGGCGGAATTCCCG,0,45,8,5,0,0,0,0,0,0,...,0,0,3,0,0,0,0,2,0,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GAGCGTTGTCCGGAATCATTGGGCGTAAAGGGTTCGTAGGCGGATAAGCAAGTTAGAAGTTAAATCCTATAGCTCAACTATAGCAAGCTTTTAAAACTGCTCATCTTGAGGTATGGAAGGGAAAGTGGAATTCCTAG,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGCCAGGTAAGCGTGTTGTGAAATGTACCGGCTCAACCGGTGAATTGCAGCGCGAACTGTCTGGCTTGAGTGCACGGTAAGCAGGCGGAATTCATG,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAGCGTTATCCGGAATTATTGGGTGTAAAGGGTGCGTAGACGGAAGAACAAGTTGGTTGTGAAATCCCTCGGCTCAACTGAGGAACTGCAACCAAAACTATTCTCCTTGAGTGTCGGAGAGGAAAGTGGAATTCCTA,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
GAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGTGGTTTATTAAGTCTGGTGTAAAAGGCAGTGGCCCAACCATTGTATGCATTGGAAACTGGTAGACTTGAGTGCAGGAGAGGAGAGTGGAATTCCATG,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0


This dataframe now has OTUs (represented by their sequences) as rows and samples as columns.

## Taxonomy
To see how the sequences map to taxonomy, we open the "dada2_placements.csv" file.

In [7]:
dadaplace = pd.read_csv("dada2_placements.csv")

In [8]:
dadaplace

Unnamed: 0.1,Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,AAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTC...,Bacteria,Actinobacteria,Actinobacteria,Bifidobacteriales,Bifidobacteriaceae,Bifidobacterium,adolescentis/breve/kashiwanohense/longum/pseud...
1,AAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTT...,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacteriales,Enterobacteriaceae,Escherichia/Shigella,
2,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATG...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,dorei/vulgatus
3,AAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGCAT...,Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae,,
4,AAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAG...,Bacteria,Firmicutes,Clostridia,Clostridiales,Ruminococcaceae,Faecalibacterium,prausnitzii
...,...,...,...,...,...,...,...,...
4775,GAGCGTTGTCCGGAATCATTGGGCGTAAAGGGTTCGTAGGCGGATA...,Bacteria,Firmicutes,Clostridia,Clostridiales,Clostridiales_Incertae_Sedis_XI,Anaerococcus,
4776,AGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGCCA...,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Prevotellaceae,Prevotella,
4777,AAGCGTTATCCGGAATTATTGGGTGTAAAGGGTGCGTAGACGGAAG...,Bacteria,Firmicutes,Clostridia,Clostridiales,,,
4778,GAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGTGGTTT...,Bacteria,Firmicutes,Bacilli,Lactobacillales,Streptococcaceae,Lactococcus,


This dataframe gives us the taxonomy for each sequence in the abundance dataframe.

## sample metadata
We next see how each sample maps to a subject and time point. The CSV file "sample_metadata.csv" specifies an associated subject ID and timepoint for each sample ID. 

In [10]:
samplemeta = pd.read_csv("sample_metadata.csv", header=None)

In [11]:
samplemeta.columns = ["sample_ID", "host_subject_id", "time"]

In [12]:
samplemeta

Unnamed: 0,sample_ID,host_subject_id,time
0,10249.C001.01SS,1,0
1,10249.C001.02SS,1,36
2,10249.C001.03SS,1,42
3,10249.C001.04SS,1,49
4,10249.C001.05SS,1,57
...,...,...,...
892,10249.C057.06SS,57,157
893,10249.C057.08SS,57,215
894,10249.C057.09SS,57,238
895,10249.C057.10SS,57,244


## subject metadata
The file "subject_data.csv" gives us a table showing how each host_subject_id maps to a label for mode of delivery, antibiotic use, and diet.

In [13]:
subjdata = pd.read_csv("subject_data.csv")

In [15]:
subjdata.head(10)

Unnamed: 0,host_subject_id,delivery,sex,mom_prenatal_abx,mom_prenatal_abx_class,mom_prenatal_abx_trimester,diet,diet_3
0,1,Vaginal,Female,False,na,na,bd,eb
1,2,Cesarean,Male,True,nitrofuran,1,bd,eb
2,4,Cesarean,Male,True,beta-lactam,3,bd,eb
3,5,Cesarean,Female,False,na,na,fd,fd
4,7,Cesarean,Male,False,na,na,bd,eb
5,8,Vaginal,Male,False,na,na,bd,eb
6,9,Vaginal,Male,False,na,na,bd,eb
7,10,Vaginal,Male,False,na,na,bd,eb
8,11,Cesarean,Female,False,na,na,fd,fd
9,12,Cesarean,Female,False,na,na,bd,eb


The first column gives the host subject id, same as that in the sample metadata file. The 'delivery' column gives the mode of birth (Cesarean vs Vaginal). The sex column gives the sex of the infant. The next three columns specify whether the mother was given prenatal antibiotics, the class of antibiotics and the trimester in which the antibiotics were administered. The final two columns give the diet. The 'diet' column species whether the infant was predominately (>50% of feedings) breast fed (labeled bd) or formula fed (labeled fd). The 'diet_3' column further distinguishes between predominately breast fed (bd) or exclusively breast fed (eb).