## Exploring the David dataset
In this notebook we load the datafiles from the David et al., (2014) dataset and show what each file contains. The study from David et al., (2014) tracked the gut microbiome composition of 20 healthy adults before, during, and after a 5-day period of consuming exclusively plant-based or exclusively animal-based diets.

In [1]:
import pandas as pd

### Abundance data
The CSV file "abundance.csv" contains the microbial abundances, formatted with the first row providing OTU IDs and the first column providing sample IDs. 

In [9]:
abundances = pd.read_csv("abundance.csv", index_col=0)

In [10]:
abundances.head()

Unnamed: 0,Otu000001,Otu000002,Otu000003,Otu000004,Otu000005,Otu000006,Otu000007,Otu000008,Otu000009,Otu000010,...,Otu017301,Otu017302,Otu017303,Otu017304,Otu017305,Otu017306,Otu017307,Otu017308,Otu017309,Otu017310
DD10,5629,0,623,0,291,0,0,1263,1961,515,...,0,0,0,0,0,0,0,0,0,0
DD102,5194,0,218,0,674,0,0,2307,560,0,...,0,0,0,0,0,0,0,0,0,0
DD104,5292,0,81,634,2518,0,1938,2009,0,691,...,0,0,0,0,0,0,0,0,0,0
DD106,1780,0,164,0,384,0,0,934,1798,865,...,0,0,0,0,0,0,0,0,0,0
DD107,6046,0,811,0,69,3,0,0,234,459,...,0,0,0,0,0,0,0,0,0,0


We reformat this to get otus to label rows and columns labeled by samples

In [11]:
abundances2 = abundances.transpose()

In [12]:
abundances2.head()

Unnamed: 0,DD10,DD102,DD104,DD106,DD107,DD108,DD110,DD111,DD112,DD113,...,ID87,ID89,ID9,ID90,ID91,ID92,ID95,ID97,ID98,ID99
Otu000001,5629,5194,5292,1780,6046,102,2289,3346,2786,2273,...,8745,30865,541,13732,9814,5963,11834,538,9981,32485
Otu000002,0,0,0,0,0,2850,66,0,0,0,...,1,6,22511,6,5,0,0,30569,0,0
Otu000003,623,218,81,164,811,26,472,0,0,764,...,3126,1998,882,3336,2,815,1650,179,5211,2763
Otu000004,0,0,634,0,0,0,0,389,604,893,...,0,659,162,329,4543,0,467,60,0,0
Otu000005,291,674,2518,384,69,583,541,145,543,224,...,1082,924,877,1966,1333,29,1184,396,1602,395


## Taxonomy
To see how each OTU maps to taxonomy, we open the "mothur_placements.csv" file.

In [14]:
mothurplace = pd.read_csv("mothur_placements.csv")

In [15]:
mothurplace.head()

Unnamed: 0.1,Unnamed: 0,Kingdom,Phylum,Class,Order,Family,Genus,Species
0,Otu000001,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,
1,Otu000002,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Prevotellaceae,Prevotella,
2,Otu000003,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Rikenellaceae,Alistipes,
3,Otu000004,Bacteria,Bacteroidetes,Bacteroidia,Bacteroidales,Bacteroidaceae,Bacteroides,
4,Otu000005,Bacteria,Firmicutes,Clostridia,Clostridiales,Ruminococcaceae,,


## amplicon sequences
To get the actual amplicon sequences corresponding to each otu, we load and process the "sequence_key.fa" file.

In [18]:
with open("sequence_key.fa", "r") as file:
    lines = file.readlines()

In [19]:
lines

['>Otu000001\n',
 'GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGA\n',
 '>Otu000002\n',
 'GGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGAAATGTAGATGCTCAACATCTGCACTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTGGAGCGCAACTGACGCTGAAGCTCGA\n',
 '>Otu000003\n',
 'AAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTTGATAAGTTAGAGGTGAAATTTCGGGGCTCAACCCTGAACGTGCTAGCGGTGAAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAACTATATCTGACGTTGAGGCACGA\n',
 '>Otu000004\n',
 'GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGACTGTAACTGACGCTGATGCTCGA\n',
 '>Otu000005\n',
 'AAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAAGACAAGTTGGAAGTGAAATCTATGGGCTCAACCCATAAACTGCTAGCGGTGGAATGCGTAGATATCGGGAGGAACACCAGTGGCGAAGGCGGCCTACTGGGCACCAACTGACGCTGAGGCTCGA\n',
 '>Otu000006\n',
 'GGGCGTTATCC

The file is formated with each line containing the otu number given by ">OTU#" followed by the corresponding sequence on the next line

In [20]:
# create a dictionary with otu #'s as keys and sequences as values

In [35]:
start=">"
sequences = dict()
for i,line in enumerate(lines):
    if line.startswith(">"):
        key = line[line.find(start)+len(start):].strip()
        seq = lines[i+1].strip()
        sequences[key] = seq

In [39]:
# create dataframe from dictionary

In [46]:
seq_map =  pd.Series(sequences, name='sequence').to_frame()

In [47]:
seq_map

Unnamed: 0,sequence
Otu000001,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATG...
Otu000002,GGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAG...
Otu000003,AAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTT...
Otu000004,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACG...
Otu000005,AAGCGTTGTCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGAA...
...,...
Otu017306,AGCGTTGTTCGGAATTACTGGGCGTAAAGGGTGTGTAGGCGGTTAT...
Otu017307,AAGCGTTAATCGGAATCACTGGGCGTAAAGCGCACGTAGGCTGTTA...
Otu017308,AGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTCT...
Otu017309,GAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCATAGGTGGTGA...


This dataframe now contains a sequence for each otu. You can combine this with the mothurplace dataframe as well to have the otu mapping to sequences and taxonomy all in one dataframe. It might also be useful to create multi-indices from the taxonomy, so you can aggregate bugs at different taxonomy levels.

## sample metadata
We next see how each sample maps to a subject and time point. The CSV file "sample_metadata.csv" specifies an associated subject ID and timepoint for each sample ID. 

In [48]:
sample_metadata = pd.read_csv("sample_metadata.csv", header=None)

In [49]:
sample_metadata.columns = ["sample_ID", "subject_ID", "time"]

In [50]:
sample_metadata

Unnamed: 0,sample_ID,subject_ID,time
0,DD2,Plant5,3.0
1,DD3,Plant7,4.0
2,DD4,Plant7,3.0
3,DD5,Plant4,2.0
4,DD6,Plant8,-1.0
...,...,...,...
231,ID262,Animal3,8.0
232,ID263,Animal4,10.0
233,ID264,Animal5,5.0
234,ID265,Animal1,-2.0


## Subject metadata
The file "subject_data.csv" gives us a table showing how each subject_ID maps to a label (e.g., Plant-diet or Animal-diet).

In [52]:
subject_metadata = pd.read_csv("subject_data.csv")

In [53]:
subject_metadata

Unnamed: 0,subject_ID,diet
0,Plant5,Plant
1,Plant7,Plant
2,Plant4,Plant
3,Plant8,Plant
4,Plant6,Plant
5,Plant9,Plant
6,Plant3,Plant
7,Plant1,Plant
8,Plant10,Plant
9,Plant2,Plant


The first column gives the subject_ID, same as that in the sample metadata file. The second column gives the label for the diet