## Exploring the David dataset
In this notebook we load the datafiles from the David et al., (2014) dataset and show what each file contains. The study from David et al., (2014) tracked the gut microbiome composition of 20 healthy adults before, during, and after a 5-day period of consuming exclusively plant-based or exclusively animal-based diets.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os

In [3]:
dataset_path = "./datasets/david/"

### Abundance data
A CSV file containing the microbial abundances, formatted with the first row providing OTU IDs and the first column providing sample IDs. 

In [4]:
abundances = pd.read_csv(os.path.join(dataset_path, "abundance.csv"), index_col=0)

In [5]:
abundances

Unnamed: 0,Otu000001,Otu000002,Otu000003,Otu000004,Otu000005,Otu000006,Otu000007,Otu000008,Otu000009,Otu000010,...,Otu017301,Otu017302,Otu017303,Otu017304,Otu017305,Otu017306,Otu017307,Otu017308,Otu017309,Otu017310
DD10,5629,0,623,0,291,0,0,1263,1961,515,...,0,0,0,0,0,0,0,0,0,0
DD102,5194,0,218,0,674,0,0,2307,560,0,...,0,0,0,0,0,0,0,0,0,0
DD104,5292,0,81,634,2518,0,1938,2009,0,691,...,0,0,0,0,0,0,0,0,0,0
DD106,1780,0,164,0,384,0,0,934,1798,865,...,0,0,0,0,0,0,0,0,0,0
DD107,6046,0,811,0,69,3,0,0,234,459,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ID92,5963,0,815,0,29,0,2458,674,386,8999,...,0,0,0,0,0,0,0,0,0,0
ID95,11834,0,1650,467,1184,0,0,0,1327,0,...,0,0,0,0,0,0,0,0,0,0
ID97,538,30569,179,60,396,19910,0,0,1398,20,...,0,0,0,0,0,0,2,0,0,0
ID98,9981,0,5211,0,1602,0,7275,0,0,898,...,5,0,0,0,0,0,0,0,0,0


### Sample metadata
A CSV file that specifies an associated subject ID and timepoint for each sample ID.

In [6]:
sample_metadata = pd.read_csv(os.path.join(dataset_path, "sample_metadata.csv"), header=None)

In [7]:
sample_metadata.columns = ["sample_ID", "subject_ID", "time"]

In [8]:
sample_metadata

Unnamed: 0,sample_ID,subject_ID,time
0,DD2,Plant5,3.0
1,DD3,Plant7,4.0
2,DD4,Plant7,3.0
3,DD5,Plant4,2.0
4,DD6,Plant8,-1.0
...,...,...,...
231,ID262,Animal3,8.0
232,ID263,Animal4,10.0
233,ID264,Animal5,5.0
234,ID265,Animal1,-2.0


### Subject metadata
A CSV file that gives information about each subject, (including the value of whatever variable will be used as the host outcome for prediction (e.g., Plant-diet or Animal-diet in the David et al).

In [9]:
subject_metadata = pd.read_csv(os.path.join(dataset_path, "subject_data.csv"))

In [10]:
subject_metadata

Unnamed: 0,subject_ID,diet
0,Plant5,Plant
1,Plant7,Plant
2,Plant4,Plant
3,Plant8,Plant
4,Plant6,Plant
5,Plant9,Plant
6,Plant3,Plant
7,Plant1,Plant
8,Plant10,Plant
9,Plant2,Plant
