Skip to content

Timeseries filtering

gregcaporaso edited this page Nov 30, 2012 · 16 revisions

@jrrideout added qiime.filter.sample_ids_from_category_state_coverage (awaiting a couple of minor changes before merge). This let's us explore the effect of different filtering strategies - we pass a mapping file and can specify the minimum number of timepoints that an individual must have provided to include and/or specific timepoints that must be present for a given individual. For example:

# do some setup
In [0]: from qiime.filter import sample_ids_from_category_state_coverage as s
In [1]: f = list(open("./StudentMicrobiomeProject.tsv",'U'))
In [2]: cc = "WeeksSinceStart"
In [3]: sc = "PersonalID"

# call help to learn how to use this function
In [4]: help(s)

# Now let's explore the data:
# How many individuals donated at least 1 timepoint?
In [28]: s(f,cc,sc,1)[1]
Out[28]: 123

# How many individuals donated at least 5 timepoint?
In [30]: s(f,cc,sc,5)[1]
Out[30]: 90

# What about 6 and up?
In [31]: s(f,cc,sc,6)[1]
Out[31]: 90
In [32]: s(f,cc,sc,7)[1]
Out[32]: 88
In [33]: s(f,cc,sc,8)[1]
Out[33]: 79
In [34]: s(f,cc,sc,9)[1]
Out[34]: 70
In [35]: s(f,cc,sc,10)[1]
Out[35]: 51
In [36]: s(f,cc,sc,11)[1]
Out[36]: 17
In [37]: s(f,cc,sc,12)[1]
Out[37]: 10

# We can also specify specific timepoints that we care about.
# How many individuals donated at samples at timepoints 0 and 10?
In [39]: s(f,cc,sc,1,map(str,[0,10]))[1]
Out[39]: 51

# And some other specific timepoints
In [40]: s(f,cc,sc,1,map(str,[0,9]))[1]
Out[40]: 50
In [41]: s(f,cc,sc,1,map(str,[0,8]))[1]
Out[41]: 76
In [42]: s(f,cc,sc,1,map(str,[0,7]))[1]
Out[42]: 55

# So from this, it looks like we might maximize our results if we treat 0 and 8 as the start and stop points. So now let's combine the two filtering methods to say that an individual must have provided timepoints 0 and 8 and at least n total samples for a few values of n.
In [43]: s(f,cc,sc,5,map(str,[0,8]))[1]
Out[43]: 76
In [44]: s(f,cc,sc,6,map(str,[0,8]))[1]
Out[44]: 76
In [45]: s(f,cc,sc,7,map(str,[0,8]))[1]
Out[45]: 76
In [46]: s(f,cc,sc,8,map(str,[0,8]))[1]
Out[46]: 71

# So it looks like the meat of our timecourse will contain 76 individuals who provided at least 7 samples between/including weeks 0 through 8.