-
Notifications
You must be signed in to change notification settings - Fork 3
Timeseries filtering
gregcaporaso edited this page Nov 30, 2012
·
16 revisions
@jrrideout added qiime.filter.sample_ids_from_category_state_coverage
(now merged - some additional features are coming soon). This lets us explore the effect of different timeseries filtering strategies. We pass a mapping file and can specify the minimum number of timepoints that an individual must have provided to include them and/or specific timepoints that must be present for a given individual. Here's how to use it if others want to explore the data.
# do some setup
In [0]: from qiime.filter import sample_ids_from_category_state_coverage as s
In [1]: f = list(open("./StudentMicrobiomeProject.tsv",'U'))
In [2]: cc = "WeeksSinceStart"
In [3]: sc = "PersonalID"
# call help to learn how to use this function
In [4]: help(s)
# Now let's explore the data:
# How many individuals donated at least 1 timepoint?
In [28]: s(f,cc,sc,1)[1]
Out[28]: 123
# How many individuals donated at least 5 timepoint?
In [30]: s(f,cc,sc,5)[1]
Out[30]: 90
# What about 6 and up?
In [31]: s(f,cc,sc,6)[1]
Out[31]: 90
In [32]: s(f,cc,sc,7)[1]
Out[32]: 88
In [33]: s(f,cc,sc,8)[1]
Out[33]: 79
In [34]: s(f,cc,sc,9)[1]
Out[34]: 70
In [35]: s(f,cc,sc,10)[1]
Out[35]: 51
In [36]: s(f,cc,sc,11)[1]
Out[36]: 17
In [37]: s(f,cc,sc,12)[1]
Out[37]: 10
# We can also specify specific timepoints that we care about.
# How many individuals donated at samples at timepoints 0 and 10?
In [39]: s(f,cc,sc,1,map(str,[0,10]))[1]
Out[39]: 51
# And some other specific timepoints
In [40]: s(f,cc,sc,1,map(str,[0,9]))[1]
Out[40]: 50
In [41]: s(f,cc,sc,1,map(str,[0,8]))[1]
Out[41]: 76
In [42]: s(f,cc,sc,1,map(str,[0,7]))[1]
Out[42]: 55
# So from this, it looks like we might maximize our results if we
# treat 0 and 8 as the start and stop points. So now let's combine
# the two filtering methods to say that an individual must have
# provided timepoints 0 and 8 and at least n total samples for a
# few values of n.
In [43]: s(f,cc,sc,5,map(str,[0,8]))[1]
Out[43]: 76
In [44]: s(f,cc,sc,6,map(str,[0,8]))[1]
Out[44]: 76
In [45]: s(f,cc,sc,7,map(str,[0,8]))[1]
Out[45]: 76
In [46]: s(f,cc,sc,8,map(str,[0,8]))[1]
Out[46]: 71
# So it looks like the meat of our timecourse will contain 76
# individuals who provided at least 7 samples between/including
# weeks 0 through 8.
The commands below illustrate how the GutTimeseries
, TongueTimeseries
, PalmTimeseries
, ForeheadTimeseries
, and AnyTimeseries
columns were created in the mapping file.
open('./StudentMicrobiomeProject_forehead.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject.tsv','U'),"BodySite:forehead"))
open('./StudentMicrobiomeProject_gut.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject.tsv','U'),"BodySite:gut"))
open('./StudentMicrobiomeProject_palm.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject.tsv','U'),"BodySite:palm"))
open('./StudentMicrobiomeProject_tongue.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject.tsv','U'),"BodySite:tongue"))
In [64]: def f(min_num_states,covered_states=None):
sites = ['gut','tongue','forehead','palm']
for site in sites:
print site, s(open('./StudentMicrobiomeProject_%s.tsv' % site,'U'),cc,sc,min_num_states,covered_states)[1]
....:
# Determine number of individuals retained with on a per-body-site basis
In [65]: f(7,map(str,[0,8]))
gut 74
tongue 76
forehead 75
palm 75
# Generate the list of sids with enough timepoints on a per-body-site basis
gut_sids = s(open('./StudentMicrobiomeProject_gut.tsv','U'),cc,sc,7,map(str,[0,8]))[0]
tongue_sids = s(open('./StudentMicrobiomeProject_tongue.tsv','U'),cc,sc,7,map(str,[0,8]))[0]
palm_sids = s(open('./StudentMicrobiomeProject_palm.tsv','U'),cc,sc,7,map(str,[0,8]))[0]
forehead_sids = s(open('./StudentMicrobiomeProject_forehead.tsv','U'),cc,sc,7,map(str,[0,8]))[0]
# Parse the mapping file to a dict
from qiime.parse import parse_mapping_file_to_dict
map = parse_mapping_file_to_dict(open('./StudentMicrobiomeProject.tsv','U'))[0]
# Create a new mapping file for just this timeseries data which will be
# merged with the mapping file in Google Docs. Note that we are only keeping the
# timepoints between WeeksSinceStart 0 and 8 (inclusive) so we have well-defined
# start and end points.
In [99]: g = open('ts_map.txt','w')
In [100]: for k, v in map.items():
if v['WeeksSinceStart'] == 'na' or float(v['WeeksSinceStart']) > 8:
g.write('%s\t%s\n' % (k,'\t'.join(["No"] * 5)))
else:
fields = []
if k in gut_sids:
fields.append('Yes')
else:
fields.append('No')
if k in tongue_sids:
fields.append('Yes')
else:
fields.append('No')
if k in palm_sids:
fields.append('Yes')
else:
fields.append('No')
if k in forehead_sids:
fields.append('Yes')
else:
fields.append('No')
if 'Yes' in fields:
fields.append('Yes')
else:
fields.append('No')
g.write('%s\t%s\n' % (k,'\t'.join(fields)))
.....:
In [101]: g.close()