Skip to content

Timeseries filtering

gregcaporaso edited this page Dec 14, 2012 · 16 revisions

@jrrideout added qiime.filter.sample_ids_from_category_state_coverage (now merged - some additional features are coming soon). This lets us explore the effect of different timeseries filtering strategies. We pass a mapping file and can specify the minimum number of timepoints that an individual must have provided to include them and/or specific timepoints that must be present for a given individual. Here's how to use it if others want to explore the data.

# Define a valid-states string to filter mapping file to remove samples with ambiguous timepoints (list compiled by Gilbert) or samples with fewer than 10k sequences
In [31]: valid_states = 'SampleID:*,!F05216,!P16170,!T20802,!P16175,!P16177,!F05211,!P16034,!G10981,!F05256,!T20904,!F06365,!P15537,!G11212,!P15967,!P15570,!G10724,!G10142,!G11234,!T20922,!G11591,!T21271,!T20899,!F06450,!T20931,!F05523,!T21216,!T21300,!T20525,!G10784,!F05229,!F05590,!P15589,!F05569,!G11267,!F06517,!G10991,!F05247,!G11208,!F06459,!F06592,!P16510,!G10918,!G10152,!P15540,!T20761,!P16181,!P16183,!G10118,!P16596,!G11203,!T21386,!T20969,!G11057,!T21241,!F05536,!P15059,!G11503,!T21205,!P15055,!G10930,!F05162,!T20555;SequenceCount:*,!na'

# Filter the mapping file and write it to a new file
In [32]: open('./StudentMicrobiomeProject-map.filt.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject-map.tsv','U'),valid_states))

# Number of samples before and after the filter
In [ ]: !wc -l StudentMicrobiomeProject-map.tsv
    3735 StudentMicrobiomeProject-map.tsv

In [ ]: !wc -l StudentMicrobiomeProject-map.filt.tsv
    3293 StudentMicrobiomeProject-map.filt.tsv
# do some setup
In [0]: from qiime.filter import sample_ids_from_category_state_coverage as s
In [1]: f = list(open("./StudentMicrobiomeProject-map.filt.tsv",'U'))
In [2]: cc = "WeeksSinceStart"
In [3]: sc = "PersonalID"


# call help to learn how to use this function
In [4]: help(s)

# Now let's explore the data:
# How many individuals donated at least 1 timepoint?
In [7]: s(f,cc,sc,1)[1]
Out[7]: 124

# How many individuals donated at least 5 timepoint?
In [8]: s(f,cc,sc,5)[1]
Out[8]: 88

# What about 6 and up?
In [9]: s(f,cc,sc,6)[1]
Out[9]: 88
In [10]: s(f,cc,sc,7)[1]
Out[10]: 86
In [11]: s(f,cc,sc,8)[1]
Out[11]: 77
In [12]: s(f,cc,sc,9)[1]
Out[12]: 67
In [13]: s(f,cc,sc,10)[1]
Out[13]: 51
In [14]: s(f,cc,sc,11)[1]
Out[14]: 17
In [15]: s(f,cc,sc,12)[1]
Out[15]: 10

# We can also specify specific timepoints that we care about.
# How many individuals donated at samples at timepoints 0 and 10?
In [8]: s(f,cc,sc,1,[0,10])[1]
Out[8]: 50
# And some other specific timepoints
In [9]: s(f,cc,sc,1,[0,9])[1]
Out[9]: 49
In [10]: s(f,cc,sc,1,[0,8])[1]
Out[10]: 76
In [11]: s(f,cc,sc,1,[0,7])[1]
Out[11]: 53

# We can also check how many individuals we retain if we require a certain number of 
# timepoints in some range. For example, how many individuals provided at least 6 samples
# between timepoints 0 and 8 (optionally including those timepoints). 

In [29]: s(f,cc,sc,6,considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8])[1]
Out[29]: 86

# ... or at least 7 of those timepoints
In [30]: s(f,cc,sc,7,considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8])[1]
Out[30]: 79
# ... or at least 8 of those timepoints
In [31]: s(f,cc,sc,8,considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8])[1]
Out[31]: 61

# Alternatively, if we require 7 samples between timepoints 0 and 9 we retain almost the same 
#  number of individuals as with 6 samples between timepoints 0 and 8, so we'll go with that.
In [35]: s(f,cc,sc,7,considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9])[1]
Out[35]: 85

The commands below illustrate how the GutTimeseries, TongueTimeseries, PalmTimeseries, ForeheadTimeseries, and AnyTimeseries columns were created in the mapping file.

# Filter the mapping file to create body-site-specific mapping files.
In [ ]: from qiime.filter import filter_mapping_file_by_metadata_states
In [ ]: open('./StudentMicrobiomeProject_forehead.filt.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject-map.filt.tsv','U'),"BodySite:forehead"))
In [ ]: open('./StudentMicrobiomeProject_gut.filt.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject-map.filt.tsv','U'),"BodySite:gut"))
In [ ]: open('./StudentMicrobiomeProject_palm.filt.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject-map.filt.tsv','U'),"BodySite:palm"))
In [ ]: open('./StudentMicrobiomeProject_tongue.filt.tsv','w').write(filter_mapping_file_by_metadata_states(open('./StudentMicrobiomeProject-map.filt.tsv','U'),"BodySite:tongue"))


# Determine number of individuals retained with on a per-body-site basis
In [ ]: def f(min_num_states,considered_states=None):
    sites = ['gut','tongue','forehead','palm']
    for site in sites:
        print site, s(open('./StudentMicrobiomeProject_%s.filt.txt' % site,'U'),cc,sc,min_num_states,considered_states=considered_states)[1]
   ....:         

In [ ]: f(7,considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9])
gut 84
tongue 86
forehead 86
palm 85

In [ ]: considered_states=[0,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9]

# Generate the list of sids with enough timepoints on a per-body-site basis
In [ ]: gut_sids = s(open('./StudentMicrobiomeProject_gut.filt.tsv','U'),cc,sc,7,considered_states=considered_states)[0]
In [ ]: tongue_sids = s(open('./StudentMicrobiomeProject_tongue.filt.tsv','U'),cc,sc,7,considered_states=considered_states)[0]
In [ ]: palm_sids = s(open('./StudentMicrobiomeProject_palm.filt.tsv','U'),cc,sc,7,considered_states=considered_states)[0]
In [ ]: forehead_sids = s(open('./StudentMicrobiomeProject_forehead.filt.tsv','U'),cc,sc,7,considered_states=considered_states)[0]

# Parse the mapping file to a dict
In [ ]: from qiime.parse import parse_mapping_file_to_dict
In [ ]: map = parse_mapping_file_to_dict(open('./StudentMicrobiomeProject-map.tsv','U'))[0]

# Create a new mapping file for just this timeseries data which will be 
# merged with the mapping file in Google Docs. Note that we are only keeping the
# timepoints between WeeksSinceStart 0 and 8 (inclusive) so we have well-defined
# start and end points.
In [99]: g = open('ts_map.txt','w')

In [100]: for k, v in map.items():
    if v['WeeksSinceStart'] == 'na' or float(v['WeeksSinceStart']) > 9:
        g.write('%s\t%s\n' % (k,'\t'.join(["No"] * 5)))
    else:
        fields = []
        if k in gut_sids:
            fields.append('Yes')
        else:
            fields.append('No')
        if k in tongue_sids:
            fields.append('Yes')
        else:
            fields.append('No')
        if k in palm_sids:
            fields.append('Yes')
        else:
            fields.append('No')
        if k in forehead_sids:
            fields.append('Yes')
        else:
            fields.append('No')
        if 'Yes' in fields:
            fields.append('Yes')
        else:
            fields.append('No')
        g.write('%s\t%s\n' % (k,'\t'.join(fields)))
   .....:         

In [101]: g.close()

Finally the new mapping file is merged with the input file, and the result is uploaded to Google Docs.

merge_mapping_files.py -m StudentMicrobiomeProject-map.txt,ts_map.tsv -o new.txt