Early in the American Gut Project, it was observed that some organisms bloomed likely as a result of increased shipping time and delay between when samples were collected and when they were put on ice (more detail can be found [here](http://americangut.org/?page_id=277)). The purpose of this notebook is to apply the filter developed in order to bioinformatically subtract these observed bloom sequences from fecal samples. It is important to apply this filter when combining data with the American Gut as to remove a potential study-effect bias as all fecal data in the American Gut has had this filter applied. The specific steps covered are:

* Filter demultiplexed sequence data to only fecal samples
* Determine what sequences in the fecal samples recruit to the observed bloom sequences
* Remove the recruited bloom sequences from the demultiplexed sequence data

The filtering is only intended to be applied to fecal data. As such, this notebook allows you to describe what metadata column and value to use so that only fecal samples are used.

In [None]:
import os
import multiprocessing
import americangut.notebook_environment as agenv

Next, we'll establish the paths we will be creating.

In [None]:
fecal_sequences         = agenv.get_new_path(agenv.filenames['fecal-sequences'])
filtered_sequences      = agenv.get_new_path(agenv.filenames['filtered-sequences'])
observed_blooms         = agenv.get_new_path(agenv.filenames['observed-blooms'])
observed_blooms_biom    = agenv.get_new_path(agenv.filenames['observed-blooms-biom'])
observed_blooms_otu_map = agenv.get_new_path(agenv.filenames['observed-blooms-otu-map'])

This next call will setup and verify the path to the bloom sequences used for filtering.

In [None]:
bloom_sequences = agenv.get_bloom_sequences()

Now let's setup the paths to the sequences to filter. We need the metadata as well in order to reduce the data to just the fecal samples. Please replace these variables with your own paths if you wish to filter your data for blooms (a necessary precursor if you wish to combine data with the American Gut). 

In [None]:
# If you are filtering your own data, please update these filepath variables as necessary
sequences = agenv.get_existing_path(agenv.filenames['raw-sequences'])
metadata  = agenv.get_existing_path(agenv.filenames['raw-metadata'])

We also need to specify what specific metadata category and value correspond indicate what samples are fecal. It is possible that these values are study specific, so please modify these as needed if you filtering other datasets. 

In [None]:
# If you are filtering your own data, please update these variables to reflect your mapping file
metadata_category = 'BODY_SITE'
metadata_value    = 'UBERON:feces'

Now that we know what sequences to focus on, we can filter the input data down to just those that need to be considered for filtering.

In [None]:
_fecal_states = ':'.join([metadata_category, metadata_value])

!filter_fasta.py -f $sequences \
                 -o $fecal_sequences \
                 --mapping_fp $metadata \
                 --valid_states $_fecal_states

The next thing we need to do is setup the parameters for SortMeRNA, which is the method we'll use to compare all the input data to our reference of bloom sequences.

In [None]:
_params_file = agenv.get_path('sortmerna_pick_params.txt')
with open(_params_file, 'w') as f:
    f.write("pick_otus:otu_picking_method sortmerna\n")
    f.write("pick_otus:similarity 0.97\n")
    f.write("pick_otus:threads %d\n" % multiprocessing.cpu_count())
    
!pick_closed_reference_otus.py -i $fecal_sequences \
                               -o $observed_blooms \
                               -r $bloom_sequences \
                               -p $_params_file

And now, we can remove the blooms from the input sequences. 

In [None]:
!filter_fasta.py -f $sequences \
                 -m $observed_blooms_otu_map \
                 -n \
                 -o $filtered_sequences

Finally, let's do a quick sanity check that we have sequence data and we'll also dump out summary information about how many reads per sample recruited to the blooms.

In [None]:
assert os.stat(filtered_sequences).st_size > 0
!biom summarize-table -i $observed_blooms_biom