# Sourcetracker

Here we will demonstrate how to run sourcetracker on 16S and metabolomics data. We aim to estimate the probability of an individual touching an object, such as a computer.

In [1]:
import pandas as pd
import numpy as np
from biom import load_table, Table
from biom.util import biom_open
import re

Now we will read in the metadata.  Note that we will need to add two additional metadata columns.
Specifically, we need to add a `SourceSink` column and a `Env` column. See the http://qiime.org/tutorials/source_tracking.html
for more explanation.

In [2]:
metadata = pd.read_csv('../data/qiita_refined_mapping.csv', index_col=0)

We will want to assign all of the volunteers as the source of the microbes / metabolites and the objects as sinks.  This information will be stored in the `SourceSink` column.  

The `Env` variable here is simple, it just keeps track of the object name.

In [3]:
def assign_source_sink(x):
    if x in ['Volunteer1',
             'Volunteer2',
             'Volunteer3',
             'Volunteer4']:
        return 'source'
    else:
        return 'sink'
    
metadata['SourceSink'] = metadata.subject.apply(assign_source_sink)
metadata['Env'] = metadata['subject']

For the sake of the sourcetracking analyses, we will focus on samples that aren't controls.

In [4]:
metadata = metadata.loc[metadata.subject != 'Empty']

# Metabolomics source tracking

Let's load up the MS1 features.  We'll also want to convert the IDs from the metabolomics
analysis into IDs compatible with the metadata.

In [5]:
table = pd.read_table('../data/lcms_pos_metabolites.csv', index_col=0)

pattern = re.compile('P(\d)\S+R([A-Z][0-9]+)\S+.mzXML')
def convert_id(x):
    p, col = re.findall(pattern, x)[0]
    return '10244.%s%s' % (p, col)
oids = table.columns
ids = list(map(convert_id, oids))
table = Table(table.values, table.index, ids, table_id='MS1 feature table')

We'll want to make sure that the metabolomics samples match up with the sample metadata

In [6]:
ms_metadata = metadata.loc[table.ids(axis='sample')]

Now we can save the MS table and the metadata table into disk, for sourcetracker

In [7]:
ms_metadata.to_csv('../data/refined_MS_metadata.txt', sep='\t')
with biom_open('../data/MS1.biom', 'w') as f:  
     table.to_hdf5(f, "filtered")

We can now run sourcetracking on the metabolites.  It is important to note that as a part of the gibbs sampling procedure, subsampling will be performed.  In order to avoid issues with
memory, we will explicitly sample with replacment

# Microbial source tracking.

Let's load up the deblurred biom table.

In [8]:
table = load_table('../data/final.only-16s.biom')

Before we begin sourcetracking microbes, we'll want to filter out any nonsense samples in the 16S biom table. This will include blanks and low abundance samples.

In [9]:
sample_filter = lambda v, i, m: v.sum() > 1000
table = table.filter(sample_filter)

rrna_metadata = metadata.loc[table.ids(axis='sample')]

Now we'll want to save this file to disk to be used later by sourcetracker.

In [10]:
rrna_metadata.to_csv('../data/refined_16S_metadata.txt', sep='\t')
with biom_open('../data/deblur-clean-16s.biom', 'w') as f:  
     table.to_hdf5(f, "filtered")

Finally, we can start to run sourcetracking

The sourcetracking results can be found in the results folder.  Now we can visualize these
results on the 3D model using ili.