This notebook preps data to run SourceTracker.

In [1]:
import pandas as pd

In [2]:
fotu = '../../data/clean/rosen.otu_table.counts.clean'
fmeta = '../../data/clean/rosen.metadata.clean'

df = pd.read_csv(fotu, sep='\t', index_col=0)
meta = pd.read_csv(fmeta, sep='\t', index_col=0)

# Prepare samples

Let's just keep the lung, throat, and gastric samples, and remove the second time point and lung transplant samples as we do in the rest of the paper.

In [3]:
sites = ['bal', 'throat_swab', 'gastric_fluid']
print(meta.shape)
meta = meta.query('site == @sites')
print(meta.shape)

(586, 958)
(520, 958)


Also remove the second time points and lung transplant samples.

In [5]:
# Don't include samples from second time point or lung transplants
samples = meta.index
exclude = ['2', 'F', 'sick', 'F2T']
for s in exclude:
    samples = [i for i in samples if not i.endswith(s)]
samples = [i for i in samples if not i.startswith('05')]
len(samples)

425

In [7]:
len(meta.loc[samples, 'subject_id'].unique())

217

That's what we expect: we have 222 subjects total in our whole dataset, which includes 5 patients who _only_ have stool (so we have 217 total with any of throat, lung, or gastric fluid).

In [8]:
meta = meta.loc[samples]
df = df.loc[samples]

## Prepare SourceTracker metadata file

In [11]:
st = meta['site'].reset_index()
st['SourceSink'] = st['site'].apply(lambda x: 'sink' if x == "bal" else 'source')
st = st.rename(columns={'index': '#SampleID', 'site': 'Env'})
st = st[['#SampleID', 'SourceSink', 'Env']]
st.head()

Unnamed: 0,#SampleID,SourceSink,Env
0,01-112-7GI,source,gastric_fluid
1,01-112-7TI,source,throat_swab
2,01-164-7GI,source,gastric_fluid
3,01-164-7TI,source,throat_swab
4,01-173-4G,source,gastric_fluid


In [12]:
st.to_csv('../../data/clean/sourcetracker_mapping.txt', index=False, sep='\t')

## Also make the OTU table

In [15]:
df = df.T
df.index.name = 'otu'
df.to_csv('../../data/clean/sourcetracker_table.txt', sep='\t')

Then, in the terminal I ran:

```
biom convert -i sourcetracker_table.txt -m sourcetracker_mapping.txt -o sourcetracker_table.biom --to-hdf5
```

And then to run sourcetracker I did:

```
sourcetracker2 gibbs -i sourcetracker_table.biom -m sourcetracker_mapping.txt -o ./sourcetracker/
```