This notebook is going to be used to create tables where supervised learning can be run to determine the most different OTUs between the replicates, as well as the runs as a whole

In [1]:
%matplotlib inline
import pandas as pd
from q2d2 import rarify
from os.path import join
from functools import partial



In [2]:
from IPython.parallel import Client
clients = Client(profile='data-analysis-conda')
dview = clients.direct_view()

In [3]:
home = '/home/office-microbe-files'
notebooks = '/home/johnchase/office-project/office-microbes/notebooks'

Load mapping file
----------
Load the mapping file and filter it to contain only 16S

In [4]:
map_fp = join(home, 'master_map_150908.txt')
sample_md = pd.read_csv(map_fp, sep='\t', index_col=0, dtype=str)
sample_md = sample_md[sample_md['16SITS'] == '16S']

Load the two tables for run 1-3 and run 2-3
----------------


In [23]:
table_13 = pd.read_csv('table_13.txt', sep='\t', index_col=0, dtype='object').astype('float')
table_23 = pd.read_csv('table_23.txt', sep='\t', index_col=0, dtype='object').astype('float')

Rarify the tables and write to file
------------

In [6]:
rarify1000 = partial(rarify, even_sampling_depth=1000)
df1, df2 = dview.map(rarify1000, [table_13, table_23])
df1.to_csv('table_13_rarefied.txt', sep='\t')
df2.to_csv('table_23_rarefied.txt', sep='\t')

```bash
biom convert -i table_13_rarefied.txt -o table_13_rarefied.biom --table-type "OTU table" --to-hdf5
biom convert -i table_23_rarefied.txt -o table_23_rarefied.biom --table-type "OTU table" --to-hdf5
```

Run beta diversity on the full tables
----------------------------------

```bash
parallel_beta_diversity.py -i /scratch/jc33/test_bdiv/tables/table_13_rarefied.biom -o /scratch/jc33/test_bdiv/output -t /scratch/jc33/beta_div/rep_set.tre  

parallel_beta_diversity.py -i /scratch/jc33/test_bdiv/tables/table_23_rarefied.biom -o /scratch/jc33/test_bdiv/output23 -t /scratch/jc33/beta_div/rep_set.tre
```

The replicate IDs are known. Although the replicate IDs could  be munged from the mapping file this will be faster and less error prone.

In [7]:
#These are the replicate_ids
replicate_ids = '''F2F.2.Ce.021
F2F.2.Ce.022
F2F.3.Ce.021
F2F.3.Ce.022
F2W.2.Ca.021
F2W.2.Ca.022
F2W.2.Ce.021
F2W.2.Ce.022
F3W.2.Ce.021
F3W.2.Ce.022
F1F.3.Ca.021
F1F.3.Ca.022
F1C.3.Ca.021
F1C.3.Ca.022
F1W.2.Ce.021
F1W.2.Ce.022
F1W.3.Dr.021
F1W.3.Dr.022
F1C.3.Dr.021
F1C.3.Dr.022'''.split('\n')

In [8]:
office_md = sample_md[sample_md['OfficeSample'] == 'yes']
office_md_13 = office_md[(office_md['Run'] == '1') | (office_md['Run'] == '3')]
office_md_23 = office_md[(office_md['Run'] == '2') | (office_md['Run'] == '3')]
reps_13 = office_md_13[office_md_13['Description'].isin(replicate_ids)]
reps_23 = office_md_23[office_md_23['Description'].isin(replicate_ids)]

In [9]:
#this seems redundant but is necessary to keep only duplicates we want
reps_13 = reps_13[reps_13.duplicated('Description', keep='last') | reps_13.duplicated('Description')]
reps_23 = reps_23[reps_23.duplicated('Description', keep='last') | reps_23.duplicated('Description')]

now we have the 10 replicates from each group of runs that we are interested in
'F2F.3.Ce.022' was not included in the map from Argonne so there are only 9 run 2-3 replicates

Filter the unrarefied tables to include only the replicate samples
-----------------

In [10]:
table_13_replicates = table_13[reps_13.index]
table_23_replicates = table_23[reps_23.index]
table_13_replicates.to_csv('replicate_filtered_tables/table_13_replicates.txt', sep='\t')
table_23_replicates.to_csv('replicate_filtered_tables/table_23_replicates.txt', sep='\t')

In [11]:
table_13_replicates_rarified = rarify1000(table_13_replicates)
table_23_replicates_rarified = rarify1000(table_23_replicates)
table_13_replicates_rarified.to_csv('replicate_filtered_tables/table_13_replicates_rarified.txt', sep='\t')
table_23_replicates_rarified.to_csv('replicate_filtered_tables/table_23_replicates_rarified.txt', sep='\t')

Run supervised learning on the resulting files
--------------------------------------

```bash
supervised_learning.py -i table_13_replicates_rarified.biom -m /home/office-microbe-files/master_map_150908.txt -c Run -o sl_13_out -e cv5  
supervised_learning.py -i table_23_replicates_rarified.biom -m /home/office-microbe-files/master_map_150908.txt -c Run -o sl_23_out -e cv5
 ```

Load summary stats for sl 
--------------

In [18]:
feat_13 = pd.read_csv('replicate_filtered_tables/sl_13_out/feature_importance_scores.txt', sep='\t', index_col=0)
feat_23 = pd.read_csv('replicate_filtered_tables/sl_23_out/feature_importance_scores.txt', sep='\t', index_col=0)

The workflow to filter OTUs based on the blank samples included 10 samples from each run, 20 in total. So here we will also filter out 20, 200 and 2000.

Filter the tables at varying levels
--------------------

In [21]:
table_13_otus_10 = rarify(table_13.drop(feat_13.index[:20], inplace=False), 1000)
table_13_otus_100 = rarify(table_13.drop(feat_13.index[:200], inplace=False), 1000)
table_13_otus_1000 = rarify(table_13.drop(feat_13.index[:2000], inplace=False), 1000)

table_23_otus_10 = rarify(table_23.drop(feat_23.index[:20], inplace=False), 1000)
table_23_otus_100 = rarify(table_23.drop(feat_23.index[:200], inplace=False), 1000)
table_23_otus_1000 = rarify(table_23.drop(feat_23.index[:2000], inplace=False), 1000)

In [None]:
otus = [10, 100, 1000, 10, 100, 1000]
runs = ['13']*3 + ['23']*3
paths = [join(notebooks, 'blank_filtered_tables/table_{0}_otus_{1}.txt'.format(run, otu)) for run, otu in zip(runs, otus)]
dfs = [table_13_otus_10, table_13_otus_100, table_13_otus_1000, 
       table_23_otus_10, table_23_otus_100, table_23_otus_1000]

h = dview.map(df_to_file, dfs, paths)

Convert the biom tables
----------------

```bash
biom convert -i replicate_filtered_tables/filtered_tables/table_13_otus_10.txt -o replicate_filtered_tables/filtered_tables/table_13_otus_10.biom --to-hdf5 --table-type "OTU table"&
biom convert -i replicate_filtered_tables/filtered_tables/table_13_otus_100.txt -o replicate_filtered_tables/filtered_tables/table_13_otus_100.biom --to-hdf5 --table-type "OTU table"&
biom convert -i replicate_filtered_tables/filtered_tables/table_13_otus_1000.txt -o replicate_filtered_tables/filtered_tables/table_13_otus_1000.biom --to-hdf5 --table-type "OTU table"&
biom convert -i replicate_filtered_tables/filtered_tables/table_23_otus_10.txt -o replicate_filtered_tables/filtered_tables/table_23_otus_10.biom --to-hdf5 --table-type "OTU table" &
biom convert -i replicate_filtered_tables/filtered_tables/table_23_otus_100.txt -o replicate_filtered_tables/filtered_tables/table_23_otus_100.biom --to-hdf5 --table-type "OTU table"&
biom convert -i replicate_filtered_tables/filtered_tables/table_23_otus_1000.txt -o replicate_filtered_tables/filtered_tables/table_23_otus_1000.biom --to-hdf5 --table-type "OTU table"
```

Run beta diversity
---------------

```bash
#SBATCH --job-name=bd_sl
#SBATCH --output=/scratch/jc33/bdiv/std_out.txt
#SBATCH --error=/scratch/jc33/bdiv/std_err.txt
#SBATCH --workdir=/scratch/jc33/beta_div
#SBATCH --mail-type=ALL
#SBATCH --time=5-00:00:00
#SBATCH --mem-per-cpu=32000

module load qiime

srun parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_13_otus_10.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_13_10_out -t /scratch/jc33/beta_div/rep_set.tre &  

parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_13_otus_100.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_13_100_out -t /scratch/jc33/beta_div/rep_set.tre &  

parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_13_otus_1000.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_13_1000_out -t /scratch/jc33/beta_div/rep_set.tre &  

parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_23_otus_10.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_23_10_out -t /scratch/jc33/beta_div/rep_set.tre &  

parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_23_otus_100.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_23_100_out -t /scratch/jc33/beta_div/rep_set.tre &  

parallel_beta_diversity.py -i /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/table_23_otus_1000.biom -o /scratch/jc33/bdiv/replicate_filtered_tables/filtered_tables/bdiv_23_1000_out -t /scratch/jc33/beta_div/rep_set.tre
```

The workflow
-----------

Unlike the previous example where I was looking at blanks I do want to rarefy the samples before comparing them. The idea being that in the blanks *anything* that was in the blanks should not have been there, whereas here we expect the composition to be relatively similar between replicates, however we want to rarefy so that we can compare them directly.

1. Rarefy the full tables
2. Compare the diffferences with beta diversity (this was done in the previous workflow)
3. Filter tables to only contian replicate samples
4. Compare these with supervised learning and/or differential abundance
5. Filter out the top 10, 100, 1000 OTUs from the full table and rerun beta diversities

Repeat the above steps, but this time compare the run to