This notebook is going to be used to create tables where supervised learning can be run to determine the most different OTUs between the replicates, as well as the runs as a whole

In [1]:
%matplotlib inline
import pandas as pd
from q2d2 import rarify
from os.path import join
from functools import partial



In [2]:
from IPython.parallel import Client
clients = Client(profile='data-analysis-conda')
dview = clients.direct_view()

In [3]:
home = '/home/office-microbe-files'
notebooks = '/home/johnchase/office-project/office-microbes/notebooks'

Load mapping file
----------
Load the mapping file and filter it to contain only 16S

In [4]:
map_fp = join(home, 'master_map_150908.txt')
sample_md = pd.read_csv(map_fp, sep='\t', index_col=0, dtype=str)
sample_md = sample_md[sample_md['16SITS'] == '16S']

Load the two tables
----------------


In [5]:
table_13 = pd.read_csv('table_13.txt', sep='\t', index_col=0, dtype='object').astype('float')
table_23 = pd.read_csv('table_23.txt', sep='\t', index_col=0, dtype='object').astype('float')

Rarify the tables
------------

In [6]:
rarify1000 = partial(rarify, even_sampling_depth=1000)
df1, df2 = dview.map(rarify1000, [table_13, table_23])
df1.to_csv('table_13_rarefied.txt', sep='\t')
df2.to_csv('table_23_rarefied.txt', sep='\t')

The replicate IDs are known. Although the replicate IDs could  be munged from the mapping file this will be faster and less error prone.

In [7]:
#These are the replicate_ids
replicate_ids = '''F2F.2.Ce.021
F2F.2.Ce.022
F2F.3.Ce.021
F2F.3.Ce.022
F2W.2.Ca.021
F2W.2.Ca.022
F2W.2.Ce.021
F2W.2.Ce.022
F3W.2.Ce.021
F3W.2.Ce.022
F1F.3.Ca.021
F1F.3.Ca.022
F1C.3.Ca.021
F1C.3.Ca.022
F1W.2.Ce.021
F1W.2.Ce.022
F1W.3.Dr.021
F1W.3.Dr.022
F1C.3.Dr.021
F1C.3.Dr.022'''.split('\n')

In [8]:
office_md = sample_md[sample_md['OfficeSample'] == 'yes']
office_md_13 = office_md[(office_md['Run'] == '1') | (office_md['Run'] == '3')]
office_md_23 = office_md[(office_md['Run'] == '2') | (office_md['Run'] == '3')]
reps_13 = office_md_13[office_md_13['Description'].isin(replicate_ids)]
reps_23 = office_md_23[office_md_23['Description'].isin(replicate_ids)]

In [9]:
#this seems redundant but is necessary to keep only duplicates we want
reps_13 = reps_13[reps_13.duplicated('Description', keep='last') | reps_13.duplicated('Description')]
reps_23 = reps_23[reps_23.duplicated('Description', keep='last') | reps_23.duplicated('Description')]

now we have the 10 replicates from each group of runs that we are interested in
'F2F.3.Ce.022' was not included in the map from Argonne so there are only 9 run 2-3 replicates

Filter the unrarefied tables to include one the replicate samples
-----------------

In [10]:
table_13_replicates = table_13[reps_13.index]
table_23_replicates = table_23[reps_23.index]
table_13_replicates.to_csv('replicate_filtered_tables/table_13_replicates.txt', sep='\t')
table_23_replicates.to_csv('replicate_filtered_tables/table_23_replicates.txt', sep='\t')

In [11]:
table_13_replicates_rarified = rarify1000(table_13_replicates)
table_23_replicates_rarified = rarify1000(table_23_replicates)
table_13_replicates_rarified.to_csv('replicate_filtered_tables/table_13_replicates_rarified.txt', sep='\t')
table_23_replicates_rarified.to_csv('replicate_filtered_tables/table_23_replicates_rarified.txt', sep='\t')

The workflow
-----------

Unlike the previous example where I was looking at blanks I do want to rarefy the samples before comparing them. The idea being that in the blanks *anything* that was in the blanks should not have been there, whereas here we expect the composition to be relatively similar between replicates, however we want to rarefy so that we can compare them directly.

1. Rarefy the full tables
2. Compare the diffferences with beta diversity (this was done in the previous workflow)
3. Filter tables to only contian replicate samples
4. Compare these with supervised learning and/or differential abundance
5. Filter out the top 10, 100, 1000 OTUs from the full table and rerun beta diversities

Repeat the above steps, but this time compare the run to

In [12]:
sample_md[sample_md['Description'] == 'F1F.3.Ca.021']

Unnamed: 0_level_0,16SITS,BarcodeSequence,BlankExtraction,BlankSwab,City,Cooler,Date,Description,Duplicate,Event,...,Period,PlateLocation,ProjectID,Replicate,Row,Run,Time,TimeOfCollection,WeekDay,SampleType
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3OY7IWPU4ULSM,16S,GTTCCTCCATTA,no,no,flagstaff,,8/6/13 16:45,F1F.3.Ca.021,yes,21,...,1,floor,F1F.3.Ca.021,yes,3,1,1645,16:45:00,Tuesday,office
MQ64KAEXGK27,16S,CCTGACACACAC,no,no,flagstaff,,8/6/13,F1F.3.Ca.021,yes,21,...,1,floor,F1F.3.Ca.021,yes,3,3,1645,,Tuesday,office
6H3VNIOXDI5T9,16S,CTATTAAGCGGC,no,no,flagstaff,,8/6/13,F1F.3.Ca.021,yes,21,...,1,floor,F1F.3.Ca.021,yes,3,4,1645,,Tuesday,office


In [13]:
table_13_replicates['MQ64KAEXGK27'].sum()

146.0