# Prior Probabilities

This notebook describes how to generate and use prior probability compositions to train ``q2-feature-classifier`` for taxonomic classification of marker-gene sequence data.

Prior probabilities are used by some classification methods to predict the likelihood that a given class label will be observed in a population. Training prior probabilities may improve classification accuracy, particularly in situations when specific taxa are unique to an environment. This is opposed to the use of a uniform prior, which assumes that any class is equally likely to be observed in a given environment. While this is a prudent approach when handling unknown samples, it is a naive approach when handling samples from previously characterized sample types.

This notebook serves two functions:

    1) generating prior probability trainers to train ``q2-feature-classifier``.
    2) comparing classification accuracy of training classifiers with:
        a) a uniform prior.
        b) prior probabilities appropriate for sample type X.
        c) prior probabilities inappropriate for sample type X.
        
This notebook examines the effect of prior probabilities in classifying two different types of data sets:

    1) simulated communities generated from observed compositions of natural communities.
    2) natural communities in which the true composition is unknown, but well studied.
    
The general approach is to:

    1) Select an appropriate natural community (NC).
    2) Generate prior probability training set as a function of NC taxonomic composition at levels 1-6.
    3) Select an appropriate reference dataset (RD).
    4) extract RD sequences x,y,z (match NC taxonomies x,y,z)
    5) mix x,y,z at defined distributions D to make simulated community S
    6) provide prior distributions to classifier C
    7) assign taxonomy to S using C


In [26]:
from os.path import join, expandvars 

from qiime.plugins import feature_classifier, taxa
from qiime import Artifact
#from tax_credit.process_mocks import *

import pandas as pd

In [4]:
# Project directory
project_dir = expandvars("$HOME/Desktop/projects/short-read-tax-assignment")
# Directory containing reference sequence databases
reference_database_dir = expandvars("$HOME/Desktop/ref_dbs/")
# prior probabilities directory
prior_probabilities_dir = join(project_dir, "data", "prior_probabilities")

## Generating prior probabilities from natural community observations
This section describes how we generate prior probability trainers. To do so, we use taxonomic composition observations from previously analyzed natural communities. This does introduces bias in that we are assuming that those previous observations are accurate — or at least precise enough that we can assume that training with this prior probability will yield better predictive accuracy than a uniform prior.

This can be performed on any sample type and used for subsequent analysis of samples, or re-analysis of the same samples. Samples should be well-characterized and the microbial composition reasonably well-known. For this reason, we demonstrate using three defined sample types from the following data sets. Each is publicly available, with hyperlinks to the QIITA study and open-access publications below.

[Wine fermentation microbiota (bacterial 16S)](https://qiita.ucsd.edu/study/description/10119)    ([Bokulich et al. mBio 2016](http://dx.doi.org/10.1128/mBio.00631-16))

[Sour beer fermentation microbiota (bacterial 16S)](https://qiita.ucsd.edu/study/description/1689)    ([Bokulich et al. PLoS One 2012](http://dx.doi.org/10.1371/journal.pone.0035507))

["Moving pictures of the human microbiome" (human feces bacterial 16S)](https://qiita.ucsd.edu/study/description/550)     ([Caporaso et al. Genome Biol 2012](http://dx.doi.org/10.1186/gb-2011-12-5-r50))

Download the taxonomy-annotated biom files and mapping files and place in a single directory, which we designate as natural_community_dir below. It is critical that only one sample type is used to make each prior. Hence, any dissimilar sample types must be removed prior to classification. For example, the "Moving Pictures" dataset contains samples from several body sites, but we want to focus on stool samples here.

(Note that QIITA uses different processing parameters from those used in the original studies, and thus results may differ slightly from those in publication. QIITA conveniently provides pre-processed data, such that taxonomy assignments can be immediately produced from multiple studies, facilitating the training of prior probabilities.)

In [23]:
# Select an appropriate natural community (NC).
# Directory containing natural community observations (taxonomy assignments)
natural_community_dir = expandvars("$HOME/Desktop/projects/natural_community_priors/")

# Create dictionary that maps datasets to their biom and mapping files, and to the marker gene type
# {dataset : (db, biom_table, rep_seqs, metadata_map)}
natural_datasets = {
 #'moving_pics' : ('gg_13_8_otus', '67_otu_table.biom',
 # '550_prep_72_qiime_20160804-161109.txt'), # Moving pictures
 #'sour_beer' : ('gg_13_8_otus',
 #               join(natural_community_dir, 'beer-final.only-16s.biom'), # https://qiita.ucsd.edu/download/95214
 #               join(natural_community_dir, 'beer-final.seqs.fa.no_artifacts'), # https://qiita.ucsd.edu/download/95213
 #               join(natural_community_dir, '1989_prep_698_qiime_20150818-225324.txt')), # https://qiita.ucsd.edu/download/25875
 'wine' : ('gg_13_8_otus',
           join(natural_community_dir, 'wine-final.only-16s.biom'), # https://qiita.ucsd.edu/download/102065
           join(natural_community_dir, 'wine-final.seqs.fa.no_artifacts'), # https://qiita.ucsd.edu/download/102064
           join(natural_community_dir, '10119_prep_258_qiime_20160529-140847.txt')), # https://qiita.ucsd.edu/download/69224
}

# Set paths to reference database seqs, taxonomy, and trained classifiers
# {db : (reference_seqs, reference_tax, classifier_path)}
reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 
                                        'gg_13_8_otus/rep_set/97_otus_515f-806r_trim250.fasta'), 
                                   join(reference_database_dir, 
                                        'gg_13_8_otus/taxonomy/97_otu_taxonomy.txt'),
                                   expandvars("$HOME/Desktop/tools/gg-13-8-99-515-806-nb-classifier.qza"))
                }

In [24]:
feature_classifier.methods.classify.signature

reads : (FeatureData[Sequence], <class 'q2_types.feature_data._transformer.DNAIterator'>), classifier : (TaxonomicClassifier, <class 'dict'>), chunk_size : (Int, <class 'int'>), n_jobs : (Int, <class 'int'>), pre_dispatch : (Str, <class 'str'>), confidence : (Float, <class 'float'>) -> ((FeatureData[Taxonomy], <class 'pandas.core.frame.DataFrame'>),)

In [28]:
df = pd.DataFrame.from_csv(join(natural_community_dir, '10119_prep_258_qiime_20160529-140847.txt'), sep='\t')
df

Unnamed: 0_level_0,BarcodeSequence,LinkerPrimerSequence,center_name,center_project_name,emp_status,experiment_center,experiment_design_description,experiment_title,illumina_technology,instrument_model,...,sample_tube_id,sample_type,scientific_name,stage,taxon_id,title,variety,vineyard,winery,Description
#SampleID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10119.MT.283,CGATTACG,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.283,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,21,NN,MT.283
10119.MT.282,CACTGTCT,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.282,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,21,NN,MT.282
10119.MT.281,CAGAGACA,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.281,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Merlot,28,NN,MT.281
10119.MT.280,CAGTCACA,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.280,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,21,NN,MT.280
10119.MT.287,CATCAGCA,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.287,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,21,NN,MT.287
10119.MT.286,CATCTGGA,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.286,wine,food metagenome,D,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,24,NN,MT.286
10119.MT.285,CATGTCCT,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.285,wine,food metagenome,D,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,24,NN,MT.285
10119.MT.289,CAGAGAGT,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.289,wine,food metagenome,D,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,19,NN,MT.289
10119.MT.288,CAGTCAGT,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.288,must,food metagenome,B,870726,Microbiota of regional grapes and wine ferment...,Cabernet_Sauvignon,21,NN,MT.288
10119.MT.840,GCTATTCC,GTGCCAGCMGCCGCGGTAA,UCDAES,Microbiota of regional grapes and wine ferment...,EMP_Processed,UCDCAES,Microbiota of regional grapes and wine ferment...,Microbiota of regional grapes and wine ferment...,MiSeq,Illumina MiSeq,...,MT.840,wine,food metagenome,G,870726,Microbiota of regional grapes and wine ferment...,Merlot,28,NN,MT.840


In [22]:
# Generate prior probability training set as a function of NC taxonomic composition
for dataset, data in natural_datasets.items():
    db, biom_table, rep_seqs, metadata_map = data
    reference_seqs, reference_tax, classifier_path = reference_dbs[db]
    
    # import data into qiime
    feature_table = Artifact.import_data("FeatureTable[Frequency]", biom_table)
    rep_seqs_artifact = Artifact.import_data("FeatureData[Sequence]", rep_seqs)

    # assign taxonomy using uniform prior
    tax_assign = feature_classifier.methods.classify(reads = rep_seqs_artifact,
                                                     classifier = classifier_path)
    tax_assign.save(''.join(map(str, [natural_community_dir, '/', 
                                      dataset, '_tax_assignment.qza'])))
    
    

ValueError: InPath('/Users/nbokulich/Desktop/projects/natural_community_priors/wine-final.seqs.fa.no_artifacts.fasta') is not formatted as a DNAFASTAFormat file.

In [None]:
    ### Collapsing features and tables may be unnecessary for simulation
    ### just randomly sample tax_assign to make simulated sets.
                    
    # collapse features
    # in qiime2, species = level 7
    taxa_table = taxa.methods.collapse(table = feature_table, taxonomy = tax_assign,
                                      level = 7)           
    tax_assign.save(''.join(map(str, [natural_community_dir, '/', 
                                      dataset, '_tax_assignment.qza']))
    # collapse samples

    # store as dict?
                    
    # add to natural_datasets? as dataframe? 
    # Then we can store dataframe and call composition below without re-running this step
                    

## Generating simulated communities resembling natural communities
The purpose of this step is to generate simulated communities that resemble natural sample types, but contain a known taxonomic composition.

In [None]:
num_samples = 5
seqs_per_sample = 1000

# Select an appropriate reference dataset (RD).
# extract NC sequences x,y,z (match RD taxonomies x,y,z)
# mix x,y,z at defined distributions D to make simulated community S


## Classification of simulated communities using prior probabilities

In [None]:
# provide prior distributions to classifier C
# assign taxonomy to S using C

## Classification of natural communities using prior probabilities

In [None]:
# provide prior distributions to classifier C
# assign taxonomy to NC using C
