# Evaluate Seqs & Taxonomy by Processing Step

## Impact of step wise processing per step.

Ultimately we'd like to know what the effect of processing has on our sequene and taxonomy evaluations. Many of the files we need have already been made via the notebook `02-SILVA-Benchmark-prep.ipynb` and we'll continue by following the flowchart below:

<img src="flowchart-evals-by-step.png">

We'll use the files from the following path for input: 

`/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks`

The files we'll be using from the above path are those in the 'white' boxes. Everything else in 'yellow' we'll do here via the QIIME 2 API.

In [15]:
import os
from os import getcwd, listdir, chdir, mkdir, path
import pandas as pd
import qiime2 as q2
from qiime2.plugins import rescript, taxa, feature_classifier
import glob
#from IPython.display import Image

In [2]:
mwd = '/home/mrobeson/projects/rescript_benchmarks'
refdb = mwd + '/ref_dbs/silva-138'
bench = mwd + '/benchmarks/silva-138'

In [3]:
wd = refdb + '/silva-138-nr99-default-noeuks'

In [4]:
os.chdir(wd)
os.getcwd()

'/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks'

## Initial setup

We never made a file in which we only keep the good labels using the default mode with `uniq` dereplication. We'll make those files here.

In [6]:
#import existing NR99 files
seqs_derep = q2.Artifact.load('silva-nr99-default-noeuks-derep-seqs.qza')
tax_derep = q2.Artifact.load('silva-nr99-default-noeuks-derep-taxa.qza')

In [7]:
exclude_taxa = 's__unidentified,hloroplast,itochondria,s__uncultured,s__uncultivated,etagenome,andidatus,ryza_sativa,_bacterium,_proteobacterium,manure,arctic,marine,water,gut,symbiont,oral,lake,sea,microbial_mat,glacial,drainage,thermal_vent,nrichment,synthetic,candidate,clone,mineralizing,swine,isolate,aerobic,hot_spring,halophilic,gas_vacuolate'
exclude_taxa_list = exclude_taxa.split(',')

In [8]:
# Exclude sequences that contain unidentified / poor taxonomy labels, then save.
seqs_goodlabels, = taxa.actions.filter_seqs(sequences = seqs_derep,
                                            taxonomy = tax_derep,
                                            exclude = exclude_taxa,
                                            mode='contains')
seqs_goodlabels.save('silva-nr99-default-noeuks-derep-seqs-gl.qza')

'silva-nr99-default-noeuks-derep-seqs-gl.qza'

## Filter taxa

Used API to filter taxa, as the only way to do this, is with a list of feature ids.

In [9]:
def filt_taxa(seq_infile, tax_infile, tax_outfile):
    seqs = q2.Artifact.load(seq_infile)
    tax = q2.Artifact.load(tax_infile)
    feature_ids = [feature_id for feature_id in seqs.view(pd.Series).index]
    ids = pd.Index(feature_ids, name='Feature ID')
    ids_to_keep = q2.Metadata(pd.DataFrame(index=ids))
    tax_goodlabels, = rescript.actions.filter_taxa(taxonomy=tax, ids_to_keep=ids_to_keep)
    tax_goodlabels.save(tax_outfile)

**Base No Euks**

In [40]:
# No Euks, yes there is a typo in the name (i.e. "noeuks-noeuks") from when I first made these files.
# I propagatd them through all downstream outputs so we know these files "belong together".
filt_taxa('silva-nr99-default-noeuks-noeuks-seqs.qza', 
          'silva-nr99-default-noeuks-parsed-taxonomy.qza', 
          'silva-nr99-default-noeuks-noeuks-taxa.qza')

Input features: 510984
Output features: 452173


**Culled Seqs**

In [41]:
# Culled Seqs
filt_taxa('silva-nr99-default-noeuks-culled-seqs.qza', 
          'silva-nr99-default-noeuks-parsed-taxonomy.qza', 
          'silva-nr99-default-noeuks-culled-taxa.qza')

Input features: 510984
Output features: 429812


**Filter Seqs Length by Taxon**

In [42]:
# Filt SeqLength by Taxon
filt_taxa('silva-nr99-default-noeuks-filt-seqs.qza', 
          'silva-nr99-default-noeuks-parsed-taxonomy.qza', 
          'silva-nr99-default-noeuks-filt-taxa.qza')

Input features: 510984
Output features: 429812


**Note:** ^^ Filtering by different lengths did not remove any additional reads. Indicating many of the short reads also contained homopolymers and/or ambiguous bases.

**derep uniq**

No need to run `filter-taxa` for the dereplicated data as this is output by `rescript dereplicate`.

**remove unclassified / poor labels**

In [43]:
# Filt by removing unclassified / poor labels.
filt_taxa('silva-nr99-default-noeuks-derep-seqs-gl.qza', 
          'silva-nr99-default-noeuks-parsed-taxonomy.qza', 
          'silva-nr99-default-noeuks-derep-taxa-gl.qza')

Input features: 510984
Output features: 101990


In [13]:
# Filt by removing unclassified / poor labels from v4.
filt_taxa('silva-nr99-default-noeuks-derep-seqs-515-806-gl.qza', 
          'silva-nr99-default-noeuks-derep-taxa-515-806.qza', 
          'silva-nr99-default-noeuks-derep-taxa-515-806-gl.qza')

Input features: 278221
Output features: 52103


## Eval Taxa & Seqs

### Full length only

In [13]:
! qiime rescript evaluate-taxonomy \
    --i-taxonomies silva-nr99-default-noeuks-noeuks-taxa.qza \
                   silva-nr99-default-noeuks-culled-taxa.qza \
                   silva-nr99-default-noeuks-filt-taxa.qza \
                   silva-nr99-default-noeuks-derep-taxa.qza \
                   silva-nr99-default-noeuks-derep-taxa-gl.qza \
    --p-labels 'Base' 'Culled' 'LengFiltByTax' 'DereplicateUniq' 'NoAmbigLabels' \
    --p-rank-handle-regex "^[dkpcofgs]__" \
    --o-taxonomy-stats 'NR99-proc-steps-evaltax.qzv'

[32mSaved Visualization to: NR99-proc-steps-evaltax.qzv[0m


In [14]:
q2.Visualization.load('NR99-proc-steps-evaltax.qzv')

In [15]:
! qiime rescript evaluate-seqs \
    --i-sequences silva-nr99-default-noeuks-noeuks-seqs.qza \
                   silva-nr99-default-noeuks-culled-seqs.qza \
                   silva-nr99-default-noeuks-filt-seqs.qza \
                   silva-nr99-default-noeuks-derep-seqs.qza \
                   silva-nr99-default-noeuks-derep-seqs-gl.qza \
    --p-labels 'Base' 'Culled' 'LengFiltByTax' 'DereplicateUniq' 'NoAmbigLabels' \
    --p-palette 'cividis' \
    --p-subsample-kmers 0.2 \
    --p-kmer-lengths 32 16 8 \
    --o-visualization 'NR99-proc-steps-evalseqs.qzv'

[32mSaved Visualization to: NR99-proc-steps-evalseqs.qzv[0m


In [16]:
q2.Visualization.load('NR99-proc-steps-evalseqs.qzv')

## Cross validate / fit classifier

Run on each step

In [5]:
ecv_pbs_str2 = """#!/bin/bash
#PBS -l nodes=1:ppn={ppn},mem={mem},walltime={wt}
#PBS -l feature=xeon
#PBS -N {job_name}
#PBS -o {base_out_path}/{job_name}.out
#PBS -e {base_out_path}/{job_name}.err

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

source activate qiime2-2020.6

date

cd {base_out_path}

qiime rescript evaluate-cross-validate \
    --i-sequences {base_in_path}/{inseq} \
    --i-taxonomy  {base_in_path}/{intax} \
    --p-n-jobs {ppn} \
    --o-expected-taxonomy {base_out_path}/{exp_tax} \
    --o-observed-taxonomy {base_out_path}/{obs_tax} \
    --o-evaluation {base_out_path}/{eval_tax}

date

source deactivate"""

In [6]:
efc_pbs_str2 = """#!/bin/bash
#PBS -l nodes=1:ppn={ppn},mem={mem},walltime={wt}
#PBS -l feature=xeon
#PBS -N {job_name}
#PBS -o {base_out_path}/{job_name}.out
#PBS -e {base_out_path}/{job_name}.err

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

source activate qiime2-2020.6

date

cd {base_out_path}

qiime rescript evaluate-fit-classifier \
    --i-sequences {base_in_path}/{inseq} \
    --i-taxonomy  {base_in_path}/{intax} \
    --p-n-jobs {ppn} \
    --o-classifier {base_out_path}/{classifier_out_fp} \
    --o-evaluation {base_out_path}/{eval_out_fp} \
    --o-observed-taxonomy {base_out_path}/{obs_tax_out_fp}

date

source deactivate"""

In [24]:
# This is not a generic function. Written specifically for the output we generated earlier.
# Using global string variable `ecv_pbs_str2`
def make_ecv_benchmark_pbs_files(ppn = '1',
                                 mem = '100GB',
                                 wt = '72:00:00',
                                 base_in_path = '/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks',
                                 base_out_path = '/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks',
                                 glob_seqs_list = ['silva-nr99-default-noeuks-noeuks-seqs.qza',  'silva-nr99-default-noeuks-culled-seqs.qza',  'silva-nr99-default-noeuks-filt-seqs.qza',  'silva-nr99-default-noeuks-derep-seqs.qza',  'silva-nr99-default-noeuks-derep-seqs-gl.qza']):

    chdir(base_in_path)
    seq_files = []
    for f in glob_seqs_list:
        seq_files.extend(glob.glob(f))

    tax_files = [sf.replace('seqs', 'taxa') for sf in seq_files]

    for s,t in zip(seq_files, tax_files):
        bn = path.splitext(t)[0]
        job_name = bn + '-ecv'
        ecv_str = ecv_pbs_str2.format(ppn = ppn,
                   mem = mem,
                   wt = wt,
                   job_name = job_name,
                   base_in_path = base_in_path,
                   base_out_path = base_out_path,
                   inseq = s,
                   intax = t,
                   exp_tax = bn + '-ecv-exptax.qza',
                   obs_tax = bn + '-ecv-obstax.qza',
                   eval_tax = bn + '-ecv-evaltax.qzv')
    
        job_file_name = job_name + '.pbs'    
    
        with open(path.join(base_out_path, job_file_name), 'w') as outfile:
            outfile.write(ecv_str)    

In [25]:
# This is not a generic function. Written specifically for the output we generated earlier.
# Using global string variable `efc_pbs_str2`
def make_efc_benchmark_pbs_files(ppn = '1',
                        mem = '100GB',
                        wt = '72:00:00',
                        base_in_path = '/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks',
                        base_out_path = '/home/mrobeson/projects/rescript_benchmarks/ref_dbs/silva-138/silva-138-nr99-default-noeuks',
                        glob_seqs_list = ['silva-nr99-default-noeuks-noeuks-seqs.qza',  'silva-nr99-default-noeuks-culled-seqs.qza',  'silva-nr99-default-noeuks-filt-seqs.qza',  'silva-nr99-default-noeuks-derep-seqs.qza',  'silva-nr99-default-noeuks-derep-seqs-gl.qza']):
    
    chdir(base_in_path)

    seq_files = []
    for f in glob_seqs_list:
        seq_files.extend(glob.glob(f))

    tax_files = [sf.replace('seqs', 'taxa') for sf in seq_files]

    for s,t in zip(seq_files, tax_files):
        bn = path.splitext(t)[0]
        job_name = bn + '-efc'
        ecv_str = efc_pbs_str2.format(ppn = ppn,
                   mem = mem,
                   wt = wt,
                   job_name = job_name,
                   base_in_path = base_in_path,
                   base_out_path = base_out_path,
                   inseq = s,
                   intax = t,
                   classifier_out_fp = bn + '-efc-classifier.qza',
                   obs_tax_out_fp = bn + '-efc-obstax.qza',
                   eval_out_fp = bn + '-efc-evaltax.qzv')
    
        job_file_name = job_name + '.pbs'    
    
        with open(path.join(base_out_path, job_file_name), 'w') as outfile:
            outfile.write(ecv_str)    
    

In [26]:
make_ecv_benchmark_pbs_files()

In [27]:
make_efc_benchmark_pbs_files()