**Author**: Justine Debelius<br>
**Date**: Summer/Fall 2021<br>
**Conda enviroment**: `micc-2021.11`<br>

The goal of this notebook is to document the import and filtering of the MiCC Dada2 table processed through [CTMR Amplicon](https://github.com/ctmrbio/ctmr-amplicon).

We'll read in the table generated with DADA2 and Silva 128. The taxonomy gets pulled off, massaged to provide labels and deal with ragged strings. The taxonomy gets filtered to get rid of any feautre which does not at least have the kingdom (bacteria) assigned. 

The representative sequences get converted into hashed feature IDs (the IDs are an MD5 hash of the sequence and a partial description of the lowest assigned taxonomic level.

The whole thing then gets wrapped into a tidy QIIME 2 table, with IDs formatted to matcht he metadata.

We'll then filter the table to make sure we only retain samples in the metadata for samples with phylum level annotation.

In [1]:
import hashlib

import biom
import pandas as pd
import numpy as np
import skbio

from qiime2 import Artifact, Metadata

import qiime2.plugins.feature_table.actions as q2_feature_table
import qiime2.plugins.taxa.actions as q2_taxa

# Imports data into QIIME 2

We start by importing the feature table generated through Dada2.

In [2]:
dada2_table = pd.read_csv('data/raw_data/DADA2_seqtab_silva128_20191206.tsv',
                          sep='\t',
                          dtype=str)

I'd like to create a shorter identifier for the taxonomy. For this, I'll add a hashed sequence

In [3]:
def hash_seq(x):
    return hashlib.md5(x.encode()).hexdigest()
dada2_table['hash_seq'] = dada2_table['Sequence'].apply(hash_seq)

In [4]:
dada2_table = dada2_table.loc[~dada2_table['Taxonomy'].isin(['Bacteria;'])]

For taxonomy, we prefix levels with a greengenes style taxonomic identifier, and then fill empty levels with the previous identifer. Unfortunately, it's hard to determine if the value is missing due to misclassification or because its not defined, but since this is Silva, I suspect undefined is more likely.

Then, we'll use the genus level data to get a short label.

In [5]:
# Parses the taxonomy into a series instead of a semi colon delimited string
taxonomy = dada2_table.set_index('hash_seq')['Taxonomy'].apply(
    lambda x: pd.Series(x.split(';')))
taxonomy.replace({"": np.nan}, inplace=True)
# Adds prefixes
for i, level in enumerate(['k', 'p', 'c', 'o', 'f', 'g', 's']):
    taxonomy[i] = taxonomy[i].dropna().apply(lambda x: f'{level}__{x}')
# Fills in the missing taxonomy
taxonomy.fillna(method='ffill', axis=1, inplace=True)
# Truncates taxonomy to get a short label for a hash
taxonomy['short'] = taxonomy[5].apply(lambda x: x.split("__")[-1][:4])
# Hashes the squence to 7 characters
taxonomy['hash_short'] = taxonomy.index.to_frame()['hash_seq'].apply(lambda x: x[:7])
# Generates a composite feature id
taxonomy['Feature ID'] = taxonomy['short'] + '-' + taxonomy['hash_short']

In [6]:
dada2_table.head()

Unnamed: 0,#Seq_ID,P001__044_27-6572_MiCC__1177,P002__044_27-6572_MiCC__1178,P003__044_27-6572_MiCC__1179,P004__044_27-6572_MiCC__1180,P005__044_27-6572_MiCC__1181,P006__044_27-6572_MiCC__1182,P007__044_27-6572_MiCC__1183,P008__044_27-6572_MiCC__1184,P009__044_27-6572_MiCC__1185,...,P244__044_27-6608_MiCC__1056,P245__044_27-6608_MiCC__1057,P246__044_27-6608_MiCC__1058,P247__044_27-6608_MiCC__Neg_ex_27-6608,P248__044_27-6608_MiCC__Pos_ex_27-6608,P249__044_27-6608_MiCC__neg_pcr_27-6608,P250__044_27-6608_MiCC__pos_pcr_27-6608,Taxonomy,Sequence,hash_seq
0,Seq_00001,52,415,355,538,226,164,30,3,410,...,0,145,1556,9,5931,24,844,Bacteria;Proteobacteria;Gammaproteobacteria;En...,TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGC...,ffc36e27c82042664a16bcd4d380b286
1,Seq_00002,1418,0,0,0,0,299,125,9,0,...,0,257,0,0,0,0,0,Bacteria;Bacteroidetes;Bacteroidia;Bacteroidal...,TGAGGAATATTGGTCAATGGGCGATGGCCTGAACCAGCCAAGTAGC...,4abaa483334092f021534a979086baeb
2,Seq_00003,716,1053,957,0,0,355,759,8,329,...,0,288,0,0,0,0,0,Bacteria;Firmicutes;Clostridia;Clostridiales;R...,TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGC...,4516aa60a483dd8c7bbc57098c45f1a5
3,Seq_00004,0,930,823,0,0,122,264,0,83,...,0,138,0,0,0,0,0,Bacteria;Firmicutes;Clostridia;Clostridiales;R...,TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGC...,c728ad6f5d183cb36fa06b6a3a47758b
4,Seq_00005,32,55,55,1432,251,0,77,0,1192,...,0,315,1170,0,0,3,0,Bacteria;Bacteroidetes;Bacteroidia;Bacteroidal...,TGAGGAATATTGGTCAATGGGCGCTAGCCTGAACCAGCCAAGTAGC...,47f3d645d96038371757074de1d8fb8d


In [7]:
(dada2_table.set_index('hash_seq').loc[taxonomy[1].apply(lambda x: 'k__' in x)].drop(
    columns=['#Seq_ID', 'Taxonomy', 'Sequence']).astype(float) > 0).sum(axis=0).sort_values()

P001__044_27-6572_MiCC__1177               0.0
P159__044_27-6604_MiCC__1093               0.0
P160__044_27-6604_MiCC__1094               0.0
P161__044_27-6604_MiCC__1095               0.0
P162__044_27-6604_MiCC__1096               0.0
                                          ... 
P090__044_27-6576_MiCC__1146               0.0
P091__044_27-6576_MiCC__1147               0.0
P092__044_27-6576_MiCC__1148               0.0
P079__044_27-6576_MiCC__1135               0.0
P250__044_27-6608_MiCC__pos_pcr_27-6608    0.0
Length: 250, dtype: float64

And then, we'll import the data into QIIME 2

In [8]:
taxa_for_qiime = taxonomy.copy().set_index('Feature ID').drop(columns=['short', 'hash_short'])
taxa_for_qiime = taxa_for_qiime.apply(lambda x: ';'.join(x), axis=1)
taxa_for_qiime.name = 'Taxon'
taxa_q2 = Artifact.import_data('FeatureData[Taxonomy]', taxa_for_qiime, pd.Series)
taxa_q2.save('data/tables/taxonomy.qza')

'data/tables/taxonomy.qza'

The table gets updated with the short taxonomy name, andd we'll pull out the sequences

In [9]:
dada2_table['short_name'] = \
    dada2_table['hash_seq'].replace(taxonomy['Feature ID'].to_dict())
dada2_table.set_index('short_name', inplace=True)


In [10]:
rep_seqs = pd.Series({
    id_: skbio.DNA(seq, metadata={'id': id_})
    for id_, seq in dada2_table['Sequence'].items()
})
rep_seqs_q2 = Artifact.import_data('FeatureData[Sequence]', rep_seqs, pd.Series)
rep_seqs_q2.save('data/tables/rep_seqs.qza')

'data/tables/rep_seqs.qza'

Finally, we'll pull together the dada2 table. We'll drop metadata from the table (Taxonomy, Sequence, hash) and then rename th columns to match the format from the metadata.

In [11]:
dada2_table.drop(columns=['#Seq_ID', 'Taxonomy', 'Sequence', 'hash_seq'],
                 inplace=True)
dada2_table.index.set_names('feature-id', inplace=True)
dada2_table.rename(
    columns={c: c.split("__")[-1] for c in dada2_table.columns},
    inplace=True
)

In [12]:
dada2_table = dada2_table.T
dada2_table.index.set_names('sample-id', inplace=True)
table_q2 = Artifact.import_data('FeatureTable[Frequency]', dada2_table, pd.DataFrame)
table_q2.save('data/tables/table.qza')

'data/tables/table.qza'

# Filter to a working table

We read in the provided metadata. Note that this is slightly modified form the ENA metadata, and contains only the variables needed for working analysis.

In [14]:
meta = pd.read_csv('data/metadata_paired.tsv', sep='\t', dtype=str)
meta.set_index('sample-id', inplace=True)
# Drops and ENA column which conflicts with the qiime2 name space
meta.drop(columns=['sample_name'], inplace=True)
meta_q2 = Metadata(meta)

We filter the table based on the metadata so we have the correct set of samples.

In [15]:
paired_table = q2_feature_table.filter_samples(
    table=table_q2, 
    metadata=meta_q2
).filtered_table

Then filters to retain only features that have a phylum level assignment

In [16]:
phylum_def = q2_taxa.filter_table(
    table=paired_table, 
    taxonomy=taxa_q2, 
    include="p__",
    mode='contains'
).filtered_table

And drops out non-zero features incase they're still present.

In [17]:
paired_nonzero = q2_feature_table.filter_features(
    table=phylum_def,
    min_samples=1,
    min_frequency=1,
).filtered_table

In [18]:
paired_nonzero.save('data/tables/phylum_defined_table.qza')

'data/tables/phylum_defined_table.qza'

And then, I'd like a matching gorup of representative sequences.

In [20]:
paired_seqs = q2_feature_table.filter_seqs(
    data=rep_seqs_q2, 
    table=paired_nonzero
).filtered_data
paired_seqs.save('data/tables/phylum_defined_rep_seqs.qza')

'data/tables/phylum_defined_rep_seqs.qza'

And now we have a feature table and representative sequences imported into QIIME 2 and ready for analysis.