In [1]:
import os
import pandas as pd
from qiime2 import Visualization
import matplotlib.pyplot as plt
import numpy as np

import qiime2 as q2

%matplotlib inline


In [2]:
data_dir = 'Alien_data'

if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

In [8]:
! wget -nv -O $data_dir/sequences.qza 'https://polybox.ethz.ch/index.php/s/PCQspFMocVCKjZ3/download'

2022-10-17 17:38:52 URL:https://polybox.ethz.ch/index.php/s/PCQspFMocVCKjZ3/download [3433846903/3433846903] -> "Alien_data/sequences.qza" [1]


In [56]:
! wget -nv -O $data_dir/sample_metadata.tsv 'https://polybox.ethz.ch/index.php/s/r1AYzdUVWnQyiRL/download'

2022-10-10 19:45:54 URL:https://polybox.ethz.ch/index.php/s/r1AYzdUVWnQyiRL/download [10012/10012] -> "Alien_data/sample_metadata.tsv" [1]


In [35]:
data_dir

'Alien_data'

In [59]:
metadata_df = pd.read_csv(f'{data_dir}/sample_metadata.tsv', sep='\t')

In [3]:
! qiime tools peek $data_dir/sequences.qza

[32mUUID[0m:        394c4773-80e2-46a6-9fba-40e7c8ec3fb9
[32mType[0m:        SampleData[PairedEndSequencesWithQuality]
[32mData format[0m: SingleLanePerSamplePairedEndFastqDirFmt


In [10]:
! qiime demux summarize \
    --i-data $data_dir/sequences.qza \
    --o-visualization $data_dir/sequences.qzv

[32mSaved Visualization to: Alien_data/sequences.qzv[0m
[0m

In [3]:
Visualization.load(f'{data_dir}/sequences.qzv')

I am not sure if we need to use the trim function or not. I already sent an email to Lina and asked if we should trim the primers she provided.
The clustering method did not work very well, me and Xinyang have tried the removal of chimera and it ran for more than a day and still did not complete, so denoising with dada2 is used here. 

In [4]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs $data_dir/sequences.qza \
    --p-trunc-len-f 181 \
    --p-trunc-len-r 190 \
    --p-n-threads 4 \
    --o-table $data_dir/dada2_table.qza \
    --o-representative-sequences $data_dir/dada2_rep_set.qza \
    --o-denoising-stats $data_dir/dada2_stats.qza

[32mSaved FeatureTable[Frequency] to: Alien_data/dada2_table.qza[0m
[32mSaved FeatureData[Sequence] to: Alien_data/dada2_rep_set.qza[0m
[32mSaved SampleData[DADA2Stats] to: Alien_data/dada2_stats.qza[0m
[0m

In [5]:
! qiime metadata tabulate \
    --m-input-file $data_dir/dada2_stats.qza \
    --o-visualization $data_dir/dada2_stats.qzv

[32mSaved Visualization to: Alien_data/dada2_stats.qzv[0m
[0m

In [6]:
Visualization.load(f'{data_dir}/dada2_stats.qzv')

In [7]:
! qiime feature-table summarize \
  --i-table $data_dir/dada2_table.qza \
  --o-visualization $data_dir/dada2_table.qzv

[32mSaved Visualization to: Alien_data/dada2_table.qzv[0m
[0m

In [10]:
Visualization.load(f'{data_dir}/dada2_table.qzv')

In [8]:
! qiime feature-table tabulate-seqs \
  --i-data $data_dir/dada2_rep_set.qza \
  --o-visualization $data_dir/dada2_rep_set.qzv

[32mSaved Visualization to: Alien_data/dada2_rep_set.qzv[0m
[0m

In [9]:
Visualization.load(f'{data_dir}/dada2_rep_set.qzv')

Taxonomy classification and data curation:

In [15]:
! qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences $data_dir/silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy $data_dir/silva-138.1-ssu-nr99-tax.qza

[32mSaved FeatureData[RNASequence] to: Alien_data/silva-138.1-ssu-nr99-rna-seqs.qza[0m
[32mSaved FeatureData[Taxonomy] to: Alien_data/silva-138.1-ssu-nr99-tax.qza[0m
[0m

In [16]:
 ! qiime rescript cull-seqs \
     --i-sequences $data_dir/silva-138.1-ssu-nr99-rna-seqs.qza \
     --p-num-degenerates 5 \
     --p-homopolymer-length 8 \
     --p-n-jobs 3 \
     --o-clean-sequences $data_dir/silva-138.1-ssu-nr99-rna-seqs-cleaned.qza

[32mSaved FeatureData[Sequence] to: Alien_data/silva-138.1-ssu-nr99-rna-seqs-cleaned.qza[0m
[0m

In [17]:
! qiime rescript filter-seqs-length-by-taxon \
    --i-sequences $data_dir/silva-138.1-ssu-nr99-rna-seqs-cleaned.qza \
    --i-taxonomy $data_dir/silva-138.1-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs $data_dir/silva-138-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs $data_dir/silva-138-ssu-nr99-seqs-discard.qza

[32mSaved FeatureData[Sequence] to: Alien_data/silva-138-ssu-nr99-seqs-filt.qza[0m
[32mSaved FeatureData[Sequence] to: Alien_data/silva-138-ssu-nr99-seqs-discard.qza[0m
[0m

In [4]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa $data_dir/silva-138.1-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: Alien_data/silva-138-ssu-nr99-seqs-derep-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: Alien_data/silva-138-ssu-nr99-tax-derep-uniq.qza[0m
[0m

PCR extraction: 
The primers used for our samples are designed for the V4-V5 region:
563 F: 5′-AYTGGGYDTAAAGNG-3′
926 R: 5′-CCGTCAATTYHTTTRAGT-3′
I think below I should use the same primers used in our exercise since it is to extract the PCR region from the database, but I am not very sure, so I also asked Lina in the email, If anything need to be changed, I will redo this step. 

In [7]:
! qiime feature-classifier extract-reads \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGCCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 3 \
    --p-read-orientation 'forward' \
    --o-reads $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza

[32mSaved FeatureData[Sequence] to: Alien_data/silva-138-ssu-nr99-seqs-515f-806r.qza[0m
[0m

In [8]:
! qiime rescript dereplicate \
    --i-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa $data_dir/silva-138-ssu-nr99-tax-derep-uniq.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --p-threads 3 \
    --o-dereplicated-sequences $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza

[32mSaved FeatureData[Sequence] to: Alien_data/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza[0m
[32mSaved FeatureData[Taxonomy] to: Alien_data/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza[0m
[0m

Code below is to train taxonomy calssifier, this is the step that I get stucked with, it does not produce any output in my notebook.  

In [7]:
! qiime feature-classifier fit-classifier-naive-bayes \
     --i-reference-reads $data_dir/silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
     --i-reference-taxonomy $data_dir/silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza \
     --o-classifier $data_dir/515f-806r-classifier.qza