# **Generating Taxonomic Profile from Microbiome Data**

## **1. Importing data (generate artifact)**

In [3]:
##### FASTAQ to Artifact #####
# import the fasta files into a QIIME2 artifact
!qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-format SingleEndFastqManifestPhred33V2 \
  --input-path manifest.tsv \
  --output-path sequences.qza

[32mImported manifest.tsv as SingleEndFastqManifestPhred33V2 to sequences.qza[0m
[0m

In [4]:
##### Artifact to Visualization #####
# QIIME to visualize our sequencing data.
!qiime demux summarize \
	--i-data sequences.qza \
	--o-visualization qualities.qzv

[32mSaved Visualization to: qualities.qzv[0m
[0m

In [5]:
# visualize
!qiime tools view qualities.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.[1560781:1560781:1211/135327.143715:ERROR:object_proxy.cc(576)] Failed to call method: org.freedesktop.ScreenSaver.GetActive: object_path= /org/freedesktop/ScreenSaver: org.freedesktop.DBus.Error.NotSupported: This method is not implemented
[1560781:1560809:1211/135329.529074:ERROR:registration_request.cc(291)] Registration response error message: DEPRECATED_ENDPOINT
Created TensorFlow Lite XNNPACK delegate for CPU.
Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#-1 is a dynamic-sized tensor).
[1560833:1560833:1211/135343.641641:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 1 times!
[1560833:1560833:1211/135345.777596:ERROR:gl_surface_presentation_helper.cc(260)] GetVSyncParametersIfAvailable() failed for 2 times!
[1560833:1560833:1211/135356.439735:E

## **2. Denoise/Demultiplex data**

In [6]:
##### Quality Filtering: From Sequence to ASV ##### Change the trunc-len to 250
!qiime dada2 denoise-single \
    --i-demultiplexed-seqs sequences.qza \
    --p-trunc-len 175 \
    --p-n-threads 4 \
    --output-dir dada --verbose

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /tmp/qiime2/davo/data/07969cdf-a5ec-49d8-a585-f91f4aba5419/data --output_path /tmp/tmp54cvbted/output.tsv.biom --output_track /tmp/tmp54cvbted/track.tsv --filtered_directory /tmp/tmp54cvbted --truncation_length 175 --trim_left 0 --max_expected_errors 2.0 --truncation_quality_score 2 --max_length Inf --pooling_method independent --chimera_method consensus --min_parental_fold 1.0 --allow_one_off False --num_threads 4 --learn_min_reads 1000000 --homopolymer_gap_penalty NULL --band_size 16

R version 4.3.3 (2024-02-29) 
Loading required package: Rcpp
[?25hDADA2: 1.30.0 / Rcpp: 1.0.13.1 / RcppParallel: 5.1.9 
[?25h[?25h2) Filtering [?25hThe filter removed all reads: /tmp/tmp54cvbted/CSM5FZ3R_53_L001_R1_001.fastq.gz not 

## **3. Generating Feature table (OTU)**

In [7]:
# Denoising statistics
!qiime metadata tabulate \
    --m-input-file dada/denoising_stats.qza \
    --o-visualization denoising-stats.qzv

[32mSaved Visualization to: denoising-stats.qzv[0m
[0m

In [10]:
# visualize 
!qiime tools view denoising-stats.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.Opening in existing browser session.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

In [13]:
# Feature table summary
!qiime feature-table summarize \
  --i-table ./dada/table.qza \
  --m-sample-metadata-file ./metadata.tsv \
  --o-visualization ./dada_freqtable.qzv

[32mSaved Visualization to: ./dada_freqtable.qzv[0m
[0m

In [14]:
# visualize 
!qiime tools view ./dada_freqtable.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.Opening in existing browser session.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

In [15]:
# Export feature table
!qiime tools export \
  --input-path dada/table.qza \
  --output-path exported_table

[32mExported dada/table.qza as BIOMV210DirFmt to directory exported_table[0m
[0m

## **4. Taxonomic Classification (math features to labels)**

### **a. Silva 138 99% OTUs full-length sequences**

In [None]:
# get the classifier
# !wget -nv -O \
#   "classifier/silva-138-99-nb-classifier.qza" \
#   "https://data.qiime2.org/classifiers/sklearn-1.4.2/silva/silva-138-99-nb-classifier.qza"

2024-12-08 05:50:58 URL:https://s3-us-west-2.amazonaws.com/qiime2-data/classifiers/sklearn-1.4.2/silva/silva-138-99-nb-classifier.qza [218245868/218245868] -> "silva-138-99-515-806-nb-classifier.qza" [1]


In [16]:
# Get taxonomic OTU
!qiime feature-classifier classify-sklearn \
  --i-classifier classifier/silva-138-99-nb-classifier.qza \
  --i-reads ./dada/representative_sequences.qza \
  --o-classification silva-taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: silva-taxonomy.qza[0m
[0m

In [17]:
# Export TSV
!qiime tools export \
  --input-path silva-taxonomy.qza \
  --output-path exported_table

[32mExported silva-taxonomy.qza as TSVTaxonomyDirectoryFormat to directory exported_table[0m
[0m

In [18]:
# generate vizualization
!qiime taxa barplot \
  --i-table ./dada/table.qza  \
  --i-taxonomy silva-taxonomy.qza \
  --m-metadata-file metadata.tsv \
  --o-visualization taxa-barplot.qzv

[32mSaved Visualization to: taxa-barplot.qzv[0m
[0m

In [1]:
# visualize empress
!qiime tools view taxa-barplot.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.Opening in existing browser session.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

#### Get table for model

In [20]:
### Combine OTU and Taxonomy data
import biom
import pandas as pd
import numpy as np

# Load the OTU table
otu = biom.load_table('exported_table/feature-table.biom')
otu = otu.to_dataframe()

# Load the taxonomy table
taxonomy = pd.read_csv('exported_table/taxonomy.tsv', sep='\t', index_col=0)

# Merge the OTU table with the taxonomy table
otu_taxonomy_merged = pd.merge(taxonomy, otu, left_index=True, right_index=True)

# Save the merged table to a CSV file
otu_taxonomy_merged.to_csv('otu_with_taxonomy.csv')


In [21]:
# Split the 'Taxon' column into multiple columns
df_split = otu_taxonomy_merged['Taxon'].str.split(';', expand=True)

# Assign the split columns to the original dataframe
columns_to_update = ['Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
otu_taxonomy_merged[columns_to_update] = df_split

# Remove the first 3 characters from the specified columns
otu_taxonomy_merged[columns_to_update] = otu_taxonomy_merged[columns_to_update].apply(lambda x: x.str.slice(3))

# Replace empty strings with np.nan in the specified columns
otu_taxonomy_merged[columns_to_update] = otu_taxonomy_merged[columns_to_update].replace('' or 'ssigned', None)

In [22]:
# Reorder columns to place columns_to_update after the 'Taxon' column
cols = ['Taxon'] + columns_to_update + [col for col in otu_taxonomy_merged.columns if col not in ['Taxon'] + columns_to_update]
otu_taxonomy_merged = otu_taxonomy_merged[cols]

otu_taxonomy_merged

Unnamed: 0,Taxon,Domain,Phylum,Class,Order,Family,Genus,Species,Confidence,CSM5FZ3N,...,MSM5LLFQ,ESM5MEBP,CSM5MCU8,CSM5MCWK,MSM5LLH8,MSM5LLEL,ESM5GEZ1,CSM5MCUC,HSM5MD4U,MSM5LLI2
e555dbe2062a0ffcb4f273545b2674a6,d__Bacteria;p__Firmicutes;c__Clostridia;o__Pep...,Bacteria,Firmicutes,Clostridia,Peptococcales,Peptococcaceae,Peptococcus,,0.999321,0,...,0,0,0,0,0,0,0,0,0,0
63763e4255f871b815820732e3ae4ec0,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,Bacteria,Firmicutes,Clostridia,Oscillospirales,Ruminococcaceae,Faecalibacterium,,0.940070,0,...,0,0,0,0,0,0,0,0,0,0
6af6ac1420fe044e24359eef15c3cda2,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,Bacteria,Firmicutes,Clostridia,Oscillospirales,Ruminococcaceae,Faecalibacterium,,0.900157,0,...,0,0,0,0,0,0,0,0,0,0
bd89cafb38710bddb7ef4c8531634e47,d__Bacteria;p__Firmicutes;c__Clostridia;o__Osc...,Bacteria,Firmicutes,Clostridia,Oscillospirales,Ruminococcaceae,Subdoligranulum,,0.998394,0,...,0,0,0,0,0,0,0,0,0,0
cb1b80f3841b89cc8691183c24c6b87c,d__Bacteria;p__Firmicutes;c__Clostridia,Bacteria,Firmicutes,Clostridia,,,,,0.817713,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451b459a5a35107cd1691274c807f8a2,d__Bacteria,Bacteria,,,,,,,0.766184,0,...,0,0,0,0,0,0,0,0,0,0
2846ae324ecc370a21433a16059eaa96,d__Eukaryota,Eukaryota,,,,,,,0.881866,0,...,0,0,0,0,0,0,0,0,0,0
4f68e8328bf3e815e9adb1d96d78fa69,d__Bacteria,Bacteria,,,,,,,0.767886,0,...,0,0,0,0,0,0,0,0,0,0
1d01290fe70c6ac4fa126625896349b5,Unassigned,,,,,,,,0.557753,0,...,0,0,0,0,0,0,0,0,0,0


#### Exploration

In [23]:
print(otu_taxonomy_merged['Confidence'].describe())

count    3023.000000
mean        0.803355
std         0.208818
min         0.301606
25%         0.641529
50%         0.874125
75%         0.996899
max         1.000000
Name: Confidence, dtype: float64


In [26]:
import missingno as msno

msno.matrix(otu_taxonomy_merged[columns_to_update])

<Axes: >

In [27]:
# export csv
otu_taxonomy_merged.to_csv('otu_with_taxonomy.csv')

## **5. Alpha Rarefaction and Selecting a Rarefaction Depth**

In [17]:
# calculate rarefaction
!qiime diversity alpha-rarefaction \
  --i-table ./dada/table.qza \
  --m-metadata-file metadata.tsv \
	--p-min-depth 10 \
  --p-max-depth 4900 \
  --o-visualization alpha_rarefaction_curves.qzv

[32mSaved Visualization to: alpha_rarefaction_curves.qzv[0m
[0m

In [214]:
# visualize empress
!qiime tools view alpha_rarefaction_curves.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.Opening in existing browser session.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.

## **6. Phylogenetics**

In [36]:
# Aligning sequences and constructing a phylogenetic tree with QIIME2
!qiime phylogeny align-to-tree-mafft-fasttree \
	--i-sequences dada/representative_sequences.qza \
	--output-dir tree

^C

Aborted!


In [14]:
# Visualization for the tree using the empress QIIME 2 plugin
!qiime empress tree-plot \
	--i-tree tree/rooted_tree.qza \
	--o-visualization tree/empress.qzv

[32mSaved Visualization to: tree/empress.qzv[0m
[0m

In [35]:
# expoert the table
!qiime tools export \
  --input-path taxonomy.qza \
  --output-path exported_table

[32mExported taxonomy.qza as TSVTaxonomyDirectoryFormat to directory exported_table[0m
[0m

In [15]:
# visualize empress
!qiime tools view tree/empress.qzv

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.Opening in existing browser session.

Press the 'q' key, Control-C, or Control-D to quit. This view may no longer be accessible or work correctly after quitting.