# Biomphalaria organ microbiome


## Aim

Lauren Carruthers internship project. Look at microbiomes of different tissues from two *Biomphalaria* species.


Several folders present or will be created during the analysis. Here is the list of the folders and their content:
* **data**: Raw data files used for the analysis.
* **.env**: Files needed to create appropriate environment.
* **graphs**: Graphical representation of the data. If not existing, this will be created during the analysis
* **results**: Files that are generating through data processing. If not existing, this will be created during the analysis
* **scripts**: Scripts used for the analysis.


## Environment and data

### Creating environment

In [None]:
# Check if conda available
[[ ! $(which conda 2> /dev/null) ]] && echo "conda not available in \$PATH. Please interrupt the kernel and fix the situation." && sleep inf

# Creating conda environment
conda env create -f .env/env.yml

This cell must be run each time a new session of Jupyter is run.

In [2]:
# Activate the environment
source $(sed "s,/bin/conda,," <<<$CONDA_EXE)/etc/profile.d/conda.sh
conda activate ubiome_organs

# Remove potential variable interferences
export PERL5LIB=""
export PYTHONNOUSERSITE=1

(ubiome_organs) 

: 1

In [None]:
# Installing needed R packages
Rscript ".env/R package dependencies.R"

### Downloading sequencing data

This step downloads the fastq files of the different samples.

**vvvv To be removed** This generates the data and will have to be replaced by the downloading of the data from SRA

In [None]:
# Project directory
cd $HOME/analyses/11-Microbiome/
mkdir 2019-08-08_Biomphalaria_tissues
cd 2019-08-08_Biomphalaria_tissues

# Fasta generation (MiSeq Output previously uploaded)
cd 0-Raw\ data/190806_M01370_0001_000000000-CKY27/
nohup bcl2fastq --output-dir ../fastq_files/
cd ../..

# Working directory
mkdir 1-Qiime
cd 1-Qiime

# Link data
mkdir data
ln -s ../0-Libraries/*.fastq.gz data/

**^^^ To be removed**

In [None]:
# Data directory
ldir="data/libraries"
[[ ! -d "$ldir" ]] && mkdir -p "$ldir"

# Bioproject
bioproject=PRJNXXXXXXX

# Download related information to data project
wget -q -O runinfo "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&rettype=runinfo&db=sra&term=${bioproject}"

# Field of interest (library name and weblink)
fdn=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "LibraryName" | cut -d ":" -f 1)
fdr=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "Run" | cut -d ":" -f 1)

# Download fastq files
while read line
do
    # Filename and download link
    fln=$(cut -d "," -f $fdn <<<$line)
    run=$(cut -d "," -f $fdr <<<$line)
    
    # Download
    echo "$fln"
    #wget -P "$ldir" -O "$fln" "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=${run}&format=fastq"
    fastq-dump -O "$ldir" --split-files "$run"
    
    mv "$ldir/${run}_1.fastq" "$ldir/${fln}_R1.fastq"
    mv "$ldir/${run}_2.fastq" "$ldir/${fln}_R2.fastq"
        
done < <(tail -n +2 runinfo)

# Compress files
pigz "$ldir/"*

rm runinfo

### Downloading database

The Silva database is used to assign taxonomy to the ASVs generated from the sequencing data.

In [None]:
# Database directory
dbdir="data/Silva db"
[[ ! -d "$dbdir" ]] && mkdir -p "$dbdir"


# Download and extract the relevant Silva file
wget -P "$dbdir" 'https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip'
unzip "$dbdir/Silva_132_release.zip" -d "$dbdir" && rm "$dbdir/Silva_132_release.zip"

# Import the sequence database in Qiime format
qiime tools import \
    --input-path "$dbdir/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna" \
    --output-path "$dbdir/silva_132_99_16S.qza" \
    --type 'FeatureData[Sequence]'

# Import the taxonomy database in Qiime format
qiime tools import \
    --input-path "$dbdir/SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_all_levels.txt" \
    --output-path "$dbdir/silva_132_99_16S_taxa.qza" \
    --type 'FeatureData[Taxonomy]' \
    --input-format HeaderlessTSVTaxonomyFormat

## Qiime pipeline

This section process the data to generate ASVs and assign taxonomy.

In [5]:
# Qiime output directory
qdir="results/1-qiime"
[[ ! -d "$qdir" ]] && mkdir -p "$qdir"

# Metadata file
metadata="data/sample-metadata.tsv"

(ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) 

: 1

**Note**: C.4.BgBS90.O sample is missing because sample amplification failed. So not library was made from this sample.

**vvvv To be removed** This generates the manifest from fastq generated by bcl2fastq and will have to be replaced by the cell after that generates the manifest from the downloaded SRA data

In [5]:
# Create the manifest for importing data in artefact
## source: https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats
for i in $(ls data/* | cut -d "_" -f 3-5 | uniq)
do
    nm=$(sed "s,data/,, ; s,_,.,g" <<<$i)
    fl=$(ls -1 $PWD/data/*$i* | tr "\n" "\t")

    echo -e "$nm\t$fl"
done > "$qdir/manifest"

# Add header
sed -i "1s/^/sample-id\tforward-absolute-filepath\treverse-absolute-filepath\n/" "$qdir/manifest"

# Import data
## source: https://docs.qiime2.org/2019.4/tutorials/importing/
qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path "$qdir/manifest" \
    --input-format PairedEndFastqManifestPhred33V2 \
    --output-path "$qdir/demux-paired-end.qza"

**^^^ To be removed**

In [None]:
# Check for sequencing data
[[ ! $(find "$ldir" -type f -name *fastq.gz) ]] && echo  "No sequencing data. Please interrupt the kernel and fix the situation." && sleep inf

# Create the manifest for importing data in artefact
## source: https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats
for i in $(ls "$ldir"/* | sed "s,_R[12].fastq.*,,g" | uniq)
do
    nm=$(sed "s,$ldir/,," <<<$i)
    fl=$(ls -1 "$PWD/$i"* | tr "\n" "\t")

    echo -e "$nm\t$fl"
done > "$qdir/manifest"

# Add header
sed -i "1s/^/sample-id\tforward-absolute-filepath\treverse-absolute-filepath\n/" "$qdir/manifest"

# Import data
## source: https://docs.qiime2.org/2019.4/tutorials/importing/
qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path "$qdir/manifest" \
    --input-format PairedEndFastqManifestPhred33V2 \
    --output-path "$qdir/demux-paired-end.qza"

**Note**:
* No need to remove adapters and barcodes. This has been done during `bcl2fastq`. This can be checked using `grep`.
* Importing with `--input-format PairedEndFastqManifestPhred33V2` instead of `--input-format CasavaOneEightSingleLanePerSampleDirFmt` for [custom sample names](https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats).

### Data quality

To assess data quality, we need to generate a visualization to check data quality. The visualization can be view on [Qiime2 website](https://view.qiime2.org/)

In [9]:
# Make a summary to check read quality
qiime demux summarize \
    --i-data demux-paired-end.qza \
    --o-visualization demux-paired-end.qzv

(qiime2-2019.4) [32mSaved Visualization to: demux-paired-end2.qzv[0m
(qiime2-2019.4) 

: 1

**Note**: read quality drops toward the end but are still above 10. So no trimming done.

### Sequence clustering and denoising

This steps generates ASVs from the sequencing data. This step is perform by the `dada2` module.

In [None]:
qiime dada2 denoise-paired \
    --i-demultiplexed-seqs "$qdir/demux-paired-end.qza" \
    --p-trunc-len-f 177 \
    --p-trunc-len-r 202 \
    --p-trim-left-f 0 \
    --p-trim-left-r 13 \
    --p-max-ee-f 5 \
    --p-max-ee-r 10 \
    --p-n-threads 0 \
    --o-table "$qdir/table.qza" \
    --o-representative-sequences "$qdir/rep-seqs.qza" \
    --o-denoising-stats "$qdir/denoising-stats.qza"

In [None]:
qiime feature-classifier classify-consensus-vsearch \
    --i-query "$qdir/rep-seqs.qza" \
    --i-reference-reads database/silva_132_99_16S.qza \
    --i-reference-taxonomy database/silva_132_99_16S_taxa.qza \
    --p-perc-identity 0.97 \
    --p-threads $(nproc) \
    --o-classification "$qdir/rep-seqs_taxa.qza"

In [None]:
# source: https://chmi-sops.github.io/mydoc_qiime2.html

# Multiple seqeunce alignment using Mafft
qiime alignment mafft \
    --i-sequences "$qdir/rep-seqs.qza" \
    --o-alignment "$qdir/aligned-rep-seqs.qza"

# Masking (or filtering) the alignment to remove positions that are highly variable. These positions are generally considered to add noise to a resulting phylogenetic tree.
qiime alignment mask \
    --i-alignment "$qdir/aligned-rep-seqs.qza" \
    --o-masked-alignment "$qdir/masked-aligned-rep-seqs.qza"

# Creating tree using the Fasttree program
qiime phylogeny fasttree \
    --i-alignment "$qdir/masked-aligned-rep-seqs.qza" \
    --o-tree "$qdir/unrooted-tree.qza"

# Root the tree using the longest root
qiime phylogeny midpoint-root \
    --i-tree "$qdir/unrooted-tree.qza" \
    --o-rooted-tree "$qdir/rooted-tree.qza"

## Identification of ASVs with unassigned taxonomy

ASVs with unassigned taxonomy could correspond to eukaryotic contaminants because of the 16S primers amplifying on the 5S or 18S regions. To exclude such contaminants, we perform a megablast search againt the NCBI nt database to find the best sequence similarity to a given unassigned ASV. The results will be then used to exclude non 16S sequences for the subsequent analysis.

Because of the time this analysis can take (up to 7h), the list of contaminants is already available in the data folder.

### Export of unassigned ASV

We use Qiime to filter and export ASV without taxonomy assignments.

In [3]:
# Filtering assigned ASV out 
qiime taxa filter-seqs \
    --i-sequences "$qdir/rep-seqs.qza" \
    --i-taxonomy "$qdir/rep-seqs_taxa.qza" \
    --p-include "Unassigned" \
    --o-filtered-sequences "$qdir/rep-seqs_unassigned.qza"

# Exporting sequences in fasta format
qiime tools export \
    --input-path "$qdir/rep-seqs_unassigned.qza" \
    --output-path "$qdir/rep-seqs_unassigned"

(ubiome_organs) [32mSaved FeatureData[Sequence] to: results/1-qiime/rep-seqs_unassigned.qza[0m
(ubiome_organs) (ubiome_organs) (ubiome_organs) [32mExported results/1-qiime/rep-seqs_unassigned.qza as DNASequencesDirectoryFormat to directory results/1-qiime/rep-seqs_unassigned[0m
(ubiome_organs) 

: 1

### Blast and annotation of unassigned ASVs

To identify the most similar sequences to the unassigned ASVs, we perform a megablast. This is done using a relatively lenient e-value parameter (1e-2) to increase power of detection. This step is relatively long (estimated running time: 5 h - 8 h) even when using a maxiumum target of 1 with a maximum alignment (HSPs) of 1. The resulting table is then updated with the title and the phyla of the blast match (estimated running time: 30 min - 1 h). Finally we correct the annotations because some ASVs match 16S mitochondrial DNA of eukayote organisms and are wrongly classified as eukaryotes while they likely represent bacteria.

In [None]:
# Blast the unassigned against the nt database
blastn -task megablast -db nt -remote \
    -query "$qdir/rep-seqs_unassigned/dna-sequences.fasta" \
    -max_target_seqs 1 \
    -max_hsps 1 \
    -evalue 1e-2 \
    -outfmt 6 > "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv"

# Complete the table to identify what kind of organism the match belongs to
for ((i=1; i <= $(wc -l < "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv"); i++))
do
    # Get GI from the blast result
    gi=$(sed -n "${i}p" "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv" | cut -f 2)
    
    # Download entry and get title and phylym info
    entry=$(wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${gi}&rettype=gb")
    title=$(echo "$entry" | grep "^DEFINITION" | cut -d " " -f 3-)
    phylum=$(echo "$entry" | grep -A 1 "^ *ORGANISM" | sed -n "2p" | cut -d ";" -f 1 | sed "s/ *//g")
    
    # Update table line
    sed -i "${i}s/$/\t$title\t$phylum/" "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv"
    
    # Sleep a little to avoid server closing connection on next request
    sleep 0.25s
done

# Correct annotation
for i in $(egrep -n "mitochondri.*Eukaryota$" "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv" | cut -d ":" -f 1)
do
    sed -i "${i}s/Eukaryota/Bacteria/" "$qdir/rep-seqs_unassigned/unassigned.blastn.tsv"
done

## Functional inference

We investigate the potential role of the hemolymph microbiome by analyzing the metabolic pathways differentially represented between hemolymph and water samples. The pathway inference is done using PiCRUST2 and following this [tutorial](https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.1.4-beta)#pathway-level-inference).

In [4]:
# PiCRUST output directory
pdir="results/2-picrust2"
[[ ! -d "$pdir" ]] && mkdir -p "$pdir"

(ubiome_organs) (ubiome_organs) (ubiome_organs) 

: 1

In [7]:
# Export sequences
qiime tools export \
    --input-path "$qdir/rep-seqs.qza" \
    --output-path "$pdir/"

# Remove contaminants
while read i
do
    sed -i "/>${i}/,+1d" "$pdir/dna-sequences.fasta"
done < "data/contaminants"

qiime tools export \
    --input-path "$qdir/table.qza" \
    --output-path "$pdir/rep"

(ubiome_organs) [32mExported results/1-qiime/rep-seqs.qza as DNASequencesDirectoryFormat to directory results/2-picrust2/[0m
(ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) [32mExported results/1-qiime/table.qza as BIOMV210DirFmt to directory results/2-picrust2/rep[0m
(ubiome_organs) 

: 1

In [8]:
# Place reads into reference tree
place_seqs.py -s "$pdir/dna-sequences.fasta" -o "$pdir/out.tre" -p $(nproc) \
    --intermediate "$pdir/intermediate/place_seqs"

# Hidden-state prediction of gene families
hsp.py -i 16S -t "$pdir/out.tre" -o "$pdir/marker_predicted_and_nsti.tsv.gz" -p $(nproc) -n
hsp.py -i EC -t "$pdir/out.tre" -o "$pdir/EC_predicted.tsv.gz" -p $(nproc) -n

# Number of outliers
zcat "$pdir/marker_predicted_and_nsti.tsv.gz"  | tail -n +2 | awk '$3 >= 2' | wc -l


This is the set of poorly aligned input sequences to be excluded: fe70a7151005deea0059d7a2e3cdc991, 0c4dc293d67c7ec04776747e95472401, eb85cbbe78c5a9f7a8f459a93bc77e40, 3f7b9842f475a6d1cb89cb3e35a48438, b3806aba53e45979728e3295fbc9b138, f58b60a0ffa6b70af68fd95d63b691ab, 3822ef1b1055e494cd9f36fceec4db9d, bd498b1081c7b564609c406a3549dda4, 80985301cd7eb46118ce99a3504cbec3, f9c944f3f7414e0917d4a2d559014387, f295b1ee79e587340b5d4705403aa67f, 5c5a3b52521062cfae2421e4932c5b7c, 50ed35943a01d18c25ca7ff8d9d0b1bd, 33620f8983bc6136c1d4a5934c2145b3, 8880fa044897f3b276f9fb6b06bac807, a9e1207d131ed1b5a1f089aef8be502d, d8f45ceb8cfd4c1c75d0e029047590d7, 8c585b220e4e405520466f588c25e34f, 11edfdcd0551a9c73de61c895bf46c9f, 3f403dbee943b163577f2215323d28e2, 95daa04d5a452f70cf0001a438d24ca0, d1f6957777e1cc48276f21b0911a865c, 7cee499c9cf8589a1cf315b2c23cdf38, cd9099cbd25af5269071c9d214615f99, 8464f96cbfb7217b34b377bdc02e20e1, 9681f56486e4962acecd443d619f8724, e92859a0f57d9fb7aa595f266fea2dd8, df05b5305bb9623

: 1

In [15]:
# Generate metagenome predictions
metagenome_pipeline.py \
    -i "$pdir/rep/feature-table.biom" \
    -m "$pdir/marker_predicted_and_nsti.tsv.gz" \
    -f "$pdir/EC_predicted.tsv.gz" \
    --max_nsti 2.0 \
    -o "$pdir/rep/EC_metagenome_out" \
    --strat_out \
    --wide_table

(ubiome_organs) 139 of 1711 ASVs were above the max NSTI cut-off of 2.0 and were removed.
139 of 1711 ASVs were above the max NSTI cut-off of 2.0 and were removed.
(ubiome_organs) 

: 1

In [17]:
# Pathway-level inference
pathway_pipeline.py \
    -i "$pdir/rep/EC_metagenome_out/pred_metagenome_strat.tsv.gz" \
    -o "$pdir/rep/pathways_out" \
    -p $(nproc) \
    --wide_table

#Add functional descriptions
add_descriptions.py -i "$pdir/rep/EC_metagenome_out/pred_metagenome_unstrat.tsv.gz" -m EC \
                    -o "$pdir/rep/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz"

add_descriptions.py -i "$pdir/rep/pathways_out/path_abun_unstrat.tsv.gz" -m METACYC \
                    -o "$pdir/rep/pathways_out/path_abun_unstrat_descrip.tsv.gz"

(ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) (ubiome_organs) 

: 1

## Trunk

### Sample metadata updates

In [None]:
# Create the manifest for importing data in artefact
## source: https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats
for i in $(ls data/* | cut -d "_" -f -5 | uniq)
do
    nm=$(sed "s,data/,, ; s,_,.,g" <<<$i)
    cln=$(echo "$nm" | cut -d "." -f 4)
    cln="$cln\t$(echo "$nm" | cut -d "." -f 4-5)"
    cln="$cln\t$(echo "$nm" | cut -d "." -f 5)"
    
    # Update name
    nm=$(echo "$nm" | cut -d "." -f 3-5)

    echo -e "$nm\t$cln"
done > sample-metadata.tsv

# Add header
sed -i "1s/^/sample-id\tSpecies\tComb\tTissue\n/" sample-metadata.tsv

In [19]:
## !! WARNING To be removed when samplesheet corrected
sed -i "s/\tCa/\tBa/g" sample-metadata.tsv

(qiime2-2019.4) (qiime2-2019.4) (qiime2-2019.4) (qiime2-2019.4) (qiime2-2019.4) (qiime2-2019.4) 

: 1