# Snail hemolymph microbiome


## Aim

This notebook contains the different steps to install the environment, download the data and perform the data analysis. 


Several folders present or will be created during the analysis. Here is the list of the folders and their content:
* **data**: Raw data files used for the analysis.
* **.env**: Files needed to create appropriate environment.
* **graphs**: Graphical representation of the data. If not existing, this will be created during the analysis
* **results**: Files that are generating through data processing. If not existing, this will be created during the analysis
* **scripts**: Scripts used for the analysis.


## Environment and data

### Creating environment

In [None]:
# Check if conda available
[[ ! $(which conda 2> /dev/null) ]] && echo "conda not available in \$PATH. Exiting..." && exit 1

# Creating conda environment
conda env create -f .env/env.yml

This cell must be run each time a new session of Jupyter is run.

In [None]:
# Activate the environment
source $(sed "s,/bin/conda,," <<<$CONDA_EXE)/etc/profile.d/conda.sh
conda activate ubiome_hml

In [None]:
# Installing needed R packages
Rscript ".env/R package dependencies.R"

### Downloading sequencing data

This step downloads the fastq files of the different samples.

In [None]:
# Data directory
ldir="data/libraries"
[[ ! -d "$ldir" ]] && mkdir -p "$ldir"

# Bioproject
bioproject=PRJNA613098

# Download related information to data project
wget -q -O runinfo "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&rettype=runinfo&db=sra&term=${bioproject}"

# Field of interest (library name and weblink)
fdn=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "LibraryName" | cut -d ":" -f 1)
fdr=$(head -n 1 runinfo | tr "," "\n" | grep -w -n "Run" | cut -d ":" -f 1)

# Download fastq files
while read line
do
    # Filename and download link
    fln=$(cut -d "," -f $fdn <<<$line)
    run=$(cut -d "," -f $fdr <<<$line)
    
    # Download
    echo "$fln"
    #wget -P "$ldir" -O "$fln" "http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=dload&run_list=${run}&format=fastq"
    fastq-dump -O "$ldir" --split-files "$run"
    
    mv "$ldir/${run}_1.fastq" "$ldir/${fln}_R1.fastq"
    mv "$ldir/${run}_2.fastq" "$ldir/${fln}_R2.fastq"
        
done < <(tail -n +2 runinfo | head -1)

# Compress files
pigz "$ldir/"*

rm runinfo

### Downloading database

The Silva database is used to assign taxonomy to the ASVs generated from the sequencing data.

In [None]:
# Database directory
dbdir="data/Silva db"
[[ ! -d "$dbdir" ]] && mkdir -p "$dbdir"


# Download and extract the relevant Silva file
wget -P "$dbdir" 'https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip'
unzip "$dbdir/Silva_132_release.zip" -d "$dbdir" && rm "$dbdir/Silva_132_release.zip"

# Import the sequence database in Qiime format
qiime tools import \
  --input-path "$dbdir/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna" \
  --output-path "$dbdir/silva_132_99_16S.qza" \
  --type 'FeatureData[Sequence]'

# Import the taxonomy database in Qiime format
qiime tools import \
  --input-path "$dbdir/SILVA_132_QIIME_release/taxonomy/16S_only/99/taxonomy_all_levels.txt" \
  --output-path "$dbdir/silva_132_99_16S_taxa.qza" \
  --type 'FeatureData[Taxonomy]' \
  --input-format HeaderlessTSVTaxonomyFormat

## Qiime pipeline

This section process the data to generate ASVs and assign taxonomy.

In [None]:
# Qiime output directory
qdir="results/1-qiime"
[[ ! -d "$qdir" ]] && mkdir -p "$qdir"

# Metadata file
metadata="data/sample-metadata.tsv"

### Sequencing data

If sequencing data are present, they will be imported as a Qiime artifact.

In [None]:
# Check for sequencing data
[[ ! $(find "$ldir" -type f -name *fastq.gz) ]] && echo  "No sequencing data. Exiting..." && exit 1

# Create the manifest for importing data in artefact
## source: https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats
for i in $(ls "$ldir"/* | sed "s,_R[12].fastq.*,,g" | uniq)
do
    nm=$(sed "s,$ldir/,," <<<$i)
    fl=$(ls -1 "$PWD/$i"* | tr "\n" "\t")

    echo -e "$nm\t$fl"
done > "$qdir/manifest"

# Add header
sed -i "1s/^/sample-id\tforward-absolute-filepath\treverse-absolute-filepath\n/" "$qdir/manifest"

# Import data
## source: https://docs.qiime2.org/2019.4/tutorials/importing/
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path "$qdir/manifest" \
  --input-format PairedEndFastqManifestPhred33V2 \
  --output-path "$qdir/demux-paired-end.qza"

**Note**:
* No need to remove adapters and barcodes. This has been done during `bcl2fastq`. This can be checked using `grep`.
* Importing with `--input-format PairedEndFastqManifestPhred33V2` instead of `--input-format CasavaOneEightSingleLanePerSampleDirFmt` for [custom sample names](https://docs.qiime2.org/2019.4/tutorials/importing/#fastq-manifest-formats).

### Data quality

To assess data quality, we need to generate a visualization to check data quality. The visualization can be view on [Qiime2 website](https://view.qiime2.org/)

In [None]:
# Make a summary to check read quality
qiime demux summarize \
  --i-data "$qdir/demux-paired-end.qza" \
  --o-visualization "$qdir/demux-paired-end.qzv"

**Note**: read quality drops toward the end but are still above 10. So no trimming done.

### Sequence clustering and denoising

This steps generates ASVs from the sequencing data. This step is perform by the `dada2` module.

In [None]:
[[ -f "$qdir/table.qza" ]] && echo "ASV file (table.qza) exists already. Exiting..." && exit 1

# Run dada2
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs "$qdir/demux-paired-end.qza" \
  --p-trunc-len-f 250 \
  --p-trunc-len-r 250 \
  --p-max-ee 5 \
  --p-n-threads 0 \
  --o-table "$qdir/table.qza" \
  --o-representative-sequences "$qdir/rep-seqs.qza" \
  --o-denoising-stats "$qdir/denoising-stats.qza"

# Add metadata information to the denoising stats
qiime metadata tabulate \
    --m-input-file "$metadata" \
    --m-input-file "$qdir/denoising-stats.qza" \
    --o-visualization "$qdir/denoising-stats.qzv"

**Note:** Visual inspection shows that optimal number of features is 25079 which allows retaining 2,633,295 (46.33%) features in 105 (99.06%) samples. This leads to the exclusion of one sample: `Ba.Water.1`.

### Taxonomy identification

This step assigns taxonomy to the ASVs generated.

In [None]:
qiime feature-classifier classify-consensus-vsearch \
  --i-query "$qdir/rep-seqs.qza" \
  --i-reference-reads "$dbdir/silva_132_99_16S.qza" \
  --i-reference-taxonomy "$dbdir/silva_132_99_16S_taxa.qza" \
  --p-perc-identity 0.97 \
  --p-threads $(nproc) \
  --o-classification "$qdir/rep-seqs_taxa.qza"

### Phylogeny

This step generates a phylogeny from the ASVs ([source](https://chmi-sops.github.io/mydoc_qiime2.html)).

In [None]:
[[ -f "$qdir/rooted-tree.qza" ]] && echo "A tree file (rooted-tree.qza) exists already. Exiting..." && exit 1

# Multiple seqeunce alignment using Mafft
qiime alignment mafft \
    --i-sequences "$qdir/rep-seqs.qza" \
    --o-alignment "$qdir/aligned-rep-seqs.qza"

# Masking (or filtering) the alignment to remove positions that are highly variable. These positions are generally considered to add noise to a resulting phylogenetic tree.
qiime alignment mask \
    --i-alignment "$qdir/aligned-rep-seqs.qza" \
    --o-masked-alignment "$qdir/masked-aligned-rep-seqs.qza"

# Creating tree using the Fasttree program
qiime phylogeny fasttree \
    --i-alignment "$qdir/masked-aligned-rep-seqs.qza" \
    --o-tree "$qdir/unrooted-tree.qza"

# Root the tree using the longest root
qiime phylogeny midpoint-root \
    --i-tree "$qdir/unrooted-tree.qza" \
    --o-rooted-tree "$qdir/rooted-tree.qza"

## Library analysis

Analyze Qiime visualization and generate table summarizing number of reads after each step of the pipeline.

In [None]:
scripts/library_stats.R

## Microbiome diversity

This step analyze ASV data using different methods (rarefaction curve, $\alpha$ and $\beta$ diversity). Technical replicates are also evaluated to show the robustness of the library generation method. Details about the methods used are in the R script. Analysis of the results are in the manuscript.

In [None]:
Rscript scripts/microbiome_diversity.R

## Functional inference

Analysis following this [tutorial](https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.1.4-beta)#pathway-level-inference)

In [None]:
# PiCRUST output directory
pdir="results/2-picrust2"
[[ ! -d "$pdir" ]] && mkdir -p "$pdir"

In [None]:
# Filter table to retain sample from each replicate
for i in {1..2}
do
    awk -v i=$i 'NR==1; $5 == i {print}' "$metadata" > "$qdir/.metadata"
    
    qiime feature-table filter-samples \
        --i-table "$qdir/table.qza"  \
        --m-metadata-file "$qdir/.metadata" \
        --o-filtered-table "$qdir/table_rep$i.qza"
done

# Clean
rm "$qdir/.metadata"

In [None]:
qiime tools export \
    --input-path "$qdir/rep-seqs.qza" \
    --output-path "$pdir/"

for i in {1..2}
do
    qiime tools export \
        --input-path "$qdir/table_rep$i.qza" \
        --output-path "$pdir/rep$i"
done

In [None]:
# Place reads into reference tree
place_seqs.py -s "$pdir/dna-sequences.fasta" -o "$pdir/out.tre" -p $(nproc) \
    --intermediate "$pdir/intermediate/place_seqs"

# Hidden-state prediction of gene families
hsp.py -i 16S -t "$pdir/out.tre" -o "$pdir/marker_predicted_and_nsti.tsv.gz" -p $(nproc) -n
hsp.py -i EC -t "$pdir/out.tre" -o "$pdir/EC_predicted.tsv.gz" -p $(nproc) -n

# Number of outliers
zcat "$pdir/marker_predicted_and_nsti.tsv.gz"  | tail -n +2 | awk '$3 >= 2' | wc -l

In [None]:
# Run for each replicate
for i in {1..2}
do
    # Generate metagenome predictions
    metagenome_pipeline.py \
        -i "$pdir/rep$i/feature-table.biom" \
        -m "$pdir/marker_predicted_and_nsti.tsv.gz" \
        -f "$pdir/EC_predicted.tsv.gz" \
        --max_nsti 2.0 \
        -o "$pdir/rep$i/EC_metagenome_out" \
        --strat_out --metagenome_contrib

    # Pathway-level inference
    pathway_pipeline.py \
        -i "$pdir/rep$i/EC_metagenome_out/pred_metagenome_strat.tsv.gz" \
        -o "$pdir/rep$i/pathways_out" -p $(nproc)

    #Add functional descriptions
    add_descriptions.py -i "$pdir/rep$i/EC_metagenome_out/pred_metagenome_unstrat.tsv.gz" -m EC \
                        -o "$pdir/rep$i/EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz"

    add_descriptions.py -i "$pdir/rep$i/pathways_out/path_abun_unstrat.tsv.gz" -m METACYC \
                        -o "$pdir/rep$i/pathways_out/path_abun_unstrat_descrip.tsv.gz"
done

In [None]:
scripts/pathway_analysis.R

## Microbiome density

Could the differences observed between population explained by microbe density?

In [None]:
# qPCR output directory
ddir="data/qPCR"
[[ ! -d "$ddir" ]] && echo "$ddir and qPCR data are missing. Exiting..." && exit 1

# Analyze data
scripts/microbiome_density.R