---
layout: post  
title:  Analyzing Soil Microbiomes  
date: 2020-03-12  
author: Cameron Prybol  

---

In [1]:
import Dates

In [2]:
projects_dir = "$(homedir())/projects"
if !isdir(projects_dir)
    mkdir(projects_dir)
end
project_dir = "$(projects_dir)/$(Dates.today())-soil-compost-analysis"
if !isdir(project_dir)
    mkdir(project_dir)
end
cd(project_dir)

# Soil-Microbiome

[Comparing unamended soil with soil enriched with fresh organic matter and pyrogenic organic matter](https://www.ncbi.nlm.nih.gov/biosample?term=%22geo_loc_name=USA:%20Mt.%20Pleasant%20research%20farm,%20Cornell%20University,%20New%20York%22[attr])

Click link

1. On page, click "Send to:"
2. Under "Choose Destintation", select "File"
3. Under "Format", choose "Accessions List"
4. Click "Create File"

5. Under "Find related data", next to "Database:", select "SRA"
6. Click "Find items"

1. On page, click "Send to:"
2. Under "Choose Destintation", select "File"
3. Under "Format", choose "Summary"
4. repeat, but Under "Format", choose "Accessions List"

- we should have three files now, `biosample_result.txt`, `SraAccList.txt`, and `sra_result.csv`
- We will make a folder with a descriptive sample set ID `usa_mt-pleasant-research-farm_cornell-university_new-york`
- Then place the downloaded files into this folder
- `SraAccList.txt` has a blank line at end of file that I manually deleted with vi, but could also be handled programmatically in the future

```bash
parallel prefetch {} :::: SraAccList.txt
```

```bash
parallel fastq-dump --dumpbase --gzip --split-files --outdir {} {} :::: SraAccList.txt
```

In [None]:
# grab first sample
ID=$(ls -1 | perl -pe 's/_pass_[12]\.fastq\.gz//g' | sort -u | head -n1)
OUT_DIR=$BASE/$ID
mkdir -p $OUT_DIR
echo $OUT_DIR

In [None]:
FORWARD="$ID"_pass_1.fastq.gz
REVERSE="$ID"_pass_2.fastq.gz

In [None]:
# initialize a directory that will house all assembly related information, qc, and reports
mkdir $OUT_DIR/fastqc_pre_filtering
# run fastqc quality report on raw, pre-filtered reads
fastqc --outdir $OUT_DIR/fastqc_pre_filtering $FORWARD $REVERSE

In [None]:
# perform quality and adapter trimming
trim_galore --output_dir $OUT_DIR/trim_galore --paired $FORWARD $REVERSE

In [None]:
# re-evaluate the data quality post trimming
TRIMMED_FORWARD=$OUT_DIR/trim_galore/$(basename $FORWARD | perl -pe 's/(.*?)\.(fastq|fq).*$/$1/')_val_1.fq.gz
TRIMMED_REVERSE=$OUT_DIR/trim_galore/$(basename $REVERSE | perl -pe 's/(.*?)\.(fastq|fq).*$/$1/')_val_2.fq.gz
mkdir $OUT_DIR/fastqc_post_filtering
fastqc --outdir $OUT_DIR/fastqc_post_filtering $TRIMMED_FORWARD $TRIMMED_REVERSE

In [None]:
# perform metaspades assembly as well, so we can compare which one assembles better
metaspades.py -t $(nproc) -o $OUT_DIR/metaspades -1 $TRIMMED_FORWARD -2 $TRIMMED_REVERSE

In [None]:
# use quast to generate assembly statistic reports
quast.py --output-dir $OUT_DIR/metaspades/quast --min-contig 1 $OUT_DIR/metaspades/scaffolds.fasta

In [None]:
# run bwa on the raw reads to assess how well the assembly agrees with the raw data
bwa index $OUT_DIR/metaspades/scaffolds.fasta
mkdir $OUT_DIR/metaspades/alignments
bwa mem -t $(nproc) $OUT_DIR/metaspades/scaffolds.fasta $FORWARD $REVERSE | samtools view -buh - | samtools sort - > $OUT_DIR/metaspades/alignments/scaffolds.fasta.bam

In [None]:
# run qualimap to generate reports on read-mapping quality
qualimap bamqc -nt $(nproc) -bam $OUT_DIR/metaspades/alignments/scaffolds.fasta.bam -outdir $OUT_DIR/metaspades/qualimap

In [None]:
# run multiqc to compile a meta-report, which some people may prefer over the individual reports
multiqc --outdir $OUT_DIR/metaspades/multiqc $OUT_DIR/metaspades

In [None]:
# generate a bandage plot of the assembly graph
mkdir $OUT_DIR/metaspades/bandage
/home/jovyan/Bandage image $OUT_DIR/metaspades/assembly_graph_with_scaffolds.gfa $OUT_DIR/metaspades/bandage/bandage.jpg

In [None]:
##PRODIGAL
mkdir -p $OUT_DIR/prodigal
prodigal -i $OUT_DIR/metaspades/scaffolds.fasta -f gff -o $OUT_DIR/prodigal/prodigal.gff &> /dev/null

grep -v "#" $OUT_DIR/prodigal/prodigal.gff \
    | awk '{OFS="\t"}{print $1, $4-1, $5, ".", ".", $7}' \
    > $OUT_DIR/prodigal/prodigal.gff.bed

In [None]:
BED=$OUT_DIR/prodigal/prodigal.gff.bed

In [None]:
bedtools getfasta -s -fi $OUT_DIR/metaspades/scaffolds.fasta -bed $BED \
    > $BED.fna

In [None]:
translate6frames.sh \
    -Xmx2g \
    in=$BED.fna \
    out=stdout.fa 'frames=1' \
    | awk '{if ($1 ~ /^>/) {gsub(" +","|"); print $0} else {print $0}}' \
    > $BED.fna.faa

In [None]:
time diamond blastp --threads $(nproc) --db ~/databases/refseq-protein-diamond --query prodigal.gff.bed.fna.faa --out prodigal.gff.bed.fna.faa.diamond --evalue 0.001

In [None]:
while IFS= read -r line
do
    CONTIG=$(echo $line | awk '{print $1}')
    CONTIG_SIZE=$(bioawk -c fastx '{ print $name, length($seq) }' $OUT_DIR/metaspades/scaffolds.fasta | grep -m 1 "$CONTIG" | awk '{print $2}')
    START=$(echo $line | awk '{print $2}')
    STOP=$(echo $line | awk '{print $3}')
    STRAND=$(echo $line | awk '{print $6}')
    NAME=$(grep -m 1 "^$CONTIG:$START-$STOP" $BED.fna.faa.blastp | awk '{print $2}')
    SCORE=$(grep -m 1 "^$CONTIG:$START-$STOP" $BED.fna.faa.blastp | awk '{print $12}')
    # Filter out any wrap-around calls made by glimmer
    if [ $STOP -le $CONTIG_SIZE ];
    then
        echo -e "$CONTIG\t$START\t$STOP\t$NAME\t$SCORE\t$STRAND"
    fi
done <$BED.filtered.bed | sort -k 1,1 -k2,2n > $BED.filtered.bed.fna.faa.blastp.bed

In [None]:
```
mkdir 1.FASTQ
parallel fastq-dump --skip-technical --clip --read-filter pass --dumpbase --gzip --split-3 --outdir 1.FASTQ {} :::: SraAccList.txt
```

fastqc [-o output dir] -d -t seqfile1 seqfile2 .. seqfileN


https://github.com/FelixKrueger/TrimGalore

fastqc

gunzip -c file.fastq.gz | jellyfish count -o file.jf -m ...
jellyfish histo -o file_jf.hist -f file.jf

In [None]:
BASE="$(homedir())/projects/2020-03-12-soil-microbiome"
if !isdir(BASE)
    mkdir(BASE)
end