Skip to content
eolesin edited this page Dec 30, 2020 · 18 revisions

17.12.2020 -- Quality check on AMOR sequences

Sequencing data has arrived from NSC (hooray!), and now it's time to get started looking at what we got out of it. My process will be on SAGA. You can do these same processes on a local server like kjempefuru using conda packages fastqc, multiqc, and parallel.

# login
ssh username@saga.sigma2.no
cd /cluster/projects/nn9836k/ # project folder where I'm working. Yours may be different and this access needs permissions from PI.

Step 1: FastQC

We were given the pdfs of the data but not the qc files themselves... so we have to run our own fastqc's Let's get them going in parallel using some lines stolen off biostars:

#!/usr/bin/bash
#SBATCH --account=nn9836k
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem-per-cpu=5GB
#SBATCH --time=6:00:00   # It's ok to be generous on time, not as $$ as mem designation is.

#Load the modules
module load FastQC/0.11.9-Java-11
module load parallel/20190922-GCCcore-8.3.0

find TheMoon/ -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j 10 --verbose
# copies all the fastqc files to directory ./    
find TheMoon/ -name '*fastqc.*' | xargs -I '{}' mv '{}' FASTQC/

You can follow along with the progress of the file creation by looking at the slurm-job#-out file created as its running.

Step 2: MultiQC (For visualising outputs from multiple modules, not only fastqc)

The sequencing centre provides individual FASTQC reports from every sample, but this is somewhat tough to navigate individually. So we will run MultiQC to see what the results were in context of all the samples. We will probably have to split up the files for multiple multiqc reports.

To get the MultiQC program, the easiest way I found was installing through conda (tried Easybuild modules and python virtualenv but ran into permissions problems and config issues):

# Make a conda environment if you don't have one
conda create --name YourEnvName
source YourEnvName/bin/activate # You'll have to close out the shell and open a new one for this to take effect.

# Activate and install multiqc
conda activate YourEnvName
conda install -c bioconda multiqc

In SAGA we must make scripts for each job we run as the login node cannot really process anything. I created a script with the following content, called MultiQC.sh in the Metagenomics_AMOR_2020 folder in our project (/cluster/projects/nn9836k/).

touch MutliQC.sh # Make the file
vim MultiQC.sh

And we add the following to the script doc:

#!/usr/bin/bash
#SBATCH --account=nn9836k
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
# every job requires some specification of the memory (RAM) it needs
#SBATCH --mem-per-cpu=5GB
# every job requires a runtime limit
#SBATCH --time=48:00:00
#SBATCH --job-name=QC_1

#Set up job environment
# Load modules
module purge
module load Anaconda3/2019.03

# Activate the conda environment
conda activate YourEnvName

# Scan directories for data to report on
multiqc TheMoon/

Run the script and check the status:

sbatch MultiQC.sh # run
squeue -u $USER # status

Step 3: Quality Control of sequences

As per the Anvio docs, create a file with all your sequence file name paths for each sample. I guess there's clever ways to just make the entire document on the command line, but for now I just quickly grab all the R1s and R2s file paths and edit in GoogleSheets to correct the format with sample, r1, r2 columns

find TheMoon/ -type f -name '*R2_001.fastq.gz' -printf "%p\n" > SampleFileNamesR2.tsv

And now,

Clone this wiki locally