-
Notifications
You must be signed in to change notification settings - Fork 0
01_QC
Sequencing data has arrived from NSC (hooray!), and now it's time to get started looking at what we got out of it. My process will be on SAGA.
# login
ssh username@saga.sigma2.no
cd /cluster/projects/nn9836k/ # project folder where I'm working. Yours may be different and this access needs permissions from PI.
We were given the pdfs of the data but not the qc files themselves... so we have to run our own fastqc's Let's get them going in parallel using some lines stolen off biostars:
#!/usr/bin/bash
#SBATCH --account=nn9836k
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=5GB
#SBATCH --time=2:00:00
#Load the modules
module load FastQC/0.11.9-Java-11
module load parallel/20190922-GCCcore-8.3.0
find TheMoon/ -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j 4 --verbose
# copies all the fastqc files to directory ./
find TheMoon/ -name '*fastqc.*' | xargs -I '{}' mv '{}' FASTQC/
You can follow along with the progress of the file creation by looking at the slurm-job#-out file created as its running.
The sequencing centre provides individual FASTQC reports from every sample, but this is somewhat tough to navigate individually. So we will run MultiQC to see what the results were in context of all the samples. We will probably have to split up the files for multiple multiqc reports.
To get the MultiQC program, the easiest way I found was installing through conda (tried Easybuild modules and python virtualenv but ran into permissions problems and config issues):
# Make a conda environment if you don't have one
conda create --name YourEnvName
source YourEnvName/bin/activate # You'll have to close out the shell and open a new one for this to take effect.
# Activate and install multiqc
conda activate YourEnvName
conda install -c bioconda multiqc
In SAGA we must make scripts for each job we run as the login node cannot really process anything. I created a script with the following content, called MultiQC.sh in the Metagenomics_AMOR_2020 folder in our project (/cluster/projects/nn9836k/).
touch MutliQC.sh # Make the file
vim MultiQC.sh
And we add the following to the script doc:
#!/usr/bin/bash
#SBATCH --account=nn9836k
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
# every job requires some specification of the memory (RAM) it needs
#SBATCH --mem-per-cpu=5GB
# every job requires a runtime limit
#SBATCH --time=48:00:00
#SBATCH --job-name=QC_1
#Set up job environment
# Load modules
module purge
module load Anaconda3/2019.03
# Activate the conda environment
conda activate YourEnvName
# Scan directories for data to report on
multiqc TheMoon/
Run the script and check the status:
sbatch MultiQC.sh # run
squeue -u $USER # status
In 2020 Dahle group sent 60 samples for sequencing from various chimneys across the AMOR. The wiki here is to share the pipeline I used to process this dataset. The intent is to be specific about all steps involved, and to provide other lab members with this information so that they do not have to repeat the same time-consuming processes. By using my Git page, there is an added benefit of accountability and having someone to email if something doesn't work for you. :)