# Quality Control and Trimming

This notebook will go through the workflow for read quality control and trimming. We will follow each of the steps below, that will require time on the HPC to run. Be sure to check back after each step to make sure you have the right files, and start the next step.   

1. Quality control using fastqc to determine quality thresholds.
2. Compressing files before trimming.
3. Trimming reads with [Trimmomatic](https://carpentries-lab.github.io/metagenomics-analysis/03-trimming-filtering/index.html).
4. Optional: Final QC check after trimming


## Getting Started (Run this before each step)

You will need to rerun this section each time you come back to this notebook to kick off the next step.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/06_qc_trimming"
%cd $work_dir

## Creating a config file
Each of the scripts below executes code that requires certain variables to be set. So we don't need to edit the code in each of the scripts, we are going to use a config file that defines all of these variables. Then when we want to use these variables in the script, we will "source" the config file to set the variables. This is generally a good practice in writing scripts on the HPC, that makes it so you only need to modify the config file (rather than each individual script). We are going to create this file using the variables you set above in "Getting started". Note that you only need to create this config file once, even if you are returning to complete the next step.   

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Assessing Read Quality

Now that we have all of our data downloaded, we are ready to start the quality control process. We will use a tool called fastqc that generates a report about the quality of our sequence data.

First, we will create an sbatch script that runs fastqc on each of the sequence files. Note that when you kick off this analysis by running "sbatch" below, you will need to wait ~1 hour for the results to come back, depending on the queue wait time. 

In [None]:
# Create a script to run fastqc on each of our accessions
# A few important points:
# 1. We are using the variables from the config file via
# the `source ./config.sh` command in the script.
# 2. fastqc runs on each of the fastq files in the $FASTQ_DIR
# 3. We are creating a directory called check_fastqc in our home directory
# this allows us to copy the *html files produced by fastqc and explore
# them using Jupyter server on the on demand hpc portal.
# 4. The fastqc program runs in the $FASTQ_DIR, but to keep our files
# organized, we are going to move the results into our $WORK_DIR.
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-fastqc-%a.out
#SBATCH --cpus-per-task=1                  
#SBATCH --mem=4G                           

pwd; hostname; date

source ./config.sh
names=($(cat $FASTQ_DIR/$XFILE))

apptainer run /contrib/singularity/shared/bhurwitz/fastqc-0.11.9.sif fastqc \
    $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq*

mkdir ~/check_fastqc
cp $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.html ~/check_fastqc 
mv $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.html $WORK_DIR
mv $FASTQ_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.zip $WORK_DIR
 
'''

with open('fastqc_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat fastqc_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the fastqc_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run fastqc on each of the FASTQ files
# Remember that this may take 1 hour to run, so take a break, 
# and get a coffee.
!sbatch ./fastqc_parallel.sh

In [None]:
# You can check if it is running using the squeue command
# If you get an error "Invalid job id specified", then your job already
# completed.
# You can also check for all jobs under your netid
# !squeue --job=$netid
!squeue --job=MY_JOBID

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# For example, you can look at Job-fastqc-0.out
!ls
!cat Job-fastqc-0.out

In [None]:
# Double check that all of your files have run through fastqc.
# Do you see a *.html and *.zip file for each one?
!ls /xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming

Great job! It looks like your files have all been checked with fastqc. Before moving on to the next step, start up the Jupyter server on HPC on demand, navigate to the folder called check_fastqc, and double click on each of the html files to check the quality of each of your sequence files. Be sure to refer back to the in-class exercise on quality control to understand what each of the sections means. Because your sequence data come from the SRA, you will likely find that your files are all passing quality control checks already. But, to be certain, we will run a few basic "screening and cleaning" steps via trimmomatic in Step 3 to make sure the sequences are up to par. 

## Step 2: Compressing your *fastq files using gzip 

Trimmomatic works on FASTQ files that are compressed with either gzip or bzip2. So, before we can run trimmomatic, we will need to compress our read files. We'll be using gzip to compress files and get the .gz file extension we need. These FASTQ files are massive, and gzip takes time to run, so lets create a script to sbatch the compression job.

In [None]:
# Let's create a script that gzip's all of the FASTQ files
# These are huge files, so it may take 2 hours to run.
# This script uses gzip to compress each of the *.fastq files.
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4              # the number of accession files
#SBATCH --output=Job-gzip-%a.out
#SBATCH --cpus-per-task=1        # num CPUs per task
#SBATCH --mem=4G                 # total memory per node
 
pwd; hostname; date
source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))
gzip ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq
'''

with open('gzip_untrimmedfiles.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Let's double check that your script was created.
!cat gzip_untrimmedfiles.sh

In [None]:
# OK, we are ready to kick off the script
# Time to go get another coffee...before completing Step 3.
!sbatch gzip_untrimmedfiles.sh

In [None]:
# You can check if it is running using the squeue command or looking for
# jobs under your netid
!squeue --job=MY_JOB
!squeue --user=$netid

In [None]:
# Check to see if all of your *.fastq files are gzip-ed
# Note that these files are in the "05_getting_data" directory
# You should just see one gzipped file for each fastq
# For example, ERR2198631_1.fastq.gz not ERR2198631_1.fastq too...
# If you see both .fastq and .fastq.gz the gzip command 
# is still in progress.
!ls /xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data

## Step 3: Trimming .fastq Files

In order to run trimmomatic in a PE (paired-end) format we'll need two files. In our case, we have *_1.fastq.gz and *_2.fastq.gz for each accession from the SRA. You should now have those from the steps above.

Note that we are following the same trimming protocol from the in-class exercise.

### Initial Data Management
The output from trimmomatic will give us 4 output files (forward paired, forward unpaired, reverse paired and reverse unpaired. To keep our data organized, let's set up some output directories so the script can organize our data as it runs.


In [None]:
trim_dir = work_dir + "/trimmed_reads"
unpair_dir = work_dir + "/unpaired_reads"
!mkdir $trim_dir
!mkdir $unpair_dir

In [None]:
# Let's create a script that runs trimmomatic on all of our fastq files
# you can only run this after the *.fastq files are all gzip-ed (step 2)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                          # the number of accessions
#SBATCH --output=Job-trim-%a.out
#SBATCH --cpus-per-task=1                    # num CPUs per task
#SBATCH --mem=4G                             # total memory per node
 
pwd; hostname; date
source ./config.sh
names=($(cat ${FASTQ_DIR}/${XFILE}))

TRIM_DIR="${WORK_DIR}/trimmed_reads"
UNPAIR_DIR="${WORK_DIR}/unpaired_reads"

apptainer run /contrib/singularity/shared/bhurwitz/trimmomatic:0.39--hdfd78af_2.sif trimmomatic PE \
    ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz ${FASTQ_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz \
    ${TRIM_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz ${UNPAIR_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_1.fastq.gz \
    ${TRIM_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz ${UNPAIR_DIR}/${names[${SLURM_ARRAY_TASK_ID}]}_2.fastq.gz \
    ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 SLIDINGWINDOW:4:20
'''

with open('trimmomatic_fastq.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Did you create the script file correctly?
!cat trimmomatic_fastq.sh

In [None]:
# Now we can run trimmomatic
!sbatch trimmomatic_fastq.sh

In [None]:
# You can check if it is running using the squeue command
!squeue --job=MY_JOBID
!squeue --user=$netid

### Checking your output files

Once your job has completed, you should see that there are four output files from two input files. The trimmomatic program places all of the "orphaned" reads in a separate file from the trimmed reads. Reads can become orphaned when their "mate pair" is either too short, or too low quality. For our analyses going forward, we will only use the reads that were trimmed, and have both the forward and reverse read. 

In [None]:
# Let's check the file sizes to see that they are smaller for our trimmed
# reads
!echo "trimmed:"
!ls -l /xdisk/bhurwitz/bh_class/$netid/assignments/06_qc_trimming/trimmed_reads/*fastq.gz
!echo "untrimmed:"
!ls -l /xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data/*fastq.gz

## Step 4 (optional) QC Final Check

If you have any doubts about the trimming process, you can always run fastqc on the trimmed data and double check that you see all "green". You can check the fastqc files using Jupyter to check for any failures or other warnings.

In [None]:
# Here we are re-running fastqc on the trimmed data
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=10:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-fastqc-trim-%a.out
#SBATCH --cpus-per-task=1                  
#SBATCH --mem=4G                           

pwd; hostname; date

source ./config.sh
names=($(cat $FASTQ_DIR/$XFILE))
TRIM_DIR="${WORK_DIR}/trimmed_reads"

apptainer run /contrib/singularity/shared/bhurwitz/fastqc-0.11.9.sif fastqc \
    $TRIM_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*.fastq*

mkdir ~/check_fastqc_trimmed
cp $TRIM_DIR/${names[${SLURM_ARRAY_TASK_ID}]}_*_fastqc.html ~/check_fastqc_trimmed 

'''

with open('fastqc_trim_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
!sbatch ./fastqc_trim_parallel.sh

In [None]:
# check to see if your job is finished running
!squeue --user=$netid

Once your job completes, you can look at the *.html files in your home directory in ~/check_fastqc_trimmed. 

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/06_qc_trimming.ipynb $work_dir