# Quality Control
<!--
** Add Intro Text***
-->

## Shell Variables
Assign the variables in this notebook.

In [None]:
set -u

# We used these in the last notebook
DATA_BASE="/data/hts_2019_data"
RAW_FASTQS="$DATA_BASE/hts2019_pilot_rawdata"

# The following are new for this notebook
CUROUT=$HOME/work/scratch/bioinf_intro
QC=$CUROUT/qc_output

## Making New Directories
The directories *do* carry over between notebooks, they are a function of the server, so we only need to make the directories that are new in this notebook

In [None]:
mkdir -p $QC

Now let's check to be sure that worked.  We will run `ls` and check that these directories now exist in the `$OUTPUT` directory.

In [None]:
ls $CUROUT

## Fastqc
Now that we have had a quick look at our fastq file, let's run some basic analysis using a program called fastqc.  This will take less than a minute to run.  Remember that while "long" jobs are running it will say "In [\*]:" in the left margin, once it is done, a number will replace the asterisk.

In [None]:
fastqc -h

In [None]:
# fastqc --threads 2 --quiet $DEMUX/r1.8A_pilot.fq.gz -o $QC
fastqc --extract $RAW_FASTQS/21_2019_P_M1_S21_L002_R1_001.fastq.gz -o $QC

Once fastqc is done running we can view the results by finding the output in the Jupyter browser, it should be in:

In [None]:
echo $QC

## MultiQC
FastQC is a useful tool, but it has one problem: it generates one report for each FASTQ file.  When you have more than a handful of FASTQs (as most projects will), it is tedious to look at each one, and there is no simple way to compare them.

MultiQC is a solution to this problem.  It mines the results from FastQC (and other HTS analysis tools) and generates reports that combine and summarize results for all the FASTQs analyzed.

### Run FastQC on Multiple FASTQs
Let's pick a selection of FASTQs to look at - we will run FastQC on the Lane 1 results for samples 10 through 19 (because it is easy to specify these files with a simple command).

In [None]:
ls $RAW_FASTQS/1?_2019_*_L001_R1_001.fastq.gz

`--threads 2` tells fastqc to run using 2 cores (of the 16 that we all share).  In general you should not do this because it is potentially inconsiderate of other users.  I am using it so we don't have to wait too long for this to run during class.

In [None]:
fastqc --quiet --threads 2 --extract $RAW_FASTQS/1?_2019_*_L001_R1_001.fastq.gz -o $QC

### Run MultiQC

In [None]:
multiqc -h

In [None]:
multiqc $QC --outdir $QC

Once multiqc is done running we can view the results by finding the output in the Jupyter browser, it should be in a file named `multiqc_report.html` in :

In [None]:
echo $QC