# Quality Control
<!--
** Add Intro Text***
-->

## Shell Variables
Assign the variables in this notebook.

In [1]:
set -u

# We used these in the last notebook
DATA_BASE="/data/hts_2019_data"
RAW_FASTQS="$DATA_BASE/hts2019_pilot_rawdata"

# The following are new for this notebook
CUROUT=$HOME/work/scratch/bioinf_intro
QC=$CUROUT/qc_output

## Making New Directories
The directories *do* carry over between notebooks, they are a function of the server, so we only need to make the directories that are new in this notebook

In [2]:
mkdir -p $QC

Now let's check to be sure that worked.  We will run `ls` and check that these directories now exist in the `$OUTPUT` directory.

In [3]:
ls $CUROUT

[0m[01;34mcount_out[0m  [01;34migv[0m     [01;34mqc_output[0m  [01;31mstuff_for_igv_shorter_intron.tgz[0m  [01;34mtrimmed_fastqs[0m
[01;34mgenome[0m     [01;34mmyinfo[0m  [01;34mstar_out[0m   [01;31mstuff_for_igv.tgz[0m


## Fastqc
Now that we have had a quick look at our fastq file, let's run some basic analysis using a program called fastqc.  This will take less than a minute to run.  Remember that while "long" jobs are running it will say "In [\*]:" in the left margin, once it is done, a number will replace the asterisk.

In [4]:
fastqc -h


            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit

In [5]:
# fastqc --threads 2 --quiet $DEMUX/r1.8A_pilot.fq.gz -o $QC
fastqc --extract $RAW_FASTQS/21_2019_P_M1_S21_L002_R1_001.fastq.gz -o $QC

Started analysis of 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 5% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 10% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 15% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 20% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 25% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 30% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 35% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 40% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 45% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 50% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 55% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 60% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 65% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 70% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 75% complete for 21_2019_P_M1_S21_L002_R1_001.fastq.gz
Approx 80% co

Once fastqc is done running we can view the results by finding the output in the Jupyter browser, it should be in:

In [6]:
echo $QC

/home/jovyan/work/scratch/bioinf_intro/qc_output


## MultiQC
FastQC is a useful tool, but it has one problem: it generates one report for each FASTQ file.  When you have more than a handful of FASTQs (as most projects will), it is tedious to look at each one, and there is no simple way to compare them.

MultiQC is a solution to this problem.  It mines the results from FastQC (and other HTS analysis tools) and generates reports that combine and summarize results for all the FASTQs analyzed.

### Run FastQC on Multiple FASTQs
Let's pick a selection of FASTQs to look at - we will run FastQC on the Lane 1 results for samples 10 through 19 (because it is easy to specify these files with a simple command).

In [7]:
ls $RAW_FASTQS/1?_2019_*_L001_R1_001.fastq.gz

[0m[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/10_2019_P_M1_S10_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/11_2019_P_M1_S11_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/12_2019_P_M1_S12_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/13_2019_P_M1_S13_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/14_2019_P_M1_S14_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/15_2019_P_M1_S15_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/16_2019_P_M1_S16_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/17_2019_P_M1_S17_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/18_2019_P_M1_S18_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/19_2019_P_M1_S19_L001_R1_001.fastq.gz[0m


`--threads 2` tells fastqc to run using 2 cores (of the 16 that we all share).  In general you should not do this because it is potentially inconsiderate of other users.  I am using it so we don't have to wait too long for this to run during class.

In [8]:
fastqc --quiet --threads 2 --extract $RAW_FASTQS/1?_2019_*_L001_R1_001.fastq.gz -o $QC

### Run MultiQC

In [9]:
multiqc -h

  configs = yaml.load(f)
  sp = yaml.load(f)
Usage: multiqc [OPTIONS] <analysis directory>

  MultiQC aggregates results from bioinformatics analyses across many
  samples into a single report.

  It searches a given directory for analysis logs and compiles a HTML
  report. It's a general use tool, perfect for summarising the output from
  numerous bioinformatics tools.

  To run, supply with one or more directory to scan for analysis results. To
  run here, use 'multiqc .'

  See http://multiqc.info for more details.

  Author: Phil Ewels (http://phil.ewels.co.uk)

Options:
  -f, --force                     Overwrite any existing reports
  -d, --dirs                      Prepend directory to sample names
  -dd, --dirs-depth INTEGER       Prepend [INT] directories to sample names.
                                  Negative number to take from start of path.
  -s, --fullnames                 Do not clean the sample names (leave as full
                                  file name)
  -i, 

In [10]:
multiqc $QC --outdir $QC

  configs = yaml.load(f)
  sp = yaml.load(f)
[INFO   ]         multiqc : This is MultiQC v1.7
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/home/jovyan/work/scratch/bioinf_intro/qc_output'
[?25lSearching 221 files..  [####################################]  100%[?25h
[INFO   ]          fastqc : Found 11 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../scratch/bioinf_intro/qc_output/multiqc_report_1.html
[INFO   ]         multiqc : Data        : ../../scratch/bioinf_intro/qc_output/multiqc_data_1
[INFO   ]         multiqc : MultiQC complete


Once multiqc is done running we can view the results by finding the output in the Jupyter browser, it should be in a file named `multiqc_report.html` in :

In [11]:
echo $QC

/home/jovyan/work/scratch/bioinf_intro/qc_output
