# Trimming and Filtering a FASTQ


## Shell Variables
Retyping shell variables in every notebook is getting old, and its error prone.  Let's centralize these so we can share them between notebooks.  We can create a shell script that contains the shell variables that we need, and then we can `source` it in each notebook.  Let's call it `bioinf_intro_config.sh`.  We can do this using the Jupyter text editor.

In [1]:
source bioinf_intro_config.sh

## Making New Directories
Make the directories that are new in this notebook

In [2]:
mkdir -p $TRIMMED
mkdir -p $MYINFO

Now let's check to be sure that worked.  We will run `ls` and check that these directories now exist in the `$CUROUT` directory.

In [3]:
ls $CUROUT

[0m[01;34mcount_out[0m  [01;34migv[0m     [01;34mqc_output[0m  [01;31mstuff_for_igv_shorter_intron.tgz[0m  [01;34mtrimmed_fastqs[0m
[01;34mgenome[0m     [01;34mmyinfo[0m  [01;34mstar_out[0m   [01;31mstuff_for_igv.tgz[0m


# Trimming and Filtering
Now we get into some actual preprocessing.  We will use `fastq-mcf` to trim adapter from our reads and do some quality filtering.  We need to trim adapter, because if a fragment is short enough, we will sequence all the way through the fragment and into the adapter.  Obviously the adapter sequence in not found in the genome, and can keep the read from aligning properly.  To do the trimming, we need to generate an adapter file.

## Making an adapter file
The first step is to get the adapter sequence.  We can get this from the [manual](https://www.neb.com/-/media/catalog/datacards-or-manuals/manuale7600.pdf), but sequences from a PDF can pick up weird characters, so we are better off getting the adapter sequences from the [Primer Sample Sheet](https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en).  

We can download and display the Sample Sheet using `curl`:

In [4]:
curl "https://www.neb.com/-/media/nebus/files/excel/e7600_nextseq_v4.csv?la=en"

[Header],,,,,,,,,
IEMFileVersion,4,,,,,,,,
Date,,,,,,,,,
Workflow,,,,,,,,,
Application,,,,,,,,,
Assay,,,,,,,,,
Description,,,,,,,,,
Chemistry,,,,,,,,,
,,,,,,,,,
[Reads],,,,,,,,,
,,,,,,,,,
,,,,,,,,,
,,,,,,,,,
[Settings],,,,,,,,,
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,
,,,,,,,,,
[Data],,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,,,,D701,ATTACTCG,D501,AGGCTATA,,
2,,,,D702,TCCGGAGA,D502,GCCTCTAT,,
3,,,,D703,CGCTCATT,D503,AGGATAGG,,
4,,,,D704,GAGATTCC,D504,TCAGAGCC,,
5,,,,D705,ATTCAGAA,D505,CTTCGCCT,,
6,,,,D706,GAATTCGT,D506,TAAGATTA,,
7,,,,D707,CTGAAGCT,D507,ACGTCCTG,,
8,,,,D708,TAATGCGC,D508,GTCAGTAC,,
9,,,,D709,CGGCTATG,D501,AGGCTATA,,
10,,,,D710,TCCGCGAA,D502,GCCTCTAT,,
11,,,,D711,TCTCGCGC,D503,AGGATAGG,,
12,,,,D712,AGCGATAG,D504,TCAGAGCC,,


We want the adapter sequences from the sample sheet: 
```
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,
```

Now we need to make the adapter file; this needs to be in FASTA format.

0. Browse to scratch/bioinf_intro/myinfo
1. Click on the jupyter "File" menu, and select "Open".  
2. When the the new browser window/tab opens, click on the "Files" tab if it is not already active.
3. Click on the "home" symbol to go to the top level directory, then click on "myinfo"
4. In the "New" menu select "Text File".
5. In this text file, paste the adapter lines from above.
7. We also want to include the reverse complement of the adapter, in case the adapter contamination as sequenced is the reverse completement of what is given.  The easiest way to do that is to use https://www.bioinformatics.org/sms/rev_comp.html to generate the reverse complement, then name it something like "Adapter_RC"
8. Now clean up by making sure that . . .
    1. Each sequence is on its own line
    2. Each sequence has a name on the line before it
    3. The sequence name is preceded by a ">"
    4. All commas and spaces need to be removed, and non-sequence characters need to be removed from the sequence lines
Now it should look like this:
```
>Adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>AdapterRead2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>Adapter_rc
TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>AdapterRead2_rc
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
```
10. Click on "untitled.txt" to change the file name to "neb_e7600_adapters.fasta"
11. Save the file.


## fastq-mcf
You can run `fastq-mcf -h` to get details about running fastq-mcf.  We will adjust run parameters, because some of the defaults set a low bar (even the author acknowleges this).

In [5]:
# the "| cat" is a hack that prevents problems with jupyter
fastq-mcf -h | cat

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...] 
Version: 1.04.807

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
    -h       This help
    -o FIL   Output file (stats to stdout)
    -O N     Only output the first N records (all)
    -s N.N   Log scale for adapter minimum-length-match (2.2)
    -t N     % occurance threshold before adapter clipping (0.25)
    -m N     Minimum clip length, overrides scaled auto (1)
    -p N     Maximum adapter difference percentage (10)
    -l N     Minimum remaining sequence length (19)
    -L N     Maximum remaining sequence length (none)
    -D N     Remove duplicate reads : Read_1 has an identical N b

### Running fastq-mcf
1. neb_e7600_adapters.fasta : the adapter file
2. 27_MA_P_S38_L002_R1_001.fastq.gz : the FASTQ with the data (fastq-mcf, like most NGS analysis software, detects gzipped files and automatically decompresses on the fly)
3. -q 20 : if a read has any bases with quality score lower than this, trim them and anything 3' of that base
4. -x 0.5 : if this percentage (or higher) of the reads have an "N" in a given position, trim all reads to that position
5. -o 27_MA_P_S38_L002_R1_001.trim.fastq.gz : output file (the .gz ending tells fastq-mcf to compress the output file)

In [6]:
fastq-mcf $MYINFO/neb_e7600_adapters.fasta \
    $RAW_FASTQS/21_2019_P_M1_S21_L002_R1_001.fastq.gz \
    -q 20 -x 0.5 \
    -o $TRIMMED/21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz

Command Line: /home/jovyan/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz -q 20 -x 0.5 -o /home/jovyan/work/scratch/bioinf_intro/trimmed_fastqs/21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz
Scale used: 2.2
Phred: 33
Threshold used: 751 out of 300000
Adapter Adapter (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA): counted 2515 at the 'end' of '/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz', clip set to 6
Files: 1
Total reads: 2437108
Too short after clip: 1347
Clipped 'end' reads: Count: 44977, Mean: 15.55, Sd: 8.27
Trimmed 288960 reads by an average of 1.70 bases on quality < 20


at this point we could run fastqc on the output of fastq-mcf to see if statistics have improved, but we will skip that for now.

In [7]:
ls $TRIMMED

[0m[01;31m21_2019_P_M1_S21_L001_R1_001.trim.fastq.gz[0m
[01;31m21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz[0m
[01;31m21_2019_P_M1_S21_L003_R1_001.trim.fastq.gz[0m
