# Trimming bad data

There are two ways of filtering data: trimming ends that may have very low quality, or removing
reads that are low quality. In general, short-read sequence aligners take quality information
into account, and so conservative trimming and filtering is not necessary. However, if you have
a run with very low quality ends, trimming those ends can help your analysis, especially if you
are assembly a de novo transcriptome. 

There are a number of tools designed to help you control read quality, each with their own
benefits. For today, we will use a program called ’Trimmomatic’ because it does a great job of
explicitly handling paired-end data like these. To call Trimmomatic, we will use java, and simply pass the arguments we want to use. For more detail
on each option, go to the website: [http://www.usadellab.org/cms/?page=trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic).

*One Note:*
paired-end data requires two outputs for each file, one for those that match the opposite
direction read, and one for those that don’t. The code below is an example that may be a helpful
starting point; note that the ‘\’ at the end of each line means ‘put this all on one line; don’t hit
return yet’ and can either be copied in directly (and interpreted by the console), or omitted to
put everything on one line (interpreted by you).

Let's again check our input files to trim: 

In [None]:
cd /home/gea_user/data/pre-imported/sra-fastq
ls

These are all single end files, and we can use the following loop to process them all using the following options 

- SE - Single-end reads
- threads - Number of CPUs (8) to use to complete this analysis
- SLIDINGWINDOW:i:j - Take the average quality of i reads and trim if the average is below j 
- MINLEN:i - Drop any read if less than i nucleotides after SLIDINGWINDOW trimming

In [None]:
for infile in /home/gea_user/data/pre-imported/sra-fastq/*.fastq.gz
 do
 base=$(basename --suffix=.fastq.gz $infile)
 trimmomatic \
  SE \
  -threads 8 \
  ${infile} ${base}_trimmed.fastq.gz \
  SLIDINGWINDOW:4:20 MINLEN:75
 done

We now have 6 trimmed fastq files

In [None]:
ls *_trimmed.fastq.gz

let's move these to a directory

In [None]:
mkdir /home/gea_user/rna-seq-project/trimmed-reads
mv *_trimmed.fastq.gz /home/gea_user/rna-seq-project/trimmed-reads

Let's run another round of fastqc on the trimmed results to compare

## Pre-Computed data (optional)

The trimmomatic step can take about an hour to run for these 6 large files. If you don't want to wait, the pre-trimmed files are available

In [None]:
ls /home/gea_user/data/worked-trimmed-reads