<a href="https://colab.research.google.com/github/epi2me-labs/tutorials/blob/fastq_things/Fastq_bits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Fastq</h1>

The [fastq format](https://en.wikipedia.org/wiki/FASTQ_format) is (usually) a 4 line string (text) data format denoting a sequence and it's corresponding quality score values.  There different ways of encoding quality in a `.fastq` file however, nanopore reads use [sanger phred scores](https://academic.oup.com/nar/article/38/6/1767/3112533).  A sequence record is made up of 4 lines:

```
line 1: Sequence ID and Sequence description
line 2: Sequence line e.g. ATCGs
line 3: plus symbol (can additionally have description here)
line 4: Sequence line qualities
```
**IMPORTANT:** Line 1 and line 2 must have the same length of the sequence record is not valid.

For example a sample record looks like:

```
@sequence_id sequence_description
ATCG
+
!^%%
```
The sequence ID must not contain any spaces. Anything after the first space in the sequence ID line will be considered the "description".

A `.fastq` file may contain multiple records. The default number of records in a fastq file generated during a nanopore run is 4000 reads (16000 lines).


## Useful snippets

The following snippets demonstrate common tasks you might want to perform on a single `.fastq` file or a set of such files. For many tasks we recommend the excellent [seqkit](https://github.com/shenwei356/seqkit) program.

To run the examples below, first download the sample data:

In [1]:
# create a work directory and move into it
directory = "fastq_tutorial"
working_dir='/home/jovyan/work/{}/'.format(directory)
!mkdir -p "$working_dir"
%cd "$working_dir"

/home/jovyan/work/fastq_tutorial


#### How many records in my `.fastq` file?

To count the number of records in a `.fastq` file we can use the linux [word count](https://linux.die.net/man/1/wc) command to count the number of lines in a file, with a division by four accounting for four lines per record:

In [3]:
filename = "test.fastq"  #@param {type: "string"}
!echo $(( $(wc -l < $filename) / 4 )) reads

4000 reads


#### List all the fastqs in a directory

As Oxford Nanopore Technologies' sequencing devices output multiple `.fastq` files during the course of an experiment, it can be useful to find and list all such files. We can do this with the linux [find](https://linux.die.net/man/1/find) command:

In [5]:
directory = "." #@param {type: "string"}

!find $directory -name "*.fastq"

./test0/fail/test.fastq
./test0/pass/test.fastq
./test.fastq


The default directory value here (`.`) means "the current working directory."

#### Concatenate all fastqs in a directory into a single file

Many bioinformatics programs require all sequence data to be present in a single `.fastq` file. In order to process sequences across multiple files we must concatenate (or "cat") all the `.fastq` files into a single consolidated file. To perform this task we can use a combination of the linux [find](https://linux.die.net/man/1/find), [xargs](https://linux.die.net/man/1/xargs), and [cat](https://linux.die.net/man/1/cat) commands:

In [8]:
directory = "." #@param {type: "string"}
output_fastq = "all_records.fastq" #@param {type: "string"}

!find . -type f \( -iname "*.fastq" ! -iname $output_fastq \) | \
    xargs cat > $output_fastq
!echo $(( $(wc -l < $output_fastq) / 4 )) reads

12000 reads


Again the default directory value here (`.`) means "the current working directory."

You may often see a simple form of the above:

    cat *.fastq > output.fastq

however, this command will fail if the number of `.fastq` files found is very large.

#### Remove all duplicates in a fastq

In can sometimes be the case that for some reason a `.fastq` file contains duplicates of the same read. To remove these we can use the [`rmdup`](https://bioinf.shenwei.me/seqkit/usage/#rmdup) command of the [seqkit](https://github.com/shenwei356/seqkit) program:

In [9]:
input_fastq = "all_records.fastq" #@param {type: "string"}
output_fastq = "deduplicated.fastq" #@param {type: "string"}

!seqkit rmdup "$input_fastq" -o "$output_fastq"

[INFO][0m 8000 duplicated records removed


For the example data, 8000 duplicate records are identified because the three files (containing 4000 records each) are in fact copies of the same file.

#### Compress or extract a fastq file

We can save hard disk space on our computer by compressing `.fastq` files. To do this we recommend using [`bgzip`](http://www.htslib.org/doc/bgzip.html) which allows for indexing and fast retrieval of sequences by bioinformatics programs:

In [21]:
#@markdown Compress the fastq
input_fastq = "test.fastq" #@param {type: "string"}
compressed_fastq = "test.fastq.gz" #@param {type: "string"}
!ls -lh "$input_fastq"
!bgzip "$input_fastq"
!ls -lh "$compressed_fastq"

-rw-r--r-- 1 jovyan users 70M Mar 16 10:17 test.fastq
-rw-r--r-- 1 jovyan users 36M Mar 16 10:18 test.fastq.gz


The size of the compressed file is roughly half of the original. To decompress the compress file, we again use `bgzip`:

In [0]:
#@markdown Extract the compressed fastq
compressed_fastq = "test.fastq.gz" #@param {type: "string"}
!bgzip -d "$compressed_fastq"

#### Compress a directory structure

In order to compress a directory structure we can use the linux [`tar`](https://linux.die.net/man/1/tar) command with the compression option: 

In [23]:
directory = "test0" #@param {type: "string"}
archive = "archive.tar.gz" #@param {type: "string"}

# the options here mean: create, gzip compress, verbose, output file
!tar -czvf "$archive" "$directory"

test0/
test0/fail/
test0/fail/test.fastq
test0/pass/
test0/pass/test.fastq


When compressing directories and their contents in this way it is good practice to compress a single top-level directory, so that when the archive is decompressed a single top-level directory is retrieved (and the users working directory isn't polluted).

To decompress the archive we use a similar command:

In [26]:
archive = "archive.tar.gz" #@param {type: "string"}

# A temporary folder (tmp) is created here simply to avoid confusion with the
# original directory compressed in the previous example. This is not necessary
# in practice.

# the options here mean: extract, gzip compressed, verbose, input file
!rm -rf tmp && mkdir tmp && cd tmp && \
    tar -xzvf ../"$archive"

test0/
test0/fail/
test0/fail/test.fastq
test0/pass/
test0/pass/test.fastq
