<a href="https://colab.research.google.com/github/epi2me-labs/tutorials/blob/fastq_things/Fastq_bits.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fastq

The [fastq format](https://en.wikipedia.org/wiki/FASTQ_format) is (usually) a 4 line string (text) data format denoting a sequence and it's corresponding quality score values.  There different ways of encoding quality in fastq however, nanopore reads use [sanger phred scores](https://academic.oup.com/nar/article/38/6/1767/3112533).  A sequence record is made up of 4 lines:

```
line 1: Sequence ID and Sequence description
line 2: Sequence line e.g. ATCGs
line 3: plus symbol (can additional description here)
line 4: Sequence line qualities
```
**IMPORTANT:** Line 1 and line 2 must have the same length of the sequence record is not valid.

```
@sequence_id sequence_description
ATCG
+
!^%%
```
The SEQUENCE_ID must not contain any spaces. Anything after the first space in the sequence id line will be considered "description".

A fastq file contains multiple records. The default number of records in a fastq file generated during a nanopore run is 4000 reads (16000 lines).  


### How many records in my fastq file?

In [22]:
filename = "test.fastq"  #@param {type: "string"}
!echo $(( `cat $filename | wc -l` / 4 )) reads

15076 reads


### List all the fastqs in a directory

In [31]:
directory = "." #@param {type: "string"}

!find $directory -name "*.fastq"

./test.fastq
./test_dir/test.fastq
./test_dir/test_dir2/test2.fastq


### Cat all fastqs in a directory into a single file

In [59]:
directory = "." #@param {type: "string"}
output_fastq = "total.fastq" #@param {type: "string"}

!find . -type f \( -iname "*.fastq" ! -iname $output_fastq \) | xargs cat > $output_fastq
!wc -l $output_fastq

180912 total.fastq


### Remove all duplicates in a fastq


In [91]:
from Bio import SeqIO
from progressbar import ProgressBar

input_fastq = "total.fastq" #@param {type: "string"}
output_fastq = "total_without_duplicates.fastq" #@param {type: "string"}

#@markdown If you want to keep the records with duplicates and just alter the
#@markdown IDs tick here:
id_addition = True #@param {type: "boolean"}

new_fastq = {}
for i, record in enumerate(SeqIO.parse(input_fastq, "fastq")):
  if id_addition:
    record.id = record.id + "_{}".format(i)
  new_fastq[record.id] = record

new_fastq_records = new_fastq.values()
print("Records found: {} | Records kept: {}".format(i + 1, len(new_fastq)))

with ProgressBar(max_value=len(new_fastq)) as bar:
  with open(output_fastq, "w") as f:
    SeqIO.write(new_fastq_records, handle=f, format="fastq")
    bar.update(1)

Records found: 45228 | Records kept: 45228


100% (45228 of 45228) |##################| Elapsed Time: 0:00:00 Time:  0:00:00


###Compress or extract a fastq file

In [96]:
#@markdown Compress the fastq
input_fastq = "total.fastq" #@param {type: "string"}
compressed_fastq = "total.fastq.tar.gz" #@param {type: "string"}
!tar -czvf $compressed_fastq $input_fastq

total.fastq


In [94]:
#@markdown Extract the compressed fastq
input_tar_gz = "total.fastq.tar.gz" #@param {type: "string"}
!tar -xvf $input_tar_gz 

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
tar: Ignoring unknown extended header keyword 'SCHILY.dev'
tar: Ignoring unknown extended header keyword 'SCHILY.ino'
tar: Ignoring unknown extended header keyword 'SCHILY.nlink'
total.fastq
