# Quality filtering for FASTQ files

## Context:
- In this tutorial, we will read FASTQ data and filter sequences based upon Phred quality score.

### Get Data

In [1]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz 

--2023-07-14 02:23:24--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz
           => 'SRR020192.fastq.gz'
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR020/SRR020192 ... done.
==> SIZE SRR020192.fastq.gz ... 1777817
==> PASV ... done.    ==> RETR SRR020192.fastq.gz ... done.
Length: 1777817 (1.7M) (unauthoritative)

     0K .......... .......... .......... .......... ..........  2% 49.5K 34s
    50K .......... .......... .......... .......... ..........  5%  291K 19s
   100K .......... .......... .......... .......... ..........  8% 4.48M 13s
   150K .......... .......... .......... .......... .......... 11% 3.79M 9s
   200K .......... .......... .......... .......... .......... 14% 80.1K 11s
   250K .......... .......... ..........

In [3]:
# !gzip -d SRR020192.fastq.gz

# The above code didn't work on my system, thus I unzipped manually using 7-zip.

In [5]:
# counting the number of sequences

from Bio import SeqIO
count = 0
for rec in SeqIO.parse("SRR020192.fastq", "fastq"):
    count += 1
print(f"No of sequences: {count}") 

No of sequences: 41892


In [6]:
# get the phred quality scores for each nucleotide in a sequence
rec = SeqIO.parse("SRR020192.fastq", "fastq")
rec = next(rec)
rec.letter_annotations['phred_quality']

[24,
 23,
 27,
 30,
 30,
 30,
 23,
 23,
 24,
 23,
 23,
 30,
 28,
 27,
 25,
 25,
 27,
 27,
 27,
 22,
 22,
 24,
 18,
 18,
 18,
 30,
 19,
 19,
 23,
 23,
 30,
 30,
 32,
 32,
 32,
 30,
 24,
 23,
 23,
 27,
 30,
 32,
 30,
 32,
 29,
 28,
 28,
 17,
 17,
 17,
 17,
 24,
 17,
 17,
 13,
 15,
 17,
 25,
 25,
 24,
 24,
 23,
 27,
 27,
 15,
 15,
 15,
 15,
 15,
 17,
 17,
 11,
 15,
 15]

In [9]:
# get the sequences with minimum phred quality score >= 20
good_reads = (
    rec
    for rec in SeqIO.parse("SRR020192.fastq", "fastq")
    if min(rec.letter_annotations["phred_quality"]) >= 20
)
count = SeqIO.write(good_reads, "good_quality.fastq", "fastq")
print(f"Saved {count} reads.")

Saved 20050 reads.


Interpretation: Out of total 41892 reads, 20050 reads have minimum nucleotide phred quality score of 20.

## Summary: In this tutorial, we downloaded fastq file, filtered out reads with minimum nucleotide phred quality score of 20.

# Finish!