# Indexing a FASTQ file - How to read big FASTQ file faster?

## Context:
- FASTQ files generally have millions of reads in them, which sometimes can't be loaded in the RAM at once.
- Bio.SeqIO.index() function allows the access to any read in the FASTQ file by its name. This is useful when we have a known ID of the sequence to be accessed, as it allows instant read, instead of looping through file until we find desired sequence.

### Download Data

Downloading ~1GB FASTQ file - took 4m 34.4s on my system :|

In [1]:
!wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR494/SRR494102/SRR494102.fastq.gz

--2023-07-13 20:19:56--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR494/SRR494102/SRR494102.fastq.gz
           => 'SRR494102.fastq.gz'
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR494/SRR494102 ... done.
==> SIZE SRR494102.fastq.gz ... 606078059
==> PASV ... done.    ==> RETR SRR494102.fastq.gz ... done.
Length: 606078059 (578M) (unauthoritative)

     0K .......... .......... .......... .......... ..........  0% 66.3K 2h28m
    50K .......... .......... .......... .......... ..........  0%  274K 92m23s
   100K .......... .......... .......... .......... ..........  0% 7.47M 62m1s
   150K .......... .......... .......... .......... ..........  0%  281K 55m16s
   200K .......... .......... .......... .......... ..........  0% 9.01M 44m26s
   250K .......... ...

In [14]:
# unzip FASTQ file ~4GB
# !gzip -d SRR494102.fastq.gz

# Above unzipping didn't work, unzipped manually using 7-Zip instead

### Indexing

In [2]:
from Bio import SeqIO
fq_dict = SeqIO.index("SRR494102.fastq", "fastq")

In [3]:
len(fq_dict)

27626583

In [4]:
list(fq_dict.keys())[:4]

['SRR494102.1', 'SRR494102.2', 'SRR494102.3', 'SRR494102.4']

In [5]:
fq_dict["SRR494102.20000"].seq # get 20_000th sequence

Seq('AGCAACCACCATGACCACCCCTTCACCAACCACCAC')

## Summary: In this tutorial, we indexed through a fastq file using SeqIO.index() function.

# Finish!