# Generator/Stream of FASTQ reads

**In this notebook, I am creating and test my scripts to stream the FASTQ reads**

In [1]:
import os
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

DIR_DAT_RAW  = "/home/jovyan/work/Data/SRR4841864/rawdata"
DIR_DAT_TRIM = "/home/jovyan/work/Data/SRR4841864/trimmed_fastqs/"
DIR_CUR      = "/home/jovyan/work/GitRepo/Duke_CBB520_HW2"

**Define my test function to stream the data**

In [2]:
def stream_fastq(fpath):
    """stream over the given fastq file; yield results by every four lines"""
    with open(fpath, 'r') as fp:
        ### initialization
        keys = ["label", "seq", "label2", "qscore"]
        dct  = dict()
        
        ### iterate through the file
        for cnt, line in enumerate(fp):
            # yield the results for every four lines and
            # reset the dictionary
            if (cnt != 0) and (cnt % 4 == 0):
                yield dct
                dct = dict()
                
            # process each line and add to a dictionary 
            idx  = cnt % 4
            line = line.strip()
            dct[keys[idx]] = line

**read line by line of FASTQ file**

In [3]:
with open(os.path.join(DIR_DAT_RAW, "SRR4841864_1.fastq"), 'r') as fp:
    for cnt, line in enumerate(fp):
        line = line.strip()
        if cnt % 4 == 0:
            print("\n====================================================")
        print("Line {0:^3}: {1}".format(cnt, line))
        if cnt > 10:
            break


Line  0 : @SRR4841864.1 1 length=101
Line  1 : NGGCATACCAATGGGAAGAGCATGCGTGGGAATGGTATTCAAAAATCGAGAAAGAAGGATGCTGCTGCTGCTGCCGCTGTTGGGAAAGCTATAAAGGATGA
Line  2 : +SRR4841864.1 1 length=101
Line  3 : #1:BDFFFDHFHDGGBBGGHGIGGHEHGIFIHGFDB?FHGHIIIDDGFHGCHIIIIIIIIIHGCCCBBDFDFCCECCCCBCCCBBABBCCCCCCECCCCCC

Line  4 : @SRR4841864.2 2 length=101
Line  5 : NAGCTTCGTTAAGACTTTGACTGGTAAGACCATCACTTTGGAAGTTGAATCTTCTGACACCATTGACAATGTCAAGTCCAAGATCCAAGACAAGGAAGGTA
Line  6 : +SRR4841864.2 2 length=101
Line  7 : #1=BDDDD?CFFFIIEFIFFHFIEFHIBHFFEFEFFIFEFIIFE<<?D?DGFIFIIGFEDFFIEGFFGGCC7@EFE7=AEEEDDDDDC@CCCBBBBBBBA>

Line  8 : @SRR4841864.3 3 length=101
Line  9 : NTTATATCTATGAATGAATATGCATTGCTCTTCTGATACTATGGTATGAGAGGCCTTGGTAAAGGGCATCGATAACCGAAAGAATCATCCGCAGAGAAGTA
Line 10 : +SRR4841864.3 3 length=101
Line 11 : #1=?DDD?F?D8<FEGBABHBA<FHE93AAAHD<AFEG4?:ED<?DD<B4B918B@?FH8BDG?)BDFDHID@6;3=:59=<ACCCA@>CBB8?B253<:3


**read the fastq file using my function and double check with the results above**

In [4]:
fn  = "SRR4841864_1.fastq"          # file name
fp  = os.path.join(DIR_DAT_RAW, fn) # file path
gen = stream_fastq(fp)              # generator

res = next(gen)
print(res["label"])
print(res["seq"])
print(res["qscore"])

print("\n======================\n")

res = next(gen)
print(res["label"])
print(res["seq"])
print(res["qscore"])

@SRR4841864.1 1 length=101
NGGCATACCAATGGGAAGAGCATGCGTGGGAATGGTATTCAAAAATCGAGAAAGAAGGATGCTGCTGCTGCTGCCGCTGTTGGGAAAGCTATAAAGGATGA
#1:BDFFFDHFHDGGBBGGHGIGGHEHGIFIHGFDB?FHGHIIIDDGFHGCHIIIIIIIIIHGCCCBBDFDFCCECCCCBCCCBBABBCCCCCCECCCCCC


@SRR4841864.2 2 length=101
NAGCTTCGTTAAGACTTTGACTGGTAAGACCATCACTTTGGAAGTTGAATCTTCTGACACCATTGACAATGTCAAGTCCAAGATCCAAGACAAGGAAGGTA
#1=BDDDD?CFFFIIEFIFFHFIEFHIBHFFEFEFFIFEFIIFE<<?D?DGFIFIIGFEDFFIEGFFGGCC7@EFE7=AEEEDDDDDC@CCCBBBBBBBA>
