# Working with FASTQ files in Python

## Table of Contents

1. [Available Tools](#available-tools)

## Available Tools

- [htseq](https://github.com/simon-anders/htseq)
- [fastq-and-furious](https://github.com/lgautier/fastq-and-furious)
- [biopython](https://github.com/biopython/biopython)
- [fastqp](https://github.com/mdshw5/fastqp) 

In [1]:
import os
import sys
import time
import itertools

import numpy as np
import pandas as pd

import HTSeq as hs
import fastqandfurious as ff
from Bio import SeqIO

In [2]:
os.getcwd()

'/home/ifrancium/Documents/sophiagenetics-francisglee/notebooks'

In [3]:
os.listdir('./data/original_files')

['aln.bam', 'aln1.fastq', 'aln.bam.bai', 'aln2.fastq']

In [4]:
aln1_FASTQ_path = '/home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/original_files/aln1.fastq'
aln2_FASTQ_path = '/home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/original_files/aln2.fastq'

So there appears to be a few different standards for Phred Scores.  I believe we have the MiSeq Illumina version, as indicated from the BAM file.  So when using Biopython's SeqIO, we need to specify as such.

In [8]:
aln1_fastq = SeqIO.parse(aln1_FASTQ_path, "fastq")
aln2_fastq = SeqIO.parse(aln2_FASTQ_path, "fastq")

Let's see how many reads there are in each FASTQ file.

In [6]:
aln1_count = 0
aln2_count = 0

for read in aln1_fastq:
    aln1_count += 1

for read in aln2_fastq:
    aln2_count += 1
    
print("There are {} reads in aln1.fastq".format(aln1_count))
print("There are {} reads in aln2.fastq".format(aln2_count))

There are 575002 reads in aln1.fastq
There are 575002 reads in aln2.fastq


Biopython's SeqIO allows us to view the Phred Scores in a few different formats:

In [9]:
count = 1

for index, record in enumerate(itertools.islice(aln2_fastq, count)):
    print(record)
    print('\n')
    print(record.id, "Phred Scores: ", record.letter_annotations["phred_quality"])
    print('\n')
    print(record.format("fastq-illumina"))
    print('\n')
    print(record.format("fastq")) # fastq-sanger
    print('\n')
    print(record.format("fastq-solexa"))

ID: M01636:3:000000000-A442D:1:1101:15821:1569
Name: M01636:3:000000000-A442D:1:1101:15821:1569
Description: M01636:3:000000000-A442D:1:1101:15821:1569 2:N:0:1
Number of features: 0
Per letter annotation for: phred_quality
Seq('NAGAAAAGGTGGGCCTGAGGTTCAGAGCCATGGACCCCCACACAGCAAAGCAGA...GAT', SingleLetterAlphabet())


M01636:3:000000000-A442D:1:1101:15821:1569 Phred Scores:  [2, 29, 29, 32, 32, 32, 37, 33, 32, 16, 33, 32, 36, 38, 38, 37, 32, 37, 38, 38, 37, 38, 39, 38, 39, 34, 38, 15, 38, 39, 33, 16, 38, 38, 39, 39, 38, 38, 38, 38, 38, 38, 38, 38, 39, 38, 37, 39, 39, 39, 39, 38, 38, 37, 33, 33, 38, 37, 39, 35, 39, 39, 39, 37, 38, 39, 37, 36, 36, 39, 38, 38, 37, 37, 39, 38, 39, 37, 37, 39, 39, 39, 39, 39, 39, 37, 38, 38, 38, 38, 14, 38, 38, 37, 17, 38, 37, 39, 39, 39, 37, 39, 37, 38, 16, 38, 39, 39, 39, 39, 39, 38, 38, 38, 33, 17, 36, 38, 36, 36, 31, 34, 33, 34, 30, 27, 31, 37, 39, 39, 17, 33, 35, 38, 37, 39, 37, 38, 38, 38, 39, 38, 38, 38, 37, 37, 30, 37, 38, 15, 30]


@M01636:3:000000000

Seems like Phred encoding is Phred+33.

In [10]:
print(type(aln1_fastq))

<class 'generator'>


Seems like `SeqIO.parse()` creates a generator.  We can't store the FASTQ data using `SeqIO.parse`.  Bummer.

Looks like at a first glance, these FASTQ files are RAW and not processed.  Let's see what the primer distributions are.

| Forward Primers | Reverse Primers | 
|:----------------|:----------------|
| TTGCCAGTTAACGTCTTCCTTCTCTCTCTG | GAGAAAAGGTGGGCCTGAGGTTCAGAGCCA |   
| CCCTTGTCTCTGTGTTCTTGTCCCCCCCA | CCCCACCAGACCATGAGAGGCCCTGCGGCC |    
| TGATCTGTCCCTCACAGCAGGGTCTTCTCT | TGACCTAAAGCCACCTCCTTA |
| CACACTGACGTGCCTCTCCCTCCCTCCA | CCGTATCTCCCTTCCCTGATTA |
   

#### Let's use a different library to parse FASTQ: fastq-and-furious.

In [11]:
count = 2

for index, record in enumerate(itertools.islice(aln1_fastq, count)):
    print(record)

ID: M01636:3:000000000-A442D:1:1101:15821:1569
Name: M01636:3:000000000-A442D:1:1101:15821:1569
Description: M01636:3:000000000-A442D:1:1101:15821:1569 1:N:0:1
Number of features: 0
Per letter annotation for: phred_quality
Seq('TTGCCAGTTAACGTCTTCCTTCTCTCTCTGTCATAGGGACTCTGGATCCCAGAA...TTG', SingleLetterAlphabet())
ID: M01636:3:000000000-A442D:1:1101:17381:1570
Name: M01636:3:000000000-A442D:1:1101:17381:1570
Description: M01636:3:000000000-A442D:1:1101:17381:1570 1:N:0:1
Number of features: 0
Per letter annotation for: phred_quality
Seq('CACACTGACGTGCCTCTCCCTCCCTCCAGGAAGCCTACGTGATGGCCAGCGTGG...TGT', SingleLetterAlphabet())


In [None]:
aln1_FASTQ_df = pd.read_csv(aln1_FASTQ_path)

In [None]:
print(aln1_FASTQ_df.index)

In [None]:
print(aln1_FASTQ_df.columns)

In [None]:
aln1_FASTQ_df.head(30)

In [None]:
phred_scores = aln1_FASTQ_df['Phred Score']

print(phred_scores[0], type(phred_scores[0]))

In [None]:
for index, score_set in enumerate(phred_scores):
    if index > 10:
        break
    else:
        print(index, score_set)
    

In [None]:
score_list = [int(x) for x in phred_scores[0].lstrip("[").rstrip("]").split(" ")]
print(score_list)

score_array = np.asarray(score_list)
print(score_array)

#### Python implementation of a fastq reader.

#### Trimming with Trimmomatic

```
java -jar trimmomatic-0.38.jar PE -phred33 /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/original_files/aln1.fastq /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/original_files/aln1.fastq /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/trimmomatic_output/output_forward_paired.fq.gz /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/trimmomatic_output/output_forward_unpaired.fq.gz /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/trimmomatic_output/output_reverse_paired.fq.gz /home/ifrancium/Documents/sophiagenetics-francisglee/notebooks/data/trimmomatic_output/output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36\
```