# Reading and Summarizing FastQ Data

The pipeline comprises the following steps:

1. FASTQ reading and quality summary – Confirm read count, length, and quality distribution.
2. Align reads to reference (Bowtie2) – Map sequencing reads to the S288C reference genome (which lacks the KanMX insertion) to see how CHZ1 is covered.
3. Parse alignments for coverage and clipping – Examine alignment coverage across the CHZ1 locus and identify reads with soft-clips (which may indicate partial alignment due to the insertion).
4. Search for KanMX-specific sequences – Scan reads for known KanMX cassette sequences (primer tails and internal motifs) to directly detect the insertion.
5. Compile evidence and interpret – Combine the findings to determine if CHZ1 is fully replaced by KanMX, and discuss confidence in the result.

In [1]:
## Reading and Summarizing FastQ Data

from Bio import SeqIO

fastq_path = "matching.fq"  # path to your FASTQ file
total_reads = 0
lengths = []
qual_sums = []

for record in SeqIO.parse(fastq_path, "fastq"):
    total_reads += 1
    seq_len = len(record.seq)
    lengths.append(seq_len)
    # Calculate average quality of this read
    phred_scores = record.letter_annotations["phred_quality"]
    qual_sums.append(sum(phred_scores)/seq_len)

print(f"Total reads: {total_reads}")
if lengths:
    print(f"Read length: min={min(lengths)}, max={max(lengths)}, average={sum(lengths)/len(lengths):.1f}")
    print(f"Mean Phred quality per read: ~{sum(qual_sums)/len(qual_sums):.1f}")


FileNotFoundError: [Errno 2] No such file or directory: 'matching.fastq'