# Pipeline for Detecting a KanMX Insertion at the Yeast CHZ1 (YER030W)

The pipeline comprises the following steps:

1. FASTQ reading and quality summary – Confirm read count, length, and quality distribution.
2. Align reads to reference (Bowtie2) – Map sequencing reads to the S288C reference genome (which lacks the KanMX insertion) to see how CHZ1 is covered.
3. Parse alignments for coverage and clipping – Examine alignment coverage across the CHZ1 locus and identify reads with soft-clips (which may indicate partial alignment due to the insertion).
4. Search for KanMX-specific sequences – Scan reads for known KanMX cassette sequences (primer tails and internal motifs) to directly detect the insertion.
5. Compile evidence and interpret – Combine the findings to determine if CHZ1 is fully replaced by KanMX, and discuss confidence in the result.

##  Reading and Summarizing FASTQ Data

First, we load the sequencing reads from the FASTQ file and get basic statistics: number of reads, read lengths, and quality scores. We can use Biopython’s Bio.SeqIO module to parse the FASTQ, which yields SeqRecord objects for each [read](https://biopython.org/wiki/SeqIO#:~:text=Aims). Biopython assumes Sanger FASTQ format by default (Phred+33 quality encoding)[https://biopython.org/wiki/SeqIO#:~:text=fastq,See%20also%20the%20incompatible%20%E2%80%9Cfastq]. For efficiency with large files, one could also use Bio.SeqIO.QualityIO.FastqGeneralIterator.

In [5]:
from Bio import SeqIO

fastq_path = "matching.fq"  # path to your FASTQ file
total_reads = 0
lengths = []
qual_sums = []

for record in SeqIO.parse(fastq_path, "fastq"):
    total_reads += 1
    seq_len = len(record.seq)
    lengths.append(seq_len)
    # Calculate average quality of this read
    phred_scores = record.letter_annotations["phred_quality"]
    qual_sums.append(sum(phred_scores)/seq_len)

print(f"Total reads: {total_reads}")
if lengths:
    print(f"Read length: min={min(lengths)}, max={max(lengths)}, average={sum(lengths)/len(lengths):.1f}")
    print(f"Mean Phred quality per read: ~{sum(qual_sums)/len(qual_sums):.1f}")

Total reads: 265
Read length: min=150, max=150, average=150.0
Mean Phred quality per read: ~35.9


This indicates we have 265 reads, each 150 bp long, with high avg quality.

We could examine the distribution of quality scores or length if needed. (In this case, all are 150, which is typical)

__Indication:__ High quality data reads suggest the data is good for analysis. The CHZ1 Coding Region is 483 bp (153 amino acids)[https://wiki.yeastgenome.org/index.php/Chromosome_V_History#:~:text=The%20start%20site%20of%20YER030W,160%20aa%20to%20153%20aa], so reads of 150 bp will not cover it fully in one piece. We will rely on multiple reads and possibly overlapping pairs (if paired-end) to cover the locus. If CHZ1 has been knocked out, we expect anomalies in how these reads map to the reference genome at that location – which we will assess in the next steps.


## Aligning Reads to the S. cerevisiae Reference (Bowtie2)

Next, we align the reads to the yeast reference genome (S288C, release R64-1-1). We use Bowtie2, an ultrafast and memory-efficient aligner for short [reads](https://gensoft.pasteur.fr/docs/bowtie2/2.1.0/#:~:text=Bowtie%202%20is%20an%20ultrafast,used%20simultaneously%20to%20achieve%20greater). Bowtie2 can be invoked from the command line; we’ll assume it’s installed on our VM. We first need the reference sequence for chromosome V (or the whole genome). We will outline using the entire reference genome.

### Download the reference genome:

The yeast reference can be obtained from the Saccharomyces Genome Database (SGD) or Ensembl. For example, SGD provides a FASTA file for S288C chromosomes. You can download and extract it here [http://sgd-archive.yeastgenome.org/?prefix=sequence/S288C_reference/genome_releases/](http://sgd-archive.yeastgenome.org/?prefix=sequence/S288C_reference/genome_releases/)

This FASTA contains all 16 chromosomes (labeled chrI, chrII, ..., chrXVI, plus mitochondrial). We are particularly interested in chrV.

We need the S288C_reference_sequence_R64-4-1_20230823.fsa file, attached in the repo.

### Building the Bowtie2 Index from the reference file

You can find instructions to install Bowtie2 from their page [here](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#obtaining-bowtie-2) to start.

Run this command to build a Bowtie2 index from your reference FASTA file:
```
bowtie2-build S288C_reference_sequence_R64-4-1_20230830.fsa yeast_R64_index
```
This will produce several files that Bowtie2 uses to rapidly align reads.