# PicklSeq & Haplotype Caller


## Introduction

This is a notebook is the implementation of the methods introduced in **"Machine learning for targeted denoising and haplotype phasing of mixed clone pathogens using nanopore sequence data"**. It combines the PicklSeq and Haplotype caller library. The PicklSeq tool is a high-level wrapper that combines several open source tools to extract, map, and align raw fastq files into Python-friendly .pkl format for subsequent processing.

#### PicklSeq:
https://github.com/paopaoch/PicklSeq

#### VariantCalling
https://github.com/paopaoch/VariantCalling

## Section 1: PicklSeq

In this section, we set up the environment on Colab to run PicklSeq. This includes the packages and open-source tools installation (SamTools, Minimap2, and Chopper). Lastly, we clone the PicklSeq repository and complete the envrionment setup.

### Installing packages and tools

In [None]:
######### Force Environment to use Keras < 3.0 #########
!pip install "keras<3.0.0" "tensorflow<2.16" "tf-models-official<2.16" mediapipe-model-maker

######### Packages Installation #########
!apt-get install autoheader
!apt-get install autoconf

######### SamTools #########
# Install HTSlib
!git clone https://github.com/samtools/htslib --recursive
%cd htslib/
!autoreconf -i  # Build the configure script and install files it uses
!./configure    # Optional but recommended, for choosing extra functionality
!make
!make install

%cd /content
!git clone https://github.com/samtools/samtools --recursive
%cd /content/samtools/
!echo "Running autoheader"
!pwd
!autoheader            # Build config.h.in
!autoconf -Wno-syntax  # Generate the configure script
!./configure           # Needed for choosing optional functionality
!make
!make install

######### MiniMap2 #########
%cd /content/
!curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
!cp /content/minimap2-2.26_x64-linux/minimap2 /usr/local/bin

######### Chopper #########
%cd /content/
!wget https://github.com/wdecoster/chopper/releases/download/v0.6.0/chopper-linux.zip
!yes|unzip chopper-linux.zip
!cp /content/chopper /usr/local/bin
!chmod +x /usr/local/bin/chopper

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package autoheader
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
autoconf is already the newest version (2.71-2).
autoconf set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Cloning into 'htslib'...
remote: Enumerating objects: 17027, done.[K
remote: Counting objects: 100% (1304/1304), done.[K
remote: Compressing objects: 100% (602/602), done.[K
remote: Total 17027 (delta 828), reused 890 (delta 672), pack-reused 15723[K
Receiving objects: 100% (17027/17027), 12.61 MiB | 14.06 MiB/s, done.
Resolving deltas: 100% (12240/12240), done.
Submodule 'htscodecs' (https://github.com/samtools/htscodecs.git) registered for path 'htscodecs'
Cloning into '/content/htslib/htscodecs'...
remote: Enumerating objects: 2216, done.        
remote: Counting objects: 100% (591/591), done.        
re

### Cloning the PicklSeq Repo and Downloading the Example Fastq File

In [None]:
from google.colab import drive
%cd /content/
drive.mount('/content/drive')

#!cat /content/drive/MyDrive/fastq_files/barcode03_pf_gdna_mix_repeat_ori_barcode6.fastq.gz | gzip -d > /content/barcode03_pf_gdna_mix_repeat_ori_barcode6.fastq

!cat /content/drive/MyDrive/fastq_files/barcode01_pf_gdna_mix_repeat_pcr_barcode6.fastq.gz | gzip -d > /content/barcode01_pf_gdna_mix_repeat_pcr_barcode6.fastq


/content
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/
!git clone https://<insert token>@github.com/paopaoch/PicklSeq.git
!wget --header 'Authorization: token <insert token>' https://raw.githubusercontent.com/paopaoch/VariantCalling/main/samples/G22_Control_C_DD2_DRAG1_PfMAP.fastq

/content
Cloning into 'PicklSeq'...
remote: Enumerating objects: 89, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 89 (delta 0), reused 1 (delta 0), pack-reused 86[K
Receiving objects: 100% (89/89), 47.85 MiB | 10.74 MiB/s, done.
Resolving deltas: 100% (36/36), done.
--2024-07-11 10:57:24--  https://raw.githubusercontent.com/paopaoch/VariantCalling/main/samples/G22_Control_C_DD2_DRAG1_PfMAP.fastq
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92508684 (88M) [text/plain]
Saving to: ‘G22_Control_C_DD2_DRAG1_PfMAP.fastq’


2024-07-11 10:57:31 (84.8 MB/s) - ‘G22_Control_C_DD2_DRAG1_PfMAP.fastq’ saved [92508684/92508684]



### Running PickleSeq

In this example we will be supplying a fastq file containing read data of *P. falciparum* DD2 clones. We will be aligning and matching the chloroquine resistance transporter (CRT) sequence. Successful execution of the program will generate an output pickle file in the /content directory

Initializing Python variables. We will need these variables for both PicklSeq and Haplotype Caller

In [None]:
sequence_selected = "CRT"
seq_length = 178
clone_names = ["3D7","DD2","7G8"]
nb_mutations = len(clone_names)
clone_file = "crt_clones.txt"

In [None]:
import os
import subprocess

%cd /content/PicklSeq/
# Define the sequence
in_file = r"/content/barcode01_pf_gdna_mix_repeat_pcr_barcode6.fastq"
#in_file = r"/content/barcode02_pf_gdna_mix_repeat_pcr_barcode7.fastq"
#in_file = r"/content/barcode03_pf_gdna_mix_repeat_ori_barcode6.fastq"
#in_file = r"/content/barcode04_pf_gdna_mix_repeat_ori_barcode7.fastq"
#in_file = r"/content/barcode05_pf_gdna_mix_repeat_new_barcode6.fastq"
#in_file = r"/content/barcode06_pf_gdna_mix_repeat_new_barcode7.fastq"
out_file = in_file.replace(".fastq",".pkl")
print("Currently processing: ", in_file, "\t expecting output: ", out_file)
subprocess.run(["python", "picklseq.py", "-f="+in_file, "-o="+out_file, "-c=20", "-M=4000", "-t="+sequence_selected])

%cd /content


/content/PicklSeq
Currently processing:  /content/barcode01_pf_gdna_mix_repeat_pcr_barcode6.fastq 	 expecting output:  /content/barcode01_pf_gdna_mix_repeat_pcr_barcode6.pkl
/content


## Section 2: Haplo-Reader

In this section, we will perform haplotype-calling on the extracted reads from fastq file from PicklSeq. We start off with the loading of the ML model, and subsequently generate the similarity score with the reference sequences (considering the 3D7,DD2, and 7G8 clones only). Each read will be matched to the reference clone with the highest similarity score with a threshold of 0.5, when no reference sequences achieve score > 0.5, the read will be matched to the unknown group. This is done in a read-by-read manner until all read data in the pickle file has been processed. As the last step of the haplotype calling process, the frequency / proportion of each clone is then computed.

### Cloning the Repository

In [None]:
%cd /content/
!git clone https://<insert token>@github.com/paopaoch/VariantCalling.git

/content
Cloning into 'VariantCalling'...
remote: Enumerating objects: 517, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 517 (delta 69), reused 78 (delta 58), pack-reused 419[K
Receiving objects: 100% (517/517), 104.14 MiB | 14.47 MiB/s, done.
Resolving deltas: 100% (262/262), done.
Updating files: 100% (90/90), done.


#### Running Inference on the Preprocessed Pickle File

In [None]:
%cd /content/VariantCalling
import tensorflow as tf
import VariantCalling as vc
import Comparator
import pickle
import numpy as np
import pandas as pd
import importlib
importlib.reload(Comparator)

model_path = r"/content/VariantCalling/comparator_models/Comparator_CRT.keras"
model = Comparator.load_comparator(model_path)
tracker, pred_list = Comparator.run_comparator_inference(model,out_file,clone_file)
print("\nClone\t\tProportion")
for i in range(len(tracker)-1):
    print(clone_names[i] + ":\t\t" + str(round(tracker[i]*100/sum(tracker),2)) + "%")
print("Unknown:\t" + str(round(tracker[len(tracker)-1]*100/sum(tracker),2)) + "%")
print(in_file)

/content/VariantCalling

Clone		Proportion
3D7:		35.37%
DD2:		63.58%
7G8:		0.41%
Unknown:	0.63%
/content/barcode01_pf_gdna_mix_repeat_pcr_barcode6.fastq


# Expected Results

File: G22_Control_C_DD2_DRAG1_PfMAP.fastq

Clone		Proportion
3D7:		2.56%
DD2:		96.99%
7G8:		0.16%
Unknown:	0.28%