# PicklSeq & Haplotype Caller


## Introduction

This is a notebook is the implementation of the methods introduced in **"Machine learning for targeted denoising and haplotype phasing of mixed clone pathogens using nanopore sequence data"**. It combines the PicklSeq and Haplotype caller library. The PicklSeq tool is a high-level wrapper that combines several open source tools to extract, map, and align raw fastq files into Python-friendly .pkl format for subsequent processing.

#### PicklSeq:
https://github.com/paopaoch/PicklSeq

#### VariantCalling
https://github.com/VariantCalling/VariantCalling

## Section 1: PicklSeq

In this section, we set up the environment on Colab to run PicklSeq. This includes the packages and open-source tools installation (SamTools, Minimap2, and Chopper). Lastly, we clone the PicklSeq repository and complete the envrionment setup.

### Installing packages and tools

In [1]:
######### Force Environment to use Keras < 3.0 #########
#!pip install "keras<3.0.0" "tensorflow<2.16" "tf-models-official<2.16" mediapipe-model-maker

######### Packages Installation #########
!apt-get install autoheader
!apt-get install autoconf

######### SamTools #########
# Install HTSlib
!git clone https://github.com/samtools/htslib --recursive
%cd htslib/
!autoreconf -i  # Build the configure script and install files it uses
!./configure    # Optional but recommended, for choosing extra functionality
!make
!make install

%cd /content
!git clone https://github.com/samtools/samtools --recursive
%cd /content/samtools/
!echo "Running autoheader"
!pwd
!autoheader            # Build config.h.in
!autoconf -Wno-syntax  # Generate the configure script
!./configure           # Needed for choosing optional functionality
!make
!make install

######### MiniMap2 #########
%cd /content/
!curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
!cp /content/minimap2-2.26_x64-linux/minimap2 /usr/local/bin

######### Chopper #########
%cd /content/
!wget https://github.com/wdecoster/chopper/releases/download/v0.6.0/chopper-linux.zip
!yes|unzip chopper-linux.zip
!cp /content/chopper /usr/local/bin
!chmod +x /usr/local/bin/chopper

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package autoheader
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
autoconf is already the newest version (2.71-2).
autoconf set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Cloning into 'htslib'...
remote: Enumerating objects: 17721, done.[K
remote: Counting objects: 100% (196/196), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 17721 (delta 141), reused 115 (delta 115), pack-reused 17525 (from 2)[K
Receiving objects: 100% (17721/17721), 13.18 MiB | 29.46 MiB/s, done.
Resolving deltas: 100% (12795/12795), done.
Submodule 'htscodecs' (https://github.com/samtools/htscodecs.git) registered for path 'htscodecs'
Cloning into '/content/htslib/htscodecs'...
remote: Enumerating objects: 2280, done.        
remote: Counting objects: 100% (573/573), done.      

### Cloning the PicklSeq Repo and Downloading the Example Fastq File

In [2]:
%cd /content/
!git clone https://github.com/cchuenchoksan/PicklSeq.git

!gdown https://drive.google.com/uc?id=1scYGpLgL3Aj0d6MYoT-VaWasBx0XY0qB # mix_3clones.zip
!gdown https://drive.google.com/uc?id=1huSbAzYVNKOfQg7mgXIBnrvGeXLjdj2J # single_sample.zip
!gdown https://drive.google.com/uc?id=1kgvCYVxmIg-JJteAb0GC6Z_JconQ4Y9U # mix_2clones.zip

!yes | unzip -j mix_3clones.zip
!yes | unzip -j mix_2clones.zip
!yes | unzip -j single_samples.zip

/content
Cloning into 'PicklSeq'...
remote: Enumerating objects: 89, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 89 (delta 0), reused 1 (delta 0), pack-reused 86 (from 1)[K
Receiving objects: 100% (89/89), 47.85 MiB | 38.86 MiB/s, done.
Resolving deltas: 100% (36/36), done.
Downloading...
From (original): https://drive.google.com/uc?id=1scYGpLgL3Aj0d6MYoT-VaWasBx0XY0qB
From (redirected): https://drive.google.com/uc?id=1scYGpLgL3Aj0d6MYoT-VaWasBx0XY0qB&confirm=t&uuid=340e2050-47c8-49bb-bf5b-cba4abf2113d
To: /content/mix_3clones.zip
100% 826M/826M [00:06<00:00, 124MB/s] 
Downloading...
From (original): https://drive.google.com/uc?id=1huSbAzYVNKOfQg7mgXIBnrvGeXLjdj2J
From (redirected): https://drive.google.com/uc?id=1huSbAzYVNKOfQg7mgXIBnrvGeXLjdj2J&confirm=t&uuid=459e1947-2ed3-4e0b-a8b9-1b8fe449ce08
To: /content/single_samples.zip
100% 551M/551M [00:04<00:00, 121MB/s]
Downloading...
From (original): https://driv

### Running PickleSeq

In this example we will be supplying a fastq file containing read data of *P. falciparum* DD2 clones. We will be aligning and matching the chloroquine resistance transporter (CRT) sequence. Successful execution of the program will generate an output pickle file in the /content directory

Initializing Python variables. We will need these variables for both PicklSeq and Haplotype Caller

### Select Sequence by Updating haplo_caller_notebook_script.py
Supported sequences: pfcrt (CRT), pfdhps (DHPS), pfdhfr (DHFR), SARS-Cov2 RBM (RBM200)

In [3]:
# User Input - Note: This section has to be copied to haplot_caller_notebook_script.py too
sequence_selected = "CRT" #  "CRT" | "DHPS" | "DHFR" | "RBM200"
barcode = "02" # Options: 00-09

In [4]:
import os
import subprocess

%cd /content/PicklSeq/
# Define the sequence
in_file = r"/content/FAY33695_pass_barcode" + barcode + ".fastq"

out_file = in_file.replace(".fastq",".pkl")
print("Currently processing: ", in_file, "\t expecting output: ", out_file)
subprocess.run(["python", "picklseq.py", "-f="+in_file, "-o="+out_file, "-c=20", "-M=4000", "-t="+sequence_selected])

%cd /content


/content/PicklSeq
Currently processing:  /content/FAY33695_pass_barcode02.fastq 	 expecting output:  /content/FAY33695_pass_barcode02.pkl
/content


## Section 2: Haplo-Reader

In this section, we will perform haplotype-calling on the extracted reads from fastq file from PicklSeq. We start off with the loading of the ML model, and subsequently generate the similarity score with the reference sequences (considering the 3D7,DD2, and 7G8 clones only). Each read will be matched to the reference clone with the highest similarity score with a threshold of 0.5, when no reference sequences achieve score > 0.5, the read will be matched to the unknown group. This is done in a read-by-read manner until all read data in the pickle file has been processed. As the last step of the haplotype calling process, the frequency / proportion of each clone is then computed.

### Cloning the Repository

In [5]:
%cd /content/
!git clone https://github.com/VariantCalling/VariantCalling.git

/content
Cloning into 'VariantCalling'...
remote: Enumerating objects: 644, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 644 (delta 58), reused 55 (delta 52), pack-reused 571 (from 2)[K
Receiving objects: 100% (644/644), 156.57 MiB | 31.63 MiB/s, done.
Resolving deltas: 100% (318/318), done.
Updating files: 100% (160/160), done.


### Creating a separate Python 3.10 environment. Latest Colab update has changed default python version to 3.11.

In [9]:
# HaploCaller requires Python 3.10
!apt-get install python3.10
!apt-get update -y
!update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
!update-alternatives --config python
# select python version
!apt install python3-pip
!apt install python3.10-distutils
!ln -s /usr/bin/python /usr/local/bin/python3
!python --version
!######### Force Environment to use Keras < 3.0 #########
!/usr/bin/python3.10 -m pip install "keras<3.0.0" "tensorflow[and-cuda]<2.16" "tf-models-official<2.16" "matplotlib==3.10" mediapipe-model-maker

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3.10 is already the newest version (3.10.12-1~22.04.10).
0 upgraded, 0 newly installed, 0 to remove and 36 not upgraded.
Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured f

#### Running Inference on the Preprocessed Pickle File

In [10]:
%cd /content/VariantCalling
!/usr/bin/python3.10 haplo_caller_notebook_script.py

/content/VariantCalling
2025-06-22 20:48:50.179093: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-06-22 20:48:50.179140: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-06-22 20:48:50.180932: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-22 20:48:53.235235: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

Clone		Proportion
3D7:		97.62%
DD2:		0.97%
7G8:		0.69%
Unknown:	0.72%
/content/FAY33695_pass_barcode02.fastq
Time elapsed (s

### This section is deprecated after the latest Google Colab Python 3.11 update

In [8]:
%cd /content/VariantCalling
import tensorflow as tf
import VariantCalling as vc
import Comparator
import pickle
import numpy as np
import pandas as pd
import importlib
importlib.reload(Comparator)

model_path = r"/content/VariantCalling/comparator_models/" + model_name
model = Comparator.load_comparator(model_path)
tracker, pred_list = Comparator.run_comparator_inference(model,out_file,clone_file)
print("\nClone\t\tProportion")
for i in range(len(tracker)-1):
    print(clone_names[i] + ":\t\t" + str(round(tracker[i]*100/sum(tracker),2)) + "%")
print("Unknown:\t" + str(round(tracker[len(tracker)-1]*100/sum(tracker),2)) + "%")
print(in_file)
time_end = time.time()
print("Time elapsed (s):\t",time_end - time_start)

/content/VariantCalling


NameError: name 'model_name' is not defined

# Expected Results
## CRT

| ont_barcode| 3D7 + HB3 |	DD2 |
| -------- | ------- | -------- |
| barcode01| 0.99 | 0.01 |
| barcode02| 1.00 | 0.00 |
| barcode03| 0.05 | 0.95 |
| barcode04| 0.11 | 0.89 |
| barcode05| 0.10 | 0.90 |
| barcode06| 0.39 | 0.61 |
| barcode07| 0.46 | 0.54 |
| barcode08| 0.40 | 0.60 |
| barcode09| 0.45 | 0.55 |

## DHPS
| ont_barcode | 3D7 + HB3 | DD2 |
| -------- | ------- | -------- |
| barcode01 | 0.988 |	0.012 |
| barcode02 | 0.988 |	0.012 |
| barcode03 | 0.012 |	0.988 |
| barcode04 | 0.255 |	0.745 |
| barcode05 | 0.283 |	0.717 |
| barcode06 | 0.842 |	0.158 |
| barcode07 | 0.842 |	0.158 |
| barcode08 | 0.446 |	0.554 |
| barcode09 | 0.503 |	0.497 |

## DHFR
| ont_barcode | 3D7 | DD2 | HB3 |
| -------- | ------- | -------- | -------- |
| barcode01 | 0.985 | 0.003 | 0.012 |
| barcode02 | 0.011 | 0.004 | 0.986 |
| barcode03 | 0.015 | 0.923 | 0.062 |
| barcode04 | 0.251 | 0.704 | 0.045 |
| barcode05 | 0.248 | 0.698 | 0.054 |
| barcode06 | 0.81 | 0.168 | 0.022 |
| barcode07 | 0.819 | 0.16 | 0.021 |
| barcode08 | 0.337 | 0.526 | 0.137 |
| barcode09 | 0.359 | 0.508 | 0.133 |