# Every Variant Sequencing with Oxford Nanopore Technologies

This script is being used after sequencing. The raw pod5 files can be basecalled or the already basecalled files can be used directly (fastq.gz)

## Workflow

### 1. Basecalling (Optional)

- The raw reads are stored in the main folder of ONT (e.g /var/lib/minknow/data). Enter the experiment name as input. 
- Sequences are basecalled based on the model of choice. If enough computational power is available, we recommend "sup" method

### 2. Demultiplexing 
- Each reead is assigned to a well/plate combination. 

### 3. Variant Calling
- Minimap2 for creating Multiple Sequence Alignment (MSA)
- Base Frequency Caller is being used for variant calling



### Packages 

In [15]:
# Import all packages

import sys
sys.path.append("/home/emre/github_repo/MinION")

from minION.util import IO_processor
from minION.basecaller import Basecaller

from minION.variantcaller import *

from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import subprocess
import importlib
importlib.reload(IO_processor)

<module 'minION.util.IO_processor' from '/home/emre/github_repo/MinION/minION/util/IO_processor.py'>

### Meta Data 

- Provide the following arguments:

- Result Path: Path where the minion result folder will be created. All experiment results are then stored within the folder
- Experiment Name: The experiment name is assigned when running the sequencer. Use the same name for identification


In [None]:
result_path = Path("/home/emre/")
experiment_name = "20240206-MinION-KingBob10"
basecall_model_type = "sup"
result_folder = IO_processor.create_folder( experiment_name,
                                            basecall_model_type, 
                                            target_path=result_path)




# Create Barcode fasta file 
barcode_path = "../minION/barcoding/minion_barcodes.fasta" # Path to standard barcode file
front_prefix = "NB"
back_prefix = "RB"
bp = IO_processor.BarcodeProcessor(barcode_path, front_prefix, back_prefix)
barcode_path = result_folder / "minion_barcodes_filtered.fasta"

# Barcode indexes
front_min = 1
front_max = 96
back_min = 1
back_max = 12

# Expected fragment sizes
min_size = 800
max_size = 5000

bp.filter_barcodes(barcode_path, (front_min,front_max), (back_min,back_max))


file_to_experiment= f"/var/lib/minknow/data/{experiment_name}"
template_fasta = "/home/longy/zq-p411.fasta"

# Basecalling
basecall_folder = result_folder / "basecalled"
basecall_folder.mkdir(parents=True, exist_ok=True)
experiment_folder = IO_processor.find_experiment_folder(experiment_name) # Folder where pod5 files are located

# Demultiplexing
experiment_name = experiment_name + "_" + basecall_model_type
result_folder_path = IO_processor.find_folder(result_path, experiment_name)


In [None]:
# Add conditions to avoid running the script accidentally
skip_basecalling = True
skip_demultiplex = False
skip_variant_calling = False

### Step 1 (Optional): Basecall reads

- Basecall can usually be done while sequencing (if GPU available?)
- Otherwise, basecall afterwards

In [None]:
if not skip_basecalling:


    pod5_files = IO_processor.find_folder(experiment_folder, "pod5")
    bc = Basecaller(model=basecall_model_type, pod5_files, basecall_folder, fastq = True)
    bc.run_basecaller()


In [None]:
# Find fastq files
file_to_fastq = IO_processor.find_folder(experiment_folder, "fastq_pass")
print(file_to_fastq)

### Step 2: Demultiplex with SW
- Demultiplex with SW 

In [None]:
if not skip_demultiplex:
    path_to_code = "/home/emre/github_repo/MinION/source/source/demultiplex"
    prompt = f"{path_to_code} -f {file_to_fastq} -d {result_folder} -b {barcode_path} -w {100} -r {100} -m {min_size} -x {max_size}"
    subprocess.run(prompt, shell=True)

In [None]:
demultiplex_folder = result_folder 
print(demultiplex_folder)

### Step 3: Call Variant with PileUP Analysis

- Call Variant with min freq of 0.4 & depth min 15

Read Summary file (Optional):


In [None]:

demultiplex_folder_name = result_folder

In [16]:
if not skip_variant_calling:
    vc = VariantCaller(experiment_folder, 
                   template_fasta, 
                   demultiplex_folder_name=demultiplex_folder_name, 
                   padding_start=0, 
                   padding_end=0)
    
    variant_df = vc.get_variant_df(qualities=True, 
                                threshold=0.2,
                                min_depth=5)
    seq_gen = IO_processor.SequenceGenerator(variant_df, template_fasta)
    variant_df = seq_gen.get_sequences()
    #TODO: Save the variant_df to a file after running. Currently it is not saved.

0it [00:00, ?it/s]

unsupported operand type(s) for /: 'float' and 'str'


68it [00:39,  1.61it/s]

'NoneType' object is not subscriptable


71it [00:40,  1.62it/s]

unsupported operand type(s) for /: 'float' and 'str'


93it [00:52,  1.96it/s]

'NoneType' object is not subscriptable


96it [00:54,  1.63it/s]

unsupported operand type(s) for /: 'float' and 'str'


133it [01:19,  1.41it/s]

'NoneType' object is not subscriptable


183it [01:54,  1.55it/s]

'NoneType' object is not subscriptable


187it [01:56,  2.33it/s]

unsupported operand type(s) for /: 'float' and 'str'


189it [01:57,  2.06it/s]

'NoneType' object is not subscriptable


194it [01:58,  3.56it/s]

Too many positions: 28, Skipping...


226it [02:22,  1.46it/s]

'NoneType' object is not subscriptable


243it [02:33,  1.25it/s]

unsupported operand type(s) for /: 'float' and 'str'


272it [02:51,  1.15it/s]

'NoneType' object is not subscriptable


300it [03:02,  6.84it/s]

'[1184, 1185, 1187, 1188] not in index'
'[488, 489, 491, 492] not in index'
'[1915, 1916, 1918, 1919] not in index'
'[471, 472, 474, 475] not in index'
'[1831, 1832, 1834, 1835] not in index'
'[228, 229, 231, 232] not in index'
'[1099, 1100, 1102, 1103] not in index'
'[389, 390, 392, 393] not in index'
'[1339, 1340, 1342, 1343] not in index'
'[1395, 1396, 1398, 1399] not in index'
Too many positions: 70, Skipping...
'[436, 437, 439, 440] not in index'
Too many positions: 26, Skipping...
Too many positions: 18, Skipping...
Too many positions: 87, Skipping...
'[1158, 1159, 1161, 1162] not in index'
Too many positions: 21, Skipping...


317it [03:02, 18.15it/s]

Too many positions: 159, Skipping...
Too many positions: 17, Skipping...
'[1096, 1097, 1099, 1100] not in index'
'[782, 783, 785, 786] not in index'
Too many positions: 26, Skipping...
Too many positions: 14, Skipping...
'[986, 987, 989, 990] not in index'
Too many positions: 12, Skipping...
'[100, 101, 103, 104] not in index'
Too many positions: 68, Skipping...
'[419, 420, 422, 423] not in index'
'[934, 935, 937, 938] not in index'
Too many positions: 112, Skipping...
'[1953, 1954, 1956, 1957] not in index'
'[345, 346, 348, 349] not in index'
'[546, 547, 549, 550] not in index'
Too many positions: 47, Skipping...
Too many positions: 31, Skipping...
Too many positions: 44, Skipping...


332it [03:02, 27.84it/s]

Too many positions: 69, Skipping...
Too many positions: 74, Skipping...
'[322, 323, 325, 326] not in index'
'[223, 224, 226, 227] not in index'
'[1395, 1396, 1398, 1399] not in index'
'[1605, 1606, 1608, 1609] not in index'
Too many positions: 46, Skipping...
Too many positions: 21, Skipping...
'[2007, 2008, 2010, 2011] not in index'


349it [03:02, 42.93it/s]

Too many positions: 16, Skipping...
Too many positions: 19, Skipping...
Too many positions: 11, Skipping...
'[34, 35, 37, 38] not in index'
Too many positions: 46, Skipping...
'[790, 791, 793, 794] not in index'
Too many positions: 15, Skipping...
'[1532, 1533, 1535, 1536] not in index'
'[766, 767, 770] not in index'
Too many positions: 89, Skipping...
Too many positions: 31, Skipping...
'[534, 535, 537, 538] not in index'
'[1507, 1508, 1510, 1511] not in index'
Too many positions: 13, Skipping...
Too many positions: 87, Skipping...
Too many positions: 14, Skipping...
'[1051, 1052, 1054, 1055] not in index'
'[148, 149, 151, 152] not in index'
Too many positions: 23, Skipping...


364it [03:03, 46.75it/s]

'[1164, 1165, 1167, 1168] not in index'
'[2015, 2016, 2018, 2019] not in index'
Too many positions: 80, Skipping...
'[421, 422, 424, 425] not in index'
'[577, 578, 580, 581] not in index'
Too many positions: 35, Skipping...
'[1481, 1482, 1484, 1485] not in index'
Too many positions: 31, Skipping...
'[585, 586, 588, 589] not in index'


372it [03:03, 45.34it/s]

Too many positions: 73, Skipping...
'[591, 592, 594, 595] not in index'
Too many positions: 34, Skipping...
Too many positions: 28, Skipping...
'[290, 291, 293, 294] not in index'
'[1870, 1871, 1873, 1874] not in index'
Too many positions: 30, Skipping...
'[1486, 1487, 1489, 1490] not in index'


384it [03:03,  2.09it/s]

Too many positions: 40, Skipping...
'[1587, 1588, 1590, 1591] not in index'
'[1192, 1193, 1195, 1196] not in index'
'[140, 141, 143, 144] not in index'
'[21, 22, 24, 25] not in index'
'[161, 162, 164, 165] not in index'
Too many positions: 14, Skipping...
'[955, 956, 958, 959] not in index'
'[500, 501, 503, 504] not in index'
Too many positions: 35, Skipping...





In [None]:
#20 - 30
variant_df.to_csv(result_folder / "variant_df.csv", index=False)  

In [None]:
variant_df.iloc[1,2]

In [17]:
variant_df

Unnamed: 0,Plate,Well,Path,Alignment_count,Variant,Probability,Sequence
0,9,A1,,0,,,
1,9,A2,/home/emre/minION_results/20240206-MinION-King...,37,C541G_G543T,0.972916,ACCGTTATTAAACACAGATAAACCGGTTCAAGCTTTGATGAAAATT...
2,9,A3,/home/emre/minION_results/20240206-MinION-King...,49,T542G_G543T,0.959444,ACCGTTATTAAACACAGATAAACCGGTTCAAGCTTTGATGAAAATT...
3,9,A4,/home/emre/minION_results/20240206-MinION-King...,29,C541T_T542G,0.735540,ACCGTTATTAAACACAGATAAACCGGTTCAAGCTTTGATGAAAATT...
4,9,A5,/home/emre/minION_results/20240206-MinION-King...,65,C541G_T542G_G543T,1.000000,ACCGTTATTAAACACAGATAAACCGGTTCAAGCTTTGATGAAAATT...
...,...,...,...,...,...,...,...
379,12,H8,/home/emre/minION_results/20240206-MinION-King...,118,,,
380,12,H9,/home/emre/minION_results/20240206-MinION-King...,267,,,
381,12,H10,/home/emre/minION_results/20240206-MinION-King...,249,,,
382,12,H11,/home/emre/minION_results/20240206-MinION-King...,166,,,
