## 1. Load Required Libraries

## Ensure the required dependencies are installed

1. You need to have samtools installed
2. You also need minimap2 installed


## Download the test data or use your own

We recomend trying this by first using the data that we make available via zenodo (https://zenodo.org/records/13694463). You want to download the fastq.zip file, along with the reference file for that. Put that in a data folder so that the contents are:

```
20240421-YL-ParLQ-ep1.csv
20240422-YL-ParLQ-ep1.fastq
```


In [1]:
! samtools  # This will tell you whether you have samtools installed 


Program: samtools (Tools for alignments in the SAM format)
Version: 1.15.1 (using htslib 1.15.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates
     ampliconclip   clip oligos from the end of reads

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     consensus      produce a consensus Pileup/FASTA/FASTQ
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          spli

In [2]:
! minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
  Indexing:
    -H           use homopolymer-compressed k-mer (preferrable for PacBio)
    -k INT       k-mer size (no larger than 28) [15]
    -w INT       minimizer window size [10]
    -I NUM       split index for every ~NUM input bases [4G]
    -d FILE      dump index to FILE []
  Mapping:
    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]
    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]
    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]
    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]
    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
    -n INT       minimal number of minimizers on a chain [3]
    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]
    -X           skip self and dual mappings (for the all-vs-all m

In [1]:
# Load necessary libraries
import os
import sys
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np
import pandas as pd
from importlib import resources
import subprocess
from Bio import SeqIO
import tqdm
import re
import gzip
import shutil

# Add the path to the levseq directory to the system path
sys.path.append('../levseq')

# Import custom functions from the provided script
from run_levseq import *
result_folder = os.getcwd()

## 2. Define Run Location
We'll specify the location of the sequencing run data. This is also within the same directory structure.


In [2]:
os.getcwd()

'/Users/arianemora/Documents/code/MinION/example'

In [3]:
# Define the path to the run data
# This is where you downloaded that data to (i.e. where you put the concatenated fastq files)
run_location = '../zenodo_download/'

# This is the full path to the reference file.
ref_df = pd.read_csv('../zenodo_download/20240421-YL-ParLQ-ep1.csv')
variant_csv_path = 'OutputExample.csv'
name = 'Test-ep1'

## 3. Demultiplexing and variant calling
Demultiplexing is the process of separating out individual samples from a multiplexed sequencing run. We'll use the `demux_fastq` function from the custom script to perform this step.


In [4]:
# Create empty variant df
result_folder = os.path.join(result_folder, name)
variant_df = pd.DataFrame(columns=["barcode_plate", "name", "refseq", "variant"])

for i, row in ref_df.iterrows():
    barcode_plate = row["barcode_plate"]
    name = row["name"]
    refseq = row["refseq"].upper()

    # Create a subfolder for the current iteration using the name value
    name_folder = os.path.join(result_folder, name)
    os.makedirs(name_folder, exist_ok=True)

    # Write the refseq to a temporary fasta file
    temp_fasta_path = os.path.join(name_folder, f"temp_{name}.fasta")
    with open(temp_fasta_path, "w") as f:
        f.write(f">{name}\n{refseq}\n")
    # Create filtered barcode path
    f_min = 1
    f_max = 96
    rbc = i+1
    front_prefix = "NB"
    back_prefix = "RB"
    barcode_path = "../levseq/barcoding/minion_barcodes.fasta"
    barcode_path_filter = os.path.join(name_folder, "minion_barcodes_filtered.fasta")
    filter_barcodes(
        barcode_path,
        barcode_path_filter,
        (f_min, f_max),
        rbc,
        front_prefix,
        back_prefix,)
    
    # Perform demultiplexing
    demux_fastq(run_location, name_folder, barcode_path_filter)
    
    variant_result = call_variant(f"{name}", name_folder, temp_fasta_path, barcode_path_filter)
    variant_result["barcode_plate"] = barcode_plate
    variant_result["name"] = name
    variant_result["refseq"] = refseq
    variant_df = pd.concat([variant_df, variant_result])
variant_df.to_csv(variant_csv_path, index=False)

dyld[18242]: weak-def symbol not found '__ZNKRSt7__cxx1115basic_stringbufIcSt11char_traitsIcESaIcEE3strEv'


CalledProcessError: Command '/Users/arianemora/miniconda3/envs/minion/lib/python3.8/site-packages/levseq/barcoding/demultiplex-arm64 -f ../zenodo_download/ -d /Users/arianemora/Documents/code/MinION/example/Test-ep1/300-1 -b /Users/arianemora/Documents/code/MinION/example/Test-ep1/300-1/minion_barcodes_filtered.fasta -w 100 -r 100 -m 150 -x 10000' died with <Signals.SIGABRT: 6>.

## 4. Create variant and visualization csv files

In [13]:
variant_df

Unnamed: 0,barcode_plate,name,refseq,variant,index,Plate,Well,Barcode,ID,P value,Mixed Well,Variant,Average mutation frequency,Alignment Count,P adj. value
0,1,300-1,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,0.0,300-1,A1,RB01_NB01,300-1_A1,1.000000e+00,,,,,1.000000e+00
1,1,300-1,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,1.0,300-1,A2,RB01_NB02,300-1_A2,1.000000e+00,,,,0.0,1.000000e+00
2,1,300-1,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,2.0,300-1,A3,RB01_NB03,300-1_A3,1.000000e+00,True,T62G_G273A_A284G,0.912500,80.0,1.000000e+00
3,1,300-1,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,3.0,300-1,A4,RB01_NB04,300-1_A4,4.277116e-86,True,T163C,0.705263,95.0,4.106032e-84
4,1,300-1,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,4.0,300-1,A5,RB01_NB05,300-1_A5,1.000000e+00,True,T124C_G269T_G369T_G519C,0.542500,200.0,1.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,4,500-2,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,91.0,500-2,H8,RB04_NB92,500-2_H8,1.000000e+00,,,,,1.000000e+00
92,4,500-2,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,92.0,500-2,H9,RB04_NB93,500-2_H9,1.740075e-43,True,C378T_T422C_T446DEL,0.822222,15.0,1.670472e-41
93,4,500-2,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,93.0,500-2,H10,RB04_NB94,500-2_H10,5.367269e-117,False,G246DEL_T417C,0.937500,40.0,5.152578e-115
94,4,500-2,ATGGCGGTTCCCGGCTACGATTTTGGCAAAGTCCCGGATGCCCCAA...,,94.0,500-2,H11,RB04_NB95,500-2_H11,3.079761e-35,False,T281C,1.000000,23.0,2.956571e-33


## 5. Visualization
Finally, we'll visualize the results using appropriate visualization functions. This step helps in understanding the distribution and impact of the identified variants.

In [16]:
df_variants, df_vis = create_df_v(variant_df)
layout = generate_platemaps(
            max_combo_data=df_vis,
            result_folder=result_folder,)   
layout

In [19]:
layout

## 6. Upload to LevSeq website

Hold tight, this will be deployed in 1 day :D 
