In [None]:
#!/usr/bin/env python


## NanoDiP all-in-one Jupyter Notebook
*J. Hench, S. Frank, and C. Hultschig, Neuropathology, IfP Basel, 2021*

This software is provided free of charge and warranty; by using it you agree
to do this on your own risk. The authors shall not be held liable for any
damage caused by this software. We have assembled this and tested it to the
best of our knowledge.

The purpose of NanoDiP (Nanopore Digital Pathology) is to compare low-coverage
Nanopore sequencing data from natively extracted DNA sequencing runs against
a flexibly adaptable collection of 450K/850K Illumina Infinium Methylation
array data. These data have to be preprocessed into binary beta value files;
this operation is performed in R (uses minfi to read raw array data) and
outputs bindary float files (one per dataset). These beta values files (e.g.,
204949770141_R03C01_betas_filtered.bin) are named according to the array ID
(Sentrix ID) followed by the suffix. A collection of betas_filtered.bin files
can be provided in a static manner and XLSX (Microsoft Excel) tables can be
used to select a subset thereof alongside a user-defined annotation. The
corresponding datasets will be loaded into memory and then serve as the
reference cohort to which the Nanopore data are compared by dimension reduction
(UMAP). This comparison is optimized for speed and low resource consumption so
that it can run on the computer that operates the sequencer. The sequencing run
is initiated through the MinKNOW API by this application. Basecalling and
methylation calling occur as background tasks outside this Jupyter Notebook.
User interaction occurs through a web interface based on CherryPy which has
been tested on Chromium web browser. It is advisable to run it locally, there
are no measures to secure the generated website.

In order to use this application properly please make sure to be somewhat
familiar with Jupyter Notebook. To run the software, press the button called
*restart the kernel, re-run the whole notebook (with dialog)* and confirm
execution. Then, in Chromium Browser, navigate to http://localhost:8080/ and
preferably bookmark this location for convenience. In case of errors, you may
just again click the same button *restart the kernel, re-run the whole notebook
(with dialog)*.
___

### Technical Details

* Tested with Python 3.7.5; 3.8.8 fails to load minknow_api in jupyter
  notebook.
* Verified to run on Ubuntu 18.04/Jetpack on ARMv8 and x86_64 CPUs; not
  tested on Windows and Mac OS. The latter two platforms are unsupported, we
  do not intend to support them.
* **CAUTION**: Requires a *patched* version of minknow api, file
  `[VENV]/lib/python3.7/site-packages/minknow_api/tools/protocols.py`.
  Without the patch, the generated fast5 sequencing data will be unreadable
  with f5c or nanopolish (wrong compression algorithm, which is the default in
  the MinKNOW backend).


In [None]:
# Verify running Python version (should be 3.7.5) and adjust jupyter notebook.
import IPython
import os
from IPython.core.display import display, HTML

In [None]:
# set display witdth to 100%
display(HTML("<style>.container { width:100% !important; }</style>"))
os.system('python --version')


## Multithreading Options
Depending on the number of parallel threads/cores of the underlying hardware,
threading options for multithreaded modules need to be set as
environment-specific parameters. One way to do so is through the *os* module.


In [None]:
# execution-wide multithreading options, set according to your hardware. Jetson
# AGX: suggest "2" needs to be set before importing other modules that query
# these parameters
import os
os.environ["NUMBA_NUM_THREADS"] = "2"
os.environ["OPENBLAS_NUM_THREADS"] = "2"
os.environ["MKL_NUM_THREADS"] = "2"


## Modules
This section imports the required modules that should have been installed via
pip. Other package managers have not been tested. To install packages, use the
setup script provided with this software or, alternatively, install them one
by one, ideally in a virtual python environment. Note that the MinKNOW API
requires manual patching after installation with pip.


In [None]:
from minknow_api.manager import Manager
from minknow_api.tools import protocols
from numba import jit
from plotly.io import write_json, from_json
from tqdm import tqdm
import argparse
import bisect
import cherrypy
import csv
import datetime
import fnmatch
import jinja2
import logging
import math
import minknow_api.device_pb2
import minknow_api.statistics_pb2
import multiprocessing as mp
import numpy as np
import openpyxl
import os
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import psutil
import pysam
import re
import shutil
import socket
import subprocess
import sys
import time
import xhtml2pdf.pisa


## Configuration
Below are system-specific parameters that may or may not require adaptation.
Many variable names are self-explanatory. The key difference between
Nanopore setups are between devices provided by ONT (MinIT incl. running the
MinIT distribution on a NVIDIA Jetson developer kit such as the AGX Xavier,
GridION) and the typical Ubuntu-based MinKNOW version on x86_64 computers. The
raw data are written into a `/data` directory on ONT-based devices while they
are found in `/var/lib/minknow/data` on x86_64 installations. Make sure to
adapt your `DATA` accordingly. There are furthermore permission
issues and special folders / files in the MinKNOW data directory. These files
/ folders should be excluded from analysis through `EXCLUDED_FROM_ANALYSIS` so
that only real run folders will be parsed. Finally, the `NANODIP_OUTPUT` is the
place in which the background methylation and alignment process will place its
results by replicating the directory hierarchy of the MinKNOW data location.
It will not duplicate the data, and these data will be much smaller than raw
run data. They can be placed anywhere in the file tree, but also inside the
MinKNOW data path within a sub-folder. If the latter is the case, make sure to
apply appropriate read/write permissions. Final reports and figures generated
by NanoDiP are written into `NANODIP_REPORTS`.


In [None]:
NANODIP_VERSION = 24

In [None]:
# Data directories
DATA = "/data"
NANODIP_OUTPUT = os.path.join(DATA, "nanodip_output")
NANODIP_REPORTS = os.path.join(DATA, "nanodip_reports")
REFERENCE_DATA = "/applications/reference_data"
BETA_VALUES = os.path.join(REFERENCE_DATA, "betaEPIC450Kmix_bin")
ANNOTATIONS = os.path.join(REFERENCE_DATA, "reference_annotations")
ANNOTATIONS_ABBREVIATIONS_BASEL = "/applications/reference_data/reference_annotations/mc_anno_ifp_basel.csv"

In [None]:
# https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations
ANNOTATIONS_ABBREVIATIONS_TCGA = "/applications/reference_data/reference_annotations/tcga_study_abbreviations.tsv"

In [None]:
# Reference data
ILUMINA_CG_MAP = os.path.join(REFERENCE_DATA, "minimap_data/hg19_HumanMethylation450_15017482_v1-2_cgmap.tsv")
REFERENCE_METHYLATION_DATA = os.path.join(REFERENCE_DATA, "EPIC450K")
REFERENCE_METHYLATION = os.path.join(REFERENCE_METHYLATION_DATA, "methylation.bin")
REFERENCE_CPG_SITES = os.path.join(REFERENCE_METHYLATION_DATA, "cpg_sites.csv")
REFERENCE_SPECIMENS = os.path.join(REFERENCE_METHYLATION_DATA, "specimens.csv")
REFERENCE_METHYLATION_SHAPE = os.path.join(REFERENCE_METHYLATION_DATA, "shape.csv")

In [None]:
# Genome reference data
CHROMOSOMES = os.path.join(REFERENCE_DATA, "hg19_cnv", "hg19_chromosomes.tsv")

In [None]:
# Human reference genome in fa/minimap2 mmi format.
REFERENCE_GENOME_FA = "/applications/reference_data/minimap_data/hg19.fa"
REFERENCE_GENOME_MMI = "/applications/reference_data/minimap_data/hg19_20201203.mmi"

In [None]:
# Barcode strings, currently kit SQK-RBK004.
BARCODE_NAMES = [
    "barcode01","barcode02","barcode03",
    "barcode04","barcode05","barcode06",
    "barcode07","barcode08","barcode09",
    "barcode10","barcode11","barcode12",
]

In [None]:
# HG19 Gene data
# https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.refGene.gtf.gz
GENES_RAW = os.path.join(REFERENCE_DATA, "hg19_cnv", "hg19.refGene.gtf")
GENES = os.path.join(REFERENCE_DATA, "hg19_cnv", "hg19_genes.csv")
RELEVANT_GENES = os.path.join(REFERENCE_DATA, "hg19_cnv", "relevant_genes.csv")

In [None]:
# Beta values above cutoff will be interpreted as methylated.
METHYLATION_CUTOFF = 0.35

In [None]:
# Number of basecalled bases until run termination occurs.
NEEDED_NUMBER_OF_BASES = 150_000_000

In [None]:
# URL prefix/suffix to load PDF with CNV plot for a given Sentrix ID.
CNV_URL_PREFIX = "http://s1665.rootserver.io/umapplot01/"
CNV_URL_SUFFIX = "_CNV_IFPBasel_annotations.pdf"

CNV_GRID = "/applications/tmp/grid.json" # TODO to /tmp/nanodip

In [None]:
# Number of reference cases to be shown in subplot including copy
# number profile links (not advisable >200, plotly will become really
# slow)
UMAP_PLOT_TOP_MATCHES = 100

PLOTLY_RENDER_MODE = "webgl"

ANALYSIS_EXCLUSION_PATTERNS = ["_TestRun_"]

In [None]:
# List of files and folders in DATA to be exluded from analysis.
EXCLUDED_FROM_ANALYSIS = [
    ".Trash-1000",
    "core-dump-db",
    "intermediate",
    "lost+found",
    "minimap_data",
    "nanodip_output",
    "nanodip_reports",
    "nanodip_tmp",
    "non-ont",
    "pings",                              
    "playback_raw_runs",
    "queued_reads",
    "raw_for_playback",
    "reads",
    "user_scripts",
]

In [None]:
# List of file name sections that identify past runs.
RESULT_ENDINGS = {
    "cnv_png": "_CNVplot.png",
    "ranking": "_NanoDiP_ranking.pdf",
    "report": "_NanoDiP_report.pdf",
    "umap_all": "_UMAP_all.html",
    "umap_top": "_UMAP_top.html",
}

ENDINGS = {
    **RESULT_ENDINGS,
    "aligned_reads": "_alignedreads.txt",
    "cnv_bins_json": "_CNV_binsplot.json",
    "cnv_html": "_CNVplot.html",
    "cnv_json": "_CNVplot.json",
    "cpg_cnt":"_cpgcount.txt",
    "genes": "_genes.csv",
    "methyl": "_methyl_overlap.npy",
    "reads_csv": "_reads.csv",
    "relevant_genes": "_relevant_genes.csv",
    "umap_all_json": "_UMAP_all.json",
    "umap_csv": "_UMAP.csv",
    "umap_top_json": "_UMAP_top.json",
}

DEBUG_MODE = True

In [None]:
# 0=low log verbosity, 1=high log verbosity (with timestamps, for benchmarking and debugging)
VERBOSITY = 0 # TODO replace logger

In [None]:
# Host and port on which the NanoDiP UI will be served
CHERRYPY_HOST = "localhost"
THIS_HOST = "localhost"
CHERRYPY_PORT = 8080

In [None]:
# The web browser favicon file for this application.
BROWSER_FAVICON = "/applications/nanodip/favicon.ico"

In [None]:
# The location where image files for the web application are stored.
IMAGES ="/applications/nanodip"

In [None]:
# Number of reads per file. 400 works well on the Jetson AGX. Higher numbers
# increase batch size and RAM usage, lower numbers use more I/O resouces due
# to more frequent reloading of alignment reference.
READS_PER_FILE = "400"

In [None]:
# Paths to binaries for methylation calling.
F5C = "/applications/f5c/f5c"
MINIMAP2 = "/applications/nanopolish/minimap2/minimap2"
SAMTOOLS = "/applications/samtools/samtools"
RSCRIPT = "/applications/R-4.0.3/bin/Rscript"

In [None]:
# TODO del: R script that reads CpGs into simplified text file (absolute path)
READ_CPG_RSCRIPT="/applications/nanodip/readCpGs_mod02.R"

In [None]:
def extract_referenced_cpgs(sample_methylation,
                            output_overlap,
                            output_overlap_cnt):
    """Extract ilumina CpG sites including methylation status from sample.
    Args:
        sample_methylation: methylation file of sample
        output_overlap: file path of CpG overlap
        output_overlap_cnt: file path of CpG overlap count
    """
    reference_cpgs = pd.read_csv(
        ILUMINA_CG_MAP,
        delimiter="\t",
        names=["ilmnid","chromosome","strand","start"],
    )
    sample_cpgs = pd.read_csv(
        sample_methylation,
        delimiter="\t",
    )
    cpgs = pd.merge(sample_cpgs, reference_cpgs, on=["chromosome", "start"])
    # Extract singelton CpG's
    cpgs = cpgs.loc[cpgs["num_cpgs_in_group"] == 1]
    cpgs = cpgs.loc[
       (~cpgs["chromosome"].isin(["chrX", "chrY"])) # TODO is this necessary?
       & (~cpgs["ilmnid"].duplicated())
    ]
    cpgs["is_methylated"] = 0
    cpgs.loc[cpgs["methylated_frequency"] > 0.5 ,"is_methylated"] = 1
    # Write overlap Data Frame
    cpgs[["ilmnid", "is_methylated"]].to_csv(
        output_overlap, header=False, index=False, sep="\t")
    # Write number of CpG's
    with open(output_overlap_cnt, "w") as f:
        f.write(f"{len(cpgs)}")

In [None]:
def render_template(template_name, **context):
    loader = jinja2.FileSystemLoader("templates")
    template = jinja2.Environment(
        loader=loader).get_template(template_name)
    return template.render(context)

In [None]:
def convert_html_to_pdf(source_html, output_file):
    """Create PDF from html-string."""
    with open(output_file, "w+b") as f:
        pisa_status = xhtml2pdf.pisa.CreatePDF(source_html, dest=f)
    return pisa_status.err

In [None]:
def date_time_string_now():
    """Return current date and time as a string to create timestamps."""
    now = datetime.datetime.now()
    return now.strftime("%Y%m%d_%H%M%S")


    nanodip.data
    ------------
    
    Data containers for sample, reference-data and reference-genome/gene
    data.  


In [None]:
def binary_reference_data_exists():
    """Check if the binary form of the reference data was already created."""
    return (
        os.path.exists(REFERENCE_METHYLATION_DATA) and
        os.path.exists(REFERENCE_METHYLATION) and
        os.path.exists(REFERENCE_CPG_SITES) and
        os.path.exists(REFERENCE_SPECIMENS) and
        os.path.exists(REFERENCE_METHYLATION_SHAPE)
    )

In [None]:
def make_binary_reference_data(input_dir=BETA_VALUES,
                               output_dir=REFERENCE_METHYLATION_DATA,
                               cutoff=METHYLATION_CUTOFF):
    """Create binary methylation files from raw reference data.

    Args:
        input_dir: Directory of reference data as float arrays-
            files.
        output_dir: Output dir containing binary array-file.
        cutoff: Empirical cutoff value for methylated
            (round to 1) and unmethylated (round to 0) CpGs.
    """
    print("The binary reference data is generated. Takes 5-10 minutes.")
    if not os.path.isdir(output_dir):
        os.mkdir(output_dir)

    specimens = [f for f in os.listdir(input_dir)
                 if f.endswith(".bin")]

    # Get shape parameters of output_data
    path0 = os.path.join(input_dir, specimens[0])
    with open(path0, "r") as f:
        beta_values_0 = np.fromfile(f, dtype=float)
    shape = (len(specimens), len(beta_values_0))

    methylation_data = np.empty(shape, dtype=bool)

    for i, specimen in enumerate(tqdm(specimens, desc="Reading reference")):
        specimen_path = os.path.join(input_dir, specimen)

        with open(specimen_path, "rb") as f:
            beta_values = np.fromfile(f, dtype=float)
            methylation_data[i] = np.digitize(
                beta_values,
                bins=[cutoff]
            ).astype(bool)

    # write methylation data as binary
    methylation_file = os.path.join(output_dir, "methylation.bin")
    methylation_data.tofile(methylation_file)

    # write shape parameters
    shape_file = os.path.join(output_dir, "shape.csv")
    with open(shape_file, "w") as f:
        f.write("%s\n %s" % shape)

    # write reference specimens
    specimens_file = os.path.join(output_dir, "specimens.csv")
    specimen_names = [s[:-len("_betas_filtered.bin")] for s in specimens]
    with open(specimens_file, "w") as f:
        f.write("\n".join(specimen_names))

    # write reference cpg sites
    index_file = os.path.join(output_dir, "cpg_sites.csv")
    with open(os.path.join(input_dir, "index.csv")) as f:
        index = f.read()
    with open(index_file, "w") as f:
        f.write(index)

In [None]:
def make_binary_reference_data_if_needed():
    if not binary_reference_data_exists():
        make_binary_reference_data()

In [None]:
class ReferenceData:
    """Container of reference data and metadata."""
    def __init__(self, name):
        make_binary_reference_data_if_needed()
        self.name = name
        self.annotation = self.get_annotation()
        with open(REFERENCE_CPG_SITES, "r") as f:
        # save as Dictionary to allow fast index lookup
            self.cpg_sites = {cpg:i for i, cpg in enumerate(
                f.read().splitlines()
            )}

        with open(REFERENCE_SPECIMENS) as f:
            self.specimens = f.read().splitlines()

        # determine if there are entries in the annotation without corresponding
        # methylation binary file
        self.annotated_specimens = list(
            set(self.annotation["id"]) & set(self.specimens)
        )

        # Save as dictionary to allow fast hash lookup.
        index = {s:i for i, s in enumerate(self.specimens)}
        self.annotated_specimens_index = [index[a]
            for a in self.annotated_specimens]
        self.annotated_specimens_index.sort()

        # Save as dictionary to allow fast hash lookup.
        methyl_dict = {i:mc for i, mc in
            zip(self.annotation.id, self.annotation.methylation_class)
        }
        self.specimen_ids = [self.specimens[i]
            for i in self.annotated_specimens_index]
        self.methylation_class = [methyl_dict[s] for s in self.specimen_ids]
        self.description = ReferenceData.get_description(
            self.methylation_class
        ) 

    def get_description(methylation_classes):
        """Returns a description of the methylation class."""
        abbr_df = pd.read_csv(ANNOTATIONS_ABBREVIATIONS_BASEL)
        abbr = {
            mc:desc for mc, desc in 
            zip(abbr_df.MethylClassStr, abbr_df.MethylClassShortDescr)
        }
        non_trivial_abbr = abbr.copy()
        non_trivial_abbr.pop("-")
        tcga_df = pd.read_csv(ANNOTATIONS_ABBREVIATIONS_TCGA, delimiter="\t")
        tcga = {r[0]:r[1] for _, r in tcga_df.iterrows()}
        def description(mc):
            mc = mc.upper()
            # Exact match
            if mc in abbr:
                return abbr[mc]
            # Else choose longest substring from Basel-Annotations/TCGA
            basel_substring = [a for a in non_trivial_abbr if a in mc]
            basel_substring.sort(key=lambda x: len(x))
            tcga_substring = [a for a in tcga if a in mc]
            tcga_substring.sort(key=lambda x: len(x))
            # Prefer Basel Annotation
            if (
                basel_substring and (
                    not tcga_substring or
                    len(basel_substring[-1]) >= len(tcga_substring[-1])
                )
            ):
                return abbr[basel_substring[-1]]
            if tcga_substring:
                return tcga[tcga_substring[-1]]
            # No proper annotation for "PITUI"
            if mc == "PITUI":
                return "Pituicytoma"
            else:
                return ""
        mc_description = [
            description(mc).capitalize() for mc in methylation_classes
        ]
        return mc_description

    def get_annotation(self):
        """Reads annotation as csv file from disk, and returns is as
        pd.DataFrame. If csv is missing or file not up to date, annotation
        is read from original excel file (slow) and csv file is written to
        disk.
        """
        path_csv = os.path.join(ANNOTATIONS, self.name + ".csv")
        path_xlsx = os.path.join(ANNOTATIONS, self.name + ".xlsx")
        if not os.path.exists(path_csv):
            csv_exists_and_up_to_date = False
        else:
            csv_exists_and_up_to_date = (
                os.path.getmtime(path_csv) > os.path.getmtime(path_xlsx)
            )
        if csv_exists_and_up_to_date:
            return pd.read_csv(path_csv)
        annotation = pd.read_excel(
            path_xlsx,
            header=None,
            names=["id", "methylation_class", "custom_text"],
        )
        annotation.to_csv(path_csv, index=False)
        return annotation

In [None]:
class ReferenceGenome:

    def __init__(self):
        self.chrom = pd.read_csv(CHROMOSOMES,
                                 delimiter="\t",
                                 index_col=False)
        self.chrom["offset"] = [0] + np.cumsum(self.chrom["len"]).tolist()[:-1]
        self.chrom["center"] = self.chrom["offset"] + self.chrom["len"]//2
        self.chrom["centromere_offset"] = (self.chrom["offset"]
            + (self.chrom["centromere_start"] + self.chrom["centromere_end"])//2)
        self.length = (self.chrom["offset"].iloc[-1]
                     + self.chrom["len"].iloc[-1])
        if not os.path.exists(GENES):
            self.write_genes_csv()
        self.set_genes()

    def __iter__(self):
        return self.chrom.itertuples()

    def set_genes(self):
        """Read and set genes from csv file."""
        self.genes = pd.read_csv(
            GENES,
            delimiter="\t",
        )

    def write_genes_csv(self):
        """Write csv gene list with one selected transcript per gene."""
        genes = pd.read_csv(
            GENES_RAW,
            delimiter="\t",
            names=["seqname", "source", "feature", "start", "end",
                   "score", "strand", "frame", "attribute"],
            usecols=["seqname", "feature", "start", "end", "attribute"]
        )
        genes = genes.loc[
            (genes["feature"] == "transcript")
            & (genes["seqname"].isin(self.chrom.name))
        ]
        genes["name"] = genes.attribute.apply(
            lambda x: re.search('gene_name(.*)"(.*)"', x).group(2)
        )
        genes["transcript"] = genes.attribute.apply(
            lambda x: re.search(
                'transcript_id(.*)"(.*)"(.*)gene_name(.*)', x
                ).group(2)
        )
        genes = genes.drop_duplicates(subset=["name", "seqname"], keep="first")
        genes = genes.sort_values("name")
        genes["loc"] = genes.apply(
            lambda z: (
                  z["seqname"]
                + ":"
                + "{:,}".format(z["start"])
                + "-"
                + "{:,}".format(z["end"])
            ),
            axis=1,
        )
        # Make data comapitle with pythonic notation
        genes["end"] += 1
        offset = {i.name:i.offset for i in self}
        genes["start"] = genes.apply(
            lambda z: offset[z["seqname"]] + z["start"],
            axis=1,
        )
        genes["end"] = genes.apply(
            lambda z: offset[z["seqname"]] + z["end"],
            axis=1,
        )
        genes["midpoint"] = (genes["start"] + genes["end"]) // 2
        with open(RELEVANT_GENES, "r") as f:
            relevant_genes = f.read().splitlines()
        genes["relevant"] = genes.name.apply(lambda x: x in relevant_genes)
        genes["len"] = genes["end"] - genes["start"]
        genes[["name", "seqname", "start", "end",
               "len", "midpoint", "relevant", "transcript",
               "loc",
        ]].to_csv(GENES, index=False, sep="\t")

In [None]:
class SampleData:
    """Container of sample data."""
    def __init__(self, name):
        self.name = name
        self.cpg_sites = SampleData.get_read_cpgs(name)
        self.cpg_overlap = None
        self.cpg_overlap_index = None
        self.reads = None

    def set_reads(self):
        """Calculate all read start and end positions."""
        genome = ReferenceGenome()
        bam_files = []
        sample_path = os.path.join(NANODIP_OUTPUT, self.name)
        for root, _, files in os.walk(sample_path):
            bam_files.extend(
                [os.path.join(root, f)
                for f in files if f.endswith(".bam")]
            )
        read_positions = []
        for f in bam_files:
            samfile = pysam.AlignmentFile(f, "rb")
            for chrom in genome:
                for read in samfile.fetch(chrom.name):
                    read_positions.append([
                        read.reference_start + chrom.offset,
                        # reference_end equals first position after alignment
                        # consistent with python notations.
                        read.reference_end + chrom.offset,
                    ])
                    assert (read.reference_length != 0), "Empty read"
        self.reads = read_positions

    def get_read_cpgs(name):
        """Get all Ilumina methylation sites with methylaton status
        within a samples reads.

        Args:
            name: sample name to be analysed

        Returns:
            Pandas Data Frame containing the reads Ilumina cpg_sites and
            methylation status.
        """

        sample_path = os.path.join(NANODIP_OUTPUT, name)

        if not os.path.exists(sample_path):
            raise FileNotFoundError(sample_path)

        cpg_files = []
        for root, _, files in os.walk(sample_path):
            cpg_files.extend(
                [os.path.join(root, f)
                for f in files if f.endswith("methoverlap.tsv")]
            )

        methylation_info = pd.DataFrame()

        for f in cpg_files:
            # Some fast5 files do not contain any CpGs.
            try:
                cpgs = pd.read_csv(f, delimiter="\t", header=None,
                                   names=["cpg_site", "methylation"])
                methylation_info = methylation_info.append(cpgs)
            except FileNotFoundError:
                logger.exception("empty file encountered, skipping")

        return methylation_info

    def set_cpg_overlap(self, reference):
        """Calculate CpG overlap data between sample and reference.

        Some probes have been skipped from the reference set, e.g. sex
        chromosomes.
        """ #TODO is this true?
        self.cpg_overlap = set(self.cpg_sites["cpg_site"]).intersection(
            reference.cpg_sites.keys())

        self.cpg_overlap_index = [reference.cpg_sites[f]
            for f in self.cpg_overlap]
        self.cpg_overlap_index.sort()

In [None]:
def _get_reference_methylation(reference_index, cpg_index):
    """Extract and return methylation information matrix from reference data.

    Args:
        reference_index: Index of references to extract from reference
            data.
        cpg_index: Index of CpG's to extract from CpG data.

    Returns:
        Numpy array matrix containing submatrix of reference data
        with rows=reference_index and columns=cpg_index.
    """

    make_binary_reference_data_if_needed()
    shape = [len(reference_index), len(cpg_index)]
    delta_offset = np.diff(reference_index, prepend=-1) - 1
    reference_matrix = np.empty(shape, dtype=bool)

    with open(REFERENCE_METHYLATION_SHAPE, "r") as f:
        number_of_cpgs = [int(s) for s in f.read().splitlines()][1]

    with open(REFERENCE_METHYLATION, "rb") as f:
        for i, d in enumerate(delta_offset):
            reference_matrix[i] = np.fromfile(
                f, dtype=bool, offset=d*number_of_cpgs, count=number_of_cpgs
            )[cpg_index]
    return reference_matrix

In [None]:
def get_reference_methylation(sample, reference):
    """Extract and return methylation information matrix from overlap of sample
    CpG's with annotated reference data.
    """

    reference_index = reference.annotated_specimens_index
    cpg_index = sample.cpg_overlap_index
    result = _get_reference_methylation(reference_index, cpg_index)
    return result

In [None]:
def get_sample_methylation(sample, reference):
    """Calculate sample methylation info from reads.

    Args:
        sample: Sample data set.
        reference: Reference data set.
        cpg_ovelrap: Set containing intersection of cpg sites in cpg_sample
            and reference_cpg_site.
    Returns:
        Numpy array containing sample Methylation information.
    """

    sample_methylation = np.full(len(reference.cpg_sites), 0, dtype=bool)
    sample_mean_methylation = sample.cpg_sites.groupby(
        "cpg_site",
        as_index=False).mean()

    for _, row in sample_mean_methylation.iterrows():
        cpg = row["cpg_site"]
        if cpg in sample.cpg_overlap:
            i = reference.cpg_sites[cpg]
            sample_methylation[i] = row["methylation"] > \
                                    METHYLATION_CUTOFF
    sample_methylation = sample_methylation[sample.cpg_overlap_index]
    return sample_methylation


    nanodip.plots
    -------------

    Create Methylation UMAP plot.
    Create Copy Number Variation plot.


In [None]:
def umap_plot_from_data(sample, reference, umap_data_frame, close_up):
    """Create and return umap plot from UMAP data.

    Args:
        sample: sample data
        reference: reference data
        umap_data_frame: pandas data frame containing umap info. First
            row corresponds to sample.
        close_up: bool to indicate if only top matches should be plotted.
    """
    umap_sample = umap_data_frame.iloc[0]
    umap_title = f"UMAP for {sample.name} against {reference.name}, "\
        + f"{len(reference.annotated_specimens)} reference cases, "\
        + f"{len(sample.cpg_overlap)} CpGs"
    if close_up:
        umap_title = "Close-up " + umap_title
    umap_plot = px.scatter(
        umap_data_frame,
        x="x",
        y="y",
        labels={"x":"UMAP 0", "y":"UMAP 1", "color":"WHO class"},
        title=umap_title,
        color="methylation_class",
        hover_name="id",
        hover_data=["description"],
        render_mode=PLOTLY_RENDER_MODE,
        template="simple_white",
    )
    umap_plot.add_annotation(
        x=umap_sample["x"],
        y=umap_sample["y"],
        text=sample.name,
        showarrow=True,
        arrowhead=1,
    )
    umap_plot.update_yaxes(
        scaleanchor = "x",
        scaleratio = 1,
        mirror=True,
    )
    umap_plot.update_xaxes(
        mirror=True,
    )

    # If close-up add hyperlinks for all references and draw circle
    if close_up:
        umap_plot.update_traces(marker=dict(size=5))
        # Add hyperlinks
        for _, row in umap_data_frame.iloc[1:].iterrows():
            umap_plot.add_annotation(
                x=row["x"],
                y=row["y"],
                text="<a href='" + CNV_URL_PREFIX + row["id"]
                    + CNV_URL_SUFFIX
                    + "' target='_blank'>&nbsp;</a>",
                showarrow=False,
                arrowhead=1,
            )
        # Draw circle
        radius = umap_data_frame["distance"].iloc[-1]
        umap_plot.add_shape(
            type="circle",
            x0=umap_sample["x"] - radius,
            y0=umap_sample["y"] - radius,
            x1=umap_sample["x"] + radius,
            y1=umap_sample["y"] + radius,
            line_color="black",
            line_width=0.5,
        )
    return umap_plot

In [None]:
def umap_data_frame(sample, reference):
    """Create UMAP methylation analysis matrix.

    Args:
        sample: sample to analyse
        reference: reference data
    """
    import umap #TODO move to beginning

    logger.info(
        f"UMAP Plot initiated for {sample.name} and reference {reference.name}."
    )
    logger.info(f"Reference Annotation:\n{reference.annotation}")
    logger.info(f"Reference CpG Sites No:\n{len(reference.cpg_sites)}")
    logger.info(f"Reference Specimens No:\n{len(reference.specimens)}")
    logger.info(
        f"Reference Annotated specimens: {len(reference.annotated_specimens)}"
    )
    logger.info(f"Sample CpG Sites No:\n{len(sample.cpg_sites)}")
    logger.info(f"Sample CpG overlap No before:\n{sample.cpg_overlap}")

    # Calculate overlap of sample CpG's with reference CpG's (some probes have
    # been skipped from the reference set, e.g. sex chromosomes).
    sample.set_cpg_overlap(reference)
    logger.info(f"Sample read. CpG overlap No after:\n{len(sample.cpg_overlap)}")

    if not sample.cpg_overlap:
        logger.info("UMAP done. No Matrix created, no overlapping data.")
        raise ValueError("Sample has no overlapping CpG's with reference.")

    # Extract reference and sample methylation according to CpG overlap.
    reference_methylation = get_reference_methylation(sample,
                                                      reference)
    logger.info(f"""Reference methylation extracted:
                {reference_methylation}""")
    sample_methylation = get_sample_methylation(sample, reference)
    logger.info(f"""Sample methylation extracted:
                {sample_methylation}""")
    logger.info("UMAP algorithm initiated.")

    # Calculate UMAP Nx2 Matrix. Time intensive (~1min).
    methyl_overlap = np.vstack([sample_methylation, reference_methylation])
    umap_2d = umap.UMAP(verbose=True).fit_transform(methyl_overlap)

    # Free memory
    del reference_methylation
    del sample_methylation

    logger.info("UMAP algorithm done.")

    umap_sample = umap_2d[0]
    umap_df = pd.DataFrame({
        "distance": [np.linalg.norm(z - umap_sample) for z in umap_2d],
        "methylation_class":  [sample.name] + reference.methylation_class,
        "description":  ["undetermined"] + reference.description,
        "id": [sample.name] + reference.specimen_ids,
        "x": umap_2d[:,0],
        "y": umap_2d[:,1],
    })

    logger.info("UMAP done. Matrix created.")

    return (methyl_overlap, umap_df)

In [None]:
def get_bin_edges(n_bins, genome):
    """Returns sequence of {n_bin} equal sized bins on chromosomes. Every bin is
    limited to one chromosome."""
    if n_bins < 100:
        raise ValueError("Binwidth too small.")
    edges = np.linspace(0, genome.length, num=n_bins + 1).astype(int)
    # limit bins to only one chromosome
    for chrom_edge in genome.chrom.offset:
        i_nearest = np.abs(edges - chrom_edge).argmin()
        edges[i_nearest] = chrom_edge
    return edges

In [None]:
def get_cnv(read_positions, genome):
    """Return CNV."""
    expected_reads_per_bin = 30
    n_bins = len(read_positions)//expected_reads_per_bin
    read_start_positions = [i[0] for i in read_positions]

    copy_numbers, bin_edges = np.histogram(
        read_start_positions, bins=get_bin_edges(n_bins, genome), range=[0, genome.length]
    )

    bin_midpoints = (bin_edges[1:] + bin_edges[:-1])/2
    expected_reads = np.diff(bin_edges) * len(read_positions) / genome.length
    cnv = [(x - e)/e for x, e in zip(copy_numbers, expected_reads)]
    return bin_midpoints, copy_numbers

In [None]:
def cnv_grid(genome):
    """Makes chromosome grid layout for CNV Plot and saves it on disk. If 
    available grid is read from disk.
    """
    # Check if grid exists and return if available.
    grid_path = CNV_GRID
    if os.path.exists(grid_path):
        with open(grid_path, "r") as f:
            grid = from_json(f.read())
        return grid
        
    grid = go.Figure()
    grid.update_layout(
        coloraxis_showscale=False,
        xaxis = dict(
            linecolor="black",
            linewidth=1,
            mirror=True,
            range=[0, genome.length],
            showgrid=False,
            ticklen=10,
            tickmode="array",
            ticks="outside",
            tickson="boundaries",
            ticktext=genome.chrom.name,
            tickvals=genome.chrom.center,
            zeroline=False,
        ),
        yaxis = dict(
            linecolor="black",
            linewidth=1,
            mirror=True,
            showline=True,
        ),
        width=1700,
        height=900,
        template="simple_white",
    )
    # Vertical line: centromere.
    for i in genome.chrom.centromere_offset:
        grid.add_vline(x=i, line_color="black",
                           line_dash="dot", line_width=1)
    # Vertical line: shromosomes.
    for i in genome.chrom.offset.tolist() + [genome.length]:
        grid.add_vline(x=i, line_color="black", line_width=1)
    # Save to disk
    grid.write_json(grid_path)
    return grid

In [None]:
def cnv_plot_from_data(data_x, data_y, E_y, sample_name, read_num, genome):
    """Create CNV plot from CNV data.

    Args:
        data_x: x-Values to plot.
        data_y: y-Values to plot.
        E_y: expected y-Value.
        sample_name: Name of sample.
        read_num: Number of read reads.
        genome: Reference Genome.
    """
    grid = cnv_grid(genome)
    # Expected value: horizontal line.
    grid.add_hline(y=E_y, line_color="black", line_width=1)
    cnv_plot = px.scatter(
        x=data_x,
        y=data_y,
        labels={
            "x":f"Number of mapped reads: {read_num}",
            "y":f"Copy numbers per {round(genome.length/(len(data_x) * 1e6), 2)} MB"
        },
        title=f"Sample ID: {sample_name}",
        color=data_y,
        range_color=[E_y*0, E_y*2],
        color_continuous_scale="Portland",
        render_mode=PLOTLY_RENDER_MODE,
    )
    cnv_plot.update_traces(
        hovertemplate="Copy Numbers = %{y} <br>",
    )
    cnv_plot.update_layout(
        grid.layout,
        yaxis_range = [-0.5, 2*E_y],
    )
    return cnv_plot

In [None]:
def number_of_reads(read_start_pos, interval):
    """Return the number of starting sequences whithin interval. Reads must
    be sorted in ascending order."""
    left, right = interval
    i_left = bisect.bisect_left(read_start_pos, left)
    i_right = bisect.bisect_left(read_start_pos, right)
    return len(read_start_pos[i_left:i_right])

In [None]:
def get_cnv_plot(sample, bin_midpoints, cnv, genome):
    """Create a genome-wide copy number plot and save data on dist."""
    logger.info(f"CNVP start")
    logger.info(f"Read positions:\n{sample.reads[:100]}")

    logger.info(f"Bin midpoints:\n{bin_midpoints}")
    logger.info(f"CNV:\n{cnv}")

    avg_read_per_bin = len(sample.reads) // len(bin_midpoints)

    cnv_plot = cnv_plot_from_data(
        data_x=bin_midpoints,
        data_y=cnv,
        E_y = avg_read_per_bin,
        sample_name=sample.name,
        read_num=len(sample.reads),
        genome=genome,
    )

    logger.info(f"CNVP done")
    return cnv_plot

In [None]:
class UMAPData:
    """Umap data container and methods for invoking umap plot algorithm."""
    def __init__(self, sample_name, reference_name):
        self.sample_name = sample_name
        self.reference_name = reference_name
        self.path = os.path.join(
            NANODIP_REPORTS,
            f"{sample_name}_{reference_name}",
        )

    def make_umap_plot(self):
        """Invoke umap plot algorithm and save to disk."""
        self.sample = SampleData(self.sample_name)
        self.reference = ReferenceData(self.reference_name) # time: 3.6s
        self.methyl_overlap, self.umap_df = umap_data_frame(
            self.sample, self.reference
        )
        self.plot = umap_plot_from_data(
            self.sample,
            self.reference,
            self.umap_df,
            close_up=False,
        )
        logger.info("UMAP plot generated.")
        self.cu_umap_df = self.umap_df.sort_values(
            by="distance"
        )[:UMAP_PLOT_TOP_MATCHES + 1]
        self.cu_plot = umap_plot_from_data(
            self.sample,
            self.reference,
            self.cu_umap_df,
            close_up=True,
        )
        logger.info("UMAP close-up plot generated.")

        # Convert to json.
        self.plot_json = self.plot.to_json()
        self.cu_plot_json = self.cu_plot.to_json()

        self.save_to_disk()

    def save_ranking_report(self):
        """Save pdf containing the nearest neighbours from umap analyis."""
        rows = [row for _, row in self.cu_umap_df.iterrows()]

        html_report = render_template("umap_report.html", rows=rows)

        file_path = os.path.join(
            NANODIP_REPORTS,
            "%s_%s%s" % (self.sample_name,
                         self.reference_name,
                         ENDINGS["ranking"],
                        ),
        )
        convert_html_to_pdf(html_report, file_path)

        file_path = os.path.join(
            NANODIP_REPORTS,
            "%s%s" % (self.sample_name, ENDINGS["cpg_cnt"]),
        )
        with open(file_path, "w") as f:
            f.write("%s" % len(self.sample.cpg_overlap))

    def save_to_disk(self):
        # Save Methylation Matrix.
        file_path = os.path.join(NANODIP_REPORTS,
            "%s_%s%s" % (self.sample_name,
                         self.reference_name,
                         ENDINGS["methyl"])
        )
        np.save(file_path, self.methyl_overlap)

        # Save UMAP Matrix.
        file_path = os.path.join(NANODIP_REPORTS,
            "%s_%s%s" % (self.sample_name,
                         self.reference_name,
                         ENDINGS["umap_csv"])
        )
        self.umap_df.to_csv(file_path, index=False)

        # Write UMAP plot to disk.
        file_path = os.path.join(
            NANODIP_REPORTS,
            f"%s_%s%s" % (self.sample_name, self.reference_name,
                          ENDINGS["umap_all"]),
        )
        self.plot.write_html(file_path, config=dict({"scrollZoom": True}))
        self.plot.write_json(file_path[:-4] + "json")
        self.plot.write_image(file_path[:-4] + "png") # Time consumption 1.8s

        # Write UMAP close-up plot to disk.
        file_path = os.path.join(
            NANODIP_REPORTS,
            f"%s_%s%s" % (self.sample_name, self.reference_name,
                          ENDINGS["umap_top"]),
        )
        self.cu_plot.write_html(file_path, config=dict({"scrollZoom": True}))
        self.cu_plot.write_json(file_path[:-4] + "json")
        self.cu_plot.write_image(file_path[:-4] + "png") # Time consumption 0.9s

        # Save close up ranking report.
        self.save_ranking_report() # Time consumption 0.4s


    def files_on_disk(self):
        methyl_overlap_path = os.path.join(NANODIP_REPORTS,
            "%s_%s%s" % (self.sample_name,
                         self.reference_name,
                         ENDINGS["methyl"])
        )
        plot_path = self.path + ENDINGS["umap_all_json"]
        cu_plot_path = self.path + ENDINGS["umap_top_json"]

        return (os.path.exists(plot_path) and
                os.path.exists(cu_plot_path) and
                os.path.exists(methyl_overlap_path))

    def read_from_disk(self):
        methyl_overlap_path = os.path.join(NANODIP_REPORTS,
            "%s_%s%s" % (self.sample_name,
                         self.reference_name,
                         ENDINGS["methyl"])
        )
        plot_path = self.path + ENDINGS["umap_all_json"]
        cu_plot_path = self.path + ENDINGS["umap_top_json"]

        # Read UMAP plot as json.
        with open(plot_path, "r") as f:
            self.plot_json = f.read()
        #self.plot = from_json(self.plot_json)

        # Read UMAP close-up plot as json.
        with open(cu_plot_path, "r") as f:
            self.cu_plot_json = f.read()
        #self.cu_plot = from_json(self.cu_plot_json)

        # Read Methylation Matrix.
        self.methyl_overlap = np.load(methyl_overlap_path,
            allow_pickle=True)

In [None]:
def TODO():
    path = umap_output_path(sample, reference, close_up=True)
    with open(path["html"], "w") as f:
        f.write("<html><body>No data to plot.</body></html>")

In [None]:
class CNVData:
    """CNV data container and methods for invoking cnv plot algorithm."""
    genome = ReferenceGenome()
    def __init__(self, sample_name):
        self.sample_name = sample_name
        self.path = os.path.join(NANODIP_REPORTS, f"{sample_name}")

    def read_from_disk(self):
        plot_path = self.path + ENDINGS["cnv_json"]
        genes_path = self.path + ENDINGS["genes"]
        with open(plot_path, "r") as f:
            self.plot_json = f.read()
        self.plot = from_json(self.plot_json)
        self.genes = pd.read_csv(genes_path)

    def files_on_disk(self):
        plot_path = self.path + ENDINGS["cnv_json"]
        genes_path = self.path + ENDINGS["genes"]
        return (
            os.path.exists(plot_path) and 
            os.path.exists(genes_path)
        )

    def make_cnv_plot(self):
        self.sample = SampleData(self.sample_name)
        self.sample.set_reads() # time consumption 2.5s
        self.bin_midpoints, self.cnv = get_cnv(
            self.sample.reads,
            CNVData.genome,
        )
        self.plot = get_cnv_plot(
            sample=self.sample,
            bin_midpoints=self.bin_midpoints,
            cnv=self.cnv,
            genome=CNVData.genome,
        )
        self.plot_json = self.plot.to_json()
        self.genes = self.gene_cnv(
            CNVData.genome.length // len(self.bin_midpoints)
        )
        self.relevant_genes = self.genes.loc[self.genes.relevant]
        self.save_to_disk()

    def save_to_disk(self):
        self.plot.write_html(
            self.path + ENDINGS["cnv_html"],
            config=dict({"scrollZoom": True}),
        )
        write_json(self.plot, self.path + ENDINGS["cnv_json"])
        if not os.path.exists(self.path + ENDINGS["cnv_png"]):
            # time consuming operation (1.96s)
            self.plot.write_image(
                self.path + ENDINGS["cnv_png"], width=1280, height=720
            )
        with open(self.path + ENDINGS["aligned_reads"], "w") as f:
            f.write("%s" % len(self.sample.reads))
        with open(self.path + ENDINGS["reads_csv"], "w") as f:
            write = csv.writer(f)
            write.writerows(self.sample.reads)
        self.genes.to_csv(self.path + ENDINGS["genes"], index=False)
        self.relevant_genes.to_csv(
            self.path + ENDINGS["relevant_genes"],
            index=False,
        )

    def gene_cnv(self, bin_width):
        genes = CNVData.genome.genes
        genes["interval"] = list(zip(genes.start, genes.end))
        read_start_pos = [i[0] for i in self.sample.reads]
        read_start_pos.sort()
        num_reads = len(read_start_pos)
        genes["cn_obs"] = genes.interval.apply(
            lambda z: number_of_reads(read_start_pos, z)
        )
        genes["cn_norm"] = genes.apply(
            lambda z: z["cn_obs"]/z["len"] * bin_width, # TODO exp/norm not compatible
            axis=1,
        )
        genes["cn_exp"] = genes.apply(
            lambda z: len(self.sample.reads)*z["len"]/CNVData.genome.length,
            axis=1,
        )
        genes = genes.sort_values(by="cn_obs", ascending=False)
        return genes

    def get_gene_positions(self, genes):
        gene_pos = self.genes.loc[self.genes.name.isin(genes)]
        return gene_pos

    def plot_cnv_and_genes(self, gene_names):
        genes = self.get_gene_positions(gene_names)
        plot = go.Figure(self.plot)
        plot.add_trace(
            go.Scatter(
                customdata=genes[[
                    "name",          # 0
                    "loc",           # 1
                    "transcript",    # 2
                    "len",           # 3
                ]],
                hovertemplate=(
                    "Copy numbers = %{y} <br>"
                    "<b> %{customdata[0]} </b> <br>"
                    "%{customdata[1]} "
                    "(hg19 %{customdata[2]}) <br>"
                    "%{customdata[3]} bases <br>"
                ),
                name="",
                marker_color="rgba(0,0,0,1)",
                mode="markers+text",
                marker_symbol="diamond",
                textfont_color="rgba(0,0,0,1)",
                showlegend=False,
                text=genes.name,
                textposition="top center",
                x=genes.midpoint,
                y=genes.cn_obs,
            ))
        return plot.to_json()

In [None]:
# create and cofigure logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter(
    "%(levelname)s %(asctime)s %(lineno)d - %(message)s")
file_handler = logging.FileHandler(
    os.path.join(NANODIP_REPORTS, "nanodip.log"),
    "w",
)
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)

In [None]:
## output to console
# stream_handler = logging.StreamHandler()
# logger.addHandler(stream_handler)


# No user editable code below
Do not modify the cells below unless you would like to patch errors or create
something new.

## Sections
1. Generic Functions
2. MinKNOW API Functions
3. CNV Plotter
4. UMAP Methylation Plotter
5. User Interface Functions
6. Report Generator
7. CherryPy Web UI



### 1. Generic Functions


In [None]:
def logpr(v,logstring): # logging funcion that reads verbosity parameter
    if v==1:
        print(str(datetime.datetime.now())+": "+str(logstring))

In [None]:
def get_runs():
    """Return list of run folders from MinKNOW data directory sorted by
    modification time."""
    runs = []
    for f in os.listdir(DATA):
        if f not in EXCLUDED_FROM_ANALYSIS:
            file_path = os.path.join(DATA, f)
            mod_time = os.path.getmtime(file_path)
            if os.path.isdir(file_path):
                runs.append([f, mod_time])
    # sort based on modif. date
    runs.sort(key=lambda x: (x[1], x[0]), reverse=True)
    # Remove date after sorting
    return [x[0] for x in runs]

In [None]:
def predominant_barcode(sample_name):
    """Returns the predominante barcode within all fast5 files."""
    fast5_files = []
    for root, _, files in os.walk(os.path.join(DATA, sample_name)):
        fast5_files.extend(
            [os.path.join(root, f) for f in files if f.endswith(".fast5")]
        )
    barcode_hits=[]
    for barcode in BARCODE_NAMES:
        barcode_hits.append(
            len([f for f in fast5_files if barcode in f])
        )
    max_barcode = max(barcode_hits)
    if max_barcode > 1:
        predominant_barcode = BARCODE_NAMES[barcode_hits.index(max_barcode)]
    else:
        predominant_barcode = "undetermined"
    return predominant_barcode

In [None]:
def reference_annotations():
    """Return list of all reference annotation files (MS Excel XLSX format)."""
    annotations = []
    for r in os.listdir(ANNOTATIONS):
        if r.endswith(".xlsx"):
            annotations.append(r)
    return annotations

In [None]:
# TODO del
# write the filename of the UMAP reference for the
def writeReferenceDefinition(sampleId,referenceFile):
    # current run into a text file
    with open(NANODIP_REPORTS+'/'+sampleId+'_selected_reference.txt', 'w') as f:
        f.write(referenceFile)

In [None]:
def write_reference_name(sample_id,reference_name):
    """Write the filename of the UMAP reference for the current run into
    a text file."""
    path = os.path.join(
        NANODIP_REPORTS, sample_id + "_selected_reference.txt"
    )
    with open(path, "w") as f:
        f.write(reference_name)

In [None]:
def readReferenceDefinition(sampleId): # read the filename of the UMAP reference for the current sample
    try:
        with open(NANODIP_REPORTS+'/'+sampleId+'_selected_reference.txt', 'r') as f:
            referenceFile=f.read()
    except:
        referenceFile=""
    return referenceFile

In [None]:
def writeRunTmpFile(sampleId,deviceId):
    # current run into a text file
    with open(NANODIP_REPORTS+'/'+sampleId+'_'+deviceId+'_runinfo.tmp', 'a') as f:
        try:
            runId=getActiveRun(deviceId)
        except:
            runId="none"
        ro=getThisRunOutput(deviceId,sampleId,runId)
        readCount=ro[0]
        bascalledBases=ro[1]
        overlapCpGs=getOverlapCpGs(sampleId)
        f.write(str(int(time.time()))+"\t"+
                str(readCount)+"\t"+
                str(bascalledBases)+"\t"+
                str(overlapCpGs)+"\n")

In [None]:
def readRunTmpFile(sampleId):
    print("readRunTmpFile not ready")

In [None]:
def getOverlapCpGs(sampleName):
    methoverlapPath=NANODIP_OUTPUT+"/"+sampleName # collect matching CpGs from sample
    methoverlapTsvFiles=[] # find all *methoverlap.tsv files
    for root, dirnames, filenames in os.walk(methoverlapPath):
        for filename in fnmatch.filter(filenames, '*methoverlap.tsv'):
            methoverlapTsvFiles.append(os.path.join(root, filename))
    methoverlap=[]
    first=True
    for f in methoverlapTsvFiles:
        try: # some fast5 files do not contain any CpGs
            m=pd.read_csv(f, delimiter='\t', header=None, index_col=0)
            if first:
                methoverlap=m
                first=False
            else:
                methoverlap=methoverlap.append(m)
        except:
            logpr(VERBOSITY,"empty file encountered, skipping")
    return len(methoverlap)

In [None]:
def f5cOneFast5(sampleId,analyzeOne=True):
    analyzedCount=0
    thisRunDir=DATA+"/"+sampleId
    pattern = '*.fast5'
    fileList = []
    for dName, sdName, fList in os.walk(thisRunDir): # Walk through directory
        for fileName in fList:
            if fnmatch.fnmatch(fileName, pattern): # Match search string
                fileList.append(os.path.join(dName, fileName))
    calledList=[]
    completedCount=0
    maxBcCount=1 # at least 2 "passed" files (>1) need to be present
    targetBc="undetermined"
    for bc in BARCODE_NAMES:
        thisBc=0
        for f in fileList:
            if bc in f:
                if "_pass_" in f:
                    thisBc+=1
        if thisBc > maxBcCount:
            maxBcCount=thisBc
            targetBc=bc
    f5cAnalysisDir=NANODIP_OUTPUT+"/"+sampleId
    if os.path.exists(f5cAnalysisDir)==False:
        os.mkdir(f5cAnalysisDir)
    thisBcFast5=[]
    thisBcFastq=[]
    for f in fileList:
        if targetBc in f:
            q=f.replace(".fast5","").replace("fast5_pass","fastq_pass")+".fastq"
            if os.path.exists(q): # check if accompanying fastq exists
                thisBcFast5.append(f)
                thisBcFastq.append(q)
                thisBcFileName=f.split("/")
                thisBcFileName=thisBcFileName[len(thisBcFileName)-1].replace(".fast5","") # get name prefix (to be the analysis subdir name later)
                thisAnalysisDir=f5cAnalysisDir+"/"+thisBcFileName
                if os.path.exists(thisAnalysisDir)==False:
                    os.mkdir(thisAnalysisDir)
                target5=thisAnalysisDir+"/"+thisBcFileName+".fast5"
                targetq=thisAnalysisDir+"/"+thisBcFileName+".fastq"
                if os.path.exists(target5)==False:
                    os.symlink(f,target5)             # fast5 symlink
                if os.path.exists(targetq)==False:
                    os.symlink(q,targetq)             #fastq symlink
                if os.path.exists(thisAnalysisDir+"/"+thisBcFileName+"-methoverlapcount.txt")==False:
                    if (analyzeOne==True and analyzedCount==0) or analyzeOne==False:
                        cmd=F5C+" index -t 1 --iop 100 -d "+thisAnalysisDir+" "+targetq
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) #index, call methylation and get methylation frequencies
                        p.wait()
                        cmd=MINIMAP2+" -a -x map-ont "+REFERENCE_GENOME_MMI+" "+targetq+" -t 4 | "+SAMTOOLS+" sort -T tmp -o "+thisAnalysisDir+"/"+thisBcFileName+"-reads_sorted.bam"
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) # get sorted BAM (4 threads)
                        p.wait()
                        cmd=SAMTOOLS+" index "+thisAnalysisDir+"/"+thisBcFileName+"-reads_sorted.bam"
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) # index BAM
                        p.wait()
                        cmd=F5C+" call-methylation -B2000000 -K400 -b "+thisAnalysisDir+"/"+thisBcFileName+"-reads_sorted.bam -g "+REFERENCE_GENOME_FA+" -r "+targetq+" > "+thisAnalysisDir+"/"+thisBcFileName+"-result.tsv"
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) # set B to 2 megabases (GPU) and 0.4 kreads
                        p.wait()
                        cmd=F5C+" meth-freq -c 2.5 -s -i "+thisAnalysisDir+"/"+thisBcFileName+"-result.tsv > "+thisAnalysisDir+"/"+thisBcFileName+"-freq.tsv"
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
                        p.wait()
                        cmd=RSCRIPT+" "+READ_CPG_RSCRIPT+" "+thisAnalysisDir+"/"+thisBcFileName+"-freq.tsv "+ILUMINA_CG_MAP+" "+thisAnalysisDir+"/"+thisBcFileName+"-methoverlap.tsv "+thisAnalysisDir+"/"+thisBcFileName+"-methoverlapcount.txt"
                        p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
                        p.wait()
                        calledList.append(thisBcFileName)
                        analyzedCount+=1
                else:
                    completedCount+=1
    return "Target = "+targetBc+"<br>Methylation called for "+str(calledList)+". "+str(completedCount+analyzedCount)+"/"+str(len(thisBcFast5))


### 2. MinKNOW API Functions
Check https://github.com/nanoporetech/minknow_api for reference.

The following code requires a patched version of the MinKNOW API, install it
from https://github.com/neuropathbasel/minknow_api.


In [None]:
# Construct a manager using the host + port provided. This is used to connect to
def mkManager():
    return Manager(host=THIS_HOST, port=9501, use_tls=False) # the MinKNOW service trough the MK API.

In [None]:
def listMinionPositions(): # list MinION devices that are currenty connected to the system
    manager = mkManager()
    positions = manager.flow_cell_positions() # Find a list of currently available sequencing positions.
    return(positions)   # User could call {pos.connect()} here to connect to the running MinKNOW instance.

In [None]:
def listMinionExperiments(): # list all current and previous runs in the MinKNOW buffer, lost after MinKNOW restart
    manager=mkManager()
    htmlHost="<b>Host: "+THIS_HOST+"</b><br><table border='1'><tr>"
    positions=manager.flow_cell_positions() # Find a list of currently available sequencing positions.
    htmlPosition=[]
    for p in positions:
        htmlPosinfo="<b>-"+str(p)+"</b><br>"
        connection = p.connect()
        mountedFlowCellID=connection.device.get_flow_cell_info().flow_cell_id # return the flow cell info
        htmlPosinfo=htmlPosinfo+"--mounted flow cell ID: <b>" + mountedFlowCellID +"</b><br>"
        htmlPosinfo=htmlPosinfo+"---"+str(connection.acquisition.current_status())+"<br>" # READY, STARTING, sequencing/mux = PROCESSING, FINISHING; Pause = PROCESSING
        protocols = connection.protocol.list_protocol_runs()
        bufferedRunIds = protocols.run_ids
        for b in bufferedRunIds:
            htmlPosinfo=htmlPosinfo+"--run ID: " + b +"<br>"
            run_info = connection.protocol.get_run_info(run_id=b)
            htmlPosinfo=htmlPosinfo+"---with flow cell ID: " + run_info.flow_cell.flow_cell_id +"<br>"
        htmlPosition.append(htmlPosinfo)
    hierarchy = htmlHost
    for p in htmlPosition:
        hierarchy=hierarchy + "<td valign='top'><tt>"+p+"</tt></td>"
    hierarchy=hierarchy+"</table>"
    return(hierarchy)

In [None]:
def getFlowCellID(thisDeviceId): # determine flow cell ID (if any). Note that some CTCs have an empty ID string.
    mountedFlowCellID="no_flow_cell"
    manager=mkManager()
    positions=manager.flow_cell_positions() # Find a list of currently available sequencing positions.
    for p in positions:
        if thisDeviceId in str(p):
            connection = p.connect()
            mountedFlowCellID=connection.device.get_flow_cell_info().flow_cell_id # return the flow cell info
    return mountedFlowCellID

In [None]:
# This cell starts a run on Mk1b devices and perform several checks concerning
# the run protocol.

# modified from the MinKNOW API on https://github.com/nanoporetech/minknow_api (2021-06)
# created from the sample code at
# https://github.com/nanoporetech/minknow_api/blob/master/python/examples/start_protocol.py
# minknow_api.manager supplies "Manager" a wrapper around MinKNOW's Manager
# gRPC API with utilities for querying sequencing positions + offline
# basecalling tools.
# from minknow_api.manager import Manager

# We need 'find_protocol' to search for the required protocol given a kit +
# product code.
# from minknow_api.tools import protocols
def parse_args():
    """Build and execute a command line argument for starting a protocol.

    Returns:
        Parsed arguments to be used when starting a protocol.
    """
    parser = argparse.ArgumentParser(
        description="""
        Run a sequencing protocol in a running MinKNOW instance.
        """
    )
    parser.add_argument(
        "--host",
        default="localhost",
        help="IP address of the machine running MinKNOW (defaults to localhost)",
    )
    parser.add_argument(
        "--port",
        help="Port to connect to on host (defaults to standard MinKNOW port based on tls setting)",
    )
    parser.add_argument(
        "--no-tls", help="Disable tls connection", default=False, action="store_true"
    )
    parser.add_argument("--verbose", action="store_true", help="Enable debug logging")

    parser.add_argument("--sample-id", help="sample ID to set")
    parser.add_argument(
        "--experiment-group",
        "--group-id",
        help="experiment group (aka protocol group ID) to set",
    )
    parser.add_argument(
        "--position",
        help="position on the machine (or MinION serial number) to run the protocol at",
    )
    parser.add_argument(
        "--flow-cell-id",
        metavar="FLOW-CELL-ID",
        help="ID of the flow-cell on which to run the protocol. (specify this or --position)",
    )
    parser.add_argument(
        "--kit",
        required=True,
        help="Sequencing kit used with the flow-cell, eg: SQK-LSK108",
    )
    parser.add_argument(
        "--product-code",
        help="Override the product-code stored on the flow-cell and previously user-specified"
        "product-codes",
    )
    # BASECALL ARGUMENTS
    parser.add_argument(
        "--basecalling",
        action="store_true",
        help="enable base-calling using the default base-calling model",
    )
    parser.add_argument(
        "--basecall-config",
        help="specify the base-calling config and enable base-calling",
    )
    # BARCODING ARGUMENTS
    parser.add_argument(
        "--barcoding", action="store_true", help="protocol uses barcoding",
    )
    parser.add_argument(
        "--barcode-kits",
        nargs="+",
        help="bar-coding expansion kits used in the experiment",
    )
    parser.add_argument(
        "--trim-barcodes", action="store_true", help="enable bar-code trimming",
    )
    parser.add_argument(
        "--barcodes-both-ends",
        action="store_true",
        help="bar-code filtering (both ends of a strand must have a matching barcode)",
    )

    parser.add_argument(
        "--detect-mid-strand-barcodes",
        action="store_true",
        help="bar-code filtering for bar-codes in the middle of a strand",
    )
    parser.add_argument(
        "--min-score",
        type=float,
        default=0.0,
        help="read selection based on bar-code accuracy",
    )
    parser.add_argument(
        "--min-score-rear",
        type=float,
        default=0.0,
        help="read selection based on bar-code accuracy",
    )

    parser.add_argument(
        "--min-score-mid",
        type=float,
        default=0.0,
        help="read selection based on bar-code accuracy",
    )
    # ALIGNMENT ARGUMENTS
    parser.add_argument(
        "--alignment-reference",
        help="Specify alignment reference to send to basecaller for live alignment.",
    )
    parser.add_argument(
        "--bed-file", help="Specify bed file to send to basecaller.",
    )
    # Output arguments
    parser.add_argument(
        "--fastq",
        action="store_true",
        help="enables FastQ file output, defaulting to 4000 reads per file",
    )
    parser.add_argument(
        "--fastq-reads-per-file",
        type=int,
        default=4000,
        help="set the number of reads combined into one FastQ file.",
    )
    parser.add_argument(
        "--fast5",
        action="store_true",
        help="enables Fast5 file output, defaulting to 4000 reads per file, this will store raw, "
        "fastq and trace-table data",
    )
    parser.add_argument(
        "--fast5-reads-per-file",
        type=int,
        default=4000,
        help="set the number of reads combined into one Fast5 file.",
    )
    parser.add_argument(
        "--bam",
        action="store_true",
        help="enables BAM file output, defaulting to 4000 reads per file",
    )
    parser.add_argument(
        "--bam-reads-per-file",
        type=int,
        default=4000,
        help="set the number of reads combined into one BAM file.",
    )
    # Read until
    parser.add_argument(
        "--read-until-reference", type=str, help="Reference file to use in read until",
    )
    parser.add_argument(
        "--read-until-bed-file", type=str, help="Bed file to use in read until",
    )
    parser.add_argument(
        "--read-until-filter",
        type=str,
        choices=["deplete", "enrich"],
        help="Filter type to use in read until",
    )
    # Experiment
    parser.add_argument(
        "--experiment-duration",
        type=float,
        default=72,
        help="time spent sequencing (in hours)",
    )
    parser.add_argument(
        "--no-active-channel-selection",
        action="store_true",
        help="allow dynamic selection of channels to select pores for sequencing, "
        "ignored for Flongle flow-cells",
    )
    parser.add_argument(
        "--mux-scan-period",
        type=float,
        default=1.5,
        help="number of hours before a mux scan takes place, enables active-channel-selection, "
        "ignored for Flongle flow-cells",
    )
    parser.add_argument(
        "extra_args",
        metavar="ARGS",
        nargs="*",
        help="Additional arguments passed verbatim to the protocol script",
    )
    args = parser.parse_args()
    # Further argument checks
    # Read until must have a reference and a filter type, if enabled:
    if (
        args.read_until_filter is not None
        or args.read_until_reference is not None
        or args.read_until_bed_file is not None
    ):
        if args.read_until_filter is None:
            print("Unable to specify read until arguments without a filter type.")
            sys.exit(1)

        if args.read_until_reference is None:
            print("Unable to specify read until arguments without a reference type.")
            sys.exit(1)

    if args.bed_file and not args.alignment_reference:
        print("Unable to specify `--bed-file` without `--alignment-reference`.")
        sys.exit(1)

    if (args.barcoding or args.barcode_kits) and not (
        args.basecalling or args.basecall_config
    ):
        print(
            "Unable to specify `--barcoding` or `--barcode-kits` without `--basecalling`."
        )
        sys.exit(1)
    if args.alignment_reference and not (args.basecalling or args.basecall_config):
        print("Unable to specify `--alignment-reference` without `--basecalling`.")

        sys.exit(1)
    if not (args.fast5 or args.fastq):
        print("No output (fast5 or fastq) specified")

    return args

In [None]:
def is_position_selected(position, args):
    """Find if the {position} is selected by command line arguments {args}."""

    # First check for name match:
    if args.position == position.name:
        return True

    # Then verify if the flow cell matches:
    connected_position = position.connect()
    if args.flow_cell_id is not None:
        flow_cell_info = connected_position.device.get_flow_cell_info()
        if (
            flow_cell_info.user_specified_flow_cell_id == args.flow_cell_id
            or flow_cell_info.flow_cell_id == args.flow_cell_id
        ):
            return True

    return False

In [None]:
def startRun():
    """Entrypoint to start protocol example"""
    # Parse arguments to be passed to started protocols:
    run_id=""
    args = parse_args()
    #args = parse_args(minknowApiShellArgumentString.split())

    # Specify --verbose on the command line to get extra details about
    if args.verbose:
        logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

    # Construct a manager using the host + port provided:
    #manager = Manager(host=args.host, port=args.port, use_tls=not args.no_tls)
    manager=mkManager()
    errormessage=""

    # Find which positions we are going to start protocol on:
    positions = manager.flow_cell_positions()
    filtered_positions = list(
        filter(lambda pos: is_position_selected(pos, args), positions)
    )

    # At least one position needs to be selected:
    if not filtered_positions:
        errormessage="No positions selected for protocol - specify `--position` or `--flow-cell-id`"
    else:
        protocol_identifiers = {}
        for pos in filtered_positions:
            # Connect to the sequencing position:
            position_connection = pos.connect()

            # Check if a flowcell is available for sequencing
            flow_cell_info = position_connection.device.get_flow_cell_info()
            if not flow_cell_info.has_flow_cell:
                errormessage="No flow cell present in position "+str(pos)
            else:
                # Select product code:
                if args.product_code:
                    product_code = args.product_code
                else:
                    product_code = flow_cell_info.user_specified_product_code
                    if not product_code:
                        product_code = flow_cell_info.product_code

                # Find the protocol identifier for the required protocol:
                protocol_info = protocols.find_protocol(
                    position_connection,
                    product_code=product_code,
                    kit=args.kit,
                    basecalling=args.basecalling,
                    basecall_config=args.basecall_config,
                    barcoding=args.barcoding,
                    barcoding_kits=args.barcode_kits,
                )

                if not protocol_info:
                    print("Failed to find protocol for position %s" % (pos.name))
                    print("Requested protocol:")
                    print("  product-code: %s" % args.product_code)
                    print("  kit: %s" % args.kit)
                    print("  basecalling: %s" % args.basecalling)
                    print("  basecall_config: %s" % args.basecall_config)
                    print("  barcode-kits: %s" % args.barcode_kits)
                    print("  barcoding: %s" % args.barcoding)
                    errormessage="Protocol build error, consult application log."
                else:
                    # Store the identifier for later:
                    protocol_identifiers[pos.name] = protocol_info.identifier

                    # Start protocol on the requested postitions:
                    print("Starting protocol on %s positions" % len(filtered_positions))
                    for pos in filtered_positions:

                        # Connect to the sequencing position:
                        position_connection = pos.connect()

                        # Find the protocol identifier for the required protocol:
                        protocol_identifier = protocol_identifiers[pos.name]

                        # Now select which arguments to pass to start protocol:
                        print("Starting protocol %s on position %s" % (protocol_identifier, pos.name))

                        # Set up user specified product code if requested:
                        if args.product_code:
                            position_connection.device.set_user_specified_product_code(
                                code=args.product_code
                            )

                        # Build arguments for starting protocol:
                        basecalling_args = None
                        if args.basecalling or args.basecall_config:
                            barcoding_args = None
                            alignment_args = None
                            if args.barcode_kits or args.barcoding:
                                barcoding_args = protocols.BarcodingArgs(
                                    args.barcode_kits,
                                    args.trim_barcodes,
                                    args.barcodes_both_ends,
                                    args.detect_mid_strand_barcodes,
                                    args.min_score,
                                    args.min_score_rear,
                                    args.min_score_mid,
                                )

                            if args.alignment_reference:
                                alignment_args = protocols.AlignmentArgs(
                                    reference_files=[args.alignment_reference], bed_file=args.bed_file,
                                )

                            basecalling_args = protocols.BasecallingArgs(
                                config=args.basecall_config,
                                barcoding=barcoding_args,
                                alignment=alignment_args,
                            )

                        read_until_args = None
                        if args.read_until_filter:
                            read_until_args = protocols.ReadUntilArgs(
                                filter_type=args.read_until_filter,
                                reference_files=[args.read_until_reference],
                                bed_file=args.read_until_bed_file,
                                first_channel=None,  # These default to all channels.
                                last_channel=None,
                            )

                        def build_output_arguments(args, name):
                            if not getattr(args, name):
                                return None
                            return protocols.OutputArgs(
                                reads_per_file=getattr(args, "%s_reads_per_file" % name)
                            )

                        fastq_arguments = build_output_arguments(args, "fastq")
                        fast5_arguments = build_output_arguments(args, "fast5")
                        bam_arguments = build_output_arguments(args, "bam")

                        # print the protocol parameters
                        print("position_connection "+str(position_connection))
                        print("protocol_identifier "+str(protocol_identifier))
                        print("args.sample_id "+str(args.sample_id))
                        print("args.experiment_group "+str(args.experiment_group))
                        print("basecalling_args "+str(basecalling_args))
                        print("read_until_args "+str(read_until_args))
                        print("fastq_arguments "+str(fastq_arguments)) #fastq_arguments OutputArgs(reads_per_file=400)
                        print("fast5_arguments "+str(fast5_arguments)) #fast5_arguments OutputArgs(reads_per_file=400)
                        print("bam_arguments "+str(bam_arguments))
                        print("args.no_active_channel_selection"+str(args.no_active_channel_selection))
                        print("args.mux_scan_period"+str(args.mux_scan_period))
                        print("args.experiment_duration "+str(args.experiment_duration))
                        print("args.extra_args "+str(args.extra_args))  # Any extra args passed.

                        # Now start the protocol:
                        run_id = protocols.start_protocol(
                            position_connection,
                            protocol_identifier,
                            sample_id=args.sample_id,
                            experiment_group=args.experiment_group,
                            basecalling=basecalling_args,
                            read_until=read_until_args,
                            fastq_arguments=fastq_arguments,
                            fast5_arguments=fast5_arguments,
                            bam_arguments=bam_arguments,
                            disable_active_channel_selection=args.no_active_channel_selection,
                            mux_scan_period=args.mux_scan_period,
                            experiment_duration=args.experiment_duration,
                            args=args.extra_args,  # Any extra args passed.
                        )

                        #print("Started protocol %s" % run_id)
    return errormessage+run_id # one of them should be ""

In [None]:
def stopRun(minionId): # stop an existing run (if any) for a MinION device
    manager=mkManager()
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == minionId, positions))
    # Connect to the grpc port for the position:
    connection = filtered_positions[0].connect()
    protocols = connection.protocol.list_protocol_runs()
    bufferedRunIds = protocols.run_ids
    thisMessage="No protocol running, nothing was stopped."
    c=0
    for b in bufferedRunIds:
        try:
            connection.protocol.stop_protocol()
            thisMessage="Protocol "+b+" stopped on "+minionId+"."
        except:
            c=c+1
    return thisMessage

In [None]:
# from minknow_api demos, start_seq.py
def is_position_selected(position, args):
    """Find if the {position} is selected by command line arguments {args}."""
    if args.position == position.name: # First check for name match:
        return True
    connected_position = position.connect()  # Then verify if the flow cell matches:
    if args.flow_cell_id is not None:
        flow_cell_info = connected_position.device.get_flow_cell_info()
        if (flow_cell_info.user_specified_flow_cell_id == args.flow_cell_id
            or flow_cell_info.flow_cell_id == args.flow_cell_id):
            return True
    return False

In [None]:
def getMinKnowApiStatus(deviceString): # MinKNOW status per device
    replyString=""
    testHost="localhost"
    manager=mkManager()
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    # determine if anything is running and the kind of run, via set temperature
    replyString=replyString+"acquisition.get_acquisition_info().state: "+str(connection.acquisition.get_acquisition_info().state)+"<br>"
    replyString=replyString+"acquisition.current_status(): "+str(connection.acquisition.current_status())+"<br>"
    replyString=replyString+"minion_device.get_settings().temperature_target.min: "+str(connection.minion_device.get_settings().temperature_target.min)+"<br>"
    replyString=replyString+"device.get_temperature(): " + str(connection.device.get_temperature().minion.heatsink_temperature)+"<br>"
    replyString=replyString+"device.get_bias_voltage(): " + str(connection.device.get_bias_voltage())+"<br>"
    return replyString

In [None]:
def getActiveRun(deviceString):
    manager=mkManager()
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    try:
        activeRun=connection.acquisition.get_current_acquisition_run().run_id # error if no acquisition is running, same as with acquisitio.current_status(), no acquisition until temperature reached
    except:
        activeRun="none"
    return activeRun

In [None]:
def getRealDeviceActivity(deviceString):            # seq. runs: 34 degC and flow cell checks 37 degC target
    manager=mkManager()                             # temperatures seem to be the only way to determine if
    positions = list(manager.flow_cell_positions()) # a device has been started
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    targetTemp=str(connection.minion_device.get_settings().temperature_target.min)
    returnValue=""
    if targetTemp=="34.0":
        returnValue="sequencing"
    elif targetTemp=="37.0":
        returnValue="checking flow cell"
    elif targetTemp=="35.0":
        returnValue="idle"
    return returnValue

In [None]:
def getThisRunState(deviceString): # obtain further information about a particular device / run
    manager=mkManager()
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    try:
        thisRunState="Run state for "+deviceString+": "
        thisRunState=thisRunState+str(connection.protocol.get_current_protocol_run().state)+"/"
        thisRunState=thisRunState+str(connection.acquisition.get_acquisition_info().state)
    except:
        thisRunState="No state information in MinKNOW buffer for "+deviceString
    return thisRunState

In [None]:
def getThisRunSampleID(deviceString): # get SampleID from MinKNOW by device, only available after data
    manager=mkManager()               # acquisition as been initiated by MinKNOW.
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    try:
        thisRunSampleID=connection.protocol.get_current_protocol_run().user_info.sample_id.value
    except:
        thisRunSampleID="No sampleId information in MinKNOW buffer for "+deviceString
    return thisRunSampleID

In [None]:
def getThisRunYield(deviceString): # get run yield by device. The data of the previous run will remain
    manager=mkManager()            # in the buffer until acquisition (not just a start) of a new run
    positions = list(manager.flow_cell_positions()) # have been initiated.
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    try:
        acqinfo=connection.acquisition.get_acquisition_info()
        thisRunYield="Run yield for "+deviceString+"("+acqinfo.run_id+"):&nbsp;"
        thisRunYield=thisRunYield+str(acqinfo.yield_summary)
    except:
        thisRunYield="No yield information in MinKNOW buffer for "+deviceString
    return thisRunYield

In [None]:
def getThisRunOutput(deviceString,sampleName,runId): # get run yield by device, sampleName, runId
    thisRunOutput=[-1,-1] # defaults in case of error / missing information
    manager=mkManager()            # in the buffer until acquisition (not just a start) of a new run
    positions = list(manager.flow_cell_positions()) # have been initiated.
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    readCount=-3
    calledBases=-3
    if getThisRunSampleID(deviceString)==sampleName: # check that runID and sampleID match
        readCount=-4
        calledBases=-4
        if connection.acquisition.get_current_acquisition_run().run_id==runId:
            if connection.acquisition.current_status()!="status: READY": # i.e., working
                try:
                    acq=connection.acquisition.get_acquisition_info()
                    readCount=acq.yield_summary.basecalled_pass_read_count
                    calledBases=acq.yield_summary.basecalled_pass_bases
                except:
                    readCount=-5
                    calledBases=-5
    thisRunOutput=[readCount,calledBases]
    return thisRunOutput # shall be a list

In [None]:
def getThisRunEstimatedOutput(deviceString,sampleName,runId): # get run yield by device, sampleName, runId
    thisRunOutput=[-1,-1] # defaults in case of error / missing information
    manager=mkManager()            # in the buffer until acquisition (not just a start) of a new run
    positions = list(manager.flow_cell_positions()) # have been initiated.
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    readCount=-3
    calledBases=-3
    if getThisRunSampleID(deviceString)==sampleName: # check that runID and sampleID match
        readCount=-4
        calledBases=-4
        if connection.acquisition.get_current_acquisition_run().run_id==runId:
            if connection.acquisition.current_status()!="status: READY": # i.e., working
                try:
                    acq=connection.acquisition.get_acquisition_info()
                    readCount=acq.yield_summary.basecalled_pass_read_count
                    calledBases=acq.yield_summary.estimated_selected_bases
                except:
                    readCount=-5
                    calledBases=-5
    thisRunOutput=[readCount,calledBases]
    return thisRunOutput # shall be a list

In [None]:
def getThisRunInformation(deviceString): # get current run information. Only available after data acquisition
    manager=mkManager()                  # has started.
    positions = list(manager.flow_cell_positions())
    filtered_positions = list(filter(lambda pos: pos.name == deviceString, positions))
    connection = filtered_positions[0].connect() # Connect to the grpc port for the position
    try:
        thisRunInfo="Run information for "+deviceString+"<br><br>"+str(connection.protocol.get_current_protocol_run())
    except:
        thisRunInfo="No protocol information in MinKNOW buffer for "+deviceString
    return thisRunInfo

In [None]:
def thisRunWatcherTerminator(deviceString,sampleName):
    realRunId=getActiveRun(deviceString) #
    currentBases=getThisRunEstimatedOutput(deviceString,sampleName,realRunId)[1]
    currentBasesString=str(round(currentBases/1e6,2))
    wantedBasesString=str(round(NEEDED_NUMBER_OF_BASES/1e6,2))
    myString="<html><head>"
    myString=myString+"<title>"+currentBasesString+"/"+wantedBasesString+"MB:"+sampleName+"</title>"
    if currentBases < NEEDED_NUMBER_OF_BASES: # don't refresh after showing the STOP state
        myString=myString+"<meta http-equiv='refresh' content='10'>"
    myString=myString+"</head><body>"
    myString=myString+"<b>Automatic run terminator</b> for sample <b>"+sampleName+ "</b>, run ID="+realRunId+" on "+deviceString+" when reaching "+wantedBasesString+" MB, now "+currentBasesString+" MB"
    myString=myString+"<hr>"
    myString=myString+"Last refresh at "+date_time_string_now()+".<hr>"
    if currentBases > NEEDED_NUMBER_OF_BASES:
        stopRun(deviceString)
        myString=myString+"STOPPED at "+date_time_string_now()
    elif currentBases==0:
        myString=myString+"IDLE / MUX / ETC"
    else:
        myString=myString+"RUNNING"
    myString=myString+"</body></html>"
    return myString


### 3. CNV Plotter



### 4. UMAP Methylation Plotter



### 5. Report Generator



### 6. User Interface Functions


In [None]:
# TODO changed
# String patterns in sample names that exclude data from downstream analysis,
# e.g., test runs
def analysis_launch_table():
    """Presents a html table from which analyses can be started in a post-hoc
    manner."""
    analysis_runs = [run for run in get_runs() if
        not any(pattern in run for pattern in ANALYSIS_EXCLUSION_PATTERNS)]
    annotations = reference_annotations()
    table = f"""
        <tt>
        <font size='-2'>
        <table border=1>
        <thead>
        <tr>
            <th align='left'><b>Sample ID </b></th>
            <th align='left'><b>CpGs</b></th>
            <th align='left'><b>CNV</b></th>"""
    for a in annotations:
        table += f"""
            <th align='left'>
                <b>UMAP against<br>{a.replace(".xlsx", "")}</b>
            </th>"""
    table += """
        </tr>
        </thead>
        <tbody>"""
    for _, run in enumerate(analysis_runs):
        table += f"""
        <tr>
            <td>{run}</td>
            <td>
            <a href='./analysisLauncher?functionName=methylationPoller&sampleName={run}&refAnno=None'
            target='_blank' rel='noopener noreferrer' title='{run}: CpGs'>
                get CpGs
            </a>
            </td>
            <td>
            <a href='./analysisLauncher?functionName=cnvplot&sampleName={run}&refAnno=None'
                target='_blank' rel='noopener noreferrer' title='{run}: CNV'>
                    plot CNV
            </a>
            </td>"""
        for a in annotations:
            table += f"""
            <td>
            <a href='./analysisLauncher?functionName=umapplot&sampleName={run}&refAnno={a}'
            target='_blank' rel='noopener noreferrer'
            title='{run}: {a.replace(".xlsx", "")}'>
                plot UMAP
            </a>&nbsp;
            <a href='./makePdf?sampleName={run}&refAnno={a}' target='_blank'
            rel='noopener noreferrer' title='{run}: {a.replace(".xlsx", "")}'>
                make PDF
            </a>
            </td>"""
        table += """
        </tr>"""
    table += """
        </tbody>
        </table>
        </font>
        </tt>"""
    return table

In [None]:
def get_all_results():
    """Return list of all analysis result files in report directory sorted
    by modification time."""
    files = []
    for f in os.listdir(NANODIP_REPORTS):
        for e in RESULT_ENDINGS.values():
            if f.endswith(e):
                mod_time = os.path.getmtime(
                    os.path.join(NANODIP_REPORTS, f)
                )
                files.append([f, mod_time])
    files.sort(key=lambda x: (x[1], x[0]), reverse=True)
    return [f[0] for f in files]

In [None]:
def livePage(deviceString): # generate a live preview of the data analysis with the current PNG figures
    thisSampleID=getThisRunSampleID(deviceString) # if there is a run that produces data, the run ID will exist
    thisSampleRef=readReferenceDefinition(thisSampleID).replace(".xlsx", "")
    cnvPlotPath="reports/"+thisSampleID+"_CNVplot.png"
    umapAllPlotPath="reports/"+thisSampleID+"_"+thisSampleRef+"_UMAP_all.png"
    umapAllPlotlyPath="reports/"+thisSampleID+"_"+thisSampleRef+"_UMAP_all.html"
    umapTopPlotPath="reports/"+thisSampleID+"_"+thisSampleRef+"_UMAP_top.png"
    ht="<html><body><tt>sample ID: "+thisSampleID+" with reference "+thisSampleRef+"</tt><br>"
    ht=ht+"<a href='"+cnvPlotPath+"' target='_blank'><img align='Top' src='"+cnvPlotPath+"' width='50%' alt='CNV plot will appear here'></a>"
    ht=ht+"<a href='"+umapAllPlotlyPath+"' target='_blank'><img align='Top' src='"+umapAllPlotPath+"' width='50%' alt='UMAP plot will appear here'></a>"
    ht=ht+"</tt></table><body></html>"
    return ht

In [None]:
def methcallLivePage(sampleName): # generate a self-refreshing page to invoke methylation calling
    ht="<html><head><title>MethCaller: "+sampleName+"</title>"
    ht=ht+"<meta http-equiv='refresh' content='3'></head><body>"
    ht=ht+"last refresh and console output at "+date_time_string_now()+"<hr>shell output<br><br><tt>"
    #ht=ht+calculateMethylationAndBamFromFast5Fastq(sampleName)
    ht=ht+f5cOneFast5(sampleName,analyzeOne=True)
    ht=ht+"</tt></body></html>"
    return ht

In [None]:
#TODO changed
def menuheader(current_page, autorefresh=0):
    """Generate a universal website header for the UI pages that
    contains a simple main menu."""
    menu = {
        "index":[
            "Overview",
            "General system information",
        ],
        "listPositions":[
            "Mk1b Status",
            "Live status of all connected Mk1b devices",
        ],
        "startSequencing":[
            "Start seq.",
            "Start a sequencing run on an idle Mk1b device",
        ],
        "startTestrun":[
            "Start test run",
            "Start a test seq. run on an idle Mk1b device to verify that the previous flow cell wash was successful.",
        ],
        "listExperiments":[
            "Seq. runs",
            "List all buffered runs. Will be purged upon MinKNOW backend restart.",
        ],
        "listRuns":[
            "Results",
            "List all completed analysis results",
        ],
        "analyze":[
            "Analyze",
            "Launch data analyses manually, e.g. for retrospective analysis",
        ],
        "about":[
            "About NanoDiP",
            "Version, etc.",
        ],
    }
    html = f"""
        <html>
        <head>
        <title>
            NanoDiP Version {NANODIP_VERSION}
        </title>"""
    if autorefresh > 0:
        html += f"<meta http-equiv='refresh' content='{autorefresh}'>"
    html += """
        </head>
        <body>
        <table border=0 cellpadding=2>
        <tr>
            <td>
                <img src='img/EpiDiP_Logo_01.png' width='40px' height='40px'>
            </td>"""
    for key, value in menu.items():
        selected_color = "#E0E0E0" if current_page == key else "white"
        html += f"""
            <td bgcolor='{selected_color}'>
                <b>
                <a href='{key}' title='{value[1]}'> {value[0]}
                </a>
                </b>
            </td>"""
    html += f"""
        </tr>
        </table>
        <br>"""
    return html


### 6. CherryPy Web UI
The browser-based user interface is based on CherryPy, which contains an
intergrated web server and serves pages locally. Communication between the
service and browser typically generates static web pages that may or may not
contain automatic self refresh commands. In the case of self-refreshing pages,
the browser will re-request a given page with leads to re-execution of the
respective python functions. The main handles to these function are located in
the Web UI cell below.


In [None]:
class UserInterface(object):
    """The CherryPy Web UI Webserver class defines entrypoints and
    function calls."""
    # global variables within the CherryPy Web UI
    cpgQueue = 0 # TODO use mutex instead
    umapQueue = 0
    cnvpQueue = 0
    cnv_lock = mp.Lock()
    umap_lock = mp.Lock()

    @cherrypy.expose
    def index(self):
        total, used, free = shutil.disk_usage(DATA)
        sys_stat = {
            "hostname": socket.gethostname(),
            "disk_total": total // (2**30),
            "disk_used": used // (2**30),
            "disk_free": free // (2**30),
            "memory_free": round(
                psutil.virtual_memory().available * 100
                / psutil.virtual_memory().total
            ),
            "cpu": round(psutil.cpu_percent()),
            "cpgs": UserInterface.cpgQueue,
            "cnvp": len([p for p in mp.active_children() if p.name == "cnv"]),
            "umap": len([p for p in mp.active_children() if p.name == "umap"]),
        }
        return render_template("index.html", sys_stat=sys_stat)

    @cherrypy.expose
    def old(self):
        """Titlepage."""
        html = menuheader('index', 15)
        html += "<tt><b>Computer:</b> "
        html += str(socket.gethostname())
        html += "</tt><br><br>"
        return html

    @cherrypy.expose
    def restart(self):
        cherrypy.engine.restart()
        return render_template("restart.html")

    @cherrypy.expose
    def reset_queue(self, queue_name=""):
        html = menuheader('index', 15)
        if queue_name:
            if queue_name == "cpg":
                UserInterface.cpgQueue = 0
            if queue_name == "umap":
                UserInterface.umapQueue = 0
            if queue_name == "cnvp":
                UserInterface.cnvpQueue = 0
            html += queue_name + " queue reset"
        return html


    @cherrypy.expose
    def listPositions(self):
        myString=menuheader(1,10)
        positions=listMinionPositions()
        for pos in positions:
            n=str(pos.name) # pos.state does not tell much other than that the device is connected with USB ("running")
            myString=myString+"<br><iframe src='DeviceStatusLive?deviceString="+n+"' height='200' width='600' title='"+n+"' border=3></iframe>"
            myString=myString+"<iframe src='AnalysisStatusLive?deviceString="+n+"' height='200' width='600' title='"+n+"' border=3></iframe>"
            myString=myString+"<br><a href='DeviceStatusLive?deviceString="+n+"' target='_blank' title='Click to open device status page in new tab or window'>"+n+"</a>"
            myString=myString+", live state: "+getRealDeviceActivity(n)
            activeRun=getActiveRun(n)
            myString=myString+", active run: "+getActiveRun(n)
            if activeRun!="none":
                myString=myString+" <a href='launchAutoTerminator?sampleName="+getThisRunSampleID(n)+"&deviceString="+n+"' target='_blank'>"
                myString=myString+"<br>Click this link to launch automatic run terminator after"+str(round(NEEDED_NUMBER_OF_BASES/1e6))+" MB.</a>"
                myString=myString+"<br><font color=''#ff0000'><a href='stopSequencing?deviceId="+n+"' title='Clicking this will terminate the current run immediately! Use with care!'>terminate manually</a></font>"
            myString=myString+"<br><br>"
        myString=myString+"</body></html>"
        return myString

    @cherrypy.expose
    def status(self):
        positions = [str(pos.name) for pos in listMinionPositions()]
        print("--------------", positions)
        return render_template(
            "status.html",
            positions=positions,
            mega_bases=NEEDED_NUMBER_OF_BASES // 1e6)

    @cherrypy.expose
    def startSequencing(self,deviceId="",sampleId="",runDuration="",referenceFile=""):
        myString=menuheader(2,0)
        if sampleId:
            if float(runDuration)>=0.1:
                sys.argv = ['',
                            '--host','localhost',
                            '--position',deviceId,
                            '--sample-id',sampleId,
                            '--experiment-group',sampleId,
                            '--experiment-duration',runDuration,
                            '--basecalling',
                            '--fastq',
                            '--fastq-reads-per-file',READS_PER_FILE,
                            '--fast5',
                            '--fast5-reads-per-file',READS_PER_FILE,
                            '--verbose',
                            '--kit','SQK-RBK004',
                            '--barcoding',
                            '--barcode-kits','SQK-RBK004']
                realRunId=startRun()
                writeReferenceDefinition(sampleId,referenceFile)
                myString=myString+"sequencing run started for "+sampleId+" on "+deviceId+" as "+realRunId+" with reference "+referenceFile
                myString=myString+"<hr>"+getThisRunInformation(deviceId)
                myString=myString+"<hr><a href='launchAutoTerminator?sampleName="+sampleId+"&deviceString="+deviceId+"'>"
                myString=myString+"Click this link to launch automatic run terminator after"+str(round(NEEDED_NUMBER_OF_BASES/1e6))+" MB.</a> "
                myString=myString+"If you do not start the run terminator, you will have to terminate the run manually, or it will stop after the predefined time."
        else:
            myString=myString+'''<form action="startSequencing" method="GET">
                Select an idle Mk1b:&nbsp;<select name="deviceId" id="deviceId">'''
            positions=listMinionPositions()
            for pos in positions:
                thisPos=pos.name
                if getRealDeviceActivity(thisPos)=="idle":
                    if getFlowCellID(thisPos)!="":
                        myString=myString+'<option value="'+thisPos+'">'+thisPos+': '+getFlowCellID(thisPos)+'</option>'
            myString=myString+'''
                </select>&nbsp; and enter the sample ID:&nbsp;<input type="text" name="sampleId" />
                &nbsp;for&nbsp;<input type="text" name="runDuration" value="72" />&nbsp;hours.
                &nbsp;Reference set&nbsp;<select name="referenceFile" id="referenceFile">'''
            for ref in reference_annotations():
                myString=myString+'<option value="'+ref+'">'+ref+'</option>'
            myString=myString+'&nbsp;<input type="submit" value="start sequencing now"/></form>'
        return myString



    @cherrypy.expose
    def start(self, device_id="", sample_id="",
              run_duration="", reference_id=""):
        start_now = sample_id and float(run_duration) >= 0.1
        if start_now:
            sys.argv = [
                "",
                "--host", "localhost",
                "--position", device_id,
                "--sample-id", sample_id,
                "--experiment-group", sample_id,
                "--experiment-duration", run_duration,
                "--basecalling",
                "--fastq",
                "--fastq-reads-per-file", READS_PER_FILE,
                "--fast5",
                "--fast5-reads-per-file", READS_PER_FILE,
                "--verbose",
                "--kit", "SQK-RBK004",
                "--barcoding",
                "--barcode-kits", "SQK-RBK004",
            ]
            run_id = startRun()
            write_reference_name(sample_id, reference_id)
            return render_template(
                "start.html",
                start_now=start_now,
                test=False,
                sample_id=sample_id,
                reference_id=reference_id,
                device_id=device_id,
                run_id=run_id,
                mega_bases=NEEDED_NUMBER_OF_BASES // 1e6,
                run_info=getThisRunInformation(device_id),
            )
        else:
            positions = [p.name for p in listMinionPositions()]
            idle = [p for p in positions if getRealDeviceActivity(p) == "idle"
                and getFlowCellID(p) != ""]
            return render_template(
                "start.html",
                start_now=start_now,
                test=False,
                idle=idle,
                references=reference_annotations(),
            )


    @cherrypy.expose
    def startTestrun(self,deviceId=""):
        myString=menuheader('startTestrun', 0)
        if deviceId:
            sampleId=date_time_string_now()+"_TestRun_"+getFlowCellID(deviceId)
            sys.argv = ['',
                        '--host','localhost',
                        '--position',deviceId,
                        '--sample-id',sampleId,
                        '--experiment-group',sampleId,
                        '--experiment-duration','0.1',
                        '--basecalling',
                        '--fastq',
                        '--fastq-reads-per-file',READS_PER_FILE,
                        '--fast5',
                        '--fast5-reads-per-file',READS_PER_FILE,
                        '--verbose',
                        '--kit','SQK-RBK004',
                        '--barcoding',
                        '--barcode-kits','SQK-RBK004']
            realRunId=startRun()
            myString=myString+"sequencing run started for "+sampleId+" on "+deviceId+" as "+realRunId
            myString=myString+"<hr>"+getThisRunInformation(deviceId)
        else:
            myString=myString+'''<form action="startTestrun" method="GET">
                Select an idle Mk1b:&nbsp;<select name="deviceId" id="deviceId">'''
            positions=listMinionPositions()
            for pos in positions:
                thisPos=pos.name
                if getRealDeviceActivity(thisPos)=="idle":
                    if getFlowCellID(thisPos)!="":
                        myString=myString+'<option value="'+thisPos+'">'+thisPos+': '+getFlowCellID(thisPos)+'</option>'
            myString=myString+'''
                </select>&nbsp;<input type="submit" value="start test run now (0.1h)"/></form>'''
        return myString

    @cherrypy.expose
    def test_run(self, device_id=""):
        if device_id:
            sample_id = (date_time_string_now() + "_TestRun_"
                + getFlowCellID(device_id))
            sys.argv = [
                "",
                "--host", "localhost",
                "--position", device_id,
                "--sample-id", sample_id,
                "--experiment-group", sample_id,
                "--experiment-duration", "0.1",
                "--basecalling",
                "--fastq",
                "--fastq-reads-per-file", READS_PER_FILE,
                "--fast5",
                "--fast5-reads-per-file", READS_PER_FILE,
                "--verbose",
                "--kit", "SQK-RBK004",
                "--barcoding",
                "--barcode-kits", "SQK-RBK004",
            ]
            run_id = startRun()
            return render_template(
                "start.html",
                start_now=True,
                sample_id=sample_id,
                reference_id="TEST",
                device_id=device_id,
                run_id=run_id,
                mega_bases=NEEDED_NUMBER_OF_BASES // 1e6,
                run_info=getThisRunInformation(device_id),
            )
        else:
            positions = [p.name for p in listMinionPositions()]
            idle = [p for p in positions if getRealDeviceActivity(p) == "idle"
                and getFlowCellID(p) != ""]
            return render_template(
                "start.html",
                start_now=False,
                test=True,
                idle=idle,
                references=reference_annotations(),
            )

    @cherrypy.expose
    def stopSequencing(self, deviceId=""):
        myString=menuheader('listPositions', 0)
        myString=myString + stopRun(deviceId)
        myString=myString + "<br><br>Click on any menu item to proceed."
        return myString

    @cherrypy.expose
    def listExperiments(self):
        myString=menuheader('listExperiments', 10)
        myString=myString+"Running and buffered experiments:<br>"
        experiments=listMinionExperiments()
        myString=myString+experiments
        return myString

    @cherrypy.expose
    def list_runs(self):
        status = {}
        mounted_flow_cell_id = {}
        current_status = {}
        flow_cell_id = {}
        buffered_run_ids = {}

        manager = mkManager()
        # Find a list of currently available sequencing positions.
        positions = manager.flow_cell_positions()

        for p in positions:
            connection = p.connect()
            # return the flow cell info
            mounted_flow_cell_id[p] = connection.device.get_flow_cell_info(
                ).flow_cell_id
            # READY, STARTING, sequencing/mux = PROCESSING, FINISHING;
            # Pause = PROCESSING
            current_status[p] = connection.acquisition.current_status()
            protocols = connection.protocol.list_protocol_runs()
            buffered_run_ids[p] = protocols.run_ids
            for b in buffered_run_ids[p]:
                run_info = connection.protocol.get_run_info(run_id=b)
                flow_cell_id[(p, b)] = run_info.flow_cell.flow_cell_id
        return render_template(
            "list_runs.html",
            positions=positions,
            host=CHERRYPY_HOST,
            status=status,
            mounted_flow_cell_id=mounted_flow_cell_id,
            current_status=current_status,
            flow_cell_id=flow_cell_id,
            buffered_run_ids=buffered_run_ids,
        )


    @cherrypy.expose
    def results(self):
        files = get_all_results()
        return render_template("results.html", files=files)

    @cherrypy.expose
    def analyze(self):
        myString=menuheader('analyze',0)
        myString=myString+analysis_launch_table()
        return myString

    @cherrypy.expose
    def analysis(self, func="", samp="", ref="", new="False"):
        if func == "":
            analysis_runs = [run for run in get_runs() if not any(pattern in run
                for pattern in ANALYSIS_EXCLUSION_PATTERNS)]
            annotations = [a.replace(".xlsx", "")
                for a in reference_annotations()]
            return render_template(
                "analysis_start.html",
                analysis_runs=analysis_runs,
                annotations=annotations,
            )
        if func == "cnv":
            genome = ReferenceGenome()
            genes = genome.genes.name.to_list()
            return render_template(
                "analysis_cnv.html",
                sample_name=samp,
                genes=genes,
                new=new,
            )
        if func == "umap":
            return render_template(
                "analysis_umap.html",
                sample_name=samp,
                reference_name=ref,
                new=new,
                first_use = not binary_reference_data_exists(),
            )
        else:
            raise cherrypy.HTTPError(404, "URL not found")

    @cherrypy.expose
    def cnv(self, samp, genes="", new="False"):
        t0=time.time()
        print("NEW**********************",new)
        try:
            cnv_plt_data = CNVData(samp)
        except FileNotFoundError:
            raise cherrypy.HTTPError(405, "URL not allowed")

        def make_plot(cnv_data, lock):
            """Plot function for multiprocessing."""
            lock.acquire()
            if not cnv_data.files_on_disk() or new == "True":
                cnv_data.make_cnv_plot()
            lock.release()

        proc = mp.Process(
            target=make_plot,
            args=(cnv_plt_data, UserInterface.cnv_lock),
            name="cnv",
        )
        proc.start()
        proc.join()
        cnv_plt_data.read_from_disk()
        print("CNV=====================", time.time()-t0)

        return cnv_plt_data.plot_cnv_and_genes([genes])

    @cherrypy.expose
    def umap(self, samp, ref, close_up="", new="False"):
        t0=time.time()
        try:
            umap_data = UMAPData(samp, ref)
        except FileNotFoundError:
            raise cherrypy.HTTPError(405, "URL not allowed")

        def make_plot(plt_data, lock):
            """Plot function for multiprocessing."""
            lock.acquire()
            if not plt_data.files_on_disk() or new == "True":
                plt_data.make_umap_plot()
            lock.release()

        proc = mp.Process(
            target=make_plot,
            args=(umap_data, UserInterface.umap_lock),
            name="umap",
        )
        proc.start()
        proc.join()
        umap_data.read_from_disk()

        if close_up == "True":
            return umap_data.cu_plot_json

        return umap_data.plot_json

    @cherrypy.expose
    def umapplot(self, sampleName=None, refAnno=None):
        html = ""
        if sampleName and refAnno:
            while UserInterface.umapQueue > 0:
                time.sleep(2)
            UserInterface.umapQueue += 1
            reference_name = refAnno.replace(".xlsx", "")
            try:
                make_umap_plot(sampleName, reference_name)
                html_error = ""
            except: #TODO which exception?
                html_error = """
                    <b>
                    <font color='#FF0000'>ERROR OCCURRED, PLEASE RELOAD TAB
                    </font>
                    </b>"""
            html += f"""
                <html>
                <head>
                <title>
                    {sampleName} against {refAnno} at {date_time_string_now()}
                </title>
                <meta http-equiv='refresh' content='1;
                    URL=reports/{sampleName}_{reference_name}_UMAP_all.html'>"
                </head>
                <body>
                {html_error}
                Loading UMAP plot. If it fails,
                <a href='reports/{sampleName}_{reference_name}_UMAP_all.html'>
                    click here to load plot
                </a>.
                </body>
                </html>"""
            UserInterface.umapQueue -= 1
        return html

    @cherrypy.expose
    def make_pdf(self, samp=None, ref=None):
        path = os.path.join(NANODIP_REPORTS, samp + "_cpgcount.txt")
        with open(path, "r") as f:
            overlap_cnt = f.read()

        path = os.path.join(NANODIP_REPORTS, samp + "_alignedreads.txt")
        with open(path, "r") as f:
            read_numbers = f.read()

        cnv_path = os.path.join(NANODIP_REPORTS, samp + "_CNVplot.png") #TODO png
        umap_path = os.path.join(
            NANODIP_REPORTS,
            samp + "_" + ref + "_UMAP_top.png",
        )

        html_report = render_template(
            "pdf_report.html",
            sample_name=samp,
            sys_name=socket.gethostname(),
            date=date_time_string_now(),
            barcode=predominant_barcode(samp),
            reads=read_numbers,
            cpg_overlap_cnt=overlap_cnt,
            reference=ref,
            cnv_path=cnv_path,
            umap_path=umap_path,
        )
        report_name = samp + "_" + ref + "_NanoDiP_report.pdf"
        report_path = os.path.join(NANODIP_REPORTS, report_name)
        convert_html_to_pdf(html_report, report_path)
        raise cherrypy.HTTPRedirect(os.path.join("reports", report_name))

    @cherrypy.expose
    def about(self):
        return render_template("about.html")

    @cherrypy.expose
    def DeviceStatusLive(self,deviceString=""):
        currentFlowCellId=getFlowCellID(deviceString)
        myString="<html><head><title>"+deviceString+": "+currentFlowCellId+"</title>"
        try:
            myString=myString+"<meta http-equiv='refresh' content='2'>"
            if getRealDeviceActivity(deviceString)=="sequencing":
                myString=myString+"<body bgcolor='#00FF00'>"
            else:
                myString=myString+"<body>"
            myString=myString+"<b>"+deviceString+": "+currentFlowCellId+"</b><br><tt>"
            myString=myString+getMinKnowApiStatus(deviceString)
        except:
            myString=myString+"<br>No previous device activity, information will appear as soon as the device has been running once in this session.<br>"
        myString=myString+"Sample ID: "+getThisRunSampleID(deviceString)+"<br>"
        myString=myString+getThisRunState(deviceString)
        myString=myString+"<br>"+getThisRunYield(deviceString)
        myString=myString+"</tt></body></html>"
        return myString

    @cherrypy.expose
    def AnalysisStatusLive(self,deviceString=""):
        myString=""
        if deviceString:
            myString=livePage(deviceString)
        return myString

    @cherrypy.expose
    def analysisLauncher(self,functionName="",sampleName="",refAnno=""):
        if functionName and sampleName and refAnno:
            myString="<html><head><title>"+sampleName+" "+functionName+"</title></head><body>"
            myString=myString+functionName+" launched for "+sampleName+" "
            if refAnno!="None":
                myString=myString+"against "+refAnno
            myString=myString+" at "+date_time_string_now()+". "
            myString=myString+"Frame below will display result upon completion, if this tab/window is kept open."
            if refAnno=="None":
                myString=myString+"<br><iframe src='./"+functionName+"?sampleName="+sampleName+"' height='95%' width='100%' title='"+sampleName+"' border=3></iframe>"
            else:
                myString=myString+"<br><iframe src='./"+functionName+"?sampleName="+sampleName+"&refAnno="+refAnno+"' height='95%' width='100%' title='"+sampleName+"' border=3></iframe>"
        else:
            myString="Nothing to launch. You may close this tab now."
        return myString

    @cherrypy.expose
    def analysisPoller(self,sampleName="",deviceString="",runId=""):
        myString="<html><head>"
        if sampleName and deviceString and runId:
                myString=myString+"<title>Poller: "+sampleName+"/"+deviceString+"/"+runId+"</title>"
                myString=myString+"<meta http-equiv='refresh' content='15'>"
                myString=myString+"<body>"
                myString=myString+"Last refresh for "+sampleName+"/"+deviceString+"/"+runId+" at "+date_time_string_now()
                myString=myString+"</body></html>"
                writeRunTmpFile(sampleName,deviceString)
        return myString

    @cherrypy.expose
    def methylationPoller(self,sampleName=""):
        while UserInterface.cpgQueue>0:
            time.sleep(2)
        UserInterface.cpgQueue+=1
        myString=methcallLivePage(sampleName)
        UserInterface.cpgQueue-=1
        return myString

    @cherrypy.expose
    def launchAutoTerminator(self,sampleName="",deviceString=""):
        myString="ERROR"
        if sampleName and deviceString:
            myString=thisRunWatcherTerminator(deviceString,sampleName)
        return myString

In [None]:
def main():
    # Start CherryPy Webserver
    if DEBUG_MODE:
        #set access logging
        cherrypy.log.screen = True
        cherrypy.config.update({'log.screen': True})
    else:
        #set access logging
        cherrypy.log.screen = False
        cherrypy.config.update({'log.screen': False})
        cherrypy.config.update({ "environment": "embedded" })

    print(f"NanoDiP server running at http://{CHERRYPY_HOST}:{CHERRYPY_PORT}")

    cherrypy_config = {
        '/favicon.ico': {
            'tools.staticfile.on': True,
            'tools.staticfile.filename': BROWSER_FAVICON,
        },
        '/img': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': IMAGES,
        },
        '/reports': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': NANODIP_REPORTS,
        },
        '/static': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': os.path.join(os.getcwd(), "static"),
        },
    }
    cherrypy.quickstart(UserInterface(), "/", cherrypy_config)

if __name__ == "__main__":
    main()


### ^^^ LIVE LOG ABOVE ^^^
All CherryPy access will be logged here, including live progress bars for
computationally intense analyses. Detailed access logging is turned off by
default (accessLogging is False), but can be turned on,e.g., for debugging,
in the configuration section at the beginning of this notebook. While it is not
required to have at look at these during normal operation, information
contained in the log may be helpful in troubleshooting. Line numbers in error
messages indicated here typically match those given in the respective Jupyter
Notebook cells.

To preseve these messages, halt the Python kernel, save and close the notebook
to send it for support. This makes sure that the code as well as the error
messages will be preserved.

To launch the user interface, wait until you see a pink log entry that the web
server has started, then navigate to http://localhost:8080.
