# ColabBE: automated base editing analysis for ClinVar variants

Written by Angus Li with contributions from Alvin Hsu

Editing predictions based on BE-Hive by Max W. Shen (https://www.crisprbehive.design/), an implementation of machine learning models for base editing outcomes described in:

**Arbab M**, **Shen MW**, Mok B, et al. Determinants of Base Editing Outcomes from Target Library Analysis and Machine Learning. *Cell*. 2020;182(2):463-480.e30. doi:10.1016/j.cell.2020.05.037

Cas9 variant recommendations based on HT-PAMDA data courtesy of Rachel E. Silverstein and Benjamin P. Kleinstiver, as described in:

**Silverstein RA**, Kim N, Kroell AS, et al. Custom CRISPR-Cas9 PAM variants via scalable engineering and machine learning. *Nature*. Published online April 22, 2025. doi:10.1038/s41586-025-09021-y

For coding mutations, ColabBE analyzes base editing strategies that are expected to correct the mutated codon to any codon for the reference amino acid. For non-coding mutations, ColabBE analyzes base editing strategies that are expected to revert the mutated base to the reference sequence.

**Instructions**:

- First run the section titled "Install/import libraries and define functions" by clicking the "Play" button to the left.
- The optional "Search ClinVar for ClinVar IDs" section can be used to find ClinVar IDs of interest using more general search terms.
- Specify all parameters under "Perform analysis" then start analysis by clicking the "Play" button to the left.

**Results interpretation**:

- When analysis is finished, a zip archive is created, containing all output files.
- For each ClinVar ID with potential corrective strategies, two files are provided: `...predicted base editing statistics.xlsx` and `...predicted sequence distributions.xlsx`.
-  `...predicted base editing statistics.xlsx` contains the editor, 20-nucleotide spacer, and 4-nucleotide PAM for each analyzed base editing strategy, as well as the fraction of predicted perfect correction, Z-score, percentile, and predicted target efficiency given the specified `average_editing_efficiency`. For coding mutations, perfect correction is defined as the fraction of **edited** sequencing reads for which the corresponding translated protein sequence is reverted to the same as the reference sequence. For non-coding mutations, perfect correction is defined as the fraction of **edited** sequencing reads for which the DNA sequence exactly matches the reference sequence. Cas9 variant recommendations and corresponding HT-PAMDA scores are also provided based on the PAM sequence.
- `...predicted sequence distributions.xlsx` contains the predicted fraction of sequences resulting from each analyzed editing strategy. Each editor, spacer, and PAM combination is a separate sheet in the file.
- A file named `Summary of editor candidates above threshold.xlsx` is also provided. This is a listing of all editor candidates, with their corresponding ClinVar IDs and variant names, with predicted perfect correction above the cutoff specified in `threshold`.
- A file named `Variants with editor candidates above threshold.xlsx` is provided. This lists all analyzed variants with at least one editor candidate with predicted perfect correction above the cutoff specified in `threshold`.
- All output files are also stored in a timestamped folder on your Google Drive. This can be deleted after analysis is finished if space is limited.

In [None]:
# Install/import libraries and define functions

import requests
from Bio import Seq
from Bio import Entrez
import io
import pandas as pd
from datetime import datetime
import os
import subprocess
import re
pam_dict = pd.read_excel('PAM_scores.xlsx', index_col = 'PAM').to_dict(orient='index')

def cvIDtoHg38Coords(cvID):
    ASM_MAP = {"NCBI36": "hg18", "GRCh37": "hg19", "GRCh38": "hg38", "T2T-CHM13v2.0": "hs1"}
    resp = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?",
                      params = {"db": "clinvar", "id": cvID, "retmode": "json"})
    if not resp.ok: raise Exception("Failed to query ClinVar")
    data = resp.json()
    if "result" not in data or cvID not in data['result']: raise Exception(f'Invalid ClinVar ID: {cvID}')
    varData = data["result"][cvID]["variation_set"][0]
    spdi = varData["canonical_spdi"]
    [start, ref, mut] = spdi.split(":")[1:]
    coord = [entry for entry in varData["variation_loc"] if entry['status'] == 'current'][0]
    [chrom, assembly_name] = [coord['chr'], coord['assembly_name']]
    return {'coords': {'assembly': ASM_MAP[assembly_name],
                                  'chrom': "chr" + chrom,
                                  'pos': start },
                        'alleles': { 'vName': varData["variation_name"],
                                   'ref': ref,
                                   'mut': mut },
                        'gene': data["result"][cvID]["gene_sort"]}

def coordsToRefSeq(coords):
    if 'start' in coords and 'end' in coords:
        start = coords['start']
        end = coords['end']
    else:
        start = int(coords['pos'])
        end = start + 1
    query = {'track': 'ccdsGene', 'genome': coords['assembly'], 'chrom': coords['chrom'],
                 'start': start, 'end': end}
    resp = requests.get("https://api.genome.ucsc.edu/getData/track", params = query)
    if not resp.ok: raise Exception(f'Failed to get RefSeq annotations from UCSC: {resp.status_code} {resp.text}')
    data = resp.json()
    if data['ccdsGene']: return data['ccdsGene'][0]
    return None

def fetchSequenceFromCoords(coords, contextLen):
    [assembly, chrom, pos] = [coords['assembly'], coords['chrom'], int(coords['pos'])]
    startPos = pos - contextLen
    endPos = pos + contextLen
    query = {'genome': assembly, 'chrom': chrom, 'start': startPos, 'end': endPos}
    resp = requests.get('https://api.genome.ucsc.edu/getData/sequence', params = query)
    if not resp.ok: raise Exception('Failed to fetch sequence from UCSC Genome Browser')
    data = resp.json()
    return data['dna']

def getContextExonTranslations(geneData, target, contextLen):
    target = int(target)
    contextStart = target - contextLen
    contextEnd = target + contextLen
    exonStarts = [int(n) for n in geneData["exonStarts"].split(',') if n.isdigit() and int(n) != 0]
    exonEnds = [int(n) for n in geneData["exonEnds"].split(',') if n.isdigit() and int(n) != 0]
    exonFrames = [int(n) for n in geneData["exonFrames"].split(',') if n.isdigit()]
    if geneData['strand'] == '+':
        exonStarts[0] = geneData['cdsStart']
        exonEnds[-1] = geneData['cdsEnd']
        exonFrames[0] = 0
    else:
        exonStarts[0] = geneData['cdsStart']
        exonEnds[-1] = geneData['cdsEnd']
        exonFrames[-1] = 0
    contentExons = []
    for i in range(len(exonStarts)):
        if exonStarts[i] <= contextEnd and exonEnds[i] >= contextStart:
            exonNumber = i + 1 if geneData['strand'] == '+' else len(exonStarts) - i
            startOffset = max(0, exonStarts[i] - contextStart)
            endOffset = min(exonEnds[i], contextEnd) - contextStart
            adjustedFrame = exonFrames[i]
            if exonStarts[i] < contextStart and geneData['strand'] == '+':
                distanceFromContextStart = contextStart - exonStarts[i]
                adjustedFrame = (exonFrames[i] + distanceFromContextStart) % 3
            elif exonEnds[i] > contextEnd and geneData['strand'] == '-':
                distanceFromContextStart = exonEnds[i] - contextEnd
                adjustedFrame = (exonFrames[i] + distanceFromContextStart) % 3
            contentExons.append({'name': f'{geneData["name2"]} Exon {exonNumber}',
                                 'start': startOffset, 'end': endOffset, 'direction': geneData['strand'],
                                 'frame': (3 - adjustedFrame) % 3})
    return contentExons

def testReversion(mutCodon, wtAA, verbose = True):
    candidates = []
    abe_plus = mutCodon.replace('A', 'G')
    if Seq.translate(abe_plus) == wtAA:
        if verbose: print(f'{mutCodon} could be corrected to {abe_plus}')
        candidates.append(['ABE8e', '+'])
    cbe_plus = mutCodon.replace('C', 'T')
    if Seq.translate(cbe_plus) == wtAA:
        if verbose: print(f'{mutCodon} could be corrected to {cbe_plus}')
        candidates.append(['BE4', '+'])
    abe_minus = mutCodon.replace('T', 'C')
    if Seq.translate(abe_minus) == wtAA:
        if verbose: print(f'{mutCodon} could be corrected to {abe_minus}')
        candidates.append(['ABE8e', '-'])
    cbe_minus = mutCodon.replace('G', 'A')
    if Seq.translate(cbe_minus) == wtAA:
        if verbose: print(f'{mutCodon} could be corrected to {cbe_minus}')
        candidates.append(['BE4', '-'])
    return candidates

def isTransition(allele, verbose = True):
    if allele['mut'] == 'A' and allele['ref'] == 'G':
        if verbose: print('A could be corrected to G')
        return ['ABE8e', '+']
    elif allele['mut'] == 'C' and allele['ref'] == 'T':
        if verbose: print('C could be corrected to T')
        return ['BE4', '+']
    elif allele['mut'] == 'T' and allele['ref'] == 'C':
        if verbose: print('T could be corrected to C')
        return ['ABE8e', '-']
    elif allele['mut'] == 'G' and allele['ref'] == 'A':
        if verbose: print('G could be corrected to A')
        return ['BE4', '-']
    return False

def behive(seq50,start,contextLen,editor,pred_efficiency,correctExon,strand=None):
    frame = None
    exonStart = 0
    exonEnd = 2 * contextLen
    if correctExon:
        frame = correctExon['frame']
        exonStart = correctExon['start']
        exonEnd = correctExon['end']
        if correctExon['direction'] == '+':
            if strand == '+': frame_shifted = (exonStart + frame - start) % 3 + 1
            else: frame_shifted = (-1*frame - ((exonEnd-exonStart)%3) + (2*contextLen-exonEnd)+start) % 3 + 1
        elif correctExon['direction'] == '-':
            if strand == '-': frame_shifted = (-1 * exonStart + ((exonEnd-exonStart)%3)+ frame + start) % 3 + 1
            else: frame_shifted = (-1*frame  - (2*contextLen-exonEnd)-start) % 3 + 1
    else:
        frame_shifted = 1
        strand = '+'

    URL = 'https://www.crisprbehive.design'

    dash_update = f'{URL}/_dash-update-component'

    payload = {
    "output":"S_csv_download_link.href",
    "changedPropIds":[
        "S_hidden_pred_signal_bystander.children"
    ],
    "inputs":[
        {
            "id":"S_hidden_pred_signal_bystander",
            "property":"children",
            "value":f"{seq50},{editor},mES"
        },
        {
            "id":"S_hidden_chosen_aa_frame",
            "property":"children",
            "value":f"{frame_shifted},{strand}"
        }
    ]
    }

    headers = {
        "Content-Type": "application/json"
    }

    resp = requests.post(dash_update, headers=headers, json=payload)
    href = resp.json()['response']['props']['href']
    resp = requests.get(f'{URL}{href}')
    df = pd.read_csv(io.StringIO(resp.text), index_col=0)

    payload = {
        "output":"S_efficiency_longtext.children",
        "changedPropIds":[
            "S_hidden_pred_signal_efficiency.children"
        ],
        "inputs":[
            {
                "id":"S_slider_efficiency_mean",
                "property":"value",
                "value":pred_efficiency
            },
            {
                "id":"S_hidden_pred_signal_efficiency",
                "property":"children",
                "value":f"{seq50},{editor},mES"
            }
        ]
    }
    resp = requests.post(dash_update, headers=headers, json=payload)
    href = resp.json()
    for block in href['response']['props']['children']:
        s = block['props']['children']
        if isinstance(s, str):
            if 'Predicted Z-score:' in s:
                z = float(s[19:])
            elif 'Percentile:' in s:
                percentile = float(s[-4:])
            elif 'then this target\'s efficiency is' in s:
                target_efficiency = float(s[-6:-2])
    return {'df':df, 'z':z, 'percentile':percentile, 'target_efficiency':target_efficiency}

def analyze(cvId, verbose = True, pred_efficiency = 0.5):
    contextLen = 35
    results = {}
    analysis = {}
    stats_dict = []
    a = cvIDtoHg38Coords(cvId)
    vname = a['alleles']['vName']
    print(f'ClinVar ID: {cvId}')
    print(f'Variant name: {vname}')
    if len(a['alleles']['ref']) != 1 or len(a['alleles']['mut']) != 1:
        print(f'The mutation with ClinVar ID {cvId} is not a single nucleotide variant.')
        return[analysis, stats_dict]
    geneData = coordsToRefSeq(a['coords'])
    seq = fetchSequenceFromCoords(a['coords'], contextLen)
    seq = seq.upper()
    mut = seq[:contextLen] + a['alleles']['mut'] + seq[contextLen+1:]
    if geneData: c = getContextExonTranslations(geneData, a['coords']['pos'], contextLen)
    else: c = {}
    correctExon = None
    for exon in c:
        seqSlice = seq[exon['start']:exon['end']]
        mutSlice = mut[exon['start']:exon['end']]
        if exon['direction'] == '-':
            seqSlice = Seq.reverse_complement(seqSlice)
            mutSlice = Seq.reverse_complement(mutSlice)
        seqTranslation = Seq.translate(seqSlice[exon['frame']:len(seqSlice)-((len(seqSlice)-exon['frame'])%3)])
        mutTranslation = Seq.translate(mutSlice[exon['frame']:len(seqSlice)-((len(seqSlice)-exon['frame'])%3)])
        if verbose:
            print('WT:')
            print(seqSlice)
            print(' ' * exon['frame'] + ''.join([' ' + seqTranslation[aa] + ' ' for aa in range(len(seqTranslation))]))
            print(Seq.complement(seqSlice))
            print('Variant:')
            print(mutSlice)
            print(' ' * exon['frame'] + ''.join([' ' + mutTranslation[aa] + ' ' for aa in range(len(mutTranslation))]))
            print(Seq.complement(mutSlice))
        if contextLen < exon['end'] and contextLen > exon['start']:
            correctExon = exon
            if exon['direction'] == '+':
                pos = contextLen - exon['start']
            elif exon['direction'] == '-':
                pos = (exon['end'] - 1) - contextLen
            wtCodon = seqSlice[pos - ((pos - exon['frame']) % 3):pos - ((pos - exon['frame']) % 3) + 3]
            mutCodon = mutSlice[pos - ((pos - exon['frame']) % 3):pos - ((pos - exon['frame']) % 3) + 3]
            wtAA = Seq.translate(wtCodon)
            mutAA = Seq.translate(mutCodon)
            if verbose: print(f'The wild-type codon {wtCodon}, coding for {wtAA}, is mutated to {mutCodon}, coding for {mutAA}')
    if correctExon:
        candidates = testReversion(mutCodon, wtAA, verbose)
        if candidates:
            for candidate in candidates:
                [editor, strand] = candidate
                if strand == correctExon['direction']:
                    for start in range(17):
                        seq50 = mut[start:start+50]
                        if verbose: print(f'Sending {seq50} (top strand) to BE-Hive with editor {editor}')
                        results[(cvId, start, editor, strand)] = behive(seq50, start, contextLen, editor, pred_efficiency, correctExon, strand)
                else:
                    mut_rc = Seq.reverse_complement(mut)
                    for start in range(17):
                        seq50 = mut_rc[start:start+50]
                        if verbose: print(f'Sending {seq50} (bottom strand) to BE-Hive with editor {editor}')
                        results[(cvId, start, editor, strand)] = behive(seq50, start, contextLen, editor, pred_efficiency, correctExon, strand)
            seqSlice = seq[correctExon['start']:correctExon['end']]
            if correctExon['direction'] == '-':
                seqSlice = Seq.reverse_complement(seqSlice)
            seqTranslation = Seq.translate(seqSlice[correctExon['frame']:len(seqSlice)-((len(seqSlice)-correctExon['frame'])%3)])
            for entry in results:
                (cvId, start, editor, strand) = entry
                df = results[entry]['df']
                z = results[entry]['z']
                percentile = results[entry]['percentile']
                target_efficiency = results[entry]['target_efficiency']
                perfect_correction = 0
                if strand == correctExon['direction']:
                    spacer = mut[start+20:start+40]
                    pam = mut[start+40:start+44]
                else:
                    mut_rc = Seq.reverse_complement(mut)
                    spacer = mut_rc[start+20:start+40]
                    pam = mut_rc[start+40:start+44]
                params = (editor, spacer, pam)
                analysis[params] = {}
                for p, genotype in zip(df['Predicted frequency'], df['Genotype']):
                    if strand == correctExon['direction']:
                        mut_edited = mut[:start] + genotype + mut[start+50:]
                    else:
                        mut_rc_edited = mut_rc[:start] + genotype + mut_rc[start+50:]
                        mut_edited = Seq.reverse_complement(mut_rc_edited)
                    mutSlice = mut_edited[correctExon['start']:correctExon['end']]
                    if correctExon['direction'] == '-': mutSlice = Seq.reverse_complement(mutSlice)
                    mutTranslation = Seq.translate(mutSlice[correctExon['frame']:len(seqSlice)-((len(seqSlice)-correctExon['frame'])%3)])
                    if seqTranslation == mutTranslation:
                        perfect_correction += p
                    if mutTranslation not in analysis[params]: analysis[params][mutTranslation] = 0
                    analysis[params][mutTranslation] += p
                stats_dict.append({'ClinVar ID': cvId, 'Variant name': vname, 'Editor': editor, 'Spacer': spacer, 'PAM': pam, 'Perfect correction': perfect_correction,
                                     'Z-score': z, 'Percentile': percentile, 'Target efficiency': target_efficiency})
        else: print(f'The coding mutation with ClinVar ID {cvId} is unlikely to be correctable via canonical base editing. Try prime editing instead.')
    else:
        transition = isTransition(a['alleles'])
        if transition:
            [editor, strand] = transition
            if strand == '+':
                for start in range(17):
                    seq50 = mut[start:start+50]
                    if verbose: print(f'Sending {seq50} (top strand) to BE-Hive with editor {editor}')
                    results[(cvId, start, editor, strand)] = behive(seq50, start, contextLen, editor, 0.5, correctExon)
            elif strand == '-':
                for start in range(17):
                    seq50 = Seq.reverse_complement(mut)[start:start+50]
                    if verbose: print(f'Sending {seq50} (bottom strand) to BE-Hive with editor {editor}')
                    results[(cvId, start, editor, strand)] = behive(seq50, start, contextLen, editor, 0.5, correctExon)
            for entry in results:
                (cvId, start, editor, strand) = entry
                df = results[entry]['df']
                z = results[entry]['z']
                percentile = results[entry]['percentile']
                target_efficiency = results[entry]['target_efficiency']
                perfect_correction = 0
                if strand == '+':
                    spacer = mut[start+20:start+40]
                    pam = mut[start+40:start+44]
                elif strand == '-':
                    mut_rc = Seq.reverse_complement(mut)
                    spacer = mut_rc[start+20:start+40]
                    pam = mut_rc[start+40:start+44]
                params = (editor, spacer, pam)
                analysis[params] = {}
                for p, genotype in zip(df['Predicted frequency'], df['Genotype']):
                    if strand == '+':
                        mut_edited = mut[:start] + genotype + mut[start+50:]
                    elif strand == '-':
                        mut_rc_edited = mut_rc[:start] + genotype + mut_rc[start+50:]
                        mut_edited = Seq.reverse_complement(mut_rc_edited)
                    if seq == mut_edited:
                        perfect_correction += p
                    if mut_edited not in analysis[params]: analysis[params][mut_edited] = 0
                    analysis[params][mut_edited] += p
                stats_dict.append({'ClinVar ID': cvId, 'Variant name': vname, 'Editor': editor, 'Spacer': spacer, 'PAM': pam, 'Perfect correction': perfect_correction,
                                   'Z-score': z, 'Percentile': percentile, 'Target efficiency': target_efficiency})
        else: print(f'The non-coding mutation with ClinVar ID {cvId} is not accesible to canonical base editing. Try prime editing instead.')
    print('Done')
    return [analysis, stats_dict, vname]

In [None]:
# Search ClinVar for ClinVar IDs
# This section can be used to find ClinVar IDs for mutations of interest. It can be run as many times as needed before starting analysis by clicking the "Play" button to the left. Before running for the first time, also execute the section above.
email = 'liangus@broadinstitute.org'
# An email address is required for Entrez searches for ClinVar records. We do not store your email.
search_term = "KIF1A L249P"
max_record_number = 5
if search_term:
  Entrez.email = email
  handle = Entrez.esearch(db='clinvar', retmax = max(1, max_record_number), term=search_term)
  record = Entrez.read(handle)
  cvIds_search = record['IdList']
  post = Entrez.read(Entrez.epost('clinvar', id=','.join(cvIds_search)))
  webenv = post['WebEnv']
  query_key = post['QueryKey']
  records = Entrez.read(Entrez.esummary(db='clinvar', query_key=query_key, WebEnv=webenv), validate = False)
  df = pd.DataFrame.from_dict(records['DocumentSummarySet']['DocumentSummary'])
  df = df[['title','gene_sort','chr_sort','location_sort','obj_type','protein_change','germline_classification']]
  df.insert(0, 'ClinVar ID', cvIds_search)
  df['germline_classification'] = df['germline_classification'].apply(lambda x: x['description'])
  display(df)

Unnamed: 0,ClinVar ID,title,gene_sort,chr_sort,location_sort,obj_type,protein_change,germline_classification
0,986214,NM_001244008.2(KIF1A):c.746T>C (p.Leu249Pro),KIF1A,2,240783791,single nucleotide variant,L249P,Pathogenic/Likely pathogenic


In [None]:
# Perform analysis and download results
# When all parameters have been specified, start analysis by clicking the "Play" button to the left. A CPU runtime is sufficient.
email = 'liangus@broadinstitute.org' 
# An email address is required for Entrez searches for ClinVar records. We do not store your email.
mode = 'Enter ClinVar IDs directly' 
# Inputs for modes other than the one selected will be ignored. Excel and CSV files are supported for upload. If applicable, you will be prompted to upload after clicking the "Play" button to the left.
spreadsheet_has_headers = True 
# If you are uploading a spreadsheet with column headers, select `spreadsheet_has_headers` and enter the name of the column with ClinVar IDs here.
spreadsheet_column = 'VariationID' 
# If you are uploading a spreadsheet with column headers, specify the name of the column containing ClinVar IDs here. If your spreadsheet does not contain headers, the first column of the spreadsheet will automatically be used to retrieve the list of ClinVar IDs.
gene = 'ATM' 
# To search for corrective base editing strategies to all pathogenic/likely pathogenic mutations in a gene documented in ClinVar (may take a long time), use the "Enter a gene name" mode.
clinVar_ids = '30169, 986214' 
# ClinVar IDs found using the search form above can be entered here. ClinVar IDs can be separated by any mix of spaces, commas, and/or semicolons. This list will be used if the "Enter ClinVar IDs directly" mode is selected.
average_editing_efficiency = 0.5 # @param {type:"slider", min:0.01, max:0.99, step:0.01}
# Specify the average editing efficiency of your base editor in `average_editing_efficiency`. This only affects the predicted editing efficiency at each target and will not alter the list of suggested editing strategies.
threshold = 0.6 # @param {type:"slider", min:0.01, max:0.99, step:0.01}
# Use `threshold` to specify the cutoff threshold for predicted edit purity required for inclusion in the abbreviated summary of high-potential editor candidates. For coding mutations, purity is measured by exact match to reference translated protein sequence. For non-coding mutations, purity is measured by exact DNA sequence match to reference. This does not alter the full editor analysis that is also provided.
verbose_output = True # @param {type:"boolean"}
# Selecting `verbose_output` will provide a running summary of edit strategies being tested. If deselected, only the current ClinVar ID and variant name being tested will be displayed.

if mode == "Enter ClinVar IDs directly":
  cvIds = re.split(r'[ ,;\t]+', clinVar_ids)
elif mode == "Enter a gene name":
  Entrez.email = email
  handle = Entrez.esearch(db='clinvar', retmax = 5000, term=f'((("{gene}"[Gene Name]) AND "single nucleotide variant"[Type of variation]) AND ("clinsig pathogenic"[Filter] OR "clinsig likely path"[Filter]))')
  record = Entrez.read(handle)
  cvIds = record['IdList']
elif mode == "Upload a spreadsheet containing ClinVar IDs":
  fname = ''
  ext = fname.split('.')[-1]
  if ext == 'xlsx' or ext == 'xls':
    if not spreadsheet_has_headers:
      df = pd.read_excel(fname, header = None)
      cvIds = df.iloc[:, 0].to_list()
    else:
      df = pd.read_excel(fname)
      cvIds = df[spreadsheet_column].to_list()
  elif ext == 'csv':
    if not spreadsheet_has_headers:
      df = pd.read_csv(fname, header = None)
      cvIds = df.iloc[:, 0].to_list()
    else:
      df = pd.read_csv(fname)
      cvIds = df[spreadsheet_column].to_list()
  else: raise Exception('Unsupported file type')
  cvIds = [str(cvId) for cvId in cvIds]
elif mode == "Use all ClinVar search results from above":
  try: cvIds = cvIds_search
  except: print('No ClinVar search was run in the current session. Run a ClinVar search and try again.')


homedir = '/content/gdrive/MyDrive'
timestamp = datetime.now().strftime('%Y-%m-%d %HH %MM %SS')
os.mkdir(os.path.join(homedir,f'ColabBE {timestamp}'))
run = 1
candidates = []
hits = {}
for cvId in cvIds:
    print(f'Run {run} of {len(cvIds)}')
    [sequence_distributions, stats_dict, vname] = analyze(cvId, verbose_output, average_editing_efficiency)
    if stats_dict:
        with pd.ExcelWriter(os.path.join(homedir,f'ColabBE {timestamp}',f'{cvId} ({vname}) predicted sequence distributions.xlsx')) as seq_dist:
            for entry in sequence_distributions:
                (editor, spacer, pam) = entry
                df_seq = pd.DataFrame.from_dict(sequence_distributions[entry], orient='index', columns=['Frequency'])
                df_seq.to_excel(seq_dist, sheet_name=f'{editor} {spacer}|{pam}', index_label='Sequence')
        for entry in stats_dict:
            if entry['PAM'] in pam_dict:
              entry['HT-PAMDA score'] = pam_dict[entry['PAM']]['max_score']
              entry['Recommended Cas9 variant'] = pam_dict[entry['PAM']]['best_PAM']
            else:
              entry['HT-PAMDA score'] = '< -2.5'
              entry['Recommended Cas9 variant'] = 'Non-SpCas9'
            if entry['Perfect correction'] >= threshold:
                candidates.append(entry)
                hits[entry['ClinVar ID']] = entry['Variant name']
        df_stats = pd.DataFrame.from_dict(stats_dict)
        df_stats.to_excel(os.path.join(homedir,f'ColabBE {timestamp}',f'{cvId} ({vname}) predicted base editing statistics.xlsx'))
    run += 1
df_summary = pd.DataFrame.from_dict(candidates)
df_summary.to_excel(os.path.join(homedir,f'ColabBE {timestamp}',f'Summary of editor candidates above threshold.xlsx'))
df_hits = pd.DataFrame.from_dict(hits, orient='index', columns=['Variant name'])
df_hits.to_excel(os.path.join(homedir,f'ColabBE {timestamp}',f'Variants with editor candidates above threshold.xlsx'), index_label='ClinVar ID')
subprocess.run(['zip', '-r', os.path.join(homedir, f'ColabBE {timestamp}.zip'), os.path.join(homedir, f'ColabBE {timestamp}')])

Run 1 of 2
ClinVar ID: 30169
Variant name: NM_001244008.2(KIF1A):c.296C>T (p.Thr99Met)
WT:
GGATACAACGTGTGCATCTTCGCCTATGGGCAGACGGGTGCCGGCAAGTCCTACACCATGATGGGCAAGC
 G  Y  N  V  C  I  F  A  Y  G  Q  T  G  A  G  K  S  Y  T  M  M  G  K 
CCTATGTTGCACACGTAGAAGCGGATACCCGTCTGCCCACGGCCGTTCAGGATGTGGTACTACCCGTTCG
Variant:
GGATACAACGTGTGCATCTTCGCCTATGGGCAGATGGGTGCCGGCAAGTCCTACACCATGATGGGCAAGC
 G  Y  N  V  C  I  F  A  Y  G  Q  M  G  A  G  K  S  Y  T  M  M  G  K 
CCTATGTTGCACACGTAGAAGCGGATACCCGTCTACCCACGGCCGTTCAGGATGTGGTACTACCCGTTCG
The wild-type codon ACG, coding for T, is mutated to ATG, coding for M
ATG could be corrected to ACG
Sending GCTTGCCCATCATGGTGTAGGACTTGCCGGCACCCATCTGCCCATAGGCG (top strand) to BE-Hive with editor ABE8e
Sending CTTGCCCATCATGGTGTAGGACTTGCCGGCACCCATCTGCCCATAGGCGA (top strand) to BE-Hive with editor ABE8e
Sending TTGCCCATCATGGTGTAGGACTTGCCGGCACCCATCTGCCCATAGGCGAA (top strand) to BE-Hive with editor ABE8e
Sending TGCCCATCATGGTGTAGGACTTGCCGGCACCCATCTGCCCATAGGCGAAG (top strand) 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>