## Pooling for Sequencing

### Converts a confusing spreadsheet to an automated pooling protocol

### Goals:

*  Create a script that will calculate how to evenly or unevenly pool samples for different sequencers

* Provide GUI drop down menu of sequencer options for user to chose from (all have different total read capacities and loading requirements, so selection must then be linked to those details in some way -- library/dictionary/ect)

* Give total concentration of that final library to determine dilution needs, if any

* Asks if user wants phix and what percentage, then calculate how many reads go to phix, and how many leftover for samples

* Asks user how many libraries going on

* Asks for number of samples per library, number of reads per sample, ng/ul, % of library between 200-1000 base pair length, base pair length average

* From input, will calculate MW, nmol/ul, nM, total number of reads needed, and proportion of the lib/total

* Will ask user what minimum pipetting value they want to provide

* After the user enters in all libraries and their respective information the script will calculate volume to input (uL) but if that volume is under the minimum pipetting value, will recalculate so proportionally input is greater than that minimum.

* Will also provide instructions on barcoding directions depending on sequencer

* Finally, will calculate dilutions or total volume to provide per sequencer requirements


* Perhaps at the end there is another GUI which allows you to chose a library to change numbers for, i.e. "which library do you want to change?" and then "what pooling aspect do you want to change?"

In [1]:
import tkinter as tk
import Bio as BP
import pandas as pd
import numpy as np
import collections as col
import os
import csv
import re
import regex

In [2]:
'''Resource for tkinter GUI (graphical user interface)
http://web.archive.org/web/20201111235321/https://effbot.org/tkinterbook/tkinter-hello-tkinter.htm
'''
def pick_sequencer():
    
    OPTIONS = [
    "NovaSeq S4 4 Lanes (8 billion clusters $30,780)",
    "NovaSeq S4 3 Lanes (6 billion clusters $25,335)",
    "NovaSeq S4 2 Lanes (4 billion clusters $16,890)",
    "NovaSeq S4 1 Lane (2 billion clusters $8445)",
    "NovaSeq S2 2 Lanes (3 billion clusters $10,575 - $16,700)",
    "NovaSeq S2 1 Lane (1.5 billion clusters $6687 - $9750)",
    "NovaSeq SP (800 million clusters $2250 - $5500)",   
    "NextSeq 2000 P3 (1.1 billion clusters $2762 - $5100)",
    "NextSeq High Output (400 million clusters $1525 - $4680)",
    "NextSeq Mid Output (150 million clusters $1110 - $1790)",
    "MiSeq V3 (23 million clusters $930 - $1625)",
    "MiSeq V2 (13 million clusters $840 - $1210)",
    "MiSeq V2 Micro (4 million clusters $450)",
    "MiSeq V2 Nano (1 million clusters $300 - $360)",
    "iSeq (4 million clusters $625)",
    ] 

    Sequencer = tk.Tk()

    variable = tk.StringVar(Sequencer)
    variable.set(OPTIONS[0]) # default value

    m = tk.Label(Sequencer, text="Please pick your sequencer!")
    w = tk.OptionMenu(Sequencer, variable, *OPTIONS)
    m.pack()
    w.pack()


    picked = []
    Close = tk.Label(Sequencer, text="Great! Now please click the X to close.")
    
    def ok():
        stuff = variable.get()
        picked.append(str(stuff))
        Close.pack()
        
    button = tk.Button(Sequencer, text="Submit", command=ok)
    button.pack()
    
    Sequencer.mainloop()
    return picked[0]

In [3]:
def sequencer_param(sequencer):
    if sequencer == "iSeq (4 million clusters $625)":
        volume_input_uL = 10 
        concentration_nM = range(4, 20)     
        reads = 4000000
        
    if "MiSeq V3 (23 million clusters $930 - $1625)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20)   
        reads = 23000000
        
    if "MiSeq V2 (13 million clusters $840 - $1210)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20) 
        reads = 13000000
        
    if "MiSeq V2 Micro (4 million clusters $450)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20)  
        reads = 4000000
        
    if "MiSeq V2 Nano (1 million clusters $300 - $360)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20)
        reads = 1000000
        
    if "NextSeq 2000 P3 (1.1 billion clusters $2762 - $5100)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20) 
        reads = 1100000000 
        
    if "NextSeq High Output (400 million clusters $1525 - $4680)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20)   
        reads = 400000000
        
    if "NextSeq Mid Output (150 million clusters $1110 - $1790)" in sequencer:
        volume_input_uL = 10 
        concentration_nM = range(4, 20)   
        reads = 150000000
                
    if "NovaSeq SP" in sequencer:
        volume_input_uL = 100 
        concentration_nM = range(4, 10)  
        reads = 800000000
    
    if "NovaSeq S2 2 Lanes" in sequencer:
        volume_input_uL = 200 
        concentration_nM = range(4, 10)   
        reads = 3000000000
        
    if "NovaSeq S2 1 Lane" in sequencer:
        volume_input_uL = 200 
        concentration_nM = range(4, 10)   
        reads = 1500000000
    
    if "NovaSeq S4 4 Lanes" in sequencer:
        volume_input_uL = 350 
        concentration_nM = range(4, 10)
        reads = 8000000000
        
    if "NovaSeq S4 3 Lanes" in sequencer:
        volume_input_uL = 350 
        concentration_nM = range(4, 10)  
        reads = 6000000000
        
    if "NovaSeq S4 2 Lanes" in sequencer:
        volume_input_uL = 350 
        concentration_nM = range(4, 10)  
        reads = 4000000000
        
    if "NovaSeq S4 1 Lane" in sequencer:
        volume_input_uL = 350 
        concentration_nM = range(4, 10)  
        reads = 2000000000
        
    return volume_input_uL, concentration_nM, reads

In [4]:
def library_number_and_names():
    lib_num = input("How many libraries are you sequencing? ")
    lib_num = int(lib_num) 
    dictionary_of_libraries = {}
    for i in range(lib_num):
        question = "\n\nWhat is the name of library number " + str(i) + "? (Unique names are strongly encouraged.) "
        question2 = "\nHow many samples in this library? "
        
        
        question3 = "\nHow many reads per sample do you want for this library?"
        question4 = "\nWhat is the current concentration of this library in ng/uL?"
        question5 = "\nWhat percentage of your library is between 200 and 1000 base pairs long? (in decimal form please)"
        question6 = "\nWhat is your average base pair length for this library?"
        name = None
        while True:
            name = input(question)
            if name in dictionary_of_libraries.keys():
                print("name already exists")
            else:
                break
        dictionary_of_libraries[name]= {'number_of_samples':input(question2), 'reads_per_sample':input(question3), 'concentration_library':input(question4), '%_lib_between_200_1000':input(question5), 'average_bp_length':input(question6)}
    return lib_num, dictionary_of_libraries


In [5]:
'''Adding to nested dictionary created in library_number_and_names: 
From input, will calculate MW, nmol/ul, nM, total number of reads per library'''

def calculated_library_quant(dictionary_of_libraries):
    for i in dictionary_of_libraries:
        dictionary_of_libraries[i]['MW']=((float(dictionary_of_libraries[i]['average_bp_length']) * 607.4)+157.9)
        dictionary_of_libraries[i]['nmol/ul']=float(dictionary_of_libraries[i]['concentration_library']) * float(dictionary_of_libraries[i]['%_lib_between_200_1000'])/float(dictionary_of_libraries[i]['MW'])
        dictionary_of_libraries[i]['nM']= float(dictionary_of_libraries[i]['nmol/ul']) * 1000000
        dictionary_of_libraries[i]['total_num_reads_per_library']=float(dictionary_of_libraries[i]['reads_per_sample']) * float(dictionary_of_libraries[i]['number_of_samples'])
    return dictionary_of_libraries


In [6]:
'''total number of reads needed, and proportion of the total pool. 
Asks if user wants phix and what percentage,
calculate how many reads go to phix, and how many leftover for samples'''

def cont_library_quant(dictionary_of_libraries, reads):
    a = 0
    for i in dictionary_of_libraries:
        a = a + float(dictionary_of_libraries[i]['total_num_reads_per_library'])    
    for i in dictionary_of_libraries:
        dictionary_of_libraries[i]['proportion_of_pool_total'] = float(dictionary_of_libraries[i]['total_num_reads_per_library'])/float(a)  
    total_reads_all = a
    return dictionary_of_libraries, total_reads_all

In [7]:
#a in this line " dictionary_of_libraries['total_reads_all'] = a" is becoming it's own separate dictionary item that we try to iterate over and that doesn't work. 

In [8]:
def main():
    print("Hello user! This script will help you pool your libraries for sequencing. Just follow the instructions on the screen.")
    sequencer = pick_sequencer()
    volume_input_uL, concentration_nM, reads = sequencer_param(sequencer)
    lib_num, dictionary_of_libraries = library_number_and_names()
    dictionary_of_libraries = calculated_library_quant(dictionary_of_libraries)
    dictionary_of_libraries, total_reads_all = cont_library_quant(dictionary_of_libraries, reads)
    print(sequencer, volume_input_uL, concentration_nM, reads, lib_num, dictionary_of_libraries, total_reads_all)
    
    

In [9]:
main()

Hello user! This script will help you pool your libraries for sequencing. Just follow the instructions on the screen.
How many libraries are you sequencing? 2


What is the name of library number 0? (Unique names are strongly encouraged.) bob

How many samples in this library? 2

How many reads per sample do you want for this library?2

What is the current concentration of this library in ng/uL?2

What percentage of your library is between 200 and 1000 base pairs long? (in decimal form please)2

What is your average base pair length for this library?2


What is the name of library number 1? (Unique names are strongly encouraged.) 2

How many samples in this library? 2

How many reads per sample do you want for this library?2

What is the current concentration of this library in ng/uL?2

What percentage of your library is between 200 and 1000 base pairs long? (in decimal form please)2

What is your average base pair length for this library?2
NovaSeq S4 4 Lanes (8 billion clusters $30,78