# Create a tool to help design LNA gapmers

## Input = a long-ish nucleic acid sequence

Programme asks the user to paste the input sequence

## Output = all possible sequence stretches of length X-nt

Programme asks the user to input the desired length of the stretches (this should be 16nt for typical gapmers, for example).

Programme outputs an Excel spreadsheet with all the sequence stretches written out in a single column, with their names in an adjacent column (eg out_1, out_2, ... etc)

# (Intended) Outline of steps

## 1. Read in the input sequence

Ask the user to paste it in.

## 2. Read in the length of output stretches

Ask the user for the desired length of input stretches

## 3. Parse the input

Eg split off the Fasta sequence name - could then use it later in some output

Extract the nucleic acid sequence. Eg ignore spaces, line gaps, special characters - just go along collecting letters

Generate a warning message if the sequence isn't all-RNA or all-DNA (ie if it contains letters outside the ATCG set or the AUCG set)


## 4. Create a dataframe to store outputs

Just an empty Pandas dataframe with a column for the sequence and a column for sequence name

## 5. Create sequence stretches

Scan along the extracted sequence in X-character-long window and write them out into the dataframe


## 6. Save the dataframe as an Excel spreadsheet

Create a new Excel file. Could use Fasta sequence name to name it, if this was given. Else just a default name.

Would be cool to incorporate today's date in the naming - Python might have something to fetch this?


## 7. Generate & print extra bits of output info

Eg was everything okay - was it all DNA / RNA / mix of both / had extra issues?
Eg the number of sequence stretches generated
Eg the Excel file name that's been generated

In [8]:
import pandas as pd
import re

## Define functions used later

In [9]:
# define function to transform input sequence into one neat string of letters
# 1) remove line breaks and white spaces
# 2) check for special characters => (a) remove, (b) print a warning
# 3) check for non-base letetrs => (a) remove, (b) print a warning
# 4) check if it's DNA / RNA / a mix => (a) make mix into DNA, (b) print notification on the outcome

def seq_formater (user_data):
    
    import re
    
    # remove line breaks and white spaces
    single_line = user_data.replace('\n', '')
    #print(single_line)
    no_spaces = ''.join(single_line.split())
    #print(no_spaces)
    
    
    
    # check for non-letter characters
    
    letters_only = re.sub("[^a-zA-Z]", "", no_spaces)
    if letters_only != no_spaces:
        print ('\nWarning: your sequence contained non-letter characters.\nThese have now been removed.')
    else:
        print ('\nNo special characters in your sequence, yay.')
    #print(letters_only)
    
    
    
    # check for non-DNA/RNA characters

    bases_only = re.sub("[^a,A,c,C,g,G,t,T,u,U]", "", letters_only)
    if bases_only != letters_only:
        print ('\nWarning: your sequence contained non-base letters. That\'s concerning.\nStill, these have now been removed.')
    else:
        print ('\nNo non-base letters in your sequence, yay.')
    #print(bases_only)
    
    
    
    # check for DNA-only or RNA-only characters

    DNA_only = re.sub("[u,U]", "T", bases_only)
    RNA_only = re.sub("[t,T]", "U", bases_only)
    
    if DNA_only == bases_only:
        formated_seq = DNA_only
        print ('\nLooks like your submitted sequence is DNA.')
    elif RNA_only == bases_only:
        formated_seq = RNA_only
        print ('\nLooks like your submitted sequence is RNA.')
    else:
        formated_seq = DNA_only
        print('\nLooks like your submitted sequence was a mix of RNA and DNA. It\'s now been converted to DNA.')

    
    #make seqeunce all upper-case (required by QIAgen, for example)
    formated_seq = formated_seq.upper()
    
        
    #print('\nThe final formated input sequence is:\n', formated_seq, '\n')    
           
        
        
    return formated_seq

In [10]:
# define function to make reverse complement

def rev_comp(sequence):
    print(len(sequence))
    
    reverse_complement = ""
    for i in range(len(sequence)):       
        old_letter = sequence[i]
        
        if old_letter == "a" or old_letter == "A":
            new_letter = "T"
        if old_letter == "c" or old_letter == "C":
            new_letter = "G"
        if old_letter == "g" or old_letter == "G":
            new_letter = "C"
        if old_letter == "t" or old_letter == "T" or old_letter == "u" or old_letter == "U":
            new_letter = "A" 
                        
        reverse_complement = new_letter + reverse_complement
    return reverse_complement

In [11]:
# define function to convert sequence ends to LNA format as requested
# in this case, it's IDT and QIAgen format: +A

def LNAse_IDT(sequence):
    gapmer = ""
    i = 1
    #print(len(sequence))
    
    while i <= LNA_5:
        old_letter = sequence[i-1]
        new_letter = "+" + old_letter
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i > LNA_5 and i <= len(sequence) - LNA_3:
        old_letter = sequence[i-1]
        new_letter = old_letter
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i >= (len(sequence) - LNA_3 + 1) and i <= len(sequence):
        old_letter = sequence[i-1]
        new_letter = "+" + old_letter
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    return gapmer  
    

In [12]:
# define function to convert sequence ends to LNA format as requested
# in this case, it's Eurogentec format: {A}

def LNAse_Eurogentec(sequence):
    gapmer = ""
    i = 1
    #print(len(sequence))
    
    while i <= LNA_5:
        old_letter = sequence[i-1]
        new_letter = "{" + old_letter + "}"
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i > LNA_5 and i <= len(sequence) - LNA_3:
        old_letter = sequence[i-1]
        new_letter = old_letter
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i >= (len(sequence) - LNA_3 + 1) and i <= len(sequence):
        old_letter = sequence[i-1]
        new_letter = "{" + old_letter + "}"
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    return gapmer  

In [13]:
# define function to convert sequence ends to LNA format as requested
# in this case, it's Sigma format: [+A]

def LNAse_Sigma(sequence):
    gapmer = ""
    i = 1
    #print(len(sequence))
    
    while i <= LNA_5:
        old_letter = sequence[i-1]
        new_letter = "[+" + old_letter + "]"
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i > LNA_5 and i <= len(sequence) - LNA_3:
        old_letter = sequence[i-1]
        new_letter = old_letter
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    while i >= (len(sequence) - LNA_3 + 1) and i <= len(sequence):
        old_letter = sequence[i-1]
        new_letter = "[+" + old_letter + "]"
        gapmer += new_letter
        i += 1
    #print(gapmer)
    #print(i)
    
    return gapmer  

In [14]:
# define function to convert sequence ends to LNA format as requested
# in this case, it's Sigma format: [+A]

def LNAse_Sigma_PS(sequence):
    gapmer = ""
    i = 1
    #print(len(sequence))
    
    while i <= LNA_5:
        old_letter = sequence[i-1]
        new_letter = "[+" + old_letter + "]"
        gapmer = gapmer + new_letter + '*'
        i += 1
    #print(gapmer)
    #print(i)
    
    while i > LNA_5 and i <= len(sequence) - LNA_3:
        old_letter = sequence[i-1]
        new_letter = old_letter
        gapmer = gapmer + new_letter + '*'
        i += 1
    #print(gapmer)
    #print(i)
    
    while i >= (len(sequence) - LNA_3 + 1) and i < len(sequence):
        old_letter = sequence[i-1]
        new_letter = "[+" + old_letter + "]"
        gapmer = gapmer + new_letter + '*'
        i += 1
    
    if i >= (len(sequence) - LNA_3 + 1) and i == len(sequence):
        old_letter = sequence[i-1]
        new_letter = "[+" + old_letter + "]"
        gapmer = gapmer + new_letter
        i += 1
    
    
    #print(gapmer)
    #print(i)
    
    return gapmer  

## Run the programme

In [15]:
# Print a welcome message

welcome_message = "\nWelcome to the LNA gapmer target-generator!\n\n"
print(welcome_message)


Welcome to the LNA gapmer target-generator!




In [16]:
# Get the name of the user sequence

#title = input("\nWhat's the name of your nucleic acid sequence?\n\n")

In [17]:
# Get user sequence

user_data = input("\nPlease paste your nucleic acid sequence:\n")


Please paste your nucleic acid sequence:
atcgttcgtagctgctgctggtcc


In [18]:
# format inout sequence
formated_seq = seq_formater (user_data)

print("\nYour formated sequence is\n", formated_seq, "\n", sep="")
print("The length of your sequence is ", len(formated_seq), " nt.\n", sep="")


No special characters in your sequence, yay.

No non-base letters in your sequence, yay.

Looks like your submitted sequence is DNA.

Your formated sequence is
ATCGTTCGTAGCTGCTGCTGGTCC

The length of your sequence is 24 nt.



In [19]:
# Get the desired length of sequence stretches

stretch_length = int(input("\nDesired length of the gapmer:\t"))


Desired length of the gapmer:	24


In [20]:
# Generate sequence stretches

stretches = [formated_seq[i:i+stretch_length] for i in range(len(formated_seq)-(stretch_length-1))]
stretch_count = len(stretches)

print("Example stretches:")
print (stretches[0:2])
print (stretches [-2:])

print("\nNumber of stretches generated is ", stretch_count, '.', sep="")

Example stretches:
['ATCGTTCGTAGCTGCTGCTGGTCC']
['ATCGTTCGTAGCTGCTGCTGGTCC']

Number of stretches generated is 1.


In [21]:
# ask how many LNAs to put in

LNA_5 = int(input("\nHow many LNA nucleotides on the 5' end?\t"))
LNA_3 = int(input("How many LNA nucleotides on the 3' end?\t"))

print("\nOkay, this will leave ", stretch_length - LNA_5 - LNA_3, " nt of DNA in the middle of the gapmer.", sep="")


How many LNA nucleotides on the 5' end?	4
How many LNA nucleotides on the 3' end?	4

Okay, this will leave 16 nt of DNA in the middle of the gapmer.


In [22]:
# Set up an empty dataframe

excel_lines = pd.DataFrame(columns=['Target stretch name', 'Target stretch sequence', 'Reverse complement',
                                    'LNA gapmer for IDT and QIAgen', 'LNA gapmer for Eurogentec', 'LNA gapmer for Sigma',
                                   'LNA all_PS gapmer for Sigma'])
excel_lines

Unnamed: 0,Target stretch name,Target stretch sequence,Reverse complement,LNA gapmer for IDT and QIAgen,LNA gapmer for Eurogentec,LNA gapmer for Sigma,LNA all_PS gapmer for Sigma


In [23]:
# Populate the dataframe

for x in range(stretch_count):
    stretch_name = "seq_" + str(x + 1)
    reverse_complement = rev_comp(stretches[x])
    LNA_gapmer_IDT = LNAse_IDT(reverse_complement)
    LNA_gapmer_Eurogentec = LNAse_Eurogentec(reverse_complement)
    LNA_gapmer_Sigma = LNAse_Sigma(reverse_complement)
    LNA_gapmer_Sigma_PS = LNAse_Sigma_PS(reverse_complement)
    new_line = {'Target stretch name': stretch_name, 'Target stretch sequence': stretches[x],
                'Reverse complement': reverse_complement, 'LNA gapmer for IDT and QIAgen': LNA_gapmer_IDT,
                'LNA gapmer for Eurogentec': LNA_gapmer_Eurogentec, 'LNA gapmer for Sigma': LNA_gapmer_Sigma,
               'LNA all_PS gapmer for Sigma': LNA_gapmer_Sigma_PS}
    
    excel_lines = excel_lines.append(new_line, ignore_index=True)

excel_lines

24


Unnamed: 0,Target stretch name,Target stretch sequence,Reverse complement,LNA gapmer for IDT and QIAgen,LNA gapmer for Eurogentec,LNA gapmer for Sigma,LNA all_PS gapmer for Sigma
0,seq_1,ATCGTTCGTAGCTGCTGCTGGTCC,GGACCAGCAGCAGCTACGAACGAT,+G+G+A+CCAGCAGCAGCTACGAA+C+G+A+T,{G}{G}{A}{C}CAGCAGCAGCTACGAA{C}{G}{A}{T},[+G][+G][+A][+C]CAGCAGCAGCTACGAA[+C][+G][+A][+T],[+G]*[+G]*[+A]*[+C]*C*A*G*C*A*G*C*A*G*C*T*A*C*...


In [24]:
# create & save a new Excel spreadsheet

# fetch today's date in YYYY-MM-DD format
from datetime import datetime
date = datetime.today().strftime('%Y-%m-%d')


# create a unique file name
import os.path

x = 1
filename = "Sequence stretches " + date + " no" + str(x) + ".xlsx"
while os.path.isfile(filename):
    x += 1
    filename = "Sequence stretches " + date + " no" + str(x) + ".xlsx"

# write out the Excel file
excel_lines.to_excel(filename, index=False) 