### Goal of the Notebook:

I have a collection of 16 forward and reverse primers for my TnSeq experiment. With this script I want to figure out which Nextera primers (ordered by Kaylee) are compatible with my primer set.

Compatibility is defined as primers having a Hamming distance of at least 3 from all the primers in my forward and reverse set

NOTE: Another potential concern with selecting barcodes is color balance. However, given that we have over 25 n7xx and s5xx barcodes, this should not be an major issue 

In [1]:
import numpy as np

In [2]:
def hamming_distance(str1, str2):
    #just asserting that the strings are equal in length. If not, Hamming distance doesn't make sense
    assert len(str1)==len(str2), 'DNA Barcodes arent equal in length'
    #basically counting how many positions are different
    return sum(char1 != char2 for char1, char2 in zip(str1, str2))

In [12]:
#importing all the fasta files
#forward primer indices
with open('tnseq_forward_indices.txt', 'r') as in_handle:
    tnseq_f = in_handle.read().splitlines()
#reverse primer indices
with open('tnseq_reverse_indices.txt', 'r') as in_handle:
    tnseq_r = in_handle.read().splitlines()
#s5xx series indices
with open('nextera_s5xx.txt', 'r') as in_handle:
    nxt_s5xx = in_handle.read().splitlines()
#s5xx series indices
with open('nextera_n7xx.txt', 'r') as in_handle:
    nxt_n7xx = in_handle.read().splitlines()

The fasta files are in the following format:

\> barcode name 

ATCGATCG

So, I extract every second line, which will contain the barcode sequence

In [18]:
#now I'll split the array of strings into one with barcodes
tnseq_f_index = [tnseq_f[2*i+1] for i in range(0, int(len(tnseq_f)/2))]
tnseq_r_index = [tnseq_r[2*i+1] for i in range(0, int(len(tnseq_r)/2))]
nxt_s5xx_index = [nxt_s5xx[2*i+1] for i in range(0, int(len(nxt_s5xx)/2))]
nxt_n7xx_index = [nxt_n7xx[2*i+1] for i in range(0, int(len(nxt_n7xx)/2))]

In [28]:
#minimum Hamming distance I'll require for a barcode to be used for analysis
ham_thresh = 4

In [29]:
#comparing hamming distances for forward primers
for i in range(0, int(len(nxt_s5xx_index))):
    dist = ham_thresh
    test_index = nxt_s5xx_index[i]
    for j in range(0, int(len(tnseq_f)/2)):
        dist = min(dist, hamming_distance(test_index, tnseq_f_index[j]))
    if dist==ham_thresh:
        print(nxt_s5xx[2*i])
        print(nxt_s5xx[2*i+1])

>S501
TAGATCGC
>S502
CTCTCTAT
>S503
TATCCTCT
>S506
ACTGCATA
>S507
AAGGAGTA
>S509
GGCTACTC
>S510
CCTCAGAC
>S511
TCCTTACG
>S512
ACGCGTGG
>S513
GGAACTCC
>S514
TGGCCATG
>S516
CGCGGTTA
>S517
GACCGCCA
>S518
TAAGATGG
>S519
ATTGACAT
>S520
AGCCAACT
>S521
TACTAGGT
>S522
TCACGGTT
>S523
TGTAATGA
>S524
CACGTCAG
>S525
CTGAATTC
>S526
CGTACCGG
>S528
TATAGACG
>S529
GTCATTGA
>S530
GCATCGTT
>S531
AGGTTGAC
>S532
TGAAACTG
>S533
CAATCACA
>S534
ACATGCAA
>S535
ATCGCGCC
>S536
TCGGTTAA


In [31]:
#comparing hamming distances for reverse primers
for i in range(0, int(len(nxt_n7xx_index))):
    dist = ham_thresh
    test_index = nxt_n7xx_index[i]
    for j in range(0, int(len(tnseq_r)/2)):
        dist = min(dist, hamming_distance(test_index, tnseq_r_index[j]))
    if dist==ham_thresh:
        print(nxt_n7xx[2*i])
        print(nxt_n7xx[2*i+1])

>N701
TCGCCTTA
>N703
TTCTGCCT
>N704
GCTCAGGA
>N705
AGGAGTCC
>N707
GTAGAGAG
>N708
CCTCTCTG
>N709
AGCGTAGC
>N710
CAGCCTCG
>N711
TGCCTCTT
>N712
TCCTCTAC
>N713
ATTACAAT
>N714
GAATGATC
>N716
AATAACGG
>N717
TAGAAGAA
>N718
GTCAGGTA
>N719
GCGGTCCT
>N720
AATCGGAC
>N721
AACTCGTG
>N722
GGCCGTGG
>N723
TTACATGT
>N724
AGTTAACA


### These are the Nextera index (barcode) sequences that are compatible with my TnSeq primer index sequences