# Bioinformatics

Here we are going to illustrate some bioinformatics in Python following:

http://hplgit.github.io/bioinf-py/doc/pub/html/index.html
~
Drawing from their material.

Life is definitely digital. The genetic code of all living organisms are represented by a long sequence of simple molecules called nucleotides, or bases, which makes up the Deoxyribonucleic acid, better known as DNA. There are only four such nucleotides, and the entire genetic code of a human can be seen as a simple, though 3 billion long, string of the letters A, C, G, and T. Analyzing DNA data to gain increased biological understanding is much about searching in (long) strings for certain string patterns involving the letters A, C, G, and T. This is an integral part of bioinformatics, a scientific discipline addressing the use of computers to search for, explore, and use information about genes, nucleic acids, and proteins.

## Basic Bioinformatics Examples in Python

The instructions to the computer how the analysis is going to be performed are specified using the Python programming language. The forthcoming examples are simple illustrations of the type of problem settings and corresponding Python implementations that are encountered in bioinformatics. However, the leading Python software for bioinformatics applications is [BioPython](https://biopython.org/) and for real-world problem solving one should rather utilize BioPython instead of home-made solutions. The aim of the sections below is to illustrate the nature of bioinformatics analysis and introduce what is inside packages like BioPython.

We shall start with some very simple examples on DNA analysis that bring together basic building blocks in programming: loops, if tests, and functions. As reader you should be somewhat familiar with these building blocks in general and also know about the specific Python syntax.

## Counting Letters in DNA Strings

Given some string `dna` containing the letters A, C, G, or T, representing the bases that make up DNA, we ask the question: how many times does a certain base occur in the DNA string? For example, if dna is ATGGCATTA and we ask how many times the base A occur in this string, the answer is 3.

A general Python implementation answering this problem can be done in many ways. Several possible solutions are presented below.

In [1]:
list("ATGC")

['A', 'T', 'G', 'C']

In [2]:
dna = 'ATGCGGACCTAT'
base = 'C'
dna.count(base)

3

## Generating Random DNA strings

We can use the `random.choice` function to select a character from our 'alphabet' of A, T, C or G.

The random.choice(x) function selects an element in the list x at random.

Note that `N` is very often a large number. In Python version 2.x, `range(N)` generates a list of N integers. We can avoid the list by using `xrange` which generates an integer at a time and not the whole list. In Python version 3.x, the range function is actually the `xrange` function in version 2.x. Using `xrange`, combining the statements, and wrapping the construction of a random DNA string in a function, gives

In [8]:
import random
alphabet = list("ATCG")
dna = "".join([random.choice(alphabet) for i in range(10)])
dna

'TGTGAGGGCC'

## Computing Frequencies

Your genetic code is essentially the same from you are born until you die, and the same in your blood and your brain. Which genes that are turned on and off make the difference between the cells. This regulation of genes is orchestrated by an immensely complex mechanism, which we have only started to understand. A central part of this mechanism consists of molecules called transcription factors that float around in the cell and attach to DNA, and in doing so turn nearby genes on or off. These molecules bind preferentially to specific DNA sequences, and this binding preference pattern can be represented by a table of frequencies of given symbols at each position of the pattern. More precisely, each row in the table corresponds to the bases A, C, G, and T, while column j reflects how many times the base appears in position j in the DNA sequence.

For example, if our set of DNA sequences are TAG, GGT, and GGG, the table becomes:

| **base** | **0** | **1** | **2** |
| ----- | -- | -- | -- |
| A | 0 | 1 | 0 |
| C | 0 | 0 | 0 |
| G | 2 | 2 | 2 |
| T | 1 | 0 | 1 |

From this table we can read that base A appears once in index 1 in the DNA strings, base C does not appear at all, base G appears twice in all positions, and base T appears once in the beginning and end of the strings.

In the following we shall present different data structures to hold such a table and different ways of computing them. The table is known as a frequency matrix in bioinformatics and this is the term used here too.

In [194]:
import numpy as np
def generate_sequence(N, as_str=False):
    alphabet=list("ATCG")
    if as_str:
        return "".join([random.choice(alphabet) for i in range(N)])
    else:
        return [random.choice(alphabet) for i in range(N)]
    
dna_list = np.asarray([generate_sequence(100) for i in range(1000)])

We're going to use `numpy` and `pandas` to generate a matrix of these characters which should make counting substantially easier:

In [195]:
import pandas as pd

In [196]:
def calculate_count(MAT):
    length, strand = MAT.shape
    freqs = []
    for j in range(strand):
        uniq, counts = np.unique(dna_list[:,j], return_counts=True)
        freqs.append(pd.Series(counts, index=uniq, dtype=np.int_))
    return pd.concat(freqs,axis=1,sort=False).fillna(0).astype(np.int_)

counts = calculate_count(dna_list)

In [197]:
counts

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
A,236,252,239,247,263,246,244,260,248,244,...,243,243,253,235,257,247,257,239,251,240
C,280,263,242,247,242,251,265,260,252,257,...,243,261,258,260,229,277,222,265,240,268
G,252,256,281,265,260,253,245,258,245,266,...,237,248,220,249,232,252,264,238,224,241
T,232,229,238,241,235,250,246,222,255,233,...,277,248,269,256,282,224,257,258,285,251


## Analyzing the Frequency Matrix

Having built a frequency matrix out of a collection of DNA strings, it is time to use it for analysis. The short DNA strings that a frequency matrix is built out of, is typically a set of substrings of a larger DNA sequence, which shares some common purpose. An example of this is to have a set of substrings that serves as a kind of anchors/magnets at which given molecules attach to DNA and perform biological functions (like turning genes on or off). With the frequency matrix constructed from a limited set of known anchor locations (substrings), we can now scan for other similar substrings that have the potential to perform the same function. The simplest way to do this is to first determine the most typical substring according to the frequency matrix, i.e., the substring having the most frequent nucleotide at each position. This is referred to as the consensus string of the frequency matrix. We can then look for occurrences of the consensus substring in a larger DNA sequence, and consider these occurrences as likely candidates for serving the same function (e.g., as anchor locations for molecules).

In [116]:
def find_consensus_v1(frequency_matrix):
    base2index = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    consensus = ''
    frequency_matrix=np.asarray(frequency_matrix)
    dna_length = len(frequency_matrix[0])

    for i in range(dna_length):  # loop over positions in string
        max_freq = -1            # holds the max freq. for this i
        max_freq_base = None     # holds the corresponding base

        for base in 'ATGC':
            if frequency_matrix[base2index[base]][i] > max_freq:
                max_freq = frequency_matrix[base2index[base]][i]
                max_freq_base = base
            elif frequency_matrix[base2index[base]][i] == max_freq:
                max_freq_base = '-' # more than one base as max

        consensus += max_freq_base  # add new base with max freq
    return consensus

In [117]:
find_consensus_v1(counts)

'AATCAAT-ACAACAATTTCCGTTGCCGCCGTAAAA-ACATCTC-GATGAAGGTCGCCGTATCATACCCTAAT-AGTGGGTCTGACGTGAACATTAATAAT'

Alternatively using `pandas` we can compute this substantially faster.

We create a *custom function* called `exclusive_max` that returns the character with the highest count *only if* it is a unique maximum, else we return a blank character.

We then simply aggregate over all the base positions and apply the function.

In [165]:
def exclusive_max(col):
    """Given a column of DNA base frequencies, find the 
    exclusive maximum, or return '-' if no exclusive exists.
    """
    mx = col.max()
    if ((col.eq(mx).sum())) == 1:
        return col.idxmax()
    else:
        return "-"

def find_consensus_slow(freq_matrix):
    """assumes freq_matrix is a pandas.DataFrame"""
    if isinstance(freq_matrix, pd.DataFrame):
        return freq_matrix.aggregate(exclusive_max).str.cat()
    else:
        raise TypeError("freq_matrix must be of type 'pd.DataFrame'")

In [166]:
find_consensus_slow(counts)

'AATCAAT-ACAACAATTTCCGTTGCCGCCGTAAAA-ACATCTC-GATGAAGGTCGCCGTATCATACCCTAAT-AGTGGGTCTGACGTGAACATTAATAAT'

In [167]:
%timeit find_consensus_v1(counts)

203 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [168]:
%timeit find_consensus_slow(counts)

28.8 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Dot Plots from Pair of DNA Sequences

Dot plots are commonly used to visualize the similarity between two protein or nucleic acid sequences. They compare two sequences, say $d_1$ and $d_2$, by organizing $d_1$ along the x-axis and $d_2$ along the y-axis of a plot. When `d1[i] == d2[j]` we mark this by drawing a dot at location $(i,j)$ in the plot. An example is

In [127]:
def dotplot_list_of_lists(dna_x, dna_y):
    dotplot_matrix = [['0' for x in dna_x] for y in dna_y]
    for x_index, x_value in enumerate(dna_x):
        for y_index, y_value in enumerate(dna_y):
            if x_value == y_value:
                dotplot_matrix[y_index][x_index] = '1'
    return dotplot_matrix

In [128]:
dna_x = 'TAATGCCTGAAT'
dna_y = 'CTCTATGCC'
dotplot_list_of_lists(dna_x, dna_y)

[['0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0'],
 ['1', '0', '0', '1', '0', '0', '0', '1', '0', '0', '0', '1'],
 ['0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0'],
 ['1', '0', '0', '1', '0', '0', '0', '1', '0', '0', '0', '1'],
 ['0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '1', '0'],
 ['1', '0', '0', '1', '0', '0', '0', '1', '0', '0', '0', '1'],
 ['0', '0', '0', '0', '1', '0', '0', '0', '1', '0', '0', '0'],
 ['0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0'],
 ['0', '0', '0', '0', '0', '1', '1', '0', '0', '0', '0', '0']]

In [129]:
def dotplot_numpy(dna_x, dna_y):
    dotplot_matrix = np.zeros((len(dna_y), len(dna_x)), np.int)
    for x_index, x_value in enumerate(dna_x):
        for y_index, y_value in enumerate(dna_y):
            if x_value == y_value:
                dotplot_matrix[y_index,x_index] = 1
    return dotplot_matrix

In [130]:
dotplot_numpy(dna_x, dna_y)

array([[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]])

## Base Frequencies

DNA consists of four molecules called nucleotides, or bases, and can be represented as a string of the letters A, C, G, and T. But this does not mean that all four nucleotides need to be similarly frequent. Are some nucleotides more frequent than others, say in yeast, as represented by the first chromosome of yeast? Also, DNA is really not a single thread, but two threads wound together. This wounding is based on an A from one thread binding to a T of the other thread, and C binding to G (that is, A will only bind with T, not with C or G). Could this fact force groups of the four symbol frequencies to be equal? The answer is that the A-T and G-C binding does not in principle force certain frequencies to be equal, but in practice they usually become so because of evolutionary factors related to this pairing.

Our first programming task now is to compute the frequencies of the bases A, C, G, and T. That is, the number of times each base occurs in the DNA string, divided by the length of the string. For example, if the DNA string is ACGGAAA, the length is 7, A appears 4 times with frequency 4/7, C appears once with frequency 1/7, G appears twice with frequency 2/7, and T does not appear so the frequency is 0.

From a coding perspective we may create a function for counting how many times A, C, G, and T appears in the string and then another function for computing the frequencies. In both cases we want dictionaries such that we can index with the character and get the count or the frequency out.

In [191]:
def get_base_frequencies_v1(s):
    return {base: s.count(base) / float(len(s)) for base in "ATGC"}

In [207]:
get_base_frequencies_v1(generate_sequence(1000, as_str=True))

{'A': 0.257, 'T': 0.236, 'G': 0.24, 'C': 0.267}

### Real world example: Yeast chromosome

    wget http://hplgit.github.com/bioinf-py/data/yeast_chr1.txt

In [203]:
def read_dnafile(filename):
    lines = open(filename,"r").readlines()
    dna = "".join([line.strip() for line in lines])
    return dna

In [206]:
yeast = read_dnafile("yeast_chr1.txt")
get_base_frequencies_v1(yeast)

{'A': 0.3033256880733945,
 'T': 0.30394252154573254,
 'G': 0.19879847789824853,
 'C': 0.1939333124826244}

The varying frequency of different nucleotides in DNA is referred to as nucleotide bias. The nucleotide bias varies between organisms, and have a range of biological implications. For many organisms the nucleotide bias has been highly optimized through evolution and reflects characteristics of the organisms and their environments, for instance the typical temperature the organism is adapted to. The interested reader can, e.g., find more details in [this article](http://embor.embopress.org/content/6/12/1208).

## Translating Genes into Proteins

An important usage of DNA is for cells to store information on their arsenal of proteins. Briefly, a gene is, in essence, a region of the DNA, consisting of several coding parts (called exons), interspersed by non-coding parts (called introns). The coding parts are concatenated to form a string called mRNA, where also occurrences of the letter T in the coding parts are substituted by a U. A triplet of mRNA letters code for a specific amino acid, which are the building blocks of proteins. Consecutive triplets of letters in mRNA define a specific sequence of amino acids, which amounts to a certain protein.

Here is an example of using the mapping from DNA to proteins to create the Lactase protein (LPH), using the DNA sequence of the Lactase gene (LCT) as underlying code. An important functional property of LPH is in digesting Lactose, which is found most notably in milk. Lack of the functionality of LPH leads to digestive problems referred to as lactose intolerance. Most mammals and humans lose their expression of LCT and therefore their ability to digest milk when they stop receiving breast milk.

The genetic code file can be obtained with:

    wget http://hplgit.github.com/bioinf-py/data/genetic_code.tsv

In [217]:
genetic_code = pd.read_csv("genetic_code.tsv", sep="\t", header=None)
genetic_code.columns = ["Triplet", "AA-code", "AA-name", "AA-fullname"]

In [219]:
genetic_code.head()

Unnamed: 0,Triplet,AA-code,AA-name,AA-fullname
0,UUU,F,Phe,Phenylalanine
1,UUC,F,Phe,Phenylalanine
2,UUA,L,Leu,Leucine
3,UUG,L,Leu,Leucine
4,CUU,L,Leu,Leucine


To form mRNA, we need to grab the exon regions (the coding parts) of the lactase gene. These regions are substrings of the lactase gene DNA string, corresponding to the start and end positions of the exon regions. Then we must replace T by U, and combine all the substrings to build the mRNA string.

Obtain the lactase exon using the following:

    wget http://hplgit.github.com/bioinf-py/data/lactase_exon.tsv
    wget http://hplgit.github.com/bioinf-py/data/lactase_gene.txt
    
where `lactase_exon` provides the (start, end) tuples of each exon junction.

In [224]:
lactase_exon = pd.read_csv("lactase_exon.tsv",sep="\t",header=None)
lactase_exon.columns=["start","end"]

In [226]:
lactase_exon.head()

Unnamed: 0,start,end
0,0,651
1,3990,4070
2,7504,7588
3,13177,13280
4,15082,15161


In [229]:
lactase_gene = read_dnafile("lactase_gene.txt")

For simplicity’s sake, we shall consider mRNA as the concatenation of exons, although in reality, additional base pairs are added to each end. Having the lactase gene as a string and the exon regions as a list of (start, end) tuples, it is straightforward to extract the regions as substrings, replace T by U, and add all the substrings together:

In [251]:
def create_mRNA(gene, exon_regions):
    mrna = ''
    for i, (start, end) in exon_regions.iterrows():
        mrna += gene[start:end].replace('T','U')
    return mrna

In [253]:
mrna_lactase = create_mRNA(lactase_gene,lactase_exon)

In [255]:
mrna_lactase[:50]

'GUUCCUAGAAAAUGGAGCUGUCUUGGCAUGUAGUCUUUAUUGCCCUGCUA'

To create the protein, we replace the triplets of the mRNA strings by the corresponding 1-letter name as specified in the `genetic_code.tsv` file.

In [273]:
def create_protein(mrna, genetic_code):
    # create dictionary
    dd = genetic_code[["Triplet","AA-code"]].set_index("Triplet").squeeze().to_dict()
    protein = ''
    for i in range(len(mrna)//3):
        start = i * 3
        end = start + 3
        protein += dd[mrna[start:end]]
    return protein

In [287]:
lactase_prot = create_protein(mrna_lactase,genetic_code)

In [291]:
lactase_prot

'VPRKWSCLGMXSLLPCXVFHAGGQTGSLIEISFPPLVLXPMTCCTTXVVSWETRVLTLXQGTKTCMFVTSHCPLSCQNTSAVSMPVRSPIIRYFCHGHSSSQQEAPRIQTRKQCSATGDSSRPSRLHGFSPWSSCTTRPSLPAPSGEPKPLLTSSPTMPHSPSTPSGTXLGSGSPSVTWRKXSRSFPTRNQERHNSRPSVMPTEKPMRFTTKAMLFRAENSLLSCELKISRSSCXNHPYLRLPRTRSISSLLICLMNAKMRQVCGRSXVNCRPLSQKXKFSSSTXNSQTAPPPXRTQPVCSSAFLKPXIKTKCSPLGLILMSFXVVHQVPRKACLVLXLAAWPFSLTSSRTTRPRTPLLPLPIRESGKHLPISPGRKGMPSCRILSLKASSGVPPQEPLTWKEAGPRVGEGXASGIHAGPXTPLRAKRRWRWPATVTTRXPLTSPCFAASGLRCTSSPSPGPGSSPWGTGAAPASQALPTTTSXLTGYRMRASSPWPRCSTGTCLRPCRIMVDGRMRAWWMPSWTMRPSASPHLGTVXSCGXPSMSRGXXATQAMAPASTLPASLTQEWPLLRWLTWSSRLMPELGTTTTAIIAHSSRGTWALCXTQTGQNPCLQRGLRTXEPLSASCTSCWAGLHTPSLWMETTQPPXGPRSNRXTDSAPILWLNSPSSQRQRSSSXKALLIFWVCRITPPASSATPHKTPASLAMIPLEASPNTXTMCGPRPHPLGFVWCPGGXGGCCSLYPWNTQEEKFQYTLPGMACPXGKVKISLMIPXEXTTSINISMRCSRLSRKTLWMFVPTLLVPSLMASKALLVTASGLACTTSTSATAASQGLPGNLPTFSLASXKRTVSSPRGQKDCYHLIQXTSPPKSEPSLFHLRCPPRLKSFGKSSPANPSSKEICSTTGRFGMTFCGACPLPLIRLKARGMPMAKAPASGITLPTHQGAMXKTMPLETSPVTAITSWMPIXICSELXRXRPTASLSPGLGFSQLGETALSTVM

Unfortunately, this first try to simulate the translation process is incorrect. The problem is that the translation always begins with the amino acid Methionine, code AUG, and ends when one of the stop codons is met. We must thus check for the correct start and stop criteria. A fix is

In [288]:
def create_protein_fixed(mrna, genetic_code):
    # create dictionary
    dd = genetic_code[["Triplet","AA-code"]].set_index("Triplet").squeeze().to_dict()
    protein_fixed = ''
    trans_start_pos = mrna.find('AUG')
    for i in range(len(mrna[trans_start_pos:])//3):
        start = trans_start_pos + i*3
        end = start + 3
        amino = dd[mrna[start:end]]
        if amino == 'X':
            break
        protein_fixed += amino
    return protein_fixed

In [289]:
lactase_prot2 = create_protein_fixed(mrna_lactase,genetic_code)

In [290]:
lactase_prot2

'MELSWHVVFIALLSFSCWGSDWESDRNFISTAGPLTNDLLHNLSGLLGDQSSNFVAGDKDMYVCHQPLPTFLPEYFSSLHASQITHYKVFLSWAQLLPAGSTQNPDEKTVQCYRRLLKALKTARLQPMVILHHQTLPASTLRRTEAFADLFADYATFAFHSFGDLVGIWFTFSDLEEVIKELPHQESRASQLQTLSDAHRKAYEIYHESYAFQGGKLSVVLRAEDIPELLLEPPISALAQDTVDFLSLDLSYECQNEASLRQKLSKLQTIEPKVKVFIFNLKLPDCPSTMKNPASLLFSLFEAINKDQVLTIGFDINEFLSCSSSSKKSMSCSLTGSLALQPDQQQDHETTDSSPASAYQRIWEAFANQSRAERDAFLQDTFPEGFLWGASTGAFNVEGGWAEGGRGVSIWDPRRPLNTTEGQATLEVASDSYHKVASDVALLCGLRAQVYKFSISWSRIFPMGHGSSPSLPGVAYYNKLIDRLQDAGIEPMATLFHWDLPQALQDHGGWQNESVVDAFLDYAAFCFSTFGDRVKLWVTFHEPWVMSYAGYGTGQHPPGISDPGVASFKVAHLVLKAHARTWHHYNSHHRPQQQGHVGIVLNSDWAEPLSPERPEDLRASERFLHFMLGWFAHPVFVDGDYPATLRTQIQQMNRQCSHPVAQLPEFTEAEKQLLKGSADFLGLSHYTSRLISNAPQNTCIPSYDTIGGFSQHVNHVWPQTSSSWIRVVPWGIRRLLQFVSLEYTRGKVPIYLAGNGMPIGESENLFDDSLRVDYFNQYINEVLKAIKEDSVDVRSYIARSLIDGFEGPSGYSQRFGLHHVNFSDSSKSRTPRKSAYFFTSIIEKNGFLTKGAKRLLPPNTVNLPSKVRAFTFPSEVPSKAKVVWEKFSSQPKFERDLFYHGTFRDDFLWGVSSSAYQIEGAWDADGKGPSIWDNFTHTPGSNVKDNATGDIACDSYHQLDADLNMLRALKVKAYRFSISWSRIFPTGRNSSINSHGVDY

## Random Mutation of Genes

A simple model: This is easily modeled by replacing the letter in a randomly chosen position of the DNA by a randomly chosen letter from the alphabet A, C, G, and T. Python’s random module can be used to generate random numbers. Selecting a random position means generating a random index in the DNA string, and the function random.randint(a, b) generates random integers between a and b (both included). Generating a random letter is easiest done by having a list of the actual letters and using random.choice(list) to pick an arbitrary element from list. A function for replacing the letter in a randomly selected position (index) by a random letter among A, C, G, and T is most straightforwardly implemented by converting the DNA string to a list of letters, since changing a character in a Python string is impossible without constructing a new string. However, an element in a list can be changed in-place:

In [296]:
lactase_gene[10]

'A'

In [301]:
random.choice(list("ATCG"))

'A'

In [308]:
import random
def mutate_v1(dna):
    dna_list = list(dna)
    mutation_site = random.randint(0, len(dna_list) - 1)
    dna_list[mutation_site] = random.choice(list('ATCG'))
    return ''.join(dna_list)

In [314]:
mutate_v1("ATTTACG")

'TTTTACG'

In [334]:
i2c = {0: 'A', 1: 'C', 2: 'G', 3: 'T'}

def mutate_v2(dna, N):
    dna = np.asarray(dna, dtype='c')  # array of characters
    mutation_sites = np.random.randint(0, len(dna) - 1, size=N)
    # Must draw bases as integers
    new_bases_i = np.random.randint(0, 3, size=N)
    # Translate integers to characters
    new_bases_c = np.zeros(N, dtype='c')
    for i in i2c:
        new_bases_c[new_bases_i == i] = i2c[i]
    dna[mutation_sites] = new_bases_c
    return "".join(dna.astype("str").tolist())

In [339]:
mutate_v2(lactase_gene[:500],50)

'CTTCCTAGAAACTGGAGCTGTCTTGGCATGTAGTCCTTATTGCCCTGCTAAGTTTTTCATGCTGGGGGGCACACTGGGAGTCTGATAGAAATTTCATTTCCACCGCTGGTCCTCAAACCAATGACTTCCCGCACAACCAGAATGGTGTACTGCGAGACCAAAGTTATAACATTGTAGCAGCGGCCGAAGACATGTATGTTTGTCACCAACCACTGCCCAATTTCCTGCCAGAATACTTCAGCAGTGTCCATGCCAGTCAGATCACCCATTATACGGTATTTCTGTCATGAGCACAGCTCCTCCCAGCAGGAAGGACCCAGAATCCAGCCGAGACAAAAGTGCAGTGCTAGCGGCGAGTCCTCAACGCCCTCAAGACTGCACGGCTTCAGCCCATGGTCATCCTAGACCACCAGACCCTCCCTGCCAGCACGCTCCGGAGAAGCGAAGCCTTTGCTGACCTCTTCGCCGACTATGCCACATTCACCTTACACTCCTTCGGG'