## LSI31008 Elements of Bioinformatics, Assignment:  Basic sequence statistics, the genetic code and transcription factor binding sites.

## Introduction to the assignment

Here we will explore the (near) universal genetic code discussed at the lecture. We first load in yeast *S.cerevisiae* coding sequence downloaded from [SGD](https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/). The idea is to evaluate various statistics from the genomic data and think about them in the light of the genetic code. For background reading please have a look of ["JB Plotkin, G Kudla: Synonymous but not the same: the causes and consequences of codon bias"](https://www.nature.com/articles/nrg2899) and ["EV Koonin, AS Novozhilov: Origin and evolution of the genetic code: The universal enigma"](https://iubmb.onlinelibrary.wiley.com/doi/abs/10.1002/iub.146).

The Biopython package offers easy ways to read in FASTA-files and to handle sequences. To install it on the CSC Jupyter environment you need to run the following commands: (Note: you might need to run this separately and then restart this notebook to ensure that the package is available to your session)

In [1]:
import sys
!{sys.executable} -m pip install --user biopython



After Biopython installs, you need to restart the Jupyter kernel by selecting `Kernel -> Restart Kernel...` in the user interface at the top of the page (or clicking on the circular arrow button).

### Library imports

Import pyplot for figure generation:

In [1]:
from matplotlib import pyplot as plt

Import also Counter and OrderedDict, for subsequent usage:

In [2]:
from collections import Counter, OrderedDict

Import Biopython SeqIO for reading FASTA-files (if the import fails, restart the kernel as indicated in the previous section).

Please see the following links for information on using Biopython sequence tools:

https://biopython.org/wiki/SeqIO

https://biopython.org/wiki/SeqRecord

https://biopython.org/wiki/Seq

In [3]:
from Bio import SeqIO

### Data reading and primary exploration

The yeast coding sequence data we will be using in this exercise is provided in FASTA format that contains many short sequences, each identified by an ID and some supplementary information -- chromosome number, genomic position, etc.

We read in the sequence data file into a dictionary of SeqRecord objects:

In [4]:
input_file = 'data/orf_coding.fasta'
records = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

See that we have 5917 keys (IDs):

In [5]:
ids = list(records.keys())
len(ids)

5917

A single record in the dictionary contains all kinds of data. Let's see what the data represented by the **first** ID contains:

In [6]:
records[ids[0]]

SeqRecord(seq=Seq('ATGGTACTGACGATTTATCCTGACGAACTCGTACAAATAGTGTCTGATAAAATT...TAA'), id='YAL001C', name='YAL001C', description='YAL001C TFC3 SGDID:S000000001, Chr I from 151006-147594,151166-151097, Genome Release 64-2-1, reverse complement, intron sequence removed, Verified ORF, "Subunit of RNA polymerase III transcription initiation factor complex; part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; cooperates with Tfc6p in DNA binding; largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC)"', dbxrefs=[])

To get the nucleotide sequence out use `.seq`

In [7]:
sequence = (records[ids[0]]).seq
sequence

Seq('ATGGTACTGACGATTTATCCTGACGAACTCGTACAAATAGTGTCTGATAAAATT...TAA')

It may be more convenient to cast the variable type in to a string (str):

In [8]:
str(sequence)

'ATGGTACTGACGATTTATCCTGACGAACTCGTACAAATAGTGTCTGATAAAATTGCTTCAAATAAGGGAAAAATCACTTTGAATCAGCTGTGGGATATATCTGGTAAATATTTTGATTTGTCTGATAAAAAAGTTAAACAGTTCGTGCTTTCATGCGTGATATTGAAAAAGGACATTGAGGTGTATTGTGATGGTGCTATAACAACTAAAAATGTGACTGATATTATAGGCGACGCTAATCATTCATACTCGGTTGGGATTACTGAGGACAGCCTATGGACATTATTAACGGGATACACAAAAAAGGAGTCAACTATTGGAAATTCTGCATTTGAACTACTTCTCGAAGTTGCCAAATCAGGAGAAAAAGGGATCAATACTATGGATTTGGCGCAGGTAACTGGGCAAGATCCTAGAAGTGTGACTGGACGTATCAAGAAAATAAACCACCTGTTAACAAGTTCACAACTGATTTATAAGGGACACGTCGTGAAGCAATTGAAGCTAAAAAAATTCAGCCATGACGGGGTGGATAGTAATCCCTATATTAATATTAGGGATCATTTAGCAACAATAGTTGAGGTGGTAAAACGATCAAAAAATGGTATTCGCCAGATAATTGATTTAAAGCGTGAATTGAAATTTGACAAAGAGAAAAGACTTTCTAAAGCTTTTATTGCAGCTATTGCATGGTTAGATGAAAAGGAGTACTTAAAGAAAGTGCTTGTAGTATCACCCAAGAATCCTGCCATTAAAATCAGATGTGTAAAATACGTGAAAGATATTCCAGACTCTAAAGGCTCGCCTTCATTTGAGTATGATAGCAATAGCGCGGATGAAGATTCTGTATCAGATAGCAAGGCAGCTTTCGAAGATGAAGACTTAGTCGAAGGTTTAGATAATTTCAATGCGACTGATTTATTACAAAATCAAGGCCTTGTTATGGAAGAGAAAGAGGATGCTGTAAAGAATGAAGTTCTTCTTAATCGATTTTATCCA

Or `.translate()` for amino acid sequence:

In [9]:
str(sequence.translate())

'MVLTIYPDELVQIVSDKIASNKGKITLNQLWDISGKYFDLSDKKVKQFVLSCVILKKDIEVYCDGAITTKNVTDIIGDANHSYSVGITEDSLWTLLTGYTKKESTIGNSAFELLLEVAKSGEKGINTMDLAQVTGQDPRSVTGRIKKINHLLTSSQLIYKGHVVKQLKLKKFSHDGVDSNPYINIRDHLATIVEVVKRSKNGIRQIIDLKRELKFDKEKRLSKAFIAAIAWLDEKEYLKKVLVVSPKNPAIKIRCVKYVKDIPDSKGSPSFEYDSNSADEDSVSDSKAAFEDEDLVEGLDNFNATDLLQNQGLVMEEKEDAVKNEVLLNRFYPLQNQTYDIADKSGLKGISTMDVVNRITGKEFQRAFTKSSEYYLESVDKQKENTGGYRLFRIYDFEGKKKFFRLFTAQNFQKLTNAEDEISVPKGFDELGKSRTDLKTLNEDNFVALNNTVRFTTDSDGQDIFFWHGELKIPPNSKKTPNKNKRKRQVKNSTNASVAGNISNPKRIKLEQHVSTAQEPKSAEDSPSSNGGTVVKGKVVNFGGFSARSLRSLQRQRAILKVMNTIGGVAYLREQFYESVSKYMGSTTTLDKKTVRGDVDLMVESEKLGARTEPVSGRKIIFLPTVGEDAIQRYILKEKDSKKATFTDVIHDTEIYFFDQTEKNRFHRGKKSVERIRKFQNRQKNAKIKASDDAISKKSTSVNVSDGKIKRRDKKVSAGRTTVVVENTKEDKTVYHAGTKDGVQALIRAVVVTKSIKNEIMWDKITKLFPNNSLDNLKKKWTARRVRMGHSGWRAYVDKWKKMLVLAIKSEKISLRDVEELDLIKLLDIWTSFDEKEIKRPLFLYKNYEENRKKFTLVRDDTLTHSGNDLAMSSMIQREISSLKKTYTRKISASTKDLSKSQSDDYIRTVIRSILIESPSTTRNEIEALKNVGNESIDNVIMDMAKEKQIYLHGSKLECTDTLPDILENRGNYKDFGVAFQYRCKVNELLEAGNAIVIN

##  Problem 1a
Evaluate what is the fraction of nucleotides 'A', 'C', 'G' and 'T' in the coding sequence of *S.cer*. You can use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) from the code below to get the number of occurrences of every nucleotide (remember to compute the fractions).

In [10]:
# Example with the first sequence in the sequence data
sequence = str(records[ids[0]].seq)
for nucleotide, count in Counter(sequence).items():
    print(nucleotide, ":", count)

A : 1238
T : 952
G : 744
C : 549


Alternatively, you can use the [OrderedDict](https://docs.python.org/3/library/collections.html#collections.OrderedDict), together with the `.most_common()` method to sort the elements of a `Counter` object:

In [11]:
# By default, OrderedDict sorts a dictionary in a descending order.
OrderedDict(Counter(sequence).most_common())

OrderedDict([('A', 1238), ('T', 952), ('G', 744), ('C', 549)])

**Hint:** remember that each of the 5917 sequences in the data are part of the full coding sequence! It may be useful to combine the entire coding sequence into one long string.

In [None]:
# Your code here...


##  Problem 1b
Translate the coding sequences to protein sequence. Make a barchart showing the numbers of each amino-acid and stop codon, order by prevalence.

**Hint:** Adapt the code in 1a, but for amino-acids instead of nucleotides, and then make the barchart.

In [73]:
# Your code here...


##  Problem 1c
Pick out the most frequent amino acid from the figure you made in 1b and plot the frequencies of the corresponding nucleotide triplets (codons) in the full coding sequence. 

Use the `triplet_to_aminoacid` dictionary below or this [infographic](https://i0.wp.com/www.compoundchem.com/wp-content/uploads/2014/09/20-Common-Amino-Acids-v3.png?ssl=1) to determine which nucleotide triplets code for the selected amino acid.

**Hint:** Use the `extract_triplets` function to get the number of occurences of each non-overlapping triplet in the sequence.

In [None]:
# The following function reads a string by triplets, counts the occurrence of each triplet and returns a dictionary
def extract_triplets(s):
    size = len(s)//3
    out = {}
    for i in range(size):
        start = i*3
        end = start + 3
        triplet = s[start:end]
        if triplet in out:
            out[triplet] += 1
        else:
            out[triplet] = 1
    return out


triplet_to_aminoacid = {'AAA' : 'K', 'AAC' : 'N', 'AAG' : 'K', 'AAT' : 'N',
                        'ACA' : 'T', 'ACC' : 'T', 'ACG' : 'T', 'ACT' : 'T',
                        'AGA' : 'R', 'AGC' : 'S', 'AGG' : 'R', 'AGT' : 'S',
                        'ATA' : 'I', 'ATC' : 'I', 'ATG' : 'M', 'ATT' : 'I',
                        'CAA' : 'Q', 'CAC' : 'H', 'CAG' : 'Q', 'CAT' : 'H',
                        'CCA' : 'P', 'CCC' : 'P', 'CCG' : 'P', 'CCT' : 'P',
                        'CGA' : 'R', 'CGC' : 'R', 'CGG' : 'R', 'CGT' : 'R',
                        'CTA' : 'L', 'CTC' : 'L', 'CTG' : 'L', 'CTT' : 'L',
                        'GAA' : 'E', 'GAC' : 'D', 'GAG' : 'E', 'GAT' : 'D',
                        'GCA' : 'A', 'GCC' : 'A', 'GCG' : 'A', 'GCT' : 'A',
                        'GGA' : 'G', 'GGC' : 'G', 'GGG' : 'G', 'GGT' : 'G',
                        'GTA' : 'V', 'GTC' : 'V', 'GTG' : 'V', 'GTT' : 'V',
                        'TAA' : '*', 'TAC' : 'Y', 'TAG' : '*', 'TAT' : 'Y',
                        'TCA' : 'S', 'TCC' : 'S', 'TCG' : 'S', 'TCT' : 'S',
                        'TGA' : '*', 'TGC' : 'C', 'TGG' : 'W', 'TGT' : 'C',
                        'TTA' : 'L', 'TTC' : 'F', 'TTG' : 'L', 'TTT' : 'F'}

In [None]:
# Your code here...


##  Problem 1d

Looking at the codon frequencies for randomly selected genes in the Figure 1 of ["JB Plotkin, G Kudla: Synonymous but not the same: the causes and consequences of codon bias"](https://www.nature.com/articles/nrg2899) for yeast (*S. cerevisiae*), select some codons that show a clear bias. Then, like in problem 1C, investigate the triplet barcharts of the amino-acid that is coded by your chosen codons. Can you see the same bias in the codon frequencies that code that amino-acid at the whole genome level of yeast?

In [None]:
# Your code here...


Your answer here (double click to edit this markdown cell):

##  Problem 2: Binding energy statistics in the yeast genome (intergenic regions) 

Here we analyse signatures of selection from genomic data using transcription factor binding sites as an example. See e.g. ["Kinney JB, Tkacik G, Callan CG (2007) Precise physical models of protein-DNA interaction from high-throughput data. Proc Natl Acad Sci USA 104(2):501–506"](http://www.pnas.org/content/104/2/501.short) and ["Mustonen V, Kinney J, Callan CG, Lässig M (2008) Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites. Proc Natl Acad Sci USA 105(34):12376–12381"](http://www.pnas.org/content/105/34/12376.short). 

Directory `data/` contains a file `alignment.txt.NaN.removed` where intergenic sequences for four yeast species (*S.cer, S.par, S.mik, S.bay*) are given. Note we have removed insertions and deletions from the alignment so that analyses which in this assignment focus on a single species aspects are a bit simpler.

In [74]:
from setup import *
%matplotlib inline

In [75]:
# Import data to a dataframe
datafile = "data/alignment.txt.NaN.removed";
seq_igs = pd.read_csv(datafile, sep="\s+");
tfBindingFile = "data/Emat.abf1.kinney";
# Imports an energy matrix modelling transcription factor to DNA binding for factor Abf1
# The dimensions are 4 x 20, correspoding to the four nucleotides A, C, G, T and the binding sites motif length 20
Emat = np.array(pd.read_csv(tfBindingFile, header=None, sep="\s+"))

In [76]:
def getEnergy(seq, Emat):
    s1 = list(seq.replace('A', '0').replace('C', '1').replace('G', '2').replace('T', '3').replace('N', '4'))
    Lmat = Emat.shape[1]
    Lseq = len(s1)
    Ev = []
    for i in range(0, Lseq-Lmat+1):
        E = 0.0;
        k = 0;
        flag = 0;
        eps = 0.0;
        for j in range(i, i+Lmat):
            nuc = int(s1[j])
            if nuc < 4:
                eps = Emat[nuc, k];
            else:
                # Remove sequences with missing data
                flag = 1;
            E += eps
            k += 1;
        if flag == 0:    
            Ev.append(E)
    return Ev

In [77]:
def randomise(seq):
    s1 = list(seq.replace('A', '0').replace('C', '1').replace('G', '2').replace('T', '3').replace('N', '4'))
    random.shuffle(s1)
    return "".join(s1)

#### Problem 2a: 
Visualise the binding energy matrix *Emat* with elements $\epsilon_k(a)$ where $k$ denotes column (binding site position) and $a \in{A,C,G,T}$ nucleotides. Use a heatmap for visualisation (e.g., [imshow](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html)).

What is the best possible binding sequence according to this model? 

**Hint:** Smaller energy values indicate better binding.

In [79]:
# Your code here...


#### Problem 2b:

Explain what the functions `getEnergy()` and `randomise()` do, and add some comments to the code to help a reader understand how they work.

In [None]:
# Your code here...


#### Problem 2c: 
Now, run the cell below (takes ~1 minute) and put a comment on each line to explain what it does. Plot a histogram of EvAll (use logarithmic y-axis and increase the number of bins, e.g., to 100).

In [92]:
Nigs = seq_igs.shape[0]
Eigs = []
for n in range(0, Nigs):
    seq = seq_igs['Scer'][n]
    Eigs.append(getEnergy(seq, Emat))
    
EvAll = [val for sublist in Eigs for val in sublist]

In [112]:
# Your code here...


#### Problem 2d: 
In fact, the histogram (2c) does not show all the possible binding sites in intergenic regions because the given sequences represent only the leading strand of DNA. To get the missing half, you can either make a second data set by reverse complementing all intergenic sequence or by reverse complementing the energy matrix (apply that to a new energy matrix by copy-pasting the code from 2c and modifying it as needed; hint: np.fliplr() and np.flipud might be useful). Make a histogram as in 2c but now for all data.

In [104]:
# Your code here...


#### Problem 2e: 
Make a null model by permuting randomly each intergenic sequence using the function `randomise()` (i.e., copy the code from 2d and add one line with `randomise()`, before using `getEnergy()`). Plot the counts from the null model together with the real data into a histogram. Comment on what you see. Where is selection visible?


In [107]:
# Your code here...


Your answer here (double click to edit this markdown cell):