# Tutorial notebook for dinucleotide forces computation

We will take an example genome and we will use it as working example to test the function of the scripts.

In [1]:
# load the genome - this is an Influenza H5N1 PB2 segment, strain used: A/Anhui/1/2005
sequence = "ATGGAGAGAATAAAAGAATTAAGGGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAAACCACTGTGGACCATATGGCCATAATCAAGAAGTACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAACAGAGATGATTCCTGAAAGGAATGAACAAGGGCAGACGCTCTGGAGCAAGACAAATGATGCCGGATCGGACAGGTTGATGGTGTCTCCCTTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTGCAGTCCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGCTGAAAGGCTAAAACATGGAACCTTCGGTCCCGTCCATTTTCGAAACCAAGTTAAAATACGCCGCCGAGTTGATATAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGTTTTCCCAAATGAAGTGGGAGCTAGAATATTGACATCAGAGTCACAATTGACAATAACGAAAGAGAAGAAAGAAGAGCTCCAAGATTGTAAGATTGCTCCCTTAATGGTTGCATACATGTTGGAAAGGGAACTGGTCCGCAAAACCAGATTCCTACCGGTAGCAAGCGGAACAAGCAGTGTGTACATTGAGGTATTGCATTTGACTCAAGGGACCTGCTGGGAACAGATGTACACTCCAGGCGGAGAAGTGAGAAACGACGATGTTGACCAGAGTTTGATCATCGCTGCCAGAAACATTGTTAGGAGAGCAACGGTATCAGCGGATCCACTGGCATCACTGCTGGAGATGTGTCACAGCACACAAATTGGTGGGATAAGGATGGTGGACATCCTTAGGCAAAACCCAACTGAGGAACAAGCTGTGGGTATATGCAAAGCAGCAATGGGTCTGAGGATCAGTTCATCCTTTAGCTTTGGAGGCTTCACTTTCAAAAGAACAAGTGGATCATCCGTCACGAAGGAAGAGGAAGTGCTTACAGGCAACCTCCAAACATTGAAAATAAGAGTACATGAGGGGTATGAAGAGTTCACAATGGTTGGACGGAGGGCAACAGCTATCCTGAGGAAAGCAACTAGAAGGCTGATTCAGTTGATAGTAAGTGGAAGAGACGAACAATCAATCGCTGAGGCAATCATTGTAGCAATGGTGTTCTCACAGGAGGATTGCATGATAAAGGCAGTCCGGGGCGATTTGAATTTCGTAAACAGAGCAAACCAAAGATTAAACCCCATGCATCAACTCCTGAGACATTTTCAAAAGGACGCAAAAGTGCTATTTCAGAATTGGGGAATTGAACCCATTGATAATGTCATGGGGATGATCGGAATATTACCTGACCTGACTCCCAGCACAGAAATGTCACTGAGAAGAGTAAGAGTTAGTAAAGTGGGAGTGGATGAATATTCCAGCACTGAGAGAGTAATTGTAAGTATTGACCGTTTCTTAAGGGTTCGAGATCAGCGGGGGAACGTACTCTTATCTCCCGAAGAGGTCAGCGAAACCCAGGGAACAGAGAAATTGACAATAACATATTCATCATCAATGATGTGGGAAATCAACGGTCCTGAGTCAGTGCTTGTTAACACCTATCAATGGATCATCAGAAACTGGGAAACTGTGAAGATTCAATGGTCTCAAGACCCCACGATGCTGTACAATAAGATGGAGTTTGAACCGTTCCAATCCTTGGTACCTAAGGCTGCCAGAGGTCAATACAGTGGATTTGTGAGAACACTATTCCAACAAATGCGTGACGTACTGGGGACATTTGATACTGTCCAGATAATAAAGCTGCTACCATTTGCAGCAGCCCCACCAGAGCAGAGCAGAATGCAGTTTTCTTCTCTAACTGTGAATGTGAGAGGCTCAGGAATGAGAATACTCGTAAGGGGCAATTCCCCTGTGTTCAACTACAATAAGGCAACCAAAAGGCTTACCGTTCTTGGAAAGGACGCAGGTGCATTAACAGAGGATCCAGATGAGGGGACAACCGGAGTGGAGTCTGCAGTACTGAGGGAATTCCTAATTCTAGGCAAGGAGGACAAAAGATATGGACCAGCATTGAGTATCAATGAACTGAGCAACCTTGCGAAAGGGGAGAAAGCTAATGTGCTGATAGGACAAGGAGACGTGGTGTTGGTAATGAAACGGAAACGGGACTCTAGCATACTTACTGACAGCCAGACAGCGACCAAAAGAATTCGGATGGCCATCAATTAG";

## Non-coding forces

In [3]:
package_path = "./"
!in(package_path, LOAD_PATH) && push!(LOAD_PATH, package_path)
using NoncodingForces_v2_1

┌ Info: Precompiling NoncodingForces_v2_1 [top-level]
└ @ Base loading.jl:1317


Let's start by computing the force on the CpG motif along the full genome:

In [5]:
motifs = ["CG"]

NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 5 entries:
  "A"  => -1.1779
  "T"  => -1.61537
  "C"  => -1.52913
  "CG" => -0.995938
  "G"  => -1.28543

When no nucleotide biases (frequencies) are specified, as above, the corresponding fields are inferred together with the forces. 
Notice that $\sum_n e^{h_n} = 1$, where $n \in \{A,C,G,T\}$. This is the result of a gauge choice, because in general the model probabilities are invariant under the transformation
$$ h_n \to h_n+K, $$
which can then be used to make them interpretable as the logarithm of a frequency.

A user-specified bias can be given, and in this case fields are not inferred:

In [7]:
nt_bias = [0.25, 0.25, 0.25, 0.25] # probs for A, C, G, T
motifs = ["CG"]

NoncodingForces_v2_1.DimerForce(sequence, motifs; freqs=nt_bias)

Dict{String, Float64} with 1 entry:
  "CG" => -1.04766

The script easily allow to compute forces on two or more dinucleotides:

In [8]:
motifs = ["CG", "TA"]

NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 6 entries:
  "A"  => -1.11739
  "T"  => -1.48581
  "C"  => -1.63146
  "CG" => -0.912403
  "G"  => -1.38268
  "TA" => -0.761354

Notice how the CpG force changed in the 3 cases, due to the fact that different motifs (as well as fields) interact. 
When the full set of dinucleotides is inferred, the system of equations solved to obtain forces is underdetermined. This means that a gauge choice must be made, for instance by setting some forces to zero by passing less than 16 dinucleotides as motifs.
If such a choise is not made, the script does it, as follows:

In [9]:
alphabet = ["A", "C", "G", "T"]
motifs = [a*b for a in alphabet for b in alphabet]

NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 20 entries:
  "T"  => -1.51866
  "C"  => -1.58002
  "CC" => 0.0
  "GC" => -0.0745032
  "GG" => 0.0
  "CG" => -0.923865
  "AT" => 0.0
  "A"  => -1.15173
  "CA" => 0.244623
  "TG" => 0.0438542
  "TA" => -0.499717
  "GT" => 0.0
  "G"  => -1.35119
  "GA" => 0.319797
  "TT" => 0.0
  "AC" => -0.152997
  "CT" => 0.0
  "AA" => 0.0
  "AG" => -0.178605
  "TC" => 0.0621285

In particular, the gauge chosen is so that:
- the exponential of the fields sum to 1;
- the forces for dinucleotides of the form NN and those of the form NT are put to zero.

Notice that the script allows for flexible choices: if not all the dinucleotides are given, the gauge is always chosen such that the maximum possible number of the nucleotides of the form NN and NT have forces equal to zero:

In [10]:
motifs = ["AC", "AG", "AT", "CA", "GA", "TA"]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 10 entries:
  "AC" => -0.15147
  "AT" => 0.0
  "A"  => -1.10232
  "T"  => -1.35861
  "C"  => -1.75463
  "AG" => 0.00906447
  "CA" => 0.463157
  "G"  => -1.43586
  "TA" => -0.613914
  "GA" => 0.262192

In [11]:
motifs = [
 "AC",
 "AG",
 "AT",
 "CA",

 "CG",
 "CT",
 "GA",
 "GC",

 "GT",
 "TA",
 "TC",
 "TG",
]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 16 entries:
  "T"  => -1.51982
  "C"  => -1.58108
  "GC" => -0.0729455
  "CG" => -0.92149
  "AT" => 0.0
  "A"  => -1.14948
  "CA" => 0.242638
  "TG" => 0.0479148
  "G"  => -1.35211
  "TA" => -0.500021
  "GT" => 0.0
  "GA" => 0.317307
  "AC" => -0.151731
  "CT" => 0.0
  "AG" => -0.177023
  "TC" => 0.0658733

Notice that when dinucleotides are not given, this is equivalent to fix to 0 their forces. Depending on which dinucleotides are not given, this might or not be a specific gauge - in other words, it might or not result in an equivalent model.
For instance, the following three cells result in equivalent models (although the third has different parameters because it is in another gauge), while the fourth does not:

In [12]:
motifs = ["AC", "AG", "AT", "CA", "GA", "TA"]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 10 entries:
  "AC" => -0.15147
  "AT" => 0.0
  "A"  => -1.10232
  "T"  => -1.35861
  "C"  => -1.75463
  "AG" => 0.00906447
  "CA" => 0.463157
  "G"  => -1.43586
  "TA" => -0.613914
  "GA" => 0.262192

In [13]:
motifs = ["AC", "AG", "CA", "GA", "TA"]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 9 entries:
  "AC" => -0.15147
  "A"  => -1.10232
  "T"  => -1.35861
  "C"  => -1.75463
  "AG" => 0.00906447
  "CA" => 0.463157
  "G"  => -1.43586
  "TA" => -0.613914
  "GA" => 0.262192

In [14]:
motifs = ["AG", "AT", "CA", "GA", "TA"]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 9 entries:
  "AT" => 0.170807
  "A"  => -1.10216
  "T"  => -1.36147
  "C"  => -1.75115
  "AG" => 0.17067
  "CA" => 0.301661
  "G"  => -1.43553
  "TA" => -0.775411
  "GA" => 0.100696

In [15]:
motifs = ["AG", "CA", "GA", "TA"]
NoncodingForces_v2_1.DimerForce(sequence, motifs)

Dict{String, Float64} with 8 entries:
  "A"  => -1.10409
  "T"  => -1.33658
  "C"  => -1.78238
  "AG" => 0.079359
  "CA" => 0.39291
  "G"  => -1.43756
  "TA" => -0.684162
  "GA" => 0.191945

In some cases (notably when all the dinucleotides are given), fixing fields is equivalent to fix (part of) a gauge, so the model is independendent on this choice (although the values of the inferred parameters depend on that).

# From here on, it is work in progress!!!

In [None]:
##########################################################################
##########################################################################

In [None]:
# this below seems not to work well...

Using "add_pseudocount = true", a single pseudocount is added to the number of each motif inferred (nucleotide or di-nucleotide):

In [22]:
short_sequence = sequence[1:500]

"ATGGAGAGAATAAAAGAATTAAGGGATCTAATGTCACAGTCCCGCACTCGCGAGATACTAACAAAAACCACTGTGGACCATATGGCCATAATCAAGAAGTACACATCAGGAAGACAAGAGAAGAACCCTGCTCTCAGAATGAAATGGATGATGGCAATGAAATATCCAATCACAGCGGACAAGAGAATAACAGAGATGATTCCTGAAAGGAATGAACAAGGGCAGACGCTCTGGAGCAAGACAAATGATGCCGGATCGGACAGGTTGATGGTGTCTCCCTTAGCTGTAACTTGGTGGAATAGGAATGGGCCGACGACAAGTGCAGTCCATTATCCAAAGGTTTACAAAACATACTTTGAGAAGGCTGAAAGGCTAAAACATGGAACCTTCGGTCCCGTCCATTTTCGAAACCAAGTTAAAATACGCCGCCGAGTTGATATAAATCCTGGCCATGCAGATCTCAGTGCTAAAGAAGCACAAGATGTCATCATGGAGGTCGT"

In [24]:
alphabet = ["A", "C", "G", "T"]
motifs = [a*b for a in alphabet for b in alphabet]
NoncodingForces_v2_1.DimerForce(short_sequence, motifs; add_pseudocount=false)

Dict{String, Float64} with 20 entries:
  "T"  => -1.78999
  "C"  => -1.46096
  "CC" => 0.0
  "GC" => -0.0881668
  "GG" => 0.0
  "CG" => -0.673973
  "AT" => 0.0
  "A"  => -1.02452
  "CA" => 0.201549
  "TG" => 0.285663
  "TA" => -0.190423
  "GT" => 0.0
  "G"  => -1.41859
  "GA" => 0.470057
  "TT" => 0.0
  "AC" => -0.497816
  "CT" => 0.0
  "AA" => 0.0
  "AG" => -0.420048
  "TC" => 0.323648

In [23]:
alphabet = ["A", "C", "G", "T"]
motifs = [a*b for a in alphabet for b in alphabet]
NoncodingForces_v2_1.DimerForce(short_sequence, motifs; add_pseudocount=true)

Dict{String, Float64} with 20 entries:
  "T"  => -3.3792
  "C"  => -1.30829
  "CC" => 0.0
  "GC" => -0.0826845
  "GG" => 0.0
  "CG" => -0.645923
  "AT" => 0.0
  "A"  => -0.881581
  "CA" => 0.191599
  "TG" => 2.18227
  "TA" => 1.71958
  "GT" => 0.0
  "G"  => -1.26757
  "GA" => 0.455001
  "TT" => 0.0
  "AC" => -0.48536
  "CT" => 0.0
  "AA" => 0.0
  "AG" => -0.409147
  "TC" => 2.21938

In [25]:
short_sequence = sequence[1:200]
alphabet = ["A", "C", "G", "T"]
motifs = [a*b for a in alphabet for b in alphabet]
NoncodingForces_v2_1.DimerForce(short_sequence, motifs; add_pseudocount=true)

Dict{String, Float64} with 20 entries:
  "T"  => NaN
  "C"  => NaN
  "CC" => NaN
  "GC" => NaN
  "GG" => NaN
  "CG" => NaN
  "AT" => NaN
  "A"  => NaN
  "CA" => NaN
  "TG" => NaN
  "TA" => NaN
  "GT" => NaN
  "G"  => NaN
  "GA" => NaN
  "TT" => 0.0
  "AC" => NaN
  "CT" => NaN
  "AA" => NaN
  "AG" => NaN
  "TC" => NaN

In [26]:
short_sequence = sequence[1:100]
alphabet = ["A", "C", "G", "T"]
motifs = [a*b for a in alphabet for b in alphabet]
NoncodingForces_v2_1.DimerForce(short_sequence, motifs; add_pseudocount=true)

LoadError: LinearAlgebra.SingularException(1)

## Coding forces

TODO