# KwARG Tutorial (Windows)

This tutorial gives an overview of the usage of KwARG and the types of output that it can produce. 

Open the Windows Command Prompt (Start menu $\rightarrow$ Windows System $\rightarrow$ Command Prompt) and navigate to the folder where you have saved the kwarg binaries (using the command "cd directory_of_the_folder").

To recreate the results on the command line, remove the '!' from the commands below.

## Available options

A list of the available options can be viewed by running KwARG with the '-h', '-H' or '-?' option:

In [1]:
!kwarg -h

Usage: kwarg [options] < [input]
The program reads data from the input file specified and greedily
constructs a history with a low number of recombinations and
recurrent mutations (homoplasies). The history is constructed by
stepping backwards in time using coalescence, mutation and
recombination events. At each point in this process, all possible
next events (strictly speaking, only a useful subset of all possible
next events) are considered, and the resulting ancestral states are
scored. The scores are used to choose an event either at random or
to proceed to an ancestral state with minimum score (see option -T).
This process is NOT guaranteed to lead to a history with a minimum
number of recombinations or the minimum number of homoplasies.
Legal options are:
 -L[x] Provide an optional label x (should be an integer) to print at
       the start of each line.
 -S[x] Specify cost of a sequencing error (default: x = 0.5).
 -M[x] Specify cost of a recurrent mutation (default: x = 0.9).
 

## Running KwARG

We will first run KwARG on the Kreitman dataset, with 3 iterations for each of the default cost parameters:

In [2]:
!kwarg -Q3 < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607523139   30.0    -1.00    -1.00     1.00     2.00   0   0   8        153      0.02100000
   1607551929   30.0    -1.00    -1.00     1.00     2.00   0   0   8        168      0.01600000
   1607566027   30.0    -1.00    -1.00     1.00     2.00   0   0   8        190      0.00700000
   1607533024   30.0     1.00     1.01     1.00     2.00   1   1   4        728      0.09500000
   1607581506   30.0     1.00     1.01     1.00     2.00  NA  NA  NA       1042      0.11600000
   1607531870   30.0     1.00     1.01     1.00     2.00   0   1   5        744      0.07800000
   1607553963   30.0     0.90     0.91     1.00     2.00   2   0   4        754      0.09300000
   1607566190   30.0     0.90     0.91     1.00     2.00  NA  NA  NA       1136      0.13500000
   1607539839   30.0     0.90     0.91     1.00     2.00  NA  NA  NA        797      0.09600000
   1607547804   30.0     0.80     0.81  

The output table gives:
- Seed: the random seed needed to rerun this particular iteration (demonstrated below)
- Temp: the annealing temperature used (default: 30)
- SE_cost, RM_cost, R_cost, RR_cost: the cost parameter for a sequencing error, recurrent mutation, single recombination, and two consecutive recombination events, respectively
- SE, RM, R: the number of each type of event in the solution
- N_states: the total number of states considered when constructing one-step neighbourhoods
- Time: CPU time for each iteration.

Lines where number of events is shown as 'NA' correspond to runs which were identified as sub-optimal and terminated before completion.

KwARG can also be run with specified costs, as follows:

In [3]:
!kwarg -S0.2,0.4 -M0.3,0.5 -Q5 < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607522951   30.0     0.20     0.30     1.00     2.00   8   0   2       1061      0.12100000
   1607562527   30.0     0.20     0.30     1.00     2.00   6   2   1        736      0.06500000
   1607563436   30.0     0.20     0.30     1.00     2.00   5   0   2        770      0.08400000
   1607534898   30.0     0.20     0.30     1.00     2.00  NA  NA  NA        815      0.08900000
   1607542092   30.0     0.20     0.30     1.00     2.00  13   0   0        823      0.16800000
   1607539673   30.0     0.40     0.50     1.00     2.00  NA  NA  NA       1037      0.10200000
   1607528196   30.0     0.40     0.50     1.00     2.00   5   0   2        723      0.09200000
   1607558364   30.0     0.40     0.50     1.00     2.00  NA  NA  NA       1035      0.08100000
   1607553783   30.0     0.40     0.50     1.00     2.00  NA  NA  NA        858      0.24300000
   1607573336   30.0     0.40     0.50  

The same number of SE_cost and RM_cost parameters should be specified, separated by commas.

The options '-R' and '-C' can be used to specify the cost of single and double recombination events, this defaults to 1.0 and 2.0, respectively. If these options are used, then SE_cost and RM_cost must also be provided.

### Input formats

#### Binary data
The input data can be in 0/1 binary format, and any other symbols will be interpreted as missing data.

In [4]:
!type example_data_1.txt

00010010
1001101x
11001x01
01100101


Sequence and site labels can also be provided, as follows:

In [5]:
!type example_data_2.txt

#> Seq1
00010010
#> Seq2
1001101-
#> Seq3
11001-01
#> Seq4
01100101
#positions: 2 3 7 9 10 11 13 14


#### FASTA format
Input data in fasta format can be provided, for instance:

In [6]:
!type example_data_3.fasta

>Seq1
ACTTAAGG
>Seq2
TCTTTAGN
>Seq3
TGTATNAA
>Seq4
AGCAATAA


Any symbols other than 'ACTG' are taken to denote missing data. 

To denote that the data is provided in fasta format in nucleotide representation, the '-f' and '-n' flags should be used:

In [7]:
!kwarg -f -n -S0.5 -M0.9 < example_data_3.fasta

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607542536   30.0     0.50     0.90     1.00     2.00   2   0   0         36      0.00000000


### Annealing parameter

We can change the annealing temperature using the '-T' option:
- '-T30' is the default temperature of 30.
- '-T0' means the next step is chosen uniformly at random among all available moves (this is not recommended).
- '-T-1' signifies $T = \infty$, so the next step is chosen among all available moves with the minimum score.

In [8]:
!kwarg -T-1 -Q3 < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607542536   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        146      0.01400000
   1607556362   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        167      0.01600000
   1607585230   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        167      0.01600000
   1607565299   -1.0     1.00     1.01     1.00     2.00   1   0   6        797      0.12100000
   1607583785   -1.0     1.00     1.01     1.00     2.00   1   0   6        797      0.09200000
   1607522775   -1.0     1.00     1.01     1.00     2.00   1   0   6        797      0.08500000
   1607584167   -1.0     0.90     0.91     1.00     2.00   3   0   3        773      0.09200000
   1607527690   -1.0     0.90     0.91     1.00     2.00   3   0   3        773      0.10300000
   1607525508   -1.0     0.90     0.91     1.00     2.00   3   0   3        773      0.09300000
   1607557310   -1.0     0.80     0.81  

### Turning off recurrent mutations

Specifying '-S-1 -M-1' turns off recurrent mutations, so in this case an upper bound on Rmin is computed under the infinite sites assumption:

In [9]:
!kwarg -S-1 -M-1 -T-1 -Q5 < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607522763   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        146      0.00900000
   1607555700   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        167      0.01000000
   1607545368   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        167      0.01100000
   1607531335   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        146      0.00800000
   1607568421   -1.0    -1.00    -1.00     1.00     2.00   0   0   7        167      0.01400000


### Turning off recombination

Specifying '-R-1 -C-1' turns off the possibility of recombinations, so an upper bound on the minimum parsimony score is computed:

In [10]:
!kwarg -S1.0 -M1.1 -R-1 -C-1 -Q5 < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   1607548939   30.0     1.00     1.10    -1.00    -1.00  20   4   0       1841      0.27100000
   1607537049   30.0     1.00     1.10    -1.00    -1.00  NA  NA  NA       1892      0.42700000
   1607524830   30.0     1.00     1.10    -1.00    -1.00  20   0   0       1544      0.28600000
   1607531213   30.0     1.00     1.10    -1.00    -1.00  NA  NA  NA       1639      0.34000000
   1607566656   30.0     1.00     1.10    -1.00    -1.00  NA  NA  NA       2024      0.27600000


### Specifying the ancestral (root) sequence

It is possible to specify a particular sequence as ancestral to the sample (corresponding to the root of the ARG), using the '-k' option. 

If the input data is in binary format, then the all-zero sequence will be assumed to be ancestral (whether or not this is included in the data). 

If the input data is in nucleotide or amino acid representation, then the first sequence in the data will be taken as ancestral. 

The effect on the resulting history is illustated further below. 

## Outputs

We can re-run a particular instance via specifying the random seed, and output the history as text.
Here we are specifying the seed 3183788175, SE_cost=0.01 and RM_cost=0.02. We expect to see 5 sequencing errors, 0 recurrent mutations and 2 recombinations in the output. 

The "-b" option will save the history in file "example_history.txt". 

The "-d" option will output a picture of the corresponding ARG in DOT format and save it in the file "example_arg.dot". The "-e" options tells KwARG to label the edges with mutations.

If an output option is specified, an output file name must also be provided.

In [11]:
!kwarg -S0.01 -M0.02 -Z3183788175 -bexample_history.txt -dexample_arg.dot -e < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   3183788175   30.0     0.01     0.02     1.00     2.00   5   0   2        702      0.10600000


In [12]:
!type example_history.txt

Mutation of site 6 in sequence 11
Mutation of site 7 in sequence 11
Mutation of site 8 in sequence 11
Mutation of site 10 in sequence 1
Mutation of site 14 in sequence 11
Mutation of site 15 in sequence 5
Mutation of site 21 in sequence 11
Mutation of site 25 in sequence 11
Mutation of site 39 in sequence 4
Mutation of site 40 in sequence 4
Mutation of site 41 in sequence 3
Mutation of site 42 in sequence 6
Mutation of site 43 in sequence 3
Coalescing sequences 8 and 9
Coalescing sequences 8 and 10
Mutation of site 13 in sequence 8
Mutation of site 38 in sequence 8
---->Sequencing error at site 36 in sequence 6
---->Stretch of sequencing errors spanning 2 sites:
---->Sequencing error at site 17 in sequence 5
---->Sequencing error at site 18 in sequence 5
Mutation of site 17 in sequence 4
Mutation of site 18 in sequence 4
Coalescing sequences 3 and 4
Mutation of site 36 in sequence 3
---->Sequencing error at site 9 in sequence 1
---->Sequencing error at site 3 in sequence 2
Coalescing s

If GraphViz is installed, it can be used to convert the DOT file to a png.

In [13]:
!dot -Tpng -o example_arg.png example_arg.dot

![Example_ARG](example_arg.png)

Recurrent mutations are labelled with '\*'. Recombination nodes are coloured blue, with the number inside the node denoting the recombination breakpoint. The parts of the genome to the left and right of the breakpoint are inherited from two different parent nodes. The edges leading to these nodes are labelled 'P' and 'S', which stands for 'prefix' and 'suffix', respectively. 

See the help for a full list of possible output formats. For instance, we can also print out the local trees for each interval in Newick format as follows:

In [14]:
!kwarg -S0.01 -M0.02 -Z3257491408 -texample_local_trees.txt -I < kreitman_snp.txt

         Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
   3257491408   30.0     0.01     0.02     1.00     2.00   5   0   2        702      0.13000000


The '-I' option means a tree is given for each interval between recombination breakpoints. Not specifying this option will mean that one tree is produced for each site.

In [15]:
!type example_local_trees.txt

((((((1,2),(3,4)),5),6),7),(((8,9),10),11))1-3;
(((((1,2),(3,4)),6),7),((5,((8,9),10)),11))4-29;
((((((1,2),(3,4)),5),6),7),(((8,9),10),11))30-43;


### Specifying the ancestral sequence
Compare the output ARGs when the ancestral sequence is and is not specified:

In [16]:
!type example_data_2.txt

#> Seq1
00010010
#> Seq2
1001101-
#> Seq3
11001-01
#> Seq4
01100101
#positions: 2 3 7 9 10 11 13 14


In [17]:
!kwarg -L1 -Z3066074262 -dexample_ancestral.dot -k -e -vboth < example_data_2.txt
!kwarg -L2 -Z3066074262 -dexample_nonancestral.dot -e -vboth -s < example_data_2.txt
!dot -Tpng -o example_ancestral_arg.png example_ancestral.dot
!dot -Tpng -o example_nonancestral_arg.png example_nonancestral.dot

       Ref          Seed   Temp  SE_cost  RM_cost   R_cost  RR_cost  SE  RM   R   N_states            Time
         1    3066074262   30.0     0.50     0.90     1.00     2.00   2   0   0         55      0.00800000
         2    3066074262   30.0     0.50     0.90     1.00     2.00   2   0   0         36      0.00100000


Here '-vboth' means that the leaves are marked with their corresponding reference (if specified), and all nodes are labelled with the corresponding sequence.

![Ancestral sequence specified](example_ancestral_arg.png)

![Ancestral sequence not specified](example_nonancestral_arg.png)

## Simplify

This is a small program which reduces the input dataset using the 'Clean' algorithm, removing all mutations and coalescing all possible sequences until no further reduction is possible. The input data types are the same as for KwARG.

In [18]:
!simplify < kreitman_snp.txt

Input dataset: 11 sequences, 43 sites
Reduced dataset: 9 sequences, 16 sites
0100010101010101
0001010101010101
0101010001010111
0101011001010111
0110101001010100
0001000000000010
0001000010101000
1010100010101000
1010000010101000
Sequences:
X 1 2 3 4 5 6 X 10 
Sites:
X 2 X 8 X 15 X X 29 30 31 32 33 34 35 36 


The output sequence labels show either the number of the sequence in the input dataset, or 'X' if the sequence in the output has been coalesced with another. The site labels show either the number of the site as in the input dataset, or 'X' if the column corresponds to multiple sites collapsed into one.

## Flip

This is a small program which flips the nucleotides at the given sequence and site (from 0 to 1 or 1 to 0) and outputs the resulting amended dataset.

In [19]:
!type kreitman_snp.txt

0010000001000001001101110111101010101000000
0000000010000001001101110111101010101000000
0010000010000001000000000000001010111000101
0010000010000001110000000000001010111011000
0011100000110010110000000000001010100000000
0000000010000000000000000000000000010000010
0000000010000000000000000000010101000000000
1101100000111000000000000000010101000100000
1101100000111000000000000000010101000100000
1101100000111000000000000000010101000100000
1101111100000100000010001000010101000000000


In [20]:
!flip -q6,8 -s37,1 < kreitman_snp.txt

0010000001000001001101110111101010101000000
0000000010000001001101110111101010101000000
0010000010000001000000000000001010111000101
0010000010000001110000000000001010111011000
0011100000110010110000000000001010100000000
0000000010000000000000000000000000011000010
0000000010000000000000000000010101000000000
0101100000111000000000000000010101000100000
1101100000111000000000000000010101000100000
1101100000111000000000000000010101000100000
1101111100000100000010001000010101000000000


Using the option '-bfilename.txt' will save the resulting dataset to filename.txt.