# MentaLiST quick start

This notebook shows some examples on how to run MentaLiST to create new MLST scheme databases, either downloading from public MLST websites or from custom files, and then calling alleles for NGS samples.

## Help
MentaLiST.jl is the main script, with several commands available. To see a list of commands, run MentaLiST with the -h flag:  

In [1]:
cd ../src/mentalist_results
alias mentalist="julia --depwarn=no  /projects/pathogist/pfeijao/mentalist2/MentaLiST/src/mentalist" 

bash: cd: ../src/mentalist_results: No such file or directory


In [2]:
# Help: shows all available commands:
mentalist -h

usage: mentalist [-h]
                 {call|build_db|list_pubmlst|download_pubmlst|list_cgmlst|download_cgmlst|download_enterobase}

commands:
  call                 MLST caller, given a sample and a k-mer
                       database.
  build_db             Build a MLST k-mer database, given a list of
                       FASTA files.
  list_pubmlst         List all available MLST schemes from
                       www.pubmlst.org.
  download_pubmlst     Dowload a MLST scheme from pubmlst and build a
                       MLST k-mer database.
  list_cgmlst          List all available cgMLST schemes from
                       www.cgmlst.org.
  download_cgmlst      Dowload a MLST scheme from cgmlst.org and build
                       a MLST k-mer database.
  download_enterobase  Dowload a MLST scheme from Enterobase
                       (enterobase.warwick.ac.uk) and build a MLST
                       k-mer database.

optional arguments:
  -h, --help           show this hel

To see the help of a particular command, run MentaLiST with the command name and the -h flag: 

In [3]:
mentalist call -h

usage: mentalist call -o O -s S --db DB [-t MUTATION_THRESHOLD]
                      [--kt KT] [--output_votes] [--output_special]
                      [-h] files...

positional arguments:
  files                 FastQ input files

optional arguments:
  -o O                  Output file with MLST call
  -s S                  Sample name
  --db DB               Kmer database
  -t, --mutation_threshold MUTATION_THRESHOLD
                        Maximum edit distance (number of mutations)
                        when looking for novel alleles. (type: Int64,
                        default: 6)
  --kt KT               Minimum # of times a kmer is seen, to be
                        considered 'solid', meaning actually present
                        in the sample. (type: Int64, default: 10)
  --output_votes        Also outputs the results for the original
                        voting algorithm, without novel.
  --output_special      Also outputs a FASTA file with the alleles
           

In the following sections, we will give quick examples on how to use each of MentaLiST commands. It might be a good idea to create a new folder to store the results: 

In [4]:
mkdir mentalist_results
cd mentalist_results 

# Installing MLST schema
MentaLiST needs to create a k-mer database file for a given MLST scheme before it can call alleles. There are different possible options, from custom schema based on local FASTA files, to downloading public schema from pubmlst.org or cgmlst.org.

## pubMLST schema

MentaLiST can search and install MLST schema from pubMLST.org, as shown.

### List Available pubmlst.org schema
The command 'list_publist' lists the available schema on pubMLST. Since there are many, it is also possible to give a prefix, such that only schema matching this prefix are listed.

In [5]:
mentalist list_pubmlst -h

usage: mentalist list_pubmlst [-p PREFIX] [-h]

optional arguments:
  -p, --prefix PREFIX  Only list schemes that starts with this prefix.
  -h, --help           show this help message and exit



In [6]:
# List campylobacter schema:
mentalist list_pubmlst -p Campylobacter

#id	organism
23	Campylobacter concisus/curvus 
24	Campylobacter fetus           
25	Campylobacter helveticus      
26	Campylobacter hyointestinalis 
27	Campylobacter insulaenigrae   
28	Campylobacter jejuni          
29	Campylobacter lanienae        
30	Campylobacter lari            
31	Campylobacter sputorum        
32	Campylobacter upsaliensis     
INFO: 10 schema found.


### Install a pubmlst.org scheme
A scheme can be referenced by species name (exact match) or, more simply, but the ID as given in the 'list_pubmlst' command. To install the 'Campylobacter jejuni' scheme, run the following command:

In [7]:
mentalist download_pubmlst -k 31 -o campy_mlst_fasta_files -s 28 --db campy_mlst.db 

INFO: Searching for the scheme ... 
2017-12-06T12:06:52.198 - info: Downloading scheme for Campylobacter jejuni ... 
2017-12-06T12:06:52.389 - info: Downloading profile ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  302k  100  302k    0     0  34231      0  0:00:09  0:00:09 --:--:-- 78831
2017-12-06T12:07:01.501 - info: Downloading locus aspA ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  216k  100  216k    0     0  26112      0  0:00:08  0:00:08 --:--:-- 47866
2017-12-06T12:07:10.02 - info: Downloading locus glnA ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  298k  100  298k    0     0  35891      0  0:00:08  0:00:08 --:--:-- 83921

In [8]:
# The output folder (-o) has all the FASTA files and profile for the scheme.
ls campy_mlst_fasta_files

aspA.tfa           glnA.tfa  glyA.tfa  tkt.tfa
campylobacter.txt  gltA.tfa  pgm.tfa   uncA.tfa


In [9]:
# The --db flag indicates the database file, the will be used by MentaLiST in the calling phase.
ls -lh campy_mlst.db

-rw-r--r-- 1 pfeijao users 1.1M Dec  6 12:08 campy_mlst.db


## cgMLST schema
Similarly with the pubMLST schema, MentaLiST can also download and install cgMLST schema from cgmlst.org.

### List available cgMLST schema from cgmlst.org

In [10]:
mentalist list_cgmlst

#id	organism
3956907	Acinetobacter baumannii       
3560802	Clostridioides difficile      
991893	Enterococcus faecium          
260204	Francisella tularensis        
2187931	Klebsiella pneumoniae/variicola/quasipneumoniae
1025099	Legionella pneumophila        
690488	Listeria monocytogenes        
741110	Mycobacterium tuberculosis/bovis/africanum/canettii
141106	Staphylococcus aureus         
INFO: 9 schema found.


### Download and install a cgMLST scheme from cgmlst.org

In [11]:
mentalist download_cgmlst -h

usage: mentalist download_cgmlst -o OUTPUT -s SCHEME -k K --db DB [-c]
                        [-h]

optional arguments:
  -o, --output OUTPUT   Output folder for the scheme files.
  -s, --scheme SCHEME   Species name or ID of the scheme
  -k K                  K-mer size (type: Int8)
  --db DB               Output file for the kmer database.
  -c, --disable_compression
                        Disables the default compression of the
                        database, that stores only the most
                        informative kmers. Not recommended unless for
                        debugging.
  -h, --help            show this help message and exit



In [12]:
mentalist download_cgmlst -o mtb_cgmlst_fasta -s 741110 -k 31 --db mtb_cgmlst.db

2017-12-06T12:08:40.451 - info: Downloading cgMLST scheme ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9422k    0 9422k    0     0   507k      0 --:--:--  0:00:18 --:--:-- 1492k
2017-12-06T12:08:59.377 - info: Unzipping cgMLST scheme into individual FASTA files for each loci ...
...............
2017-12-06T12:09:56.045 - info: 2891 loci found.
2017-12-06T12:09:56.045 - info: Building the k-mer database ...
2017-12-06T12:10:01.439 - info: Opening FASTA files ... 
2017-12-06T12:18:57.54 - info: Combining results for each locus ...
2017-12-06T12:19:06.386 - info: Saving DB ...
2017-12-06T12:19:09.468 - info: Done!


## Install a custom scheme from FASTA files
It is also possible to install a custom MLST scheme from the FASTA files. Each file should be called LOCUS.fa (the extension is not important, can be .fasta, .tfa, etc.), and each different allele in this file should have identifier LOCUS_N (or alternatively LOCUS.N), where N is a unique number for each allele, and it is usually a sequence from 1 to N for N alleles. 

For instance, let's test this functionality with the Campylobacter scheme FASTA files that were downloaded in a previous example above:

In [13]:
# Each file is a different locus:
ls campy_mlst_fasta_files/*.tfa

campy_mlst_fasta_files/aspA.tfa  campy_mlst_fasta_files/pgm.tfa
campy_mlst_fasta_files/glnA.tfa  campy_mlst_fasta_files/tkt.tfa
campy_mlst_fasta_files/gltA.tfa  campy_mlst_fasta_files/uncA.tfa
campy_mlst_fasta_files/glyA.tfa


In [14]:
# For each locus file, a different ID and sequence for each allele:
head -n 18 campy_mlst_fasta_files/glnA.tfa

>glnA_1
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATAGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT
>glnA_2
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATGGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT


In [15]:
# Install the Campylobacter jejuni scheme directly from the FASTA files; let's use a different k-mer length:
mentalist build_db -k 25 --db campy_mlst_25.db -p campy_mlst_fasta_files/campylobacter.txt -f campy_mlst_fasta_files/*.tfa

2017-12-06T12:19:19.237 - info: Opening FASTA files ... 
2017-12-06T12:19:23.944 - info: Combining results for each locus ...
2017-12-06T12:19:24.241 - info: Saving DB ...
2017-12-06T12:19:26.088 - info: Done!


# Calling MLST alleles for a sample

After a k-mer database has been created, MentaLiST can call alleles for a given sample.

In [16]:
# Help:
mentalist call -h

usage: mentalist call -o O -s S --db DB [-t MUTATION_THRESHOLD]
                      [--kt KT] [--output_votes] [--output_special]
                      [-h] files...

positional arguments:
  files                 FastQ input files

optional arguments:
  -o O                  Output file with MLST call
  -s S                  Sample name
  --db DB               Kmer database
  -t, --mutation_threshold MUTATION_THRESHOLD
                        Maximum edit distance (number of mutations)
                        when looking for novel alleles. (type: Int64,
                        default: 6)
  --kt KT               Minimum # of times a kmer is seen, to be
                        considered 'solid', meaning actually present
                        in the sample. (type: Int64, default: 10)
  --output_votes        Also outputs the results for the original
                        voting algorithm, without novel.
  --output_special      Also outputs a FASTA file with the alleles
           

For this example we are using a Campylobacter jejuni sample from EMBL ENA. You can download the FASTQ file with the following command:

In [17]:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR582/007/SRR5824107/SRR5824107_1.fastq.gz

--2017-12-06 12:19:31--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR582/007/SRR5824107/SRR5824107_1.fastq.gz
           => ‘SRR5824107_1.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.192.7
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR582/007/SRR5824107 ... done.
==> SIZE SRR5824107_1.fastq.gz ... 42927998
==> PASV ... done.    ==> RETR SRR5824107_1.fastq.gz ... done.
Length: 42927998 (41M) (unauthoritative)


2017-12-06 12:19:47 (2.99 MB/s) - ‘SRR5824107_1.fastq.gz’ saved [42927998]



Now, run MentaLiST caller on this sample, passing the MentaLiST database that we created previously, using the --db flag. 

(ignore warnings, they are caused by some deprecated commands in libraries that MentaLiST uses... Julia can be annoying about that.)

In [18]:
mentalist call -o campy_call.txt -s SRR5824107 --db campy_mlst.db SRR5824107_1.fastq.gz 

2017-12-06T12:19:56.299 - info: Opening kmer database ... 
2017-12-06T12:19:59.739 - info: Opening fastq file(s) and counting kmers ... 
2017-12-06T12:20:13.66 - info: Voting for alleles ... 
2017-12-06T12:20:14.099 - info: Calling alleles and novel alleles ...
2017-12-06T12:20:16.199 - info: Writing output ...
2017-12-06T12:20:16.526 - info: Done.


The output consists of two files: one has the calls, and the other some details about the coverage of calls and special cases.

In [19]:
# results:
ls campy_call.*

campy_call.txt  campy_call.txt.coverage.txt


In [20]:
# Allele calls and ST are on the campy_call.txt file:
column -ts $'\t' campy_call.txt

Sample      aspA  glnA  gltA  glyA  pgm  tkt  uncA  ST   clonal_complex
SRR5824107  2     17    2     3     2    1    5     883  ST-21 complex


In [21]:
# Detailed vote count for each allele:
cat campy_call.txt.coverage.txt

Locus	Coverage	MinKmerDepth	Call
aspA	1	61	Called allele 2.
glnA	1	34	Called allele 17.
gltA	1	62	Called allele 2.
glyA	1	46	Called allele 3.
pgm	1	15	Called allele 2.
tkt	1	34	Called allele 1.
uncA	1	54	Called allele 5.


## M. tuberculosis sample on the cgMLST scheme.

Now we test the MentaLiST call on a M. tuberculosis sample, also downloaded from EMBL ENA. This time we are going to download both paired end FASTQ files:

In [22]:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_{1,2}.fastq.gz

--2017-12-06 12:20:17--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_1.fastq.gz
           => ‘SRR6152708_1.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.192.7
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR615/008/SRR6152708 ... done.
==> SIZE SRR6152708_1.fastq.gz ... 118042753
==> PASV ... done.    ==> RETR SRR6152708_1.fastq.gz ... done.
Length: 118042753 (113M) (unauthoritative)


2017-12-06 12:21:32 (1.53 MB/s) - ‘SRR6152708_1.fastq.gz’ saved [118042753]

--2017-12-06 12:21:32--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR615/008/SRR6152708/SRR6152708_2.fastq.gz
           => ‘SRR6152708_2.fastq.gz’
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.192.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.

For this example, we will use the flags `--output_votes` and `--output_special`, that tell MentaLiST to create additional output files. To use paired end samples, just include both files at the end of the command:

In [23]:
## Call alleles for the sample:
mentalist call -o SRR6152708.txt -s SRR6152708 --output_votes --output_special  --db mtb_cgmlst.db SRR6152708_1.fastq.gz SRR6152708_2.fastq.gz

2017-12-06T12:22:17.101 - info: Opening kmer database ... 
2017-12-06T12:22:24.27 - info: Opening fastq file(s) and counting kmers ... 
2017-12-06T12:23:47.149 - info: Voting for alleles ... 
2017-12-06T12:23:49.717 - info: Calling alleles and novel alleles ...
2017-12-06T12:24:58.569 - info: Writing output ...
2017-12-06T12:24:59.512 - info: Done.


In addition to the regular output files from the previous example (legionella.txt and legionella.txt.coverage.txt), 
there are some new files, due to the use of flags `--output_votes` and `--output_special`.

In [24]:
ls -l SRR6152708.txt*

-rw-r--r-- 1 pfeijao users  27616 Dec  6 12:24 SRR6152708.txt
-rw-r--r-- 1 pfeijao users  27624 Dec  6 12:24 SRR6152708.txt.byvote
-rw-r--r-- 1 pfeijao users  89586 Dec  6 12:24 SRR6152708.txt.coverage.txt
-rw-r--r-- 1 pfeijao users  85232 Dec  6 12:24 SRR6152708.txt.novel.fa
-rw-r--r-- 1 pfeijao users   3459 Dec  6 12:24 SRR6152708.txt.novel.txt
-rw-r--r-- 1 pfeijao users 198343 Dec  6 12:24 SRR6152708.txt.special_cases.fa
-rw-r--r-- 1 pfeijao users    434 Dec  6 12:24 SRR6152708.txt.ties.txt
-rw-r--r-- 1 pfeijao users 589190 Dec  6 12:24 SRR6152708.txt.votes.txt


In [25]:
# Quick check of the first 10 calls:
cut -f1-10 SRR6152708.txt | column -ts $'\t'  

Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023
SRR6152708  5        2        1        1        2        1        1        1        1


In [26]:
# Details; we can see that MentaLiST found some novel alleles:
head -n 15 SRR6152708.txt.coverage.txt

Locus	Coverage	MinKmerDepth	Call
Rv0014c	1	43	Called allele 5.
Rv0015c	1	18	Called allele 2.
Rv0016c	1	45	Called allele 1.
Rv0017c	1	45	Called allele 1.
Rv0018c	1	33	Called allele 2.
Rv0019c	1	39	Called allele 1.
Rv0021c	1	35	Called allele 1.
Rv0022c	1	32	Called allele 1.
Rv0023	1	57	Called allele 1.
Rv0024	1	42	Novel, 1 mutation from allele 1: Del of len 1 at pos 6
Rv0025	1	55	Called allele 1.
Rv0033	1	48	Called allele 1.
Rv0034	1	45	Called allele 2.
Rv0035	1	25	Novel, 1 mutation from allele 2: Subst G->A at pos 1023


### Novel allele detection:

In [27]:
# The following file has all novel alleles with descriptions; 
# this tells from which allele MentaLiST applied some mutation(s) to discover the new allele.   
head SRR6152708.txt.novel.txt

Loci	MinKmerDepth	Nmut	Desc
Rv0024	42	1	From allele 1, Del of len 1 at pos 6.
Rv0035	25	1	From allele 2, Subst G->A at pos 1023.
Rv0045c	35	1	From allele 185, Ins of base G at pos 373.
Rv0063	37	1	From allele 1, Ins of base C at pos 1417.
Rv0101	32	1	From allele 8, Subst A->G at pos 6088.
Rv0134	45	1	From allele 1, Del of len 1 at pos 386.
Rv0165c	40	2	From allele 42, Ins of base C at pos 176, Ins of base G at pos 178.
Rv0195	59	1	From allele 2, Subst T->C at pos 191.
Rv0226c	36	1	From allele 4, Subst A->C at pos 36.


In [28]:
# Novel allele sequences are on the following file:
head -n4 SRR6152708.txt.novel.fa

>Rv0024
GTGAATACAGCGAGGTCGAGCTGTTGAGTCGCGCTCATCAACTGTTCGCCGGAGACAGTCGGCGACCGGGGTTGGATGCGGGCACCACACCCTACGGGGATCTGCTGTCTCGGGCTGCCGACCTGAATGTGGGTGCGGGCCAGCGCCGGTATCAACTCGCCGTGGACCACAGCCGGGCGGCCTTGCTGTCTGCTGCGCGAACCGATGCCGCGGCCGGGGCCGTCATCACCGGCGCTCAACGGGATCGGGCATGGGCCCGGCGGTCGACCGGAACCGTTCTCGACGAGGCTCGCTCGGATACCACCGTTACTGCGGTTATGCCGATAGCCCAGCGCGAAGCCATACGCCGTCGTGTGGCGCGGCTGCGCGCGCAACGAGCCCATGTGCTGACGGCGCGACGACGGGCACGACGGCACCTGGCGGCGCTGCGTGCGCTGCGGTACCGGGTGGCGCACGGCCCGGGGGTCGCGCTGGCCAAACTTCGGCTGCCGTCGCCGAGCGGTCGCGCCGGCATCGCGGTCCACGCCGCGCTGTCGCGACTTGGCCGTCCCTATGTCTGGGGCGCAACGGGGCCCAACCAGTTCGACTGTTCCGGTTTGGTCCAGTGGGCCTACGCCCAGGCGGGTGTTCACCTGGATCGCACCACCTATCAACAGATCAACGAGGGGATCCCGGTGCCGCGCTCACAGGTCCGGCCGGGCGATCTGGTCTTCCCGCACCCCGGGCACGTGCAGCTGGCGATCGGCAACAATCTGGTCGTCGAGGCGCCCCATGCGGGCGCGTCGGTTCGGGTCAGCTCGCTGGGCAACAACGTGCAGATTCGGCGACCGCTGAGTGGCAGATAA
>Rv0035
ATGACGGCGGCCTTGCTTTCACCAGCCATCGCCTGGCAGCAGATCTGGGCTTGCACGGACCGCACGCTGACGATCTCTTGCGAGGATTCCGAGGTAATCAGCTATCAGGACCTCATCGCGCGCGCGGCGGCATGCATC

### Optional: outputting the voting calls 

The `--output-votes` flag makes MentaLiST output three additional files: `SRR6152708.txt.byvote`, `SRR6152708.txt.votes.txt` and `SRR6152708.txt.ties.txt`. These are the results of the old calling algorithm in MentaLiST v0.1, where only the votes are considered. In the current version, MentaLiST checks the allele sequences to ensure that the called allele has full coverage, and also tries to find novel alleles. 

In [29]:
# Calls by the old voting algorithm:
cut -f1-10 SRR6152708.txt.byvote | column -ts $'\t'   

Sample      Rv0014c  Rv0015c  Rv0016c  Rv0017c  Rv0018c  Rv0019c  Rv0021c  Rv0022c  Rv0023
SRR6152708  5        2        1        1        2        1        1        1        1


The `SRR6152708.txt.votes.txt` file has the top voted alleles on each loci:


In [30]:
head -n12 SRR6152708.txt.votes.txt

Locus	TotalLocusVotes	Allele(votes),...
Rv0014c	110902	5(1822),354(279),310(257),336(244),137(228),33(219),16(123),427(116),164(114),246(114),191(79),60(66),165(53),1(0),367(-13),162(-37),266(-61),189(-98),232(-113),284(-131)
Rv0015c	66560	2(1710),219(1524),306(1087),282(1084),153(1082),201(1075),25(1070),304(1066),204(1058),133(1043),131(1036),303(1030),155(1021),286(988),280(947),327(883),188(433),119(433),288(401),321(200)
Rv0016c	97453	1(0),278(-716),249(-726),65(-797),286(-1091),141(-1240),10(-1472),190(-1492),108(-1502),137(-1516),251(-1518),67(-1522),126(-1526),28(-1550),236(-1559),33(-1574),219(-1604),110(-1638),208(-1687),9(-1712)
Rv0017c	84916	1(0),281(-60),32(-365),132(-686),267(-686),173(-822),71(-1025),38(-1065),216(-1120),285(-1208),243(-1231),291(-1288),289(-1371),172(-1545),127(-1573),55(-1577),68(-1579),115(-1580),219(-1580),158(-1582)
Rv0018c	91617	2(0),228(-46),176(-303),337(-342),358(-409),257(-569),366(-743),312(-743),128(-1093),334(-1144),191(-1184),67(-1206),295(

As we can see, there is a tie on locus Rv0024. The `SRR6152708.txt.ties.txt` file has a list of loci where there is a tie for most voted alleles, listing all tied alleles:


In [31]:
cat SRR6152708.txt.ties.txt

Rv0024	165, 217, 1, 222
Rv0101	8, 815
Rv0165c	2, 42
Rv0538	231, 258, 270, 2, 159, 40, 208, 68
Rv0757	117, 1
Rv0818	42, 29, 1
Rv0826	1, 27
Rv1001	111, 1
Rv1097c	1, 108
Rv1363c	1, 104
Rv1413	1, 21
Rv1417	58, 90, 78, 70, 52, 26, 17, 44, 45, 13, 1, 32, 80, 91, 9, 60, 61, 79, 48, 81, 16, 21, 10, 51, 6, 88, 53, 72, 5
Rv2148c	84, 1, 115
Rv2241	90, 1
Rv2330c	1, 3
Rv2437	1, 75
Rv2526	31, 1
Rv3091	1, 277
Rv3234c	1, 16
Rv3830c	1, 50, 116, 5


If we check those alleles in the MentaLiST coverage report, we can see that we had three different cases:

In [32]:
for p in $(cut -f1 SRR6152708.txt.ties.txt); do grep $p SRR6152708.txt.coverage.txt; done

[01;31m[KRv0024[m[K	1	42	Novel, 1 mutation from allele 1: Del of len 1 at pos 6
[01;31m[KRv0101[m[K	1	32	Novel, 1 mutation from allele 8: Subst A->G at pos 6088
[01;31m[KRv0165c[m[K	1	40	Novel, 2 mutations from allele 42: Ins of base C at pos 176, Ins of base G at pos 178
[01;31m[KRv0538[m[K	1	20	Called allele 2.
[01;31m[KRv0757[m[K	1	47	Novel, 1 mutation from allele 117: Subst G->C at pos 373
[01;31m[KRv0818[m[K	1	28	Novel, 1 mutation from allele 1: Subst G->A at pos 293
[01;31m[KRv0826[m[K	1	40	Novel, 1 mutation from allele 1: Subst C->G at pos 901
[01;31m[KRv1001[m[K	1	40	Novel, 1 mutation from allele 1: Del of len 1 at pos 999
[01;31m[KRv1097c[m[K	1	36	Novel, 2 mutations from allele 1: Del of len 2 at pos 311
[01;31m[KRv1363c[m[K	1	42	Novel, 1 mutation from allele 1: Subst C->G at pos 75
[01;31m[KRv1413[m[K	1	49	Novel, 1 mutation from allele 1: Subst G->A at pos 80
[01;31m[KRv1417[m[K	0.5678	0	Not present; allele 58 is the best voted 

For loci Rv0538 and Rv2330c, MentaLiST could find the only allele in each locus that has full coverage, and made the call. For Allele Rv1417, MentaLiST called it as not present, since it only has around 56% coverage; this might be due a poorly covered region in the sample, or because the gene is really not present in the sample, but some other regions in the genome have some similarity with this gene, causing the partial presence.

In all of the other loci, MentaLiST was able to find a putative novel allele. 

### Special Cases

There are some possible 'special cases', that makes MentaLiST flag the call file. 

The first is **multiple possible alleles**, when more that one allele has full coverage in the sample: 

In [33]:
grep Multiple SRR6152708.txt.coverage.txt

Rv1319c	1	*	[01;31m[KMultiple[m[K possible alleles:8, 3 with depth 35, 35 and votes 3914, 3417. Most voted (8) is chosen on call file.
Rv1911c	1	*	[01;31m[KMultiple[m[K possible alleles:1, 118 with depth 26, 26 and votes 0, -267. Most voted (1) is chosen on call file.
Rv2319c	1	*	[01;31m[KMultiple[m[K possible alleles:7, 1 with depth 28, 30 and votes 97, 0. Most voted (7) is chosen on call file.


On these cases, MentaLiST chooses the most voted allele, but included a flag "+" in the output:

In [34]:
cut -f 1023 SRR6152708.txt

Rv1319c
8+


Another case is to call a missing locus, when there are some k-mers from the locus in the sample, but below some minimum threshold:

In [35]:
grep "Not present" SRR6152708.txt.coverage.txt 

Rv1417	0.5678	0	[01;31m[KNot present[m[K; allele 58 is the best voted but below threshold with 188/435 missing kmers.


In this case, MentaLiST outputs a zero (0) in the call file.


In [36]:
cut -f 1094 SRR6152708.txt

Rv1417
0


There is also the case where an locus has an allele that is partially covered, and then it tries to find a novel allele that would be fully covered, but does not find it.

In [37]:
grep Partial SRR6152708.txt.coverage.txt

Rv0275c	0.9856	9	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 1 has 10/696 missing kmers, and no novel was found.
Rv0581	0.9731	8	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 1 has 5/186 missing kmers, and no novel was found.
Rv0860	0.9869	9	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 3 has 28/2133 missing kmers, and no novel was found.
Rv1269c	0.9449	0	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 1 has 19/345 missing kmers, and no novel was found.
Rv1860	0.9525	6	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 2 has 45/948 missing kmers, and no novel was found.
Rv1999c	0.9575	6	[01;31m[KPartial[m[Kly covered alelle, novel or not present; Most covered allele 5 has 55/1293 missing kmers, and no novel was found.
Rv2249c	0.998	8	[01;31m[KPartial[m[Kly covered alelle, novel or not presen

In this case, the output is in the format x/N?, where x is the most covered allele found, since MentaLiST is not sure if this is a novel allele or not. Here we show three of the above loci as an example: 

In [38]:
cut -f 217,467,674 SRR6152708.txt

Rv0275c	Rv0581	Rv0860
1/N?	1/N?	3/N?
