# MentaLiST 0.2

MentaLiST 0.2 has a new calling algorithm and also detects and reconstructs putative novel alleles, also calling non-present loci, allowing the use for wgMLST schemes.



## Downloading the new version

MentaLiST 0.2 is on a different branch on the github repository. You can clone the repository as usual and then switch to the new branch: 

In [1]:
git clone https://github.com/WGS-TB/MentaLiST

Cloning into 'MentaLiST'...
remote: Counting objects: 742, done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 742 (delta 59), reused 51 (delta 22), pack-reused 611[K
Receiving objects: 100% (742/742), 28.62 MiB | 49.27 MiB/s, done.
Resolving deltas: 100% (379/379), done.


In [2]:
cd MentaLiST
git checkout mentalist_v0.2

Branch mentalist_v0.2 set up to track remote branch mentalist_v0.2 from origin.
Switched to a new branch 'mentalist_v0.2'


You can create an alias to make it easier to access this version, specially if you also have the original MentaLiST version installed with bioconda.

In [3]:
alias mentalist="julia --depwarn=no $PWD/src/mentalist"

MLST calling has some new options, in relation the the previous version: 

In [4]:
mentalist call -h

usage: mentalist call -o O -s S --db DB [-t MUTATION_THRESHOLD]
                      [--kt KT] [--output_votes] [--output_special]
                      [-h] files...

positional arguments:
  files                 FastQ input files

optional arguments:
  -o O                  Output file with MLST call
  -s S                  Sample name
  --db DB               Kmer database
  -t, --mutation_threshold MUTATION_THRESHOLD
                        Maximum edit distance (number of mutations)
                        when looking for novel alleles. (type: Int64,
                        default: 6)
  --kt KT               Minimum # of times a kmer is seen, to be
                        considered 'solid', meaning actually present
                        in the sample. (type: Int64, default: 10)
  --output_votes        Also outputs the results for the original
                        voting algorithm, without novel.
  --output_special      Also outputs a FASTA file with the alleles
           

The -t is the maximum distance, in number of mutations (nucleotide substitutions, insertions and/or deletions) that MentaLiST will apply to an existing allele, while searching for a novel allele present in a sample. Larger values might take considerably longer.

The --kt option determines the minimum number of times that a kmer has to be observed in the FASTQ sample to be considered 'solid'. This will be used on the calling phase, to determine if a particular allele is present (all its k-mers have to be present in the sample and above the --kt threshold), and also in the search for novel alleles. You might increase this value if you have larger depth in your sample.

The --output_votes enables the output of the MLST calling files for the old algorithm, based only on the maximum vote count, without considering allele coverage.

When --output_special is given, a FASTA file with all the 'special cases' will be created. This includes loci where no allele has 100% coverage, or there is more than one allele with 100% coverage, and also novel alleles. 

## Running MentaLiST 0.2

Because of the new calling algorithm, new information has to be stored on the MentaLiST database, so databases created with the previous version are not compatible. The command for creating a database is exactly the same as before.

In [5]:
mentalist build_db -h

usage: mentalist build_db --db DB -k K -f FASTA_FILES [FASTA_FILES...]
                        [-p PROFILE] [-c] [-h]

optional arguments:
  --db DB               Output file (kmer database)
  -k K                  Kmer size (type: Int8)
  -f, --fasta_files FASTA_FILES [FASTA_FILES...]
                        Fasta files with the MLST scheme
  -p, --profile PROFILE
                        Profile file for known genotypes.
  -c, --disable_compression
                        Disables the default compression of the
                        database, that stores only the most
                        informative kmers. Not recommended unless for
                        debugging.
  -h, --help            show this help message and exit



The folder ../MTB_scheme has all FASTA files from a M. tuberculosis cgMLST scheme. To build a MentaLiST database for this scheme, run:   

In [6]:
mentalist build_db --db mtb_cgMLST.db -k 31 -f ../MTB_scheme/*.fa

2017-11-20T14:26:09.925 - info: Opening FASTA files ... 
2017-11-20T14:33:05.367 - info: Combining results for each locus ...
2017-11-20T14:33:11.98 - info: Saving DB ...
2017-11-20T14:33:15.073 - info: Done!


In [7]:
ls -lh mtb_cgMLST.db

-rw-r--r-- 1 pfeijao users 46M Nov 20 14:33 mtb_cgMLST.db


In [9]:
mentalist call -o sample.call -s sample --db mtb_cgMLST.db --kt 10 --output_votes --output_special ../sample.fastq.gz

2017-11-20T14:42:31.451 - info: Opening kmer database ... 
2017-11-20T14:42:36.915 - info: Opening fastq file(s) and counting kmers ... 
2017-11-20T14:43:24.928 - info: Voting for alleles ... 
2017-11-20T14:43:26.77 - info: Calling alleles and novel alleles ...
2017-11-20T14:44:21.856 - info: Writing output ...
2017-11-20T14:44:24.359 - info: Done.


In [10]:
ls sample.call*

sample.call               sample.call.novel.fa          sample.call.ties.txt
sample.call.byvote        sample.call.novel.txt         sample.call.votes.txt
sample.call.coverage.txt  sample.call.special_cases.fa


In [11]:
# novel alleles:
head sample.call.novel.fa -n4

>Rv0024
GTGAATACAGCGAGGTCGAGCTGTTGAGTCGCGCTCATCAACTGTTCGCCGGAGACAGTCGGCGACCGGGGTTGGATGCGGGCACCACACCCTACGGGGATCTGCTGTCTCGGGCTGCCGACCTGAATGTGGGTGCGGGCCAGCGCCGGTATCAACTCGCCGTGGACCACAGCCGGGCGGCCTTGCTGTCTGCTGCGCGAACCGATGCCGCGGCCGGGGCCGTCATCACCGGCGCTCAACGGGATCGGGCATGGGCCCGGCGGTCGACCGGAACCGTTCTCGACGAGGCTCGCTCGGATACCACCGTTACTGCGGTTATGCCGATAGCCCAGCGCGAAGCCATACGCCGTCGTGTGGCGCGGCTGCGCGCGCAACGAGCCCATGTGCTGACGGCGCGACGACGGGCACGACGGCACCTGGCGGCGCTGCGTGCGCTGCGGTACCGGGTGGCGCACGGCCCGGGGGTCGCGCTGGCCAAACTTCGGCTGCCGTCGCCGAGCGGTCGCGCCGGCATCGCGGTCCACGCCGCGCTGTCGCGACTTGGCCGTCCCTATGTCTGGGGCGCAACGGGGCCCAACCAGTTCGACTGTTCCGGTTTGGTCCAGTGGGCCTACGCCCAGGCGGGTGTTCACCTGGATCGCACCACCTATCAACAGATCAACGAGGGGATCCCGGTGCCGCGCTCACAGGTCCGGCCGGGCGATCTGGTCTTCCCGCACCCCGGGCACGTGCAGCTGGCGATCGGCAACAATCTGGTCGTCGAGGCGCCCCATGCGGGCGCGTCGGTTCGGGTCAGCTCGCTGGGCAACAACGTGCAGATTCGGCGACCGCTGAGTGGCAGATAA
>Rv0045c
TCAGCGTGTGTCGAGCACCCCGCGCACGATCTCGATCAGGGCGCGCGGTTGGTCACTTTGCACCGAGTGGCCTGACTTCTCGACGATGTGAACGCCACGGAAATGCGTTGCACGCCTGTGGAGTTCGGCGGTGTCCT

In [12]:
# Description of novel alleles:
head sample.call.novel.txt

Loci	MinKmerDepth	Nmut	Desc
Rv0024	20	1	Del of len 1 at pos 6
Rv0045c	27	1	Ins of base G at pos 373
Rv0063	33	1	Ins of base C at pos 1417
Rv0134	32	1	Del of len 1 at pos 386
Rv0165c	31	2	Ins of base C at pos 176, Ins of base G at pos 178
Rv0174	26	1	Subst C->G at pos 632
Rv0217c	30	1	Del of len 1 at pos 777
Rv0266c	21	1	Subst T->C at pos 2164
Rv0322	27	1	Subst C->G at pos 97


In [13]:
# Description of each call:
head sample.call.coverage.txt -n15

Locus	Coverage	MinKmerDepth	Call
Rv0014c	1	27	Called allele 5.
Rv0015c	1	27	Called allele 2.
Rv0016c	1	27	Called allele 1.
Rv0017c	1	22	Called allele 1.
Rv0018c	1	28	Called allele 2.
Rv0019c	1	24	Called allele 1.
Rv0021c	1	20	Called allele 1.
Rv0022c	1	34	Called allele 1.
Rv0023	1	23	Called allele 1.
Rv0024	1	20	Novel, 1 mutation from allele 1: Del of len 1 at pos 6
Rv0025	1	33	Called allele 1.
Rv0033	1	42	Called allele 1.
Rv0034	1	31	Called allele 2.
Rv0035	1	21	Called allele 2.


In [14]:
# Missing alleles:
grep present sample.call.coverage.txt

Rv0767c	0.538	17	Not [01;31m[Kpresent[m[K, allele 18 is the best voted but below threshold with 283 missing kmers.
Rv0768	0.168	99	Not [01;31m[Kpresent[m[K, allele 20 is the best voted but below threshold with 1195 missing kmers.
Rv1269c	0.945	52	Either novel or not [01;31m[Kpresent[m[K; Allele 1 has 19 missing kmers, and no novel was found.
Rv2017	0.981	35	Either novel or not [01;31m[Kpresent[m[K; Allele 2 has 19 missing kmers, and no novel was found.
Rv2947c	0.986	109	Either novel or not [01;31m[Kpresent[m[K; Allele 2 has 21 missing kmers, and no novel was found.
Rv3160c	0.984	23	Either novel or not [01;31m[Kpresent[m[K; Allele 1 has 10 missing kmers, and no novel was found.
Rv3180c	0.835	27	Either novel or not [01;31m[Kpresent[m[K; Allele 1 has 67 missing kmers, and no novel was found.
Rv3181c	0.922	19	Either novel or not [01;31m[Kpresent[m[K; Allele 1 has 33 missing kmers, and no novel was found.
Rv3183	0.753	28	Either novel or not [01;31m[Kpresent

In [15]:
# All calls: N for novel allele.
cut -f1-14 sample.call

Sample	Rv0014c	Rv0015c	Rv0016c	Rv0017c	Rv0018c	Rv0019c	Rv0021c	Rv0022c	Rv0023	Rv0024	Rv0025	Rv0033	Rv0034
sample	5	2	1	1	2	1	1	1	1	N	1	1	2
