# MentaLiST quick start

This notebook shows some examples on how to run MentaLiST to create new MLST scheme databases, either downloading from public MLST websites or from custom files, and then calling alleles for NGS samples.

## Help
MentaLiST.jl is the main script, with several commands available. To see a list of commands, run MentaLiST with the -h flag:  

In [1]:
# Help: shows all available commands:
MentaLiST.jl -h

usage: MentaLiST.jl [-h]
                    {call|build_db|list_pubmlst|download_pubmlst|list_cgmlst|download_cgmlst}

commands:
  call              MLST caller, given a sample and a k-mer database.
  build_db          Build a MLST k-mer database, given a list of FASTA
                    files.
  list_pubmlst      List all available MLST schema from
                    www.pubmlst.org.
  download_pubmlst  Dowload a MLST scheme from pubmlst and build a
                    MLST k-mer database.
  list_cgmlst       List all available cgMLST schema from
                    www.cgmlst.org.
  download_cgmlst   Dowload a MLST scheme from cgmlst.org and build a
                    MLST k-mer database.

optional arguments:
  -h, --help        show this help message and exit



To see the help of a particular command, run MentaLiST with the command name and the -h flag: 

In [2]:
MentaLiST.jl call -h

usage: MentaLiST.jl call -o O -s S --db DB [-t T] [-q] [-e] [-j J]
                        [-h] files...

positional arguments:
  files       FastQ input files

optional arguments:
  -o O        Output file with MLST call
  -s S        Sample name
  --db DB     Kmer database
  -t T        A read of length L is discarded if it has at less than
              (L - k) * t hits to the same locus in the kmer database,
              where k is the kmer length. 0 <= t <= 1 (type: Float64,
              default: 0.2)
  -q          Quick filter (MentaLiST FAST); if middle kmer of a read
              is not in the kmer DB, the read is discarded. Disabled
              by default.
  -e          Use external kmc kmer counter. Disabled by default.
  -j J        Skip length between consecutive k-mers. Defaults to 1.
              (type: Int64, default: 1)
  -h, --help  show this help message and exit



In the following sections, we will give quick examples on how to use each of MentaLiST commands. It might be a good idea to create a new folder to store the results: 

In [3]:
mkdir mentalist_results
cd mentalist_results 

# Installing MLST schema
MentaLiST needs to create a k-mer database file for a given MLST scheme before it can call alleles. There are different possible options, from custom schema based on local FASTA files, to downloading public schema from pubmlst.org or cgmlst.org.

## pubMLST schema

MentaLiST can search and install MLST schema from pubMLST.org, as shown.

### List Available pubmlst.org schema
The command 'list_publist' lists the available schema on pubMLST. Since there are many, it is also possible to give a prefix, such that only schema matching this prefix are listed.

In [4]:
MentaLiST.jl list_pubmlst -h

usage: MentaLiST.jl list_pubmlst [-p PREFIX] [-h]

optional arguments:
  -p, --prefix PREFIX  Only list schema that starts with this prefix.
  -h, --help           show this help message and exit



In [5]:
# List campylobacter schema:
MentaLiST.jl list_pubmlst -p Campylobacter

2017-08-02T15:50:45.58 - info: Downloading the MLST database xml file...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  110k  100  110k    0     0  12022      0  0:00:09  0:00:09 --:--:-- 12022
Campylobacter concisus/curvus  ID:23
Campylobacter fetus            ID:24
Campylobacter helveticus       ID:25
Campylobacter hyointestinalis  ID:26
Campylobacter insulaenigrae    ID:27
Campylobacter jejuni           ID:28
Campylobacter lanienae         ID:29
Campylobacter lari             ID:30
Campylobacter sputorum         ID:31
Campylobacter upsaliensis      ID:32
10 schema found.


### Install a pubmlst.org scheme
A scheme can be referenced by species name (exact match) or, more simply, but the ID as given in the 'list_pubmlst' command. To install the 'Campylobacter jejuni' scheme, run the following command:

In [6]:
MentaLiST.jl download_pubmlst -k 31 -o Campy -s 28 --db Campy/mlst_31.db 

2017-08-02T15:50:59.684 - info: Searching for the scheme ... 
2017-08-02T15:50:59.882 - info: Downloading scheme for Campylobacter jejuni ... 
2017-08-02T15:50:59.883 - info: Downloading profile ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  298k  100  298k    0     0  35655      0  0:00:08  0:00:08 --:--:-- 35655
2017-08-02T15:51:08.575 - info: Downloading locus aspA ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  211k  100  211k    0     0  17411      0  0:00:12  0:00:12 --:--:-- 17412
2017-08-02T15:51:21.033 - info: Downloading locus glnA ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  297k  100  297k    0     0  31496      0  0:00

In [7]:
# The output folder has all the FASTA files and profile for the scheme, and also the kmer database file,
# mlst_31.db on this example.
ls Campy

aspA.tfa           glnA.tfa  glyA.tfa    mlst_31.db.profile  tkt.tfa
campylobacter.txt  gltA.tfa  mlst_31.db  pgm.tfa             uncA.tfa


## cgMLST schema
Similarly with the pubMLST schema, MentaLiST can also download and install cgMLST schema from cgmlst.org.

### List available cgMLST schema from cgmlst.org

In [8]:
MentaLiST.jl list_cgmlst

2017-08-02T15:52:37.262 - info: Downloading the cgmlist HTML to find schema...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4221    0  4221    0     0    551      0 --:--:--  0:00:07 --:--:--  1015
Clostridioides difficile       - ID:3560802
Enterococcus faecium           - ID:991893
Francisella tularensis         - ID:260204
Legionella pneumophila         - ID:1025099
Listeria monocytogenes         - ID:690488
Mycobacterium tuberculosis     - ID:741110
Staphylococcus aureus          - ID:141106
7 schema found.


### Download and install a cgMLST scheme from cgmlst.org

In [9]:
MentaLiST.jl download_cgmlst -h

usage: MentaLiST.jl download_cgmlst -o OUTPUT -s SCHEME -k K --db DB
                        [-h]

optional arguments:
  -o, --output OUTPUT  Output folder for the schema files.
  -s, --scheme SCHEME  Species name or ID of the scheme
  -k K                 K-mer size (type: Int8)
  --db DB              Output file for the kmer database.
  -h, --help           show this help message and exit



In [10]:
MentaLiST.jl download_cgmlst -o cgmlst/legionella -s 1025099 -k 31 --db cgmlst/legionella/db_31

2017-08-02T15:52:54.016 - info: Downloading cgMLST scheme ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2340k    0 2340k    0     0   291k      0 --:--:--  0:00:08 --:--:--  480k
2017-08-02T15:53:02.33 - info: Unzipping cgMLST scheme into individual FASTA files for each loci ...
........
2017-08-02T15:53:10.07 - info: 1521 loci found.
2017-08-02T15:53:10.07 - info: Building the k-mer database ...
2017-08-02T15:53:13.733 - info: Opening FASTA files ... 
2017-08-02T15:53:36.065 - info: Combining results for each locus ...
2017-08-02T15:54:16.393 - info: Saving DB ...
2017-08-02T15:54:20.905 - info: Done!


## Install a custom scheme from FASTA files
It is also possible to install a custom MLST scheme from the FASTA files. Each file should be called LOCUS.fa (the extension is not important, can be .fasta, .tfa, etc.), and each different allele in this file should have identifier LOCUS_N (or alternatively LOCUS.N), where N is a unique number for each allele, and it is usually a sequence from 1 to N for N alleles. 

For instance, for the Campylobacter scheme that was downloaded in the example above, we have:

In [11]:
# Each file is a different locus:
ls Campy/*.tfa

Campy/aspA.tfa  Campy/gltA.tfa  Campy/pgm.tfa  Campy/uncA.tfa
Campy/glnA.tfa  Campy/glyA.tfa  Campy/tkt.tfa


In [12]:
# For each locus file, a different ID and sequence for each allele:
head -n 20 Campy/glnA.tfa

>glnA_1
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATAGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT
>glnA_2
GATCCTTTTACGGCTGATCCTACTATCATAGTATTTTGTGATGTGTATGATATTTACAAA
GGACAAATGTATGAAAAATGTCCAAGAAGCATAGCAAAAAAAGCAATGGAACACCTTAAA
AATAGTGGCATAGCTGATACTGCTTACTTTGGACCAGAAAATGAATTCTTTGTTTTTGAT
AGTGTAAAAATAGTTGATACTACTCATTGTTCTAAGTATGAAGTTGATACCGAAGAAGGA
GAGTGGAATGATGATAGAGAATTTACCGATAGCTACAATACTGGACACAGGCCAAGAAAC
AAAGGTGGATATTTTCCAGTTCAGCCAATTGATTCTTTAGTAGATATTCGTTCTGAAATG
GTTCAAACCCTTGAAAAAGTAGGTCTTAAAACTTTTGTTCATCATCATGAAGTTGCACAA
GGACAAGCTGAAATAGGAGTAAATTTTGGCACGCTTGTAGAAGCAGCTGACAATGTT
>glnA_3
GATCCT

In [13]:
# Install the Campylobacter jejuni scheme directly from the FASTA files:
MentaLiST.jl build_db -k 25 --db Campy/mlst_25.db -p Campy/campylobacter.txt -f Campy/*.tfa

2017-08-02T15:54:29.65 - info: Opening FASTA files ... 
2017-08-02T15:54:31.13 - info: Combining results for each locus ...
2017-08-02T15:54:31.763 - info: Saving DB ...
2017-08-02T15:54:33.771 - info: Done!


# Calling MLST alleles for a sample

After a k-mer database has been created, MentaLiST can call alleles for a given sample.

In [14]:
# Help:
MentaLiST.jl call -h

usage: MentaLiST.jl call -o O -s S --db DB [-t T] [-q] [-e] [-j J]
                        [-h] files...

positional arguments:
  files       FastQ input files

optional arguments:
  -o O        Output file with MLST call
  -s S        Sample name
  --db DB     Kmer database
  -t T        A read of length L is discarded if it has at less than
              (L - k) * t hits to the same locus in the kmer database,
              where k is the kmer length. 0 <= t <= 1 (type: Float64,
              default: 0.2)
  -q          Quick filter (MentaLiST FAST); if middle kmer of a read
              is not in the kmer DB, the read is discarded. Disabled
              by default.
  -e          Use external kmc kmer counter. Disabled by default.
  -j J        Skip length between consecutive k-mers. Defaults to 1.
              (type: Int64, default: 1)
  -h, --help  show this help message and exit



For this example we are using a Campylobacter jejuni sample from NCBI SRA that was hugely downsampled to make it smaller. This sample is available on the GitHub repository at https://github.com/WGS-TB/MentaLiST/blob/master/data/SRR5824107_small.fastq.gz. If you don't have a clone of the repository installed, you can download this file with the following command:

In [15]:
wget https://github.com/WGS-TB/MentaLiST/raw/master/data/SRR5824107_small.fastq.gz

--2017-08-02 15:54:38--  https://github.com/WGS-TB/MentaLiST/raw/master/data/SRR5824107_small.fastq.gz
Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113
Connecting to github.com (github.com)|192.30.253.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/WGS-TB/MentaLiST/master/data/SRR5824107_small.fastq.gz [following]
--2017-08-02 15:54:39--  https://raw.githubusercontent.com/WGS-TB/MentaLiST/master/data/SRR5824107_small.fastq.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2992916 (2.9M) [application/octet-stream]
Saving to: ‘SRR5824107_small.fastq.gz’


2017-08-02 15:54:39 (20.1 MB/s) - ‘SRR5824107_small.fastq.gz’ saved [2992916/2992916]



Now, run MentaLiST caller on this sample:

In [16]:
MentaLiST.jl call -o campy_call.txt -s SRR5824107 --db Campy/mlst_31.db SRR5824107_small.fastq.gz 

2017-08-02T15:54:46.904 - info: Opening kmer database ... 
2017-08-02T15:54:50.733 - info: Opening fastq file(s) ... 
2017-08-02T15:54:52.316 - info: Writing output ...
2017-08-02T15:54:52.997 - info: Done.


The output consists of three files: one has the calls, and the two other some details about the number of votes per allele and if there was a tie between the best alleles.

In [17]:
# results:
ls campy_call.*

campy_call.txt  campy_call.txt.ties.txt  campy_call.txt.votes.txt


In [18]:
# Allele calls and ST are on the campy_call.txt file:
column -ts $'\t' campy_call.txt

Sample      aspA  glnA  gltA  glyA  pgm  tkt  uncA  ST   clonal_complex
SRR5824107  2     17    2     3     2    1    5     883  ST-21 complex


In [19]:
# Detailed vote count for each allele:
cat campy_call.txt.votes.txt

Locus	Allele(votes),...
aspA	2(1105), 43(1095), 308(1045), 150(1031), 31(1010), 398(965), 36(951), 149(919), 197(919), 144(919)
glnA	17(842), 520(806), 607(800), 234(788), 526(788), 549(779), 289(727), 475(719), 254(719), 76(718)
gltA	2(782), 307(761), 149(752), 89(746), 16(746), 250(743), 156(743), 206(714), 255(698), 267(698)
glyA	3(1646), 9(1638), 10(1637), 121(1629), 389(1624), 506(1604), 658(1601), 449(1578), 73(1574), 280(1572)
pgm	2(1318), 258(1307), 865(1304), 20(1298), 815(1298), 447(1297), 291(1294), 642(1287), 497(1287), 38(1287)
tkt	1(2116), 298(2086), 474(2066), 343(2061), 255(2056), 454(2053), 312(2053), 90(2049), 62(2045), 617(2043)
uncA	5(627), 25(621), 291(619), 246(612), 103(611), 195(610), 63(607), 282(607), 429(599), 225(596)


Now we test the MentaLiST call on a Legionella sample, also downloaded from NCBI SRA and downsampled. This sample is available at https://github.com/WGS-TB/MentaLiST/blob/master/data/ERR2009175_small.fastq.gz

In [20]:
wget https://github.com/WGS-TB/MentaLiST/raw/master/data/ERR2009175_small.fastq.gz

--2017-08-02 15:54:53--  https://github.com/WGS-TB/MentaLiST/raw/master/data/ERR2009175_small.fastq.gz
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/WGS-TB/MentaLiST/master/data/ERR2009175_small.fastq.gz [following]
--2017-08-02 15:54:54--  https://raw.githubusercontent.com/WGS-TB/MentaLiST/master/data/ERR2009175_small.fastq.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26835758 (26M) [application/octet-stream]
Saving to: ‘ERR2009175_small.fastq.gz’


2017-08-02 15:54:54 (41.8 MB/s) - ‘ERR2009175_small.fastq.gz’ saved [26835758/26835758]



In [25]:
## Call alleles for the sample:
MentaLiST.jl call -o legionela.txt -s ERR2009175 --db cgmlst/legionella/db_31 ERR2009175_small.fastq.gz 

2017-08-02T16:05:19.326 - info: Opening kmer database ... 
2017-08-02T16:05:42.477 - info: Opening fastq file(s) ... 
2017-08-02T16:06:15.855 - info: Writing output ...
2017-08-02T16:06:16.784 - info: Done.


In [26]:
# Quick check of the first 10 calls:
cut -f1-10 legionela.txt | column -ts $'\t'  

Sample      lpg0004  lpg0006  lpg0007  lpg0008  lpg0009  lpg0010  lpg0011  lpg0012  lpg0014
ERR2009175  4        4        4        4        1        4        4        4        4


In [27]:
# votes:
head legionela.txt.votes.txt

Locus	Allele(votes),...
lpg0004	4(5455), 10(1058), 1(978), 13(962), 22(917), 7(862), 9(852), 33(731), 23(685), 31(427)
lpg0006	4(10963), 1(4251), 8(4243), 22(3774), 31(3305), 6(3165), 21(2998), 12(2840), 32(2313), 7(1362)
lpg0007	4(2267), 20(519), 14(383), 1(353), 19(245), 3(220), 9(199), 2(113), 8(113), 5(113)
lpg0008	4(3817), 35(3662), 1(2433), 7(2274), 33(960), 15(906), 34(860), 31(802), 2(799), 13(746)
lpg0009	1(87), 2(30), 5(-20), 6(-24), 3(-33), 7(-305), 4(-386)
lpg0010	4(3960), 1(1567), 8(1418), 27(1405), 14(1352), 6(1165), 18(1025), 15(937), 22(665), 26(586)
lpg0011	4(368), 6(244), 12(244), 7(232), 14(182), 17(155), 1(96), 16(62), 2(37), 5(13)
lpg0012	4(6589), 7(1913), 2(1678), 9(1659), 17(1632), 6(1629), 16(1605), 24(1468), 21(1332), 27(1276)
lpg0014	4(1875), 21(1021), 27(575), 1(421), 20(368), 8(347), 9(291), 2(237), 5(197), 3(193)


Since this reduced sample has very low coverage, in some loci there are alleles with the same number of votes, as seen on the ties file: 

In [29]:
# ties:
cat legionela.txt.ties.txt

lpg0073	2, 1
lpg0350	4, 2, 3, 1
lpg0441	6, 1
lpg0859	4, 2, 3, 5, 1
lpg0878	7, 3
lpg0892	22, 3
lpg1293	9, 19, 4
lpg1458	6, 3, 5, 15, 1
lpg1768	3, 20
lpg1943	2, 11, 7, 9, 10, 8, 6, 4, 3, 5, 1
lpg2005	21, 19, 3, 20
lpg2280	7, 4
lpg2323	9, 4
lpg2517	8, 6, 4, 12
lpg2724	9, 4, 13
lpg2825	4, 2, 3, 5, 6, 1
