# MentaLiST quick start

This notebook shows some examples on how to run MentaLiST to create new MLST scheme databases, either downloading from public MLST websites or from custom files, and then calling alleles for NGS samples.

## Help
MentaLiST.jl is the main script, with several commands available. To see a list of commands, run MentaLiST with the -h flag:  

In [1]:
# Help: shows all available commands:
MentaLiST.jl -h

usage: MentaLiST.jl [-h]
                    {call|build_db|list_pubmlst|download_pubmlst|list_cgmlst|download_cgmlst}

commands:
  call              MLST caller, given a sample and a k-mer database.
  build_db          Build a MLST k-mer database, given a list of FASTA
                    files.
  list_pubmlst      List all available MLST schema from
                    www.pubmlst.org.
  download_pubmlst  Dowload a MLST scheme from pubmlst and build a
                    MLST k-mer database.
  list_cgmlst       List all available cgMLST schema from
                    www.cgmlst.org.
  download_cgmlst   Dowload a MLST scheme from cgmlst.org and build a
                    MLST k-mer database.

optional arguments:
  -h, --help        show this help message and exit



To see the help of a particular command, run MentaLiST with the command name and the -h flag: 

In [2]:
MentaLiST.jl call -h

usage: MentaLiST.jl call -o O -s S --db DB [-t T] [-q] [-e] [-j J]
                        [-h] files...

positional arguments:
  files       FastQ input files

optional arguments:
  -o O        Output file with MLST call
  -s S        Sample name
  --db DB     Kmer database
  -t T        A read of length L is discarded if it has at less than
              (L - k) * t hits to the same locus in the kmer database,
              where k is the kmer length. 0 <= t <= 1 (type: Float64,
              default: 0.2)
  -q          Quick filter (MentaLiST FAST); if middle kmer of a read
              is not in the kmer DB, the read is discarded. Disabled
              by default.
  -e          Use external kmc kmer counter. Disabled by default.
  -j J        Skip length between consecutive k-mers. Defaults to 1.
              (type: Int64, default: 1)
  -h, --help  show this help message and exit



In the following sections, we will give quick examples on how to use each of MentaLiST commands. It might be a good idea to create a new folder to store the results: 

In [3]:
mkdir mentalist_results
cd mentalist_results 

# Installing MLST schema
MentaLiST needs to create a k-mer database file for a given MLST scheme before it can call alleles. There are different possible options, from custom schema based on local FASTA files, to downloading public schema from pubmlst.org or cgmlst.org.

## pubMLST schema

MentaLiST can search and install MLST schema from pubMLST.org, as shown.

### List Available pubmlst.org schema
The command 'list_publist' lists the available schema on pubMLST. Since there are many, it is also possible to give a prefix, such that only schema matching this prefix are listed.

In [None]:
MentaLiST.jl list_pubmlst -h

usage: MentaLiST.jl list_pubmlst [-p PREFIX] [-h]

optional arguments:
  -p, --prefix PREFIX  Only list schema that starts with this prefix.
  -h, --help           show this help message and exit



In [None]:
# List campylobacter schema:
MentaLiST.jl list_pubmlst -p Campylobacter

2017-08-02T12:46:31.172 - info: Downloading the MLST database xml file...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed


### Install a pubmlst.org scheme
A scheme can be referenced by species name (exact match) or, more simply, but the ID as given in the 'list_pubmlst' command. To install the 'Campylobacter jejuni' scheme, run the following command:

In [None]:
MentaLiST.jl download_pubmlst -k 31 -o Campy -s 28 --db Campy/mlst_31.db 

In [None]:
# The output folder has all the FASTA files and profile for the scheme, and also the kmer database file,
# mlst_31.db on this example.
ls Campy

## cgMLST schema
Similarly with the pubMLST schema, MentaLiST can also download and install cgMLST schema from cgmlst.org.

### List available cgMLST schema from cgmlst.org

In [None]:
MentaLiST.jl list_cgmlst

### Download and install a cgMLST scheme from cgmlst.org

In [None]:
MentaLiST.jl download_cgmlst -h

In [None]:
MentaLiST.jl download_cgmlst -o cgmlst/legionella -s 1025099 -k 31 --db cgmlst/legionella/db_31

## Install a custom scheme from FASTA files
It is also possible to install a custom MLST scheme from the FASTA files. Each file should be called LOCUS.fa (the extension is not important, can be .fasta, .tfa, etc.), and each different allele in this file should have identifier LOCUS_N (or alternatively LOCUS.N), where N is a unique number for each allele, and it is usually a sequence from 1 to N for N alleles. 

For instance, for the Campylobacter scheme that was downloaded in the example above, we have:

In [None]:
# Each file is a different locus:
ls Campy/*.tfa

In [None]:
# For each locus file, a different ID and sequence for each allele:
head -n 20 Campy/glnA.tfa

In [None]:
# Install the Campylobacter jejuni scheme directly from the FASTA files:
MentaLiST.jl build_db -k 25 --db Campy/mlst_25.db -p Campy/campylobacter.txt -f Campy/*.tfa

# Calling MLST alleles for a sample

After a k-mer database has been created, MentaLiST can call alleles for a given sample.

In [None]:
# Help:
MentaLiST.jl call -h

For this example we are using a Campylobacter jejuni sample from NCBI SRA that was hugely downsampled to make it smaller. This sample is available on the GitHub repository at https://github.com/WGS-TB/MentaLiST/blob/master/data/SRR5824107_small.fastq.gz. If you don't have a clone of the repository installed, you can download this file with the following command:

In [None]:
wget https://github.com/WGS-TB/MentaLiST/raw/master/data/SRR5824107_small.fastq.gz

Now, run MentaLiST caller on this sample:

In [None]:
MentaLiST.jl call -o campy_call.txt -s SRR5824107 --db Campy/mlst_31.db SRR5824107_small.fastq.gz 

The output consists of three files: one has the calls, and the two other some details about the number of votes per allele and if there was a tie between the best alleles.

In [None]:
# results:
ls campy_call.*

In [None]:
# Allele calls and ST are on the campy_call.txt file:
column -ts $'\t' campy_call.txt

In [None]:
# Detailed vote count for each allele:
cat campy_call.txt.votes.txt

Now we test the MentaLiST call on a Legionella sample, also downloaded from NCBI SRA and downsampled. This sample is available at https://github.com/WGS-TB/MentaLiST/blob/master/data/ERR2009175_small.fastq.gz

In [None]:
wget https://github.com/WGS-TB/MentaLiST/raw/master/data/ERR2009175_small.fastq.gz

In [None]:
## Legionela, small sample:
MentaLiST.jl call -o legionela2.txt -s ERR2009175 --db cgmlst/legionella/db_31 ERR2009175_small2.fastq.gz 

In [None]:
# Check the first 10 calls:
cut -f1-10 legionela2.txt | column -ts $'\t'  

In [None]:
# votes:
head legionela2.txt.votes.txt

In [None]:
cat legionela2.txt.ties.txt