# Module 7: MLST (Multilocus sequence typing)

## Descripción general

Multilocus sequence typing (MLST) characterises isolates of bacterial species using the sequences of internal fragments of (usually) seven house-keeping genes. Approximately 450-500 bp internal fragments of each gene are used, as these can be accurately sequenced on both strands using an automated DNA sequencer. For each house-keeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven loci define the allelic profile or sequence type (ST). Each isolate of a species is therefore unambiguously characterised by a series of seven integers which correspond to the alleles at the seven house-keeping loci.

MLST can be performed using SRST2 or MLST. We will learn both tools in this module. 

### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

In [None]:
# Install mlst
!conda install -c conda-forge -c bioconda -c defaults mlst

In [None]:
# Install srst2
!conda install bioconda::srst2

In [None]:
# Install any2fasta
!conda install -c bioconda any2fasta

In [None]:
# Install blast
!conda install bioconda::blast

### Download data

In [None]:
!wget

## Part 1: Multilocus sequence typing using SRST2

[SRST2](https://github.com/katholt/srst2 ) (Short Read Sequence Typing for Bacterial Pathogens) program is designed to take Illumina sequence data, a MLST database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes. This program performs fast and accurate detection of genes and alleles direct from WGS short sequencing reads. SRST2 can type reads using any sequence database(s) and can calculate combinatorial sequence types defined in MLST-style databases

*Further reading*: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-014-0090-6

SRST2 requires: 


1. Sequence reads (uses paired reads: 17150_4#79_1.fastq.gz & 17150_4#79_2.fastq.gz)
2. Fasta sequence database to match to. For MLST, this means a fasta file of all allele sequences. If you want to assign STs, you also need a tab-delim file which defines the ST profiles as a combination of alleles. You can retrieve these files automatically from pubmlst.org/data/ using the script provided:


### Step 1: Download the fasta sequence database and a tab-delim file from pubmlst.org/data/ using the command: 

In [None]:
# Download the mlst database
!getmlst.py --species "Streptococcus pneumoniae"

> NOTE: scripts for SRST2 requires an older version of samtools and Bowtie so you will need to have installed **samtools-0.1.18** and **bowtie2-2.1.0**

Explore the content of your working directory us (ls) command. You should have the following files:

- alleles_fasta,
- mlst_data_download_Streptococcus_pneumoniae_None.log,
- profiles_csv
- Streptococcus_pneumoniae.fasta

### Step 2:  To execute SRST2 on a single strain (17150_4#79), we will use the command: 

In [None]:
# Run srst2
!srst2 --input_pe 17150_4#79_1.fastq.gz 17150_4#79_2.fastq.gz --output 17150_4#79_test --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_definitions profiles_csv --mlst_delimiter _

An explanation of this command is as follows:

`srst2` is the tool

`--input_pe` specifies the input file are paired end reads which are 17150_4#79_1.fastq.gz 17150_4#79_2.fastq.gz

`--output` specifies the output file 17150_4#79_test

`--mlst_db` specifies the database Streptococcus_pneumoniae.fasta

`--mlst_definitions` specifies profiles_csv

`--mlst_delimiter` Character(s) separating gene name from allele number in MLST database (default "-", as in arcc-1)

Run the command  `ls -lh`  to check the contents in the folder. 

In [None]:
# Show the output
!ls -lh

You will get this output

The output file from the MLST run is “17150_4#79_test__mlst__Streptococcus_pneumoniae__results.txt”. 

So, `cat` "17150_4#79_test__mlst__Streptococcus_pneumoniae__results.txt" to view the contents of this file 

Sample 17150_4#79 is MLST 30 

### Step 3: To execute SRST2 on multiple strains, run the command:

`--input_pe *.fastq.gz`: specifies the input file are multiple compressed fastq.gz files. 

In [None]:
# Run srst2 with all fastq files
srst2 --input_pe *.fastq.gz --output s.pneumo --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_definitions profiles_csv --mlst_delimiter _

___

## Part 2: Multilocus sequence typing using SRST2

[mlst](https://github.com/tseemann/mlst) scans contig files against traditional PubMLST typing schemes

### Step 1: To execute mlst on a single strain (21127_1#30), we will use the command:

In [None]:
# Run mlst
!mlst --legacy --scheme spneumoniae 21127_1#30_output_contigs.fasta

An explanation of this command is as follows:

`mlst` is the tool

`--legacy` use old legacy output with allele header row (requires --scheme)

`--scheme spneumoniae` specifies the species (pubmlst scheme). You can identify the scheme using the command "docker_run staphb/mlst mlst --longlist"

`21127_1#30_output_contigs.fasta` input file

You will get the this output printed in terminal 

### Step 2: To execute mlst on a multiple strains, we will use the command: 



In [None]:
# Run mlst for all fasta files
mlst --legacy --scheme spneumoniae *.fasta > mlst.csv


An explanation of this command is as follows:

`mlst` is the tool

`--legacy` use old legacy output with allele header row (requires --scheme)

`--scheme spneumoniae` specifies the species (pubmlst scheme). You can identify the scheme using the command "docker_run staphb/mlst mlst --longlist"

`*.fasta` input files

`>mlst.csv` specifies output file

## BONUS!

If you are working with BASH in your computer or in a HPC and you have too many files you can optimize commands, loops are very useful for large datasets.

Here's a way to do it. 

Create a new bash script using nano named  `srst2.sh`

In [None]:
#!/bin/bash

for i in $(ls *_1.trimmed.fastq.gz); do
NAME=$(basename $i _1.trimmed.fastq.gz)
echo "$NAME"
j="${NAME}_1.trimmed.fastq.gz"
echo "$j"
k="${NAME}_2.trimmed.fastq.gz"
echo "$k"
srst2 --input_pe $j $k --output srst2output --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_defin>
done

Create a new bash script using nano named `mlst.sh`

In [None]:
##  s.pneumonae
mlst --legacy --scheme spneumoniae *contigs.fasta > mlst_results.csv

##  s.agalactiae
mlst --legacy --scheme sagalactiae *contigs.fasta > mlst_results.csv