# Module 9: MLST (Multilocus sequence typing)

## Descripción general

Multilocus sequence typing (MLST) characterises isolates of bacterial species using the sequences of internal fragments of (usually) seven house-keeping genes. Approximately 450-500 bp internal fragments of each gene are used, as these can be accurately sequenced on both strands using an automated DNA sequencer. For each house-keeping gene, the different sequences present within a bacterial species are assigned as distinct alleles and, for each isolate, the alleles at each of the seven loci define the allelic profile or sequence type (ST). Each isolate of a species is therefore unambiguously characterised by a series of seven integers which correspond to the alleles at the seven house-keeping loci.

MLST can be performed using SRST2 or MLST. We will learn both tools in this module. 

### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

> **Note**: In this module, we will use srst2 for the prediction of serotypes. However, this tool does not work properly with Python3. Therefore, we will install Python2 and run the tool using the python=2.7 version with the commands `!conda run -n py2_env`.

In [None]:
# Create a new Python 2.7 environment with the name py2_env
!conda create -n py2_env python=2.7
#!apt-get install python2
!conda run -n py2_env python --version

In [None]:
# Install srst2
!conda run -n py2_env conda install -c bioconda srst2 --yes

In [None]:
# Check if srst2 is installed
!conda run -n py2_env srst2 --help

In [None]:
# Install mlst
!conda install -c conda-forge -c bioconda -c defaults mlst

In [None]:
# Install any2fasta
!conda install -c bioconda any2fasta

In [None]:
# Install blast
!conda install bioconda::blast

### Download data

In [None]:
!wget https://zenodo.org/records/13750987/files/Module_9.tar.gz

## Part 1: Multilocus sequence typing using SRST2

[SRST2](https://github.com/katholt/srst2 ) (Short Read Sequence Typing for Bacterial Pathogens) program is designed to take Illumina sequence data, a MLST database and/or a database of gene sequences (e.g. resistance genes, virulence genes, etc) and report the presence of STs and/or reference genes. This program performs fast and accurate detection of genes and alleles direct from WGS short sequencing reads. SRST2 can type reads using any sequence database(s) and can calculate combinatorial sequence types defined in MLST-style databases

*Further reading*: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-014-0090-6

SRST2 requires: 


1. Sequence reads

We will analyze the paired FASTQ files ERR1795461_1.fastq.gz and ERR1795461_2.fastq.gz, which correspond to the run accession ERR1795461 from the project [PRJEB3084](https://www.ebi.ac.uk/ena/browser/view/PRJEB3084).

Some important data about the sample:

- Country of origin: Brasil
- Organism: *Streptococcus pneumoniae*
- Instrument Platform: ILLUMINA
- Instrument Model: Illumina MiSeq
- Read Count: 3627822
- Base Count: 453477750
- Center Name: Wellcome Sanger Institute; SC
- Library Layaout: PAIRED
- Library strategy: WGS 

2. Fasta sequence database to match to. For MLST, this means a fasta file of all allele sequences. If you want to assign STs, you also need a tab-delim file which defines the ST profiles as a combination of alleles.

### Download the fasta sequence database and a tab-delim file from [PubMLST](https://pubmlst.org/organisms/streptococcus-agalactiae/) using the command: 

In [None]:
# Download the mlst database
!conda run -n py2_env getmlst.py --species 'Streptococcus pneumoniae'

Explore the content of your working directory us (ls) command. You should have the following files:

- alleles_fasta,
- mlst_data_download_Streptococcus_pneumoniae_None.log,
- profiles_csv
- Streptococcus_pneumoniae.fasta

### Execute SRST2 on a single strain:

In [None]:
# Run srst2
!conda run -n py2_env srst2 --input_pe ERR1795461_1.fastq.gz ERR1795461_2.fastq.gz --output ERR1795461_output --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_definitions profiles_csv --mlst_delimiter _

An explanation of this command is as follows:

`srst2` is the tool

`--input_pe` specifies the input file are paired end reads which are ERR1795461_1.fastq.gz and ERR1795461_2.fastq.gz

`--output` specifies the output file ERR1795461_output

`--mlst_db` specifies the database Streptococcus_pneumoniae.fasta

`--mlst_definitions` specifies profiles_csv

`--mlst_delimiter` Character(s) separating gene name from allele number in MLST database (default "-", as in arcc-1)

### Check the results

Move to the `results` folder

In [None]:
%cd /content/Module_9/srst2/results/

Run the command  `ls -lh`  to check the contents in the folder. 

In [None]:
# Show the output
!ls -lh

The output file from the MLST run is `ERR1795461__mlst__Streptococcus_pneumoniae__results.txt`. 

So, `cat ERR1795461__mlst__Streptococcus_pneumoniae__results.txt` to view the contents of this file 

### To execute SRST2 on multiple strains, run the command:

>**Note**: In this module, we will not run the multiple analysis due to the lack of resources in Colab. However, here is an example of how to do it.

In [None]:
# Do not execute
# Run srst2 with multiple fastq files
# !conda run -n py2_env srst2 --input_pe *.fastq.gz --output s.pneumo --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_definitions profiles_csv --mlst_delimiter _

`--input_pe *.fastq.gz`: specifies the input file are multiple compressed fastq.gz files. 

___

## Part 2: Multilocus sequence typing using SRST2

[mlst](https://github.com/tseemann/mlst) scans contig files against traditional PubMLST typing schemes

### To execute mlst on a single strain, we will use the command:

In [None]:
# Run mlst
!mlst --legacy --scheme spneumoniae /spades_assembly/contigs.fasta

An explanation of this command is as follows:

`mlst` is the tool

`--legacy` use old legacy output with allele header row (requires --scheme)

`--scheme spneumoniae` specifies the species (pubmlst scheme).

`contigs.fasta` input file

### Execute mlst on a multiple strains, we will use the command: 

>**Note**: In this module, we will not run the multiple analysis due to the lack of resources in Colab. However, here is an example of how to do it.

In [None]:
# Do not execute
# Run mlst for all fasta files
#!mlst --legacy --scheme spneumoniae *.fasta > mlst.csv


An explanation of this command is as follows:

`mlst` is the tool

`--legacy` use old legacy output with allele header row (requires --scheme)

`--scheme spneumoniae` specifies the species (pubmlst scheme).

`*.fasta` input files

`>mlst.csv` specifies output file

## BONUS!

If you are working with BASH in your computer or in a HPC and you have too many files you can optimize commands, loops are very useful for large datasets.

Here's a way to do it. 

Create a new bash script using nano named  `srst2.sh`

In [None]:
#!/bin/bash

for i in $(ls *_1.trimmed.fastq.gz); do
NAME=$(basename $i _1.trimmed.fastq.gz)
echo "$NAME"
j="${NAME}_1.trimmed.fastq.gz"
echo "$j"
k="${NAME}_2.trimmed.fastq.gz"
echo "$k"
srst2 --input_pe $j $k --output srst2output --log --mlst_db Streptococcus_pneumoniae.fasta --mlst_defin>
done

Create a new bash script using nano named `mlst.sh`

In [None]:
##  s.pneumonae
mlst --legacy --scheme spneumoniae *contigs.fasta > mlst_results.csv

##  s.agalactiae
mlst --legacy --scheme sagalactiae *contigs.fasta > mlst_results.csv

*Adapted from:*

- Advanced Bioinformatics Course developed for the GPS and JUNO projects - Wellcome Sanger Insitute

*Modified by Luisa Sacristán (Universidad de los Andes-CABANA)*