# Module 8: Serotyping of *Streptococcus pneumoniae* and *Streptococcus agalactiae* 

## Overview

# This module will be developed in two parts:

1. Part 1: Predicting serotypes of *S. pneumoniae*

2. Part 2: Predicting serotypes of *S. agalactiae*

### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

> **Note**: In this module, we will use SeroBA for the prediction of *S. pneumoniae* serotypes. However, the original repository for this tool is no longer supported. Therefore, we will use the [Bentley-group](https://github.com/sanger-bentley-group/seroba) repository, where the use of the tool with Docker is recommended.


### What is Docker?

Docker is an open platform for developing, shipping, and running applications. It provides the ability to package and run an application in a loosely isolated environment called a container. Containers are lightweight and contain everything needed to run the application, so you do not need to rely on what is currently installed on the host; i.e., it involves bundling an application together with all of the necessary configuration files, libraries, and dependencies to ensure the software can run in a reproducible fashion across a diversity of computing environments. You can easily share containers while you work, and be sure that everyone you share with gets the same container that works in the same way.

![docker](images/docker.png)

*Taken from: https://docs.docker.com/get-started/docker-overview/*

If you want to delve deeper into how Docker and its containers work, you can visit: https://docs.docker.com/get-started/introduction/

In [None]:
# Install udocker
%%shell
!pip install udocker
!udocker --allow-root install

In [None]:
# Get SeroBA
!udocker --allow-root pull sangerbentleygroup/seroba

In [None]:
# Chek if SeroBA is installed
!udocker --allow-root run sangerbentleygroup/seroba seroba version

> **Note**: In this module, we will use srst2 for the prediction of *S. agalactiae* serotypes. However, this tool does not work properly with Python3. Therefore, we will install Python2 and run the tool using the python=2.7 version with the commands `!conda run -n py2_env`.

In [None]:
# Create a Python 2.7 environment with conda. The environment is called py2_env
!conda create -n py2_env python=2.7
#!apt-get install python2
!conda run -n py2_env python --version

In [None]:
# Install srst2
!conda run -n py2_env conda install -c bioconda srst2 --yes

In [None]:
# Check if srst2 is installed
!conda run -n py2_env srst2 --help

### Download data

In [None]:
!wget https://zenodo.org/records/13750987/files/Module_8.tar.gz

### Extract the .tar.gz file 

In [None]:
!tar xvf Module_8.tar.gz

## Part 1: Predicting *S. pneumoniae* serotypes

To date, there are >100 known serotypes described for *S. pneumoniae* based on differing biochemical and antigenic properties of the capsule. There are a number of in-silico methods to detect the cps locus, which can then be used to predict serotypes from WGS data. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines.

[SeroBA](https://github.com/sanger-pathogens/seroba?tab=readme-ov-file#installation) was developed and it makes efficient use of computational resources in addition to accurately detecting the cps locus at low coverage, and it predict serotypes from WGS data using a database adapted from PneumoCaT. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98% concordance using a k-mer based method, can process 10,000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 10x. SeroBA is implemented in Python3 and is freely available under an open source GPLv3.

*Further reading*: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6113868/

Explore usage of serobar by running ``seroba -h``

In [None]:
# Run seroBA
!udocker --allow-root run sangerbentleygroup/seroba seroba --help

SeroBA requires only three inputs: 

1. Database with kmc (utility designed for counting k-mers) and ariba (Antimicrobial Resistance Identification By Assembly). 

For this module, we will use the database `seroba_k71_14082017` located in the module folder.

If you want to create the database yourself on your computer, you can follow the steps provided [here](https://github.com/sanger-pathogens/seroba?tab=readme-ov-file#setting-up-the-database).

2. Forward and reverse sequence files in fastq

We will use the paired files ERR331173_1.fastq.gz and ERR331173_2.fastq.gz, which correspond to run accession ERR331173 from the [PRJEB3084](https://www.ebi.ac.uk/ena/browser/view/PRJEB3084) project.

Some important data about the sample:

- Country of origin: Perú
- Organism: *Streptococcus pneumoniae*
- Instrument Platform: ILLUMINA
- Instrument Model: Illumina MiSeq
- Read Count: 1203646
- Base Count: 240729200
- Center Name: Wellcome Sanger Institute; SC
- Library Layaout: PAIRED
- Library strategy: WGS


3. Output prefix

### To predict the serotype of a single strain, we will use the command:

In [None]:
# Run seroBA
!udocker --allow-root run -v /content/Module_8/seroba_k71_14082017:/seroba_k71_14082017 -v /content/Module_8:/fastq_files sangerbentleygroup/seroba seroba runSerotyping /seroba_k71_14082017 /fastq_files/ERR1795461_1.fastq.gz /fastq_files/ERR1795461_2.fastq.gz /fastq_files/output_test

An explanation of this command is as follows:

`dockerrun` It is a function to start a container

`sangerbentleygroup/seroba` is the Docker image [sangerbentleygroup - represents the repository and seroba - represents the container image]

`seroba` is the tool/software

`runSerotyping` specifies that program will perform serotyping 

`seroba_k71_14082017` specifies where the seroba directory

`ERR331173_1.fastq.gz` and `ERR331173_2.fastq.gz` are the forward and reverse fastq files

`ERR331173_output` specifies the output prefix

In the output folder,  you will find a **pred.tsv** including your predicted serotype.

### To predict the serotype of multiple strains

>**Note**: In this module, we will not run the multiple serotyping due to the lack of resources in Colab. However, here is an example of how to do it.

### First, we will create a folder for each pair of compressed FASTQ files and the strain identification name using the command:

In [None]:
# Do not execute
# Move files
# for x in *1.fastq.gz; do mkdir ${x%%_1.fastq.gz} ; mv $x ${x%%_1.fastq.gz}; mv ${x%%1.fastq.gz}2.fastq.gz ${x%%_1.fastq.gz}; done

An explanation of this command is as follows:

`for x en 1.fastq.gz;` This starts a for loop that iterates over all files in the current directory that end with "1.fastq.gz".

`do` This starts the code block that will be executed for each file.

`mkdir ${x%%_1.fastq.gz}` This creates a new directory with the same name as the file, but with the "_1.fastq.gz" suffix removed.

`mv $x ${x%%_1.fastq.gz}` This moves the original file into the new directory created in the previous step.

`mv ${x%%1.fastq.gz}2.fastq.gz ${x%%_1.fastq.gz}` This moves a second file that has the same prefix as the first file, but with a "2.fastq.gz" suffix, into the new directory created in the first step.

`done` This ends the for loop.

Overall, this script is designed to take paired-end sequencing data that is stored in two separate files with names that end in "_1.fastq.gz" and "_2.fastq.gz", and organize it into directories based on the prefix of the file names. 

*Further reading on loop commands*: https://www.gnu.org/software/bash/manual/bash.html#Looping-Constructs

### we will then run seroBA using the command:

In [None]:
# Do not execute
# Run seroBA for all samples
#for x in * ; do seroba runSerotyping seroba_k71_14082017 $x/${x}_1.fastq.gz $x/${x}_2.fastq.gz $x"_output"; done

An explanation of this command is as follows:

`for x in *;` This starts a for loop that iterates over all files in the current directory.

`do` This starts the code block that will be executed for each file.

`seroba runSerotyping seroba_k71_14082017` This executes the command "runSerotyping" within the Docker container, using the "seroba_k71_14082017" database. 

`$x/${x}_1.fastq.gz $x/${x}_2.fastq.gz` These are the input files for the command, located in a directory named after the file (with the `#` character removed) and with "_1.fastq.gz" or "_2.fastq.gz" appended to the name.

`$x"_output"` This is the output directory for the command, also named after the file (with "_output" appended to the name).

`;` This separates the Docker command from the end of the for loop.

`done` This ends the for loop.

### We will then compile the results from the runs above using the command:

This command will combine the seroba outputs in one tsv file. 

In [None]:
# Do not execute
# Summary of seroBA results
#!seroba summary ./

___

## Part 2: Predicting *S. agalactiae* serotypes 

*Streptococcus agalactiae* (Group B Streptococcus, or GBS) are currently divided into ten serotypes based on type-specific capsular antigens and are designated as Ia, Ib, II, III, IV, V, VI, VII, VIII, and IXs. 

Group B Streptococcus Serotyping by Genome Sequencing repository contains a curated reference file which can be used for serotyping *Streptococcus agalactiae* in silico with whole genome sequencing data. The reference file (GBS-SBG.fasta) is designed to be usable for both short-read mapping and assembly-based strategies.

*Further reading*: https://github.com/swainechen/GBS-SBG 

The program [SRST2](https://github.com/katholt/srst2) (Short Read Sequence Typing for Bacterial Pathogens) is designed to take Illumina sequence data, an MLST database, and/or a gene sequence database (e.g., resistance genes, virulence genes, etc.) and report the presence of STs and/or reference genes. This program performs rapid and accurate detection of genes and alleles from short-read WGS sequencing. SRST2 can type reads using any sequence database and can calculate combinatorial sequence types defined in MLST-type databases.

*Additional Reading*: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-014-0090-6

SRST2 requires:

1. Paired reads:

In this section, we will analyze data from the study [Near-term pregnant women in the Dominican Republic experience high rates of Group B Streptococcus rectovaginal colonization with virulent strains](https://doi.org/10.1371/journal.pgph.0002281) conducted by Laycock KM, Acosta F, Valera S, Villegas A, Mejia E, Mateo C, et al. (2023). We will use the paired FASTQ files SRR23874884_1.fastq.gz and SRR23874884_2.fastq.gz, which correspond to run accession SRR23874884 from the [PRJNA945321](https://www.ebi.ac.uk/ena/browser/view/PRJNA945321) project.

Some important data about the sample:

- Country of origin: República Dominicana
- Organism: *Streptococcus agalactiae*
- Instrument Platform: ILLUMINA
- Instrument Model: Illumina NovaSeq 6000
- Read Count: 1916182
- Base Count: 453475786869647750
- Center Name: Wellcome Sanger Institute; SC
- Library Layaout: PAIRED
- Library strategy: WGS

2. A database with the sequences in fasta file:

GBS-SBG.fasta

### To predict the serotype of a single strain, we will use the command: 

In [None]:
# First, we need to move to the directory where the files are located
%cd /content/Module_8/s_agalactiae/

In [None]:
# Run srst2
!conda run -n py2_env srst2 --input_pe SRR23874884_1.fastq.gz SRR23874884_2.fastq.gz --output SRR23874884 --log --gene_db GBS-SBG.fasta

An explanation of this command is as follows:

`conda` is the command-line tool for managing environments and packages in Conda, which is a package and environment management system.

`run` This Conda subcommand is used to execute commands within a specific Conda environment without needing to activate the environment first.

`-n py2_env`: The `-n` option specifies the name of the Conda environment in which the commands should be executed. In this case, `py2_env` is the name of the environment we have created.

`srst2` is the tool

`--input_pe` specifies the input file are paired end reads which are SRR23874884_1.fastq.gz and SRR23874884_2.fastq.gz 

`--output`specifies the output file SRR23874884

`--log` switch on logging to file, rather than standard output

`--gene_db` specifies the database GBS-SBG.fasta

Run the command `ls -lh` to check the contents in the folder.

In [None]:
# List files
!ls -lh

The output file from the above run is `SRR23874884__genes__GBS-SBG__results.txt`


So, `cat` `SRR23874884__genes__GBS-SBG__results.txt` to view the contents of this file 

In [None]:
# Show results
!cat SRR23874884__genes__GBS-SBG__results.txt

### To execute SRST2 on multiple strains, run the command:

>**Note**: In this module, we will not run the multiple serotyping due to the lack of resources in Colab. However, here is an example of how to do it.

In [None]:
#Do not execute
# Run srst2 for all samples
#!conda run -n py2_env srst2 --input_pe *.fastq.gz --output s.agalactiae --log --gene_db GBS-SBG.fasta

`--input_pe .fastq.gz` specifies the input file are multiple compressed fastq.gz files. 

## BONUS!

If you are working with BASH in your computer or in a HPC and you have too many files you can optimize commands, loops are very useful for large datasets.

Here's a way to do it. 

Create a new bash script using nano named `serotype.sh`

In [None]:
#!/bin/bash
#Nombre de archivo: Serotype.sh
#Este script para serotipificar S.pneumoniae a partir de fastq.gz recortados

function docker_run() { docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data "$@" ;}
wordir=/path/to/your/directory/
cd $wordir

for i in $(ls *_1.trimmed.fastq.gz); do

NAME=$(basename $i _1.trimmed.fastq.gz)
echo "$NAME"
j="${NAME}_1.trimmed.fastq.gz"
echo "$j"
k="${NAME}_2.trimmed.fastq.gz"
echo "$k"

docker_run staphb/seroba seroba runSerotyping seroba_k71_14082017 ./$j ./$k ${NAME}_serotype_output;

done

We move all the outputs to a new folder `mv *_serotype_output serotype_results`, then compile all the data within the new directory with the command `seroba summary ./`, you should obtain a tsv file.

Create a new bash script using nano named `serotype_2.sh`

In [None]:
#!/bin/bash
#Este script para serotipificar múltiples lecturas genómicas de S. agalactaeae

function docker_run() { docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data "$@" ;}
wordir=/path/to/your/directory/
cd $wordir
mkdir -p serotyping_output

for i in $(ls *_1.fastq.gz); do

NAME=$(basename $i _1.fastq.gz)
echo "$NAME"
j="${NAME}_1.fastq.gz"
echo "$j"
k="${NAME}_2.fastq.gz"
echo "$k"

docker_run staphb/srst2 srst2 --input_pe $j $k --output ./serotyping_out/${NAME}_output --log --gene_db analysis/clean_data/GBS-SBG.fasta;

done

*Adapted from:*

- Advanced Bioinformatics Course developed for the GPS and JUNO projects - Wellcome Sanger Insitute

*Modified by Luisa Sacristán (Universidad de los Andes-CABANA)*