# Module 6: Serotyping *Streptococcus pneumoniae* y *Streptococcus agalactiae* 

## Overview

# This module will be developed in two parts:

1. Part 1: Predicting serotypes of *S. pneumoniae*

2. Part 2: Predicting serotypes of *S. agalactiae*

### Install condacolab

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

### Install software

In [None]:
# Instalar seroBA
!conda install -c bioconda seroba

In [None]:
# Instalar seroBA
!conda install bioconda::srst2

In [None]:
# Compruebe si se seroBA se instaló correctamente
!seroba --version

### Download data

In [None]:
!wget

## Part 1: Predicting *S. pneumoniae* serotypes

To date, there are >100 known serotypes described for *S. pneumoniae* based on differing biochemical and antigenic properties of the capsule. There are a number of in-silico methods to detect the cps locus, which can then be used to predict serotypes from WGS data. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines.

[SeroBA](https://github.com/sanger-pathogens/seroba?tab=readme-ov-file#installation) was developed and it makes efficient use of computational resources in addition to accurately detecting the cps locus at low coverage, and it predict serotypes from WGS data using a database adapted from PneumoCaT. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98% concordance using a k-mer based method, can process 10,000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 10x. SeroBA is implemented in Python3 and is freely available under an open source GPLv3.

*Further reading*: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6113868/

Explore usage of serobar by running ``seroba -h``

In [None]:
# Run seroBA
!seroba -h

SeroBA requires only three inputs: 

1. Database with kmc (utility designed for counting k-mers) and ariba (Antimicrobial Resistance Identification By Assembly) 
2. Forward and reverse sequence files in fastq
3. Output prefix


Download the [PneumoCaT](https://github.com/ukhsa-collaboration/PneumoCaT) database using the command: 

In [None]:
# Download the PneumoCaT database
!seroba getPneumocat PneumoCaT_dir

This command downloads PneumoCat and build an tsv formatted meta data file out of it. However, for this module we will use seroba_k71_14082017  database as its upto date. 

### Step 1: To predict the serotype of a single strain (17150_4#79), we will use the command:

In [None]:
# Run seroBA
!seroba runSerotyping seroba_k71_14082017 17150_4#79_1.fastq.gz 17150_4#79_2.fastq.gz 17150_4#79_output

An explanation of this command is as follows:

`seroba` is the tool/program

`runSerotyping` specifies that program will perform serotyping 

`seroba_k71_14082017` specifies where the seroba directory

`17150_4#79_1.fastq.gz` y `17150_4#79_2.fastq.gz` are the forward and reverse fastq files

`17150_4#79_output` specifies the output prefix

In the output folder,  you will find a **pred.tsv** including your predicted serotype.

### Step 2: To predict the serotype of multiple strains

In [None]:
# Move files
for x in *1.fastq.gz; do mkdir ${x%%_1.fastq.gz} ; mv $x ${x%%_1.fastq.gz}; mv ${x%%1.fastq.gz}2.fastq.gz ${x%%_1.fastq.gz}; done

An explanation of this command is as follows:

`for x en 1.fastq.gz;` This starts a for loop that iterates over all files in the current directory that end with "1.fastq.gz".

`do` This starts the code block that will be executed for each file.

`mkdir ${x%%_1.fastq.gz}` This creates a new directory with the same name as the file, but with the "_1.fastq.gz" suffix removed.

`mv $x ${x%%_1.fastq.gz}` This moves the original file into the new directory created in the previous step.

`mv ${x%%1.fastq.gz}2.fastq.gz ${x%%_1.fastq.gz}` This moves a second file that has the same prefix as the first file, but with a "2.fastq.gz" suffix, into the new directory created in the first step.

`done` This ends the for loop.

Overall, this script is designed to take paired-end sequencing data that is stored in two separate files with names that end in "_1.fastq.gz" and "_2.fastq.gz", and organize it into directories based on the prefix of the file names. 

*Further reading on loop commands*: https://www.gnu.org/software/bash/manual/bash.html#Looping-Constructs

we will then run seroBA using the command:

In [None]:
# Run seroBA for all samples
for x in *#* ; do seroba runSerotyping seroba_k71_14082017 $x/${x}_1.fastq.gz $x/${x}_2.fastq.gz $x"_output"; done

An explanation of this command is as follows:

`for x in #;` This starts a for loop that iterates over all files in the current directory that contain the `#` character in their name.

`do` This starts the code block that will be executed for each file.

`seroba runSerotyping seroba_k71_14082017` This executes the command "runSerotyping" within the Docker container, using the "seroba_k71_14082017" database. 

`$x/${x}_1.fastq.gz $x/${x}_2.fastq.gz` These are the input files for the command, located in a directory named after the file (with the `#` character removed) and with "_1.fastq.gz" or "_2.fastq.gz" appended to the name.

`$x"_output"` This is the output directory for the command, also named after the file (with "_output" appended to the name).

`;` This separates the Docker command from the end of the for loop.

`done` This ends the for loop.

Overall, this script is designed to run a serotyping analysis on paired-end sequencing data that is stored in two separate files with names that contain the `#` character. The resulting output will be stored in directories with names that are based on the input file names

### Step 3: We will then compile the results from the runs above using the command:

This command will combine the seroba outputs in one tsv file. 

In [None]:
# Summary of seroBA results
!seroba summary ./

___

## Part 2: Predicting *S. agalactiae* serotypes 

*Streptococcus agalactiae* (Group B Streptococcus, or GBS) are currently divided into ten serotypes based on type-specific capsular antigens and are designated as Ia, Ib, II, III, IV, V, VI, VII, VIII, and IXs. 

Group B Streptococcus Serotyping by Genome Sequencing repository contains a curated reference file which can be used for serotyping *Streptococcus agalactiae* in silico with whole genome sequencing data. The reference file (GBS-SBG.fasta) is designed to be usable for both short-read mapping and assembly-based strategies.

*Further reading*: https://github.com/swainechen/GBS-SBG 

### Step 1: To predict the serotype of a single strain (20280_5#33), we will use the command: 

In [None]:
# Run srst2
!srst2 --input_pe 20280_5#33_1.fastq.gz 20280_5#33_2.fastq.gz --output 20280_5#33_test --log --gene_db GBS-SBG.fasta

An explanation of this command is as follows:

`srst2` is the tool

`--input_pe` specifies the input file are paired end reads which are 20280_5#33_1.fastq.gz 20280_5#33_2.fastq.gz

`--output`specifies the output file 20280_5#33_test

`--log` switch on logging to file, rather than standard output

`--gene_db` specifies the database GBS-SBG.fasta

Run the command `ls -lh` to check the contents in the folder.

In [None]:
# List files
!ls -lh

You will get this output

The output file from the above run is “20280_5#33_test__genes__GBS-SBG__results.txt”. 

So, `cat` "20280_5#33_test__genes__GBS-SBG__results.txt" to view the contents of this file 

In [None]:
# Show results
!cat 20280_5#33_test__fullgenes__GBS-SBG__results.txt

### Step 2: To execute SRST2 on multiple strains, run the command:

In [None]:
# Correr srst2 en un loop
srst2 --input_pe *.fastq.gz --output s.agalactiae --log --gene_db GBS-SBG.fasta

`--input_pe .fastq.gz` specifies the input file are multiple compressed fastq.gz files. 

## BONUS!

If you are working with BASH in your computer or in a HPC and you have too many files you can optimize commands, loops are very useful for large datasets.

Here's a way to do it. 

Create a new bash script using nano named `serotype.sh`

In [None]:
#!/bin/bash
#Nombre de archivo: Serotype.sh
#Este script para serotipificar S.pneumoniae a partir de fastq.gz recortados

function docker_run() { docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data "$@" ;}
wordir=/home/bioinfo-sanger/Group_work/species/s.pneumo/results_fastqc_gz/trimmed_results/results_trimmed_gz
cd $wordir

for i in $(ls *_1.trimmed.fastq.gz); do

NAME=$(basename $i _1.trimmed.fastq.gz)
echo "$NAME"
j="${NAME}_1.trimmed.fastq.gz"
echo "$j"
k="${NAME}_2.trimmed.fastq.gz"
echo "$k"

docker_run staphb/seroba seroba runSerotyping seroba_k71_14082017 ./$j ./$k ${NAME}_serotype_output;

done

We move all the outputs to a new folder `mv *_serotype_output serotype_results`, then compile all the data within the new directory with the command `seroba summary ./`, you should obtain a tsv file.

Create a new bash script using nano named `serotype_2.sh`

In [None]:
#!/bin/bash
#Este script para serotipificar múltiples lecturas genómicas de S. agalactaeae

function docker_run() { docker run --rm=True -u $(id -u):$(id -g) -v $(pwd):/data "$@" ;}
wordir=/home/bioinfo-sanger/Data/Group_work/s.pneumo/s.agalactiae/lanes_2.txt/
cd $wordir
mkdir -p serotyping_output

for i in $(ls *_1.fastq.gz); do

NAME=$(basename $i _1.fastq.gz)
echo "$NAME"
j="${NAME}_1.fastq.gz"
echo "$j"
k="${NAME}_2.fastq.gz"
echo "$k"

docker_run staphb/srst2 srst2 --input_pe $j $k --output ./serotyping_out/${NAME}_output --log --gene_db analysis/clean_data/GBS-SBG.fasta;

done