# Metagenomic Binning

This notebook will go through the workflow for binning contigs into species-level bins from a metagenome assembled genome (MAG).

1. Create bins for your megahit MAGs
2. Create bins for your metaspades MAGs


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid and xfile
netid = "MY_NETID"
xfile = "MY_XFILE"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/09_metag_binning"
%cd $work_dir

## Creating a config file
The scripts below executes code that requires certain variables to be set. So we don't need to edit the code in the script, we are going to use a config file that defines all of these variables for us. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
# notice that we are using the reads post-trimming, and post-human removal
!echo "export NETID=$netid" > config.sh
!echo "export XFILE=$xfile" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/09_metag_binning" >> config.sh
!echo "export XFILE_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/05_getting_data" >> config.sh
!echo "export FASTQ_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/07_contam_removal" >> config.sh
!echo "export MEGAHIT_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_megahit" >> config.sh
!echo "export METASPADES_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/08_assembly/out_metaspades" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Binning contigs from your Megahit Assembly

In this step, we will create species-level bins for the contigs that were created from our megahit assembly.

In [None]:
# Create a script to run maxbin to bin megahit contigs by species
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. maxbin runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-mega-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq*
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq*

MEGAHIT_OUTDIR=${WORK_DIR}/out_megahit
OUTDIR=${MEGAHIT_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$MEGAHIT_OUTDIR" ]]; then
  echo "$MEGAHIT_OUTDIR does not exist. Directory created"
  mkdir $MEGAHIT_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${MEGAHIT_DIR}/${SAMPLE_ID}/final_contigs.fa"

/contrib/singularity/shared/bhurwitz/maxbin2:2.2.7--hdbdd923_5.sif run_MaxBin.pl \
-thread 24 -contig ${CONTIGS} \
-reads ${PAIR1} \
-reads2 ${PAIR2} \
-out ${OUTDIR}

'''

with open('megahit_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat megahit_bin_parallel.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the megahit_bin_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run the megahit contig binning
# Remember that this may take a while to run, so take a break, and get a coffee.
!sbatch ./megahit_bin_parallel.sh

In [None]:
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# Once your jobs have run (or are running) you can check the progress
# and also look for errors in the *out files
# For example, you can look at Job-mega-bins-0.out
!ls
!cat Job-mega-bins-0.out

In [None]:
# Double check that you have bins for your contigs from megahit
!ls $work_dir/out_megahit

Rock on! You have created bins for your megahit contigs. These bins should represent the species present in your samples.

This will generate a series of files. Take a look at the files generated. In particular you should see a series of *.fasta files preceeded by numbers. These are the different genome bins predicted by MaxBin.

Take a look at the mbin.summary file. What is shown?

Now, we are going to generate a concatenated file that contains all of our genome bins put together. We will change the fasta header name to include the bin number so that we can tell them apart later.

In [None]:
!for file in mbin.*.fasta; do; num=${file//[!0-9]/} ; sed -e "/^>/ s/$/ ${num}/" mbin.$num.fasta >> maxbin_binned.concat.fasta ; done

In [None]:
# Create a script to run maxbin to bin metaspades contigs by species
# A few important points:
# 1. We are using the variables from the config file via the `source ./config.sh` command in the script.
# 2. maxbin runs on each of the fastq files in the trimmed $FASTQ_DIR
# 3. The results will be written into our $WORK_DIR
# 4. Notice that we are asking for alot more resource (24 cores and 5G of memory per core), we are also asking for more time (24 hours)
my_code = '''#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --nodes=1             
#SBATCH --time=24:00:00   
#SBATCH --partition=standard
#SBATCH --account=bh_class
#SBATCH --array=0-4                         
#SBATCH --output=Job-metaspades-bins-%a.out
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=5G                                    

pwd; hostname; date

source ./config.sh
names=($(cat $XFILE_DIR/$XFILE))

SAMPLE_ID=${names[${SLURM_ARRAY_TASK_ID}]}

### reads after trimming and human filtering
PAIR1=${FASTQ_DIR}/${SAMPLE_ID}_1.fastq*
PAIR2=${FASTQ_DIR}/${SAMPLE_ID}_2.fastq*

METASPADES_OUTDIR=${WORK_DIR}/out_metaspades
OUTDIR=${METASPADES_OUTDIR}/${SAMPLE_ID}

### create the outdir if it does not exist
if [[ ! -d "$METASPADES_OUTDIR" ]]; then
  echo "$METASPADES_OUTDIR does not exist. Directory created"
  mkdir $METASPADES_OUTDIR
fi

if [[ ! -d "$OUTDIR" ]]; then
  echo "$OUTDIR does not exist. Directory created"
  mkdir $OUTDIR
fi

### final contigs
CONTIGS="${METASPADES_DIR}/${SAMPLE_ID}/contigs.fasta"

/contrib/singularity/shared/bhurwitz/maxbin2:2.2.7--hdbdd923_5.sif run_MaxBin.pl \
-thread 24 -contig ${CONTIGS} \
-reads ${PAIR1} \
-reads2 ${PAIR2} \
-out ${OUTDIR}

'''

with open('metaspades_bin_parallel.sh', mode='w') as file:
    file.write(my_code)

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/07_contam_removal.ipynb $work_dir