### Binning Metagenomes

- clustering-based binning
- Grouping contigs based on similar sequences into bins...essentially creating metagenome-assembled metagenomes (MAGs). The goal is that these contigs (sequences) will come together to form a single genome that represents a microbial species. These MAGs can be matched to a reference database and a species can be assigned, or the MAG represents a novel species.  

Software documentations:

MetaBAT: https://bitbucket.org/berkeleylab/metabat/src/master/README.md

CONCOCT: https://github.com/BinPro/CONCOCT

CheckM: https://github.com/Ecogenomics/CheckM/wiki

Das_tool: https://github.com/cmks/DAS_Tool

In [None]:
#Create new env and install software 
conda create -n binning python=3.7
conda activate binning
conda install -c bioconda metabat2

conda install -c bioconda checkm-genome
conda install -c bioconda das_tool

#### MetaBat2
The first piece of code here generates a fairly simple text file for the coverage of these files. The next set of code runs MetaBat2  (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, lcuster size 50000 and maxEdges 200. It sets the minimum size for a bin to 200000 basepairs, which is fairly low, so you can keep it. It gathers all mapping information into a single depth file, so you can use your 1 file in the next analysis. An important parameter to play around with is the minimum bin size. When set to 200000, this will severely limit the amount of bins you gain, especially if your samples aren't perfect. Therefore, it is wise to run MetaBAT several times with slight alterations to the -s flag to find your optimal setting (you don't want 3 bins, you also don't want 1000).

For a reference on how to do this accurately, use: https://bitbucket.org/berkeleylab/metabat/wiki/Best%20Binning%20Practices

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/slurm-%j.out  # %j = job ID  # %j = job ID

module load miniconda/22.11.1-1
conda activate binning

#set parameters for binning:
SAMPLENAME="healthy_2019_mcav"
OUTDIR=MetaBAT_"$SAMPLENAME"_bins2
DEPTHPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/MetaBAT_"$SAMPLENAME"_bins

CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping
CONTIGFILE="healthy_2019_mcav_filtered.contigs-fixed.fsa"

#run in dir: brooke/mcav/healthy_2019_mcav
mkdir ./binning/$OUTDIR

#create depth file for MetaBat2
#jgi_summarize_bam_contig_depths --outputDepth ./binning/$OUTDIR/MetaBAT_"$SAMPLENAME"_depth.txt $CONTIGPATH/index/*.bam

#MetaBat2 script with verbose output, minimum length (m)(has to be >=1500) and no min bin size 
metabat2 -i $CONTIGPATH/$CONTIGFILE -a $DEPTHPATH/MetaBAT_"$SAMPLENAME"_depth.txt \
-o ./binning/$OUTDIR/Metabat/ \
-m 1500 
-s 5000
#MetaBAT 2 (v2.12.1) using minContig 1500, minCV 1.0, minCVSum 1.0, 

# default parameters:
# -m [ --minContig ] arg (=2500), has to be at least 1500 
# maxP 95% Percentage of 'good' contigs considered for binning decided by connection among contigs
# minS 60 Minimum score of a edge for binning (should be between 1 and 99). The greater, the more specific.
# maxEdges 200 Maximum number of edges per node. The greater, the more sensitive.
#-x [ --minCV ] arg (=1)           Minimum mean coverage of a contig in each library for binning.
#  --minCVSum arg (=1)               Minimum total effective mean coverage of a contig (sum of depth over minCV) for binning.
# -s [ --minClsSize ] arg (=200000) Minimum size of a bin as the output.

#this runs CheckM immediately after and puts the results alongside your bins
checkm lineage_wf -x fa -t 3 ./binning/$OUTDIR/Metabat ./binning/$OUTDIR/Metabat/bins-stats

#bash script: metabat
#script location: brooke/mcav/healthy_2019_mcav/binning
#JOB ID: 19498217

In [None]:
MetaBAT_healthy_2019_mcav_bins: metabat results from min length 1500, min bin size 200000 (i think is the automatic value)

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id            Marker lineage            # genomes   # markers   # marker sets    0     1    2    3   4   5+   Completeness   Contamination   Strain heterogeneity  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  .6        o__Sphingomonadales (UID3310)         26         569           293         3    545   21   0   0   0       99.59            3.15               0.00          
  .14      c__Gammaproteobacteria (UID4444)      263         507           232         43   451   13   0   0   0       85.98            2.83              38.46          
  .22           k__Bacteria (UID3187)            2258        188           117         73   112   3    0   0   0       59.27            1.28               0.00          
  .16             k__Archaea (UID2)              207         149           107         80    47   17   3   0   2       40.20           20.84              18.06          
  .9              k__Archaea (UID2)              207         149           107         82    51   10   5   1   0       39.73            4.41               0.00          
  .17         p__Cyanobacteria (UID2143)         129         472           368        345   127   0    0   0   0       26.70            0.00               0.00          
  .19        o__Actinomycetales (UID1530)        622         259           152        210    48   1    0   0   0       15.33            0.11               0.00          
  .18            k__Bacteria (UID203)            5449        104            58         93    9    2    0   0   0        9.72            0.73              50.00          
  .10             k__Archaea (UID2)              207         149           107        142    7    0    0   0   0        5.45            0.00               0.00          
  .8              k__Archaea (UID2)              207         145           103        142    3    0    0   0   0        2.43            0.00               0.00          
  .3             k__Bacteria (UID203)            5449        104            58        101    3    0    0   0   0        2.04            0.00               0.00          
  .1             k__Bacteria (UID203)            5449         96            51         94    2    0    0   0   0        1.31            0.00               0.00          
  .7                 root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .5                 root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .4                 root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .21                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .20                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .2                 root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .15                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .13                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .12                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
  .11                root (UID1)                 5656         56            24         56    0    0    0   0   0        0.00            0.00               0.00          
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[2024-02-21 16:42:10] INFO: { Current stage: 0:00:01.755 || Total: 0:19:31.234 }


In [None]:
MetaBAT_healthy_2019_mcav_bins2: metabat results EXACTLY THE SAME from min length 1500, min bin size 5000


 #### CONCOCT
https://concoct.readthedocs.io/en/latest/installation.html

This set of commands runs CONCOCT in its standard mode. It first creates a depth/coverage file for itself to use and then runs CONCOCT, with the standard settings. This means k-mer value is set to 4, minimum contig length is 1000, and CONCOCT runs on the exact amount of slots given to it by Hydra. 
 
CONCOCT creates a depth file out of the coverance created in the mapping step. It is key that this is all in the correct places before proceeding with binning. It creates a single file, which is then used for the complete binning process. Do keep in mind that binning might take awhile, so be prepared to let this run overnight.

In [None]:
# Conda installation
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

conda create -n concoct_env python=3 concoct

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/slurm-%j-concoct.out  # %j = job ID  # %j = job ID

module load miniconda/22.11.1-1
conda activate concoct_env


#set parameters
SAMPLENAME="healthy_2019_mcav"
BINPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/Concoct_"$SAMPLENAME"_bins
CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping
CONTIGFILE="healthy_2019_mcav_filtered.contigs-fixed.fsa"
BAMPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping/index'
TEMPDIR=concoct_"$SAMPLENAME"_temp

#mkdir $BINPATH
#creates the CONCOCT depth file
#this part cuts up the contigs into 10kb pieces for CONCOCT to use 
cut_up_fasta.py $CONTIGPATH/$CONTIGFILE -c 10000 -o 0 --merge_last -b $BINPATH/${SAMPLENAME}_contigs_cut.bed > $BINPATH/${SAMPLENAME}_contigs_cut.fa
#estimate contig coverage

concoct_coverage_table.py $BINPATH/${SAMPLENAME}_contigs_cut.bed $BAMPATH/*.bam > $BINPATH/coverage_table_${SAMPLENAME}.tsv || { echo 'Exit code 2: failed to create coverage file, exiting.' && exit; }

#CONCOCT script

#mkdir $BINPATH/$TEMPDIR

#run CONCOCT
concoct --composition_file $BINPATH/${SAMPLENAME}_contigs_cut.fa --coverage_file $BINPATH/coverage_table_${SAMPLENAME}.tsv -t 3 -b $BINPATH/$TEMPDIR || { echo 'Exit code 3: CONCOCT failed to run, exiting.' && exit; }
merge_cutup_clustering.py $BINPATH/$TEMPDIR/clustering_gt1000.csv > $BINPATH/$TEMPDIR/${SAMPLENAME}_clustering_merged.csv || { echo 'Exit code 4: failed to merge clusters, exiting.' && exit; }
extract_fasta_bins.py $CONTIGPATH/$CONTIGFILE $BINPATH/$TEMPDIR/${SAMPLENAME}_clustering_merged.csv --output_path $BINPATH || { echo 'Exit code 5: Bins were not extracted, exiting.' && exit; }

# Checkm is in binning env 
conda deactivate concoct_env
conda activate binning
#this runs CheckM immediately after and puts the results alongside your bins
checkm lineage_wf -x fa -t 3 $BINPATH  $BINPATH/CheckM
#can you add '--tab_table' to make output easier to read?

#bash script: concoct
#script location: mcav/healthy_2019_mcav/binning
#JOB ID: 21569803
#checkm job id: 21563395

In [None]:
Bin Id                                   Marker lineage            # genomes   # markers   # marker sets    0     1    2    3    4   5+   Completeness   Contamination   Strain heterogeneity  
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  healthy_2019_mcav_contigs_cut             root (UID1)                 5656         56            24         0     0    1    2    5   48      100.00          610.87              1.19          
  42                               o__Sphingomonadales (UID3310)         26         569           293         3    556   10   0    0   0       99.59            2.30               0.00          
  56                              c__Gammaproteobacteria (UID4444)      263         507           232         45   450   12   0    0   0       85.79            2.68              33.33          
  109                                  k__Bacteria (UID3187)            2258        188           117         53   132   3    0    0   0       68.90            1.12               0.00          
  68                                     k__Archaea (UID2)              207         149           107         61    62   18   7    1   0       53.82           13.76               0.00          
  139                                p__Cyanobacteria (UID2143)         129         472           368        242   220   8    2    0   0       46.70            1.84              14.29          
  50                                    k__Bacteria (UID203)            5449        103            58         39    30   8    11   5   10      41.74           32.07               6.25          
  150                               o__Actinomycetales (UID1530)        622         259           152        174    83   2    0    0   0       28.65            0.16               0.00          
  167                                    k__Archaea (UID2)              207         149           107        113    28   6    2    0   0       22.39            5.99              16.67          
  98                                     k__Archaea (UID2)              207         149           107        117    24   5    2    0   1       18.04            7.31              26.92          
  170                                   k__Bacteria (UID203)            5449        104            58         90    14   0    0    0   0       13.87            0.00               0.00          
  83                                    k__Bacteria (UID203)            5449        104            58         94    9    1    0    0   0        8.73            0.16              100.00         
  113                                    k__Archaea (UID2)              207         149           107        134    11   4    0    0   0        8.54            1.08               0.00          
  84                                     k__Archaea (UID2)              207         149           107        138    9    1    1    0   0        7.94            1.87              50.00          
  7                                    k__Bacteria (UID1453)            901         171           117        144    27   0    0    0   0        5.72            0.00               0.00          
  105                                   k__Bacteria (UID203)            5449        103            58         91    10   2    0    0   0        5.58            1.72               0.00          
  54                                     k__Archaea (UID2)              207         149           107        143    6    0    0    0   0        4.52            0.00               0.00          
  76                                        root (UID1)                 5656         56            24         55    1    0    0    0   0        4.17            0.00               0.00          
  49                                        root (UID1)                 5656         56            24         55    1    0    0    0   0        4.17            0.00               0.00          
  169                                    k__Archaea (UID2)              207         145           103        139    6    0    0    0   0        4.05            0.00               0.00          
  66                                    k__Bacteria (UID203)            5449        103            58         97    6    0    0    0   0        3.28            0.00               0.00          
  152                                   k__Bacteria (UID203)            5449        103            58         94    8    1    0    0   0        2.82            0.16               0.00          
  136                                    k__Archaea (UID2)              207         145           103        142    3    0    0    0   0        1.94            0.00               0.00          
  97                                    k__Bacteria (UID203)            5449        104            58        103    1    0    0    0   0        1.72            0.00               0.00          
  137                                   k__Bacteria (UID203)            5449        104            58        102    2    0    0    0   0        1.44            0.00               0.00          
  154                                    k__Archaea (UID2)              207         145           103        141    3    1    0    0   0        1.36            0.07               0.00          
  14                                     k__Archaea (UID2)              207         149           107        146    3    0    0    0   0        1.08            0.00               0.00          
  90                                    k__Bacteria (UID203)            5449        104            58        100    4    0    0    0   0        1.00            0.00               0.00          
  75                                    k__Bacteria (UID203)            5449        102            56        101    1    0    0    0   0        0.89            0.00               0.00          
  160                                       root (UID1)                 5656         56            24         55    1    0    0    0   0        0.52            0.00               0.00          
  156                                   k__Bacteria (UID203)            5449        104            58        103    1    0    0    0   0        0.34            0.00               0.00          
  43                                     k__Archaea (UID2)              207         148           106        147    1    0    0    0   0        0.31            0.00               0.00          
  188                                    k__Archaea (UID2)              207         149           107        148    1    0    0    0   0        0.31            0.00               0.00          
  166                                    k__Archaea (UID2)              207         149           107        148    1    0    0    0   0        0.31            0.00               0.00          
  134                                   k__Bacteria (UID203)            5449        104            58        103    1    0    0    0   0        0.16            0.00               0.00          
  45                                     k__Archaea (UID2)              207         149           107        148    1    0    0    0   0        0.07            0.00               0.00          
  99                                        root (UID1)                 5656         56            24         56    0    0    0    0   0        0.00            0.00               0.00          
  #UID1 repeats after this   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[2024-03-15 03:42:52] INFO: { Current stage: 0:00:08.272 || Total: 0:43:47.260 }


#obtained from stdout file 

### DAS_Tool 
- combines & refines bins into non-redundant set
https://github.com/cmks/DAS_Tool

In [33]:
import pandas as pd
import os

In [26]:
os.chdir('/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/Concoct_healthy_2019_mcav_bins/concoct_healthy_2019_mcav_temp')

In [40]:
# Convert contig-bin file from concoct to tab-delimited format
concoct_bins=pd.read_csv('healthy_2019_mcav_clustering_merged.csv',index_col=0)
concoct_bins
concoct_bins.to_csv("concoct_bins.tsv", sep='\t',header=False)

In [46]:
# Convert all contig-bin files from metabat to single tab-delimited file

directory = '/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/MetaBAT_healthy_2019_mcav_bins/Metabat'
output_file = os.path.join(directory, 'metabat_bins.tsv')
# Define a dictionary to store the mapping of contig IDs to bin IDs
contig_bin_mapping = {}

# Iterate through each file in the directory
for filename in os.listdir(directory):
    # Check if the file is a .fa file and matches the expected pattern
    if filename.endswith('.fa') and filename.startswith('.'):
        # Extract the bin ID from the filename
        bin_id = int(filename.split('.')[1])
        
        # Open the file and read the contig IDs
        with open(os.path.join(directory, filename), 'r') as file:
            for line in file:
                # Check if the line starts with '>'
                if line.startswith('>'):
                    # Extract the contig ID and add it to the dictionary with the corresponding bin ID
                    contig_id = line.strip().split()[0][1:]  # Remove '>' and any additional information
                    contig_bin_mapping[contig_id] = bin_id

# Write the contig IDs and corresponding bin IDs to a tab-delimited file
with open(output_file, 'w') as outfile:
    for contig_id, bin_id in contig_bin_mapping.items():
        outfile.write(f"{contig_id}\t{bin_id}\n")

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/slurm-%j-das_tool.out  # %j = job ID  # %j = job ID


module load miniconda/22.11.1-1
conda activate binning

# Set parameters
CONCOCTPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/Concoct_healthy_2019_mcav_bins/concoct_healthy_2019_mcav_temp/concoct_bins.tsv'  
METABATPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/binning/MetaBAT_healthy_2019_mcav_bins/Metabat/metabat_bins.tsv'
CONTIGPATH='/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/mcav/healthy_2019_mcav/mapping'
CONTIGFILE="healthy_2019_mcav_filtered.contigs-fixed.fsa"


DAS_Tool -i $CONCOCTPATH,$METABATPATH \
-l concoct,metabat \
-c $CONTIGPATH/$CONTIGFILE \
-t 11 \
--write_bin_evals \
--write_bins \
-o das_tool

# -i input list: tab seperated table of contigs-bins 
#--score_threshold default is 0.5

#bash script: das_tool
#script location: mcav/healthy_2019_mcav/binning
# JOBID: 21582072

### prep file to import as collection file to anvio 
- metabin produces fasta files containing contigs of each bin 
- collection artifact requires a txt file that contains list of contigs with their associated bins (2 columns) 

In [None]:
#!/bin/bash
# A simple script to convert metabin results to anvio
FILES=$(find *.fa)
for f in $FILES; do
 NAME=$(basename $f .fasta)
 grep ">" $f | sed 's/>//' | sed -e "s/$/\t$NAME/" | sed 's/\./_/' >> metabins4anvio.txt
done