### Preparing assembly files for binning
To create a depth processing file, reads must be re-aligned to the contigs, or mapping. This has been done using bowtie2 (can also be done using BWA). The next step would be to create a depth file with MetaBat2, convert that to be suitable for CONCOCT and MaxBin2, and then process these into bins. 

Software documentations:

MetaBAT: https://bitbucket.org/berkeleylab/metabat/src/master/README.md

CONCOCT: https://github.com/BinPro/CONCOCT

CheckM: https://github.com/Ecogenomics/CheckM/wiki

Das_tool: https://github.com/cmks/DAS_Tool

Metabinner: https://github.com/ziyewang/MetaBinner

In [None]:
#Create new env and install software 
conda create -n binning python=3.7
conda activate binning
conda install -c bioconda metabat2

conda install -c bioconda checkm-genome
conda install -c bioconda das_tool
conda install -c bioconda metabinner
#errors downloading metabinner...HAVEN'T INSTALLED IN THE ENV YET

#### MetaBat2
The first piece of code here generates a fairly simple text file for the coverage of these files. The next set of code runs MetaBat2  (v2.10.2) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, lcuster size 50000 and maxEdges 200. It sets the minimum size for a bin to 200000 basepairs, which is fairly low, so you can keep it. It gathers all mapping information into a single depth file, so you can use your 1 file in the next analysis. An important parameter to play around with is the minimum bin size. When set to 200000, this will severely limit the amount of bins you gain, especially if your samples aren't perfect. Therefore, it is wise to run MetaBAT several times with slight alterations to the -s flag to find your optimal setting (you don't want 3 bins, you also don't want 1000).

For a reference on how to do this accurately, use: https://bitbucket.org/berkeleylab/metabat/wiki/Best%20Binning%20Practices

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 20:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j.out  # %j = job ID  # %j = job ID

module load miniconda/22.11.1-1
conda activate binning

#set parameters for binning:
SAMPLENAME=mcav
OUTDIR=MetaBAT_"$SAMPLENAME"_bins_redo

CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/mcav_assembly_redo
CONTIGFILE=mcav.contigs-fixed.fa

#mkdir ./results/$OUTDIR
#mkdir ./working/binning_redo

#create depth file for MetaBat2
jgi_summarize_bam_contig_depths --outputDepth ./working/binning_redo/MetaBAT_"$SAMPLENAME"_depth.txt $CONTIGPATH/*.bam

#mkdir results/$OUTDIR/Metabin_3
#MetaBat2 script with verbose output, minimum length (m)(has to be >=1500) and no min bin size 
metabat2 -i $CONTIGPATH/$CONTIGFILE -a ./working/binning_redo/MetaBAT_"$SAMPLENAME"_depth.txt \
-o ./results/$OUTDIR/Metabin_3/ \
-m 1500 
#try again with no min contig length..don't know if it lets you 
#MetaBAT 2 (v2.12.1) using minContig 1500, minCV 1.0, minCVSum 1.0, maxP 95%, minS 60, and maxEdges 200.

#this runs CheckM immediately after and puts the results alongside your bins
checkm lineage_wf -x fa -t 3 ./results/$OUTDIR/Metabin_3 ./results/$OUTDIR/Metabin_3/bins-stats

#bash script: metabat
#JOB ID: 15838479

In [None]:
#same results from filtering to 1000bp in mapping step...
#probably cuz you're specifying the same mincontig length in metabat
#so it doesn't matter what had previously been filtered out if it was less than that min length

### prep file to import as collection file to anvio 
- metabin produces fasta files containing contigs of each bin 
- collection artifact requires a txt file that contains list of contigs with their associated bins (2 columns) 

In [None]:
#!/bin/bash
# A simple script to convert metabin results to anvio
FILES=$(find *.fa)
for f in $FILES; do
 NAME=$(basename $f .fasta)
 grep ">" $f | sed 's/>//' | sed -e "s/$/\t$NAME/" | sed 's/\./_/' >> metabins4anvio.txt
done

 #### CONCOCT
 This set of commands runs CONCOCT in its standard mode. It first creates a depth/coverage file for itself to use and then runs CONCOCT, with the standard settings. This means k-mer value is set to 4, minimum contig length is 1000, and CONCOCT runs on the exact amount of slots given to it by Hydra. 
 
CONCOCT creates a depth file out of the coverance created in the mapping step. It is key that this is all in the correct places before proceeding with binning. It creates a single file, which is then used for the complete binning process. Do keep in mind that binning might take awhile, so be prepared to let this run overnight.

IMPORTANT: in the current version of CONCOCT, you're missing a vital file, called libmkl.so. Without this file, CONCOCT will not be able to start. You can fix this issue by installing another file through Conda: 

conda install mkl

Additionally, samtools will not work properly after a fresh CONCOCT install. The easiest way to fix this is to go to your environment where you installed CONCOCT and force an update through conda. 

### Cannot install concoct at the moment

In [None]:
#error installing concoct..everything needs different python and dependencies version so creating its own env
#conda create -n concoct python=3
conda activate concoct
conda install -c bioconda concoct
#error installing concoct - used this instead:
#conda install cython numpy scipy biopython pandas pip scikit-learn

In [None]:
#try finding scripts in directory:
lib/python2.7/site-packages/concoct

cd //home/brooke_sienkiewicz_student_uml_edu/CONCOCT-1.1.0/
#trying to load possible dependencies that aren't automatically loaded?
module load python/3.8.12
module load openblas/0.3.21
module load py-pandas/1.5.3+py3.8.12
module load gcc/11.2.0
module load gsl/2.6

#ERROR: 
#You need to have Cython installed on your system to run setup.py. Sorry!
#$ pip install Cython
#Requirement already satisfied: Cython in /home/brooke_sienkiewicz_student_uml_edu/.conda/envs/concoct/lib/python3.6/site-packages (0.29.24)

In [None]:
# trying fresh env 
conda create -n concoct_env python=3
conda activate concoct_env
conda install -c bioconda -c conda-forge concoct

In [None]:
#!/bin/bash
#SBATCH -c 24  # Number of Cores per Task
#SBATCH --mem=50G  # Requested Memory
#SBATCH -p cpu  # Partition
#SBATCH -t 24:00:00  # Job time limit
#SBATCH --mail-type=ALL
#SBATCH -o /project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/bash_scripts/slurm-%j-concoct.out  # %j = job ID  # %j = job ID

module load miniconda/22.11.1-1
conda activate concoct_env


#set parameters
SAMPLENAME=mcav
MAPPINGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/mcav_assembly_redo
CONTIGPATH=/project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/mcav_assembly_redo
CONTIGFILE=mcav.contigs-fixed.fa
OUTDIR=concoct_"$SAMPLENAME"_bins
TEMPDIR=concoct_"$SAMPLENAME"_temp


ABSPATH='project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/working/binning_redo/concoct'
#creates the CONCOCT depth file
#this part cuts up the contigs into 10kb pieces for CONCOCT to use 
cut_up_fasta.py $CONTIGPATH/$CONTIGFILE -c 10000 -o 0 --merge_last -b $ABSPATH/$SAMPLENAME_contigs_cut.bed > $ABSPATH/"$SAMPLENAME_contigs_cut.fa" 
#estimate contig coverage

concoct_coverage_table.py $ABSPATH/$SAMPLENAME_contigs_cut.bed $MAPPINGPATH/*.bam > $ABSPATH/coverage_table_$SAMPLENAME.tsv || { echo 'Exit code 2: failed to create coverage file, exiting.' && exit; }

#CONCOCT script

mkdir $ABSPATH/$TEMPDIR

#run CONCOCT
concoct --composition_file $ABSPATH/$SAMPLENAME_contigs_cut.fa --coverage_file $ABSPATH/coverage_table_$SAMPLENAME.tsv -t 3 -b $ABSPATH/$TEMPDIR || { echo 'Exit code 3: CONCOCT failed to run, exiting.' && exit; }
merge_cutup_clustering.py $ABSPATH/$TEMPDIR/clustering_gt1000.csv > $ABSPATH/$TEMPDIR/$SAMPLENAME_clustering_merged.csv || { echo 'Exit code 4: failed to merge clusters, exiting.' && exit; }
extract_fasta_bins.py $CONTIGPATH/$CONFIGFILE $ABSPATH/$TEMPDIR/$SAMPLENAME_clustering_merged.csv --output_path project/pi_sarah_gignouxwolfsohn_uml_edu/brooke/results/$OUTDIR || { echo 'Exit code 5: Bins were not extracted, exiting.' && exit; }

#this runs CheckM immediately after and puts the results alongside your bins
#checkm lineage_wf -x fa -t 3 ./results/$OUTDIR ./results/$OUTDIR/bins-stats

#don't know if installing checkM will mess up the dependencies so don't want to mess with it yet

#bash script: concoct
#JOB ID:  15908004

You can now proceed to actually running Metabinner. 

### Bin_refiner
This runs an ensemble binning software on your bins. There are two options here: one is to run all of the bins together and allow the software to choose a better set for you. Alternatively, you can use the second script, below the first one, to make different combinations of the three and then choose which two sets to combine into your perfect set of bins.

very confused....there is no third binning script. and i dont know where das_tools isntructions are. and I dont know what busco is 

In [None]:

module load miniconda/22.11.1-1
conda activate binning

SAMPLENAME=mcav
CONCOCTBINS=./results/CONCOCT_"$SAMPLENAME"_bins
MetaBATBINS=./results/MetaBAT_"$SAMPLENAME"_bins/Metabin
#EXTRABINNERBINS=/path/to/extra/binner/bins/
#did not add extra binning software 0 - only have 2
INDIR=Refiner_"$SAMPLENAME"_bins
OUTDIR="$SAMPLENAME"_final_bins
#CHECKMPATH=/path/to/CHECKM/DATABASE

mkdir ./working/$INDIR
mkdir ./working/$INDIR/CONCOCT
mkdir ./working/$INDIR/MetaBAT
mkdir ./working/$INDIR/Extrabinner
mkdir ./results/$OUTDIR

cp $CONCOCTBINS/*.fa ./working/$INDIR/CONCOCT/
cp $MetaBATBINS/*.fa ./working/$INDIR/MetaBAT/
#cp $EXTRABINNERBINS/*.fa ./working/$INDIR/Extrabinner/

INPUT=./working/$INDIR/
Binning_refiner -i $INPUT -p $SAMPLENAME -plot -m 50

mv "$SAMPLENAME"_Binning_refiner_outputs/"$SAMPLENAME"_refined_bins/*.fasta ./results/$OUTDIR

#run CheckM for quick assessment
checkm lineage_wf -x fasta -t 3 ./results/$OUTDIR ./results/$OUTDIR/bins-stats || { echo 'Exit code 5: CheckM failed' && exit; }

#this next bit runs busco on your final bins
conda deactivate
conda activate busco
busco -m  genome -i ../data/results/$OUTDIR -o ../data/results/$OUTDIR/bins-stats-Busco --auto-lineage -c $NSLOTS --download-path $DOWNLOADPATH || { echo 'Exit code 6: BUSCO failed, exiting.' && exit; }

#Then we generate files for Anvi'O processing:
cd ../data/results/$OUTDIR/
#the next loop generates a txt file which contains all the necessary information for Anvi'O
for fasta_file in *.fasta
do
    bin_name=`echo $fasta_file | awk 'BEGIN{FS="."}{print $1}'`;
    for contig_id in `grep '>' $fasta_file | sed 's/\>//g'`
    do
        echo -e "$contig_id\t$bin_name"
    done
done > collection.txt
#this removes the > so Anvi'O recognises the collection and labels it correctly
sed 's/>/ /' collection.txt > "$SAMPLENAME"_collection.txt
#remove the old collection to prevent confusion
rm collection.txt

### Continuing and troubleshooting
You should now have 3 sets of bins, each created with a slightly different algorithm, consolidated into a single set of bins through DAS_tools. It is now important to run the CheckM software with the script below and generate output files for all of them. This will inform you towards the quality of your bins and your contamination/completion rate. After this, you can proceed to the "Refine Bins" part of the workflow.

CheckM runs a check against a database to determine the levels of completeness versus contamination. These statistics are vital in determining how you want to proceed in the refinement process. Mind you, CheckM works without setting the database you need, but you get very confusing data. So make sure you set it correctly before running it. The scripts above run CheckM intrinsically, but its good to know that CheckM is the reason these scripts need to be run on a himem node (it regularly spikes above the 16G of RAM used per node, so yeah). 

MetaBAT is extremely annoying in the fact that it won't create its own directories. Make sure the directories are in place before it actually runs. 

CONCOCT will in general create more bins than MetaBAT2, but you can quite likely discard quite a few since they're going to be 3000 bp long which is not a lot (although it could be a viral sequence).

Congratulations! You have finished binning. The bins you have produced are considered putative genomes and can be used for a fair amount of practices, some of which I have listed in the Anvi'O notebook, others which are in the Analysis notebook. Good luck!