# Comparative metagenomics using SIMKA

This notebook will walk you through how to run a de novo comparative metagenomics analysis using SIMKA

Step 1: Run SIMKA on the read files for your set


## Getting Started

You will need to rerun this section each time you come back to this notebook to reset all directories and variables.

In [None]:
# set the variables for your netid
netid = "YOUR_NETID"
setid = "YOUR_SET"

In [None]:
# Go into the working directory
work_dir = "/xdisk/bhurwitz/bh_class/" + netid + "/assignments/20_kmer_comparisons"
%cd $work_dir

## Creating a config file
Let's create a config file with all of the variables we will need in the scripts below. Then when we want to use these variables in the script, we will "source" the config file to set the variables.

In [None]:
# create a config file with all of the variables you need
!echo "export NETID=$netid" > config.sh
!echo "export SETID=$setid" >> config.sh
!echo "export WORK_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/20_kmer_comparisons" >> config.sh
!echo "export DATA_DIR=/xdisk/bhurwitz/bh_class/$netid/assignments/20_kmer_comparisons/data" >> config.sh
!echo "export SIMKA=/contrib/singularity/shared/bhurwitz/simka:1.5.3--hdcf5f25_4.sif" >> config.sh
!echo "export KMER=31" >> config.sh

In [None]:
# check the config file to be sure it is correct
# Is your netid and xfile correct? Do you have the right directories?
!cat config.sh

## Step 1: Running SIMKA on the fastq files for all of the samples in your set

In this step, we will run SIMKA to do an all-vs-all sequence comparsion of all of the read files in your set.


In [None]:
# Create a script to run simka on all fastq files
my_code = '''#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=4        
#SBATCH --partition=standard
#SBATCH --account=bhurwitz                       
#SBATCH --output=Job-simka.out
#SBATCH --mem=24gb
#SBATCH --time=48:00:00 

pwd; hostname; date

module load python/3.9/3.9.10
module load R/4.2.2

source $SLURM_SUBMIT_DIR/config.sh

cd ${WORK_DIR}
FILE_LIST=${SETID}_list
TEMP=${SETID}_temp

### run simka

apptainer run ${SIMKA} simka \
-kmer-size ${KMER} \
-in ${FILE_LIST} \
-out simka_${KMER}_results \
-out-tmp ${TEMP} \
-nb-cores 128 \
-max-memory 768000 \
-count-file ./simka_count.sh \
-merge-file ./simka_merge.sh \
-count-cmd 'sbatch --partition=standard --account=bhurwitz' \
-merge-cmd 'sbatch --partition=standard --account=bhurwitz' \
-max-count 32 \
-max-merge 32

### run visualization

apptainer run ${SIMKA} python scripts/visualization/run-visualization.py \
-in simka_${KMER}_results \
-out simka_${KMER}_figures \
-pca -heatmap -tree

echo "Finished `date`"

'''

with open('run_simka.sh', mode='w') as file:
    file.write(my_code)

In [None]:
# Check the code and make sure your script above was created.
!cat run_simka.sh

In [None]:
# you should be in your working directory when you run this script
# do you see your config.sh file, and the run_kraken_biom_parallel.sh script?
!pwd
!ls

In [None]:
# Let's run sbatch to run kraken-biom
!sbatch run_simka.sh

In [None]:
# Welcome back
# You can check if it is running using the squeue command
# Check for all jobs under your netid
!squeue --user=$netid

In [None]:
# You can check to see if there are any errors by looking at one of the job output files
!cat Job-simka.out

In [None]:
# check to make sure you have a .biom file
!ls -l *results

## Final Step
Copy your notebook to the current working directory

In [None]:
cp ~/hw20_kmer_comparison.ipynb $work_dir