# Running BIGSI

Now let's run BIGSI on our data.

First, let's setup some directories.

In [14]:
data_dir=data-downsampled
kmer_dir=kmer-downsampled
kmer_counts_dir=kmer-counts-mccortex
kmers_input_sizes="9"

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [mccortex](https://github.com/mcveanlab/mccortex) and [bigsi](https://github.com/Phelimb/BIGSI). This can be done with.

```bash
conda create --name jellyfish jellyfish
conda create --name bigsi bigsi
```

Let's verify these commands exist (and verify versions).

In [2]:
conda run --name jellyfish jellyfish --version
conda run --name bigsi bigsi bloom --help

jellyfish 2.2.8
usage: bigsi-v0.3.1 bloom [-h] [-c CONFIG] ctx outfile

Creates a bloom filter from a sequence file or cortex graph.
(fastq,fasta,bam,ctx) e.g. index insert ERR1010211.ctx

positional arguments:
  ctx
  outfile

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG


Once this step is complete, we need to figure out the maximum number of kmers for our 3 datasets (for passing to `mccortex` to set the hash size). Let's define a bash function for this.

## Find max data kmer cardinality

In [3]:
# Purpose: Gets max kmer cardinality estimates on sequence reads.
# Args:
#      input_dir: The input directory containing all the kmer counts.
#      kmer_size: The kmer_size to find the max.
# Output: Prints the the maximum kmer count for the kmer size
#         in this directory (files named like `kmer-9.tsv`).
get_max_kmer_cardinality() {
    input_dir=$1
    kmer_size=$2
    
    cut -f 2 "${input_dir}/kmer-${kmer_size}.tsv" | sort -n | tail -n 1 | awk '{print int($1+0.5)}'
}

Let's test this code out.

In [4]:
get_max_kmer_cardinality "microbial/${kmer_dir}" "31"
get_max_kmer_cardinality "human/${kmer_dir}" "31"
get_max_kmer_cardinality "metagenomics/${kmer_dir}" "31"

38109197
50876500
78652018


Great. Now that we have this setup, we can move to defining a function to count and produce a list of all kmers in the dataset using `mccortex`.

## Bash mccortex kmer count function

In [17]:
# Purpose: Runs mccortex on sequence reads to count kmers.
# Args:
#      type_dir: The input type directory (e.g., microbial,)
#      output_dir: A directory to save the mccortex output into.
#      kmer_sizes: The kmer_sizes to run (separated by spaces).
#      mccortex_pe_param: The parameter used depening on paired-end/single end data.
# Output: mccortex kmer counts in the passed output directory.
run_mccortex() {
    type_dir=$1
    output_dir=$2
    kmer_sizes=$3
    mccortex_pe_param=$4
    
    input_dir=${type_dir}/${data_dir}
    
    threads=1
    
    rm -rf ${output_dir}
    mkdir ${output_dir}
    
    before=`date +%s`
    
    for kmer_size in ${kmer_sizes}
    do
        # Find max kmers
        max_kmer=`get_max_kmer_cardinality "${type_dir}/${kmer_dir}" "${kmer_size}"`
        
        # Hash size is 10x the max kmers in dataset.
        hash_size=`echo "10*${max_kmer}" | bc`
        
        output_dir_kmer=${output_dir}/${kmer_size}
        mkdir ${output_dir_kmer}
    
        #for file in ${input_dir}/*.fastq.gz
        for file in ${input_dir}/ERR1144976.fastq.gz ${input_dir}/SRR10512965.fastq.gz
        do
            accession=`basename ${file} .fastq.gz`

            mccortex_out=${output_dir_kmer}/${accession}.ctx
            mccortex_log=${output_dir_kmer}/mccortex.count.${accession}.log
        
            command="/usr/bin/time -v mccortex ${kmer_size} build --nkmers ${hash_size} --threads ${threads} --kmer ${kmer_size} \
                --sample ${accession} ${mccortex_pe_param} ${file} ${mccortex_out} 2> ${mccortex_log}.err 1> ${mccortex_log}"
            echo ${command}
            conda run --name mccortex ${command}
        done
    done
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

Now that we've got our mccortex code defined. Let's run it on a dataset.

## Microbial mccortex

In [18]:
input_dir_type="microbial"
run_mccortex ${input_dir_type} "${input_dir_type}/${kmer_counts_dir}" "${kmers_input_sizes}" "--seqi"

/usr/bin/time -v mccortex 9 build --nkmers 1310850 --threads 1 --kmer 9 --sample ERR1144976 --seqi microbial/data-downsampled/ERR1144976.fastq.gz microbial/kmer-counts-mccortex/9/ERR1144976.ctx 2> microbial/kmer-counts-mccortex/9/mccortex.count.ERR1144976.log.err 1> microbial/kmer-counts-mccortex/9/mccortex.count.ERR1144976.log
/usr/bin/time -v mccortex 9 build --nkmers 1310850 --threads 1 --kmer 9 --sample SRR10512965 --seqi microbial/data-downsampled/SRR10512965.fastq.gz microbial/kmer-counts-mccortex/9/SRR10512965.ctx 2> microbial/kmer-counts-mccortex/9/mccortex.count.SRR10512965.log.err 1> microbial/kmer-counts-mccortex/9/mccortex.count.SRR10512965.log
Done. Took 0.50 minutes.

Alright. It's all finished. Let's look at some of the output files.

In [23]:
ls -lh ${input_dir_type}/${kmer_counts_dir}/9 | head -n 5
echo -e "kmer\ttotal size"
for kmer in ${kmers_input_sizes}
do
    size=`du -ch ${input_dir_type}/${kmer_counts_dir}/${kmer}/*.ctx | grep total | sed -e 's/total//'`
    echo -e "${kmer}\t${size}"
done

total 3.3M
-rw-r--r-- 1 apetkau grp_apetkau 1.6M Dec 10 12:11 ERR1144976.ctx
-rw-r--r-- 1 apetkau grp_apetkau    0 Dec 10 12:11 mccortex.count.ERR1144976.log
-rw-r--r-- 1 apetkau grp_apetkau 4.0K Dec 10 12:11 mccortex.count.ERR1144976.log.err
-rw-r--r-- 1 apetkau grp_apetkau    0 Dec 10 12:11 mccortex.count.SRR10512965.log
kmer	total size
9	3.3M	


In [22]:
cat microbial/${kmer_counts_dir}/9/jellyfish.ERR1144976.log.err

cat: microbial/kmer-counts-mccortex/9/jellyfish.ERR1144976.log.err: No such file or directory


: 1