# Experiment: BIGSI Indexing

Now let's run BIGSI to index on our data.

First, let's setup some directories.

In [1]:
fastq_data_dir=data-downsampled
data_dir=kmer-counts-jellyfish
bigsi_dir=bigsi
kmer_size="11"

threads=1

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [mccortex](https://github.com/mcveanlab/mccortex) and [bigsi](https://github.com/Phelimb/BIGSI). This can be done with.

```bash
conda create --name bigsi_mccortex mccortex
conda create --name bigsi_mccortex bigsi
```

Let's verify these commands exist (and verify versions).

In [2]:
conda run --name bigsi_mccortex mccortex 31 build --help 2>&1 | grep 'mccortex=v'
conda run --name bigsi_mccortex bigsi bloom --help 2>&1 | grep 'bigsi-v'

[11 Dec 2019 23:19:22-sUS][version] [01;31m[Kmccortex=v[m[K0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[11 Dec 2019 23:19:22-sUS][version] [01;31m[Kmccortex=v[m[K0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
usage: [01;31m[Kbigsi-v[m[K0.3.1 bloom [-h] [-c CONFIG] ctx outfile


Great. Now let's look at figuring out the optimal Bloom filter sizes and hash functions for BIGSI and HowDeSBT.

## Bloom filter sizes

First, let's pull out the maximum number of unique (canonical) kmers from each dataset type.

In [3]:
(echo -e "data_type\tkmer_size\tmax_kmers"
for data_type in microbial metagenomics human
do
    # Get max kmer counts from kmers counted by jellyfish (for kmer size 15)
    # tail -n+2 removes header line (first line) from data
    max_kmer_count=`tail -n+2 ${data_type}/${data_dir}/${kmer_size}/kmer-counts.tsv | sort -k2,2n | cut -f 2 | tail -n 1`
    echo -e "${data_type}\t${kmer_size}\t${max_kmer_count}"
done) | sort -k3,3n | column -s$'\t' -t

data_type     kmer_size  max_kmers
human         11         1618435
microbial     11         1621455
metagenomics  11         1892593


Let's also pull out the estimated union of all k-mers across all datasets.

In [4]:
(echo -e "data_type\tkmer_size\tunion_kmers"
for data_type in microbial metagenomics human
do
    count=`sed -e 's/Estimated number of unique exact matches: //' ${data_type}/${fastq_data_dir}/total-unique-kmers-${kmer_size}.txt | \
    awk '{print int($1+0.5)}'`
    
    echo -e "${data_type}\t${kmer_size}\t${count}"
done) | sort -k3,3n | column -s$'\t' -t

data_type     kmer_size  union_kmers
human         11         2093361
microbial     11         2095943
metagenomics  11         2097351


It's the union of unique kmers we'll use to set Bloom filter sizes. We'll set these as follows.

In [5]:
# For kmer size 9
#microbial_bits=140000
#human_bits=140000
#metagenomics_bits=140000
# For kmer size 11
microbial_bits=2100000
human_bits=2100000
metagenomics_bits=2100000
# For kmer size 15
#microbial_bits=61000000
#human_bits=72000000
#metagenomics_bits=210000000

## Create BIGSI config file

In [6]:
create_bigsi_config() {
    kmer_size_config=$1
    bits=$2
    hashes=$3
    output_dir_config=$4

echo "## Example config using berkeleyDB
h: ${hashes}
k: ${kmer_size_config}
m: ${bits}
storage-engine: berkeleydb
storage-config:
  filename: ${output_dir_config}/kmer${kmer_size_config}-bits${bits}-hashes${hashes}-bigsi.db
  flag: "c" ## Change to 'r' for read-only access
" > ${output_dir_config}/berkelydb.yaml
}

In [7]:
data_type=microbial
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${microbial_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

data_type=metagenomics
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${metagenomics_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

data_type=human
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${human_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

-rw-r--r-- 1 apetkau grp_apetkau 216 Dec 11 23:19 microbial/bigsi/11/berkelydb.yaml
-rw-r--r-- 1 apetkau grp_apetkau 219 Dec 11 23:19 metagenomics/bigsi/11/berkelydb.yaml
-rw-r--r-- 1 apetkau grp_apetkau 212 Dec 11 23:19 human/bigsi/11/berkelydb.yaml


Once this step is complete, we now need to define the bash function for constructing the BIGSI Bloom filters. BIGSI requires a cortex file as input, which can be generated from the kmer list in the previous `jellyfish` step by running first through `mccortex`.

We will first do this, then generate the BIGSI Bloom filters.

## BIGSI Bloom filter bash function

In [8]:
run_bigsi_bloom() {
    type_dir=$1
    output_dir=$2
    nkmers=$3
    mem=$4
    jobs=$5
    
    input_dir=${type_dir}/${data_dir}/${kmer_size}
    
    export BIGSI_CONFIG=${output_dir}/berkelydb.yaml
            
    before=`date +%s`
    
    rm ${output_dir}/mccortex.*
    rm -rf ${output_dir}/bigsi.*
    rm ${output_dir}/*bigsi.db

    commands_file=`mktemp`
    
    for file in ${input_dir}/*.kmer.gz
    do
        accession=`basename ${file} .kmer.gz`

        mccortex_out=${output_dir}/mccortex.${accession}.ctx
        mccortex_log=${output_dir}/mccortex.count.${accession}.log
        
        bigsi_out=${output_dir}/bigsi.${accession}.bloom
        bigsi_log=${output_dir}/bigsi.${accession}.bloom.log

        command="/usr/bin/time -v mccortex ${kmer_size} build --nkmers ${nkmers} --threads ${threads} --kmer ${kmer_size} \
            --mem ${mem} --sample ${accession} --seq ${file} ${mccortex_out} 2> ${mccortex_log}.err 1> ${mccortex_log} && \
            /usr/bin/time -v bigsi bloom ${mccortex_out} ${bigsi_out} 2> ${bigsi_log}.err 1> ${bigsi_log}"
        echo ${command} >> ${commands_file}
    done
    
    echo "Will run commands (mccortex and bigsi bloom) from [${commands_file}] like:"
    head -n 1 ${commands_file}
    command="parallel -j ${jobs} -a ${commands_file}"
    echo -e "\n${command}"
    conda run --name bigsi_mccortex ${command}
    
    echo -e "\nNow, let's merge all these files together into a single BIGSI database."
    bigsi_merge_log=${output_dir}/bigsi.build.log
    files=`echo -n ${output_dir}/*.bloom`
    samples=`for file in ${files}; do echo -n "-s "; basename ${file} .bloom | sed -e 's/^bigsi\.//'; done`
    command="/usr/bin/time -v bigsi build ${files} ${samples} 2> ${bigsi_merge_log}.err 1> ${bigsi_merge_log}"
    echo ${command}
    conda run --name bigsi_mccortex ${command}
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

Now that we've got our mccortex code defined. Let's run it on a dataset.

## Microbial bigsi

In [9]:
input_dir_type="microbial"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${microbial_bits}" "3G" "24"

rm: cannot remove 'microbial/bigsi/11/mccortex.*': No such file or directory
rm: cannot remove 'microbial/bigsi/11/*bigsi.db': No such file or directory
Will run commands (mccortex and bigsi bloom) from [/tmp/tmp.Mu7frCOjX8] like:
/usr/bin/time -v mccortex 11 build --nkmers 2100000 --threads 1 --kmer 11 --mem 3G --sample ERR1144974 --seq microbial/kmer-counts-jellyfish/11/ERR1144974.kmer.gz microbial/bigsi/11/mccortex.ERR1144974.ctx 2> microbial/bigsi/11/mccortex.count.ERR1144974.log.err 1> microbial/bigsi/11/mccortex.count.ERR1144974.log && /usr/bin/time -v bigsi bloom microbial/bigsi/11/mccortex.ERR1144974.ctx microbial/bigsi/11/bigsi.ERR1144974.bloom 2> microbial/bigsi/11/bigsi.ERR1144974.bloom.log.err 1> microbial/bigsi/11/bigsi.ERR1144974.bloom.log

parallel -j 24 -a /tmp/tmp.Mu7frCOjX8

Now, let's merge all these files together into a single BIGSI database.
/usr/bin/time -v bigsi build microbial/bigsi/11/bigsi.ERR1144974.bloom microbial/bigsi/11/bigsi.ERR1144975.bloom microbial/b

Alright. It's all finished. Let's look at some of the output files.

In [10]:
du -sh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*.bloom | head -n 5

264K	microbial/bigsi/11/bigsi.ERR1144974.bloom
264K	microbial/bigsi/11/bigsi.ERR1144975.bloom
264K	microbial/bigsi/11/bigsi.ERR1144976.bloom
264K	microbial/bigsi/11/bigsi.ERR1144977.bloom
264K	microbial/bigsi/11/bigsi.ERR1144978.bloom
du: write error


These contain the individual BIGSI Bloom filters.

In [11]:
du -sh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*.ctx | head -n 5

14M	microbial/bigsi/11/mccortex.ERR1144974.ctx
14M	microbial/bigsi/11/mccortex.ERR1144975.ctx
14M	microbial/bigsi/11/mccortex.ERR1144976.ctx
14M	microbial/bigsi/11/mccortex.ERR1144977.ctx
14M	microbial/bigsi/11/mccortex.ERR1144978.ctx
du: write error


These are the intermediate cortex graphs.

In [12]:
ls -lh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db

-rw-r--r-- 1 apetkau grp_apetkau 81M Dec 11 23:20 microbial/bigsi/11/kmer11-bits2100000-hashes1-bigsi.db


This is the final BIGSI database.

Let's now measure the sizes of the intermediate files/database size on disk.

In [13]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt

754 total intermediate (MB)
68 total database (MB)


# Metagenomics bigsi

In [14]:
input_dir_type="metagenomics"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${metagenomics_bits}" "10G" "8"

rm: cannot remove 'metagenomics/bigsi/11/mccortex.*': No such file or directory
rm: cannot remove 'metagenomics/bigsi/11/*bigsi.db': No such file or directory
Will run commands (mccortex and bigsi bloom) from [/tmp/tmp.PyLXAuc2pX] like:
/usr/bin/time -v mccortex 11 build --nkmers 2100000 --threads 1 --kmer 11 --mem 10G --sample ERR1713331 --seq metagenomics/kmer-counts-jellyfish/11/ERR1713331.kmer.gz metagenomics/bigsi/11/mccortex.ERR1713331.ctx 2> metagenomics/bigsi/11/mccortex.count.ERR1713331.log.err 1> metagenomics/bigsi/11/mccortex.count.ERR1713331.log && /usr/bin/time -v bigsi bloom metagenomics/bigsi/11/mccortex.ERR1713331.ctx metagenomics/bigsi/11/bigsi.ERR1713331.bloom 2> metagenomics/bigsi/11/bigsi.ERR1713331.bloom.log.err 1> metagenomics/bigsi/11/bigsi.ERR1713331.bloom.log

parallel -j 8 -a /tmp/tmp.PyLXAuc2pX

Now, let's merge all these files together into a single BIGSI database.
/usr/bin/time -v bigsi build metagenomics/bigsi/11/bigsi.ERR1713331.bloom metagenomics/bigsi/1

In [15]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt

1158 total intermediate (MB)
68 total database (MB)


## Human bigsi

In [16]:
input_dir_type="human"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${human_bits}" "5G" "12"

rm: cannot remove 'human/bigsi/11/mccortex.*': No such file or directory
rm: cannot remove 'human/bigsi/11/*bigsi.db': No such file or directory
Will run commands (mccortex and bigsi bloom) from [/tmp/tmp.AjXGJa2EeP] like:
/usr/bin/time -v mccortex 11 build --nkmers 2100000 --threads 1 --kmer 11 --mem 5G --sample SRR038300 --seq human/kmer-counts-jellyfish/11/SRR038300.kmer.gz human/bigsi/11/mccortex.SRR038300.ctx 2> human/bigsi/11/mccortex.count.SRR038300.log.err 1> human/bigsi/11/mccortex.count.SRR038300.log && /usr/bin/time -v bigsi bloom human/bigsi/11/mccortex.SRR038300.ctx human/bigsi/11/bigsi.SRR038300.bloom 2> human/bigsi/11/bigsi.SRR038300.bloom.log.err 1> human/bigsi/11/bigsi.SRR038300.bloom.log

parallel -j 12 -a /tmp/tmp.AjXGJa2EeP

Now, let's merge all these files together into a single BIGSI database.
/usr/bin/time -v bigsi build human/bigsi/11/bigsi.SRR038300.bloom human/bigsi/11/bigsi.SRR039632.bloom human/bigsi/11/bigsi.SRR1012332.bloom human/bigsi/11/bigsi.SRR1024141.

In [17]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt

624 total intermediate (MB)
68 total database (MB)
