# Experiment: BIGSI Indexing

Now let's run BIGSI to index on our data.

First, let's setup some directories.

In [None]:
fastq_data_dir=data-downsampled
data_dir=kmer-counts-jellyfish
bigsi_dir=bigsi
kmer_size="17"

threads=1

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [mccortex](https://github.com/mcveanlab/mccortex) and [bigsi](https://github.com/Phelimb/BIGSI). This can be done with.

```bash
conda create --name bigsi_mccortex mccortex
conda create --name bigsi_mccortex bigsi
```

Let's verify these commands exist (and verify versions).

In [None]:
conda run --name bigsi_mccortex mccortex 31 build --help 2>&1 | grep 'mccortex=v'
conda run --name bigsi_mccortex bigsi bloom --help 2>&1 | grep 'bigsi-v'

Great. Now let's look at figuring out the optimal Bloom filter sizes and hash functions for BIGSI and HowDeSBT.

## Bloom filter sizes

First, let's pull out the maximum number of unique (canonical) kmers (for kmer size 15) from each dataset type.

In [None]:
(echo -e "data_type\tkmer_size\tmax_kmers"
for data_type in microbial metagenomics human
do
    # Get max kmer counts from kmers counted by jellyfish (for kmer size 15)
    # tail -n+2 removes header line (first line) from data
    max_kmer_count=`tail -n+2 ${data_type}/${data_dir}/${kmer_size}/kmer-counts.tsv | sort -k2,2n | cut -f 2 | tail -n 1`
    echo -e "${data_type}\t${kmer_size}\t${max_kmer_count}"
done) | sort -k3,3n | column -s$'\t' -t

Let's also pull out the estimated union of all k-mers across all datasets.

In [None]:
(echo -e "data_type\tkmer_size\tunion_kmers"
for data_type in microbial metagenomics human
do
    count=`sed -e 's/Estimated number of unique exact matches: //' ${data_type}/${fastq_data_dir}/total-unique-kmers-${kmer_size}.txt | \
    awk '{print int($1+0.5)}'`
    
    echo -e "${data_type}\t${kmer_size}\t${count}"
done) | sort -k3,3n | column -s$'\t' -t

It's the union of unique kmers we'll use to set Bloom filter sizes. We'll set these as follows.

In [None]:
# For kmer size 9
#microbial_bits=140000
#human_bits=140000
#metagenomics_bits=140000
# For kmer size 11
#microbial_bits=2100000
#human_bits=2100000
#metagenomics_bits=2100000
# For kmer size 13
#microbial_bits=24000000
#human_bits=24000000
#metagenomics_bits=33000000
# For kmer size 15
#microbial_bits=61000000
#human_bits=72000000
#metagenomics_bits=210000000
# For kmer size 17
microbial_bits=74000000
human_bits=86000000
metagenomics_bits=340000000

## Create BIGSI config file

In [None]:
create_bigsi_config() {
    kmer_size_config=$1
    bits=$2
    hashes=$3
    output_dir_config=$4

echo "## Example config using berkeleyDB
h: ${hashes}
k: ${kmer_size_config}
m: ${bits}
storage-engine: berkeleydb
storage-config:
  filename: ${output_dir_config}/kmer${kmer_size_config}-bits${bits}-hashes${hashes}-bigsi.db
  flag: "c" ## Change to 'r' for read-only access
" > ${output_dir_config}/berkelydb.yaml
}

In [None]:
data_type=microbial
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${microbial_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

data_type=metagenomics
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${metagenomics_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

data_type=human
output_dir=${data_type}/${bigsi_dir}/${kmer_size}
mkdir -p ${output_dir}
create_bigsi_config "${kmer_size}" "${human_bits}" "1" "${output_dir}"
ls -l ${output_dir}/berkelydb.yaml

Once this step is complete, we now need to define the bash function for constructing the BIGSI Bloom filters. BIGSI requires a cortex file as input, which can be generated from the kmer list in the previous `jellyfish` step by running first through `mccortex`.

We will first do this, then generate the BIGSI Bloom filters.

## BIGSI Bloom filter bash function

In [None]:
run_bigsi_bloom() {
    type_dir=$1
    output_dir=$2
    nkmers=$3
    mem=$4
    jobs=$5
    
    input_dir=${type_dir}/${data_dir}/${kmer_size}
    
    export BIGSI_CONFIG=${output_dir}/berkelydb.yaml
            
    before=`date +%s`
    
    rm ${output_dir}/mccortex.*
    rm -rf ${output_dir}/bigsi.*
    rm ${output_dir}/*bigsi.db

    commands_file=`mktemp`
    
    for file in ${input_dir}/*.kmer.gz
    do
        accession=`basename ${file} .kmer.gz`

        mccortex_out=${output_dir}/mccortex.${accession}.ctx
        mccortex_log=${output_dir}/mccortex.count.${accession}.log
        
        bigsi_out=${output_dir}/bigsi.${accession}.bloom
        bigsi_log=${output_dir}/bigsi.${accession}.bloom.log

        command="/usr/bin/time -v mccortex ${kmer_size} build --nkmers ${nkmers} --threads ${threads} --kmer ${kmer_size} \
            --mem ${mem} --sample ${accession} --seq ${file} ${mccortex_out} 2> ${mccortex_log}.err 1> ${mccortex_log} && \
            /usr/bin/time -v bigsi bloom ${mccortex_out} ${bigsi_out} 2> ${bigsi_log}.err 1> ${bigsi_log}"
        echo ${command} >> ${commands_file}
    done
    
    echo "Will run commands (mccortex and bigsi bloom) from [${commands_file}] like:"
    head -n 1 ${commands_file}
    command="parallel -j ${jobs} -a ${commands_file}"
    echo -e "\n${command}"
    conda run --name bigsi_mccortex ${command}
    
    echo -e "\nNow, let's merge all these files together into a single BIGSI database."
    bigsi_merge_log=${output_dir}/bigsi.build.log
    files=`echo -n ${output_dir}/*.bloom`
    samples=`for file in ${files}; do echo -n "-s "; basename ${file} .bloom | sed -e 's/^bigsi\.//'; done`
    command="/usr/bin/time -v bigsi build ${files} ${samples} 2> ${bigsi_merge_log}.err 1> ${bigsi_merge_log}"
    echo ${command}
    conda run --name bigsi_mccortex ${command}
    
    after=`date +%s`
    minutes=`echo "(${after}-${before})/60" | bc -l`
    printf "Done. Took %0.2f minutes." ${minutes}
}

Now that we've got our mccortex code defined. Let's run it on a dataset.

## Microbial bigsi

In [None]:
input_dir_type="microbial"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${microbial_bits}" "3G" "24"

Alright. It's all finished. Let's look at some of the output files.

In [None]:
du -sh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*.bloom | head -n 5

These contain the individual BIGSI Bloom filters.

In [None]:
du -sh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*.ctx | head -n 5

These are the intermediate cortex graphs.

In [None]:
ls -lh ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db

This is the final BIGSI database.

Let's now measure the sizes of the intermediate files/database size on disk.

In [None]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt

# Metagenomics bigsi

In [None]:
input_dir_type="metagenomics"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${metagenomics_bits}" "10G" "8"

In [None]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt

## Human bigsi

In [None]:
input_dir_type="human"
run_bigsi_bloom "${input_dir_type}" "${input_dir_type}/${bigsi_dir}/${kmer_size}" "${human_bits}" "5G" "12"

In [None]:
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/{*.ctx,*.bloom} | 
    grep 'total' | 
    sed -e 's/\ttotal$/ total intermediate (MB)/' | 
    tee ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt
    
du -mc ${input_dir_type}/${bigsi_dir}/${kmer_size}/*bigsi.db |
    grep 'total' | 
    sed -e 's/\ttotal$/ total database (MB)/' |
    tee -a ${input_dir_type}/${bigsi_dir}/${kmer_size}/bigsi-total-disk.txt