# Experiment: BIGSI and HowDeSBT search for genes from assembly

To measure the accuracy of BIGSI and HowDeSBT we are comparing there performance and finding a gene to an assembly and BLAST-based method (assembling with `skesa` and finding genes with BLAST using `staramr`).

First, let's setup some environment variables.

In [11]:
data_type=microbial
assembly_dir=${data_type}/assembly
bigsi_dir=${data_type}/bigsi
howdesbt_dir=${data_type}/howdesbt
accuracy_dir=accuracy
queries_dir=queries
query_file=${queries_dir}/accuracy_query.fasta
staramr_out_dir=${assembly_dir}/staramr
true_samples_file=${queries_dir}/microbial_accuracy_true_samples.txt
all_samples_file=${data_type}/microbial-genomes.txt

kmer_sizes_list="9 11 13 15 17"
perfect_search_threshold=1.00
high_search_threshold=0.99
low_search_threshold=0.70

query_string=`grep -v '^>' ${query_file} | tr -d '\n'`

PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR
cat ${query_file}

>Col(BS512)_1__NC_010656 isolate: SRR10527348, contig: Contig_1_18.8607_Circ, contig_start: 632, contig_end: 400, database_gene_start: 1, database_gene_end: 233, hsp/length: 233/233, pid: 100.00%, plength: 100.00%
ATGAATGCGGCGTTTAAGCGAATGGAAAAGCGAAAGGAGCTATCACCTGTTCAGGGGTGG
ATCAGGGCTACGGAGGTGACGCGAGGTAAGGATGGCAGCGCACATCCGCATTTTCACTGT
CTGCTGATGGTGCAACCTTCTTGGTTTAAAGGGAAGAACTACGTTAAGCACGAACGTTGG
GTAGAACTCTGGCGCGATTGCTTGCGGGTGAACTATGAGCCGAATATCGATAT


The code given below assumes you have the following [conda](https://docs.conda.io/en/latest/) environments setup to install [bigsi](https://github.com/Phelimb/BIGSI) and [howdesbt](https://github.com/medvedevgroup/HowDeSBT). This can be done with.

```bash
conda create --name bigsi_mccortex bigsi
conda create --name howdesbt howdesbt
```

Let's verify these commands exist (and verify versions).

In [2]:
conda run --name bigsi_mccortex bigsi bloom --help 2>&1 | grep 'bigsi-v'
conda run --name howdesbt howdesbt --version

usage: [01;31m[Kbigsi-v[m[K0.3.1 bloom [-h] [-c CONFIG] ctx outfile
version 2.00.02 20191014


Let's also show the list of samples which contain this gene as defined by `staramr` (BLAST) in the previous step.

In [3]:
cat ${true_samples_file}

SRR10527348
SRR10527351
SRR10527352


## True/false positive code

Now let's define some code which can be used to determine true/false matches/print a confusion table. This is written as a bash function as we are in Jupyter Bash mode, but really it's just Python code (there's probably a better way to do this).

In [4]:
print_confusion_table() {
    method_name=$1
    all_samples_file_func=$2
    blast_matches_file=$3
    matches_file=$4

    python -c "
all_samples = set(l.strip() for l in open('${all_samples_file_func}'))
method_name=\"${method_name}\"

blast_matches = set(l.strip() for l in open('${blast_matches_file}'))
blast_non_matches = all_samples - blast_matches

matches = set(l.strip() for l in open('${matches_file}'))
non_matches = all_samples - matches

true_matches = matches & blast_matches
true_non_matches = non_matches & blast_non_matches
false_matches = matches & blast_non_matches
false_non_matches = non_matches & blast_matches

print(\"\tBLAST Match\tBLAST non-Match\")
print(\"%s Match\t%s\t%s\" % (method_name, len(true_matches), len(false_matches)))
print(\"%s non-Match\t%s\t%s\" % (method_name, len(false_non_matches), len(true_non_matches)))
"
}

## BIGSI queries

Okay. Now let's try querying the different BIGSI indexes we've generated using all the different k-mer sizes and determine how well we can match the BLAST results.

First let's test out some queries.

In [8]:
kmer_size="17"
export BIGSI_CONFIG=${bigsi_dir}/${kmer_size}/berkelydb.yaml

conda run --name bigsi_mccortex bigsi search "${query_string}" 2>/dev/null | \
    tr "'" '"' | jq '.results[].sample_name' | sed -e 's/"//g'

ERR3655992


Huh? That does not at all match our three above genomes (`SRR10527348`, `SRR10527351`, `SRR10527352`). Why is that?

Let's maybe try lowering the threshold from 100%.

In [9]:
kmer_size="17"
export BIGSI_CONFIG=${bigsi_dir}/${kmer_size}/berkelydb.yaml

conda run --name bigsi_mccortex bigsi search --threshold 0.99 "${query_string}" 2>/dev/null | \
    tr "'" '"' | jq '.results[].sample_name' | sed -e 's/"//g'

SRR10527348
SRR10527351
SRR10527352


Now we get the correct results. Let's test setting a threshold of `1.0` explicitly and look over all kmers.

In [10]:
for kmer_size in ${kmer_sizes_list}
do
    export BIGSI_CONFIG=${bigsi_dir}/${kmer_size}/berkelydb.yaml
    
    echo "For kmer size ${kmer_size} and threshold 1.0"
    conda run --name bigsi_mccortex bigsi search --threshold 1.0 "${query_string}" 2>/dev/null | \
        tr "'" '"' | jq '.results[].sample_name' | sed -e 's/"//g'
done

For kmer size 9 and threshold 1.0
ERR1144974
ERR1144975
ERR1144976
ERR1144977
ERR1144978
ERR3655992
ERR3655994
For kmer size 11 and threshold 1.0
ERR3655992
For kmer size 13 and threshold 1.0
ERR3655992
For kmer size 15 and threshold 1.0
ERR3655992
For kmer size 17 and threshold 1.0
ERR3655992


Now it's wrong again, and is consistently wrong over all kmer sizes. I do not know why this is, but there's something weird about the exact matching method that gives us completly different results.

This is perhaps due to our data, or maybe a bug in the software. In any case, to make fair comparisons we will compare both BIGSI and HowDeSBT at a threshold of `0.99` as the highest instead of `1.00` (though we will include the results for `1.00` here for reference).

### BIGSI search thresholds

In [13]:
for search_threshold in ${perfect_search_threshold} ${high_search_threshold} ${low_search_threshold}
do
    for kmer_size in ${kmer_sizes_list}
    do
        bigsi_dir_kmer=${bigsi_dir}/${kmer_size}

        export BIGSI_CONFIG=${bigsi_dir_kmer}/berkelydb.yaml

        bigsi_accuracy_dir=${bigsi_dir_kmer}/${accuracy_dir}
        mkdir ${bigsi_accuracy_dir} 2> /dev/null

        search_out_file=${bigsi_accuracy_dir}/accuracy-search-threshold-${search_threshold}.txt
        search_confusion_table_file=${bigsi_accuracy_dir}/accuracy-search-threshold-${search_threshold}-table.tsv

        echo -e "\nFor kmer size ${kmer_size} and threshold ${search_threshold}"
        conda run --name bigsi_mccortex bigsi search --threshold ${search_threshold} "${query_string}" 2>/dev/null | \
            tr "'" '"' | jq '.results[].sample_name' | sed -e 's/"//g' > ${search_out_file}

        print_confusion_table "BIGSI" "${all_samples_file}" "${true_samples_file}" "${search_out_file}" | \
            tee ${search_confusion_table_file} | column -s$'\t' -t -n
    done
done


For kmer size 9 and threshold 1.00
                 BLAST Match  BLAST non-Match
BIGSI Match      0            7
BIGSI non-Match  3            40

For kmer size 11 and threshold 1.00
                 BLAST Match  BLAST non-Match
BIGSI Match      0            1
BIGSI non-Match  3            46

For kmer size 13 and threshold 1.00
                 BLAST Match  BLAST non-Match
BIGSI Match      0            1
BIGSI non-Match  3            46

For kmer size 15 and threshold 1.00
                 BLAST Match  BLAST non-Match
BIGSI Match      0            1
BIGSI non-Match  3            46

For kmer size 17 and threshold 1.00
                 BLAST Match  BLAST non-Match
BIGSI Match      0            1
BIGSI non-Match  3            46

For kmer size 9 and threshold 0.99
                 BLAST Match  BLAST non-Match
BIGSI Match      3            40
BIGSI non-Match  0            7

For kmer size 11 and threshold 0.99
                 BLAST Match  BLAST non-Match
BIGSI Match      3            0

Let's look at one of the files we saved.

In [14]:
cat ${bigsi_dir}/17/${accuracy_dir}/accuracy-search-threshold-0.99.txt
cat ${bigsi_dir}/17/${accuracy_dir}/accuracy-search-threshold-0.99-table.tsv

SRR10527348
SRR10527351
SRR10527352
	BLAST Match	BLAST non-Match
BIGSI Match	3	0
BIGSI non-Match	0	47


Awesome. From these files we can read off the true/false positives (when compared to BLAST).

Now let's look at the HowDeSBT results.

## HowDeSBT queries

In [15]:
# Reset ourselves back to main directory
cd ${PROJECT_DIR}

kmer_size="17"
# Now change the HowDeSBT results directory (since it assumes we run from directory with files)
cd ${howdesbt_dir}/${kmer_size}
pwd

conda run --name howdesbt howdesbt query --tree=howdesbt.build.sbt ${PROJECT_DIR}/${query_file} | \
    tail -n+2 | sed -e 's/^howdesbt.//'

cd ${PROJECT_DIR}

/home/CSCScience.ca/apetkau/workspace/comp7934-project/microbial/howdesbt/17
SRR10527352
SRR10527351
SRR10527348


This looks great. Let's run for all kmer sizes.

In [17]:
# Reset ourselves back to main directory
cd ${PROJECT_DIR}

for search_threshold in ${perfect_search_threshold} ${high_search_threshold} ${low_search_threshold}
do
    for kmer_size in ${kmer_sizes_list}
    do
        # Now change the HowDeSBT results directory (since it assumes we run from directory with files)
        cd ${howdesbt_dir}/${kmer_size}

        mkdir ${accuracy_dir} 2> /dev/null

        search_out_file=${accuracy_dir}/accuracy-search-threshold-${search_threshold}.txt
        search_confusion_table_file=${accuracy_dir}/accuracy-search-threshold-${search_threshold}-table.tsv

        echo "For kmer size ${kmer_size} and threshold ${search_threshold}"
        conda run --name howdesbt howdesbt query --threshold=${search_threshold} --tree=howdesbt.build.sbt ${PROJECT_DIR}/${query_file} | \
            tail -n+2 | sed -e 's/^howdesbt.//' > ${search_out_file}

        print_confusion_table "HowDeSBT" "${PROJECT_DIR}/${all_samples_file}" "${PROJECT_DIR}/${true_samples_file}" "${search_out_file}" | \
            tee ${search_confusion_table_file} | column -s$'\t' -t -n

        cd ${PROJECT_DIR}
    done
done

cd ${PROJECT_DIR}

For kmer size 9 and threshold 1.00
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            29
HowDeSBT non-Match  0            18
For kmer size 11 and threshold 1.00
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            0
HowDeSBT non-Match  0            47
For kmer size 13 and threshold 1.00
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            0
HowDeSBT non-Match  0            47
For kmer size 15 and threshold 1.00
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            0
HowDeSBT non-Match  0            47
For kmer size 17 and threshold 1.00
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            0
HowDeSBT non-Match  0            47
For kmer size 9 and threshold 0.99
                    BLAST Match  BLAST non-Match
HowDeSBT Match      3            42
HowDeSBT non-Match  0            5
For kmer size 11 and threshold 0.99
                    BLAST Mat

Awesome. Let's take a look at some of the output files.

In [18]:
cd ${PROJECT_DIR}
cat ${howdesbt_dir}/17/${accuracy_dir}/accuracy-search-threshold-0.99.txt
cat ${howdesbt_dir}/17/${accuracy_dir}/accuracy-search-threshold-0.99-table.tsv

SRR10527352
SRR10527351
SRR10527348
	BLAST Match	BLAST non-Match
HowDeSBT Match	3	0
HowDeSBT non-Match	0	47


Hooray. We have all our results.