In [7]:
NOTEBOOK_DIR=`git rev-parse --show-toplevel`
ROOT_DIR=$NOTEBOOK_DIR/microbial
cd $ROOT_DIR
ls

all_kat_hist                            freq_k31.hist
all_kat_hist.dist_analysis.json         freq_k7.hist
[0m[01;35mall_kat_hist.png[0m                        [01;34mimages[0m
[01;34mbigsi[0m                                   kat.hist
[01;34mdata[0m                                    kat.hist.dist_analysis.json
ERR1144974                              [01;35mkat.hist.png[0m
ERR1144974.dist_analysis.json           microbial-bigsi.ipynb
ERR1144974_kat_hist                     microbial-genomes.txt
ERR1144974_kat_hist.dist_analysis.json  [01;35mo.png[0m
[01;35mERR1144974_kat_hist.png[0m                 process-data.ipynb
[01;35mERR1144974.png[0m


# Microbial BIGSI

This runs `bigsi` <https://bigsi.readme.io/> on the microbial dataset. First, let's setup some variables to define size of bloom filters/bigsi parameters.

In [9]:
number_hashes=3
m_value=25000000
kmer_size=9

threads=32

# Step 1: Counting k-mers

Step 1 involves counting kmers using [mccortex](https://github.com/mcveanlab/mccortex).

In [15]:
cd $ROOT_DIR
mkdir bigsi
pushd bigsi

for genome in $ROOT_DIR/data/*_1.fastq.gz;
do
    accession=`basename $genome _1.fastq.gz`
    echo "Working on $accession"
       
    fastq_file_1=$ROOT_DIR/data/${accession}_1.fastq.gz
    fastq_file_2=$ROOT_DIR/data/${accession}_2.fastq.gz
    cortex_out=${accession}.ctx
    cortex_log=${accession}_mccortex.log
    
    set -x
    conda run --name mccortex mccortex ${kmer_size} build -t ${threads} -k ${kmer_size} -s ${accession} --seq2 ${fastq_file_1}:${fastq_file_2} ${cortex_out} 2> ${cortex_log}.err 1> ${cortex_log}
    set +x
done

echo "Done"

mkdir: cannot create directory ‘bigsi’: File exists
~/workspace/comp7934-project/microbial/bigsi ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial ~/workspace/comp7934-project/microbial
Working on ERR1144974
+ conda run --name mccortex mccortex 9 build -t 32 -k 9 -s ERR1144974 --seq2 /home/CSCScience.ca/apetkau/workspace/comp7934-project/microbial/data/ERR1144974_1.fastq.gz:/home/CSCScience.ca/apetkau/workspace/comp7934-project/microbial/data/ERR1144974_2.fastq.gz ERR1144974.ctx
+ set +x
Working on ERR1144975
+ conda run --name mccortex mccortex 9 build -t 32 -k 9 -s ERR1144975 --seq2 /home/CSCScience.ca/apetkau/workspace/comp7934-project/microbial/data/ERR1144975_1.fastq.gz:/home/CSCScience.ca/apetkau/workspace/comp7934-project/microbial/data/ERR1144975_2.fastq.gz ERR1144975.ctx
+ set +x
Working on ERR11

In [20]:
ls *.ctx|head
ls *.ctx | wc -l
du -sh .

ERR1144974.ctx
ERR1144975.ctx
ERR1144976.ctx
ERR1144977.ctx
ERR1144978.ctx
ERR3655992.ctx
ERR3655994.ctx
ERR3655996.ctx
ERR3655998.ctx
ERR3656002.ctx
50
82M	.


Awesome. We've now got files named like `SRR8088185.ctx` in our directory containing the cortex graph/kmers.

# Step 2: Build BIGSI index

In step 2, we'll look at building the `bigsi` indexes for each of these kmer counts.

Let's setup the bigsi index configuration.

In [21]:
cat > berkeleydb.yaml << EOF
## Example config using berkeleyDB
h: ${number_hashes}
k: ${kmer_size}
m: ${m_value}
storage-engine: berkeleydb
storage-config:
  filename: bigsi.db
  flag: "c" ## Change to 'r' for read-only access
EOF

export BIGSI_CONFIG=berkeleydb.yaml

Now, let's construct the blooom filters.

In [26]:
conda run --name bigsi bigsi
set -x
parallel --jobs ${threads} -I% conda run --name bigsi bigsi bloom % %.bloom ::: *.ctx
set +x

bigsi-v0.3.1

Available Commands:

 - bloom: Creates a bloom filter from a sequence file or cortex graph. (fastq...
 - build
 - delete
 - insert: Inserts a bloom filter into the graph          e.g. bigsi insert E...
 - merge
 - search

ERROR conda.cli.main_run:execute(39): Subprocess for 'conda run ['bigsi']' command failed.  Stderr was:

+ parallel --jobs 32 -I% conda run --name bigsi bigsi bloom % %.bloom ::: ERR1144974.ctx ERR1144975.ctx ERR1144976.ctx ERR1144977.ctx ERR1144978.ctx ERR3655992.ctx ERR3655994.ctx ERR3655996.ctx ERR3655998.ctx ERR3656002.ctx ERR3656004.ctx ERR3656010.ctx ERR3656012.ctx ERR3656013.ctx ERR3656015.ctx ERR3656018.ctx ERR3656019.ctx SRR10298903.ctx SRR10298904.ctx SRR10298905.ctx SRR10298906.ctx SRR10298907.ctx SRR10512964.ctx SRR10512965.ctx SRR10512968.ctx SRR10513325.ctx SRR10513326.ctx SRR10513328.ctx SRR10513363.ctx SRR10513672.ctx SRR10519468.ctx SRR10519469.ctx SRR10519616.ctx SRR10519617.ctx SRR10519619.ctx SRR10519620.ctx SRR10519637.ctx SRR1052198

In [29]:
ls -d *.bloom|head
ls -d *.bloom|wc -l

ERR1144974.ctx.bloom
ERR1144975.ctx.bloom
ERR1144976.ctx.bloom
ERR1144977.ctx.bloom
ERR1144978.ctx.bloom
ERR3655992.ctx.bloom
ERR3655994.ctx.bloom
ERR3655996.ctx.bloom
ERR3655998.ctx.bloom
ERR3656002.ctx.bloom
50


Awesome. We now have our BIGSI bloom filters.

Let's merge these all into a BIGSI index.

In [31]:
# Builds command-line string
samples=`for i in *.bloom; do echo -n "-s "; basename $i .ctx.bloom; done`

set -x
time conda run --name bigsi bigsi build *.bloom ${samples}
set +x

+ conda run --name bigsi bigsi build ERR1144974.ctx.bloom ERR1144975.ctx.bloom ERR1144976.ctx.bloom ERR1144977.ctx.bloom ERR1144978.ctx.bloom ERR3655992.ctx.bloom ERR3655994.ctx.bloom ERR3655996.ctx.bloom ERR3655998.ctx.bloom ERR3656002.ctx.bloom ERR3656004.ctx.bloom ERR3656010.ctx.bloom ERR3656012.ctx.bloom ERR3656013.ctx.bloom ERR3656015.ctx.bloom ERR3656018.ctx.bloom ERR3656019.ctx.bloom SRR10298903.ctx.bloom SRR10298904.ctx.bloom SRR10298905.ctx.bloom SRR10298906.ctx.bloom SRR10298907.ctx.bloom SRR10512964.ctx.bloom SRR10512965.ctx.bloom SRR10512968.ctx.bloom SRR10513325.ctx.bloom SRR10513326.ctx.bloom SRR10513328.ctx.bloom SRR10513363.ctx.bloom SRR10513672.ctx.bloom SRR10519468.ctx.bloom SRR10519469.ctx.bloom SRR10519616.ctx.bloom SRR10519617.ctx.bloom SRR10519619.ctx.bloom SRR10519620.ctx.bloom SRR10519637.ctx.bloom SRR10521982.ctx.bloom SRR10521983.ctx.bloom SRR10521984.ctx.bloom SRR10527348.ctx.bloom SRR10527349.ctx.bloom SRR10527351.ctx.bloom SRR10527352.ctx.bloom SRR10527353.

In [32]:
ls -lh bigsi.db

-rw-r--r-- 1 apetkau grp_apetkau 1.3G Dec  3 12:05 bigsi.db


Awesome. We've gotten our database constructed. Let's try it out.

In [44]:
echo -e "sample_name\tpercent_kmers"
time conda run --name bigsi bigsi search GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC 2> /dev/null \
    | sed -e $'s/\'/\"/g' | jq -r '.results[] | "\(.sample_name)\t\(.percent_kmers_found)"'

sample_name	percent_kmers
ERR1144974	100
ERR1144975	100
ERR1144976	100
ERR1144977	100
ERR1144978	100
ERR3655992	100
ERR3655994	100

real	0m1.310s
user	0m3.438s
sys	0m3.947s


Awesome. We can use this BIGSI index to pull out sample identifiers with specific genes.