In [3]:
NOTEBOOK_DIR=`git rev-parse --show-toplevel`
ROOT_DIR=$NOTEBOOK_DIR/microbial
cd $ROOT_DIR
ls

all_kat_hist                            [0m[01;34mimages[0m
all_kat_hist.dist_analysis.json         kat.hist
[01;35mall_kat_hist.png[0m                        kat.hist.dist_analysis.json
[01;34mbigsi[0m                                   [01;35mkat.hist.png[0m
[01;34mdata[0m                                    microbial-bigsi.ipynb
ERR1144974                              microbial-genomes.txt
ERR1144974.dist_analysis.json           microbial-process-data.ipynb
ERR1144974_kat_hist                     microbial-process-data.ipynb.bak
ERR1144974_kat_hist.dist_analysis.json  [01;35mo.png[0m
[01;35mERR1144974_kat_hist.png[0m                 sha256
[01;35mERR1144974.png[0m                          total-bp-reads.txt
freq_k31.hist                           total-kmers.txt
freq_k7.hist


# Microbial BIGSI

This runs `bigsi` <https://bigsi.readme.io/> on the microbial dataset. First, let's setup some variables to define size of bloom filters/bigsi parameters.

In [4]:
number_hashes=3
m_value=25000000
kmer_size=9

threads=48

# Step 1: Counting k-mers

Step 1 involves counting kmers using [mccortex](https://github.com/mcveanlab/mccortex).

In [5]:
cd $ROOT_DIR
mkdir bigsi
cd bigsi

for genome in $ROOT_DIR/data/subsample/*_1.fastq.gz;
do
    accession=`basename $genome _1.fastq.gz`
    echo "Working on $accession"
       
    fastq_file_1=$ROOT_DIR/data/subsample/${accession}_1.fastq.gz
    fastq_file_2=$ROOT_DIR/data/subsample/${accession}_2.fastq.gz
    cortex_out=${accession}.ctx
    cortex_log=${accession}_mccortex.log
    
    conda run --name mccortex /usr/bin/time -v mccortex ${kmer_size} build -t ${threads} -k ${kmer_size} -s ${accession} \
        --seq2 ${fastq_file_1}:${fastq_file_2} ${cortex_out} 2> ${cortex_log}.err 1> ${cortex_log}
done

echo "Done"

Working on ERR1144974
Working on ERR1144975
Working on ERR1144976
Working on ERR1144977
Working on ERR1144978
Working on ERR3655992
Working on ERR3655994
Working on ERR3655996
Working on ERR3655998
Working on ERR3656002
Working on ERR3656004
Working on ERR3656010
Working on ERR3656012
Working on ERR3656013
Working on ERR3656015
Working on ERR3656018
Working on ERR3656019
Working on SRR10298903
Working on SRR10298904
Working on SRR10298905
Working on SRR10298906
Working on SRR10298907
Working on SRR10512964
Working on SRR10512965
Working on SRR10512968
Working on SRR10513325
Working on SRR10513326
Working on SRR10513328
Working on SRR10513363
Working on SRR10513672
Working on SRR10519468
Working on SRR10519469
Working on SRR10519616
Working on SRR10519617
Working on SRR10519619
Working on SRR10519620
Working on SRR10519637
Working on SRR10521982
Working on SRR10521983
Working on SRR10521984
Working on SRR10527348
Working on SRR10527349
Working on SRR10527351
Working on SRR10527352
Worki

In [7]:
ls *.ctx|head
ls *.ctx | wc -l
du -sh .

ERR1144974.ctx
ERR1144975.ctx
ERR1144976.ctx
ERR1144977.ctx
ERR1144978.ctx
ERR3655992.ctx
ERR3655994.ctx
ERR3655996.ctx
ERR3655998.ctx
ERR3656002.ctx
50
82M	.


Awesome. We've now got files named like `SRR8088185.ctx` in our directory containing the cortex graph/kmers.

# Step 2: Build BIGSI index

In step 2, we'll look at building the `bigsi` indexes for each of these kmer counts.

Let's setup the bigsi index configuration.

In [8]:
cat > berkeleydb.yaml << EOF
## Example config using berkeleyDB
h: ${number_hashes}
k: ${kmer_size}
m: ${m_value}
storage-engine: berkeleydb
storage-config:
  filename: bigsi.db
  flag: "c" ## Change to 'r' for read-only access
EOF

export BIGSI_CONFIG=berkeleydb.yaml

Now, let's construct the blooom filters.

In [9]:
conda run --name bigsi bigsi
parallel --jobs ${threads} -I% conda run --name bigsi bigsi bloom % %.bloom ::: *.ctx

bigsi-v0.3.1

Available Commands:

 - bloom: Creates a bloom filter from a sequence file or cortex graph. (fastq...
 - build
 - delete
 - insert: Inserts a bloom filter into the graph          e.g. bigsi insert E...
 - merge
 - search

ERROR conda.cli.main_run:execute(39): Subprocess for 'conda run ['bigsi']' command failed.  Stderr was:

  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(

In [11]:
ls -d *.bloom|head
ls -d *.bloom|wc -l

ERR1144974.ctx.bloom
ERR1144975.ctx.bloom
ERR1144976.ctx.bloom
ERR1144977.ctx.bloom
ERR1144978.ctx.bloom
ERR3655992.ctx.bloom
ERR3655994.ctx.bloom
ERR3655996.ctx.bloom
ERR3655998.ctx.bloom
ERR3656002.ctx.bloom
50


Awesome. We now have our BIGSI bloom filters.

Let's merge these all into a BIGSI index.

In [12]:
# Builds command-line string
samples=`for i in *.bloom; do echo -n "-s "; basename $i .ctx.bloom; done`
echo bigsi build *.bloom ${samples}
conda run --name bigsi /usr/bin/time -v bigsi build *.bloom ${samples}

bigsi build ERR1144974.ctx.bloom ERR1144975.ctx.bloom ERR1144976.ctx.bloom ERR1144977.ctx.bloom ERR1144978.ctx.bloom ERR3655992.ctx.bloom ERR3655994.ctx.bloom ERR3655996.ctx.bloom ERR3655998.ctx.bloom ERR3656002.ctx.bloom ERR3656004.ctx.bloom ERR3656010.ctx.bloom ERR3656012.ctx.bloom ERR3656013.ctx.bloom ERR3656015.ctx.bloom ERR3656018.ctx.bloom ERR3656019.ctx.bloom SRR10298903.ctx.bloom SRR10298904.ctx.bloom SRR10298905.ctx.bloom SRR10298906.ctx.bloom SRR10298907.ctx.bloom SRR10512964.ctx.bloom SRR10512965.ctx.bloom SRR10512968.ctx.bloom SRR10513325.ctx.bloom SRR10513326.ctx.bloom SRR10513328.ctx.bloom SRR10513363.ctx.bloom SRR10513672.ctx.bloom SRR10519468.ctx.bloom SRR10519469.ctx.bloom SRR10519616.ctx.bloom SRR10519617.ctx.bloom SRR10519619.ctx.bloom SRR10519620.ctx.bloom SRR10519637.ctx.bloom SRR10521982.ctx.bloom SRR10521983.ctx.bloom SRR10521984.ctx.bloom SRR10527348.ctx.bloom SRR10527349.ctx.bloom SRR10527351.ctx.bloom SRR10527352.ctx.bloom SRR10527353.ctx.bloom SRR8088181.ctx.

In [14]:
ls -lh bigsi.db

-rw-r--r-- 1 apetkau grp_apetkau 1.3G Dec  4 17:22 bigsi.db


Awesome. We've gotten our database constructed. Let's try it out.

In [18]:
echo -e "sample_name\tpercent_kmers"
conda run --name bigsi /usr/bin/time -v bigsi search GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC \
    | sed -e $'s/\'/\"/g' | jq -r '.results[] | "\(.sample_name)\t\(.percent_kmers_found)"'

sample_name	percent_kmers
  config = yaml.load(infile)
	Command being timed: "bigsi search GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC"
	User time (seconds): 3.11
	System time (seconds): 3.82
	Percent of CPU this job got: 1435%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.48
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 63324
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 10962
	Voluntary context switches: 137
	Involuntary context switches: 150489
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
ERR1144974	100
ERR1144975	100
ERR1144976	100
ERR1144977	100
ERR1144978	100
ERR3655992	100
ERR3655994	10

Awesome. We can use this BIGSI index to pull out sample identifiers with specific genes.