In [1]:
NOTEBOOK_DIR=`git rev-parse --show-toplevel`
ROOT_DIR=$NOTEBOOK_DIR/metagenomics
cd $ROOT_DIR
ls

[0m[01;34mdata[0m                                    metagenome-bigsi.ipynb
ERR1713335_kat_hist                     metagenomes.txt
ERR1713335_kat_hist.dist_analysis.json  PRJEB13831.txt
[01;35mERR1713335_kat_hist.png[0m                 process-data.ipynb
[01;34mimages[0m


# Metagenomics BIGSI

This runs `bigsi` <https://bigsi.readme.io/> on the metagenomics dataset. First, let's setup some variables to define size of bloom filters/bigsi parameters.

In [2]:
number_hashes=3
m_value=25000000
kmer_size=9

threads=32

# Step 1: Counting k-mers

Step 1 involves counting kmers using [mccortex](https://github.com/mcveanlab/mccortex).

In [3]:
cd $ROOT_DIR
mkdir bigsi
pushd bigsi

for genome in $ROOT_DIR/data/*_1.fastq.gz;
do
    accession=`basename $genome _1.fastq.gz`
    echo "Working on $accession"
       
    fastq_file_1=$ROOT_DIR/data/${accession}_1.fastq.gz
    fastq_file_2=$ROOT_DIR/data/${accession}_2.fastq.gz
    cortex_out=${accession}.ctx
    cortex_log=${accession}_mccortex.log
    
    set -x
    conda run --name mccortex mccortex ${kmer_size} build -t ${threads} -k ${kmer_size} -s ${accession} --seq2 ${fastq_file_1}:${fastq_file_2} ${cortex_out} 2> ${cortex_log}.err 1> ${cortex_log}
    set +x
done

echo "Done"

~/workspace/comp7934-project/metagenomics/bigsi ~/workspace/comp7934-project/metagenomics
Working on ERR1713335
+ conda run --name mccortex mccortex 9 build -t 32 -k 9 -s ERR1713335 --seq2 /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713335_1.fastq.gz:/home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713335_2.fastq.gz ERR1713335.ctx
+ set +x
Working on ERR1713374
+ conda run --name mccortex mccortex 9 build -t 32 -k 9 -s ERR1713374 --seq2 /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713374_1.fastq.gz:/home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713374_2.fastq.gz ERR1713374.ctx
+ set +x
Working on ERR1713375
+ conda run --name mccortex mccortex 9 build -t 32 -k 9 -s ERR1713375 --seq2 /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713375_1.fastq.gz:/home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/data/ERR1713375_2.fast

In [5]:
ls *.ctx|head
ls *.ctx | wc -l
du -sh .

ERR1713335.ctx
ERR1713374.ctx
ERR1713375.ctx
ERR1713397.ctx
ERR1713406.ctx
5
8.2M	.


Awesome. We've now got files named like `SRR8088185.ctx` in our directory containing the cortex graph/kmers.

# Step 2: Build BIGSI index

In step 2, we'll look at building the `bigsi` indexes for each of these kmer counts.

Let's setup the bigsi index configuration.

In [6]:
cat > berkeleydb.yaml << EOF
## Example config using berkeleyDB
h: ${number_hashes}
k: ${kmer_size}
m: ${m_value}
storage-engine: berkeleydb
storage-config:
  filename: bigsi.db
  flag: "c" ## Change to 'r' for read-only access
EOF

export BIGSI_CONFIG=berkeleydb.yaml

Now, let's construct the blooom filters.

In [10]:
conda run --name bigsi bigsi
set -x
parallel --jobs ${threads} -I% conda run --name bigsi bigsi bloom % %.bloom ::: *.ctx
set +x

bigsi-v0.3.1

Available Commands:

 - bloom: Creates a bloom filter from a sequence file or cortex graph. (fastq...
 - build
 - delete
 - insert: Inserts a bloom filter into the graph          e.g. bigsi insert E...
 - merge
 - search

ERROR conda.cli.main_run:execute(39): Subprocess for 'conda run ['bigsi']' command failed.  Stderr was:

+ parallel --jobs 32 -I% conda run --name bigsi bigsi bloom % %.bloom ::: ERR1713335.ctx ERR1713374.ctx ERR1713375.ctx ERR1713397.ctx ERR1713406.ctx
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
  config = yaml.load(infile)
+ set +x


In [12]:
ls -d *.bloom|head
ls -d *.bloom|wc -l

ERR1713335.ctx.bloom
ERR1713374.ctx.bloom
ERR1713375.ctx.bloom
ERR1713397.ctx.bloom
ERR1713406.ctx.bloom
5


Awesome. We now have our BIGSI bloom filters.

Let's merge these all into a BIGSI index.

In [20]:
# Builds command-line string
files=`echo -n *.bloom`
samples=`for i in ${files}; do echo -n "-s "; basename $i .ctx.bloom; done`

set -x
/usr/bin/time -v conda run --name bigsi bigsi build ${files} ${samples}
set +x

+ /usr/bin/time -v conda run --name bigsi bigsi build ERR1713335.ctx.bloom ERR1713374.ctx.bloom ERR1713375.ctx.bloom ERR1713397.ctx.bloom ERR1713406.ctx.bloom -s ERR1713335 -s ERR1713374 -s ERR1713375 -s ERR1713397 -s ERR1713406
{'result': 'success'}
  config = yaml.load(infile)
INFO:bigsi.cmds.build:Building index: 0/1
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/bigsi/ERR1713335.ctx.bloom/ERR1713335.ctx.bloom 
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/bigsi/ERR1713374.ctx.bloom/ERR1713374.ctx.bloom 
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/bigsi/ERR1713375.ctx.bloom/ERR1713375.ctx.bloom 
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/comp7934-project/metagenomics/bigsi/ERR1713397.ctx.bloom/ERR1713397.ctx.bloom 
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/comp7934-project/

In [22]:
ls -lh bigsi.db

-rw-r--r-- 1 apetkau grp_apetkau 1.3G Dec  3 14:15 bigsi.db


Awesome. We've gotten our database constructed. Let's try it out.

In [23]:
echo -e "sample_name\tpercent_kmers"
/usr/bin/time -v time conda run --name bigsi bigsi search GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC 2> /dev/null \
    | sed -e $'s/\'/\"/g' | jq -r '.results[] | "\(.sample_name)\t\(.percent_kmers_found)"'

sample_name	percent_kmers
ERR1713335	100


Awesome. We can use this BIGSI index to pull out sample identifiers with specific genes.