# BIGSI Test 1

This is a test of the software `bigsi` <https://bigsi.readme.io/>, specifically by following the tutorial <https://bigsi.readme.io/docs/your-first-bigsi>.

# Step 1

Step 1 involves downloading the fastq files and counting kmers. This makes use of the software `fasterq-dump` to download fastq files and `jellyfish` to count kmers. The genomes to download are stored in the file `genomes.txt`.

In [9]:
NOTEBOOK_DIR=`git rev-parse --show-toplevel`

cd $NOTEBOOK_DIR
rm -rf kmers
mkdir kmers
pushd kmers

~/workspace/bigsi-examples/kmers ~/workspace/bigsi-examples ~/workspace/bigsi-examples


In [11]:
kmer_size=31
threads=16

for accession in `cat ../genomes.txt`;
do
    echo "Working on $accession"
    
    cortex_out=${accession}.ctx
    
    set -x
    fasterq-dump -s -o ${fastq_file} "$accession"
    mccortex31 build -t ${threads} -k ${kmer_size} -s ${accession} --seqi ${fastq_file} ${cortex_out}
    set +x
done

echo "Done"

Working on SRR9842706
+ fasterq-dump -s -o SRR9842706.fastq SRR9842706
2019-10-01T20:56:38 fasterq-dump.2.10.0 err: fasterq-dump.c fastdump_csra() checking ouput-file 'SRR9842706.fastq' -> RC(rcExe,rcFile,rcPacking,rcName,rcExists)
+ mccortex31 build -t 16 -k 31 -s SRR9842706 --seqi SRR9842706.fastq SRR9842706.ctx
[01 Oct 2019 15:56:38-qaD][cmd] mccortex31 build -t 16 -k 31 -s SRR9842706 --seqi SRR9842706.fastq SRR9842706.ctx
[01 Oct 2019 15:56:38-qaD][cwd] /home/CSCScience.ca/apetkau/workspace/bigsi-examples/kmers
[01 Oct 2019 15:56:38-qaD][version] mccortex=v0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[01 Oct 2019 15:56:38-qaD] Saving graph to: SRR9842706.ctx
[01 Oct 2019 15:56:38-qaD][sample] 0: SRR9842706
[01 Oct 2019 15:56:38-qaD][task] SRR9842706.fastq; FASTQ offset: auto-detect, threshold: off; cut homopolymers: off; remove PCR duplicates: no; colour: 0
[01 Oct 2019 15:56:38-qaD][memory] 104 bits per kmer
[01 Oct 2019 15:56:38-

Awesome. We've now got files named like `SRR9842706.ctx` in our directory containing the cortex graph/kmers.

# Step 2

In step 2, we'll look at building the `bigsi` indexes for each of these kmer counts.

In [12]:
number_hashes=3
m_value=25000000

mkdir bigsi
pushd bigsi

~/workspace/bigsi-examples/kmers/bigsi ~/workspace/bigsi-examples/kmers ~/workspace/bigsi-examples ~/workspace/bigsi-examples


Let's setup the bigsi index configuration.

In [21]:
cat > berkeleydb.yaml << EOF
## Example config using berkeleyDB
h: ${number_hashes}
k: ${kmer_size}
m: ${m_value}
storage-engine: berkeleydb
storage-config:
  filename: test-berkeleydb
  flag: "c" ## Change to 'r' for read-only access
EOF

export BIGSI_CONFIG=berkeleydb.yaml

Now, let's construct the blooom filters.

In [22]:
bigsi bloom ../SRR9842706.ctx SRR9842706.bloom

  config = yaml.load(infile)


In [23]:
ls

berkeleydb.yaml  [0m[01;34mSRR9842706.bloom[0m


In [25]:
bigsi build SRR9842706.bloom -s SRR9842706

  config = yaml.load(infile)
INFO:bigsi.cmds.build:Building index: 0/1
DEBUG:bigsi.cmds.build:Loading /home/CSCScience.ca/apetkau/workspace/bigsi-examples/kmers/bigsi/SRR9842706.bloom/SRR9842706.bloom 
DEBUG:bigsi.graph.bigsi:Insert sample metadata
DEBUG:bigsi.graph.bigsi:Create signature index
DEBUG:bigsi.graph.index:Transpose bitarrays
DEBUG:bigsi.graph.index:Insert rows
DEBUG:bigsi.storage.base:set bitarrays
{'result': 'success'}


In [26]:
bigsi search GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC

  config = yaml.load(infile)
{'query': 'GTTTCGTTCTTCCGGCGCGGGCGGTCAGCACGTTAACACCACCGACTCCGCTATCCGTATTACCCACTTGCCGACCGGCATCTTGGTGGAATGCCAGGACGAGC', 'threshold': 1.0, 'results': [{'percent_kmers_found': 100.0, 'num_kmers': 74, 'num_kmers_found': 74, 'sample_name': 'SRR9842706'}], 'citation': 'http://dx.doi.org/10.1038/s41587-018-0010-1'}
