# Kmer analysis tests

First let's setup some environment variables.

In [30]:
PROJECT_DIR=`git rev-parse --show-toplevel`
cd $PROJECT_DIR

# True/false positives code

Now, let's setup code to count true/false positives.

In [50]:
print_positives() {
    file1=$1
    file2=$2

    python -c "
set1 = set(l.strip() for l in open('${file1}'))
set2 = set(l.strip() for l in open('${file2}'))

true_positives = set1 & set2
false_positives = set2 - set1
false_negatives = set1 - set2

sensitivity = len(true_positives) / (len(true_positives) + len(false_negatives))

print(\"True Positives: %s: %s\" % (len(true_positives), true_positives))
print(\"False Positives: %s: %s\" % (len(false_positives), false_positives))
print(\"False Negatives: %s: %s\" % (len(false_negatives), false_negatives))
print(\"Sensitivity: %0.2f\" % (sensitivity))
"
}

# Examine input files

Let's take a look at the input files.

## CARD protein homolog database

The CARD protein homolog database looks like.

In [1]:
head db/nucleotide_fasta_protein_homolog_model.fasta

>gb|GQ343019|+|132-1023|ARO:3002999|CblA-1 [mixed culture bacterium AX_gF3SD01_15] 
ATGAAAGCATATTTCATCGCCATACTTACCTTATTCACTTGTATAGCTACCGTCGTCCGGGCGCAGCAAATGTCTGAACTTGAAAACCGGATTGACAGTCTGCTCAATGGCAAGAAAGCCACCGTTGGTATAGCCGTATGGACAGACAAAGGAGACATGCTCCGGTATAACGACCATGTACACTTCCCCTTGCTCAGTGTATTCAAATTCCATGTGGCACTGGCCGTACTGGACAAGATGGATAAGCAAAGCATCAGTCTGGACAGCATTGTTTCCATAAAGGCATCCCAAATGCCGCCCAATACCTACAGCCCCCTGCGGAAGAAGTTTCCCGACCAGGATTTCACGATTACGCTTAGGGAACTGATGCAATACAGCATTTCCCAAAGCGACAACAATGCCTGCGACATCTTGATAGAATATGCAGGAGGCATCAAACATATCAACGACTATATCCACCGGTTGAGTATCGACTCCTTCAACCTCTCGGAAACAGAAGACGGCATGCACTCCAGCTTCGAGGCTGTATACCGCAACTGGAGTACTCCTTCCGCTATGGTCCGACTACTGAGAACGGCTGATGAAAAAGAGTTGTTCTCCAACAAGGAGCTGAAAGACTTCTTGTGGCAGACCATGATAGATACTGAAACCGGTGCCAACAAACTGAAAGGTATGTTGCCAGCCAAAACCGTGGTAGGACACAAGACCGGCTCTTCCGACCGCAATGCCGACGGTATGAAAACTGCAGATAATGATGCCGGCCTCGTTATCCTTCCCGACGGCCGGAAATACTACATTGCCGCCTTCGTCATGGACTCATACGAGACGGATGAGGACAATGCGAACATCATCGCCCGCATATCACGCATGGTATATGATGCGATGAGATGA
>gb|HQ845196|+|0-861|ARO

## Sequence reads

The sequence reads look like:

In [7]:
ls -lrth input/*.fastq

-rwxr-xr-x 1 apetkau grp_apetkau 474M Jan 15 22:42 [0m[01;32minput/SRR1952908_1.fastq[0m
-rwxr-xr-x 1 apetkau grp_apetkau 471M Jan 15 22:42 [01;32minput/SRR1952908_2.fastq[0m


# Scenario 1: Assemble genome with SKESA, run RGI

Let's test out assembling the genome with SKESA and running RGI

In [54]:
conda run --name skesa /usr/bin/time -v skesa --cores 1 \
    --fastq input/SRR1952908_1.fastq,input/SRR1952908_2.fastq \
    --contigs_out SRR1952908-full-assembly.fasta --vector_percent 1

skesa --cores 1 --fastq input/SRR1952908_1.fastq,input/SRR1952908_2.fastq --contigs_out SRR1952908-full-assembly.fasta --vector_percent 1 

Total mates: 4294092 Paired reads: 2147046
Reads acquired in  10.307535s wall, 9.950000s user + 0.350000s system = 10.300000s CPU (99.9%)
Adapters clip is disabled

Kmer len: 21
Raw kmers: 338084078 Memory needed (GB): 6.49121 Memory available (GB): 29.7537 1 cycle(s) will be performed
Distinct kmers: 5702791
Kmer count in  78.417178s wall, 70.970000s user + 8.790000s system = 79.760000s CPU (101.7%)
Uniq kmers merging in  0.403631s wall, 0.210000s user + 0.190000s system = 0.400000s CPU (99.1%)
Kmers branching in  11.062922s wall, 11.010000s user + 0.170000s system = 11.180000s CPU (101.1%)

Average read length: 99
Genome size estimate: 4298544

Kmer: 21 Graph size: 5702791 Contigs in: 0
Valley: 13

Mark used kmers in  0.000014s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
Kmers in multiple/single contigs: 0 0
Fragments before: 1

Great. Let's look at the assembled genome and run RGI on it.

In [55]:
ls -lrth SRR1952908-full-assembly.fasta

-rw-r--r-- 1 apetkau grp_apetkau 4.6M Jan 16 20:18 SRR1952908-full-assembly.fasta


In [56]:
mkdir rgi-full-assembly-out
pushd rgi-full-assembly-out
conda run --name rgi-4.2.2 /usr/bin/time -v rgi main -i ../SRR1952908-full-assembly.fasta -o out -n 1
popd

~/workspace/kmer-analysis-test/rgi-full-assembly-out ~/workspace/kmer-analysis-test
	Command being timed: "rgi main -i ../SRR1952908-full-assembly.fasta -o out -n 1"
	User time (seconds): 236.39
	System time (seconds): 4.84
	Percent of CPU this job got: 109%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 3:40.13
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 141668
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 254807
	Voluntary context switches: 370
	Involuntary context switches: 217658
	Swaps: 0
	File system inputs: 0
	File system outputs: 204216
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
~/workspace/kmer-analysis-test


In [60]:
grep 'protein homolog model' rgi-full-assembly-out/out.txt | cut -f 9 | tee rgi-full-assembly-out/genes.txt

ugd
basS
MdtK
acrD
PmrF
TEM-57
tet(C)
kdpE
patA
bacA
Escherichia coli acrA
acrB
golS
mdsA
adeF
mdsC
cpxA
emrR
emrA
emrB
CRP
mdtC
sdiA
AAC(6')-Iaa
marA
OKP-B-10
mdtM
Escherichia coli mdfA
msbA
mef(B)
sul3
qacH
aadA13
Salmonella enterica cmlA
aadA
aadA9
mdtB
mdtC
baeR


# Scenario 2: Prefilter reads with kat, assemble with SKESA, run RGI

## 2.1: Prefilter reads

In [20]:
conda run --name kat /usr/bin/time -v kat filter seq -t 1 -o SRR1952908.kat -m 31 \
    --seq input/SRR1952908_1.fastq --seq2 input/SRR1952908_2.fastq \
    db/nucleotide_fasta_protein_homolog_model.fasta

Kmer Analysis Toolkit (KAT) V2.4.2

Running KAT in filter sequence mode
-----------------------------------

Input 50320 is a sequence file.  Counting kmers for input 50320 (db/nucleotide_fasta_protein_homolog_model.fasta) ... done.  Time taken: 0.6s

Filtering sequences ...
Processed 100000 pairs
Processed 200000 pairs
Processed 300000 pairs
Processed 400000 pairs
Processed 500000 pairs
Processed 600000 pairs
Processed 700000 pairs
Processed 800000 pairs
Processed 900000 pairs
Processed 1000000 pairs
Processed 1100000 pairs
Processed 1200000 pairs
Processed 1300000 pairs
Processed 1400000 pairs
Processed 1500000 pairs
Processed 1600000 pairs
Processed 1700000 pairs
Processed 1800000 pairs
Processed 1900000 pairs
Processed 2000000 pairs
Processed 2100000 pairs
Finished filtering.  Time taken: 109.4s

Found 10204 / 2147046 to keep

KAT filter seq completed.
Total runtime: 110.0s

	Command being timed: "kat filter seq -t 1 -o SRR1952908.kat -m 31 --seq input/SRR1952908_1.fastq --seq2 inp

In [22]:
ls -lrth *kat*.fastq

-rw-r--r-- 1 apetkau grp_apetkau 2.3M Jan 16 17:30 SRR1952908.kat.in.R2.fastq
-rw-r--r-- 1 apetkau grp_apetkau 2.3M Jan 16 17:30 SRR1952908.kat.in.R1.fastq


## 2.2 Assemble extracted reads

In [24]:
conda run --name skesa /usr/bin/time -v skesa --cores 1 \
    --fastq SRR1952908.kat.in.R1.fastq,SRR1952908.kat.in.R2.fastq \
    --contigs_out SRR1952908.kat.contigs.fasta --vector_percent 1

skesa --cores 1 --fastq SRR1952908.kat.in.R1.fastq,SRR1952908.kat.in.R2.fastq --contigs_out SRR1952908.kat.contigs.fasta --vector_percent 1 

Total mates: 20408 Paired reads: 10204
Reads acquired in  0.063515s wall, 0.050000s user + 0.000000s system = 0.050000s CPU (78.7%)
Adapters clip is disabled

Kmer len: 21
Raw kmers: 1607679 Memory needed (GB): 0.0308674 Memory available (GB): 29.9988 1 cycle(s) will be performed
Distinct kmers: 33311
Kmer count in  0.421162s wall, 0.400000s user + 0.020000s system = 0.420000s CPU (99.7%)
Uniq kmers merging in  0.004130s wall, 0.010000s user + 0.000000s system = 0.010000s CPU (242.2%)
Kmers branching in  0.044778s wall, 0.040000s user + 0.010000s system = 0.050000s CPU (111.7%)

Average read length: 99
Genome size estimate: 9918

Kmer: 21 Graph size: 25504 Contigs in: 0
Valley: 19

Mark used kmers in  0.000007s wall, 0.000000s user + 0.000000s system = 0.000000s CPU (n/a%)
Kmers in multiple/single contigs: 0 0
Fragments before: 33 22459
Fragments

In [26]:
ls -lh *kat*.fasta

-rw-r--r-- 1 apetkau grp_apetkau 22K Jan 16 17:32 SRR1952908.kat.contigs.fasta


## 2.3: Run RGI

In [27]:
mkdir rgi-kat
pushd rgi-kat
conda run --name rgi-4.2.2 /usr/bin/time -v rgi main -i ../SRR1952908.kat.contigs.fasta -o out -n 1
popd

~/workspace/kmer-analysis-test/rgi-kat ~/workspace/kmer-analysis-test
	Command being timed: "rgi main -i ../SRR1952908.kat.contigs.fasta -o out -n 1"
	User time (seconds): 6.03
	System time (seconds): 4.63
	Percent of CPU this job got: 343%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.10
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 128820
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 164086
	Voluntary context switches: 200
	Involuntary context switches: 19263
	Swaps: 0
	File system inputs: 0
	File system outputs: 11232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
~/workspace/kmer-analysis-test


In [52]:
grep 'protein homolog model' rgi-kat/out.txt | cut -f 9 | tee rgi-kat/genes.txt

MdtK
sul3
aadA
Salmonella enterica cmlA
aadA13
qacH
emrR
mef(B)
sdiA
TEM-57
patA
AAC(6')-Ib7
cpxA
mdtC
AAC(6')-Iaa
golS
acrB
mdsC
adeF
mdsA


Let's compare to scenario 1.

In [61]:
print_positives "rgi-full-assembly-out/genes.txt" "rgi-kat/genes.txt"

True Positives: 19: {'aadA13', 'mdtC', 'patA', 'cpxA', 'acrB', 'adeF', 'sul3', 'qacH', 'Salmonella enterica cmlA', 'emrR', "AAC(6')-Iaa", 'TEM-57', 'golS', 'sdiA', 'mdsA', 'MdtK', 'aadA', 'mdsC', 'mef(B)'}
False Positives: 1: {"AAC(6')-Ib7"}
False Negatives: 19: {'msbA', 'CRP', 'baeR', 'OKP-B-10', 'basS', 'kdpE', 'mdtM', 'Escherichia coli mdfA', 'acrD', 'ugd', 'PmrF', 'marA', 'emrB', 'aadA9', 'tet(C)', 'Escherichia coli acrA', 'emrA', 'bacA', 'mdtB'}
Sensitivity: 0.50


# Scenario 3: Break RGI database to kmers first

## 3.1: Run jellyfish

In [59]:
head db/nucleotide_fasta_protein_homolog_model.fasta -n 2

>gb|GQ343019|+|132-1023|ARO:3002999|CblA-1 [mixed culture bacterium AX_gF3SD01_15] 
ATGAAAGCATATTTCATCGCCATACTTACCTTATTCACTTGTATAGCTACCGTCGTCCGGGCGCAGCAAATGTCTGAACTTGAAAACCGGATTGACAGTCTGCTCAATGGCAAGAAAGCCACCGTTGGTATAGCCGTATGGACAGACAAAGGAGACATGCTCCGGTATAACGACCATGTACACTTCCCCTTGCTCAGTGTATTCAAATTCCATGTGGCACTGGCCGTACTGGACAAGATGGATAAGCAAAGCATCAGTCTGGACAGCATTGTTTCCATAAAGGCATCCCAAATGCCGCCCAATACCTACAGCCCCCTGCGGAAGAAGTTTCCCGACCAGGATTTCACGATTACGCTTAGGGAACTGATGCAATACAGCATTTCCCAAAGCGACAACAATGCCTGCGACATCTTGATAGAATATGCAGGAGGCATCAAACATATCAACGACTATATCCACCGGTTGAGTATCGACTCCTTCAACCTCTCGGAAACAGAAGACGGCATGCACTCCAGCTTCGAGGCTGTATACCGCAACTGGAGTACTCCTTCCGCTATGGTCCGACTACTGAGAACGGCTGATGAAAAAGAGTTGTTCTCCAACAAGGAGCTGAAAGACTTCTTGTGGCAGACCATGATAGATACTGAAACCGGTGCCAACAAACTGAAAGGTATGTTGCCAGCCAAAACCGTGGTAGGACACAAGACCGGCTCTTCCGACCGCAATGCCGACGGTATGAAAACTGCAGATAATGATGCCGGCCTCGTTATCCTTCCCGACGGCCGGAAATACTACATTGCCGCCTTCGTCATGGACTCATACGAGACGGATGAGGACAATGCGAACATCATCGCCCGCATATCACGCATGGTATATGATGCGATGAGATGA


# Scenario 3: RGI BWT

## 3.1 Run RGI BWT

In [29]:
mkdir rgi-bwt-out
pushd rgi-bwt-out
conda run --name rgi-4.2.2 /usr/bin/time -v rgi bwt -1 SRR1952908_1.fastq -2 SRR1952908_2.fastq -a bwa -n 1 -o out-bwt
popd

~/workspace/kmer-analysis-test/rgi-bwt-out ~/workspace/kmer-analysis-test
[bwa_idx_build] fail to open file '/home/CSCScience.ca/apetkau/miniconda3/envs/rgi-4.2.2/lib/python3.6/site-packages/app/_data/card_reference.fasta' : No such file or directory
[E::bwa_idx_load_from_disk] fail to locate the index files
	Command being timed: "rgi bwt -1 SRR1952908_1.fastq -2 SRR1952908_2.fastq -a bwa -n 1 -o out-bwt"
	User time (seconds): 3.86
	System time (seconds): 4.40
	Percent of CPU this job got: 465%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.77
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 184568
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 32
	Minor (reclaiming a frame) page faults: 87474
	Voluntary context switches: 288
	Involuntary context switches: 48445
	Swaps: 0
	File system inputs: 8760
	File system outputs: 64
	Soc

# Compare results (detected AMR)

In [26]:
cut -f 1,3,4 kat-staramr-out/summary.tsv

Isolate ID	Genotype	Predicted Phenotype
kat.filter.contigs	aadA1, aadA2, blaTEM-57, cmlA1, sul3, tet(A)	streptomycin, ampicillin, chloramphenicol, sulfisoxazole, tetracycline


In [27]:
cut -f 1,3,4 staramr-out/summary.tsv

Isolate ID	Genotype	Predicted Phenotype
SRR1952908	aadA1, aadA2, blaTEM-57, cmlA1, sul3, tet(A)	streptomycin, ampicillin, chloramphenicol, sulfisoxazole, tetracycline
