---
layout: post  
title: Norwalk Virus Kmer Distributions  
date: 2019-08-06  
author: Cameron Prybol  

---

In the [previous post](/selecting-genomes-by-taxonomy.html) we integrated taxonomic information with the RefSeq reference genomes to select a set of representative genomes from each taxonomic category. In this post, we will evaluate the kmer frequency profiles of the selected virus, the Norwalk virus. 

The first step in analyzing the frequency of kmers is to first determine all of the kmers in a genome. This can be done by simply starting at the beginning of the genome and then returning each k-length slice from the genome. If the genome is composed of multiple chromosomes, then the process is the same, but it is done chromosome by chromosome. Depending on the context of the question you are interested in, you may wish to review the kmers as they are actually observed, or you may wish to convert all kmers to their canonical form. The canonical form is jargon for whichever orientation is alphabetically first.

I have written a function to do this in the [Eisenia](https://github.com/cjprybol/Eisenia) package, which can be accessed as a standalone command line function.

In [2]:
genomes_directory = joinpath(dirname(pwd()), "datasets", "refseq_reference_genomes")
fasta_file_base = "GCF_000868425.1_ViralProj17577_genomic.fna"
fasta_path_base = joinpath(genomes_directory, fasta_file_base)
run(pipeline(`Eisenia stream-kmers --k 3 --fasta $fasta_path_base.gz`))

CAC
TCA
GAA
AAA
AAT
ATG
TCA
CTC
AGG
GGA
ATC
ATG
CCA
GCC
GCA
CAA
AAC
ACG
CGC
GCC
CCA
ATG
ATC
AGA
AAG
GAA
AGA
CAG
GCA
CGC
CGC
GCC
CCC
AGG
CTC
AGA
CAG
ACA
CAC
GCA
CGC
CGC
GCA
CAA
AAC
ACA
CAC
ACA
CAG
AGA
CTC
AGA
GAA
AAA
AAC
ACG
CGC
GCA
CAA
AAA
AAA
AAA
AAC
ACA
CAA
AAG
AGA
GAA
AAG
AGG
GCC
AGC
AAG
GAA
CGA
ACG
GAC
AGA
CTA
TAA
AAA
AAG
AGC
AGC
CTA
CTA
ACT
CAC
ACA
GAC
AGA
CTC
GGA
AGG
AAG
AAA
CAA
CCA
GGA
CTC
AGC
GCA
CAC
ACC
AGG
CTA
CTA
AGC
GCC
CCC
CCC
CCC
AGG
CTC
AGA
CTC
AGA
AAG
GAA
CGA
CCG
GGA
CTC
AGA
CTC
AGC
CGC
CGA
GAA
AAG
AGA
GAC
ACG
CGA
GAA
AAA
AAT
AAT
TAA
TAA
AAT
AAT
TAA
GTA
ACA
ATG
ATG
TCA
GAC
ACC
CCC
CCC
AGG
CTC
GGA
AGG
CAG
TCA
CTC
AGC
GCA
CAG
AGG
GGA
GAA
AAG
AGC
AGC
CTC
TCA
CAG
AGC
GCC
CCC
CCG
CCG
GCC
CGC
CGC
GCC
CCC
AGG
AAG
CAA
GCA
CGC
CCG
GCC
GCC
CCC
AGG
AAG
GAA
TCA
ATG
ATG
GCA
CGC
CCG
GGA
GAA
AAG
AGG
CCC
GCC
GCC
CCG
CGC
AGC
AAG
CAA
GCA
GCC
CCG
CCG
CCC
GCC
AGC
CTC
GGA
CCC
CCC
CCG
ACG
CAC
TCA
GAC
ACG
CGC
CGC
ACG
GTA
CTA
ACT
CAC
TCA
ATC
ATG
GCA
GCA
CAC
ACG
CGC
CGC
ACG
CAC
GCA
AGC
CAG
TCA


GCC
GCA
CAG
AGC
AGC
CTA
ATA
AAT
CAA
CCA
CCC
ACC
CAC
GCA
GCA
CAA
AAA
AAG
AGC
GCA
CAA
AAT
ATG
CCA
CCC
CCC
ACC
CAC
GCA
GCA
CAC
ACC
AGG
CAG
ACA
CAC
ACA
GTA
CTA
AGA
CTC
AGG
GGA
CTC
AGG
GCC
GCC
CCA
ATG
ATG
GCA
GCA
ATG
AAT
AAA
TAA
GTA
ACT
AAG
TAA
TAA
AAG
AGG
GGA
ATC
AAT
TAA
ATA
ATG
TCA
CTC
AGG
ACC
CAC
CCA
GCC
AGC
CTC
GGA
CCC
AGG
CAG
ACA
GAC
TCA
CAG
AGG
ACC
CAC
TCA
ATC
ATC
AGA
CTA
GTA
ACA
CAA
AAT
ATG
CCA
ACC
CAC
GCA
GCC
CCA
CAC
ACC
AGG
CTA
ATA
ATA
TAA
AAT
ATG
ACA
CAC
TCA
CTC
AGC
AGC
CAG
GCA
GCA
ATG
ATC
TCA
CAA
AAG
AGG
CCC
ACC
CAC
GCA
GCC
CCC
CCC
CCA
CAA
AAT
ATG
CCA
ACC
AAC
CAA
TCA
GAA
AAA
AAA
AAG
AGG
ACC
GAC
TCA
CAA
AAG
AGG
GGA
CTC
AGC
GCC
AGG
CAG
TCA
GAA
AAT
AAT
CAA
GCA
GCC
CCC
CCA
CAA
AAA
AAA
AAC
ACA
CAC
ACT
AAG
CAA
ACA
GAC
TCA
CAA
AAC
ACT
CAG
ACA
CAC
ACA
GAC
TCA
CAG
AGA
GAA
AAG
AGG
GGA
ATC
ATA
TAA
AAA
AAG
AGG
GGA
CTC
AGG
GCC
GCC
CCC
CCG
CGC
GCC
AGG
CTC
GGA
CCG
CGC
AGC
CAG
GCA
AGC
CTA
GTA
ACT
CAG
ACA
GTA
CTA
AGG
GGA
ATC
ATG
CCA
GCC
AGC
CAG
GCA
AGC
CAG
TCA
GAC
ACG
ACG
GAC
TCA
ATG
ATC
TCA
CAC


ATC
ATC
CGA
ACG
CAC
GCA
GCC
CCA
CAG
AGC
GCA
ATG
ATC
CGA
CGA
GAC
ACC
CCG
CGC
GCC
CCA
CAG
AGC
AGC
CTC
GGA
AGG
CTC
AGA
CAG
CCA
GGA
GAC
ACT
CTA
TAA
AAA
AAG
AGG
GGA
GAC
ACC
AGG
CTA
TAA
AAC
ACC
CCA
CAC
ACC
CCA
CAG
AGA
GAA
AAC
ACC
CCC
CCC
AGG
AAG
AAA
CAA
TCA
CTC
AGA
GAC
ACT
CTC
AGA
CTC
GGA
CCC
AGG
CAG
CCA
GGA
GAC
ACA
ATG
ATG
GCA
AGC
CTC
TCA
CAG
AGA
CTC
AGA
GAC
ACC
CCC
AGG
CTC
GGA
CCC
CCA
CAA
AAC
ACT
CTA
TAA
AAT
ATG
CCA
GCC
GCC
CCC
AGG
CAG
GCA
AGC
CTC
CGA
CCG
ACC
CAC
TCA
CTC
AGG
GCC
AGC
CAG
GCA
GCC
CCA
ATG
ATG
GCA
GCA
ATG
ATG
CCA
ACC
CAC
TCA
GAA
AAA
AAA
AAG
ACT
GTA
ATA
AAT
TAA
GTA
ACA
CAG
AGG
GGA
GAC
ACT
CAG
ACA
CAC
CCA
GCC
AGC
AAG
GAA
GGA
CCC
CCG
ACG
CAC
ACA
GAC
AGA
CTC
GGA
CCA
CAA
AAG
AGG
GGA
CTC
AGG
GCC
GCC
CCG
CGC
GCC
CCC
CCA
CAA
AAA
AAG
ACT
CAC
CCA
CCC
GGA
ATC
ATA
CTA
AGA
GAA
AAA
AAT
ATG
CCA
ACC
GTA
CTA
ACT
GAC
GGA
CCC
CCA
CAC
ACG
CGC
GCC
CCA
CAC
ACC
CCG
CGA
ATC
ATC
AGA
CAG
ACA
AAC
GAA
AGA
CAG
GCA
CGC
CGC
AGC
CAG
CCA
CCC
ACC
CAC
GCA
CGC
CGC
AGC
AAG
AAA
CAA
CCA
GGA
GAA
AAC
ACA
CAA
AAT
ATG


Process(`[4mEisenia[24m [4mstream-kmers[24m [4m--k[24m [4m3[24m [4m--fasta[24m [4m/Users/Cameron/Desktop/cjprybol.github.io/datasets/refseq_reference_genomes/GCF_000868425.1_ViralProj17577_genomic.fna.gz[24m`, ProcessExited(0))

The Norwalk Virus has a relatively small genome, but even small genomes contain a lot of kmers! Let's go ahead and condense the information by pipe-ing the output to `sort` and then `uniq` with the `--count` flag

In [3]:
run(pipeline(`Eisenia stream-kmers --k 3 --fasta $fasta_path_base.gz`, `sort`, `uniq --count`))

    166 AAA
    170 AAC
    269 AAG
    143 AAT
    182 ACA
    277 ACC
    138 ACG
    141 ACT
    270 AGA
    240 AGC
    338 AGG
     69 ATA
    245 ATC
    263 ATG
    293 CAA
    246 CAC
    324 CAG
    400 CCA
    363 CCC
    223 CCG
    150 CGA
    235 CGC
     97 CTA
    299 CTC
    205 GAA
    222 GAC
    269 GCA
    408 GCC
    276 GGA
    101 GTA
     83 TAA
    275 TCA


Base.ProcessChain(Base.Process[Process(`[4mEisenia[24m [4mstream-kmers[24m [4m--k[24m [4m3[24m [4m--fasta[24m [4m/Users/Cameron/Desktop/cjprybol.github.io/datasets/refseq_reference_genomes/GCF_000868425.1_ViralProj17577_genomic.fna.gz[24m`, ProcessExited(0)), Process(`[4msort[24m`, ProcessExited(0)), Process(`[4muniq[24m [4m--count[24m`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

To condense this information even further into a histogram of kmer frequencies, we can isolate the first column of kmer counts with `awk` and then repeat the `sort | uniq --count` process.

In [4]:
run(pipeline(`Eisenia stream-kmers --k 3 --fasta $fasta_path_base.gz`,
             `sort`, 
             `uniq --count`, 
             `awk '{print $1}'`,
             `sort --numeric`,
             `uniq --count`))

      1 69
      1 83
      1 97
      1 101
      1 138
      1 141
      1 143
      1 150
      1 166
      1 170
      1 182
      1 205
      1 222
      1 223
      1 235
      1 240
      1 245
      1 246
      1 263
      2 269
      1 270
      1 275
      1 276
      1 277
      1 293
      1 299
      1 324
      1 338
      1 363
      1 400
      1 408


Base.ProcessChain(Base.Process[Process(`[4mEisenia[24m [4mstream-kmers[24m [4m--k[24m [4m3[24m [4m--fasta[24m [4m/Users/Cameron/Desktop/cjprybol.github.io/datasets/refseq_reference_genomes/GCF_000868425.1_ViralProj17577_genomic.fna.gz[24m`, ProcessExited(0)), Process(`[4msort[24m`, ProcessExited(0)), Process(`[4muniq[24m [4m--count[24m`, ProcessExited(0)), Process(`[4mawk[24m [4m'{print $1}'[24m`, ProcessExited(0)), Process(`[4msort[24m [4m--numeric[24m`, ProcessExited(0)), Process(`[4muniq[24m [4m--count[24m`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

It looks like every kmer has a unique frequency except for the two kmers with a frequency of 269, which are `AAG` and `GCA`

The final step now that we have our histogram values of kmer counts is to plot out the results. [Eisenia](https://github.com/cjprybol/Eisenia) has another function for this which we will use.

In [5]:
run(pipeline(
        pipeline(`Eisenia stream-kmers --k 3 --fasta $fasta_path_base.gz`,
                 `sort`, 
                 `uniq --count`, 
                 `awk '{print $1}'`,
                 `sort --numeric`,
                 `uniq --count`), stdout="$fasta_path_base.K3.counts.histogram"))
run(pipeline(`Eisenia plot histogram --histogram $fasta_path_base.K3.counts.histogram`))
run(pipeline(`mv $fasta_path_base.K3.counts.histogram.svg $(joinpath(dirname(pwd()), "assets", "images"))`))

Process(`[4mmv[24m [4m/Users/Cameron/Desktop/cjprybol.github.io/datasets/refseq_reference_genomes/GCF_000868425.1_ViralProj17577_genomic.fna.K3.counts.histogram.svg[24m [4m/Users/Cameron/Desktop/cjprybol.github.io/assets/images[24m`, ProcessExited(0))

![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K3.counts.histogram.svg)

This isn't a super interesting plot, and the actual values represented in the plot are so few that we could have understood the dynamics of the dataset simply by reviewing the raw data. However, we'll continue increasing the length of k to see how the dynamics change as we include additional information in each kmer. Before we run the rest of the analysis though, two asides about choosing k-lengths to evaluate and optimizing the efficiency of our kmer histogram evaluations in the shell.

In general, I prefer to use k-lengths that are prime. There are two reasons for this. The first is that using odd-length kmers removes the ability for kmers to be equivalent to their reverse complements. For example, the kmer `AT` in the forward direction is also `AT` in the reverse complement direction, but the kmer `ATA` is `TAT` in the reverse complement direction, thus breaking the equivalence. The second reason is that utilizing prime-number k-lengths breaks up common 3-mer patterns. For example, if there is a repetitive genomic element of length 9 such as `AAATTTAAA` that translates to a commonly used sequence of amino acids, it might be observed many times in the genome. In order to break up these naturally occurring patterns we'd be better off utilizing a prime number k-length such as 11 that may help us better identify the unique genomic location of the sequence we are evaluating.

For the remaining kmer frequency evaluations, I'll break out of the Julia jupyter notebook environment and run the commands directly in the shell. There are a few interesting things to note about these commands, which I've tried to optimize through testing and parallelization. The first thing to note is the [`parallel`](https://www.gnu.org/software/parallel/) command that prefixes each command. This allows us to evaluate multiple k-lengths in parallel using multiple CPU cores. When using parallel, we have to escape nearly everything with a backslash `\` to bypass the standard shell command parsing and allow `parallel` to handle all of the command parsing. The easiest way that I've found to handle this correctly is to write a command that you can confirm works as a single-threaded operation, and then utilize the `parallel --shellquote` command to properly escape the command for use with `parallel`. Other things that may be of interest are prefixing `sort` with `LC_ALL=C`, which I have found to both standardize sort order and, more importantly, drastically reduce sort times (which can become a bottleneck in larger sort operations on large genomes), the use of `--temporary-directory` to avoid overusing `/tmp` space on academic clusters that have hard limits on shared `/tmp` space, and liberal use of `gzip` compression because evaluating large kmers on large genomes can rapidly deplete hundreds of GB (potentially TB?) of disk space. Practically, none of these optimizations matter on such a small genome with such small k-lengths, however this pattern of commands will remain unchanged all of the way up through the use of human genomes and raw genomic datasets.

```bash
FASTA=GCF_000868425.1_ViralProj17577_genomic.fna
K_RANGE="5 7 11 13 17"
parallel Eisenia\ stream-kmers\ --k\ \{1\}\ --fasta\ $FASTA.gz\ \|\ LC_ALL=C\ sort\ --temporary-directory\ \.\ --compress-program\ gzip \|\ uniq\ --count\ \| gzip\ \>\ $FASTA.K\{1\}.counts.gz ::: $K_RANGE
parallel gzip\ --decompress\ --stdout\ $FASTA.K\{1\}.counts.gz\ \|\ awk\ \'\{print\ \$1\}\'\ \|\ LC_ALL=C\ sort\ --numeric\ \|\ uniq\ --count\ \>\ $FASTA.K\{1\}.counts.histogram ::: $K_RANGE
parallel Eisenia\ plot\ histogram\ --histogram\ $FASTA.K\{1\}.counts.histogram ::: $K_RANGE
mv $FASTA.K*.counts.histogram.svg ../../assets/images/
```

And here are the results. While this single genome doesn't give a great representation for how all genomes behave, you will notice that after a certain k-length, a linear relationship develops between the `log(# of kmers) ~ log(coverage)` before the k-length gets so large that all kmers become unique.

![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K5.counts.histogram.svg)
![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K7.counts.histogram.svg)
![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K11.counts.histogram.svg)
![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K13.counts.histogram.svg)
![](../assets/images/GCF_000868425.1_ViralProj17577_genomic.fna.K17.counts.histogram.svg)

I'll follow up on this post with additional genomes and also with another interesting relationship known as [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). Zipf's law doesn't hold up as well as the above relationship when evaluating genomes themselves (according to my observations), however, when evaluating the kmer-frequencies in sequencing datasets where a genome may be sampled dozens, hundreds, or even thousands of times or more, I hypothesize that Zipf's law will become a more prominent relationship to utilize that will help us evaluate an ideal K-length to use for analysis during assembly and genome identification.