# k-Mer Composition

## Background Info

* **k-mer:** A length k substring of a genetic string.
* **k-mer composition:** The k-mer composition of a string $s$ encodes the number of times that each possible k-mer occurs in $s$.
* **exon:** A contiguous segment of RNA converted to mRNA for protein translation.
* **fragment assembly:** Algorithmic reconstruction of contiguous chromosomes from short fragments of DNA.

A genetic string of length $n$ can be seen as composed of $n - k + 1$ overlapping k-mers. The 1-mer composition is a generalization of the GC-content of a strand of DNA. The biological significance of a k-mer composition is manyfold. GC-content is helpful not only in helping to identify a piece of unknown DNA, but also because a genomic region having high GC-content compared to the rest of the genome signals that it may belong to an **exon**. Analyzing k-mer composition is vital to **fragment assembly** as well.

For larger values of $k$, the k-mer composition offers a more robust fingerprint of a string's identity because it offers an analysis on the scale of substrings (i.e. words) instead of that of single symbols. As a basis of comparison, in language analysis, the k-mer composition of a text can be used not only to pin down the language, but also often the author.

## Problem

For a fixed positive integer $k$, order all possible k-mers taken from an underlying alphabet lexicographically. Then the k-mer composition of a string $s$ can be represented by an array $A$ for which $A[m]$ denotes the number of times that the $m$th k-mer appears in $s$.

**Given:** A DNA string $s$ in FASTA format.<br>
**Return:** The 4-mer composition of $s$.

## Solution Explanation

1. Quaternary (base-4) numeral system can be used to represent the 4-mer substrings, where the quaternary numerals would be the indices of the 4-mers laid out in lexicographical order in an array.
2. Apply sliding window to count the number of all the 4-mer substrings in string $s$.