Tutorial
---

### Nucleotides

The class `Dna` is an IUPAC valid sequence of non-degenerate DNA nucleotides.
For the purposes of the tutorial we will assume single nucleotide sequences.

In [1]:
from nucleic import Dna, Snv

Dna("A").is_purine()

True

### Creating Variant Alleles

In [2]:
Dna("A").to("C")

Snv(ref=Dna("A"), alt=Dna("C"), context=Dna("A"))

By default, the context of the variant is assigned to the reference base, although a larger context can be set.
The context must be symmetrical in length about the base substitution otherwise an error will be raised.

In [3]:
Snv('A', 'C').within("TAG")

Snv(ref=Dna("A"), alt=Dna("C"), context=Dna("TAG"))

Unless the chemical process for the base substitution is known, it is useful to represent all base substitutions in a canonical form, with either a purine or pyrimidine as the reference base.

In [4]:
Snv('A', 'C').within("TAG").with_pyrimidine_ref()

Snv(ref=Dna("T"), alt=Dna("G"), context=Dna("CTA"))

### Single Nucleotide Variant Spectrums

A `SnvSpectrum` can be initialized by specifying the size of the local context and the reference notation.

In [5]:
from nucleic import SnvSpectrum, Notation

spectrum = SnvSpectrum(k=3, notation=Notation.pyrimidine)
spectrum

SnvSpectrum(k=3, notation=Notation.pyrimidine)

Record observations by accessing the `SnvSpectrum` like a Python dictionary.

In [6]:
snv = Snv('C', 'A').within('ACA')

spectrum[snv] += 2

If you have a vector of counts, or probabilities, then you can directly build a `SnvSpectrum` as long as the data is listed in the correct alphabetic order of the `SnvSpectrum` keys.

In [7]:
vector = [6, 5, 2, 2, 3, 8]

spectrum = SnvSpectrum.from_list(vector, k=1, notation=Notation.pyrimidine)
spectrum.items()

array([[Snv(ref=Dna("C"), alt=Dna("A"), context=Dna("C")), 6],
       [Snv(ref=Dna("C"), alt=Dna("G"), context=Dna("C")), 5],
       [Snv(ref=Dna("C"), alt=Dna("T"), context=Dna("C")), 2],
       [Snv(ref=Dna("T"), alt=Dna("A"), context=Dna("T")), 2],
       [Snv(ref=Dna("T"), alt=Dna("C"), context=Dna("T")), 3],
       [Snv(ref=Dna("T"), alt=Dna("G"), context=Dna("T")), 8]],
      dtype=object)

### Working with Probability

Many spectra are produced from whole-genome or whole-exome sequencing experiments. Spectra must be normalized to the _kmer_ frequencies in the target study.
Without normalization, no valid spectrum comparison can be made between data generated from different target territories or species.

By default each `nucleic.Variant` is given a weight of 1 and calling `nucleic.SnvSpectrum.mass_as_array` will simply give the proportion of `nucleic.Snv` counts in the `nucleic.SnvSpectrum`.
After weights are set to the observed *k*-mer counts or frequency of the target territory, calling `SnvSpectrum.mass` will compute a true normalized probability mass.

All weights can be set with assignment *e.g.*: ``spectrum.context_weights["ACA"] = 23420``.

In [8]:
spectrum.mass()

array([0.23076923, 0.19230769, 0.07692308, 0.07692308, 0.11538462,
       0.30769231])

*k*-mer counts can be found with `skbio.DNA.kmer_frequencies` for large targets.

### Fetching COSMIC Signatures

Download the published [COSMIC signatures](http://cancer.sanger.ac.uk/cosmic/signatures) of mutational processes in human cancer:

In [9]:
from nucleic.cosmic import fetch_cosmic_signatures

signatures = fetch_cosmic_signatures()

### Plotting Spectrums

Spectra with `k=3` in either `pyrimidine` or `purine` reference notation can be plotted using a style that was first used in Alexandrov *et. al.*  in 2013 (PMID: [`23945592`](https://www.ncbi.nlm.nih.gov/pubmed/23945592>). Both `nucleic.Variant` raw counts (`kind="count"`) or their probabilities (`kind="mass"`) can be plotted.

The figure and axes are returned to allow for custom formatting.

In [10]:
from nucleic.plotting import trinucleotide_spectrum


fig, (ax_main, ax_cbar) = trinucleotide_spectrum(signatures["Signature 1"], kind="mass")
fig, (ax_main, ax_cbar) = trinucleotide_spectrum(signatures["Signature 14"], kind="mass")

In [11]:
import nucleic; nucleic.__version__

'0.7.0'