# Calculating K-mer Counts in hg19
```
pi:ababaian
start: 2016 12 19
complete : 2017 01 02
```
## Introduction

This work was between Dec 19 - Jan 2nd and is now being written up into a digital notebook based on the notes I've taken.

To measure Shannon information content of different k-mers in each encoding it will be important to count how much each k-mer occurs within the genome since this directly relates.

- *Shannon Information Content* : 
    h(x) = log2 ( 1 / P(x) )
    
- *Entropy of an ensemble X* :
    H(X) = sum[ P_i(x) * log ( 1 / P_i(x) ) ] for all i


## Objective

* Count how many K-mers occur in the standard and non-standard encodings of hg19
* Establish a baseline measurement if there is differences in information between encodings


## Materials and Methods

### Software
* [Genometools / Tallymer](http://genometools.org/pub/genometools-1.5.9.tar.gz)

In [2]:
# Global Variables
SERRATUS='/home/artem/Serratus'



In [None]:
## Tallymer Indexing
# For each encoding hg19, hg19_ry, hg19_sw, hg19_mk, hg19_h
# create a tallymer index file for looking up k-mers
cd $SERRATUS/resources
    mkdir -p hg19_tally; cd hg19_tally
    # Link genomes to tallymer index folder
    ln -s ../hg19/hg19.fa ./
    ln -s ../hg19/hg19_ry.fa ./
    ln -s ../hg19/hg19_sw.fa ./
    ln -s ../hg19/hg19_mk.fa ./
    ln -s ../hg19/hg19_h.fa ./


# Index each genome by it's K-mer
    gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19.fa -indexname hg19
    gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19_ry.fa -indexname hg19_ry
    gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19_sw.fa -indexname hg19_sw
    gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19_mk.fa -indexname hg19_mk
    gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19_h.fa -indexname hg19_h
    

In [1]:
## Sample Standard Output after running suffixerator (genome indexer)
# Complete output can be viewed at in '~/resources/hg19_tally/index.stdout'

#gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -db hg19.fa -indexname hg19


# dna=yes
# indexname="hg19"
# prefixlength=automatic
# storespecialcodes=false
# inputfile[0]=hg19.fa
# indexname=hg19
# outtistab=true,outsuftab=true,outlcptab=true,outbwttab=false,outbcktab=false,outdestab=true,outsdstab=true,outssptab=true,outkystab=false
# parts=4
# maxinsertionsort=3
# maxbltriesort=1000
# maxcountingsort=4000
# lcpdist=false
# sizeof (GtUword)=64
# wildcardranges of length 1=45
# wildcardranges of length 2=6
# wildcardranges of length 3=1
# wildcardranges of length 4=1
# wildcardranges of length 10=4
# wildcardranges of length 20=1
# wildcardranges of length 100=1
# wildcardranges of length 200=2
# wildcardranges of length 500=1
# wildcardranges of length 700=1
# wildcardranges of length 851=1
# wildcardranges of length 1000=2
# wildcardranges of length 1500=1
# wildcardranges of length 5100=1
# wildcardranges of length 5290=1
# wildcardranges of length 6000=1
# wildcardranges of length 10000=20
# wildcardranges of length 16500=1
# wildcardranges of length 19000=1
# wildcardranges of length 24900=1
# wildcardranges of length 25000=3
# wildcardranges of length 30000=1
# wildcardranges of length 40000=1
# wildcardranges of length 40442=1
# wildcardranges of length 50000=155
# wildcardranges of length 55500=1
# wildcardranges of length 60000=20
# wildcardranges of length 100000=28
# wildcardranges of length 110000=1
# wildcardranges of length 142000=1
# wildcardranges of length 150000=27
# wildcardranges of length 200000=1
# wildcardranges of length 307000=1
# wildcardranges of length 486181=1
# wildcardranges of length 1050000=1
# wildcardranges of length 3000000=9
# wildcardranges of length 3100000=6
# wildcardranges of length 3150000=1
# wildcardranges of length 3200000=1
# wildcardranges of length 9411193=1
# wildcardranges of length 11100000=1
# wildcardranges of length 16050000=1
# wildcardranges of length 18150000=1
# wildcardranges of length 19000000=1
# wildcardranges of length 19020000=1
# wildcardranges of length 20000000=1
# wildcardranges of length 21050000=1
# wildcardranges of length 30000000=1
# init character encoding (uint32, 773926674 bytes, 2.00 bits/symbol)
# init ssptab encoding (uint32, 104 bytes, 0.00 bits/symbol)
# totallength=3095694007
# numofsequences=25
# specialcharacters=234350305
# specialranges=342
# realspecialranges=342
# wildcards=234350281
# wildcardranges=362
# realwildcardranges=362
# occurrences(a)=844868045
# occurrences(c)=585017944
# occurrences(g)=585360436
# occurrences(t)=846097277
# automatically determined prefixlength=13
# maxinsertionsort=3
# maxbltriesort=1000
# maxcountingsort=4000
# storespecialcodes=false
# cmpcharbychar=false
# totallength=3095694007
# sizeof (leftborder)=268435460 bytes
# sizeof (countspecialcodes)=67108864 bytes
# sizeof (distpfxidx)=22369616 bytes
# sizeof (bcktab)=357913940 bytes
# largest bucket size=1733658
# widthofpart[0]=715335929
# widthofpart[1]=715336241
# widthofpart[2]=715335670
# widthofpart[3]=715335862
# create suffix_sort_space: suftab uses 64bit values: maxvalue=3095694007,numofentries=715336241
# compute part 0: 715335929 suffixes,14587585 buckets from 0..14587584
# used workspace for sorting: 0.11 MB
# countinsertionsort=1638691
# countbltriesort=12241539
# countcountingsort=62848
# countshortreadsort=0
# countradixsort=0
# counttqsort=16483
# compute part 1: 715336241 suffixes,18969420 buckets from 14587585..33557004
# countinsertionsort=2504730
# countbltriesort=14570021
# countcountingsort=61746
# countshortreadsort=0
# countradixsort=0
# counttqsort=18972
# compute part 2: 715335670 suffixes,19935928 buckets from 33557005..53492932
# countinsertionsort=2608844
# countbltriesort=15807519
# countcountingsort=59223
# countshortreadsort=0
# countradixsort=0
# counttqsort=18622
# compute part 3: 715335862 suffixes,13615931 buckets from 53492933..67108863
# countinsertionsort=1531548
# countbltriesort=11593005
# countcountingsort=66204
# countshortreadsort=0
# countradixsort=0
# counttqsort=17857



In [3]:
# List Tallymer Directory
ls -alh $SERRATUS/resources/hg19_tally

total 136G
drwxrwxr-x 2 artem artem 4.0K Jan  3 10:40 .
drwxrwxr-x 4 artem artem 4.0K Jan  3 10:34 ..
-rw-rw-r-- 1 artem artem  154 Dec 20 13:04 hg19.des
-rw-rw-r-- 1 artem artem 739M Dec 20 13:06 hg19.esq
lrwxrwxrwx 1 artem artem   43 Jan  2 18:34 hg19.fa -> /home/artem/Serratus/resources/hg19/hg19.fa
-rw-rw-r-- 1 artem artem  154 Dec 22 20:52 hg19_h.des
-rw-rw-r-- 1 artem artem 739M Dec 22 20:54 hg19_h.esq
lrwxrwxrwx 1 artem artem   45 Jan  2 18:34 hg19_h.fa -> /home/artem/Serratus/resources/hg19/hg19_h.fa
-rw-rw-r-- 1 artem artem 2.9G Dec 23 06:25 hg19_h.lcp
-rw-rw-r-- 1 artem artem 472M Dec 23 06:25 hg19_h.llv
-rw-rw-r-- 1 artem artem  825 Dec 22 20:52 hg19_h.md5
-rw-rw-r-- 1 artem artem  515 Dec 23 06:25 hg19_h.prj
-rw-rw-r-- 1 artem artem  192 Dec 22 20:52 hg19_h.sds
-rw-rw-r-- 1 artem artem  104 Dec 22 20:54 hg19_h.ssp
-rw-rw-r-- 1 artem artem  24G Dec 23 06:25 hg19_h.suf
-rw-rw-r-- 1 artem artem 2.9G Dec 20 14:37 hg19.lcp
-rw-rw-r-- 1 artem artem 259M Dec 20 14

## Results


## Discussion
