# Overview of `kvector` features

In [2]:
import kvector

## Read HOMER Motifs

Read HOMER motif file and create a pandas dataframe for each position weight matrix (PWM), with all motifs saved as a series with the motif name as the key.

In [3]:
motifs = kvector.read_motifs('kvector/tests/data/example_rbps.motif', residues='ACGT')
motifs.head()

M001_0.6_A1CF_ENSG00000148584_Homo_sapiens\tM001_0.6_A1CF_ENSG00000148584_Homo_sapiens\t5.0                                          A         C         G         T
0  0...
M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\tM002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\tM003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster\t5.0              A         C         G         T
0  0...
M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0                                    A         C         G         T
0  0...
dtype: object

This is a series containing mini-dataframes for each motif. You can can access individual motifs with the usual pandas indexing:

In [4]:
# the 4th (counting from 0) motif
motifs[3]

Unnamed: 0,A,C,G,T
0,0.085063,0.085063,0.175952,0.653921
1,0.013046,0.013046,0.776577,0.19733
2,0.013046,0.013046,0.013046,0.960861
3,0.013046,0.013046,0.764576,0.209331
4,0.013046,0.013046,0.104634,0.869273
5,0.013046,0.013046,0.666799,0.307108
6,0.083101,0.083101,0.264548,0.56925


In [5]:
# Specific motif name
motifs['M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\tM004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens\t5.0']

Unnamed: 0,A,C,G,T
0,0.085063,0.085063,0.175952,0.653921
1,0.013046,0.013046,0.776577,0.19733
2,0.013046,0.013046,0.013046,0.960861
3,0.013046,0.013046,0.764576,0.209331
4,0.013046,0.013046,0.104634,0.869273
5,0.013046,0.013046,0.666799,0.307108
6,0.083101,0.083101,0.264548,0.56925


## Convert motifs to kmer vectors

Instead of representing a motif as a position-specific weight matrix which would require aligning motifs to compare them, you can convert them to a vector of kmers, where the value for each kmer is the score of the kmer in that motif.

Citation: [Xu and Su, *PLoS Computational Biology* (2010)](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0008797)

In [7]:
%pdb

Automatic pdb calling has been turned ON


In [12]:
motif_kmer_vectors = kvector.motifs_to_kmer_vectors(motifs, residues='ACGT',
    kmer_lengths=(3, 4))
motif_kmer_vectors.head()

Unnamed: 0,M001_0.6_A1CF_ENSG00000148584_Homo_sapiens	M001_0.6_A1CF_ENSG00000148584_Homo_sapiens	5.0,M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens	M002_0.6_ANKRD17_ENSG00000132466_Homo_sapiens	5.0,M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster	M003_0.6_FBgn0262475_FBgn0262475_Drosophila_melanogaster	5.0,M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens	M004_0.6_BRUNOL4_ENSG00000101489_Homo_sapiens	5.0
AAA,0.442114,0.310285,0.012428,0.022518
AAC,0.301068,0.278607,0.012428,0.022518
AAG,0.323372,0.27345,0.133575,0.134406
AAT,0.424485,0.271529,0.212117,0.207887
ACA,0.31289,0.301837,0.012428,0.022518


## Count kmers in fasta files


You may also want to just count the integer number of occurences of a DNA word (kmer) in a file. `count_kmers` does just that, returning a pandas dataframe.

In [4]:
kmer_vector = kvector.count_kmers('kvector/tests/data/example.fasta', kmer_lengths=(3, 4))
kmer_vector.head()

Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide
Skipped NNNN because it contains an unknown nucleotide


Unnamed: 0,AAA,AAC,AAG,AAT,ACA,ACC,ACG,ACT,AGA,AGC,...,TTCG,TTCT,TTGA,TTGC,TTGG,TTGT,TTTA,TTTC,TTTG,TTTT
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,3,1,0,0,2,1,2,1,0,...,0,0,0,0,0,0,0,0,0,0
2,2,3,1,0,0,2,1,2,1,0,...,0,0,0,0,0,0,0,0,0,0
3,6,5,6,0,0,3,2,4,3,5,...,2,0,0,0,0,0,0,0,0,0
4,6,6,4,1,12,2,0,5,3,9,...,0,0,0,1,2,0,0,1,0,1


Since this is a pandas dataframe, you can do convenient things like get the mean and standard deviation.

In [5]:
kmer_vector.mean()

AAA     5.833333
AAC     4.333333
AAG     2.166667
AAT     1.333333
ACA     3.166667
ACC     2.833333
ACG     0.666667
ACT     3.500000
AGA     2.000000
AGC     2.500000
AGG     2.000000
AGT     1.166667
ATA     0.500000
ATC     1.333333
ATG     1.666667
ATT     1.000000
CAA     3.000000
CAC     4.000000
CAG     2.666667
CAT     1.666667
CCA     2.166667
CCC     9.000000
CCG     2.333333
CCT     5.166667
CGA     0.500000
CGC     1.166667
CGG     2.666667
CGT     1.000000
CTA     1.833333
CTC     5.666667
          ...   
TGAG    0.500000
TGAT    0.166667
TGCA    1.833333
TGCC    0.333333
TGCG    0.666667
TGCT    0.500000
TGGA    1.000000
TGGC    0.333333
TGGG    0.333333
TGGT    0.833333
TGTA    0.500000
TGTC    0.666667
TGTG    0.333333
TGTT    0.333333
TTAA    0.333333
TTAC    0.333333
TTAG    0.166667
TTAT    0.166667
TTCA    0.500000
TTCC    0.500000
TTCG    0.333333
TTCT    0.333333
TTGA    0.166667
TTGC    0.333333
TTGG    0.500000
TTGT    0.500000
TTTA    0.166667
TTTC    0.8333

In [6]:
kmer_vector.std()

AAA     6.882345
AAC     3.076795
AAG     2.316607
AAT     2.804758
ACA     5.154286
ACC     2.714160
ACG     0.816497
ACT     2.810694
AGA     1.549193
AGC     3.728270
AGG     1.264911
AGT     1.471960
ATA     0.836660
ATC     2.422120
ATG     2.581989
ATT     2.449490
CAA     3.405877
CAC     4.857983
CAG     2.943920
CAT     2.658320
CCA     1.722401
CCC     5.899152
CCG     2.581989
CCT     2.786874
CGA     1.224745
CGC     1.169045
CGG     1.966384
CGT     1.549193
CTA     0.983192
CTC     3.141125
          ...   
TGAG    0.836660
TGAT    0.408248
TGCA    2.857738
TGCC    0.816497
TGCG    0.816497
TGCT    0.836660
TGGA    1.095445
TGGC    0.516398
TGGG    0.516398
TGGT    1.329160
TGTA    0.836660
TGTC    0.516398
TGTG    0.816497
TGTT    0.816497
TTAA    0.816497
TTAC    0.816497
TTAG    0.408248
TTAT    0.408248
TTCA    1.224745
TTCC    0.836660
TTCG    0.816497
TTCT    0.816497
TTGA    0.408248
TTGC    0.516398
TTGG    0.836660
TTGT    1.224745
TTTA    0.408248
TTTC    1.6020