<h1 style="color:purple"  align="center"><b><u>kProcessor Tutorial</u></b></h1>

### Import kprocessor python package

In [1]:
import kProcessor as kp

### Import random library to generate random numbers

In [2]:
from random import randint

<h2><u>kDataFrame Class</u></h2>

<h2> kDataFrame class holds kmers with their features and manipulation functions</h2>
<hr>
<h3><b><u>kDataFrame Functions</u></b></h3>
<ul>

<li><b>kDataFrame.insert(kmer): </b>Insert a kmer</li>
<li><b>kDataFrame.insert(kmer, count): </b>Insert a kmer with it's count</li>

<li><b>kDataFrame.setCount(kmer, count): </b>set the kmer count</li>
<li><b>kDataFrame.count(kmer): </b>returns the kmer count (number of times that kmer was inserted)</li>

<li><b>kDataFrame.erase(kmer): </b>remove a kmer from the kDataFrame</li>
<li><b>kDataFrame.size(): </b>returns the number of kmers in the kDataFrame</li>

<li><b>kDataFrame.max_size(): </b>returns the maximum number of kmers that the kDataframe can hold.</li>
<li><b>kDataFrame.save(): </b>Save the kDataFrame on disk.</li>
<li><b>kDataFrame.load(): </b>Load the kDataFrame From disk.</li>


</ul>

<hr style="height:1px;">

<h3><b><u>kDataFrameIterator Functions</u></b></h3>

<ul>

<li><b>kDataFrameIterator.begin(): </b>Return the first position of a kDataFrame object</li>
<li><b>kDataFrameIterator.end(): </b>Return the last position of a kDataFrame object</li>
<li><b>kDataFrameIterator.next(): </b>Increment the iterator position by one</li>
<li><b>kDataFrameIterator.getKmer(): </b>Return the kmer at the current iterator position</li>
<li><b>kDataFrameIterator.getKmerCount(): </b>Return the kmer count at the current iterator position</li>
<li><b>kDataFrameIterator.setKmerCount(): </b>Set the kmer count at the current iterator position</li>

</ul>

<hr style="height:1px;">
    
<h3><b><u>Other Functions</u></b></h3>

<ul>

<li><b>parseSequences(fastq_file, no_threads, kDataFrame): </b>Parse the fastq and insert the kmers in the kDataFrame</li>

<ul>

<h2  align="center"> Example 1 (kDataFrame Manipulation) <h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create kDataFrame with kmerSize = 21 </li> 
    <li> Insert some random kmers with random counts </li>
    <li> Query by each kmer and check the count </li>
    <li> Insert a pre-exist kmer and check it's count </li>
<ol>

### Kmers List

In [3]:
kmers = ["CTGACTACTCAGAGCTAGCCT", "CGTAACCTATGCTAGCTAGAT"]

### Instantiate an object from the class kDataFrameMQF


In [4]:
kf = kp.kDataFrameMQF(21) # Empty kDataFrame

### Print the size of the kDataFrame

In [5]:
print("kf size: %d" % kf.size())

kf size: 0


### Insert the first kmer and set count to 1 & the second kmer with count 10 

In [6]:
print("Inserting 2 kmers")
kf.insert(kmers[0], 1)
kf.insert(kmers[1], 10)

Inserting 2 kmers


True

### Print the size of the kDataFrame & counts of kmer1 and kmer2

In [7]:
print("kf size: %d" % kf.size())
# Print the first kmer count 
print("Kmer1 Count: %d" % kf.count(kmers[0]))
print("Kmer2 Count: %d" % kf.count(kmers[1]))

kf size: 2
Kmer1 Count: 1
Kmer2 Count: 10


### Insert a duplicate kmer without count

In [8]:
print("[*] Inserting kmer 1 again")
kf.insert(kmers[0])

[*] Inserting kmer 1 again


True

### Print the first kmer count again

In [9]:
print("Kmer1 Count: %d" % kf.count(kmers[0]))

Kmer1 Count: 2


### Erase kmer1 from the kDataFrame

In [10]:
print("[*] Erasing kmer1")
kf.erase(kmers[0])

[*] Erasing kmer1


True

### Print the first kmer count again and the kDataframe Size

In [11]:
print("Kmer1 Count: %d" % kf.count(kmers[0]))
print("kf size: %d" % kf.size())

Kmer1 Count: 0
kf size: 1


<hr style="height:5px">

<h2  align="center"> Example 2 (Itaration on kDataFrame kmers) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create kDataFrame with kmerSize = 21 </li> 
    <li> Insert some random kmers with random counts </li>
    <li> Iterate over kDataFrames kmers and print Kmer and Count </li>
    <li> Save the result in dictionary </li>
<ol>

## Create kmers list with 4 kmers

In [12]:
kmers = ["ATCATACTGATCGATCGATGC", "CGTAACCTATGCTAGCTAGAT", "CTGACTACTCAGAGCTAGCCT","CAATCGCTGATACGATACGTA"]

## Create an empty kDataFrame

In [13]:
kf2 = kp.kDataFrameMQF(21)

### Insert all kmers using a for loop

In [14]:
for kmer in kmers:
    random_count = randint(1,100) # generate random count between 1 and 100
    print("Inserting kmer: %s with count %d" % (kmer, random_count))
    kf2.insert(kmer, random_count)

Inserting kmer: ATCATACTGATCGATCGATGC with count 40
Inserting kmer: CGTAACCTATGCTAGCTAGAT with count 21
Inserting kmer: CTGACTACTCAGAGCTAGCCT with count 1
Inserting kmer: CAATCGCTGATACGATACGTA with count 99


### Iterate over all kmers and print their count and save them in a dictionary

In [15]:
# Create empty dictionary
kf2_data = {}

# create iterator with the first position in the kDataFrame
it = kf2.begin()

while(it != kf2.end()):
    
    # Get the kmer string
    kmer = it.getKmer()
    
    # Get the kmer count
    count = it.getKmerCount()
    
    # Print the data
    print("retrieved kmer: %s with count: %d" % (kmer, count))
    
    # Save data in a dictionary
    kf2_data[kmer] = count
    
    it.next() # Extremely Important!

retrieved kmer: CTGACTACTCAGAGCTAGCCT with count: 1
retrieved kmer: CGTAACCTATGCTAGCTAGAT with count: 21
retrieved kmer: ATCATACTGATCGATCGATGC with count: 40
retrieved kmer: CAATCGCTGATACGATACGTA with count: 99


### Dump the dictionary data

In [16]:
kf2_data

{'CTGACTACTCAGAGCTAGCCT': 1,
 'CGTAACCTATGCTAGCTAGAT': 21,
 'ATCATACTGATCGATCGATGC': 40,
 'CAATCGCTGATACGATACGTA': 99}

<h2  align="center"> Example 3 (Create a kDataFrame from a fastq file) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create an empty kDataFrame with kmerSize = 21 </li> 
    <li> Load a fasta file into a kDataFrame</li>
    <li> Save the kDataFrame on disk </li>
<ol>

### Create an empty kDataFrame

In [17]:
kf3 = kp.kDataFrameMQF(21)

### Parse the fastq file into the kf3 kDataFrame

In [18]:
kp.parseSequences("data/test.fastq", 1, kf3)

### Print the kDataFrame3 size and the kmer size used in it

In [21]:
print(kf3.size())
print(kf3.getkSize())

14481
21


### Save the kDataFrame on disk with a name "kf3"

In [20]:
kf3.save("kf3")

<hr style="height:5px">

<h2  align="center"> Example 4 (Loading the kDataFrame from disk) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Load the kf3 kDataFrame in an new kDataFrame with name kf4</li> 
    <li>Verify the loading by printing the kmerSize and total size</li>
    <li> Save the kDataFrame on disk </li>
<ol>

In [22]:
kf4 = kp.kDataFrame.load("kf3")

In [23]:
print(kf4.size())
print(kf4.getkSize())

14481
21
