<h1 style="color:purple"  align="center"><b><u>kProcessor Tutorial</u></b></h1>

### Data Download

In [1]:
#!wget https://github.com/drtamermansour/nu-ngs02/raw/master/Day5/kProcessor/data/data.zip
#!unzip data.zip
#!ls data/

### Import kprocessor python package

In [2]:
import kProcessor as kp

### Import random library to generate random numbers

In [3]:
from random import randint

<h2 align="center"> <u> Main Classes and Functions </u> </h2>
<hr>


<h3 style="color:green">1- kDataFrame Object Instantiation</h3>
<ul>
    <li><b>kp.kDataFrame(kSize): </b>Create an empty kDataFrame with predefined kmerSize</li>
</ul>

<hr style="height:1px;">

<h3 style="color:green">2- kDataFrame filling (From fastq file)</h3>
<ul>
    <li><b>kp.parseSequences(fastq_file, no_threads, kDataFrame): </b>Parse the fastq and insert the kmers in the kDataFrame with their count (kmer Counting)</li>
</ul>

<hr style="height:1px;">

<h3 style="color:green">3- kDataFrame basic functions</h3>
<ul>
<li><b>kDataFrame.size(): </b>returns the number of kmers in the kDataFrame</li>
<li><b>kDataFrame.max_size(): </b>returns the maximum number of kmers that the kDataframe can hold.</li>
<li><b>kDataFrame.save(fileName): </b>Save the kDataFrame on disk.</li>
<li><b>kp.kDataFrame.load(): </b>Load the kDataFrame From disk and return kDataFrame object</li>
</ul>

<hr style="height:1px;">

<h3 style="color:green">4- kDataFrame kmer-specific functions</h3>
<ul>
<li><b>kDataFrame.insert(kmer): </b>Insert a kmer</li>
<li><b>kDataFrame.insert(kmer, count): </b>Insert a kmer with it's count</li>
<li><b>kDataFrame.setCount(kmer, count): </b>set the kmer count</li>
<li><b>kDataFrame.count(kmer): </b>returns the kmer count (number of times that kmer was inserted)</li>
<li><b>kDataFrame.erase(kmer): </b>remove a kmer from the kDataFrame</li>
</ul>


<hr style="height:1px;">

<h3 style="color:green"><b>5- kDataFrameIterator Functions</b></h3>
<ul>
<li><b>kDataFrameIterator.begin(): </b>Return the first position of a kDataFrame object</li>
<li><b>kDataFrameIterator.end(): </b>Return the last position of a kDataFrame object</li>
<li><b>kDataFrameIterator.next(): </b>Increment the iterator position by one</li>
<li><b>kDataFrameIterator.getKmer(): </b>Return the kmer at the current iterator position</li>
<li><b>kDataFrameIterator.getKmerCount(): </b>Return the kmer count at the current iterator position</li>
<li><b>kDataFrameIterator.setKmerCount(): </b>Set the kmer count at the current iterator position</li>
</ul>

<hr style="height:1px;">
    
<h3 style="color:green"><b>6- coloredKDataFrame instantiation & filling</b></h3>
<ul>
<li><b>kp.index(fastaFile, namesFile , kmerSize): </b>Perform the kmers indexing to the given fasta file with respect to the namesFile and return coloredKDataFrame object</li>
</ul>
    
<hr style="height:1px;">
 
<h3 style="color:green">7- colored kDataFrame functions</h3>
<ul>
    <li><b>Inherits All functions in the kDataFrame</b></li>
    <li><b>colored_kDataFrame.getKmerColor(kmer): </b>returns the color of the kmer</li>
    <li><b>colored_kDataFrame.getSamplesIDForColor(color, colorsList): </b>take the colorsList and fill it with all the transcript ids contains that color</li>
    <li><b>colored_kDataFrame.getSamplesIDForKmer(kmer, colorsList): </b>take the colorsList and fill it with all the transcript ids contains that kmer</li>
    <li><b>colored_kDataFrame.namesMap: </b>unordered_map contains namesMap information {tr_id:tr_name}</li>
    <li><b>colored_kDataFrame.namesMapInv: </b>unordered_map contains namesMap information {tr_name:tr_id}</li>
    <li><b>colored_kDataFrame.load(fileNamePrefix): </b>load the coloredkDataFrame from disk and returns coloredkDataFrame object</li>
    <li><b>colored_kDataFrame.save(fileNamePrefix): </b>save the coloredkDataFrame to disk</li>
</ul>

<hr style="height:2px;">

<h2  align="center"> Example 1 (Create a kDataFrame from a fastq file) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create an empty kDataFrame with kmerSize = 21 </li> 
    <li> Load a fasta file into a kDataFrame</li>
    <li> Save the kDataFrame on disk </li>
<ol>

### Create an empty kDataFrame

In [4]:
kf3 = kp.kDataFrameMQF(21)

### Parse the fastq file into the kf3 kDataFrame

In [5]:
kp.parseSequences("data/test.fastq", 1, kf3)

### Print the kDataFrame3 size and the kmer size used in it

In [6]:
print(kf3.size())
print(kf3.getkSize())

14481
21


### Save the kDataFrame on disk with a name "kf3"

In [7]:
kf3.save("kf3")

<hr style="height:5px">

<h2  align="center"> Example 2 (Loading the kDataFrame from disk) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Load the kf3 kDataFrame in an new kDataFrame with name kf4</li> 
    <li>Verify the loading by printing the kmerSize and total size</li>
    <li> Save the kDataFrame on disk </li>
<ol>

In [8]:
kf4 = kp.kDataFrame.load("kf3")

In [9]:
print(kf4.size())
print(kf4.getkSize())

14481
21


<h2  align="center"> Example 3 (kDataFrame Manipulation) <h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create kDataFrame with kmerSize = 21 </li> 
    <li> Insert some random kmers with random counts </li>
    <li> Query by each kmer and check the count </li>
    <li> Insert a pre-exist kmer and check it's count </li>
<ol>

### Kmers List

In [10]:
kmers = ["CTGACTACTCAGAGCTAGCCT", "CGTAACCTATGCTAGCTAGAT"]

### Instantiate an object from the class kDataFrameMQF


In [11]:
kf = kp.kDataFrameMQF(21) # Empty kDataFrame

### Print the size of the kDataFrame

In [12]:
print("kf size: %d" % kf.size())

kf size: 0


### Insert the first kmer and set count to 1 & the second kmer with count 10 

In [13]:
print("Inserting 2 kmers")
kf.insert(kmers[0], 1)
kf.insert(kmers[1], 10)

Inserting 2 kmers


True

### Print the size of the kDataFrame & counts of kmer1 and kmer2

In [14]:
print("kf size: %d" % kf.size())
# Print the first kmer count 
print("Kmer1 Count: %d" % kf.count(kmers[0]))
print("Kmer2 Count: %d" % kf.count(kmers[1]))

kf size: 2
Kmer1 Count: 1
Kmer2 Count: 10


### Insert a duplicate kmer without count

In [15]:
print("[*] Inserting kmer 1 again")
kf.insert(kmers[0])

[*] Inserting kmer 1 again


True

### Print the first kmer count again

In [16]:
print("Kmer1 Count: %d" % kf.count(kmers[0]))

Kmer1 Count: 2


### Erase kmer1 from the kDataFrame

In [17]:
print("[*] Erasing kmer1")
kf.erase(kmers[0])

[*] Erasing kmer1


True

### Print the first kmer count again and the kDataframe Size

In [18]:
print("Kmer1 Count: %d" % kf.count(kmers[0]))
print("kf size: %d" % kf.size())

Kmer1 Count: 0
kf size: 1


<hr style="height:5px">

<h2  align="center"> Example 4 (Itaration on kDataFrame kmers) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li> Create kDataFrame with kmerSize = 21 </li> 
    <li> Insert some random kmers with random counts </li>
    <li> Iterate over kDataFrames kmers and print Kmer and Count </li>
    <li> Save the result in dictionary </li>
<ol>

## Create kmers list with 4 kmers

In [19]:
kmers = ["ATCATACTGATCGATCGATGC", "CGTAACCTATGCTAGCTAGAT", "CTGACTACTCAGAGCTAGCCT","CAATCGCTGATACGATACGTA"]

## Create an empty kDataFrame

In [20]:
kf2 = kp.kDataFrameMQF(21)

### Insert all kmers using a for loop

In [21]:
for kmer in kmers:
    random_count = randint(1,100) # generate random count between 1 and 100
    print("Inserting kmer: %s with count %d" % (kmer, random_count))
    kf2.insert(kmer, random_count)

Inserting kmer: ATCATACTGATCGATCGATGC with count 77
Inserting kmer: CGTAACCTATGCTAGCTAGAT with count 58
Inserting kmer: CTGACTACTCAGAGCTAGCCT with count 32
Inserting kmer: CAATCGCTGATACGATACGTA with count 13


### Iterate over all kmers and print their count and save them in a dictionary

In [22]:
# Create empty dictionary
kf2_data = {}

# create iterator with the first position in the kDataFrame
it = kf2.begin()

while(it != kf2.end()):
    
    # Get the kmer string
    kmer = it.getKmer()
    
    # Get the kmer count
    count = it.getKmerCount()
    
    # Print the data
    print("retrieved kmer: %s with count: %d" % (kmer, count))
    
    # Save data in a dictionary
    kf2_data[kmer] = count
    
    it.next() # Extremely Important!

retrieved kmer: CTGACTACTCAGAGCTAGCCT with count: 32
retrieved kmer: CGTAACCTATGCTAGCTAGAT with count: 58
retrieved kmer: ATCATACTGATCGATCGATGC with count: 77
retrieved kmer: CAATCGCTGATACGATACGTA with count: 13


### Dump the dictionary data to a file

In [23]:
with open("kf2_data.tsv", 'w') as kf2:
    kf2.write("kmer\tcount\n")
    for kmer,count in kf2_data.items():
        kf2.write("%s\t%d\n" % (kmer, count))

<hr style="height:5px">

<h2  align="center"> Example 5 (Kmers Indexing) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li>Index fasta file using index() function and save the coloredKDataFrame in ckf1</li> 
    <li>Load the namesMap as a python dictionary and print it</li>
    <li>Load the inverted namesMap as a python dictionary and print it</li>
    <li>Query by kmer to get its color</li>
    <li> Save the kDataFrame on disk </li>
<ol>

In [24]:
ckf1 = kp.index("data/min_test_sample.fa","data/min_test_sample.fa.names", 31)

In [25]:
namesMap = dict(ckf1.namesMap)
inverted_namesMap = dict(ckf1.namesMapInv)

print(namesMap)
print("-----------------")
print(inverted_namesMap)

{1: 'ENST00000641515.2', 31: 'ENST00000624697.3', 20: 'ENST00000622660.1', 10: 'ENST00000622503.4', 15: 'ENST00000620200.4', 13: 'ENST00000618779.4', 11: 'ENST00000618323.4', 9: 'ENST00000618181.4', 8: 'ENST00000617307.4', 14: 'ENST00000616125.4', 12: 'ENST00000616016.4', 25: 'ENST00000491024.1', 30: 'ENST00000484667.2', 21: 'ENST00000466300.1', 17: 'ENST00000455979.1', 19: 'ENST00000338591.7', 2: 'ENST00000335137.4', 16: 'ENST00000341065.8', 29: 'ENST00000304952.10', 4: 'ENST00000332831.4', 18: 'ENST00000327044.6', 26: 'ENST00000341290.6', 7: 'ENST00000342066.7', 24: 'ENST00000379407.7', 23: 'ENST00000379409.6', 6: 'ENST00000437963.5', 22: 'ENST00000379410.7', 5: 'ENST00000420190.6', 3: 'ENST00000426406.3', 28: 'ENST00000428771.6', 27: 'ENST00000433179.3'}
-----------------
{'ENST00000624697.3': 31, 'ENST00000622660.1': 20, 'ENST00000622503.4': 10, 'ENST00000620200.4': 15, 'ENST00000618323.4': 11, 'ENST00000617307.4': 8, 'ENST00000616125.4': 14, 'ENST00000491024.1': 25, 'ENST000004846

<h2>namesMap as an unordered_map <span style="color:red"> <span style="color:green">[Optional]</span> </h2>

### Iterate over the namesMap <span style="color:green">[Optional]</span>

In [26]:
# Get the namesMap
c_namesMap = ckf1.namesMap

# create map iterator
c_it = c_namesMap.begin()

while(c_it != c_namesMap.end()):
    
    # Get the current value (python tuple) and print it
    curr_val = c_it.value()
    print(curr_val)
    
    # Increment the iterator
    c_it.next() # Extremely Important

(1, 'ENST00000641515.2')
(31, 'ENST00000624697.3')
(20, 'ENST00000622660.1')
(10, 'ENST00000622503.4')
(15, 'ENST00000620200.4')
(13, 'ENST00000618779.4')
(11, 'ENST00000618323.4')
(9, 'ENST00000618181.4')
(8, 'ENST00000617307.4')
(14, 'ENST00000616125.4')
(12, 'ENST00000616016.4')
(25, 'ENST00000491024.1')
(30, 'ENST00000484667.2')
(21, 'ENST00000466300.1')
(17, 'ENST00000455979.1')
(19, 'ENST00000338591.7')
(2, 'ENST00000335137.4')
(16, 'ENST00000341065.8')
(29, 'ENST00000304952.10')
(4, 'ENST00000332831.4')
(18, 'ENST00000327044.6')
(26, 'ENST00000341290.6')
(7, 'ENST00000342066.7')
(24, 'ENST00000379407.7')
(23, 'ENST00000379409.6')
(6, 'ENST00000437963.5')
(22, 'ENST00000379410.7')
(5, 'ENST00000420190.6')
(3, 'ENST00000426406.3')
(28, 'ENST00000428771.6')
(27, 'ENST00000433179.3')


### Find a single element <span style="color:green">[Optional]</span>

In [27]:
# Search for transcript with ID = 1
id_1 = c_namesMap.find(1).value()
print(id_1)

# Search using the dictionary
print(namesMap[1])

(1, 'ENST00000641515.2')
ENST00000641515.2


### Get a kmer color from the coloredKDataFrame

In [28]:
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"[:31]
color = ckf1.getKmerColor(kmer)
print(color)

68


### Get all the transcripts Ids associated to this kmer

In [29]:
#1 Create an empty colorsList
colors_list = kp.colorsList()

#2 Fill it with the transcript IDs
ckf1.getSamplesIDForColor(68, colors_list)

#3 Print the colorsList as a list
print(list(colors_list))

[28, 29, 30]


### Get all transcript names of that color

In [30]:
for color in colors_list:
    print(c_namesMap.find(color).value())

(28, 'ENST00000428771.6')
(29, 'ENST00000304952.10')
(30, 'ENST00000484667.2')


### Get all transcript names of that color using the namesMap dictionary

In [31]:
for color in colors_list:
    print(namesMap[color])

ENST00000428771.6
ENST00000304952.10
ENST00000484667.2


### Save the ckf1 to the disk as ckf1

In [32]:
ckf1.save("ckf1")

<hr style="height:5px">

<h2  align="center"> Example 6 (Loading ckf1 as kDataFrame and iterate over kmers & colors) </h2>
<hr>
<h3><b>Description</b></h3>
<ol>
    <li>Load the ckf1 coloredKDataFrame to kf5</li> 
    <li>Iterate over all kmers and get kmer:color "color now instead of count" [also can be loaded as coloredKDataframe]</li>
    <li>Save the data to dictionary</li>
<ol>

### Loading ckf1 as kDataFrame

In [33]:
kf5 = kp.kDataFrame.load("ckf1")

### To load as coloredkDataFrame <span style="color:brown"> (will not be used in the next steps) </span>

In [34]:
ckf2 = kp.colored_kDataFrame.load("ckf1")

### Create an Iterator

In [35]:
kf5_it = kf5.begin()

### Iterate and print the first 5 kmers with their colors

In [36]:
for i in range(5):
    
    # Get the kmer
    kmer = kf5_it.getKmer()
    
    # Get the color (stored as count)
    color = kf5_it.getKmerCount()
    
    # Verify by querying the kmer on the colored kDataFrame
    
    color2 = ckf1.getKmerColor(kmer)
    
    print("Kmer  : %s" % kmer)
    print("Color : %d" % color)
    print("Color2: %d" % color2)
    print("-------------------------------")
    
    kf5_it.next() # Extremely Important !

Kmer  : GGGCGCACCATGGCCGCAGACACGCCGGGGA
Color : 67
Color2: 67
-------------------------------
Kmer  : GTACTCGGCCGGCGGCTATGACGGGGCCTCC
Color : 59
Color2: 59
-------------------------------
Kmer  : CTGCAGCCGCCGCCAGAGGGTTTCCTTCGGC
Color : 18
Color2: 18
-------------------------------
Kmer  : GGCACAGGAGAGCAGCCCTTGTCCCCCACGA
Color : 46
Color2: 46
-------------------------------
Kmer  : TGCCCTGTACGTGGCAGGGGGCAACGACGGC
Color : 59
Color2: 59
-------------------------------


<hr style="height:3px;">