# Recipe 5: Kmers Indexing

## Description

1. Index fasta file using index() function and save the coloredKDataFrame in ckf1
2. Load the namesMap as a python dictionary and print it
3. Load the inverted namesMap as a python dictionary and print it
4. Query by kmer to get its color
5. Save the kDataFrame on disk

## Implementation

### Importing

In [1]:
import kProcessor as kp

### Create kDataFrame Object

In [2]:
KF = kp.kDataFrameMQF(31)

### Indexing

In [3]:
# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")

### Getting the names map and its inverse

In [4]:
namesMap = ckf1.names_map()
inverse_namesMap = ckf1.inverse_names_map()

In [5]:
namesMap

{14: 'ENST00000616125.4',
 23: 'ENST00000379409.6',
 5: 'ENST00000420190.6',
 19: 'ENST00000338591.7',
 28: 'ENST00000428771.6',
 10: 'ENST00000622503.4',
 1: 'ENST00000641515.2',
 24: 'ENST00000379407.7',
 15: 'ENST00000620200.4',
 6: 'ENST00000437963.5',
 20: 'ENST00000622660.1',
 29: 'ENST00000304952.10',
 11: 'ENST00000618323.4',
 2: 'ENST00000335137.4',
 25: 'ENST00000491024.1',
 22: 'ENST00000379410.7',
 9: 'ENST00000618181.4',
 26: 'ENST00000341290.6',
 4: 'ENST00000332831.4',
 21: 'ENST00000466300.1',
 18: 'ENST00000327044.6',
 13: 'ENST00000618779.4',
 30: 'ENST00000484667.2',
 27: 'ENST00000433179.3',
 31: 'ENST00000624697.3',
 8: 'ENST00000617307.4',
 17: 'ENST00000455979.1',
 12: 'ENST00000616016.4',
 3: 'ENST00000426406.3',
 7: 'ENST00000342066.7',
 16: 'ENST00000341065.8'}

In [6]:
# Inverse of the namesMap dictionary
inverse_namesMap

{'ENST00000624697.3': 31,
 'ENST00000379407.7': 24,
 'ENST00000622503.4': 10,
 'ENST00000622660.1': 20,
 'ENST00000616125.4': 14,
 'ENST00000428771.6': 28,
 'ENST00000491024.1': 25,
 'ENST00000338591.7': 19,
 'ENST00000618779.4': 13,
 'ENST00000433179.3': 27,
 'ENST00000455979.1': 17,
 'ENST00000341290.6': 26,
 'ENST00000616016.4': 12,
 'ENST00000342066.7': 7,
 'ENST00000379410.7': 22,
 'ENST00000426406.3': 3,
 'ENST00000379409.6': 23,
 'ENST00000484667.2': 30,
 'ENST00000620200.4': 15,
 'ENST00000341065.8': 16,
 'ENST00000617307.4': 8,
 'ENST00000641515.2': 1,
 'ENST00000332831.4': 4,
 'ENST00000466300.1': 21,
 'ENST00000437963.5': 6,
 'ENST00000304952.10': 29,
 'ENST00000618181.4': 9,
 'ENST00000327044.6': 18,
 'ENST00000618323.4': 11,
 'ENST00000335137.4': 2,
 'ENST00000420190.6': 5}

## Get a kmer color from the coloredKDataFrame

In [7]:
# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"

# Get the color
kmer_color = ckf1.getKmerColor(kmer)

print(kmer_color)

61


## Get all the transcripts IDs associated to the previous color

In [8]:
transcript_ids = ckf1.getKmerSourceFromColor(61)

### Get all transcript names of that color

In [9]:
for tr_id in transcript_ids:
    print(namesMap[tr_id])

ENST00000428771.6
ENST00000304952.10
ENST00000484667.2


### Save the colored kDataFrame to disk with name "ckf1"

In [10]:
ckf1.save("ckf1")

---

### Load the colored kDataFrame again from disk.

Unlike the other kDataFrames, the colored_kDataFrame is a structure that contains a kDataFrame alongside the colors information of the kmers.
<br>
Loading a colored_kDataFrame is done by using the static function `load(file_path)` from the `colored_kDataFrame` class.

In [11]:
loaded_ckf = kp.colored_kDataFrame.load("ckf1")

we can use the colored_kDataFrame to query kmers and finding their associated colors and source sequences.
<br>
On the other hand, the kDataFrame that's encapsulated under the colored_kDataFrame is used to directly iterate over the kmers and their colors.

### Get the embedded kDataFrame in the colored_kDataFrame *loaded_ckf*

In [12]:
loaded_kf = ckf1.getkDataFrame()

### Print the types of loaded_kf & loaded_ckf 

In [13]:
print(type(loaded_ckf))
print(type(loaded_kf))

<class 'kProcessor.colored_kDataFrame'>
<class 'kProcessor.kDataFrame'>


### Create a kDataFrame Iterator 

In [14]:
it = loaded_kf.begin()

### Iterate and print the first 5 kmers with their colors

In [15]:
for i in range(5):
    
    # Get the kmer
    kmer = it.getKmer()
    
    # Get the color (stored as count)
    color = it.getKmerCount()
    
    # Verify by querying the kmer on the colored kDataFrame
    
    
    print("Kmer  : %s" % kmer)
    print("Color : %d" % color)
    
    print("-------------------------------")
    
    it.next() # Extremely Important!

Kmer  : ACTTCCCAGCCCGCTTCCCGTCCCACCCTCG
Color : 64
-------------------------------
Kmer  : CCTCTCCGTCCGAGTCTTTGGGGGGCTCGTC
Color : 45
-------------------------------
Kmer  : GTACTCGGCCGGCGGCTATGACGGGGCCTCC
Color : 40
-------------------------------
Kmer  : AGGGCACCCTCCAGCACGGCCACGCCCGCTG
Color : 40
-------------------------------
Kmer  : CTGCAGCCGCCGCCAGAGGGTTTCCTTCGGC
Color : 18
-------------------------------


---

## Complete Script

```python
import kProcessor as kp

KF = kp.kDataFrameMQF(31)

# kp.index(kDataFrame, mode, params, file_path, chunk size, namesfile)
ckf1 = kp.index(KF, "kmers", {"k_size" : 31}, "data/min_test_sample.fa", 1000, "data/min_test_sample.fa.names")

namesMap = ckf1.names_map()

inverse_namesMap = ckf1.inverse_names_map()

print(f"namesMap: {namesMap}")

# Inverse of the namesMap dictionary
print(f"inverse_namesMap: {inverse_namesMap}")


# kmer from the dataset to perform the query
kmer = "TCGAAGCTGGAGAAGGCGGACATCCTGGAGA"

# Get the color
kmer_color = ckf1.getKmerColor(kmer)

print(kmer_color)

for tr_id in transcript_ids:
    print(namesMap[tr_id])


ckf1.save("ckf1")

loaded_ckf = kp.colored_kDataFrame.load("ckf1")
loaded_kf = ckf1.getkDataFrame()

it = loaded_kf.begin()

for i in range(5):
    
    # Get the kmer
    kmer = it.getKmer()
    
    # Get the color (stored as count)
    color = it.getKmerCount()
    
    # Verify by querying the kmer on the colored kDataFrame
    
    
    print("Kmer  : %s" % kmer)
    print("Color : %d" % color)
    
    print("-------------------------------")
    
    it.next() # Extremely Important!

```