In [1]:
%%sh

pip3 install keyvi
pip3 install memory_profiler



In [3]:
import gzip
from timeit import timeit
from tf.core.data import Data
from tf.core.timestamp import Timestamp
from tf.parameters import PICKLE_PROTOCOL, GZIP_LEVEL
from keyvi.compiler import StringDictionaryCompiler
from keyvi.dictionary import Dictionary

# Keyvi

Sophisticated key-value store using shared memory.

See [github](https://github.com/KeyviDev/keyvi)

# First steps

In [4]:
%%sh

keyvi compile toy.txt toy.kv key-only
keyvi dump toy.kv toy.kv.txt

In [5]:
data = {"1": "aap", "2": "noot"}

In [6]:
kv = StringDictionaryCompiler()

In [7]:
for (k, v) in data.items():
    kv.Add(k, v)

In [8]:
kv.Compile()

In [9]:
kv.WriteToFile("data.txt")

In [10]:
D = Dictionary("data.txt")

In [11]:
"1" in D

True

In [12]:
"2" in D

True

In [13]:
"3" in D

False

In [14]:
D["1"].GetValueAsString()

'aap'

# Data preparation

We load the feature `g_word_utf8` from the BHSA.

In [15]:
path = '_temp/g_word_utf8.tf'

In [16]:
tmObj = Timestamp()
F = Data(path, tmObj)
F._readDataBin()
print(len(F.data))

426584


We also compile this data as keyvi data:

In [17]:
kv = StringDictionaryCompiler()
for (k, v) in F.data.items():
    kv.Add(str(k), v)
kv.Compile()
kv.WriteToFile("_temp/kv/g_word_utf8.txt")
with gzip.open("_temp/kv/g_word_utf8.txt.gz", "wb", compresslevel=GZIP_LEVEL) as fout:
    with open("_temp/kv/g_word_utf8.txt", "rb") as fin:
        fout.write(fin.read())

In [18]:
D = Dictionary("_temp/kv/g_word_utf8.txt")

# Performance of accessing keys:

In [19]:
def testPerformanceTf(data):
    times = 1000000
    key = 200000
    return timeit("data[key]", globals=locals(), number=times)

testPerformanceTf(F.data)

0.06972066900016216

In [20]:
def testPerformanceKv(data):
    times = 1000000
    key = "200000"
    return timeit("data[key].GetValueAsString()", globals=locals(), number=times)
testPerformanceKv(D)

0.6346312719997513

# Conclusion

KeyVI is roughly **10 times as slow** as dictionary lookup.

That makes it uninteresting for TF.

Measure the memory size of this data:

# Measuring RAM

Nevertheless, we measure the RAM needed.

We cannot directly measure the size of D, because it is an object produced by custom C code.

We do it in a roundabout way.

In [21]:
%%sh

python3 -m memory_profiler tfmem.py

<class 'tf.core.data.Data'>
Filename: tfmem.py

Line #    Mem usage    Increment   Line Contents
     5   40.434 MiB   40.434 MiB   @profile
     6                             def mem():
     7   40.434 MiB    0.000 MiB       path = '_temp/g_word_utf8.tf'
     8   40.434 MiB    0.000 MiB       tmObj = Timestamp()
     9   40.438 MiB    0.004 MiB       F = Data(path, tmObj)
    10   96.426 MiB   55.988 MiB       F._readDataBin()
    11   96.426 MiB    0.000 MiB       print(type(F))




After reading the TF data, the memory usage has increased by 54 MB.

In [22]:
%%sh

python3 -m memory_profiler kvmem.py

<class '_core.Dictionary'>
Filename: kvmem.py

Line #    Mem usage    Increment   Line Contents
     4   40.727 MiB   40.727 MiB   @profile
     5                             def mem():
     6   40.809 MiB    0.082 MiB       D = Dictionary("_temp/kv/g_word_utf8.txt")
     7   40.809 MiB    0.000 MiB       print(type(D))




After reading the KV data, the memory has increased by only **70 KB**.
This is wonderful.
Alas, the 10-fold performance drop is a huge price. We cannot afford that.

# Disk size

TF: 5.4 MB in the plain text TF file, and 2.4 MB in the gzipped binary derived file.

KV: 4.6 MB in the compiled file, 2.1 MB in the gzipped file.

A bit strange: why cannot we store the contents of the RAM (70 KB) to disk, instead of the big file of 4.6 MB?