## Problem statement

Given a set of symbols, we want to encode a list of symbols (a word) in to unique integer. Moreover we want to be able to decode the list of symbols from the integer to the corresponding word used to generate the integer.

For example given ['cat','dog']. We want to assign an integer to both, such as 

```
'cat' -> 11
'dog' -> 22
```

And given 11 we want to fetch 'cat', and given 22 we want to fetch 'dog'. But we don't want to store a dictionary of all possible words and integers


In [47]:
import math
from math import log as log

In [48]:
symbol_to_num = {0:0, 'a':1, 'b':2, 'c':3, 'd': 4}

data = ['a','b','c','d', 'aa','ab','ac','ad','ba','bb',
        'bc','bd','ca','cb','cc','cd','da','db','dc','dd','aaa', 'abc', 'ddd']

In [49]:
40**12 < 2**64-1 + 2**63-1 + 2**62-1  + 2**61-1 + 2**60-1

True

In [53]:
def map_str_to_num(word, symbol_to_num):
    res = 0
    for i,c in enumerate(word[::-1]):
        res += (symbol_to_num[c]) * len(symbol_to_num)**i
    return res

In [51]:
math.ceil(math.log(125,3))

5

In [52]:
map_str_to_num

<function __main__.map_str_to_num(word, symbol_to_num)>

In [29]:
for d in data:
    print(d,map_str_to_num(d, symbol_to_num))

a 1
b 2
c 3
d 4
aa 6
ab 7
ac 8
ad 9
ba 11
bb 12
bc 13
bd 14
ca 16
cb 17
cc 18
cd 19
da 21
db 22
dc 23
dd 24
aaa 31
abc 38
ddd 124


Now we want to map numbers to strings

In [30]:
def num_to_symbol(n, symbol_to_num):
    for k,v in symbol_to_num.items():
        if v==n:
            return k

In [31]:
for i in range(1,5):
    print( f"feat = {i} ---> {num_to_symbol(i, symbol_to_num)}")

feat = 1 ---> a
feat = 2 ---> b
feat = 3 ---> c
feat = 4 ---> d


In [32]:
def map_num_to_coeff(num, symbol_to_num):
    b = len(symbol_to_num)
    assert num >=0, "num goes from 0 to n_features"
    if num == 0:
        symbol = num_to_symbol(num, symbol_to_num) 
        return num_to_symbol(num, symbol_to_num),[num]
    digits = []
    while num:
        digits.append(int(num % b))        
        num //= b
    digits = [x for x in digits]
    return digits[::-1]

In [33]:
map_num_to_coeff(0, symbol_to_num)

(0, [0])

In [34]:
for d in data:
    d_num = map_str_to_num(d, symbol_to_num)
    d_coeff = map_num_to_coeff(d_num, symbol_to_num)
    
    print(d,d_num, d_coeff)

a 1 [1]
b 2 [2]
c 3 [3]
d 4 [4]
aa 6 [1, 1]
ab 7 [1, 2]
ac 8 [1, 3]
ad 9 [1, 4]
ba 11 [2, 1]
bb 12 [2, 2]
bc 13 [2, 3]
bd 14 [2, 4]
ca 16 [3, 1]
cb 17 [3, 2]
cc 18 [3, 3]
cd 19 [3, 4]
da 21 [4, 1]
db 22 [4, 2]
dc 23 [4, 3]
dd 24 [4, 4]
aaa 31 [1, 1, 1]
abc 38 [1, 2, 3]
ddd 124 [4, 4, 4]


In [35]:
def coeffs_to_str(coeffs, symbol_to_num):
    return "".join([num_to_symbol(x, symbol_to_num) for x in coeffs])

In [36]:
coeffs_to_str([4,4,4],symbol_to_num)

'ddd'

In [37]:
for d in data:
    d_num = map_str_to_num(d, symbol_to_num)
    d_coeff = map_num_to_coeff(d_num, symbol_to_num)
    d_str = coeffs_to_str(d_coeff, symbol_to_num)
    
    print(f'input word to encode = {d}, hash value assigned = {d_num}, coeeficients = {d_coeff}, decoded_string_from_number={d_str})')

input word to encode = a, hash value assigned = 1, coeeficients = [1], decoded_string_from_number=a)
input word to encode = b, hash value assigned = 2, coeeficients = [2], decoded_string_from_number=b)
input word to encode = c, hash value assigned = 3, coeeficients = [3], decoded_string_from_number=c)
input word to encode = d, hash value assigned = 4, coeeficients = [4], decoded_string_from_number=d)
input word to encode = aa, hash value assigned = 6, coeeficients = [1, 1], decoded_string_from_number=aa)
input word to encode = ab, hash value assigned = 7, coeeficients = [1, 2], decoded_string_from_number=ab)
input word to encode = ac, hash value assigned = 8, coeeficients = [1, 3], decoded_string_from_number=ac)
input word to encode = ad, hash value assigned = 9, coeeficients = [1, 4], decoded_string_from_number=ad)
input word to encode = ba, hash value assigned = 11, coeeficients = [2, 1], decoded_string_from_number=ba)
input word to encode = bb, hash value assigned = 12, coeeficients

## Using a bigger dictionary

In [38]:
symbols = 'abcdefghijklmnopqrstuxywzñç-'

In [39]:
symbol_to_num = {0:0, **{c:i+1 for i,c in enumerate(symbols)}}
symbol_to_num

{0: 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'x': 22,
 'y': 23,
 'w': 24,
 'z': 25,
 'ñ': 26,
 'ç': 27,
 '-': 28}

In [40]:
data = ['professionals', 'lightweight', 'veterinarians', 'castañera']

In [41]:
num = map_str_to_num(data[0],symbol_to_num)
num

5887046198922477792

In [42]:
d_coeff = map_num_to_coeff(num, symbol_to_num)
d_str = coeffs_to_str(d_coeff, symbol_to_num)
d_str

'professionals'

In [43]:
map_str_to_num(data[-1], symbol_to_num)

1529702327394

In [44]:
map_str_to_num('-------------', symbol_to_num)

10260628712958602188

In [46]:
12260628712958602188 < 18446744073709551615

True

### Ngram Creation

Now let's assume we want to handle ngrams and be able to hash them without having collissions

### Check memory usage 

In [1470]:
from pympler import asizeof
from sys import getsizeof

In [1471]:
x = 'p'
asizeof.asizeof(x), getsizeof(x)

(56, 50)

In [1477]:
x = 1529702327394
asizeof.asizeof(x), getsizeof(x)

(32, 32)

In [1479]:
x = {'the': 0, 'cat':1, 'is':2, 'big':3}
asizeof.asizeof(x), getsizeof(x)

(576, 232)

In [1482]:
x = {map_str_to_num(w, symbol_to_num) for w in ['the','cat','is','big'] }
asizeof.asizeof(x), getsizeof(x)

(344, 216)

In [1487]:
#x = {map_str_to_num(w, symbol_to_num) for w in ['the','cat','is','big'] }
#asizeof.asizeof(x), getsizeof(x)