## For traversal algo


Could test having node TOIDs as memoryview instead of string, as might read faster. Or as numpy arrays with most basic int/float type possible, then converted to memoryview. Or bytearrays/bytes, which gives more granularity on how many bytes are needed. 3 bytes might be enough for TOIDs (256**3=16777216)

Memoryview might be less good if you want to delete nodes. But if dict of memoryviews might be ok...






#### Other notes

encoding = code points to byte sequences and vice versa.

Code point for 'A' = U+0041
UTF-8 code for 'A' (single byte) = \x41
UTF-16LE code for 'A' (2 bytes) = \x41\x00

Says binary sequences are really sequences of integers

For bytes corresponding to tab, newline, carriage return, and \ = \t, \n, \r, and \\ 




In [80]:
256**3

16777216

In [78]:
s = 'cafe'
b = s.encode('utf8')
len(b)

4

In [19]:
# encode text as bytes
a = bytes('café', encoding='utf_8')
print(len(a))
print(type(a))
print(a[0])
print(a[1])
print(a[:4])

5
<class 'bytes'>
99
97
b'caf\xc3'


In [20]:
# converting to bytearray and looking at bytes
ba = bytearray(a)
for ac in ba:
    print(ac)

99
97
102
195
169


In [21]:
# can't interate through byte object like you can through bytearrays
for ac in a:
    print(a)

b'caf\xc3\xa9'
b'caf\xc3\xa9'
b'caf\xc3\xa9'
b'caf\xc3\xa9'
b'caf\xc3\xa9'


In [24]:
# decode hex encoding
k = bytes.fromhex('31 4B CE A9') 
print(k)
print(k.decode())

b'1K\xce\xa9'
1KΩ


In [30]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2, 3]) # h = int16, or two bytes per number
dd = bytes(numbers)
print(type(dd))
print(len(dd))    # 2 bytes per number, so 12 bytes in total
dd

<class 'bytes'>
12


b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00\x03\x00'

In [42]:
import numpy as np
ar = np.asarray([1, 1, 2, 3, 4]).astype('int64')
np_bytes = bytes(ar)
print(np_bytes)
print(len(np_bytes))  # int64 sees 8 bytes per integer

print(len(bytes(np.asarray([1, 1, 2, 3, 4]).astype('int32'))))  # int32 sees 4 bytes per int

b'\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00'
40
20


#### struct package helps
>parse packed bytes into a tuple of fields of different type, and vice versa

struct is used with bytes, bytearray, and memoryview objects







In [52]:
import struct
import time


In [51]:
# memoryviews can be sliced without copying the underlying data, unlike bytes (and byte arrays) or strings. 
# Making them scale linearly with things, inc slicing and assigning, unlike bytes/strings which scale quadratically
# ref: https://stackoverflow.com/questions/18655648/what-exactly-is-the-point-of-memoryview-in-python

data_byte = b'x'*100000
data_memview = memoryview(data)
data_string = 'x' * 100000
data_bytearray = bytearray(data_byte)

start = time.time()
for i in range(len(data_memview)):
    data_memview[:i]
print(time.time() - start)

start = time.time()
for i in range(len(data_byte)):
    data_byte[:i]
print(time.time() - start)

start = time.time()
for i in range(len(data_bytearray)):
    data_bytearray[:i]
print(time.time() - start)

start = time.time()
for i in range(len(data_string)):
    data_string[:i]
print(time.time() - start)


0.018759965896606445
0.27395105361938477
0.27632999420166016
0.30631279945373535


In [65]:
# view something in byte form, specifying the nunber of bytes shown and how they are shown
fmt = '<3s3sHH'  # set format to show you two sequence of 3 bytes (3s3s) and two 16-bit integers (HH): 10 bytes in total
struct.unpack(fmt, b'annwimnrmv')

(b'ann', b'wim', 29294, 30317)

In [76]:
# memory is faster than numpy array
a = np.random.rand(10000, 1000)
b = memoryview(a)

start = time.time()
for i in range(10000):
    for j in range(1000):
        a[i, j]
print(time.time() - start)

start = time.time()
for i in range(10000):
    for j in range(1000):
        b[i, j]
print(time.time() - start)



2.841446876525879
1.6500070095062256


In [95]:
a = np.random.rand(10)
b = memoryview(a)
np.asarray(b[:7])

array([0.004781  , 0.25555808, 0.7781253 , 0.73146312, 0.5996862 ,
       0.51406519, 0.55937536])

In [99]:
# can decode into different target languages, eg Russian
octets = b'Montr\xe9al'
octets.decode('koi8_r')

# How do you find the encoding of a byte sequence? Short answer: you can’t. You must be told.

'MontrИal'

In [109]:
# UTF-8 is the default source encoding for Python 3 (though says there is some chance python will use the
# base machine's default encoding, so not clear: can see from below mac reads in as utf-8)

with open('example.txt', 'w') as f:
    f.write('example')
       
fp2 = open('example.txt')
print(fp2)

fp2 = open('example.txt', encoding='utf_8') # can force encoding choice
print(fp2)

fp2 = open('example.txt', 'rb') # The 'rb' flag opens a file for reading in binary mode. No real reason to do this. 
f = fp2.read()
print(fp2)
print(type(f))
print(f)

<_io.TextIOWrapper name='example.txt' mode='r' encoding='UTF-8'>
<_io.TextIOWrapper name='example.txt' mode='r' encoding='utf_8'>
<_io.BufferedReader name='example.txt'>
<class 'bytes'>
b'example'


Little endian machine = last byte stored first

Big endian machine = first byte stored first


## 

String and bytes are dealt with differently by re and os modules