# Numpy arrays

We experiment with numpy arrays for storing a lot of data.

We load the BHSA in the normal way, end then we write code to represent the levUp data,
which is a list of lists of numbers.

Can we represent this as a numpy array, and what is the performance gain in terms of memory,
and is there a performance penalty in terms of speed?

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tf.app import use
import functools
from timeit import timeit

import numpy
from pack import deepSize

In [3]:
def testPerformance(data):
    testMember = 100000
    times = 10000000
    xTime = timeit("data[testMember]", globals=locals(), number=times)
    return xTime

In [4]:
A = use("ETCBC/bhsa:clone", checkout="clone", hoist=globals(), silent="verbose")

**Locating corpus resources ...**

This is Text-Fabric 11.4.6
122 features found and 0 ignored
   |     0.80s T otype                from ~/github/ETCBC/bhsa/tf/2021
   |       11s T oslots               from ~/github/ETCBC/bhsa/tf/2021
    12s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.00s T book@zh              from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@pa              from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@ja              from ~/github/ETCBC/bhsa/tf/2021
   |     0.99s T g_cons_utf8          from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@el              from ~/github/ETCBC/bhsa/tf/2021
   |     1.05s T g_word               from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@ur              from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@la              from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@es              from ~/github/ETCBC/bhsa/tf/2021
   |     0.00s T book@id              from ~/github/ETCBC/bhsa/tf/2021


Name,# of nodes,# slots/node,% coverage
book,39,10938.21,100
chapter,929,459.19,100
lex,9230,46.22,100
verse,23213,18.38,100
half_verse,45179,9.44,100
sentence,63717,6.7,100
sentence_atom,64514,6.61,100
clause,88131,4.84,100
clause_atom,90704,4.7,100
phrase,253203,1.68,100


In [22]:
x = numpy.array([1, 2, 3], dtype=numpy.uint32)
type(x[0])

numpy.uint32

In [23]:
type(x[0]) is numpy.uint32

True

In [9]:
results = A.search("""
clause
  phrase function=Pred
    word sp=verb
""")

  0.47s 57070 results


In [6]:
results[0:3]

[(427559, 651574, 3), (427560, 651579, 15), (427563, 651589, 33)]

In [7]:
T.text(results[0][0])

'בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃ '

In [8]:
A.show(results, end=3)

Now we compute an alternative representation for the levUp data.

In [12]:
info = A.info
error = A.error
otype = A.TF.features["otype"].data
oslots = A.TF.features["oslots"].data
rank = C.rank.data

In [45]:
def levUp(info, error, otype, oslots, rank):
    (otype, maxSlot, maxNode, slotType) = otype
    oslots = oslots[0]
    info("making inverse of edge feature oslots")
    oslotsInv = {}
    for (k, mList) in enumerate(oslots):
        for m in mList:
            oslotsInv.setdefault(m, set()).add(k + 1 + maxSlot)
    info("listing embedders of all nodes")
    embedders = []
    for n in range(1, maxSlot + 1):
        contentEmbedders = oslotsInv.get(n, tuple())
        embedders.append(
            numpy.array(
                sorted(
                    (m for m in contentEmbedders if m != n),
                    key=lambda k: -rank[k - 1],
                ),
                dtype="uint32",
            )
        )
    seen = {}
    for n in range(maxSlot + 1, maxNode + 1):
        mList = tuple(oslots[n - maxSlot - 1])
        if mList in seen:
            theseEmbedders = seen[mList]
        else:
            if len(mList) == 0:
                theseEmbedders = numpy.array()
            else:
                contentEmbedders = functools.reduce(
                    lambda x, y: x & oslotsInv[y],
                    mList[1:],
                    oslotsInv[mList[0]],
                )
                theseEmbedders = numpy.array(
                    sorted(
                        (m for m in contentEmbedders if m != n),
                        key=lambda k: -rank[k - 1],
                    ),
                    dtype="uint32",
                )
            seen[mList] = theseEmbedders
        embedders.append(theseEmbedders)
    return numpy.array(embedders, dtype=object)

In [46]:
levUpN = levUp(info, error, otype, oslots, rank)

36m 31s making inverse of edge feature oslots
36m 32s listing embedders of all nodes


In [47]:
deepSize(C.levUp.data)

310102620

In [48]:
deepSize(levUpN)

11574760

In [42]:
testPerformance(C.levUp.data)

0.15462375000061002

In [41]:
testPerformance(levUpN)

0.30584074999933364