<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Document-methods-benchmark" data-toc-modified-id="Document-methods-benchmark-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Document methods benchmark</a></span><ul class="toc-item"><li><span><a href="#Document-pop" data-toc-modified-id="Document-pop-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Document <code>pop</code></a></span></li><li><span><a href="#Document-clear" data-toc-modified-id="Document-clear-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Document <code>clear</code></a></span></li><li><span><a href="#Documentarray" data-toc-modified-id="Documentarray-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Documentarray</a></span><ul class="toc-item"><li><span><a href="#Get-embedding-from-Document" data-toc-modified-id="Get-embedding-from-Document-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Get embedding from Document</a></span></li></ul></li><li><span><a href="#Docarray-memmap" data-toc-modified-id="Docarray-memmap-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Docarray memmap</a></span><ul class="toc-item"><li><span><a href="#creating-an-array-with-a-particular-dtype" data-toc-modified-id="creating-an-array-with-a-particular-dtype-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>creating an array with a particular dtype</a></span></li><li><span><a href="#Acessing-dtype-of-a-doc" data-toc-modified-id="Acessing-dtype-of-a-doc-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Acessing dtype of a doc</a></span></li><li><span><a href="#Getting-an-embedding" data-toc-modified-id="Getting-an-embedding-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>Getting an embedding</a></span></li></ul></li></ul></li></ul></div>

# Document methods benchmark

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import jina
from jina import Document, DocumentArray
import numpy as np
import os

## Document `pop`

In [3]:
d = Document(text='doc0')

In [12]:
help(d.pop)

Help on method pop in module jina.types.document:

pop(*fields) -> None method of jina.types.document.Document instance
    Remove the values from the given fields of this Document.
    
    :param fields: field names



In [5]:
d.pop('text')

In [6]:
d.text

''

## Document `clear`

In [20]:
d = Document(text='doc0', embedding=np.array([1,2,3]))
d.text

'doc0'

In [23]:
d.clear()

In [24]:
d.text

''

## Documentarray

In [4]:
da = DocumentArray([Document(text='doc0'),
                    Document(text='doc1'),
                    Document(text='doc2')]) 
da[0].text

'doc0'

In [17]:
da = DocumentArray([Document(text='doc0'),
                    Document(text='doc1'),
                    Document(text='doc2')]) 

da = da.shuffle(seed=1234)
da[0].text,da[1].text, da[2].text

('doc1', 'doc0', 'doc2')

#### Accessing `Document._pb_body`

In [266]:
%timeit d._pb_body

47.7 ns ± 0.464 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


#### Fastest way to get type of embedding from proto

How can we know if a Document has a sparse or a dense embedding?

In [254]:
d = Document(text='doc0', embedding = np.array([2.3,4.5,4.5]))

In [249]:
type(d._pb_body.embedding.sparse)

jina_pb2.SparseNdArrayProto

In [255]:
type(d._pb_body.embedding.dense)

jina_pb2.DenseNdArrayProto

We can use the `HasField` function

In [262]:
type(d._pb_body)

jina_pb2.DocumentProto

In [240]:
%timeit d._pb_body.embedding.HasField('dense')

302 ns ± 5.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Faster way to acess dtype of a dense embedding

In [252]:
%timeit d._pb_body.embedding.dense.dtype

317 ns ± 4.29 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [253]:
%timeit d.proto.embedding.dense.dtype

397 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Get embedding from Document

Note that embeddings store the dtype of the arrays

In [191]:
d = Document(text='doc0', embedding = np.array([2.3,4.5,4.5]))
d.embedding.dtype

dtype('float64')

In [195]:
d = Document(text='doc0', embedding = np.array([2.3,4.5,4.5],dtype=np.float32))
d.embedding.dtype

dtype('float32')

This information is tored in 

`d._pb_body.embedding.dense.dtype`:


- `'f8'` is for float64
- `'f4'` is for float32


In [201]:
d._pb_body

id: "78dc1722-0bc3-11ec-82eb-787b8ab3f5de"
mime_type: "text/plain"
text: "doc0"
embedding {
  dense {
    buffer: "33\023@\000\000\220@\000\000\220@"
    shape: 3
    dtype: "<f4"
  }
}

In [297]:
d = Document(embedding = np.random.rand((128)))

In [298]:
#this PR
%timeit aux = d.embedding

2.8 µs ± 50.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [300]:
#master
%timeit aux = d.embedding

6.2 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### Getting embedding

Note that accessing `d._pb_body.embedding` has a cost, therefore it is better to store it in a variable

In [295]:
%%timeit
aux1 = d._pb_body.embedding.dense.buffer
aux2 = d._pb_body.embedding.dense.dtype
aux3 = d._pb_body.embedding.dense.shape

514 ns ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [296]:
%%timeit
dense_proto = d._pb_body.embedding.dense
aux1 = dense_proto.buffer
aux2 = dense_proto.dtype
aux3 = dense_proto.shape

304 ns ± 2.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


##### Timing in get_attributes embedding

In [276]:
d = Document(embedding = np.random.rand((128)))
print(d._pb_body.embedding.HasField('dense'))
print(d._pb_body.embedding.HasField('sparse'))

True
False


In [237]:
da = DocumentArray([d]*1000)

In [238]:
#This PR
%timeit da.get_attributes('embedding')

14.9 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
self._pb_body.embedding.HasField('dense')

We can also have sparse embeddings

In [273]:
import scipy.sparse as sp
d = Document(embedding = sp.csr_matrix(np.random.rand((128))))

In [275]:
print(d._pb_body.embedding.HasField('dense'))
print(d._pb_body.embedding.HasField('sparse'))

False
True


## Docarray memmap

In [302]:
from jina.types.arrays.memmap import DocumentArrayMemmap


In [334]:
dam = DocumentArrayMemmap('./')

In [339]:
dam.extend([Document(text='lala')])

In [316]:
dam.save()

In [331]:
da = DocumentArray([Document(text='lala')])

In [332]:
da.save('./docarray')

In [341]:
da.save_binary('./docarraybin')

In [346]:
dam.save()

### creating an array with a particular dtype

Acessing `d._pb_body.embedding.dense.dtype` is unreasonable slow

In [162]:
%timeit np.array([2.3, 4.5, 4.5])

1.09 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [161]:
%timeit np.array([2.3, 4.5, 4.5], \
                 dtype=d._pb_body.embedding.dense.dtype)

1.7 µs ± 9.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [127]:
%timeit np.array([2.3, 4.5, 4.5], dtype='f4')

1.2 µs ± 6.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [168]:
%timeit d._pb_body.embedding

206 ns ± 1.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Acessing dtype of a doc

In [169]:
d = Document(text='doc0', embedding = np.array([2.3,4.5,4.5]))

In [170]:
%timeit d._pb_body.embedding.dense

592 ns ± 7.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [171]:
%timeit d._pb_body.embedding.dense.dtype

686 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [176]:
print(type(d._pb_body))
print(type(d._pb_body.embedding.dense))

<class 'jina_pb2.DocumentProto'>
<class 'jina_pb2.DenseNdArrayProto'>


In [180]:
d = Document(embedding = np.array([2.3,4.5,4.5]))

In [181]:
%timeit d._pb_body

44.2 ns ± 0.288 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [178]:
%timeit d._pb_body.embedding.dense

589 ns ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [179]:
%timeit d._pb_body.embedding.dense.dtype

678 ns ± 9.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Getting the dtype of an embedding is around half the time needed to do 500 additions

In [151]:
x = np.ones(500)
%timeit x.sum()

1.51 µs ± 8.03 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Getting an embedding

In [109]:
%timeit d.embedding

2.34 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [83]:
%timeit d._pb_body.embedding.WhichOneof('content')

370 ns ± 14.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [84]:
d._pb_body.embedding.WhichOneof

<function NdArrayProto.WhichOneof>

In [98]:
d._pb_body.embedding.DESCRIPTOR

<google.protobuf.pyext._message.MessageDescriptor at 0x7fdcd87d92b0>

In [104]:
%timeit d._pb_body.embedding.HasField('sparse')

171 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [105]:
%timeit d._pb_body.embedding.HasField('dense')

173 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [64]:
%timeit d.embedding

5.4 µs ± 30.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [8]:
%timeit d.get_attributes('embedding')

6.03 µs ± 86.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [23]:
if d._pb_body.embedding.sparse== :
    print('hi')

In [27]:
%timeit d._pb_body.embedding.sparse

92.3 ns ± 0.438 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [33]:
d._pb_body.embedding.dense.buffer

b'ffffff\x02@\x00\x00\x00\x00\x00\x00\x12@\x00\x00\x00\x00\x00\x00\x12@'

In [39]:
%timeit d._pb_body.embedding.sparse.ByteSize()

150 ns ± 0.923 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [51]:
%timeit d._pb_body.embedding.WhichOneof('content')

202 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [58]:
d._pb_body.embedding.dense.shape

[3]

#### Appending to list vs numpy array

In [17]:
l = [1,2,3,4]
lnp = np.array([1,2,3,4])

In [21]:
%timeit np.append(lnp,lnp)

2.87 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [22]:
%timeit np.append(lnp, l)

4.37 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [1]:
#self.quantize = os.environ.get('JINA_ARRAY_QUANT', quantize)

In [26]:
import jina
from jina import Document
import numpy as np
import os

In [6]:
d = Document(embedding=np.array([1,2,3]))

In [24]:
%timeit aux = d.embedding

6.21 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [28]:
%timeit aux = os.environ.get('JINA_ARRAY_QUANT', None) 

715 ns ± 5.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [35]:
%timeit aux = jina.types.ndarray.generic.NdArray()

1.24 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [47]:
import jina.types.ndarray.dense.numpy

In [49]:
%timeit aux = jina.types.ndarray.dense.numpy.DenseNdArray(quantize='fp16')

1.63 µs ± 58.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [52]:
# before
%timeit aux = jina.types.ndarray.dense.numpy.DenseNdArray()

2.81 µs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [48]:
# after
%timeit aux = jina.types.ndarray.dense.numpy.DenseNdArray()

1.48 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [83]:
d = Document(embedding = np.array([1,2,3]))

In [54]:
%timeit aux = d.embedding

4.29 µs ± 80.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [62]:
# before
%timeit d.embedding = x

6.28 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [60]:
# after 
%timeit d.embedding = x

5.31 µs ± 74 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [63]:
from jina.types.ndarray.generic import NdArray
from scipy.sparse import coo_matrix

In [66]:
row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
a = coo_matrix((data, (row, col)), shape=(4, 4))
dense_a = a.toarray()



In [67]:
a

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in COOrdinate format>

In [82]:
aux = NdArray(a, is_sparse=True)

AttributeError: _ndarray

In [75]:
aux._ndarray

<jina.types.ndarray.dense.numpy.DenseNdArray at 140511073762272>

In [81]:
aux._ndarray.value = np.array([1,23])

In [79]:
aux.value

array([ 1, 23])

In [41]:
class A:
    def __init__(self):
        pass

In [43]:
%timeit A()

109 ns ± 0.808 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [36]:
print(type(d._pb_body))
d._pb_body

<class 'jina_pb2.DocumentProto'>


id: "387c926c-01bc-11ec-958c-787b8ab3f5de"
embedding {
  dense {
    buffer: "\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000"
    shape: 3
    dtype: "<i8"
  }
}

In [19]:
type(d._pb_body.embedding)

jina_pb2.NdArrayProto

In [30]:
d._pb_body

id: "387c926c-01bc-11ec-958c-787b8ab3f5de"
embedding {
  dense {
    buffer: "\001\000\000\000\000\000\000\000\002\000\000\000\000\000\000\000\003\000\000\000\000\000\000\000"
    shape: 3
    dtype: "<i8"
  }
}