<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Missing-slicing-DocumentArray-with-list-of-indices" data-toc-modified-id="Missing-slicing-DocumentArray-with-list-of-indices-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Missing slicing DocumentArray with list of indices</a></span></li></ul></div>

In [1]:
from jina import Document, DocumentArray
import re
import regex

In [2]:
d1 = Document(tags={'city': 'Barcelona', 'phone':'None'})
d2 = Document(tags={'city': 'Berlin','phone':'648907348'})
d3 = Document(tags={'city': 'Paris', 'phone': 'None'})

D = DocumentArray([d1,d2,d3])

In [3]:
dict(D[0].tags)

{'phone': 'None', 'city': 'Barcelona'}

In [4]:
from typing import List, Dict, Iterable

def fuzzy_filter(docs, regexes: Dict, traversal_paths):
    filtered = DocumentArray()
    iterdocs = docs.traverse_flat(traversal_paths)
    
    for tag_name, regex in regexes.items():
        pattern = re.compile(regex)
        for doc in iterdocs:
            if re.match(pattern, doc.tags[tag_name]):
                filtered.append(doc)
    return filtered

In [5]:
regexes = {'city':r'B.*'}
Dfiltered = fuzzy_filter(D, regexes, ['r'])

In [6]:
[dict(d) for d in Dfiltered.get_attributes('tags')]

[{'phone': 'None', 'city': 'Barcelona'},
 {'phone': '648907348', 'city': 'Berlin'}]

There are a couple of considerations:
    
- We do not want to compile a regex every document because it takes time
- We want to be able to pass operators and filter by some property

In [7]:
%%timeit
re.match(r'B.*', 'La Barcelona')

520 ns ± 7.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [8]:
exp = re.compile(r'B.*')

In [13]:
%%timeit
exp.match('La Barcelona')

191 ns ± 1.09 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


We want to allow user to specify an operator for selecting.

Allow for example:
    
    - get a document if all regex are verified
    - get a document if any regex is verified
    - get a document if more than X regex are verified

In [38]:
from typing import List, Dict, Iterable

def find(docs, regexes: Dict, traversal_paths, operator, value=1):

    filtered = DocumentArray()
    iterdocs = docs.traverse_flat(traversal_paths)
    matched_couts = np.zeros(len(docs), dtype=np.int32)
    
    for tag_name, regex in regexes.items():
        regexes[tag_name] = re.compile(regex)
    
    for pos, doc in enumerate(iterdocs):
        for tag_name, pattern in regexes.items():
            tag_value = doc.tags.get(tag_name, None)
            if tag_value:
                if pattern.match(tag_value):
                    matched_couts[pos] += 1

    if operator == '<':
        coordinate_flags = matched_couts < value
    elif operator == '>':
        coordinate_flags = matched_couts > value
    elif operator == '==':
        coordinate_flags = matched_couts == value
    elif operator == '!=':
        coordinate_flags = matched_couts != value
    elif operator == '<=':
        coordinate_flags = matched_couts <= value
    elif operator == '>=':
        coordinate_flags = matched_couts >= value
    elif operator == 'any':
        coordinate_flags = matched_couts >= 1
    elif operator == 'all':
        coordinate_flags = matched_couts == len(regexes)
    
    indices = np.where(coordinate_flags)[0].tolist()
    for pos in indices:
        filtered.append(docs[pos])
    
    return filtered

In [39]:
regexes = {'city':r'B.*', 'phone':'None'}
Dfiltered = find(D, regexes, ['r'], 'all')
Dfiltered

<jina.types.arrays.document.DocumentArray length=1 at 140403026028624>

In [40]:
regexes = {'city':r'B.*', 'phone':'None'}
Dfiltered = find(D, regexes, ['r'], 'any')
Dfiltered

<jina.types.arrays.document.DocumentArray length=3 at 140403026028384>

In [41]:
d1 = Document(tags={ 'phone':'None'})
d2 = Document(tags={'city': 'Berlin','phone':'648907348'})
d3 = Document(tags={'city': 'Paris', 'phone': 'None'})

D2 = DocumentArray([d1,d2,d3])

In [45]:
# If a Document
regexes = {'city':r'B.*'}
Dfiltered = find(D2, regexes, ['r'], 'any')
Dfiltered

<jina.types.arrays.document.DocumentArray length=1 at 140403026027712>

In [43]:
d2.tags.get('city', None)

'Berlin'

In [48]:
print(set(regexes.keys()))

{'city'}


### Missing slicing DocumentArray with list of indices

Operation currently not implemented

In [145]:
X = np.random.random((10,3))

In [135]:
X[1:3]

array([[0.89165356, 0.72906645, 0.67695036],
       [0.23279987, 0.82382816, 0.31403296]])

In [142]:
X[[0,2]]

array([[0.1805567 , 0.51602885, 0.26808973],
       [0.23279987, 0.82382816, 0.31403296]])

In [144]:
# Note that we can even copy more than one row at a time
X[[0,0,2]]

array([[0.1805567 , 0.51602885, 0.26808973],
       [0.1805567 , 0.51602885, 0.26808973],
       [0.23279987, 0.82382816, 0.31403296]])

In [140]:
D[1:3]

<jina.types.arrays.document.DocumentArray length=2 at 140186673583200>

In [141]:
D[[0,2]]

IndexError: do not support this index type builtins.list: [0, 2]