Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 46 additions & 40 deletions datajoint/fetch.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from collections import OrderedDict
from functools import wraps
import itertools
import re
from .blob import unpack
import numpy as np
from datajoint import DataJointError
from . import key as PRIMARY_KEY
from collections import abc


def prepare_attributes(relation, item):
if isinstance(item, str) or item is PRIMARY_KEY:
Expand All @@ -20,46 +25,57 @@ def prepare_attributes(relation, item):
raise DataJointError("Index must be a slice, a tuple, a list, a string.")
return item, attributes

class FetchQuery:
def copy_first(f):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for copy_first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought that it would make more sense for each operation on the fetch object to return a new fetch object rather than modifying the object in place. Since fetch object in itself is very cheap, this shouldn't be a real over head. copy_first is a decorator that modifies the decorated method to work on and modify the copy of the object, supporting this idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use copy.copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy.copy does not copy the dict behavior. The function deepcopy would copy _relation as well which seemed like an overkill. That's why I introduced a copy constructor and used a decorator to tell the function, that I'd like self copied. Having the elements of behavior as object attributes wouldn't solve the problem, since some of the values of behavior are lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we need to copy the object in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i have looked into other modules and, yes, things like these usually copy the object. I am merging this PR.

@wraps(f)
def ret(*args, **kwargs):
args = list(args)
args[0] = args[0].__class__(args[0]) # call copy constructor
return f(*args, **kwargs)

def __init__(self, relation):
"""
return ret

"""
self.behavior = dict(
offset=0, limit=None, order_by=None, descending=False, as_dict=False, map=None
)
self._relation = relation
class Fetch:
def __init__(self, relation):
if isinstance(relation, Fetch): # copy constructor
self.behavior = dict(relation.behavior)
self._relation = relation._relation
else:
self.behavior = dict(
offset=0, limit=None, order_by=None, as_dict=False
)
self._relation = relation


@copy_first
def from_to(self, fro, to):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the pair of methods (from_to, limit_to) preferred to the pair (`limit', 'offset') that is already familiar to SQL programmers?

relation.fetch.offset(300).limit(100)()
relation.fetch.from_to(300,400)()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could introduce another method called offset_by

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no standalone OFFSET command in SQL - it always has to occur with LIMIT statement. I suppose we can define offset_by(N) to translate into LIMIT X OFF N, where X is some large number by default to ensure all rows from offset to the end are included in the query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason why I put in the from_to, because it sets both. In the current version you cannot set offset without setting limit as well.

self.behavior['offset'] = fro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so from_to(1,2) will yield 1 tuple? That's not the usual understanding of what 100 to 100 means. I think we should stick with offset and limit in line with SQL's usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's in line with the index convention of python and I think it makes a lot of sense. If you want to split an interval a,b at c you can say from_to(a,c) and from_to(c,b) instead of from_to(a,c) and from_to(c+1,b).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python index convention does not use the language 'from' and 'to', which implies including the first and last values. I really think we should revert to SQL's offset and limit. We don't need to reinvent the wheel. These are not used too often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with offset_by and limit_to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset and limit -- cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either order, limit, and offset or order_by, limit_to, offset_by. Since we are not MySQL there is no need to adopt inconsistent naming conventions. I like the second option better because statements read clearer (like populated_from). We could make a poll among existing datajoint-python users which statement they prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, SQL alchemy also uses the method name order_by, although they use the method name limit (not limit_to). Following Python convention, I like it when the method calls reads like a sentence. I do really like order_by as it strongly suggests that we need to pass in arguments to make it work, and that order_by() without argument doesn't really make sense anyway. I somehow feel less strong about limit by I could see same argument goes for limit_to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't like is that limit and offset sound like properties.

self.behavior['limit'] = to - fro
return self

def order_by(self, order_by):
self.behavior['order_by'] = order_by
@copy_first
def order_by(self, *args):
if len(args) > 0:
self.behavior['order_by'] = self.behavior['order_by'] if self.behavior['order_by'] is not None else []
namepat = re.compile(r"\s*(?P<name>\w+).*")
for a in args: # remove duplicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to remove duplicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users could do something like this:

dog = relation.fetch.order_by('stimulus ASC', 'user_id DESC',\
                            'trial ASC').limit_to(1000)

data1 = dog.order_by('user_id DESC')()
data2 = dog.order_by('trial ASC')()

dog would be the default, and only one sorting parameter is changed at a time. To have that-in my opinion nice-feature we need (i) copies of the object and (ii) the ability to overwrite existing settings. According to @eywalker this is consistent with how it is treated in SQLAlchemy ( @eywalker correct me if I am wrong).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each method call creating separate fetch object is consistent with how SQLAlchemy handles their Query objects. I actually just checked SQLAlchemy and the query it generates, and it looks like it does not go about removing duplicates - it simply appends all statements into SQL. As for SQL, it's completely fine executing queries like ORDER_BY name DESC, language ASC, name ASC. As it sorts from last to first, this will cause the newer sort_by('name ASC') call to be effecitvely ignored.

One thing that SQLAlchemy supports that may be nice is to call order_by(None). This returns a new query object with all order_by reset. Call to order_by without an argument is simply ignored in SQLAlchemy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SQL sorts from last to first, this means that the first attribute is the major sort key and the others are subordinate keys. The question is, what do we expect from a call rel.fetch.order_by('attr1').order_by('attr2')? Is attr1 the major sort key or the subordinate?

  • If attr1 is subordinate, we can just insert new order keys in front of behavior['order_by'] and don't even need to remove duplicates. Removing them would probably be a bit more efficient on MySQL side, since I don't expect that MySQL checks whether the keys are double.
  • If attr1 is supposed to be the major sort key, then we need to agree on whether a newly added attribute with the same name should cancel out the old one or not. I think I am actually slightly in favour of not removing it and implementing rel.fetch.order_by(None) to delete the current order_by chain.
  • In general, I think that having rel.fetch.order_by(None) is a good idea in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that it would actually make sense to make later keys the subordinate ones, in line with how MySQL handles ORDER BY. In this case, newly added attribute with same name already existing should not cancel out the older one. Rather, I agree that we should just implment order_by(None) to support resetting, so the user can construct ordering as desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not get carried away and overcomplicate things. Let's do this: order_by takes the full ordering list and overrides any previous order_by settings:

a = relation.fetch.order_by('field1 desc', 'field2')()

name = namepat.match(a).group('name')
pat = re.compile(r"%s(\s*$|\s+(\S*\s*)*$)" % (name,))
self.behavior['order_by'] = [e for e in self.behavior['order_by'] if not pat.match(e)]
self.behavior['order_by'].extend(args)
return self

@copy_first
def as_dict(self):
self.behavior['as_dict'] = True

def ascending(self):
self.behavior['descending'] = False
return self

def descending(self):
self.behavior['descending'] = True
return self

def apply(self, f):
self.behavior['map'] = f
return self

def limit_by(self, limit):
@copy_first
def limit_to(self, limit):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about limit and order instead of order_by and limit_to? It seems just as clear but shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use order_by and limit_to a fetch statement almost reads like a sentence: rel.fetch.order_by('stimulus').limit_to(10). I like that. It's clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that making code read like a sentence is an advantage. Expressiveness and brevity are better than sounding like English sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear that we have different opinions on this. Let's have a vote and stick with what the majority prefers. I suggest that the two options are 1. limit, order, offset and 2. limit_to, order_by, offset_by. I am happy to setup a doodle and send it around.

self.behavior['limit'] = limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for grouping the settings in behavior rather than making them properties of Fetch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is to make copying of Fetch object easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's easier to insert them into the cursor as kwargs. It distinguishes the fetch behavior from other properties of the object, like _relation.

return self

@copy_first
def set_behavior(self, **kwargs):
self.behavior.update(kwargs)
return self
Expand All @@ -78,9 +94,7 @@ def __call__(self, **kwargs):
"""
behavior = dict(self.behavior, **kwargs)

cur = self._relation.cursor(offset=behavior['offset'], limit=behavior['limit'],
order_by=behavior['order_by'], descending=behavior['descending'],
as_dict=behavior['as_dict'])
cur = self._relation.cursor(**behavior)

heading = self._relation.heading
if behavior['as_dict']:
Expand All @@ -92,22 +106,15 @@ def __call__(self, **kwargs):
for blob_name in heading.blobs:
ret[blob_name] = list(map(unpack, ret[blob_name]))

if behavior['map'] is not None:
f = behavior['map']
for i in range(len(ret)):
ret[i] = f(ret[i])

return ret

def __iter__(self):
"""
Iterator that returns the contents of the database.
"""
behavior = self.behavior
behavior = dict(self.behavior)

cur = self._relation.cursor(offset=behavior['offset'], limit=behavior['limit'],
order_by=behavior['order_by'], descending=behavior['descending'],
as_dict=behavior['as_dict'])
cur = self._relation.cursor(**behavior)

heading = self._relation.heading
do_unpack = tuple(h in heading.blobs for h in heading.names)
Expand All @@ -126,10 +133,10 @@ def keys(self, **kwargs):
"""
Iterator that returns primary keys.
"""
b = dict(self.behavior, **kwargs)
if 'as_dict' not in kwargs:
kwargs['as_dict'] = True
yield from self._relation.project().fetch.set_behavior(**kwargs)

b['as_dict'] = True
yield from self._relation.project().fetch.set_behavior(**b)

def __getitem__(self, item):
"""
Expand All @@ -146,7 +153,7 @@ def __getitem__(self, item):
single_output = isinstance(item, str) or item is PRIMARY_KEY or isinstance(item, int)
item, attributes = prepare_attributes(self._relation, item)

result = self._relation.project(*attributes).fetch()
result = self._relation.project(*attributes).fetch(**self.behavior)
return_values = [
np.ndarray(result.shape,
np.dtype({name: result.dtype.fields[name] for name in self._relation.primary_key}),
Expand All @@ -158,8 +165,7 @@ def __getitem__(self, item):
return return_values[0] if single_output else return_values


class Fetch1Query:

class Fetch1:
def __init__(self, relation):
self._relation = relation

Expand Down Expand Up @@ -202,4 +208,4 @@ def __getitem__(self, item):
else result[attribute][0]
for attribute in item
)
return return_values[0] if single_output else return_values
return return_values[0] if single_output else return_values
18 changes: 8 additions & 10 deletions datajoint/relational_operand.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from . import DataJointError
import logging

from .fetch import FetchQuery, Fetch1Query
from .fetch import Fetch, Fetch1

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -171,7 +171,7 @@ def __call__(self, *args, **kwargs):
"""
return self.fetch(*args, **kwargs)

def cursor(self, offset=0, limit=None, order_by=None, descending=False, as_dict=False):
def cursor(self, offset=0, limit=None, order_by=None, as_dict=False):
"""
Return query cursor.
See Relation.fetch() for input description.
Expand All @@ -182,8 +182,7 @@ def cursor(self, offset=0, limit=None, order_by=None, descending=False, as_dict=
sql = self.make_select()
if order_by is not None:
sql += ' ORDER BY ' + ', '.join(order_by)
if descending:
sql += ' DESC'

if limit is not None:
sql += ' LIMIT %d' % limit
if offset:
Expand All @@ -206,12 +205,13 @@ def __repr__(self):
repr_string += ' (%d tuples)\n' % len(self)
return repr_string

@property
def fetch1(self):
return Fetch1Query(self)
return Fetch1(self)

@property
def fetch(self):
return FetchQuery(self)
return Fetch(self)

@property
def where_clause(self):
Expand Down Expand Up @@ -253,8 +253,6 @@ def make_condition(arg):
return ' WHERE ' + ' AND '.join(condition_string)




class Not:
"""
inverse restriction
Expand Down Expand Up @@ -319,9 +317,9 @@ def __init__(self, arg, group=None, *attributes, **renamed_attributes):
self._arg = Subquery(arg)
else:
self._group = None
if arg.heading.computed or\
if arg.heading.computed or \
(isinstance(arg.restrictions, RelationalOperand) and \
all(attr in self._attributes for attr in arg.restrictions.heading.names)) :
all(attr in self._attributes for attr in arg.restrictions.heading.names)):
# can simply the expression because all restrictions attrs are projected out anyway!
self._arg = arg
self._restrictions = self._arg.restrictions
Expand Down
Empty file added doc/source/_static/.dummy
Empty file.
20 changes: 20 additions & 0 deletions tests/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,26 @@ class Subject(dj.Manual):
def prepare(self):
self.insert(self.contents, ignore_errors=True)

@schema
class Language(dj.Lookup):

definition = """
# languages spoken by some of the developers

entry_id : int
---
name : varchar(40) # name of the developer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this table is not in 3rd normal form. I know it's just a test example but we might as well show good examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it later.

language : varchar(40) # language
"""

contents = [
(0, 'Fabian', 'English'),
(1, 'Edgar', 'English'),
(2, 'Dimitri', 'English'),
(3, 'Dimitri', 'Ukrainian'),
(4, 'Fabian', 'German'),
(5, 'Edgar', 'Japanese'),
]

@schema
class Experiment(dj.Imported):
Expand Down
133 changes: 133 additions & 0 deletions tests/test_fetch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
from operator import itemgetter, attrgetter
import itertools
from nose.tools import assert_true
from numpy.testing import assert_array_equal, assert_equal
import numpy as np

from . import schema
import datajoint as dj


class TestFetch:
def __init__(self):
self.subject = schema.Subject()
self.lang = schema.Language()

def test_getitem(self):
"""Testing Fetch.__getitem__"""

np.testing.assert_array_equal(sorted(self.subject.project().fetch(), key=itemgetter(0)),
sorted(self.subject.fetch[dj.key], key=itemgetter(0)),
'Primary key is not returned correctly')

tmp = self.subject.fetch(order_by=['subject_id'])

for column, field in zip(self.subject.fetch[:], [e[0] for e in tmp.dtype.descr]):
np.testing.assert_array_equal(sorted(tmp[field]), sorted(column), 'slice : does not work correctly')

subject_notes, key, real_id = self.subject.fetch['subject_notes', dj.key, 'real_id']
#
np.testing.assert_array_equal(sorted(subject_notes), sorted(tmp['subject_notes']))
np.testing.assert_array_equal(sorted(real_id), sorted(tmp['real_id']))
np.testing.assert_array_equal(sorted(key, key=itemgetter(0)),
sorted(self.subject.project().fetch(), key=itemgetter(0)))

for column, field in zip(self.subject.fetch['subject_id'::2], [e[0] for e in tmp.dtype.descr][::2]):
np.testing.assert_array_equal(sorted(tmp[field]), sorted(column), 'slice : does not work correctly')

def test_order_by(self):
"""Tests order_by sorting order"""
langs = schema.Language.contents

for ord_name, ord_lang in itertools.product(*2 * [['ASC', 'DESC']]):
cur = self.lang.fetch.order_by('name ' + ord_name, 'language ' + ord_lang)()
langs.sort(key=itemgetter(2), reverse=ord_lang == 'DESC')
langs.sort(key=itemgetter(1), reverse=ord_name == 'DESC')
for c, l in zip(cur, langs):
assert_true(np.all(cc == ll for cc, ll in zip(c, l)), 'Sorting order is different')

def test_order_by_default(self):
"""Tests order_by sorting order with defaults"""
langs = schema.Language.contents

cur = self.lang.fetch.order_by('language', 'name DESC')()
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)

for c, l in zip(cur, langs):
assert_true(np.all([cc == ll for cc, ll in zip(c, l)]), 'Sorting order is different')

def test_order_by_direct(self):
"""Tests order_by sorting order passing it to __call__"""
langs = schema.Language.contents

cur = self.lang.fetch(order_by=['language', 'name DESC'])
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)
for c, l in zip(cur, langs):
assert_true(np.all([cc == ll for cc, ll in zip(c, l)]), 'Sorting order is different')

def test_limit_to(self):
"""Test the limit_to function """
langs = schema.Language.contents

cur = self.lang.fetch.limit_to(4)(order_by=['language', 'name DESC'])
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)
assert_equal(len(cur), 4, 'Length is not correct')
for c, l in list(zip(cur, langs))[:4]:
assert_true(np.all([cc == ll for cc, ll in zip(c, l)]), 'Sorting order is different')

def test_from_to(self):
"""Test the from_to function """
langs = schema.Language.contents

cur = self.lang.fetch.from_to(2, 6)(order_by=['language', 'name DESC'])
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)
assert_equal(len(cur), 4, 'Length is not correct')
for c, l in list(zip(cur, langs[2:6])):
assert_true(np.all([cc == ll for cc, ll in zip(c, l)]), 'Sorting order is different')

def test_iter(self):
"""Test iterator"""
langs = schema.Language.contents

cur = self.lang.fetch.order_by('language', 'name DESC')
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)
for (_, name, lang), (_, tname, tlang) in list(zip(cur, langs)):
assert_true(name == tname and lang == tlang, 'Values are not the same')

def test_keys(self):
"""test key iterator"""
langs = schema.Language.contents
langs.sort(key=itemgetter(1), reverse=True)
langs.sort(key=itemgetter(2), reverse=False)

cur = self.lang.fetch.order_by('language', 'name DESC')['entry_id']
cur2 = [e['entry_id'] for e in self.lang.fetch.order_by('language', 'name DESC').keys()]

keys, _, _ = list(zip(*langs))
for k, c, c2 in zip(keys, cur, cur2):
assert_true(k == c == c2, 'Values are not the same')

def test_fetch1(self):
key = {'entry_id': 0}
true = schema.Language.contents[0]

dat = (self.lang & key).fetch1()
for k, (ke, c) in zip(true, dat.items()):
assert_true(k == c == (self.lang & key).fetch1[ke], 'Values are not the same')

def test_copy(self):
"""Test whether modifications copy the object"""
f = self.lang.fetch
f2 = f.order_by('name')
assert_true(f.behavior['order_by'] is None and len(f2.behavior['order_by']) == 1, 'Object was not copied')

def test_overwrite(self):
"""Test whether order_by overwrites duplicates"""
f = self.lang.fetch.order_by('name DeSc ')
f2 = f.order_by('name')
assert_true(f2.behavior['order_by'] == ['name'], 'order_by attribute was not overwritten')
Loading