-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add elastic app #177
WIP: Add elastic app #177
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick review.
from resolwe.flow.models import dict_dot | ||
|
||
|
||
class NoValue(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain that this is used to differentiate missing attributes from None
(another approach would be to catch AttributeError
instead of having a custom guard type).
def __init__(self, obj): | ||
self.obj = obj | ||
|
||
def filter(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation for all methods.
|
||
if field in self.mapping: | ||
if callable(self.mapping[field]): | ||
setattr(document, field, self.mapping[field]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if the callable is called with the object (self.obj
) as argument.
continue | ||
|
||
if isinstance(self.mapping[field], six.string_types): | ||
object_attr = dict_dot(self.obj, self.mapping[field], default=self.mapping[field]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good approach as typos will automatically change to default values (e.g. saying fo.bar
instead of foo.bar
will silently set all indexed values for that field to the string 'fo.bar'
). In order to set constants we should force them to be wrapped in lambda expressions or figure out a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was considering the same. But I think that lambda obj: None
is really ugly workaround. If you will miss-type, you will see the error really quickly on the API. So I don't see this as really big problem.
Another solution would be to add identity function (i.e. const
) and set constant as const(None)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this const
suggestion, just name it constant
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is there a reason why a constant value should be indexed? Wouldn't it be the same for all documents?
else: | ||
object_attr = self.mapping[field] | ||
if callable(object_attr): | ||
setattr(document, field, object_attr()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, pass self.obj
as argument.
setattr(document, field, object_attr) | ||
continue | ||
|
||
object_value = dict_dot(self.obj, field, default=NoValue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using NoValue
instead of propagating AttributeError
/KeyError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will catch error and raise more descriptive one.
document.save() | ||
|
||
def get_permissions(self): | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is this when doing indexing in bulk? Can we optimize during bulk reindex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment each object is processed individually, so I don't see easy option to optimise this. But this is definitely something that we have to work on in the future. I will add a comment about this.
|
||
def handle(self, *args, **options): | ||
"""Command handle.""" | ||
for data in Data.objects.all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid hardcoding which models to index? Instead, the list of registered index classes should be used to determine the set of models to index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will finish everything else and then work on this.
add_index(obj) | ||
|
||
|
||
@receiver(post_save, sender=Data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like these hardcoded signals. Can we make it so that the first time a new index references a new model, the appropriate signals are connected?
pass | ||
|
||
def preprocess_object(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise NotImplementedError exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not required method, just convenient one, so you don't need to implement it.
@@ -0,0 +1,27 @@ | |||
""".. Ignore pydocstyle D400. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider renaming to elastic_index.py
.
|
||
|
||
class ElasticSearchViewSet(GenericViewSet): | ||
"""Django REST Framework viewset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One line comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will improve documentation :)
1086bcd
to
a6fcc12
Compare
Current coverage is 85.67% (diff: 34.83%)@@ master #177 diff @@
==========================================
Files 84 94 +10
Lines 5641 5885 +244
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 4957 5042 +85
- Misses 684 843 +159
Partials 0 0
|
d213e93
to
1d53934
Compare
def __init__(self): | ||
"""Initialize index builder object.""" | ||
# Set dafault connection for ElasticSearch | ||
# TODO: Reestablish connection if broken! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not handled automatically (e.g. retry API request)? Because AFAIK connections are not persistent, each request is done via HTTP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be, but the connection stopped working when ElasticSearch was restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did it stop working, did the queries raise errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, urllib3 raised error that connection is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did another review pass, I like the improved API, just a few things more.
if inspect.isclass(attr) and issubclass(attr, BaseIndex) and attr is not BaseIndex: | ||
self.indexes.append(attr()) | ||
except ImportError: | ||
pass # no `elastic_indexes` in app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we differentiate this from "importing that specific index caused an ImportError"? Because this one should be reraised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
except ImportError: | ||
pass # no `elastic_indexes` in app | ||
|
||
def trigger(self, obj=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if trigger
is the best name for this, perhaps it should be build
, index
or update_object
(to be consistent with remove_object
below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll name it build
, because it triggers whole build process and remove_object
only removes single object, so there should be some difference in the name.
""" | ||
|
||
#: list of user ids with view permission on the object | ||
users_with_permissions = dsl.String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be a multi=True
here as it accepts an array of strings (also below)?
|
||
def run(self, obj=None): | ||
"""Main function for running indexes.""" | ||
if obj and obj not in self.queryset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this execute a database query just to determine if an object matches the filter? Can this be improved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, database query is executed, but I don't see any easier way to implement this. This is used only for single object (if obj
is defined), so it shouldn't slow the process too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if it is just for a single object (and never used in loops with a large number of objects) then it is ok.
__all__ = ('ElasticSearchViewSet',) | ||
|
||
|
||
class ElasticSearchViewSet(GenericViewSet): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this viewset also include methods for handling most common search requests, so that the user needs to only override a few specific things and common stuff like ordering and pagination are handled correctly? An even better approach would be to maybe have this as a mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I've changed it to the mixin.
fc0b046
to
0fd2277
Compare
""" | ||
|
||
#: list of user ids with view permission on the object | ||
users_with_permissions = dsl.String(many=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be multi
and not many
? At least based on this.
|
||
self._index_name = self.document_class()._get_index() # pylint: disable=not-callable,protected-access | ||
|
||
def filter(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case is filter
used, now that we have queryset
?
6de0bcd
to
cf0fc19
Compare
b61d468
to
5d9c527
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just some minor things. Also, I see that we currently only have a management command for reindexing? What about adding one for clearing indices as well?
|
||
elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost') # pylint: disable=invalid-name | ||
elasticsearch_port = getattr(settings, 'ELASTICSEARCH_PORT', 9200) # pylint: disable=invalid-name | ||
connections.create_connection(hosts=['{}:{}'.format(elasticsearch_host, elasticsearch_port)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why this must be done here and not in the ready
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this, Ive forgot to revert the changes when performing some tests.
def __init__(self): | ||
"""Initialize index builder object.""" | ||
# Set dafault connection for ElasticSearch | ||
# elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with this commented block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
# Set dafault connection for ElasticSearch | ||
# elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost') | ||
# elasticsearch_port = getattr(settings, 'ELASTICSEARCH_PORT', 9200) | ||
# # TODO: Reestablish connection if broken! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently connections are not reestablished? Would it be hard to add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it has started to work by itself :)
|
||
for field in document._doc_type.mapping: # pylint: disable=protected-access | ||
if field in ['users_with_permissions', 'groups_with_permissions']: | ||
continue # this fields are handled separately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fields
-> These fields
BTW, this PR seems to decrease coverage? |
4ea8bea
to
acd832a
Compare
@kostko I've added a commit with fixes. I will add tests in separate pull request, so we can start testing this. |
acd832a
to
98c15a0
Compare
No description provided.