Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add elastic app #177

Merged
merged 1 commit into from
Jan 4, 2017
Merged

WIP: Add elastic app #177

merged 1 commit into from
Jan 4, 2017

Conversation

dblenkus
Copy link
Member

No description provided.

Copy link
Contributor

@kostko kostko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review.

from resolwe.flow.models import dict_dot


class NoValue(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain that this is used to differentiate missing attributes from None (another approach would be to catch AttributeError instead of having a custom guard type).

def __init__(self, obj):
self.obj = obj

def filter(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation for all methods.


if field in self.mapping:
if callable(self.mapping[field]):
setattr(document, field, self.mapping[field]())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if the callable is called with the object (self.obj) as argument.

continue

if isinstance(self.mapping[field], six.string_types):
object_attr = dict_dot(self.obj, self.mapping[field], default=self.mapping[field])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good approach as typos will automatically change to default values (e.g. saying fo.bar instead of foo.bar will silently set all indexed values for that field to the string 'fo.bar'). In order to set constants we should force them to be wrapped in lambda expressions or figure out a different way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was considering the same. But I think that lambda obj: None is really ugly workaround. If you will miss-type, you will see the error really quickly on the API. So I don't see this as really big problem.

Another solution would be to add identity function (i.e. const) and set constant as const(None).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this const suggestion, just name it constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, is there a reason why a constant value should be indexed? Wouldn't it be the same for all documents?

else:
object_attr = self.mapping[field]
if callable(object_attr):
setattr(document, field, object_attr())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, pass self.obj as argument.

setattr(document, field, object_attr)
continue

object_value = dict_dot(self.obj, field, default=NoValue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using NoValue instead of propagating AttributeError/KeyError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will catch error and raise more descriptive one.

document.save()

def get_permissions(self):
return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is this when doing indexing in bulk? Can we optimize during bulk reindex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment each object is processed individually, so I don't see easy option to optimise this. But this is definitely something that we have to work on in the future. I will add a comment about this.


def handle(self, *args, **options):
"""Command handle."""
for data in Data.objects.all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid hardcoding which models to index? Instead, the list of registered index classes should be used to determine the set of models to index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will finish everything else and then work on this.

add_index(obj)


@receiver(post_save, sender=Data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like these hardcoded signals. Can we make it so that the first time a new index references a new model, the appropriate signals are connected?

pass

def preprocess_object(self):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise NotImplementedError exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required method, just convenient one, so you don't need to implement it.

@@ -0,0 +1,27 @@
""".. Ignore pydocstyle D400.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to elastic_index.py.



class ElasticSearchViewSet(GenericViewSet):
"""Django REST Framework viewset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will improve documentation :)

@codecov-io
Copy link

codecov-io commented Dec 15, 2016

Current coverage is 85.67% (diff: 34.83%)

Merging #177 into master will decrease coverage by 2.19%

@@             master       #177   diff @@
==========================================
  Files            84         94    +10   
  Lines          5641       5885   +244   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           4957       5042    +85   
- Misses          684        843   +159   
  Partials          0          0          

Powered by Codecov. Last update 8ac4543...98c15a0

@dblenkus dblenkus force-pushed the feature-elastic branch 2 times, most recently from d213e93 to 1d53934 Compare December 15, 2016 21:50
def __init__(self):
"""Initialize index builder object."""
# Set dafault connection for ElasticSearch
# TODO: Reestablish connection if broken!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not handled automatically (e.g. retry API request)? Because AFAIK connections are not persistent, each request is done via HTTP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be, but the connection stopped working when ElasticSearch was restarted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did it stop working, did the queries raise errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, urllib3 raised error that connection is broken.

Copy link
Contributor

@kostko kostko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another review pass, I like the improved API, just a few things more.

if inspect.isclass(attr) and issubclass(attr, BaseIndex) and attr is not BaseIndex:
self.indexes.append(attr())
except ImportError:
pass # no `elastic_indexes` in app
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we differentiate this from "importing that specific index caused an ImportError"? Because this one should be reraised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

except ImportError:
pass # no `elastic_indexes` in app

def trigger(self, obj=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if trigger is the best name for this, perhaps it should be build, index or update_object (to be consistent with remove_object below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll name it build, because it triggers whole build process and remove_object only removes single object, so there should be some difference in the name.

"""

#: list of user ids with view permission on the object
users_with_permissions = dsl.String()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a multi=True here as it accepts an array of strings (also below)?


def run(self, obj=None):
"""Main function for running indexes."""
if obj and obj not in self.queryset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this execute a database query just to determine if an object matches the filter? Can this be improved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, database query is executed, but I don't see any easier way to implement this. This is used only for single object (if obj is defined), so it shouldn't slow the process too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if it is just for a single object (and never used in loops with a large number of objects) then it is ok.

__all__ = ('ElasticSearchViewSet',)


class ElasticSearchViewSet(GenericViewSet):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this viewset also include methods for handling most common search requests, so that the user needs to only override a few specific things and common stuff like ordering and pagination are handled correctly? An even better approach would be to maybe have this as a mixin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I've changed it to the mixin.

@dblenkus dblenkus force-pushed the feature-elastic branch 2 times, most recently from fc0b046 to 0fd2277 Compare December 20, 2016 07:35
"""

#: list of user ids with view permission on the object
users_with_permissions = dsl.String(many=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be multi and not many? At least based on this.


self._index_name = self.document_class()._get_index() # pylint: disable=not-callable,protected-access

def filter(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case is filter used, now that we have queryset?

Copy link
Contributor

@kostko kostko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some minor things. Also, I see that we currently only have a management command for reindexing? What about adding one for clearing indices as well?


elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost') # pylint: disable=invalid-name
elasticsearch_port = getattr(settings, 'ELASTICSEARCH_PORT', 9200) # pylint: disable=invalid-name
connections.create_connection(hosts=['{}:{}'.format(elasticsearch_host, elasticsearch_port)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this must be done here and not in the ready method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this, Ive forgot to revert the changes when performing some tests.

def __init__(self):
"""Initialize index builder object."""
# Set dafault connection for ElasticSearch
# elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with this commented block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

# Set dafault connection for ElasticSearch
# elasticsearch_host = getattr(settings, 'ELASTICSEARCH_HOST', 'localhost')
# elasticsearch_port = getattr(settings, 'ELASTICSEARCH_PORT', 9200)
# # TODO: Reestablish connection if broken!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently connections are not reestablished? Would it be hard to add this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it has started to work by itself :)


for field in document._doc_type.mapping: # pylint: disable=protected-access
if field in ['users_with_permissions', 'groups_with_permissions']:
continue # this fields are handled separately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fields -> These fields

@kostko
Copy link
Contributor

kostko commented Jan 3, 2017

BTW, this PR seems to decrease coverage?

@dblenkus dblenkus force-pushed the feature-elastic branch 2 times, most recently from 4ea8bea to acd832a Compare January 4, 2017 09:41
@dblenkus
Copy link
Member Author

dblenkus commented Jan 4, 2017

@kostko I've added a commit with fixes.

I will add tests in separate pull request, so we can start testing this.

@dblenkus dblenkus merged commit 98c15a0 into genialis:master Jan 4, 2017
@dblenkus dblenkus deleted the feature-elastic branch January 4, 2017 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants