Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on elasticsearch-model #4

Closed
jlecour opened this issue Jan 27, 2014 · 18 comments
Closed

Feedback on elasticsearch-model #4

jlecour opened this issue Jan 27, 2014 · 18 comments
Labels

Comments

@jlecour
Copy link

jlecour commented Jan 27, 2014

Hi @karmi,

I wanted to give you some feedback about elasticsearch-model but didn't find a better place to post it.

This gem is really interesting, especially the "model" part of it.

I really like the decoupling between the class that holds the data, the one that does the request, and the one that represent the response and the results.

Still, there is something that bothers me ; there is a big assumption that "model" is something that is stored in another primary datastore than Elasticsearch, accessible through ActiveRecord, Mongoid or else.

I am working on an app where a lot of data is indexed in ES from an external data source. In the "main" app, I just want to query ES, fetch results and populate collections of results with data found in ES.

The objects I'm dealing with are very simple Ruby objects (_PORO_s) on which I can add some behavior, …

I guess I could use the various parts of Elasticsearch::Model with only what I want (excluding all the Record stuff) but it seems a little tedious and brittle.

Maybe, it would be just enough to have an example, include just what's needed to have those plain Ruby objects coming from ES.

Another question : in Result, how do you deal with script_fields (with a calculated distance, for example) when the rest of the _source is also present? I guess a basic merge would do the job.

@karmi
Copy link
Contributor

karmi commented Jan 27, 2014

Hi Jérémy,

feedback here in Github issues is absolutely perfect. Easy to track, review, contribute.

Still, there is something that bothers me ; there is a big assumption that "model" is something that is stored in
another primary datastore than Elasticsearch, accessible through ActiveRecord, Mongoid or else.

Yes, indeed. That's the main use case for the -model gem -- your primary database is "something else", and you want to
search/analyze it with Elasticsearch.

Now, first important point to make is, that a "persistence" module for this gem is planned. If you're familiar with Tire::Model::Persistence, it should have very similar semantics and feature set, allowing easily to use Elasticsearch as the primary storage for the models. I expect it to be available sometime during spring.

I am working on an app where a lot of data is indexed in ES from an external data source. In the "main" app, I just
want to query ES, fetch results and populate collections of results with data found in ES. (...)
The objects I'm dealing with are very simple Ruby objects (POROs) on which I can add some behavior (...)
I guess I could use the various parts of Elasticsearch::Model with only what I want (...)

Everything depends on how much do you want your PORO class to be involved with Elasticsearch. Should it be responsible
for configuring the index settings and mappings? Should it be able to create the index? And so on.

In the most simple case, you can simply wrap each hit from the results in your custom class, à la response['hits']['hits'].map { |r| MyWrapper.new(r) }; see the example in the elasticsearch-api README with Hashie::Mash.

With such a wrapper in place, you can of course selectively include certain Elasticsearch::Model modules, to add support for search, naming, etc. For certain features, you'd have to make the class ActiveModel-compatible, or provide the neccessary methods, of course.

Another question : in Result, how do you deal with script_fields (with a calculated distance, for example) when the
rest of the _source is also present? I guess a basic merge would do the job.

The Result class is just a convenience wrapper around the hit item, so you should be able to get to any object in the JSON hierarchy. That said, it provides a "sugar" accessor for _source, so instead of myresult._source.title, you can do myresult.title. I can see the argument being made here, that is should also delegate into fields and script_fields, but given the preference to _source filtering in recent Elasticsearch versions, and to prevent subtle errors, I am on the fence, and prefer "explicit over implicit" here.

An example of source filtering in Elasticsearch 1.0:

require 'elasticsearch'

client = Elasticsearch::Client.new host: 'localhost:9250'

client.indices.delete index: 'articles'
client.index index: 'articles', type: 'a', id: 1, refresh: true, body: { title: "Test", tags: ["a", "b", "c"] }

response = client.search index: 'articles', body: {
  _source: {
    include: 'title'
  }
}

puts response['hits']['hits'].map { |r| r['_source']}
# => {"title"=>"Test"}

Many thanks for the feedback and the questions! Post more, if you'd have them.

@jlecour
Copy link
Author

jlecour commented Jan 27, 2014

Thanks for your reply. I like that kind of discussion because it challenges my mental model.

Regarding the persistence I'd like to have my models mostly ignorant of Elasticsearch. Right now I don't need the indexation nor the mapping/setting parts. That said, if I would need them, it might be comfortable to have them at hands.
I'm not sure though that the model is the best place. I'd put them in a "repository" object, responsible for reading from and writing to ES.
Depending on my time, I might try to selectively include some parts and see how it goes.

Regarding the mapping, I've already done something similar. Right now I'm using Virtus but it is slow as hell when I need to create thousands of objects. I don't really like Hashie either since it is very leaky. If you have a size key in your hash, you're screwed because there is a size method on Hash.

Regarding the _source I've come up with a #all_fields method on my result class. It is simply merging _source and fields.

I think that the recent change in ES to restrict the use of fields in favor of _source is good in many ways, but bad when you have a script_fields that creates a fields key and removes the _source key in the response. To have them both, I have to add "_source": true in the request (which is weird to me).

Also, why ES wraps all fields values in an Array is a complete mystery to me. I've read the issue about this, but it doesn't make sense to me (don't bother explaining this to me, I've just updated my client code accordingly).

@jlecour
Copy link
Author

jlecour commented Jan 30, 2014

Hi @karmi,

I've just found an article by @ahawkins about the Repository pattern : http://hawkins.io/2013/10/implementing_the_repository_pattern/

I find it quite relevant with what I have in mind regarding a clear separation between models and persistence, using Elasticsearch as a Backend, elasticsearch-ruby as a transport/api adapter and custom clean models on the other side.

@karmi
Copy link
Contributor

karmi commented Jan 31, 2014

Regarding the persistence I'd like to have my models mostly ignorant of Elasticsearch.
Right now I don't need the indexation nor the mapping/setting parts.

In this case, I think you actually don't need much supporting "integration", since a to_hash serialization can be used as the "wire serialization" for Elasticsearch, à la:

class MyModel
  include MyStuff

  def to_hash
    { foo: "bar" }
  end
end

m = MyModel.new

client.index index: m.index_name, m.document_type, m.to_hash

The repository "support" would be quite small here?, since most of the other stuff would be tied to your model implementation -- how it handles naming (index_name and such), how it serializes itself into Hash, etc.

I've just found an article by @ahawkins about the Repository pattern (...)

Yeah, I'm familiar with that. There's actually quite a lot of different implementations of the pattern in Ruby at Github, at various states of maturity and with various interfaces. On of the most complete ones seems to be braintree/curator.

I'll think about it a bit more, and play with some sketches for the interface.

Also, why ES wraps all fields values in an Array is a complete mystery to me. (... don't bother explaining this to me...)

I will, anyway :) Basically, any field or property or attribute (or whatever you call it) of your document is multi valued in Lucene, by default. Elasticsearch here "leaks" this fact, and in my opinion, rightly. Source filtering is much more intuitive, and should be preferred anyway. Only people knowing what they're doing should be working with fields directly.

@karmi
Copy link
Contributor

karmi commented Feb 3, 2014

Just for the record, I did a little bit of research (Domain Driven Design book by Eric Evans, repository vs activerecord patterns, etc.) and I'm quite sure I'll try to support both data mapper and active record patterns in the persistence module.

REPOSITORY: "A mechanism for encapsulating storage, retrieval, and search behavior which emulates a collection of objects."

[Domain Driven Design, p. 513]

For inspiration, I quite like the approach Braintree took in their braintree/curator gem (example Rails application). See the accompanying article for background: Untangle Domain and Persistence Logic with Curator.

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

I've been studying the Curator source code lately and it is indeed really interesting.

There is something that bothers me ; there is nothing in it about building queries besides using an index to fetch a record.

With Elasticsearch the index looks like the bucket in Riak or the collection in Mongo and the id is obviously present too, but besides that there are much more concepts involved. The type and routing are not part of the query itself but are really important. Values returned are not monolithic : _source, fields, facets, … are returned in different parts of the response.

I quite easily see how I could improve my implementation by using Curator's basic parts : Configuration, DataStore, Repository, … but the (missing) QueryBuilder is the most sensitive piece (to me).

@karmi
Copy link
Contributor

karmi commented Feb 3, 2014

I'm quite sure I won't design & implement it as an "adapter" to Curator, Ruby Object Mapper or do something heavy like that.

The way I see it, the repository would be responsible for storage & retrieval related behaviour, which in Elasticsearch's world means also setting up mappings, defining index/type, etc.

The model would have to conform to a simple contract (having a #to_hash method, etc), and just be passed to repository. (According to my understanding, this would be an entity in the system, in Evans' parlance.)

Some kind of a "sugar" module could "include" the repository in a model, so it would provide familiar active record semantics (Article.create! title: "Foo", @article.save, etc), without forcing people to use the active record pattern.

You correctly highlight the special meaning queries have in Elasticsearch. There are plans for a elasticsearch-dsl gem, which would provide a Ruby DSL for specifying queries, à la Tire. On the other hand, most complex applications and systems have to abstract and centralize this into some kind of QueryBuilder class/module. This is certainly something I'll try to take into consideration when writing documentation and example code.

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

I'm quite sure I won't design & implement it as an "adapter" to Curator, Ruby Object Mapper or do something heavy like that.

I'm not sure I understand how you would do something like Curator/ROM, … for Elasticsearch, without making it an adapter. I guess I miss a key concept.

The way I see it, the repository would be responsible for storage & retrieval related behavior, which in Elasticsearch's world means also setting up mappings, defining index/type, etc.

👍

For the persistence of models, my ideas are not that clear yet. For the moment, I mainly focus on querying many indexes with complex queries.

You correctly highlight the special meaning queries have in Elasticsearch. There are plans for a elasticsearch-dsl gem, which would provide a Ruby DSL for specifying queries, à la Tire

I have to admit, I've never been a fan of Tire's DSL. I prefer building a hash over time, with a few helpers when a refactoring is useful.
For example, for the query part, depending on what I've accumulated in a terms variable I have 3 cases :

  • no term => match_all: {}
  • a unique term => filtered: { filter: terms.first }
  • many terms => filtered: { filter: { and: terms } }

Another example, depending of min and/ormax values for a search parameter, I'll build a range filter.

All this can be done using a DSL, but I've found it quite hard.
It reminds me of this Alan Kay quote :

Simple things should be simple, complex things should be possible.

And often times, DSLs aim at easing the simple/common things but fail at making the compile things possible.

@karmi
Copy link
Contributor

karmi commented Feb 3, 2014

I'm not sure I understand how you would do something like Curator/ROM, … for Elasticsearch, without making it an adapter.

What I'm saying is that my first goal is to come up with a standalone repository pattern implementation in the elasticsearch-persistence gem. There can be adapters(s) for these project in the future, but it isn't my immediate goal.

I prefer building a hash over time, with a few helpers when a refactoring is useful.

Even when it takes ~ 150 lines, like in this complex example? :)

Have a look at this sketch of using self-calling lambdas for dynamically building the Hash, I think the convenience factor of a "DSL" is immediately visible there.

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

Even after a few minutes fiddling with the sketch you've linked, I'm not sure I understand 100% of how it works and what I can do with it. It is definitely shorter than your previous example.

I guess I should take some time to rethink my query building. I've coded it on the go, when designing my queries and how I wanted to fetch my data. With the whole picture, I might come up with something more obvious/simple than what I've done lately.

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

That said, this self-calling lambda trick is really neat.

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

I'd like to come back to a point I was referring to earlier : instantiating objects with the "records" gathered from the index.

As I've said, I've been using Virtus. It's a very nice project with many useful features. You can give it a hash of attributes and each of them will be a separate method in the object. You can coerce the value.

The problem is that it's very slow when you coerce many fields. I've updated my Gist with many benchmarks.

It show that the fastest approach is no coercion at all (RawWithoutVirtus), but it's very poor feature-wise.
With a very verbose/explicit class definition (CuratorStyle) I have pretty good results.
The IndifferentCoercedWithoutVirtus version is even faster, but using method_missing might be a bad idea if attributes are accessed frequently.

In my very personal case, I need to have simple value objects, with very few behavior but with important coercions. That's why a naive Hashie/OpenStruct-style approach is not enough.

@karmi
Copy link
Contributor

karmi commented Feb 3, 2014

As I've said, I've been using Virtus. It's a very nice project with many useful features. You can give it a hash of attributes and each of them will be a separate method in the object.

I'm quite familiar with Virtus. I'm considering using it as the basis for the active record layer of the "persistence" module, because it's the most advanced and feature-complete implementation of ActiveModel (incl. casting and coercion). As discussed, this won't prevent people using other model implementations, or plain old Ruby objects with the repository layer.

The problem is that it's very slow when you coerce many fields. (...)

I've seen the benchmark, and wasn't quite sure about the concern: is it that creating thousands of Virtus instances is slow, or that creating handful of Virtus instances with many attributes is slow?

Usually, you're creating only a limited number (10, 25, 50) of instances based on search results, and even when it isn't a case (_scan/scroll workflows), the cost seems to be tolerable when compared with network overhead, JSON parsing, etc?

@jlecour
Copy link
Author

jlecour commented Feb 3, 2014

I've seen big slowdowns with Virtus when creating tens/hundreds of objects with 5 to 20 attributes.
The penalty grows with the number of attributes, the number and complexity of coercions and the number of instances.

Since this morning, I've fiddled with a simpler implementation with 2 simple specs :

  • I must be able to instantiate an object wit a hash of attributes (symbol or string indexed) and have an accessor for each attribute ;
  • I must be able to coerce attributes freely.

With the same attributes and coercions, between a Vurtus implementation and the simplest implementation that verifies my specs, I get this :

2014-02-03 15:03:06 +0100
ruby 2.1.0p0 (2013-12-25 revision 44422) [x86_64-darwin12.0]

Rehearsal ----------------------------------------------------------
coerced_with_virtus      1.150000   0.000000   1.150000 (  1.161851)
coerced_without_virtus   0.090000   0.000000   0.090000 (  0.086930)
------------------------------------------------- total: 1.240000sec

                             user     system      total        real
coerced_with_virtus      1.100000   0.010000   1.110000 (  1.109354)
coerced_without_virtus   0.070000   0.000000   0.070000 (  0.073982)

You can see the benchmark here : https://gist.github.com/jlecour/9dd8022c542c9d353ed7#file-z2_benchmark_virtus-rake

@karmi
Copy link
Contributor

karmi commented Feb 18, 2014

@jlecour I can see the difference in the benchmarks, though I still don't understand why it's such a big concern. Again, in regular use, you instantiate a handful (10, 25, 50) objects e.g. for displaying them. Can you clarify this bit?

@karmi karmi added the feature label Feb 18, 2014
@jlecour
Copy link
Author

jlecour commented Feb 18, 2014

As you said, the regular use case is not an issue. But when I have to instantiate (often, in my situation) hundreds or even thousands of objects, it becomes an issue.

I have an example with an API where I have a few hundreds main elements to return in a JSON response. Each of these main elements have 10-20 sub elements that come from a different index.

@karmi
Copy link
Contributor

karmi commented Apr 4, 2014

The working version of the repository pattern implementation added in #71.

@karmi
Copy link
Contributor

karmi commented Apr 17, 2014

Closing this issue in favour of #71 and #78.

@karmi karmi closed this as completed Apr 17, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants