Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding to index is slow #139

Closed
IanTrudel opened this issue Feb 2, 2015 · 19 comments
Closed

Adding to index is slow #139

IanTrudel opened this issue Feb 2, 2015 · 19 comments

Comments

@IanTrudel
Copy link

Hi Florian,

I am contributing to https://github.com/Shoes3 and migrated the help system search based on ftsearch to Picky. Picky works well and is a clear improvement over a very old, unmaintained and capricious crashing ftsearch. There is nonetheless one thing bothering me... adding to the index is very slow, enough to be noticeable when loading the help system as it momentarily freezes Shoes splash screen.

The help system only contains 280 entries (title and descriptions).

Also, to fully comply to the previous search I would need to use partial matching but it is even slower to generate the index. Provided is a benchmark on Ruby 2.1.5, picky 4.26.1 on Windows 8. Also, a test script representative of how I used Picky.

  • How can adding to the index be improved?
  • How can partial search be improved?
  • How to cache the index?

Caching would be good to know and might be an option but it would be preferable to stick to real-time index building, otherwise it will increase the maintenance on Shoes (needing to generate index file each time we modify the help manual... not cool). Right now the index is built once the first time the manual is loaded, no latency on ftsearch but some with Picky.

0.000000 0.000000 0.000000 ( 0.000000)
0.047000 0.000000 0.047000 ( 0.043028)
6.125000 0.000000 6.125000 ( 6.154334)

require("picky")
require("benchmark")

samples = (1..280).map { 
   [
      [*('A'..'Z')].sample(rand(10) + 5).join, 
      5.times.collect { [*('A'..'Z')].sample(rand(26)).join }
   ]
}

data = []
Entry = Struct.new(:id, :a, :b)
time = Benchmark.measure {
   samples.each { |n|
      data << Entry.new(data.size + 1, n[0], n[1])
   }
}
puts time

index = Picky::Index.new :data do
  category :a
  category :b
end

time = Benchmark.measure {
   data.each { |n| index.add n }
   @search = Picky::Search.new index
}
puts time

index = Picky::Index.new :data do
  category :a
  category :b, partial: Picky::Partial::Infix.new(from: 1, to: -1)
end

data = []
time = Benchmark.measure {
   samples.each { |n|
      data << Entry.new(data.size + 1, n[0], n[1])
      index.add data[-1]
      @search = Picky::Search.new index
   }
}
puts time
@floere
Copy link
Owner

floere commented Feb 2, 2015

Hi @backorder,

Thanks for your detailed issue entry, even with included code. I appreciate it a lot.

There are some issues with your measurements. For example, you include entry creation and search creation, the latter of which would only be run once.
Also, you seem to be indexing an Array of Strings for category b. Is that what you intended to do? Or did you intend to just have a text with 5 garbled words indexed?

I've slightly rewritten your example to take these issues into account:

require("picky")
require("benchmark")

samples = (1..280).map { 
   [
      [*('A'..'Z')].sample(rand(10) + 5).join, 
      [*('A'..'Z')].sample(rand(26)).join
   ]
}

data = []
Entry = Struct.new(:id, :a, :b)

# Add samples to data.
time = Benchmark.measure {
   samples.each { |n|
     data << Entry.new(data.size + 1, n[0], n[1])
   }
}
puts time

# Add to a simple index.
index = Picky::Index.new :data do
  category :a
  category :b
end
time = Benchmark.measure {
   data.each { |n| index.add n }
}
puts time

# Add to a more complicated index, with Infix.
index = Picky::Index.new :data do
  category :a
  category :b, partial: Picky::Partial::Infix.new(from: 1, to: -1)
end
time = Benchmark.measure {
   data.each { |n| index.add n }
}
puts time

entries = Picky::Search.new index
piece = data.first.b[3..4]
puts "Searching *#{piece}*"
results = entries.search(piece).ids
puts "Found IDs: #{results}"

# Add to a more complicated index, with Substring.
index = Picky::Index.new :data do
  category :a
  category :b, partial: Picky::Partial::Substring.new(from: 1)
end
time = Benchmark.measure {
   data.each { |n| index.add n }
}
puts time

entries = Picky::Search.new index
piece = data.first.b[0..1]
puts "Searching #{piece}*"
results = entries.search(piece).ids
puts "Found IDs: #{results}"

I'll answer your questions in another comment.

@floere
Copy link
Owner

floere commented Feb 2, 2015

On to your questions.

A general remark: Do you intend to use Infix or Postfix indexing? I wonder since you talk about partial queries.
A disclaimer: Picky is not yet optimised for Infix queries, but Postfix covers most cases, I've found.

You asked:

  1. How can adding to the index be improved?
  2. How can partial search be improved?
  3. How to cache the index?
  1. You seem to want to add full-text. I suggest adding lots of stopwords as described here: http://pickyrb.com/documentation.html#tokenizing-options
  2. Depends what you mean by improved. What do you mean?
  3. There's an example on the Picky site on how to dump/load an index: http://florianhanke.com/picky/examples.html#dump-load A good idea is to load it on startup and dump it in an exit handler, something like at_exit { index.dump }.

Let me know what you think so I can help – let's continue the discussion :)

@IanTrudel
Copy link
Author

Thanks @floere, as you can see I am new to Picky. The documentation is well written but somehow difficult to begin with. It needs a little push to understand. I had to dig into the repository to find an example on how to use Partial.

The measurements were in fact correct in including search creation because it measures the startup time required for Picky, which is the code slowing down the loading the help system in Shoes. Also, you modified the sample by getting rid of the 5.times.collect {} which was meant to generate representative data.

Take a look at the actual manual, it is a custom markup language, parsed into an array of titles and descriptions, title also being used as a access reference: https://github.com/Shoes3/shoes3/blob/master/static/manual-en.txt

  1. You seem to want to add full-text. I suggest adding lots of stopwords as described here: http://pickyrb.com/documentation.html#tokenizing-options

I need to search into titles and descriptions, :a and :b in the example. You can take a look at the manual above. Do I need tokenizing here?

  1. How can partial search be improved?

For example, search for the term oval without partial search will overlook entries such as oval, oval,, etc. Totally new to Picky, so no idea how to properly use partial. The partial search doesn't need to be wide.

There's an example on the Picky site on how to dump/load an index: http://florianhanke.com/picky/examples.html#dump-load A good idea is to load it on startup and dump it in an exit handler, something like at_exit { index.dump }.

Thanks. This is a great example. There might be something to do with this here. I'd like to explore with you the other options discussed in this thread first while keeping in mind this one.

@floere
Copy link
Owner

floere commented Feb 2, 2015

Just a quick note before I leave – in the last example, you include data << Entry.new(data.size + 1, n[0], n[1]) and Search interface creation each time. I don't think either should be included.

I'll have a look at the rest soon :) Thanks so far!

@IanTrudel
Copy link
Author

You are correct but take note that the difference in time is negligible. The time reported on the first benchmark includes the very same code you are referring to, time is effectively 0.000000. The whole point of the example is to get a feel of what's going on. The search is very fast and features are great, it's just that building the index is slow.

Now for the good news. I did try dumping and loading the index. The time is close to adding to the index without partial (even when I saved with partial), which is good. It is still slightly slower (and somehow noticeable) than ftsearch on startup. Maybe there is something to do to improve loading time.

Also, is it possible to dump and load the index in a single file? It might be slightly faster, more convenient for me, and also I can check if the file is newer (File.mtime) than the manual before updating the index.

I am looking forward your feedback.

@floere
Copy link
Owner

floere commented Feb 2, 2015

Indeed, on running it myself, I see it only uses 1/1000th of a second in total.

Picky stores JSON files for the indexes (mostly). Are you already requiring Yajl? E.g. in the Gemfile: gem 'yajl-ruby', :require => 'yajl' That helps a lot with speed.

Storing it in a single file would require quite a few changes, I think. Perhaps you could touch the top level index directory after you've dumped the index and use File.mtime on that?

@floere
Copy link
Owner

floere commented Feb 2, 2015

Thanks @floere, as you can see I am new to Picky. The documentation is well written but somehow difficult to begin with. It needs a little push to understand. I had to dig into the repository to find an example on how to use Partial.

That is a good insight. And I am painfully aware it needs improvements for people who start out – the examples were a start, but nowhere near what is needed. I've added an issue: #140

Take a look at the actual manual, it is a custom markup language, parsed into an array of titles and descriptions, title also being used as a access reference: https://github.com/Shoes3/shoes3/blob/master/static/manual-en.txt

Thanks, that helps!

  1. You seem to want to add full-text. I suggest adding lots of stopwords as described here: http://pickyrb.com/documentation.html#tokenizing-options

I need to search into titles and descriptions, :a and :b in the example. You can take a look at the manual above. Do I need tokenizing here?

Picky will tokenize – and it's almost the core work when designing the search engine to get the tokenizing right.
For starters, maybe this set helps. First I define some stopwords. Together with the rejects_token_if option, no words of length 1 or 2 will make it into the index (or any of the stopwords). Then I define which characters are going to be removed (or in this regexp, which will make it through). Then I split on a regexp (incidentally, this removes all the characters in the regexp). What survives is basically a-z and 0-9. In the tokenization of the Search instance, you'd have to use the same tokenization so that queries get the same treatment as the indexing does.

    words = %w(are from has its that the was were will with)
    stopwords = /\b(#{words.join('|')})\b/i

    default_indexing = {
      removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
      stopwords:          stopwords,
      splits_text_on:     %r{[\s/\-\_\:\"\&/\.]},
      rejects_token_if:   lambda { |token| token.size < 3 }
    }
  1. How can partial search be improved?

For example, search for the term oval without partial search will overlook entries such as oval, oval,, etc. Totally new to Picky, so no idea how to properly use partial. The partial search doesn't need to be wide.

The issue you describe would be handled by removing all characters other than a-z0-9 as described above, when indexing and when searching.

See http://pickyrb.com/documentation.html#tokenizing-options, so in short:

index = Index.new :name do
  indexing tokenizing_hash_or_tokenizer
end
Search.new index do
  searching tokenizing_hash_or_tokenizer
end

There's an example on the Picky site on how to dump/load an index: http://florianhanke.com/picky/examples.html#dump-load A good idea is to load it on startup and dump it in an exit handler, something like at_exit { index.dump }.

Thanks. This is a great example. There might be something to do with this here. I'd like to explore with you the other options discussed in this thread first while keeping in mind this one.

My pleasure – just write if you have more questions :)

@IanTrudel
Copy link
Author

That is a good insight. And I am painfully aware it needs improvements for people who start out – the examples were a start, but nowhere near what is needed. I've added an issue: #140

I'd be happy to give you feedback and perhaps write few samples to help you out. The manual is otherwise well written.

Picky stores JSON files for the indexes (mostly). Are you already requiring Yajl? E.g. in the Gemfile: gem 'yajl-ruby', :require => 'yajl' That helps a lot with speed.

Shoes does not use bundler. Added the gem, required it before picky, no difference. Does Picky need to be configured to use yajl? I would prefer minimize the dependencies whenever possible.

The issue you describe would be handled by removing all characters other than a-z0-9 as described above, when indexing and when searching.

This sounds easy enough. Thanks.

See http://pickyrb.com/documentation.html#tokenizing-options, so in short:

I read it before contacting you but don't understand how to use the options. Neither do I understand how to use the stopword and default_indexing example above.

Both indexing and searching throw an error:
sample004.rb:8:in block in

': undefined local variable or method tokenizing_hash_or_tokenizer' for #<Picky::Index:0x2637e28> (NameError).

My understanding is that in this particular case, for the help manual, only stripping before adding to the index would be enough. There is no need for partial nor custom tokenizing. Is it correct?

Storing it in a single file would require quite a few changes, I think. Perhaps you could touch the top level index directory after you've dumped the index and use File.mtime on that?

Shoes is multiplatform and includes Windows support where File.mtime on a directory is equivalent to File.ctime. So it won't work. Workaround for now is File.mtime(Dir[File.join("index", "development", "terms", "*")].first).

@floere
Copy link
Owner

floere commented Feb 2, 2015

That is a good insight. And I am painfully aware it needs improvements for people who start out – the examples were a start, but nowhere near what is needed. I've added an issue: #140

I'd be happy to give you feedback and perhaps write few samples to help you out. The manual is otherwise well written.

Thanks – I've added you to a list, so if I get to updating the manual, I may contact you.

Picky stores JSON files for the indexes (mostly). Are you already requiring Yajl? E.g. in the Gemfile: gem 'yajl-ruby', :require => 'yajl' That helps a lot with speed.

Shoes does not use bundler. Added the gem, required it before picky, no difference. Does Picky need to be configured to use yajl? I would prefer minimize the dependencies whenever possible.

Picky does not need to be configured – it will just use it if available. I suggest you give it a try and see if it helps with performance, then consider the tradeoff.

See http://pickyrb.com/documentation.html#tokenizing-options, so in short:

I read it before contacting you but don't understand how to use the options. Neither do I understand how to use the stopword and default_indexing example above.

Both indexing and searching throws an error:
sample004.rb:8:in block in

': undefined local variable or method tokenizing_hash_or_tokenizer' for #<Picky::Index:0x2637e28> (NameError).

Ah, tokenizing_hash_or_tokenizer was just a placeholder for a hash with options, or your own tokenizer.

My understanding is that in this particular case, for the help manual, only stripping before adding to the index would be enough. There is no need for partial nor custom tokenizing. Is it correct?

You could well strip the offending characters yourself. The default tokenizer options only split on \s. However, you would have to strip away the characters in the search interface, or before passing it into that.

I suggest you try setting tokenizer options and see how it works:

words = %w(are from has its that the was were will with)
stopwords = /\b(#{words.join('|')})\b/i

# Use the words and stopwords in the options
index = Picky::Index.new(:some_name) do
   indexing removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
     stopwords:          stopwords,
     splits_text_on:     %r{[\s/\-\_\:\"\&/\.]},
     rejects_token_if:   lambda { |token| token.size < 3 }

     # ... your index config here
end

# Use the words and stopwords in the options
search_interface = Picky::Search.new(index) do
   searching removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
     stopwords:          stopwords,
     splits_text_on:     %r{[\s/\-\_\:\"\&/\.]},
     rejects_token_if:   lambda { |token| token.size < 3 }

     # ... your search config here
end

Storing it in a single file would require quite a few changes, I think. Perhaps you could touch the top level index directory after you've dumped the index and use File.mtime on that?

Shoes is multiplatform and includes Windows support where File.mtime on a directory is equivalent to File.ctime. So it won't work. Workaround for now is File.mtime(Dir[File.join("index", "development", "terms", "*")].first).

Glad you found a solution! It's late here, so I won't be able to respond as quickly.

@floere
Copy link
Owner

floere commented Feb 3, 2015

@backorder Have you tried running it on Windows?

@IanTrudel
Copy link
Author

Picky does not need to be configured – it will just use it if available. I suggest you give it a try and see if it helps with performance, then consider the tradeoff.

Then it did not make any difference.

Ah, tokenizing_hash_or_tokenizer was just a placeholder for a hash with options, or your own tokenizer.

It all makes sense now that you provided an example. You should consider adding such example to the manual. Take note that rejects_token_if is written rejectstokenif in the manual.

I suggest you try setting tokenizer options and see how it works:

The example didn't work but it gives me something to experiment with. Using your example without the stopwords would give me the proper search results but it does modify the original data, which turns out like this:

image

It should look like this:

image

Using the tokenizer is faster than partial but it does strip the original data. Is there a way to prevent this?

@floere
Copy link
Owner

floere commented Feb 3, 2015

Yes, you can tell Picky where to take the data from – so you could dup the data before it gets indexed:

Here's an example with an index of book titles:

category :title, :from => lambda { |book| book.title.dup }

@IanTrudel
Copy link
Author

The combination of tokenizing and dup works and still has better performance than using partial. It is in fact as fast as without tokenizing, or at least hardly distinguishable.

How does using dup affect the memory payload? Now the manual only has 280 entries but what would happen if it had 280,000 entries...

Before you replied, a small test turned out into an interesting observation about dup with index.add: it will still modify the original data. I had to do a deep copy using Marshal.load(Marshal.dump(book)). Your solution is obviously better.

@floere
Copy link
Owner

floere commented Feb 4, 2015

Re payload – in a way, it does not affect it. I assume that you keep the help HTML/HAML/SLIM (or whatever you are using) on disk – is that right?

@floere
Copy link
Owner

floere commented Feb 4, 2015

@backorder Should we close this issue or is the index loading/indexing speed still an issue?

@IanTrudel
Copy link
Author

Re payload – in a way, it does not affect it. I assume that you keep the help HTML/HAML/SLIM (or whatever you are using) on disk – is that right?

A markdown file that is actually fully loaded into memory.

@backorder Should we close this issue or is the index loading/indexing speed still an issue?

You can close this issue. The tips you gave me are likely to be the best I can get for now. You may consider that adding to the index might need optimization when comes a larger database.

Thanks again for your invaluable help. Picky is awesome.

@floere
Copy link
Owner

floere commented Feb 4, 2015

@backorder I wonder why the md file needs to be in memory. Is it not fast enough if it needs to be fetched from disk?

Indeed, you are right, and I am working on optimisations – using the non default Hash is currently in the works. Also note that Picky can use a file-based index, but that documentation is sadly inadequate: http://pickyrb.com/documentation.html#indexes-types This will make querying slower, but if you have memory issues, and do not index while running then that would be an option.

Thanks for the kind words. Feel free to open another issues whenever you have questions.

@floere floere closed this as completed Feb 4, 2015
@IanTrudel
Copy link
Author

@backorder I wonder why the md file needs to be in memory. Is it not fast enough if it needs to be fetched from disk?

It is only 280 entries for a total of 120kb on disk, better to be loaded into memory where the markdown is parsed only once and always available for rendering.

http://pickyrb.com/documentation.html#indexes-types

Isn't the default backend File that creates the multiple json files in index/development/terms? It should be mentioned in your manual that it is the default backend. We previously talked about having one single file index. My understanding is that it would be possible to write a new backend to do exactly this. Correct?

Not that I would want to do that right now but it's something interesting to know.

Where is located the source for Picky on github ? There are so many choices: client, live, server, etc. The gem installed here is simply picky (4.26.1).

@floere
Copy link
Owner

floere commented Feb 4, 2015

Almost. The default backend is in memory. And if you call dump on the index, then the index files in #{Picky.root}/index are created. With the File backend, slightly different files are created on a dump, and then when querying, Picky goes straight to the file. Here are some tests: https://github.com/floere/picky/blob/master/server/spec/functional/backends/file_spec.rb (I have to admit, I wrote this three years ago, so I am a bit unsure about the details)

The picky gem is in the server directory, so here: https://github.com/floere/picky/tree/master/server is where it is. I'd like to rename it core, but as so many references are pointing there, I haven't done this yet.

This was referenced Nov 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants