New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding to index is slow #139
Comments
Hi @backorder, Thanks for your detailed issue entry, even with included code. I appreciate it a lot. There are some issues with your measurements. For example, you include entry creation and search creation, the latter of which would only be run once. I've slightly rewritten your example to take these issues into account: require("picky")
require("benchmark")
samples = (1..280).map {
[
[*('A'..'Z')].sample(rand(10) + 5).join,
[*('A'..'Z')].sample(rand(26)).join
]
}
data = []
Entry = Struct.new(:id, :a, :b)
# Add samples to data.
time = Benchmark.measure {
samples.each { |n|
data << Entry.new(data.size + 1, n[0], n[1])
}
}
puts time
# Add to a simple index.
index = Picky::Index.new :data do
category :a
category :b
end
time = Benchmark.measure {
data.each { |n| index.add n }
}
puts time
# Add to a more complicated index, with Infix.
index = Picky::Index.new :data do
category :a
category :b, partial: Picky::Partial::Infix.new(from: 1, to: -1)
end
time = Benchmark.measure {
data.each { |n| index.add n }
}
puts time
entries = Picky::Search.new index
piece = data.first.b[3..4]
puts "Searching *#{piece}*"
results = entries.search(piece).ids
puts "Found IDs: #{results}"
# Add to a more complicated index, with Substring.
index = Picky::Index.new :data do
category :a
category :b, partial: Picky::Partial::Substring.new(from: 1)
end
time = Benchmark.measure {
data.each { |n| index.add n }
}
puts time
entries = Picky::Search.new index
piece = data.first.b[0..1]
puts "Searching #{piece}*"
results = entries.search(piece).ids
puts "Found IDs: #{results}" I'll answer your questions in another comment. |
On to your questions. A general remark: Do you intend to use Infix or Postfix indexing? I wonder since you talk about partial queries. You asked:
Let me know what you think so I can help – let's continue the discussion :) |
Thanks @floere, as you can see I am new to Picky. The documentation is well written but somehow difficult to begin with. It needs a little push to understand. I had to dig into the repository to find an example on how to use Partial. The measurements were in fact correct in including search creation because it measures the startup time required for Picky, which is the code slowing down the loading the help system in Shoes. Also, you modified the sample by getting rid of the 5.times.collect {} which was meant to generate representative data. Take a look at the actual manual, it is a custom markup language, parsed into an array of titles and descriptions, title also being used as a access reference: https://github.com/Shoes3/shoes3/blob/master/static/manual-en.txt
I need to search into titles and descriptions, :a and :b in the example. You can take a look at the manual above. Do I need tokenizing here?
For example, search for the term oval without partial search will overlook entries such as
Thanks. This is a great example. There might be something to do with this here. I'd like to explore with you the other options discussed in this thread first while keeping in mind this one. |
Just a quick note before I leave – in the last example, you include I'll have a look at the rest soon :) Thanks so far! |
You are correct but take note that the difference in time is negligible. The time reported on the first benchmark includes the very same code you are referring to, time is effectively 0.000000. The whole point of the example is to get a feel of what's going on. The search is very fast and features are great, it's just that building the index is slow. Now for the good news. I did try dumping and loading the index. The time is close to adding to the index without partial (even when I saved with partial), which is good. It is still slightly slower (and somehow noticeable) than ftsearch on startup. Maybe there is something to do to improve loading time. Also, is it possible to dump and load the index in a single file? It might be slightly faster, more convenient for me, and also I can check if the file is newer (File.mtime) than the manual before updating the index. I am looking forward your feedback. |
Indeed, on running it myself, I see it only uses 1/1000th of a second in total. Picky stores JSON files for the indexes (mostly). Are you already requiring Yajl? E.g. in the Gemfile: Storing it in a single file would require quite a few changes, I think. Perhaps you could |
That is a good insight. And I am painfully aware it needs improvements for people who start out – the examples were a start, but nowhere near what is needed. I've added an issue: #140
Thanks, that helps!
Picky will tokenize – and it's almost the core work when designing the search engine to get the tokenizing right. words = %w(are from has its that the was were will with)
stopwords = /\b(#{words.join('|')})\b/i
default_indexing = {
removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
stopwords: stopwords,
splits_text_on: %r{[\s/\-\_\:\"\&/\.]},
rejects_token_if: lambda { |token| token.size < 3 }
}
The issue you describe would be handled by removing all characters other than a-z0-9 as described above, when indexing and when searching. See http://pickyrb.com/documentation.html#tokenizing-options, so in short: index = Index.new :name do
indexing tokenizing_hash_or_tokenizer
end
Search.new index do
searching tokenizing_hash_or_tokenizer
end
My pleasure – just write if you have more questions :) |
I'd be happy to give you feedback and perhaps write few samples to help you out. The manual is otherwise well written.
Shoes does not use bundler. Added the gem, required it before picky, no difference. Does Picky need to be configured to use yajl? I would prefer minimize the dependencies whenever possible.
This sounds easy enough. Thanks.
I read it before contacting you but don't understand how to use the options. Neither do I understand how to use the stopword and default_indexing example above. Both indexing and searching throw an error: tokenizing_hash_or_tokenizer' for #<Picky::Index:0x2637e28> (NameError) .
My understanding is that in this particular case, for the help manual, only stripping before adding to the index would be enough. There is no need for partial nor custom tokenizing. Is it correct?
Shoes is multiplatform and includes Windows support where File.mtime on a directory is equivalent to File.ctime. So it won't work. Workaround for now is |
Thanks – I've added you to a list, so if I get to updating the manual, I may contact you.
Picky does not need to be configured – it will just use it if available. I suggest you give it a try and see if it helps with performance, then consider the tradeoff.
Ah,
You could well strip the offending characters yourself. The default tokenizer options only split on I suggest you try setting tokenizer options and see how it works: words = %w(are from has its that the was were will with)
stopwords = /\b(#{words.join('|')})\b/i
# Use the words and stopwords in the options
index = Picky::Index.new(:some_name) do
indexing removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
stopwords: stopwords,
splits_text_on: %r{[\s/\-\_\:\"\&/\.]},
rejects_token_if: lambda { |token| token.size < 3 }
# ... your index config here
end
# Use the words and stopwords in the options
search_interface = Picky::Search.new(index) do
searching removes_characters: /[^a-z0-9\s\/\-\_\:\"\&\.]/i,
stopwords: stopwords,
splits_text_on: %r{[\s/\-\_\:\"\&/\.]},
rejects_token_if: lambda { |token| token.size < 3 }
# ... your search config here
end
Glad you found a solution! It's late here, so I won't be able to respond as quickly. |
@backorder Have you tried running it on Windows? |
Then it did not make any difference.
It all makes sense now that you provided an example. You should consider adding such example to the manual. Take note that rejects_token_if is written rejectstokenif in the manual.
The example didn't work but it gives me something to experiment with. Using your example without the stopwords would give me the proper search results but it does modify the original data, which turns out like this: It should look like this: Using the tokenizer is faster than partial but it does strip the original data. Is there a way to prevent this? |
Yes, you can tell Picky where to take the data from – so you could Here's an example with an index of book titles: category :title, :from => lambda { |book| book.title.dup } |
The combination of tokenizing and dup works and still has better performance than using partial. It is in fact as fast as without tokenizing, or at least hardly distinguishable. How does using dup affect the memory payload? Now the manual only has 280 entries but what would happen if it had 280,000 entries... Before you replied, a small test turned out into an interesting observation about dup with index.add: it will still modify the original data. I had to do a deep copy using |
Re payload – in a way, it does not affect it. I assume that you keep the help HTML/HAML/SLIM (or whatever you are using) on disk – is that right? |
@backorder Should we close this issue or is the index loading/indexing speed still an issue? |
A markdown file that is actually fully loaded into memory.
You can close this issue. The tips you gave me are likely to be the best I can get for now. You may consider that adding to the index might need optimization when comes a larger database. Thanks again for your invaluable help. Picky is awesome. |
@backorder I wonder why the md file needs to be in memory. Is it not fast enough if it needs to be fetched from disk? Indeed, you are right, and I am working on optimisations – using the non default Hash is currently in the works. Also note that Picky can use a file-based index, but that documentation is sadly inadequate: http://pickyrb.com/documentation.html#indexes-types This will make querying slower, but if you have memory issues, and do not index while running then that would be an option. Thanks for the kind words. Feel free to open another issues whenever you have questions. |
It is only 280 entries for a total of 120kb on disk, better to be loaded into memory where the markdown is parsed only once and always available for rendering. Isn't the default backend File that creates the multiple json files in index/development/terms? It should be mentioned in your manual that it is the default backend. We previously talked about having one single file index. My understanding is that it would be possible to write a new backend to do exactly this. Correct? Not that I would want to do that right now but it's something interesting to know. Where is located the source for Picky on github ? There are so many choices: client, live, server, etc. The gem installed here is simply |
Almost. The default backend is in memory. And if you call The |
Hi Florian,
I am contributing to https://github.com/Shoes3 and migrated the help system search based on ftsearch to Picky. Picky works well and is a clear improvement over a very old, unmaintained and capricious crashing ftsearch. There is nonetheless one thing bothering me... adding to the index is very slow, enough to be noticeable when loading the help system as it momentarily freezes Shoes splash screen.
The help system only contains 280 entries (title and descriptions).
Also, to fully comply to the previous search I would need to use partial matching but it is even slower to generate the index. Provided is a benchmark on Ruby 2.1.5, picky 4.26.1 on Windows 8. Also, a test script representative of how I used Picky.
Caching would be good to know and might be an option but it would be preferable to stick to real-time index building, otherwise it will increase the maintenance on Shoes (needing to generate index file each time we modify the help manual... not cool). Right now the index is built once the first time the manual is loaded, no latency on ftsearch but some with Picky.
The text was updated successfully, but these errors were encountered: