Skip to content

Commit

Permalink
feat: Support multithreaded and multiprocess environments
Browse files Browse the repository at this point in the history
- acquire an index writer on every writing operation
- if `exclusive_writer` is set to `true`, acquire it only once
- add `transaction` method to group writing operations

BREAKING CHANGE: `commit` method is no longer public
  • Loading branch information
baygeldin committed Mar 17, 2022
1 parent 7c6001e commit 053b4a0
Show file tree
Hide file tree
Showing 17 changed files with 387 additions and 88 deletions.
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@

Need a fast full-text search for your Ruby script, but Solr and Elasticsearch are an overkill? 😏

You're in the right place. **Tantiny** is a minimalistic full-text search library for Ruby based on [Tantivy](https://github.com/quickwit-oss/tantivy) (an awesome alternative to Apache Lucene written in Rust). It's great for cases when your task at hand requires a full-text search, but configuring a full-blown distributed search engine would take more time than the task itself. And even if you already use such an engine in your project (which is highly likely, actually), it still might be easier to just use Tantiny instead because unlike Solr and Elasticsearch it doesn't need *anything* to work (no separate server or process or whatever), it's purely embeddable. So, when you find yourself in a situation when using your search engine of choice would be tricky/inconvinient or would require additional setup you can always revert back to a quick and dirty solution that is nontheless flexible and fast.
You're in the right place. **Tantiny** is a minimalistic full-text search library for Ruby based on [Tanti**v**y](https://github.com/quickwit-oss/tantivy) (an awesome alternative to Apache Lucene written in Rust). It's great for cases when your task at hand requires a full-text search, but configuring a full-blown distributed search engine would take more time than the task itself. And even if you already use such an engine in your project (which is highly likely, actually), it still might be easier to just use Tantiny instead because unlike Solr and Elasticsearch it doesn't need *anything* to work (no separate server or process or whatever), it's purely embeddable. So, when you find yourself in a situation when using your search engine of choice would be tricky/inconvinient or would require additional setup you can always revert back to a quick and dirty solution that is nontheless flexible and fast.

Tantiny is not exactly bindings to Tantivy, but it tries to be close. The main philosophy is to provide low-level access to Tantivy's inverted index, but with a nice Ruby-esque API, sensible defaults, and additional functionality sprinkled on top.
Tantiny is not exactly Ruby bindings to Tantivy, but it tries to be close. The main philosophy is to provide low-level access to Tantivy's inverted index, but with a nice Ruby-esque API, sensible defaults, and additional functionality sprinkled on top.

Take a look at the most basic example:

Expand All @@ -20,7 +20,6 @@ index << { id: 1, description: "Hello World!" }
index << { id: 2, description: "What's up?" }
index << { id: 3, description: "Goodbye World!" }

index.commit
index.reload

index.search("world") # 1, 3
Expand Down Expand Up @@ -91,6 +90,8 @@ rio_bravo = OpenStruct.new(
release_date: Date.parse("March 18, 1959")
)

index << rio_bravo

hanabi = {
imdb_id: "tt0119250",
type: "/crime/Japan",
Expand All @@ -101,6 +102,8 @@ hanabi = {
release_date: Date.parse("December 1, 1998")
}

index << hanabi

brother = {
imdb_id: "tt0118767",
type: "/crime/Russia",
Expand All @@ -111,8 +114,6 @@ brother = {
release_date: Date.parse("December 12, 1997")
}

index << rio_bravo
index << hanabi
index << brother
```

Expand All @@ -129,12 +130,32 @@ You can also delete it if you want:
index.delete(rio_bravo.imdb_id)
```

After that you need to commit the index for the changes to take place:
### Transactions

If you need to perform multiple writing operations (i.e. more than one) you should always use `transaction`:

```ruby
index.transaction do
index << rio_bravo
index << hanabi
index << brother
end
```

Transactions group changes and [commit](https://docs.rs/tantivy/latest/tantivy/struct.IndexWriter.html#method.commit) them to the index in one go. This is *dramatically* more efficient than performing these changes one by one. In fact, all writing operations (i.e. `<<` and `delete`) are wrapped in a transaction implicitly when you call them outside of a transaction, so calling `<<` 10 times outside of a transaction is the same thing as performing 10 separate transactions.

### Concurrency and thread-safety

Tantiny is thread-safe meaning that you can safely share a single instance of the index between threads. You can also spawn separate processes that could write to and read from the same index. However, while reading from the index should be parallel, writing to it is **not**. Whenever you call `transaction` or any other operation that modify the index (i.e. `<<` and `delete`) it will lock the index for the duration of the operation or wait for another process or thread to release the lock. The only exception to this is when there is another process with an index with an exclusive writer running somewhere in which case the methods that modify the index will fail immediately.

Thus, it's best to have a single writer process and many reader processes if you want to avoid blocking calls. The proper way to do this is to set `exclusive_writer` to `true` when initializing the index:

```ruby
index.commit
index = Tantiny::Index.new("/path/to/index", exclusive_writer: true) {}
```

This way the [index writer](https://docs.rs/tantivy/latest/tantivy/struct.IndexWriter.html) will only be acquired once which means the memory for it and indexing threads will only be allocated once as well. Otherwise a new index writer is acquired every time you perform a writing operation.

## Searching

Make sure that your index is up-to-date by reloading it first:
Expand Down
17 changes: 11 additions & 6 deletions bin/console
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,13 @@ require "pry"
require "tantiny"

path = File.join(__dir__, "../tmp")
en_stem = Tantiny::Tokenizer.new(:stemmer, language: :en)

index = Tantiny::Index.new path, tokenizer: en_stem do
options = {
tokenizer: Tantiny::Tokenizer.new(:stemmer, language: :en),
exclusive_writer: true,
}

index = Tantiny::Index.new(path, **options) do
id :imdb_id
facet :category
string :title
Expand Down Expand Up @@ -49,11 +53,12 @@ brother = {
release_date: Date.parse("December 12, 1997")
}

index << rio_bravo
index << hanabi
index << brother
index.transaction do
index << rio_bravo
index << hanabi
index << brother
end

index.commit
index.reload

binding.pry
2 changes: 1 addition & 1 deletion ext/Rakefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
require "thermite/tasks"

project_dir = File.dirname(File.dirname(__FILE__))
project_dir = File.dirname(__FILE__, 2)
Thermite::Tasks.new(cargo_project_path: project_dir, ruby_project_path: project_dir)
task default: %w[thermite:build]
2 changes: 2 additions & 0 deletions lib/tantiny.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
RubyNext::Language.setup_gem_load_path

require "rutie"
require "concurrent"
require "fileutils"

require "tantiny/version"
require "tantiny/errors"
Expand Down
13 changes: 11 additions & 2 deletions lib/tantiny/errors.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,18 @@
module Tantiny
class TantivyError < StandardError; end

class UnknownField < StandardError
class IndexWriterBusyError < StandardError
def initialize
super("Can't find the specified field in the schema.")
msg = "Failed to acquire an index writer. "\
"Is there an active index with an exclusive writer already?"

super(msg)
end
end

class UnexpectedNone < StandardError
def initialize(type)
super("Didn't expect Option<#{type}> to be empty.")
end
end

Expand Down
10 changes: 10 additions & 0 deletions lib/tantiny/helpers.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,15 @@ module Helpers
def self.timestamp(date)
date.to_datetime.iso8601
end

def self.with_lock(lockfile)
File.open(lockfile, File::CREAT) do |file|
file.flock(File::LOCK_EX)

yield

file.flock(File::LOCK_UN)
end
end
end
end
118 changes: 96 additions & 22 deletions lib/tantiny/index.rb
Original file line number Diff line number Diff line change
@@ -1,23 +1,19 @@
# frozen_string_literal: true

require "fileutils"

module Tantiny
class Index
DEFAULT_INDEX_SIZE = 50_000_000
LOCKFILE = ".tantiny.lock"
DEFAULT_WRITER_MEMORY = 5_000_000 # 5MB
DEFAULT_LIMIT = 10

def self.new(path, **options, &block)
index_size = options[:size] || DEFAULT_INDEX_SIZE
default_tokenizer = options[:tokenizer] || Tokenizer.default

FileUtils.mkdir_p(path)

default_tokenizer = options[:tokenizer] || Tokenizer.default
schema = Schema.new(default_tokenizer, &block)

object = __new(
path.to_s,
index_size,
schema.default_tokenizer,
schema.field_tokenizers.transform_keys(&:to_s),
schema.text_fields.map(&:to_s),
Expand All @@ -28,35 +24,64 @@ def self.new(path, **options, &block)
schema.facet_fields.map(&:to_s)
)

object.send(:schema=, schema)
object.send(:initialize, path, schema, **options)

object
end

def initialize(path, schema, **options)
@path = path
@schema = schema

@indexer_memory = options[:writer_memory] || DEFAULT_WRITER_MEMORY
@exclusive_writer = options[:exclusive_writer] || false

@active_transaction = Concurrent::ThreadLocalVar.new(false)
@transaction_semaphore = Mutex.new

acquire_index_writer if exclusive_writer?
end

attr_reader :schema

def commit
__commit
def transaction
if inside_transaction?
yield
else
synchronize do
open_transaction!

yield

close_transaction!
end
end

nil
end

def reload
__reload
end

def <<(document)
__add_document(
resolve(document, schema.id_field).to_s,
slice_document(document, schema.text_fields) { |v| v.to_s },
slice_document(document, schema.string_fields) { |v| v.to_s },
slice_document(document, schema.integer_fields) { |v| v.to_i },
slice_document(document, schema.double_fields) { |v| v.to_f },
slice_document(document, schema.date_fields) { |v| Helpers.timestamp(v) },
slice_document(document, schema.facet_fields) { |v| v.to_s }
)
transaction do
__add_document(
resolve(document, schema.id_field).to_s,
slice_document(document, schema.text_fields) { |v| v.to_s },
slice_document(document, schema.string_fields) { |v| v.to_s },
slice_document(document, schema.integer_fields) { |v| v.to_i },
slice_document(document, schema.double_fields) { |v| v.to_f },
slice_document(document, schema.date_fields) { |v| Helpers.timestamp(v) },
slice_document(document, schema.facet_fields) { |v| v.to_s }
)
end
end

def delete(id)
__delete_document(id.to_s)
transaction do
__delete_document(id.to_s)
end
end

def search(query, limit: DEFAULT_LIMIT, **smart_query_options)
Expand All @@ -83,8 +108,6 @@ def search(query, limit: DEFAULT_LIMIT, **smart_query_options)

private

attr_writer :schema

def slice_document(document, fields, &block)
fields.inject({}) do |hash, field|
hash.tap { |h| h[field.to_s] = resolve(document, field) }
Expand All @@ -94,5 +117,56 @@ def slice_document(document, fields, &block)
def resolve(document, field)
document.is_a?(Hash) ? document[field] : document.send(field)
end

def acquire_index_writer
__acquire_index_writer(@indexer_memory)
rescue TantivyError => e
case e.message
when /Failed to acquire Lockfile/
raise IndexWriterBusyError.new
else
raise
end
end

def release_index_writer
__release_index_writer
end

def commit
__commit
end

def open_transaction!
acquire_index_writer unless exclusive_writer?

@active_transaction.value = true
end

def close_transaction!
commit

release_index_writer unless exclusive_writer?

@active_transaction.value = false
end

def inside_transaction?
@active_transaction.value
end

def exclusive_writer?
@exclusive_writer
end

def synchronize(&block)
@transaction_semaphore.synchronize do
Helpers.with_lock(lockfile_path, &block)
end
end

def lockfile_path
@lockfile_path ||= File.join(@path, LOCKFILE)
end
end
end
2 changes: 2 additions & 0 deletions sig/tantiny/helpers.rbs
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@
module Tantiny
module Helpers
def self.timestamp: ((Date | DateTime) date) -> String

def self.with_lock: (String lockfile) { (*untyped) -> void } -> void
end
end

0 comments on commit 053b4a0

Please sign in to comment.