Esmy

Esmy is a library for full text search, written in Rust. It is inspired by Lucene, but aims to be more flexible.

Features

Text indexing with different analyzers.
Text search, including phrases.
Parallel indexing
Document deletions
Quite fast

Roadmap

Document scoring
~~Document deletions~~
Doc-values data structures (fast access to values of fields)
Improve merge concurrency
More query types (e.g. spans, more boolean logic)

Example

let schema = SegmentSchemaBuilder::new()
    .add_full_doc("full_doc_feature") //features have names
    .add_string_index(
        "text_string_index",
        "text",
        Box::new(UAX29Analyzer::new())) //Unicode tokenization
    .build();

let index = IndexBuilder::new().create("path/to/index", schema).unwrap();
let doc1 = Doc::new().string_field("text", "The quick brown fox jumps over the lazy dog");
index.add_doc(doc1).unwrap();
let doc2 = Doc::new().string_field("text", "Foxes are generally smaller than some other members of the family Canidae");
index.add_doc(doc2).unwrap();
index.commit().unwrap();

let query = TextQuery::new(
    "text",                         //field
    "brown fox",                    //value
    Box::new(UAX29Analyzer::new()), //Search with the same analyzer as we indexed
);
let mut collector = CountCollector::new();
let reader = index.open_reader().unwrap();
reader.search(&query, &mut collector).unwrap();
assert_eq!(1, collector.total_count());

Design

Esmy is an information retrieval system, and takes a lot of inspiration from Lucene. The main idea is to have an inverted index, which allows you to look up which documents contain a certain term. However, often additional data structures are needed in order to be able to visualize or process the data, e.g. to create histograms of result sets or being able to do geo-search. Thus, Esmy is structured to accommodate adding new data structures.

Esmy, as e.g. Lucene, is structured around indexes and segments. A segment is a collection of on-disk data structures, and an index is a set of segments. Segments are immutable. When adding documents to Esmy, you add some documents which are at some point commited to disk, at which point a segment is created. Over time, this will mean many small segments. In order to prevent having so many small segments, Esmy can merge segments into larger segments. The on-disk data structures of the segments can then be used to do something useful, e.g. searching for text.

Apart from not being on the JVM, there are a few differences from Lucene.

One is that Lucene treats the inverted index as the core of the Library. While it is an important feature of Esmy, it's only one kind of useful data structure. Esmy instead has a concept of a segment feature. The inverted index is one such segment feature. The requirements on a segment feature is that you can create one from a set of documents, and that the feature can merge files that it wrote into larger files.

Features are identified by names, and since they are decoupled from fields you can add more than one type of index for a particular field. This means that you, for example, can have a document indexed with different analyzers without having to have separate fields for them, as you would in Lucene.

Another one is that Esmy has more opinionated (but open) view of what a document is. Lucene treats a document as a set of fields at input, but has no notion of a document when reading. This leads to e.g. Elasticsearch having a JSON-structure emulate this, by storing the JSON as a string field. Since Lucene is not Elasticsearch, Lucene can not use that _source field, Lucene can't use that field. Esmy instead has a notion of a document, and an on-disk data structure. This means that Esmy can use the document.

License

This repository is licensed under the Apache License, Version 2.0 license, with the exception of data in the data directory, which comes from Wikipedia and is only used for testing purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
data		data
esmy		esmy
esmy_cli		esmy_cli
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
Cargo.toml		Cargo.toml
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

esmy

esmy

esmy_cli

esmy_cli

.gitattributes

.gitattributes

.gitignore

.gitignore

.travis.yml

.travis.yml

Cargo.toml

Cargo.toml

README.md

README.md

rustfmt.toml

rustfmt.toml

Repository files navigation

Esmy

Features

Roadmap

Example

Design

License

About

Releases

Packages

Languages

antonha/esmy

Folders and files

Latest commit

History

Repository files navigation

Esmy

Features

Roadmap

Example

Design

License

About

Topics

Resources

Stars

Watchers

Forks

Languages