Skip to content

Latest commit

 

History

History
456 lines (350 loc) · 17.7 KB

README.md

File metadata and controls

456 lines (350 loc) · 17.7 KB

kotlin-search

This is a simple multi-language index implementation that can be embedded into applications. It is useful for applications that require full-text search and natural language processing support without adding a full-featured search engine server like Elastic Search.

This library is built on top of Lucene and it does not implement any search or NLP algorithm. It wraps Lucene to provide the following features:

  • Multi-language search index
  • Text normalization
  • Multi-language stemming and tokenization

Table of Contents

Dependency

This is available on Maven Central Repository. It can be added using the following dependency:

<dependency>
    <groupId>be.rlab</groupId>
    <artifactId>kotlin-search</artifactId>
    <version>1.4.1</version>
</dependency>

This version supports Kotlin 1.9 and Lucene 9.

The only two hard dependencies are SLF4j and commons-codec. We will not add more dependencies unless it's strictly necessary to avoid classpath errors.

Multi-language search index

The IndexManager is the component that provides access to the search index. It allows to index, to retrieve and to search for documents. Documents are dictionaries (a set of key/value fields) that are scoped to a namespace.

The IndexManager is a file-system based index. In order to support multiple languages, it creates an index per-language. It means that documents in different languages will be physically stored in different indexes. It provides very efficient operations for inserting and searching for documents since the index does not need to perform a range query to retrieve all documents for a language. This model penalizes searching for documents in different languages simultaneously.

You can use the IndexManager.Builder class to build and configure an IndexManager:

// This is the default configuration if you don't specify a different set of options.
val indexManger: IndexManager = IndexManager.Builder("/tmp/lucene-index")
  .forLanguages(Language.entries)
  .withSimilarity(BM25Similarity())
  .build()

Before reading the following sections, we strongly recommend reading Search and Scoring and Classic Scoring Formula in Lucene documentation. This library uses the default scoring algorithm to match documents (Okapi BM25).

Documents

The Document is the root entity that represents an entry in the index. Each document in the Lucene index is a plain list of key -> value fields. Lucene supports multi-value fields, which means the same key might be stored multiple times with different values in a document. In order to support both single-value and multi-value fields, we use a List<Any> type to store the values in a Field.

Documents have a 160-bits unique identifier composed by the following fields:

<32-bits hash of the Language><96-bits timestamp><32-bits unique id>

The language is included in the identifier in order to resolve which index should be queried to retrieve a document. It includes a timestamp, which means that sorting documents by id will produce a collection sorted by creation date. If you need to retrieve the document's language, you can use the be.rlab.search.Hashes.getLanguage(id) utility method to retrieve the language from the document's identifier.

The document namespace emulates domain collections. All queries will be scoped to a namespace, which means that querying the index is analog to query a collection in a no-sql database.

Documents Schemas

Lucene fields have some attributes that are used in index-time to determine how the field is processed by the index. The stored attribute tells Lucene to store the field value in the index. The indexed attribute indicates that a field will be used for search, so it needs to be processed for that purpose. The docValues attribute marks a field to be saved in a dedicated document-level space, which makes sorting and faceting much faster.

Each Lucene data type provides a default value for all these attributes (look at the data types section below). If you want to change the behavior of a field, you should change it for each field and each document. In order to make it easier, kotlin-search introduces the concept of document schemas. A document schema allows to pre-define a set of fields and its preferred attributes for a document. It provides data type validation out of the box, and it can be added to the IndexManager.

The following example adds a document schema for the namespace players using the Functional DSL. For the Object Mapper, the document schema is automatically created based on the annotations (look at the Indexing Documents section below).

import be.rlab.search.query.*

indexManager.apply {
    addSchema("players") {
        string("id")
        text("firstName")
        text("lasName")
        int("age") {
            store()
            index()
            docValues()
        }
        float("score") {
            index()
            docValues()
        }
    }
}

Data Types

Lucene supports only a few native data types. The following table shows the default attributes for each data type.

Field type Stored Indexed Description
string yes no A String value stored exactly as it is provided.
text yes yes A String value that is tokenized and pre-processed by language analyzers.
int no no A multi-dimensional Int value for fast range filters.
long no no A multi-dimensional Long value for fast range filters.
float no no A multi-dimensional Float value for fast range filters.
double no no A multi-dimensional Double value for fast range filters.

In order to support additional Kotlin types, kotlin-search provides a flexible FieldTypeMapper interface with a default implementation for the native Lucene types. Custom implementations can be registered only through the IndexMapper. So far the standard IndexManager does not support custom mappers, but we might consider adding support if there are valid use cases.

Lucene does not support to store multi-dimensional fields, since they're packed as a BytesRef value. This library does not support ByteRef field types yet.

The following table shows the default mapping from Kotlin to Lucene types.

Lucene Type Kotlin Type(s) nullable
string String, List yes
text String, List yes
int Int, List yes
long Long, List yes
float Float, List yes
double Double, List yes

You can take a look at the SimpleTypeMapper and ListTypeMapper components for further information.

Indexing Documents

kotlin-search provides two strategies to index documents, a functional DSL and an object mapper. In order to keep backward-compatibility with older versions, the functional DSL and object mapper strategies cannot be mixed. If you indexed documents using the functional DSL, you cannot use the object mapper for searching.

There is a plan to support older indexes in the future, but it will require some extra configuration.

The following examples create a new document within the players namespace in the spanish index.

Functional DSL

import be.rlab.nlp.model.Language
import be.rlab.search.IndexManager

indexManager.index("players", Language.SPANISH) {
    string("id", "player-id-1234")
    text("firstName", "Juan")
    text("lastName", "Pérez")
    int("age", 27) {
        store()
    }
    float("score", 10.0F)
}

Object Mapper

The object mapper strategy allows to use a data class to define a Lucene document structure. It uses a set of FieldTypeMappers to transform from Kotlin objects to Lucene documents and viceversa.

By default, fields are stored and indexed according to the Lucene default behavior for the data type, but you can override this behavior setting the store and index attributes in the @IndexField annotation. If marked as not stored, the field must be nullable.

You can override the default Lucene type using the @IndexFieldType annotation.

Note that the object mapper is strictly designed to map Lucene documents. You should not try to annotate your domain entities since it probably won't work as expected. The Kotlin field types are restricted to the supported Lucene field types, using other types will cause an error. If you want to map your custom types, you need to implement a FieldTypeMapper.

import be.rlab.nlp.model.Language
import be.rlab.nlp.model.BoolValue
import be.rlab.search.*
import be.rlab.search.annotation.*

@IndexDocument(namespace = "players")
data class Player(
    @IndexField @IndexFieldType(FieldType.STRING) val id: String,
    @IndexField val firstName: String,
    @IndexField val lastName: String,
    @IndexField(index = BoolValue.YES) val age: Int,
    @IndexField(store = BoolValue.YES) val score: Float?
)

val mapper = IndexMapper(indexManager)

mapper.index(Player(
    firstName = "Juan",
    lastName = "Pérez",
    age = 27,
    score = 10.0F
), Language.SPANISH)

Searching

kotlin-search provides a functional DSL and an object mapper to build Lucene queries. All queries (look at the table below) provide the following types of search:

  • By a single field using the field name
  • By a single field using a class annotated property
  • By all fields in the document

To search by all fields in the document, all queries support a signature without the field name. You can search by all fields or use the by modifier to restrict the search to a list of fields (look at the example below).

Functional DSL

In order to search by all fields using the functional DSL, you need to define a schema for the index namespace. The schema must be defined only once. Defining the schema is optional, if you don't plan to search by multiple fields, you can ignore the schema definition.

import be.rlab.search.query.*

indexManager.apply {
    addSchema("players") {
        string("id")
        text("firstName")
        text("lasName")
        int("age")
        float("score")
    }
}

indexManager.search("players", Language.SPANISH) {
    // Search in all fields defined in the schema.
    term("Juan") {
        // Optionally you can specify the list of fields, 
        // if not specified it will search in all fields.
        by("firstName")
    }
    range("age", 20, 30)
}

Object Mapper

import be.rlab.search.query.*

val mapper = IndexMapper(indexManager)

mapper.search<Player>(Language.SPANISH) {
    term(Player::firstName, "Juan")
    range(Player::age, 20, 30)
}

Both the functional DSL and the object mapper supports the following type of queries:

Query type Field types Default boolean clause
term all MUST
range all SHOULD
wildcard string, text MUST
regex string, text MUST
fuzzy string, text MUST
phrase string, text MUST

If you need to build a custom query, the QueryBuilder providers the custom() method that receives the current BooleanQuery in construction.

All queries have additional parameters that are initialized to the default Lucene values. If you need to boost a query you can apply the boost as a modifier:

indexManager.search("players", Language.SPANISH) {
    term("firstName", "Juan") {
        boost(0.5F)
    }
}

Faceted search is not supported yet.

Sorting

Sorting search results requires to mark fields as DocValues before indexing. DocValues is a fast document-level index used for sorting and faceting. You can mark a field using both the Functional DSL and the Object Mapper.

The query builder provides a sortBy() clause that allows to specify a list of fields to sort results by. Note that using the sortBy() clause requires a valid document schema. Sorting documents without an explicit document schema is not supported, since the sorting criteria depends on the data type (look at the Document Schemas section above).

Functional DSL

import be.rlab.nlp.model.Language
import be.rlab.search.IndexManager

indexManager.index("players", Language.SPANISH) {
    string("id", "player-id-1234")
    text("firstName", "Juan")
    text("lastName", "Pérez")
    int("age", 27) {
        store()
    }
    float("score", 10.0F) {
        docValues()
    }
}

indexManager.search("players", Language.SPANISH) {
    range("age", 20, 30)
    sortBy("score")
}

Object Mapper

import be.rlab.nlp.model.Language
import be.rlab.nlp.model.BoolValue
import be.rlab.search.*
import be.rlab.search.annotation.*

@IndexDocument(namespace = "players")
data class Player(
    @IndexField @IndexFieldType(FieldType.STRING) val id: String,
    @IndexField val firstName: String,
    @IndexField val lastName: String,
    @IndexField(index = BoolValue.YES) val age: Int,
    @IndexField(store = BoolValue.YES, docValues = true) val score: Float?
)

val mapper = IndexMapper(indexManager)

mapper.index(Player(
    firstName = "Juan",
    lastName = "Pérez",
    age = 27,
    score = 10.0F
), Language.SPANISH)

mapper.search<Player>(Language.SPANISH) {
    range(Player::age, 20, 30)
    sortBy(Player::score)
}

Query Parser

The QueryBuilder also supports parsing Lucene queries using the QueryParser syntax.

indexManager.search("players", Language.SPANISH) {
    parse("firstName", "age:[22 TO 35] AND Juan")
}

The first parameter of the parse() method is the default field if no field is specified in the query. For instance, in the previous query, it will search for all persons with first name Juan with ages between 22 and 35 years. For full syntax documentation take a look at the Lucene documentation.

Pagination

indexManager.search search for documents in the index. By default it limits results up to IndexManager.DEFAULT_LIMIT documents. The operation returns a SearchResult that contains the documents in the first page and a cursor to the next page in the recordset. You must call indexManager.search providing the cursor to search for the next page. This pagination strategy is useful when you have to defer the search in order to continue later.

If you don't need a deferred pagination, you can use indexManager.find to get the full list of results as a Sequence. It will query the index as many times as required until the recordset has no more documents.

Text Normalization

Text normalization is a key part on indexing and searching. Lucene applies a set of text normalization techniques to make the search more accurate.

kotlin-search provides access to the text normalization components through the be.rlab.nlp.Normalizer class. The text normalization usually has three stages:

  1. Pre-normalization
  2. Tokenization
  3. Token Normalization

Pre-normalization

In this phase, the Normalizer applies the following normalization techniques to the entire text:

  • Removes diacritics
  • Removes punctuation
  • Performs Unicode normalization
  • Transforms the text to lowercase to make it case insensitive

Tokenization

In this phase, the Normalizer split the text into tokens using the following tokenizers:

  • Word Tokenizer
  • Stop Words Tokenizer

Note that the language is required for the Stop Words Tokenizer, since stop words are language-specific. If you created the Normalizer without setting a language, it will fail with an error.

The Stop Words Tokenizer uses the collection of stop words from stopwords-iso.

Token Normalization

In this phase, the Normalizer applies the Snowball stemmer to extract the root from the words. The stemming is provided by the be.rlab.nlp.MultiLanguageStemmer component. It delegates the processing to the language-specific stemmer distributed by Lucene.

Note that the language is required for the MultiLanguageStemmer. If you created the Normalizer without setting a language, it will fail with an error.