Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to Elasticsearch #2830

Closed
wants to merge 357 commits into from
Closed

Migration to Elasticsearch #2830

wants to merge 357 commits into from

Conversation

josegar74
Copy link
Member

@josegar74 josegar74 commented Jun 4, 2018

The move from Lucene library to a on the shelf search engine like Solr or Elasticsearch is mainly motivated to:

  • first improve performances on large catalogue and allowing clustering,
  • then solve a number of limitations related to current Lucene implementation which would require major efforts.

Moving to ElasticSearch will bring a lot of flexibility on configuring the indexing and search in the catalogue. This page highlight the main benefits of this move and the current progress. GeoNetwork developers are looking for funding to continue this task.

Current limitations

  • GeoNetwork Lucene index can not be shared across catalogues making clustering not feasible (unless some synchronization mechanism is set up).
  • Searches
    • Limited set of operators (eg. not operator, phrase search)
    • Issue with " character used in search
    • Suggestions may display private records information (even if records by themselves are not visible)
    • Facets need to be configured to be available
    • Limited boosting mechanism
  • Search results are complex to customize (needs indexing, dump fields, restart and UI change)
  • Multilingual support implementation is currently quite complex (and could be simplified)

Use cases to illustrate some of the benefits

Main benefits that this move can bring to the application concern:

  • Searching:
    • Better & more relevant searches (eg. scoring & boosting)
    • Combined search on features (eg. indexed from WFS sources) and records
    • Analysis: find similar documents, facet on any fields
    • Search response can use objects instead of current text delimited (eg. link, contact)

image

  • Client can request the fields required and as such limit response size
  • Indexing:
    • Synonyms supports (eg. INSPIRE themes Elevation correspond to Annex II)
    • Advanced analysis & filtering chain
    • Simplify multilingual support

This change will also probably improve search and indexing performance and still support spatial searches.

It also target to simplify the codebase and as such will make the life easier to new comers:

image

The following sections illustrates some of the benefits:

More flexible facets (named aggregations)

Elasticsearch API allows to:

image

  • nested aggregations can be defined so we can create tree structure eg. Level 1: By resources types > Level 2: By status
  • not only count of documents, aggregations can be used for analytics (eg. sum, avg)
  • spatial aggregations can be used to create heatmaps
  • matrix
  • ...

Better suggestions

Current suggestion does not always take user privileges in account. Elasticsearch allows to use suggesters on any fields indexed (and combining searches)

eg.
image

More to analyze:

  • Suggest terms similar too
  • Phrase suggestion

More like this

"More like this" provides suggestion on similar record to the one you're currently viewing. Similarity could be define based on which fields the similarity is computed and the frequency of terms (needs more testing on how to define more like this parameters)
image

Dynamic dashboards

Once indexed in ElasticSearch, Kibana can be used to analyze the content of your catalogue. Kibana can be used to make analysis on the catalogue in order to promote your catalogue on CMS or third party website and can also be used to improve the content of records by searching invalid values.

image

Dashboard can also focus on geographical extents:

image

Performances

After basic testing with ab it looks like we could expect to be 5 time faster than current services for search/facets/indexing.

image

Migration roadmap

Dev branch is available here https://github.com/geonetwork/core-geonetwork/tree/es

This migration is a major work and will require iterations in order to cover the full scope of what GeoNetwork search & indexing related features actually covers. The process of migration is described as a succession of levels from Level 0 providing the minimal set of features to the last Level which could provide the same level of features. Therefore this migration can also be an opportunity to remove unused features and simplify the codebase.

Level 0 (Proof of concept)

Level 0 means that the application starts, create the index if not existing, index document from the database and the main search is available. This is the proof of concept allowing to analyze what will be the implementation main goals.

The search API is provided by the /api/search/records/_search service:

image

Tasks for this level are:

  • Removing Lucene dependency. This is required to add Elasticsearch as a dependency which depends on a much more recent version of Lucene.
  • Indexing support for 19139/19110/Dublin core
  • Add an index proxy taking care of forwarding search from the user interface to the index. The proxy also takes care of:
  • Do not fail on startup if ES is not available. Which means that GeoNetwork can start without a running Elasticsearch and init the index once Elasticsearch is available.
  • UI / Main search form / Use new search service
    • Define draft core response structure
    • Simple one level facet support
    • Full text search support
  • Define index fields structures, naming and conventions
  • More like this concept

image

  • Suggestion and phrase search concept

image

  • Admin console / Add status information on the index
  • Admin console / Add tools required to manage the new index

image

Funding: This level 0 of features was developed mainly during the 2018 Bolsena codesprint.

Level 1

Level 1 target a fully working user interface based on the new search service. Search is used in many places from the home page to the associated resources panel in the editor.

Technical challenges:

  • Request only the required fields
  • One level aggregations keeping in mind that nested aggregation will be implemented in the future

Tasks for this level are:

  • Application can start even if index is down.

image

  • Search / wire all the UI (home/edit/admin) to the new search service with proper search results

  • Search / Return in _source only the field required by the UI (performance)

  • Translation of codelist

  • Facets / restore full support of one level facet with configuration from the settings

    • Configuration is now available in the admin per module ie. search/editor board

image

  • Add support to filter aggregations
GET /gn-records/_search
{
  "aggs": {
    "messages" : {
      "filters" : {
        "filters" : {
          "availableInViewService" :  {
              "query_string": {
                "query": "+linkProtocol:/OGC:WMS.*/"
              }
            },
          "availableInDownloadService" :  {
              "query_string": {
                "query": "+linkProtocol:/OGC:WFS.*/"
              }
            }
        }
      }
    }},
  "query": {
  "match_all": {}
}}
  • Utility / XSLUtils / get index fields value
  • Search related
  • Cleaning / Remove unused Lucene XSLT from schema plugins
  • Autocompletion and UI configuration

Various Elasticsearch queries can be used to configure autocompletion.

image

By default, a multi_match on anytext + its ngram associated fields is configured in order to propose record titles based on analysis of partial word match.

image

Funding: This level 1 of features was developed mainly:

Level 2

Level 2 focus on restoring CSW and improving aggregations (aka facet) supports.

Tasks for this level are:

  • CSW support

  • CSW / support geometries with And / Or conditions.

  • Virtual CSW (deprecated and replace by portal - which include a virtual CSW)

  • Migrate to RestHighLevelClient Java client API instead of JEST library (used because no dependency on Lucene)

  • Selection / Restore selection manager

  • Selection / Restore MEF export

  • Selection / Restore PDF/CSV export

  • How to integrate ES for running tests and to package the installer

  • Improve index status checker by reporting status

  • create automatically index when it does exist

  • Multi portal support / Restore portal filter injection

  • Facet / Allow OR in a category

image

  • Facet / Allow not a value eg. not a service

image

  • Hierarchical facet eg. GEMET. Hierarchy of facet is now supported using 2 approaches:

    • sub aggregation concept in Elasticsearch which allows to have nested aggregations eg. below on resource type > format
    • path hierarchy using a separator eg. below on GEMET thesaurus

image

At the end of level 2, GeoNetwork should provide main functionalities for users requiring search (including CSW)/editing/map/admin console.

  • Facet / Load more values

image

  • Histogram aggregation support

image

Funding: This level 2 of features was developed mainly:

Level 3

  • Search suggestion / Configuration (which fields to suggest on)
  • Index / Subtemplates
  • Map / wire all searches
  • Index / Handle graphic overview with or without label
  • Index / Add Anchor support
  • Search / Support special character eg. search for "1+1=2"

image

  • Admin / Harvester / List records for current harvester

Funding:

  • titellus
  • Metawal

Level 4

Level 4 focused on making a first beta release of GeoNetwork on Elasticsearch that can be used for real.

image

  • ES / suggestion / current filter apply
  • ES / sort / Most recent (to easily find latest created)
  • ES / Index / Temporal coverage using date range type
  • Release a GeoNetwork 3.99.0 ?

Funding:

  • EEA
  • Metawal

Reference documents:

@josegar74 josegar74 added this to the 4.0.0 milestone Jun 4, 2018
@fxprunayre fxprunayre added the schema plugin change Indicate that this work introduces a schema plugin change. label Jun 6, 2018
@fxprunayre fxprunayre changed the base branch from 3.4.x to master June 26, 2018 08:51
@fxprunayre fxprunayre changed the title WIP - Migration to ES WIP - Migration to Elasticsearch Mar 11, 2019
fxprunayre and others added 24 commits June 25, 2019 19:43
* Add any field for more global full text search
* Add recordGroup for collapse mode
* Test routing key
* Add recordLink to target one parent (need more work)
* Link also feature to parent record
…ch based on the bucket id in session and you can reuse it somewhere else if needed. Query can be done using JSON Elastic object or Lucene query syntax also.
…hen there're not failed documents to avoid exception
@fxprunayre fxprunayre marked this pull request as ready for review April 22, 2020 17:19
@fxprunayre
Copy link
Member

fxprunayre commented Apr 28, 2020

CFV https://sourceforge.net/p/geonetwork/mailman/message/36995405/

  • +1 Jose, Florent, Emanuele, Jo, Jeroen, Francois

Actions:

  • Version changed to 4.0.0.alpha-1

image

Note:

  • db migration between alpha releases will not be created (db changes should be minor so only a migration from 3.10.x to 4.0.0 will be created)

@fxprunayre
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema plugin change Indicate that this work introduces a schema plugin change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants