Skip to content

Moving to a search server: Why & how ?

François Prunayre edited this page Aug 25, 2016 · 4 revisions

Since 2010, GeoNetwork community has been discussing the move from Lucene to Solr in order to improve user search experience. Main motivations for that move was:

  • Improve suggestions (eg. spell check, suggest only on user records)
  • Facets on any fields and cross fields facet hierarchy
  • Scoring, boosting results
  • Similar documents
  • Highlights in response
  • Join query
  • Improve Lucene memory issues on some setup (require restart)
  • Reduce Lucene multilingual/search complexity

Moving from Lucene to Solr or Elasticsearch introduce a major change in the application. The search server is running aside GeoNetwork. A proxy is implemented in GeoNetwork to do search and enrich queries and responses based on user privileges.

Based on the WFS data indexing funded by Ifremer, a first codesprint was made in April/May 2016 with titellus (Francois Prunayre) and camptocamp (Patrick Valsecchi, Antoine Abt, Florent Gravin) to replace Lucene by Solr.

This codesprint focus on starting the move to Solr in order to identify main issues & risks / main benefits and draw a roadmap in order to then look for funding. This document sum-up what has been done so far and illustrate features that could be relevant for GeoNetwork.

Codesprint main targets

  • Analyze how to move to Solr
  • Investigate Solr features and illustrate the benefits
  • Start migration & refactoring focusing on main search service and CSW; identify features to deprecate
  • Illustrate with a simple search interface providing the capability to search on metadata and datasets

draft

Technical overview

New dependency:

  • Solr 6
  • Java 8 required

Removed dependency:

  • Lucene 4.9 dependency

GeoNetwork major changes:

  • Angular app use a simple HTTP Interceptor to allows basic search (the interceptor mimic q service query/response translation from/to Solr format). This is used to enable basic functionalities in current UI.
  • New experimental Angular UI for search (on features and metadata)
  • Integrate cleaning PR ie. Remove ExtJS UI, Old XSL services, Z39.50 server

Work

See branch https://github.com/geonetwork/core-geonetwork/tree/solr

Preview of improvements

First experiments:

Spellchecks & suggestions

Spell checking module allows to suggest related search to end users in case of typo. Suggestion module can be used to provide suggestions based on field in the index.

Example of suggestions and similar words:

draft

draft

Examples on typos:

draft

Spell check also works on phrases:

draft

Current suggestion in GeoNetwork is based on a search and could not provide terms that are not matching results as current implementation does (see https://github.com/geonetwork/core-geonetwork/issues/1466, https://github.com/geonetwork/core-geonetwork/issues/634, https://github.com/geonetwork/core-geonetwork/issues/1003.

Find similar document

Using "MoreLikeThis" component, easily provide similar document to the one you're currently looking at (eg. other versions of the same dataset). See https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

eg. search for ortho imagery, when you retrieve an image for 2015, you also have similar images in 2009, 2012. More like this response is structured that way

draft

Boosting

Search can know boost on specific fields during search or indexing (eg. give more score for match in the title) using Solr search API.

Synonyms

Solr support synonyms configuration based on simple text file or more advanced synonym map (configurable using API). Synonyms are heavily used in the INSPIRE dashboard project (eg. INSPIRE themes & annex https://github.com/INSPIRE-MIF/daobs/blob/daobs-1.0.x/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_inspireannex.json, contact and territory in France https://github.com/fxprunayre/daobs/blob/geocataloguefr/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_geocat_producer_territory.json).

Once configured, synonyms can be used in search/facets/stats components.

It extends the use of thesaurus in GeoNetwork currently only broader/narrower relation in thesaurus is used for hierarchycal facets (https://github.com/geonetwork/core-geonetwork/wiki/201411HierarchicalFacetSupport).

Fine tuned queries

Query syntax could be used to make more flexible searches:

draft

Search and index analysis chain is also better configured and will avoid search errors like when searching on full title.

Highlighting

The Highlighter module provides the capability to highlight matching words in results eg. in abstract.

draft

document....
},
  highlighting: {
    501: {
      resourceAbstract: [
        "Use this template to describe a static <strong>map</strong> (eg. PDF or image) or an interactive <strong>map</strong> (eg. WMC)."
      ]
    }
  }

Note: Field MUST be tokenized. eg. does not work with String, should use text_general type.

Facetting

Instead of using the server config-summary.xml which defines a predefined list of facets, Solr allows to create facet on any fields. The client could easily request any facets required. For example, the WFS feature data filter computes automatically facets on all feature attributes. It computes statistics on field for numeric and date type fields and compute facet configuration on-the-fly:

draft

GeoNetwork facet only support term facet returning a list of values with a count of records. More advanced facetting could be done with Solr:

  • range
  • interval
  • heatmap (for geometry)
  • pivot

draft

Pivot can also be quite flexible using the new Solr facet API allowing multilevel facets. User could for example request:

  • a first level facet on resource type (eg. feature/dataset/service)
  • a second level facet on point of contact
  • a third level on conformity
  • ... and get statistics on each pivot

eg. http://localhost:8984/solr/catalog_srv_shard1_replica1/select?indent=on&q=*:*&wt=json&rows=0&facet=true&json.facet={test:{terms:resourceType}}

eg. http://localhost:8984/solr/catalog_srv_shard1_replica1/select?indent=on&q=*:*&wt=json&rows=0&facet=true&json.facet={level1:{type:terms,field:resourceType,missing:true,facet:{tag:{type:terms,field:tag}}}}

Facet API also provide the capability to request more facet values, paging in facets, ...

Indexing related documents and data

When data is available using WFS (see https://github.com/geonetwork/core-geonetwork/wiki/WFS-Filters-based-on-WFS-indexing-with-SOLR). This work needs to be extended to also index other types of document (eg. PDF). Parser like Apache Tika can be used for this task.

Grouping/Collapsing

Those features could be relevant to grouping results (datasets/serie, features/dataset, ...). Links between document must be added in the index. eg. search can be combined on both metadata and features.

grouped: {
 parent: {
  matches: 8624,
  groups: [
  {
   groupValue: "89dee307e38c972b333b152d9bd19bb2e9bb0d4d",
   doclist: {
   numFound: 49,
   start: 0,
   docs: [
   {
     id: "states.1",
     docType: "feature"

More work required:

Issues:

  • Does not return info about child docs.

## Spatial searches

Spatial search has been tested for both feature and metadata indexing/searching. Indexing of millions of object was tested. Some limitations were identified and need some more testing (eg. indexing ship track over the world was quite long to index based on the index grid size).

Heatmap feature is also used in feature analysis.

Spatial searches is based on Lucene spatial and does not use GeoTools filter. So far, spatial queries looks to be working fine.

Performances

To be tested.

Misc.

Conclusion

Moving from Lucene to a search engine will bring major benefits by bringing many features implemented in search servers like Solr or Elasticsearch (including better scalability). In both cases, a proxy is placed in front in order to deal with privileges and building responses. Major tasks which will represent most of the workload is:

  • implementing multilingual support (by using one field per language instead of one index by language as we do now).
  • rework the Angular client to deal with the new format response
  • re-implement all search protocols (the POC focused on CSW, but GN also implement OpenSearch, OAIPMH, SRU, Atom, ...)

Also, this move will allow to make more advanced dashboards based on banana (for Solr) or Kibana (for Elasticsearch) like what the daobs project do (eg. https://inspire-dashboard.eea.europa.eu/official/dashboard2/#/dashboard/solr/INSPIRE Reporting 2011 - Ref. year 2010 - Metadata availability and conformity). Dashboards could be created dynamically from the catalog content (based on record content) and could also replace the search statistics pages available in the admin.


Technical analysis & configuration

Suggestion

Solr configuration

Sample query http://localhost:8984/solr/catalog_srv_shard1_replica1/spell?q=bosin&spellcheck=true&spellcheck.collateParam.q.op=AND

Spellcheck and suggestion configuration is made in:

  • solrconfig.xml: define module configuration
  • schema: define which fields use to build the dictionary (currently, title, tags, abstract)

Response contains a dedicated spellcheck and suggestion section:

<response>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
 <lst name="suggestions">
  <lst name="bosin">
  <int name="numFound">1</int>
  <int name="startOffset">0</int>
  <int name="endOffset">5</int>
  <int name="origFreq">0</int>
  <arr name="suggestion">
   <lst>
    <str name="word">basins</str>
    <int name="freq">1</int>
   </lst>
  </arr>
 </lst>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collations">
 <lst name="collation">
  <str name="collationQuery">basins</str>
  <int name="hits">1</int>
  <lst name="misspellingsAndCorrections">
  <str name="bosin">basins</str>

More work required

Client search application

The simple search application focused on drafting Angular components to easily create interface on top of Solr Search. In that work, we tried to overcome issues made in the first Angular components (eg. difficulties to have more than one search in the same app) and we started the design of components for search (eg. requestHandler, facets, results, paging, ...).

TODO: Add some more details.

Preview of limitations

  • TODO

Solr migration work

Search

All communications made with Solr is handled by a proxy. The proxy takes care of:

  • Query / Add user privileges to search filters
  • Response / Add extra information on metadata document eg. can edit, is selected (formerly geonet:info)
  • Provide access to search, spellcheck, suggestion, facet.
  • Provide access to search for any type of document ie. metadata or data. The client should filter what to query.

Search response format is JSON.

Solr is not required to start the application but a warning is displayed in case of error contacting the search engine.

draft

A health check tests if Solr is up & running and report status in the admin console.

Major changes:

  • Search / Parameters / No default set. Client needs to define all (before, search defaults on isTemplate:n)
  • Selection / Add q parameter to select the records matching a specific query. Not related to session last search anymore. See SelectionManager

More work required

  • Multilingual search / Move from one index per language to field in each language in same index
  • OAI-PMH
  • Atom service
  • RSS search
  • CSV search
  • Server / Response / Can we have complex JSON object in response instead of only flat structure ?
  • Client / Can not sort on multivalue field (eg. denominator): Create min and max field in index

CSW

  • GetDomain / Basic support / RangeValues is not supported
  • GetRecords
  • Config / Review mapping to solr field

More work required

  • Virtual CSW / Needs testing
  • Testing

Indexing

Indexing is still made in 2 steps:

  • XSL transformation to extract information from metadata record
  • Add information from the database.

BTW, atomic update have been implemented in order to update popularity and rating without reindexing the full document for better performance.

Integration tests

More work required:

  • How to setup/start Solr for running tests ?

Relation

  • Editor / Update field name in relation panel

Multinode support

Not taken into account during the codesprint. It sounds relevant to have one Solr collection per node and provide one searcher per node. The way bean are accessed could probably be improved in order to better use Spring bean scope.

API Changes

  • GetPublicMetadataAsRdf : Move from URL params to Solr query eg. /rdf.metadata.public.get?q=...
  • Log search
  • Removed: Analyze Solr log instead - all requests made using GET contains parameters.
  • Quid: Search Solr
  • Requests and Params tables removed.
  • Admin console / Dashboard : Removed - Use Solr facets instead and build new dashboard from that.
  • Search
  • No support of geom by id geometry:region:kantone:15
  • CSW
  • Language is defined by URL only to return DC response (no language detection).
  • GetRecords / Result_with_summary custom extension is removed
  • GetDomain / no support for range

Misc

More work required:

Cleaning


Clone this wiki locally