You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Background:
After the GeoNetwork upgrade to version 4.2.5, initial harvesting of all metadata was performed.
We were harvesting from a node with much metadata (> 280000).
The initial harvesting runtime was very long, which resulted in failed harvesting runs several times in succession.
Therefore, multiple harvesting runs over several days were required until all metadata was available in the database and in the index.
The question arises: Why does the initial harvesting take so much time?
We used the profiling tool VisualVM to analyze which methods require the most time during the initial harvesting process.
The exact process of indexing during harvesting is described further down in the ticket.
The following harvesting times were asserted:
The addMetadata method requires ~89% of the total harvesting time
The indexMetadata method requires ~37% of the total harvesting time
Important
Indexing of new metadata takes about 37% of the total harvesting time. Therefore, performance enhancement in indexing has a high potential to decrease harvesting time.
Describe the solution you'd like
Tip
Suggestion: The performance of indexing during harvesting can possibly be improved by indexing several metadata uuids at once using bulk requests.
Geonetwork already uses the bulk API, but with CSW harvesting, the bulk request performs with only one metadata set at a time. The addMetadata method is called individually for each metadata set. In the code, the parameter forceRefreshReaders is set to true which causes this behavior.
The performance could be increased by indexing multiple documents at the same time instead of each document individually through bulk requests.
Additional context
Analyzed process of indexing during harvesting with VisualVM
foreach loop over each record → Call Aligner.addMetadata()
Result: addMetadata is called individually for each added metadata record
kernel.harvest.harvester.csw.Aligner.addMetadata()Source-Code (~89% of total harvesting time)
Call BaseMetadataIndexer.indexMetadata() with parameter fourceRefreshReaders = true
Only one metadata record is transferred for indexing
Possible Solution: Flag metadata record for indexing, but don't index it immediately. Instead, index multiple metadata records at once with a bulk request.
kernel.datamanager.base.BaseMetadataIndexer.indexMetadata()Source-Code (~37% of total harvesting time)
Call EsSearchManager.index() with parameter fourceRefreshReaders = true
I just came here to say this is an excellent finding @rime1014; hope it will be addressed in future updates (I'm not GeoNetwork's members).
Not related to harvesting performance, I'm curious if you have the CSW harvester's search filter working properly? Thank you.
Thank you very much.
Yes, we partly use search filters to harvest the metadata of a CSW interface in parts with the help of various harvesters (e.g. according to the editing date of the metadata (RevisionDate). This works well.
Is your feature request related to a problem? Please describe.
Background:
After the GeoNetwork upgrade to version 4.2.5, initial harvesting of all metadata was performed.
We were harvesting from a node with much metadata (> 280000).
The initial harvesting runtime was very long, which resulted in failed harvesting runs several times in succession.
Therefore, multiple harvesting runs over several days were required until all metadata was available in the database and in the index.
The question arises: Why does the initial harvesting take so much time?
We used the profiling tool VisualVM to analyze which methods require the most time during the initial harvesting process.
The exact process of indexing during harvesting is described further down in the ticket.
The following harvesting times were asserted:
addMetadata
method requires ~89% of the total harvesting timeindexMetadata
method requires ~37% of the total harvesting timeImportant
Indexing of new metadata takes about 37% of the total harvesting time. Therefore, performance enhancement in indexing has a high potential to decrease harvesting time.
Describe the solution you'd like
Tip
Suggestion: The performance of indexing during harvesting can possibly be improved by indexing several metadata uuids at once using bulk requests.
Geonetwork already uses the bulk API, but with CSW harvesting, the bulk request performs with only one metadata set at a time. The
addMetadata
method is called individually for each metadata set. In the code, the parameterforceRefreshReaders
is set totrue
which causes this behavior.The performance could be increased by indexing multiple documents at the same time instead of each document individually through bulk requests.
Additional context
Analyzed process of indexing during harvesting with VisualVM
kernel.harvest.harvester.csw.Aligner.align()
kernel.harvest.harvester.csw.Aligner.insertOrUpdate()
Source-Codeforeach loop over each record → Call
Aligner.addMetadata()
Result:
addMetadata
is called individually for each added metadata recordkernel.harvest.harvester.csw.Aligner.addMetadata()
Source-Code (~89% of total harvesting time)Call
BaseMetadataIndexer.indexMetadata()
with parameterfourceRefreshReaders = true
Only one metadata record is transferred for indexing
Possible Solution: Flag metadata record for indexing, but don't index it immediately. Instead, index multiple metadata records at once with a bulk request.
kernel.datamanager.base.BaseMetadataIndexer.indexMetadata()
Source-Code (~37% of total harvesting time)Call
EsSearchManager.index()
with parameterfourceRefreshReaders = true
kernel.search.EsSearchManager.index()
Source-CodefourceRefreshReaders
istrue
Consequence: A bulk request is carried out with one document / metadata record
index.es.EsRestClient.bulkRequest()
The text was updated successfully, but these errors were encountered: