You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This is a suggestion for improving harvesting performance by configuring the maxRecords value for the getRecords request per harvester.
An impact of the getRecord value on performance was noticed by the following observation in a harvester.
Warning
By reducing the response of a CSW harvester to 10 data records (instead of 20), the harvesting time has increased enormously from 13 hours to 26 hours.
We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process:
The analysis showed that the align method with the initialization of the UUIDMapper class was called twice as often.
Therefore, for every 10 data records (instead of 20), a DB query on the metadata table with filtering of the data for the harvester is executed. With 259,188 metadata records, this corresponds to 25,918 DB queries which is evident from the number of geonetwork warnings
Declared number of returned records (10) does not match requested record count (20)
in the harvester log file.
Before the switch to 10 records, only 12,959 DB queries would have been necessary.
Additionally, a matching of the local metadata with the remote metadata is performed for every 10 data records. Therefore, 10 metadata records of the CSW response are compared to all 259,188 metadata records of the harvester stored in the DB. This matching process is repeated 25,918 times (instead of 12,959 times with 20 metadata records within the CSW response). In total about 3.3 billion metadata records were compared during one harvesting process of 259,188 metadata records.
The database queries and matching represent a bottleneck due to partially time-consuming methods (setDateAndTime).
In addition, more getRecords queries against the CSW interface are necessary to retrieve all data.
Describe the solution you'd like
CSW interfaces might support a higher response value than 20 for maxRecords.
For each response to the getRecords query, the align method is called, which creates a new instance of the UUIDMapper. When the UUIDMapper is instantiated, the findAllSimple method is called, which determines all metadata records already available in the GN for the given harvester with a DB query.
With fewer getRecords queries due to a higher maxRecords value, the align method is called less often and therefore fewer DB queries are required.
An additional setting in the harvester settings to set this value per harvester might significantly improve harvesting performance.
Default value: 20
Additional context
Result of Visual VM analysis:
The text was updated successfully, but these errors were encountered:
Maybe we could even default to a higher number (eg. 200) to also reduce HTTP calls. 200 was used in INSPIRE monitoring exercise in the past and was working fine. Also to improve performances, we can maybe use GetRecords operation only with results instead of requesting each records with GetRecordsById.
rime1014
changed the title
OGC CSW 2.0.2 Harvesting / Performance / Konfiguration of getRecords-Value
OGC CSW 2.0.2 Harvesting / Performance / Configuration of getRecords-Value
Apr 29, 2024
josegar74
added a commit
to GeoCat/core-geonetwork
that referenced
this issue
May 19, 2024
- Increase GetRecords max records parameter to 100
- Use GetRecords with ElementSetName FULL to retrieve the full xml and avoid individual GetRecordById requests
Includes Sonarlint improvements.
Fixesgeonetwork#7995
- Increase GetRecords max records parameter to 100
- Use GetRecords with ElementSetName FULL to retrieve the full xml and avoid individual GetRecordById requests
Includes Sonarlint improvements.
Fixesgeonetwork#7995
Is your feature request related to a problem? Please describe.
This is a suggestion for improving harvesting performance by configuring the
maxRecords
value for thegetRecords
request per harvester.An impact of the
getRecord
value on performance was noticed by the following observation in a harvester.Warning
By reducing the response of a CSW harvester to 10 data records (instead of 20), the harvesting time has increased enormously from 13 hours to 26 hours.
We used the profiling tool VisualVM to analyze which methods require the most time during the harvesting process:
The analysis showed that the
align
method with the initialization of theUUIDMapper
class was called twice as often.Therefore, for every 10 data records (instead of 20), a DB query on the metadata table with filtering of the data for the harvester is executed. With 259,188 metadata records, this corresponds to 25,918 DB queries which is evident from the number of geonetwork warnings
in the harvester log file.
Before the switch to 10 records, only 12,959 DB queries would have been necessary.
Additionally, a matching of the local metadata with the remote metadata is performed for every 10 data records. Therefore, 10 metadata records of the CSW response are compared to all 259,188 metadata records of the harvester stored in the DB. This matching process is repeated 25,918 times (instead of 12,959 times with 20 metadata records within the CSW response). In total about 3.3 billion metadata records were compared during one harvesting process of 259,188 metadata records.
The database queries and matching represent a bottleneck due to partially time-consuming methods (
setDateAndTime
).In addition, more
getRecords
queries against the CSW interface are necessary to retrieve all data.Describe the solution you'd like
CSW interfaces might support a higher response value than 20 for
maxRecords
.For each response to the
getRecords
query, thealign
method is called, which creates a new instance of theUUIDMapper
. When theUUIDMapper
is instantiated, thefindAllSimple
method is called, which determines all metadata records already available in the GN for the given harvester with a DB query.With fewer
getRecords
queries due to a highermaxRecords
value, thealign
method is called less often and therefore fewer DB queries are required.An additional setting in the harvester settings to set this value per harvester might significantly improve harvesting performance.
Default value: 20
Additional context
Result of Visual VM analysis:
The text was updated successfully, but these errors were encountered: