Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(elasticsearch): Implement optimization to use reindexing instead… #8352

Merged
merged 6 commits into from
Jul 12, 2023

Conversation

iprentic
Copy link
Contributor

@iprentic iprentic commented Jun 30, 2023

… of deleteByQuery when a large amount of records will be deleted

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Jun 30, 2023
@@ -326,6 +328,12 @@ public void configure() {
}
}

@Override
public String reindexAsync(String index, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options)
Copy link
Collaborator

@david-leifker david-leifker Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, this doesn't seem right to me. The graph service index is not timeseries and would not be subject to timeseries-like truncation. Remove reindexAsync from the common interface. It is only needed on the time series related classes.

@@ -44,6 +46,12 @@ public void configure() {
indexBuilders.reindexAll();
}

@Override
public String reindexAsync(String index, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options)
Copy link
Collaborator

@david-leifker david-leifker Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, remove from the interface (also applies to below) non-timeseries Service classes.

@@ -220,6 +224,27 @@ public void buildIndex(ReindexConfig indexState) throws IOException {
}
}

public String reindexInPlaceAsync(String indexAlias, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, I'll move that. That seems better than taking in the batch size and timeout as parameters since we may want to do other tuning (or support other storages that have different parameters) in the future

@@ -220,6 +224,27 @@ public void buildIndex(ReindexConfig indexState) throws IOException {
}
}

public String reindexInPlaceAsync(String indexAlias, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much just for timeseries, I'd say we should name it differently to make that clear. It also then makes since that a timeseries method would take a timeseries package options parameter.



public interface ElasticSearchIndexed {
String reindexAsync(String index, @Nullable QueryBuilder filterQuery, BatchWriteOperationsOptions options)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove it from this interface, it is not generally a method for all indices at this time.

@@ -33,6 +36,12 @@ public void reindexAll() {
}
}

@Override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should exist, perhaps not an override but depending on how the clean-up of the ElasticsearchIndex interface goes, you may end up wanting to create a new interface like ElasticSearchIndexedTimeseries (yikes that's long). The timeseries interface applies to timeseries Services' indices (or a subset). Most Services will not be implementing the tiimeseries methods: entity, system metadata, and graph for example do not require the timeseries truncations.

@iprentic iprentic changed the title [WIP] feat(elasticsearch): Implement optimization to use reindexing instead… feat(elasticsearch): Implement optimization to use reindexing instead… Jul 7, 2023
@iprentic
Copy link
Contributor Author

iprentic commented Jul 7, 2023

Testing on quickstart instance:

Dry run:

$ curl --location --request POST  http://localhost:8080/operations\?action\=truncateTimeseriesAspect \
--header 'Content-Type: application/json' \
--data-raw '{
    "entityType": "dataset",
    "aspect": "datasetusagestatistics",
    "endTimeMillis": 1687970192000,
    "dryRun": true,
    "batchSize": 100,
    "timeoutSeconds": 3600
}'
{"value":"Delete 134517 out of 135600 rows (99.20%). Reindexing the aspect without the deleted records. This was a dry run. Run with dryRun = false to execute."}%                                                                                                              

Execute:

$ curl --location --request POST  http://localhost:8080/operations\?action\=truncateTimeseriesAspect \
--header 'Content-Type: application/json' \
--data-raw '{
    "entityType": "dataset",
    "aspect": "datasetusagestatistics",
    "endTimeMillis": 1687970192000,
    "dryRun": false,
    "batchSize": 100,
    "timeoutSeconds": 3600
}'
{"value":"qhxGdzytQS-pQek8CwBCZg:533446"}%  

Number of records after the reindex:

$ curl -X GET "localhost:9200/dataset_datasetusagestatisticsaspect_v1/_count?pretty"                  
{
  "count" : 138,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

@anshbansal anshbansal merged commit a884cf3 into master Jul 12, 2023
36 checks passed
@anshbansal anshbansal deleted the nd-reindex branch July 12, 2023 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops PR or Issue related to DataHub backend & deployment product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants