Skip to content

[HUDI-6399] Warn when datadog api key is wrong#8997

Closed
parisni wants to merge 1 commit intoapache:masterfrom
parisni:fix-datadog-wrong-apikey
Closed

[HUDI-6399] Warn when datadog api key is wrong#8997
parisni wants to merge 1 commit intoapache:masterfrom
parisni:fix-datadog-wrong-apikey

Conversation

@parisni
Copy link
Contributor

@parisni parisni commented Jun 16, 2023

Change Logs

Currently when the datadog api key is wrong the job will fail. We likely should not fail but log a warning to avoid the whole pipeline fails

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@parisni
Copy link
Contributor Author

parisni commented Jun 16, 2023

@xushiyan maybe ?

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

} catch (IOException e) {
throw new IllegalStateException("Failed to connect to Datadog to validate API key.", e);
} catch (IOException | IllegalStateException e) {
LOG.warn(String.format("Failed to connect to Datadog to validate API key. %s", e.getMessage()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still fail the job if the metrics collector does not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just should show a warning as proposed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean, catch any exeption ? this makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we still fail the job if the metrics collector does not work?

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is, we should not change the behavior of throwing an exception here. If the metric collection does not work due to API key, it should fail the job so that the user knows and fixes it before proceeding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the rational here. Should user know the API key is not working and fix it before running the job again to properly generate the metrics? It's not a good idea to silently fail here.

Copy link
Contributor Author

@parisni parisni Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the metric provider is responsible to contact the user if metrics won't work (mailing alarm, oncall ...). But the ingestion jobs should not stop working. Not having metrics is a minor problem versus having all the company pipelines broken because of a token renewal issue.

Also users configure the metrics provider to alarm in case of no metrics.

At least I assume some user won't want their nightly jobs broken because of token, this would also be the case for an API or any metrics collection, outage is a minor problem versus stopping working.

Currently same apply for pushing metrics, if it does not work, it is only a warning see

try {
MetricRegistry registry = Metrics.getInstance().getRegistry();
HoodieGauge guage = (HoodieGauge) registry.gauge(metricName, () -> new HoodieGauge<>(value));
guage.setValue(value);
} catch (Exception e) {
// Here we catch all exception, so the major upsert pipeline will not be affected if the
// metrics system has some issues.
LOG.error("Failed to send metrics: ", e);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I suggest having a feature flag on whether to fail the job if the metric provider does not work. By default, it's on, i.e., failing the job due to metric provider, the same behavior as before, while users can turn this off in the case metrics can be skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://hudi.apache.org/docs/configurations#hoodiemetricsdatadogapikeyskipvalidation
shame on me, the option to skip validation already exist. Then this PR is useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:metrics Metrics and monitoring

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants