Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send Customer.io notifications and warnings for connections being auto-disabled #11670

Merged
merged 19 commits into from
Apr 20, 2022

Conversation

terencecho
Copy link
Contributor

@terencecho terencecho commented Apr 1, 2022

What

A feature flag exists where connections that have been constantly failing will be auto-disabled to save on resources. This PR utilizes infrastructure from a previous PR to send notifications of when connections have been auto-disabled, or warnings that they are about to be auto-disabled.

These notifications require that a Customer.io api key is set as an env variable and that the AUTO_DISABLE_FAILING_CONNECTIONS flag is turned on. It also makes the assumption that an email template exists in your Customer.io account.

This PR address issue #10929.

This followup issue is to allow for users to specify which email they want the notifications to go to.

How

A ticket exists to allow users to set up a customer.io notification configuration, similar to how it works with slack notifications. Because this behavior does not exist yet, a separate method has been created to make sure that these notifications are sent even without a configuration being set. These notifications are sent to the email associated with the workspace of the connection being disabled.

To do

  • send warning when there has been only failing connections for the past maxDaysOfOnlyFailedJobs / 2 days

Recommended reading order

  1. airbyte-scheduler/persistence/src/main/java/io/airbyte/scheduler/persistence/JobNotifier.java
  2. airbyte-workers/src/main/java/io/airbyte/workers/temporal/scheduling/activities/AutoDisableConnectionActivityImpl.java

@github-actions github-actions bot added area/platform issues related to the platform area/scheduler area/worker Related to worker labels Apr 1, 2022
@terencecho terencecho temporarily deployed to more-secrets April 2, 2022 00:21 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 2, 2022 00:21 Inactive
@terencecho terencecho requested review from pmossman and removed request for pmossman April 4, 2022 17:46
@terencecho terencecho temporarily deployed to more-secrets April 6, 2022 22:23 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 6, 2022 22:23 Inactive
@terencecho terencecho marked this pull request as ready for review April 6, 2022 22:55
@terencecho terencecho temporarily deployed to more-secrets April 6, 2022 23:45 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 6, 2022 23:45 Inactive
@terencecho terencecho requested a review from pmossman April 7, 2022 00:00
@terencecho terencecho temporarily deployed to more-secrets April 7, 2022 00:01 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 7, 2022 00:01 Inactive
@terencecho
Copy link
Contributor Author

I wanted to point out for the reviewers that the warning notifications are meant to be sent out when they have reached 50% of the failure limit, so 50 out of 100 consecutive failure OR 7 days of straight failure, halfway to the 14 days where the auto disable kicks in.

For the consecutive failures, we can just warn once when the number of failures is exactly at 50, but for the days of straight failures, the current implementation does keep a state of if a warning has been sent. This means currently, we send a warning after each failure after 7 days of straight failures, which is not idea.

Copy link
Contributor

@pmossman pmossman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a read through this, I think the logic for the most part makes sense, though it is pretty convoluted and a bit hard to follow. I wonder if readability would be helped if the large autoDisableFailingConnection() method were broken into discrete private method calls?

I'm also curious what happens to the temporal workflow when a connection is auto-disabled. I see that the autoDisableFailingConnection activity updates the database status of the connection to INACTIVE, but what happens to the temporal connection manager workflow?

@terencecho
Copy link
Contributor Author

terencecho commented Apr 8, 2022

Thanks for the comments Parker. Totally agree with the code being a bit hard to follow after I added the 7 day warning. I'll work on splitting it up and making it easier to understand. I wanted to get feedback on if the logic made sense and if I should be handling the logic in a different way.

When the connection is set to Inactive, the temporal connection manager workflow is still running, it's just that the next job to be scheduled is set to like 100s of years in the future (I forget the exact value), so basically stuck in a state of running, but doesn't do anything unless it receives another signal to do something.

@terencecho
Copy link
Contributor Author

Going to move this PR back to a draft while I refactor for readability.

@terencecho terencecho marked this pull request as draft April 8, 2022 17:13
final String logUrl = connectionPageUrl + connectionId;
final String jobDescription = getJobDescription(job, "");

if (CONNECTION_DISABLED_NOTIFICATION.equals(action)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sad we don't use an enum here :(.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I'll make a note to change these to enums once notification configurations/tables have been finalized as a part of #10999

final String logUrl = connectionPageUrl + connectionId;
final String jobDescription = getJobDescription(job, "");

if (CONNECTION_DISABLED_NOTIFICATION.equals(action)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use a switch here, like:

switch (action) {
case CONNECTION... 
break;
case CONNECTION_DIS....
break;
default
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final String durationString = formatDurationPart(duration.toDaysPart(), "day")
+ formatDurationPart(duration.toHoursPart(), "hour")
+ formatDurationPart(duration.toMinutesPart(), "minute")
+ formatDurationPart(duration.toSecondsPart(), "second");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


final Job lastJob;
final Optional<Job> optionalJob = jobPersistence.getLastReplicationJob(input.getConnectionId());
if (optionalJob.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be refactor in:

final lastJob = optionalJob.orElseThrow(() -> new Exeception())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@terencecho terencecho temporarily deployed to more-secrets April 8, 2022 18:53 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 8, 2022 18:53 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 05:20 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 05:43 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 05:43 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 06:30 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 06:30 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 06:38 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 19, 2022 06:38 Inactive
@terencecho
Copy link
Contributor Author

@pmossman PTAL

@terencecho
Copy link
Contributor Author

For extra context, I refactored the database query to also retrieve timestamps when getting status. With those timestamps, we calculate if the previous failure should have triggered a warning, if i should have, then we skip sending another warning to avoid spam.

Copy link
Contributor

@pmossman pmossman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terencecho this makes sense to me! I made a recommendation to help with readability of the conditional logic in a couple places. I think once that's addressed, this should be good to go


// sending warning for disable connection if only failed jobs in the past
// maxDaysOfOnlyFailedJobsBeforeWarning days
if (numDaysSinceFirstReplicationJob >= maxDaysOfOnlyFailedJobsBeforeWarning &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit hard to follow because of the nesting, can you pull some of the inner conditional logic out into named variables to help with readability?

Something like:

final boolean firstReplicationIsOldEnough = numDaysSinceFirstReplicationJob >= maxDaysOfOnlyFailedJobsBeforeWarning;
final boolean enoughDaysPassedSinceLastSuccess = successTimestamp.isEmpty()
                || getDaysSinceTimestamp(currTimestampInSeconds, successTimestamp.get()) >= maxDaysOfOnlyFailedJobsBeforeWarning);

if (firstReplicationIsOldEnough && enoughDaysPassedSinceLastSuccess) {
    ...
}

You should pick better variable names if you can think of them, especially if mine don't actually match the logic 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. I edited the above to help with readability.

I went with uglier, longer names like firstReplicationOlderThanMaxDisableWarningDays and successOlderThanPrevFailureByMaxWarningDays to help make things as clear as possible, but do let me know if they actually make things less clear, happy to go with your suggestions if you think so.

prevFailedJob = jobs.get(i);
}

return (successTimestamp.isEmpty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar recommendation here for using local boolean variables to aid comprehension of the nested conditional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:18 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:18 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:23 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:24 Inactive
Copy link
Contributor

@pmossman pmossman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Definitely much easier to follow the logic now. LGTM!

noPreviousSuccess || getDaysSinceTimestamp(currTimestampInSeconds, successTimestamp.get()) >= maxDaysOfOnlyFailedJobsBeforeWarning;

// send warning if there are only failed jobs in the past maxDaysOfOnlyFailedJobsBeforeWarning days
// _except_ if a warning should have already been sent in the previous failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: instead of _except_ if a warning..., go with _unless_ a warning...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:33 Inactive
@terencecho terencecho temporarily deployed to more-secrets April 20, 2022 00:33 Inactive
@terencecho
Copy link
Contributor Author

@benmoriceau i'm going to merge this, but feel free to comment with any suggestions and I can follow up on another diff.

In case you want a tldr: didn't make any changes on the temporal side, instead decided to read timestamp info from database to determine if a warning should have already been sent to avoid spam instead of saving state of last warning sent somewhere.

@terencecho terencecho merged commit 9619f28 into master Apr 20, 2022
@terencecho terencecho deleted the tcho/10929-auto-disable-email-alert branch April 20, 2022 22:02
@terencecho terencecho linked an issue Apr 21, 2022 that may be closed by this pull request
suhomud pushed a commit that referenced this pull request May 23, 2022
…o-disabled (#11670)

* init commit to send notifications regarding connection auto-disabling

* remove debugging prints

* Add logic for warning about only failures for days

* cleanup comments

* remove deplicate code

* refactor activity to be more readable

* small cleanup

* Refactor JobNotifier

* fix formatting

* Save state of last warning

* Move last warning state to WorkflowInternalState

* Reduce spam by getting timestamp info instead of saving state

* edit tests

* Fix formatting

* cleanup checkIfWarningPreviouslySent

* fix formatting

* refactor for readability

* Add extra comments

* edit comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/scheduler area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notify users of auto-disabled connections
3 participants