Skip to content

Conversation

@enochtangg
Copy link
Contributor

@enochtangg enochtangg commented Nov 26, 2024

Overview

This is PR is mainly responsible for adding an end-to-end test for validating that consumer rebalances should not produce 1 or more messages in sqlite. More importantly, this PR introduces python dependencies (pytest) into this project as it is much simpler/quicker to write these end-to-end test in python.

Testing

Run make integration-test

@enochtangg
Copy link
Contributor Author

Looking for some feedback on the following:

  1. Thoughts on using pytest for end-to-end integration testing with sentry?
  2. Currently, the test_rebalancing_only_processed_once() doesn't actually produce any messages. One thought was to run the send_task CLI command as a previous step in CI before running the pytest. But this wouldn't be repeatable for other tests. Maybe we could run the CLI command in the python test?
  3. The test could probably use some clean up.

@enochtangg enochtangg requested a review from a team November 26, 2024 22:35
import yaml


def manage_consumer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the purpose of this function. At the moment it starts, and then after a random delay dies. Is that the intended behaviour? There's no guarantee it will process the messages assigned to it. I think Mark had a good idea in his PR, where he passes in the number of expected messages, and can track to see whether the consumer has processed them or not.

Copy link
Contributor Author

@enochtangg enochtangg Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the function starts, it executes the rust binary in a new child process. Then, thread itself sleeps for a random delay. This gives the consumer a chance to start up, receive assigned partitions, and process messages. Finally, a SIGINT is sent to the process and the consumer shutsdown gracefully.

It is true that we aren't explicitly checking all messages are processed here. It is possible that the consumer is started and dies before it can process any messages (though unlikely given the min_sleep and max_sleep parameters). The purpose of the test is to check for duplicate messages during rebalancing, not necessarily that all messages have been processed.

@markstory markstory linked an issue Nov 28, 2024 that may be closed by this pull request
process = subprocess.Popen(
[consumer_path, "-c", config_file_path], stderr=subprocess.STDOUT, stdout=log_file
)
random.seed(random_seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to set the seed multiple times. Doing it once before you generate other random values is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the outcomes of seeding, is that all (8) consumers restart at the same time for every iteration. FWIW, I think this is also a good test to have because we'd want to simulate both rolling deploys and deploys which shutdown multiple consumers at the same time.

min_restart_duration = 5
max_restart_duration = 20
topic_name = "task-worker"
curr_time = int(time.time())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we fixate the time/seed when trying to reproduce a failure? We could read from an environment variable like TEST_SEED 🤷

Copy link
Contributor

@john-z-yang john-z-yang Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just have the seed as cli argument? And in our gha it's always set to 1 or something
Ooops never mind, I see we already went for envar

@enochtangg enochtangg marked this pull request as ready for review November 29, 2024 20:42
@enochtangg
Copy link
Contributor Author

enochtangg commented Nov 29, 2024

This integration test no longer relies on sentry to produce test messages for integration tests for two reasons:

  1. It's nicer to separate the two concerns and functionally test each component without potentially breaking each other. Breaking sentry taskworker will not break taskbroker CI.
  2. With the new devinfra tool, we might not need to clone and install deps for sentry in taskbroker CI. This removes that requirement.

Copy link
Member

@markstory markstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think we should move forward with this. We can refine and expand it incrementally.

@enochtangg enochtangg changed the title feat: Add template for sentry CI and setup rebalance integration testing feat: Add rebalance integration testing via pytest and setup integration testing on CI Nov 30, 2024
@enochtangg enochtangg force-pushed the github-ci branch 2 times, most recently from c566a7a to 75ca8e0 Compare December 2, 2024 04:46
@enochtangg enochtangg force-pushed the github-ci branch 2 times, most recently from bd85f0a to 1a197a5 Compare December 2, 2024 21:54
@enochtangg
Copy link
Contributor Author

enochtangg commented Dec 3, 2024

After a long investigation, @john-z-yang and I have found what change was causing CI to fail this integration test. Strangely enough, this test passed consistently locally which made investigating this troublesome. When this test was executed on ubuntu, the consumer would write duplicate messages (typically 0-2 before and after the rebalance). Reverting this change: 2b9064c#diff-316edec7494fa460e0a4bdc62a604da6b4caa7933db418d023bbfd27c6277c4dR64 fixed the problem. This is because when using select! macro, the consumer client shutdown step was not fully completed since the check_for_shutdown() yields first during rebalancing. As a result, the commit state was not being saved correctly causing duplicate entries.

res
}
}
handle_os_signals(event_sender.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is now working the way we want: how to handle_events and handle_consumer_client signal shutdowns if they error out? I think if those threads just stop running, the consumer won't shut down?

Copy link
Contributor

@john-z-yang john-z-yang Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating on this? I'm having a hard time coming up with a scenario where those 2 threads panic

@enochtangg enochtangg merged commit 7c9107c into main Dec 3, 2024
3 of 4 checks passed
@enochtangg enochtangg deleted the github-ci branch December 3, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write end to end testing harness for loading sqlite

5 participants