Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hds: prevent timer reset on every update #5977

Merged
merged 5 commits into from
Feb 28, 2019
Merged

Conversation

sschepens
Copy link
Contributor

If we push frequent updates but with the same interval, we may never get responses back because the timer gets reset on every update.
This change makes a reset only if the interval has changed or a disconnect ocurred.

Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
@sschepens
Copy link
Contributor Author

@markatou @htuch could you take a look at this please?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks legit, can you add a test case in hds_integration_test to cover this behavior?

@htuch htuch added the waiting label Feb 15, 2019
@htuch htuch self-assigned this Feb 15, 2019
@sschepens
Copy link
Contributor Author

@htuch i was wondering how to test this in an integration test. Do you have any ideas?
I was thinking about about making an update loop, but then verify that reports have been received, but this seems somehow tricky.

@htuch
Copy link
Member

htuch commented Feb 19, 2019

@sschepens I think some variant of

// Tests Envoy TCP health checking an endpoint that doesn't respond and reporting that it is
with multiple updates and a sufficiently long timeout to cover them should work. Bonus points for using Test/SimulatedTimeSystem (not sure how easy it is to use inside integration tests yet).

@sschepens
Copy link
Contributor Author

@htuch we have 2 things that could be tested, one is the update of the interval, if I change the interval I should receive a new response in the new interval. The other thing is the bug of the timer being reset everytime an update is received.
Not sure if we want to test both, testing an update on the interval is probably trivial, but testing the timer reset is somehow tricky, i was thinking of making a loop of update+responseCheck with a wait time lower than the interval, if the timer is reset everytime, this would generate a neverending loop.

@htuch
Copy link
Member

htuch commented Feb 19, 2019

@sschepens

i was thinking of making a loop of update+responseCheck with a wait time lower than the interval, if the timer is reset everytime, this would generate a neverending loop.

Maybe just iterate some arbitrary number of times, e.g. 10, and see if we observe 10 responses?

Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
@sschepens
Copy link
Contributor Author

sschepens commented Feb 19, 2019

@htuch

Maybe just iterate some arbitrary number of times, e.g. 10, and see if we observe 10 responses?

We would need to sleep enough in each iteration to allow for responses to arrive, would that be OK?

I just committed a test for interval update, could you take a look at it please?

@htuch
Copy link
Member

htuch commented Feb 19, 2019

Interval update test looks good (although I wish we could use simulated time, will need to check with @jmarantz when he is back on this). I'd take as guidance that we don't want these tests to take more than a few seconds each as the upper bound on any real-time delays you'll add.

Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
@htuch htuch added the waiting label Feb 20, 2019
@stale
Copy link

stale bot commented Feb 27, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 27, 2019
@jmarantz
Copy link
Contributor

+1 for using simulated time, which is definitely done in at least one integration test:

public Event::SimulatedTimeSystem,

I'm not 100% sure whether that will work in this case, because if the timing is done in grpc itself it won't be using the Envoy time system abstraction.

It's easy enough to tell: just switch to a SimulatedTimeSystem per the pattern above, and then run the test. It will run very fast, and probably fail 99% of the time, or work 100% of the time.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Feb 27, 2019
Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
@sschepens
Copy link
Contributor Author

It's easy enough to tell: just switch to a SimulatedTimeSystem per the pattern above, and then run the test. It will run very fast, and probably fail 99% of the time, or work 100% of the time.

@jmarantz it would seem that Event::SimulatedTimeSystem conflicts with BaseIntegrationTest they both provide a timeSystem() getter, which is actually being used in hds_integration_test.
what would be the way to work around this?

@jmarantz
Copy link
Contributor

See the example I pointed to. Just inherit from SimulatedTimeSystem in your test class.

TimeSystem is the base class for SimulatedTimeSystem.

Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
@sschepens
Copy link
Contributor Author

Maybe make the time skew 250ms instead of 100ms between expect/timeout, that should hopefully not flake.

@htuch not excatly sure what you meant here, but I increased the timeouts to avoid flakes

@jmarantz
Copy link
Contributor

@sschepens ping me on slack if you have more questions about how sim-time works with integration tests. The the admin integration test I described above inherits indirectly from BaseIntegrationTest but it just inherits from SimulatedTimeSystem first, which establishes what time-system to use for the duration of the test.

@htuch
Copy link
Member

htuch commented Feb 28, 2019

@sschepens Looks good, but could you chat with @jmarantz and figure out if we could use simulated time here without too much extra work? Thanks.

@jmarantz
Copy link
Contributor

SimulatedTimeSystem works with this test, but actually (and surprisingly to me) makes it 10x slower. I want to go investigate why that is and follow up. In the meantime I suggest going forward with this as is, using real time and I can address later.

@htuch
Copy link
Member

htuch commented Feb 28, 2019

@jmarantz ack, thanks for looking into this folks.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@htuch htuch merged commit 8c3321e into envoyproxy:master Feb 28, 2019
@sschepens sschepens deleted the hds-timer branch March 1, 2019 02:24
fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
If we push frequent updates but with the same interval, we may never get responses back because the timer gets reset on every update.
This change makes a reset only if the interval has changed or a disconnect ocurred.

Signed-off-by: Sebastian Schepens <sebastian.schepens@mercadolibre.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants