Revert TableOps.exists() behavior. Closes #768 by milleruntime · Pull Request #914 · apache/accumulo

milleruntime · 2019-01-23T22:03:10Z

A bunch of words to try and explain the race conditions to the user

joshelser · 2019-01-23T23:33:05Z

Could you make this more concise by just saying "at the moment this method was called".

Was this some kind of user-complaint? The examples you included seem obvious to me for any database.

milleruntime · 2019-01-24T17:44:06Z

Was this some kind of user-complaint? The examples you included seem obvious to me for any database.

Not a complaint but used poorly. I have seen users make decisions based on the outcome like:

tableOps().delete(t1)
... do some stuff
if (tableOps().exists(t1))
   throw Exception()

joshelser · 2019-01-24T17:47:32Z

Not a complaint but used poorly. I have seen users make decisions based on the outcome like:

I'm not sure I see the value in the addition then. If you're using a database in isolation, the semantics are obvious. But, if you're using a shared system, then of course other folks might do stuff that messes with you. Seems like it would be opening up a rabbit hole to me.

milleruntime · 2019-01-24T18:10:25Z

Not a complaint but used poorly. I have seen users make decisions based on the outcome like:

Seems like it would be opening up a rabbit hole to me.

I am not sure what you mean. We don't want to remove the method or deprecate because it can be useful. But it can be problematic so I am just trying to warn users. What is obvious to you may not be obvious to everyone.

milleruntime · 2019-01-24T18:41:44Z

Perhaps if I make the warning more concise without explaining in detail situations?

joshelser · 2019-01-24T18:53:43Z

What is obvious to you may not be obvious to everyone.

My opinion is that we shouldn't be catering documentation to folks who don't know that their distributed, multi-tenant database may have changes made by other users.

milleruntime · 2019-01-24T19:30:15Z

What is obvious to you may not be obvious to everyone.

My opinion is that we shouldn't be catering documentation to folks who don't know that their distributed, multi-tenant database may have changes made by other users.

I agree. But our documentation can be more clear and even link to common issues like this.

joshelser · 2019-01-25T15:42:45Z

Do carry-on trying to make docs better. That's a worthwhile endeavor. I don't think I have anything else to add here, but don't have strong feelings on what you want to do here.

EdColeman · 2019-01-28T18:10:02Z

I agree with adding the clarification to the documentation, but there maybe more we can do with the underlying issue. This could be handled by keeping #768 open, or creating a new feature request.

Some background - it seems this issue was triggered by recent changes in the zookeeper caching that changed client behavior and increased the time (minutes) before clients would see table changes (additions in this case.) The scenario:

Process 1 creates a table, performs a bulk import and stores created table name externally for clients.

Client1..n use the external name to scan the table.

The clients were defensively coded, using exits() before starting the scan, throwing errors if exists() returned false. With the changes / additions of zookeeper caching in the client, the period while the table existed, but not "known" to the client in the cache increased, with the delay now possibly being minutes.

The use of exists() seemed reasonable because tables are externally managed and if the external processes where not managing tables as expected, the exists() check could detect this upfront before starting a scan.

In this case, the "fix" was to remove the check and instead rely on the scanner to determine if the table exists. Because the scanner / tservers handle the zookeeper caching differently than the client, (IIRC, they clear the cache and retry) the issue was mitigated.

Caching is an internal implementation and clients do need to understand that these types of conditions can and will occur in a distributed system, but we may be able to reduce the period while this condition exists.

Clarifying the documentation helps, but we should consider adopting the scanner strategy for exists() where it will automatically checks with zookeeper on a cache miss and updates the cache entry, or clears the cache more aggressively. (Less desirable because it's exposing internals would be to to provide clients a clearCache() or clearAndCheckExists() method.)

joshelser · 2019-01-28T18:42:46Z

Some background - it seems this issue was triggered by recent changes in the zookeeper caching that changed client behavior and increased the time (minutes) before clients would see table changes (additions in this case.)

Woah. That's... bad. I'd expect the upper-bound on this to be a couple of seconds, not multiple minutes. What's the change you're referencing, @EdColeman?

EdColeman · 2019-01-28T19:02:12Z

I think it may be related to the changes in #364 If not this specific change it was around this section of code. Need to look more to verify this.

milleruntime · 2019-01-28T19:06:02Z

I think it may be related to the changes in #364 If not this specific change it was around this section of code. Need to look more to verify this.

Ah OK, thanks for the info @EdColeman. I will leave this open since it may point to an underlying issue.

ctubbsii · 2019-01-30T21:20:03Z

I think you can improve the docs separately from any possible underlying issue.

milleruntime · 2019-01-30T21:27:17Z

Whoops well I just pushed b04b91c. I could make this separate PR

milleruntime · 2019-01-30T22:54:08Z

OK so my changes in b04b91c are similar to what was done in 1.8, prior to the ID cache map change. This will obviously eliminate the benefits of having a cache for this method. I am not sure what to do since reverting will cause performance to regress for use cases like Fluo with lots of lookups but not many create/drop tables. @EdColeman 's use case is the opposite, where there are many create/drop tables but they want the table exists to be more accurate.

Maybe we revert the functionality of the exists method to 1.8 behavior and create a new method that uses cached table IDs?

milleruntime · 2019-01-31T18:24:11Z

I did a little more git log research...
5adeb4b
ACCUMULO-4778 Cache table name to id map (#364)
This commit was done in the 1.9.0 and forward. So the behavior of the exists method in 1.8.1 (and prior) was to check the ZooCache every time.

The more I think about it, I think we should revert the behavior of this method back to just using the ZooCache, like we did in 1.8.1. I feel this is a good compromise. Other areas of Accumulo will gain the benefits of the cache but the behavior of this method will go back to what was expected in previous releases.

Anyone else have thoughts on this?

milleruntime · 2019-02-08T23:21:41Z

@EdColeman do you think reverting the exists method back to 1.8 behavior is needed? Or does using the scanner suffice for your use case?

EdColeman · 2019-02-09T00:17:27Z

I think the current behavior should be classified as a bug and not something that should only be handled by documentation.

With that, I think the discussion then becomes how serious a bug is it? Would I consider it a blocker for the next 1.9.x or 2.0 release, maybe not, for 2.1, probably. Because there is a work-around for the "forward" case where the table has been created, but not propagated to the cache, instead of checking for a table exits() and relying on the scanner to fail "works", but it seems like it could be an "expensive" why to check. Especially in cases where the table truly did not exist, but people keep trying to scan anyway.

I do not know how to handle the reverse case where you want to make sure a table has been deleted.

Even if there was a method to clear the cache, you could then test and if you didn't think the answer was what you wanted, clear the cache and retry may be a work-around rather that relying on the scanner behavior.

Do you think reverting the behavior would impact performance of applications that have a relatively static configuration? Are there ways to use the znode timestamps / ids that could be used to determine if the cache is current?

milleruntime · 2019-02-11T19:40:40Z

Do you think reverting the behavior would impact performance of applications that have a relatively static configuration?

I believe so. Applications that took advantage of the performance improvements upgrading to 1.9 (I believe Fluo is one of them) will be impacted by reverting behavior back to 1.8.

Are there ways to use the znode timestamps / ids that could be used to determine if the cache is current?

This may be possible by checking the "mtime" on the tables node, which should indicate when a table ID -> name mapping was updated or table added/dropped. This would require a bit more work though.

I am thinking we may have put ourselves in a situation where we need to create another exists method. Revert the behavior of the current method to always check zookeeper. Than add a new method, something like "existsInCache(String tableName, boolean clearCacheOnFail)". Which will have better performance and the option to clear cache the cache. @keith-turner thoughts on this?

milleruntime · 2019-02-12T22:54:13Z

There appears to be a race condition with the new cache of the table ID name map. I will close this PR since it doesn't fix the problem.

Update TableOps.exists() javadoc. Closes apache#768

611e76f

keith-turner approved these changes Jan 23, 2019

View reviewed changes

Very short form

9b00306

ctubbsii added the v2.0.0 label Jan 30, 2019

Make TableOperationsImpl exists use ZooReader

b04b91c

milleruntime changed the title ~~Update TableOps.exists() javadoc. Closes #768~~ Revert TableOps.exists() behavior. Closes #768 Jan 31, 2019

Get table exists back to 1.8 behavior

ef68059

milleruntime closed this Feb 12, 2019

ctubbsii removed the v2.0.0 label Feb 13, 2019

milleruntime deleted the table-exists-doc branch February 28, 2019 18:40

Conversation

milleruntime commented Jan 23, 2019

Uh oh!

joshelser commented Jan 23, 2019

Uh oh!

milleruntime commented Jan 24, 2019

Uh oh!

joshelser commented Jan 24, 2019

Uh oh!

milleruntime commented Jan 24, 2019

Uh oh!

milleruntime commented Jan 24, 2019

Uh oh!

joshelser commented Jan 24, 2019

Uh oh!

milleruntime commented Jan 24, 2019

Uh oh!

joshelser commented Jan 25, 2019

Uh oh!

EdColeman commented Jan 28, 2019

Uh oh!

joshelser commented Jan 28, 2019

Uh oh!

EdColeman commented Jan 28, 2019

Uh oh!

milleruntime commented Jan 28, 2019

Uh oh!

ctubbsii commented Jan 30, 2019

Uh oh!

milleruntime commented Jan 30, 2019

Uh oh!

milleruntime commented Jan 30, 2019

Uh oh!

milleruntime commented Jan 31, 2019

Uh oh!

milleruntime commented Feb 8, 2019

Uh oh!

EdColeman commented Feb 9, 2019

Uh oh!

milleruntime commented Feb 11, 2019

Uh oh!

milleruntime commented Feb 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants