Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: ExtendLease() returns existing valid unexpired lease #7781

Merged
merged 1 commit into from Jul 14, 2016

Conversation

vivekmenezes
Copy link
Contributor

@vivekmenezes vivekmenezes commented Jul 12, 2016

ExtendLease() was always writing to the table descriptor resulting
in the system config getting gossiped more often than needed.
This change writes a lease with a new expiration time only when the
lease is closer to its expiration time. With schema changes now
executing in chunks and ExtendLease() being called more often after
each chunk is processed, forcing a new expiration time is
unnecessary.

fixes #7724


This change is Reviewable

@andreimatei
Copy link
Contributor

:lgtm:


Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions.


sql/schema_changer.go, line 81 [r1] (raw file):

func (sc *SchemaChanger) createSchemaChangeLease() sqlbase.TableDescriptor_SchemaChangeLease {
  return sqlbase.TableDescriptor_SchemaChangeLease{
      NodeID: sc.nodeID, ExpirationTime: timeutil.Now().Add(jitteredLeaseDuration()).UnixNano()}

if you feel so inclined, let's change this to not use LeaseDuration any more. Let's use a different constant for schema change leases than for descriptor leases.
And I don't think we need jittering here. I think that's only useful for descriptor leases.


sql/schema_changer.go, line 150 [r1] (raw file):

// ExtendLease for the current leaser if necessary.
func (sc *SchemaChanger) ExtendLease(

rename to ExtendLeaseMaybe?


sql/schema_changer.go, line 150 [r1] (raw file):

// ExtendLease for the current leaser if necessary.
func (sc *SchemaChanger) ExtendLease(

Comment that this needs to be called early and often when performing schema changes.


sql/schema_changer.go, line 161 [r1] (raw file):

      // Check if there is still time on this lease.
      minDesiredExpiration := timeutil.Now().Add(MinLeaseDuration)

MinLeaseDuration refers to descriptor leases. Let's create a new constant for the other types of leases.
It's already confusing enough that the code calls them both "leases".


sql/schema_changer.go, line 161 [r1] (raw file):

      // Check if there is still time on this lease.
      minDesiredExpiration := timeutil.Now().Add(MinLeaseDuration)

I'd say let's move this out of the txn, since it's the common case, and the txn reading the descriptor is expensive.
If you have a lease that's nowhere close to expiration, no point in reading anything from the database. This would make the guarantees of ExtendLease not erroring weaker - we'd no longer be checking that we're extending the most recent lease. To fix that, we can add a tainted field to the lease, and whenever one gets a new lease (through this ExtendsLease), we'd set the bit on the old one.


sql/schema_changer_test.go, line 59 [r1] (raw file):

  // Set MinLeaseDuration to always expire the lease.
  minLeaseDuration := csql.MinLeaseDuration
  csql.MinLeaseDuration = 2 * csql.LeaseDuration

Let's keep fighting the crusade changing such globals :). They lead to races and overall headaches. Pass it through the params.
In fact, although setting this to 2 * LeaseDuration is a nice trick, I didn't notice it and spent a bit thinking about how this test is supposed to work at all given that we're not advancing the clock. Maybe a better thing for this test is to pass a testing knob that forces the always-renew behavior.

If you keep the lease life value, instead of changing it below (which you couldn't easily do through the TestServerArgs), you can just create a new server with a different value.

Of course, the correct thing to do is neither changing the value, nor introducing a knob, but advancing the clock, but alas the TestServer doesn't take a manual clock. If you feel like rabbit holeing....


sql/schema_changer_test.go, line 87 [r1] (raw file):

  // Acquiring another lease will fail.
  if newLease, err := changer.AcquireLease(); err == nil {

use testutils.IsError. It's better and you don't need to write a new massage in Fatalf.


sql/schema_changer_test.go, line 103 [r1] (raw file):

  // The new lease is a brand new lease.
  if newLease == lease {
      t.Fatalf("leases are equal: %v", lease)

I think a better error would be "lease was not extended"


sql/schema_changer_test.go, line 106 [r1] (raw file):

  }

  // Extending an old lease fails.

Can you explain why this is expected to fail?
Also, use testutils.IsError(...)

I suspect we're using this failure to ExtendLease() as a proxy for checking that a schema changer with anything but the latest lease cannot perform mutations. It'd be cool if we'd also test at least one of the schema changers to actually try to do a mutation after a lease has expired, and check that that fails.

Maybe we can also improve the schema changer interface so that it's more stupid-proof (less likely for someone to forget to call ExtendLease). How about we make all the code in schema changer only get table descriptor through a (schema change) lease (so the lease would have a GetDesc() method, which would call findTableWithLease()). Or in some other way ensure that the lease is always verified to be the latest when mutations are applied. I don't have a great idea.


sql/schema_changer_test.go, line 112 [r1] (raw file):

  }

  // Releasing an old lease fails.

is the fact that releasing an old lease fails important, or just an implementation artifact? If it's not important, I'd say remove it from the test.


Comments from Reviewable

ExtendLease() was always writing to the table descriptor resulting
in the system config getting gossiped more often than needed.
This change writes a lease with a new expiration time only when the
lease is closer to its expiration time. With schema changes now
executing in chunks and ExtendLease() being called more often after
each chunk is processed, forcing a new expiration time is
unnecessary.
@vivekmenezes
Copy link
Contributor Author

Thanks for reviewing this change!


Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


sql/schema_changer.go, line 81 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if you feel so inclined, let's change this to not use LeaseDuration any more. Let's use a different constant for schema change leases than for descriptor leases.
And I don't think we need jittering here. I think that's only useful for descriptor leases.

Done

sql/schema_changer.go, line 150 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

rename to ExtendLeaseMaybe?

I've left it as is because that's an implementation detail

sql/schema_changer.go, line 150 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Comment that this needs to be called early and often when performing schema changes.

done

sql/schema_changer.go, line 161 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

MinLeaseDuration refers to descriptor leases. Let's create a new constant for the other types of leases.
It's already confusing enough that the code calls them both "leases".

done

sql/schema_changer.go, line 161 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd say let's move this out of the txn, since it's the common case, and the txn reading the descriptor is expensive.
If you have a lease that's nowhere close to expiration, no point in reading anything from the database. This would make the guarantees of ExtendLease not erroring weaker - we'd no longer be checking that we're extending the most recent lease. To fix that, we can add a tainted field to the lease, and whenever one gets a new lease (through this ExtendsLease), we'd set the bit on the old one.

Taken it out of the txn. I see no point to tainting it since an old lease is more likely to expire than a new one.

sql/schema_changer_test.go, line 59 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Let's keep fighting the crusade changing such globals :). They lead to races and overall headaches. Pass it through the params.
In fact, although setting this to 2 * LeaseDuration is a nice trick, I didn't notice it and spent a bit thinking about how this test is supposed to work at all given that we're not advancing the clock. Maybe a better thing for this test is to pass a testing knob that forces the always-renew behavior.

If you keep the lease life value, instead of changing it below (which you couldn't easily do through the TestServerArgs), you can just create a new server with a different value.

Of course, the correct thing to do is neither changing the value, nor introducing a knob, but advancing the clock, but alas the TestServer doesn't take a manual clock. If you feel like rabbit holeing....

I'm leaving the code as is. While the TestingKnob stuff has some good properties I feel it will be a big sword to attack a tiny infidel.

sql/schema_changer_test.go, line 87 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

use testutils.IsError. It's better and you don't need to write a new massage in Fatalf.

done

sql/schema_changer_test.go, line 103 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think a better error would be "lease was not extended"

done

sql/schema_changer_test.go, line 106 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can you explain why this is expected to fail?
Also, use testutils.IsError(...)

I suspect we're using this failure to ExtendLease() as a proxy for checking that a schema changer with anything but the latest lease cannot perform mutations. It'd be cool if we'd also test at least one of the schema changers to actually try to do a mutation after a lease has expired, and check that that fails.

Maybe we can also improve the schema changer interface so that it's more stupid-proof (less likely for someone to forget to call ExtendLease). How about we make all the code in schema changer only get table descriptor through a (schema change) lease (so the lease would have a GetDesc() method, which would call findTableWithLease()). Or in some other way ensure that the lease is always verified to be the latest when mutations are applied. I don't have a great idea.

Added testutils.IsError()

We already have a test that checks that once a schema change is running another thread cannot acquire the lease in TestRaceWithBackfill. See the first operation done in runSchemaChangeWithOperations()


sql/schema_changer_test.go, line 112 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

is the fact that releasing an old lease fails important, or just an implementation artifact? If it's not important, I'd say remove it from the test.

I think it's a good test just as a double check that if someone where to release an old lease they would see an error. Of course now with this commit, a schema change that runs faster than 4 minutes is not going to renew the lease and it's less important.

Comments from Reviewable

@vivekmenezes vivekmenezes merged commit 71a1698 into cockroachdb:master Jul 14, 2016
@vivekmenezes vivekmenezes deleted the vivek/extend branch July 14, 2016 17:02
@andreimatei
Copy link
Contributor

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


sql/schema_changer.go, line 161 [r1] (raw file):

Previously, vivekmenezes wrote…

Taken it out of the txn. I see no point to tainting it since an old lease is more likely to expire than a new one.

I think something's not quite right here:
  • thread A gets lease L0, passes it to threads B and C
  • B extends is (no-op) and starts doing work
  • ... time passes ...
  • C extends it, which does an actual extension, so it gets lease L1, and starts using it
  • B tries to extend L0, but fails, because it no longer has the latest lease (findTableWithLease errors)

So even though the node has a valid lease, different workers can fail to use it. We had this problem before, but now you're making it worse because before it would've been detected when B tries to extend the lease the first time, now it's only detected when it tries to do it the second time. So sharing leases now would mostly appears to work, except until it doesn't.
I think the issue is that this leasing interface is confusing. These leases seem to be per-node, but we're passing them by value everywhere meaning we're encouraging different workers to get different instances. The lease for a table should be a (thread-safe) singleton, and workers should share it.

Thoughts? Should we at least file an issue?


Comments from Reviewable

@vivekmenezes
Copy link
Contributor Author

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


sql/schema_changer.go, line 161 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think something's not quite right here:

  • thread A gets lease L0, passes it to threads B and C
  • B extends is (no-op) and starts doing work
  • ... time passes ...
  • C extends it, which does an actual extension, so it gets lease L1, and starts using it
  • B tries to extend L0, but fails, because it no longer has the latest lease (findTableWithLease errors)

So even though the node has a valid lease, different workers can fail to use it. We had this problem before, but now you're making it worse because before it would've been detected when B tries to extend the lease the first time, now it's only detected when it tries to do it the second time. So sharing leases now would mostly appears to work, except until it doesn't.
I think the issue is that this leasing interface is confusing. These leases seem to be per-node, but we're passing them by value everywhere meaning we're encouraging different workers to get different instances. The lease for a table should be a (thread-safe) singleton, and workers should share it.

Thoughts? Should we at least file an issue?

Well we never have threads pass leases to other threads. A thread once it has a lease can use the lease to do work and it needs to keep extending the lease as it does its work. Any other thread can request its own lease from the table descriptor, but if it finds an unexpired lease on the descriptor is cannot gain a lease. This all works as long as it is used as described.

Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


sql/schema_changer.go, line 161 [r1] (raw file):

Previously, vivekmenezes wrote…

Well we never have threads pass leases to other threads. A thread once it has a lease can use the lease to do work and it needs to keep extending the lease as it does its work. Any other thread can request its own lease from the table descriptor, but if it finds an unexpired lease on the descriptor is cannot gain a lease. This all works as long as it is used as described.

But the point is that if this lease is passed by value everywhere, you have problems even without actual parallel use. You can't even delegate some of the work to a function, even if the caller is willing to wait.

Comments from Reviewable

@vivekmenezes
Copy link
Contributor Author

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


sql/schema_changer.go, line 161 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

But the point is that if this lease is passed by value everywhere, you have problems even without actual parallel use. You can't even delegate some of the work to a function, even if the caller is willing to wait.

pass by value requires the callee to return a new value back to the caller if it has extended the lease internally. Do file an issue if you are concerned about it and we can discuss it there. Thanks for taking a second look at this.

Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


sql/schema_changer.go, line 161 [r1] (raw file):

Previously, vivekmenezes wrote…

pass by value requires the callee to return a new value back to the caller if it has extended the lease internally. Do file an issue if you are concerned about it and we can discuss it there. Thanks for taking a second look at this.

Hmmm you're right, I see it's actually passed by pointer by all the users. It's just passed in/out by value by functions in this file. I think that's misleading, or at least it made me thing that you're supposed to copy it around (i.e. it suggests that you can use an instance as a _handle_ on lease). I might clean up to pass by pointer everywhere.

Comments from Reviewable

@vivekmenezes
Copy link
Contributor Author

thanks. There is also the option to return a lease instead of using a
pointer and allow ReleaseLease() to ignore a zero lease.

On Thu, Jul 14, 2016 at 7:35 PM Andrei Matei notifications@github.com
wrote:

Review status: 0 of 2 files reviewed at latest revision, 10 unresolved

discussions, all commit checks successful.

sql/schema_changer.go, line 161 [r1]
https://reviewable.io:443/reviews/cockroachdb/cockroach/7781#-KM_0i4UszlY7Siv61UO:-KMfjLghcx-3uQCho5LP:1307589885
(raw file

minDesiredExpiration := timeutil.Now().Add(MinLeaseDuration)
):
Previously, vivekmenezes wrote…

pass by value requires the callee to return a new value back to the caller
if it has extended the lease internally. Do file an issue if you are
concerned about it and we can discuss it there. Thanks for taking a second
look at this.

Hmmm you're right, I see it's actually passed by pointer by all the users.
It's just passed in/out by value by functions in this file. I think that's
misleading, or at least it made me thing that you're supposed to copy it
around (i.e. it suggests that you can use an instance as a handle on
lease). I might clean up to pass by pointer everywhere.


Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/7781


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#7781 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALOpBG8_3DxxC7iN7w_qg96vlwyZQ98Bks5qVseqgaJpZM4JKjo5
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: schema change ExtendLease() calling txn.SetSystemConfigTrigger() frequently
2 participants