-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: test_replicaof_reject_on_load crash on stop #2818
Conversation
@adiholden I did not supply a fix for pytests and the crashed instance. We need to think about this because |
CHECK(!dispatch_fb_.IsJoinable()); // can't move once it started | ||
CHECK_EQ(cc_->subscriptions, 0); // are bound to thread local caches | ||
CHECK_EQ(self_.use_count(), 1u); // references cache our thread and backpressure | ||
if (dispatch_fb_.IsJoinable()) { // can't move once it started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is triggering the dispatch_fb_ creation on shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This https://github.com/dragonflydb/dragonfly/blob/main/src/facade/dragonfly_listener.cc#L301 which calls SendCheckpoint
. See also: https://github.com/dragonflydb/dragonfly/blob/main/src/facade/dragonfly_listener.cc#L301 and https://github.com/dragonflydb/dragonfly/blob/main/src/facade/dragonfly_listener.cc#L455
Moreover, this is what I was telling you about migrations. In this flow, we migrate and it happens that the traversal fiber happens before the migration and so when we migrate the dispatch fiber is already started. Now, if we remove this check and cancel the migration, it could also be the case that another fiber run which killed the socket (after PreShutdown and DispatchTracker) and I have seen this being triggered and that's why I checked for the socket to be active.
Of course, it could be the case that traversal happened after migration
but before the connection is relinked to the thread local and in that case we would probably deadlock or loose some important message being send from the traversal.
@kostasrim can you merge this? |
…nfly ( v1.16.1 → v1.17.0 ) (#3473) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [docker.dragonflydb.io/dragonflydb/dragonfly](https://togithub.com/dragonflydb/dragonfly) | minor | `v1.16.1` -> `v1.17.0` | --- ### Release Notes <details> <summary>dragonflydb/dragonfly (docker.dragonflydb.io/dragonflydb/dragonfly)</summary> ### [`v1.17.0`](https://togithub.com/dragonflydb/dragonfly/releases/tag/v1.17.0) [Compare Source](https://togithub.com/dragonflydb/dragonfly/compare/v1.16.1...v1.17.0) ##### Dragonfly v1.17.0 Some prominent changes include: - Improved performance for MGET operations ([#​2453](https://togithub.com/dragonflydb/dragonfly/issues/2453)) - Fix argument parsing in json.objkeys ([#​2872](https://togithub.com/dragonflydb/dragonfly/issues/2872)) - Fix ipv6 support for replication ([#​2889](https://togithub.com/dragonflydb/dragonfly/issues/2889)) - Support serialisation of bloom filters - saving to and loading from snapshots ([#​2846](https://togithub.com/dragonflydb/dragonfly/issues/2846)) - Support of HLL PFADD ([#​2761](https://togithub.com/dragonflydb/dragonfly/issues/2761)) - Support bullmq workloads that do not have `{}` hashtags in their queue names ([#​2890](https://togithub.com/dragonflydb/dragonfly/issues/2890)) ##### What's Changed - fix: [#​2745](https://togithub.com/dragonflydb/dragonfly/issues/2745) don't start migration process again after apply the same the same config is applied by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2822 - feat(transaction): Idempotent callbacks (immediate runs) by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2453 - refactor(cluster): replace sync_id with node_id for slot migration [#​2835](https://togithub.com/dragonflydb/dragonfly/issues/2835) by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2838 - feat(tiering): Simple OpManager by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2781 - chore: implement path mutation for JsonFlat by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2805 - feat(cluster): add migration removing by config [#​2835](https://togithub.com/dragonflydb/dragonfly/issues/2835) by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2844 - chore: expose direct API on Bloom objects by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2845 - chore: generalize CompactObject::AllocateMR by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2847 - feat(tiering): Simplest small bins by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2810 - refactor: clean cluster slot migration code by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2848 - fix(tests): Fix numsub test by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2852 - fix: healthcheck for docker containers by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2853 - fix: possible crash in tls code by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2854 - fix(server): Do not block admin-port commands by [@​chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2842 - fix(pytest): make pytests fail if server crash on shutdown by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2827 - feat(server): add prints on takeover timeout by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2856 - fix(pytest): dont check process return code on kill by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2862 - fix: authorize the http connection to call commands by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2863 - feat(cluster): Send number of keys for incoming and outgoing migrations. by [@​chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2858 - feat(tiering): TieredStorageV2 by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2849 - bug(server): set connection flags block/pause flag on all blocking commands by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2816 - chore: serialize SBF by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2846 - fix: test_replicaof_reject_on_load crash on stop by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2818 - feat(dbslice): Add self-laundering iterator in `DbSlice` by [@​chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2815 - chore: License update by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2767 - fix(acl): incompatibilities with acl load by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2867 - fix(json): make path optional in json.objkeys by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2872 - fix: return wrong type errors for SET...GET command by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2874 - fix(redis replication): remove partial sync flow ,not supported yet by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2865 - chore: limit traffic logger only to the main interface by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2877 - chore: relax repltakeover constraints to only exclude write commands by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2873 - chore(replayer): Roll back to go1.18 by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2881 - fix: brpoplpush single shard to wake up blocked transactions by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2875 - chore: LockTable tracks fingerprints of keys by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2839 - chore: reject TLS handshake when our listener is plain TCP by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2882 - Add support for Sparse HLL PFADD by [@​azuredream](https://togithub.com/azuredream) in [dragonflydb/dragonfly#2761 - feat server: bring visibility to script errors by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2879 - chore: clean up REPLTAKEOVER flow by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2887 - chore(tiering): Move files and move kb literal to common by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2868 - chore(interpreter): Support object replies by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2885 - fix(ci/helm): Stick to v0.73.0 version of prom operator by [@​Pothulapati](https://togithub.com/Pothulapati) in [dragonflydb/dragonfly#2893 - fix(acl): authentication with UDS socket by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2895 - feat(cluster): add repeated ACK if an error is happened by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2892 - chore(blocking): Remove faulty DCHECK by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2898 - chore: add a clear link on how to build dragonfly from source by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2884 - feat(server): Allow configuration of hashtag extraction by [@​chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2890 - fix: fix build under macos by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2901 - fix(cluster_replication): replicate redis cluster node bug fix by [@​adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2876 - fix(acl): skip http and add check on connection traversals by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2883 - fix(zset): Better memory consumption calculation by [@​chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2900 - fix: fix ld for num converting by [@​BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2902 - chore: add help string for memory_fiberstack_vms_bytes by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2903 - fix(sanitizers): false positive fail on multi_test::Eval by [@​kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2896 - chore: pull helio and add ipv6 replication test by [@​dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2889 - chore: add ipv6 support for native linux release by [@​romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2908 ##### Huge thanks to all the contributors! ❤️ **Full Changelog**: dragonflydb/dragonfly@v1.16.0...v1.17.0 </details> <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zMDEuNSIsInVwZGF0ZWRJblZlciI6IjM3LjMwMS41IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL21pbm9yIl19--> Co-authored-by: repo-jeeves[bot] <106431701+repo-jeeves[bot]@users.noreply.github.com>
resolves #2769
The problem is that we are shutting down master. During this flow, we iterate and shutdown all of the listeners and then we give a chance for the currently executing commands to finish with
DispatchTracker
. However, this is problematic, because internally it traverses all of the connections in all shards and sendsSendCheckPoint
messages. When this happens, it forces the connection that handles the replica to activate itsdispatch_fiber_
. If this happens before weMigrate
the connection in the replication flow (dflycmd) then we will trigger the CHECK that asserts that the dispatch_fiber is not actve (it was never launched).Unfortunately this is based on timing and therefore it requires two things:
dispatch_fiber
is active. In that case do no crash but return false to indicate that the migration was not successful.