Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat server: bring visibility to script errors #2879

Merged
merged 8 commits into from
Apr 11, 2024
Merged

feat server: bring visibility to script errors #2879

merged 8 commits into from
Apr 11, 2024

Conversation

adiholden
Copy link
Collaborator

fixes #2870

  1. expose a metric with script errors
  2. Add an error log of a script SHA and the reason for failure. The error log must be bounded with at most K log entries per script so we won't have logs overflow for high throughput traffic.
  3. Additional feature (under a flag) upon receiving the undeclared variable error - transform the script to globally-atomic and retry the call.

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
dranikpg
dranikpg previously approved these changes Apr 10, 2024
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this is very risky behavior because it can leave the data inconsistent without notice. I think this is more serious than your system not running. Many script just don't assume they can fail. Some of them actually declare the queue name, but access arbitrary data as undeclared

@@ -33,8 +33,12 @@ ABSL_FLAG(
bool, lua_auto_async, false,
"If enabled, call/pcall with discarded values are automatically replaced with acall/apcall.");

namespace dfly {
ABSL_FLAG(bool, lua_allow_undeclared_auto_correct, true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's flip it to false to avoid inconsistent behavior by default

@@ -2040,6 +2040,7 @@ void ServerFamily::Info(CmdArgList args, ConnectionContext* cntx) {
append("blocked_on_interpreter", m.coordinator_stats.blocked_on_interpreter);
append("ram_hits", m.events.ram_hits);
append("ram_misses", m.events.ram_misses);
append("script_erros", script_mgr_->script_errors());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have "script_errors" in "info all". see should_enter("ERRORSTATS", true) section.
the goal is actually to expose script errors via /metrics

@romange romange dismissed dranikpg’s stale review April 10, 2024 11:34

needs more fixes

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@adiholden adiholden requested a review from romange April 10, 2024 19:38
dranikpg
dranikpg previously approved these changes Apr 10, 2024
@@ -95,6 +95,7 @@ struct ReplyStats {
size_t io_write_cnt = 0;
size_t io_write_bytes = 0;
absl::flat_hash_map<std::string, uint64_t> err_count;
absl::flat_hash_map<std::string, std::string> script_error_map;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment saying what it maps to what

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: adi_holden <adi@dragonflydb.io>
@@ -1146,6 +1146,17 @@ void PrintPrometheusMetrics(const Metrics& m, StringResponse* resp) {
}
}

{
const auto& reply_stats = m.facade_stats.reply_stats;
string sript_error_type_metric;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think it's unnecessary detailed. The goal is to be able to get alerted about servers that have an error and then go to the server logs and check the warnings. So I think a simple script_error counter (not GAUGE) should do the job here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fixed

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@adiholden adiholden merged commit 7580a88 into main Apr 11, 2024
10 checks passed
@adiholden adiholden deleted the fix_2870 branch April 11, 2024 08:25
szinn pushed a commit to szinn/k8s-homelab that referenced this pull request Apr 16, 2024
…nfly ( v1.16.1 → v1.17.0 ) (#3473)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
|
[docker.dragonflydb.io/dragonflydb/dragonfly](https://togithub.com/dragonflydb/dragonfly)
| minor | `v1.16.1` -> `v1.17.0` |

---

### Release Notes

<details>
<summary>dragonflydb/dragonfly
(docker.dragonflydb.io/dragonflydb/dragonfly)</summary>

###
[`v1.17.0`](https://togithub.com/dragonflydb/dragonfly/releases/tag/v1.17.0)

[Compare
Source](https://togithub.com/dragonflydb/dragonfly/compare/v1.16.1...v1.17.0)

##### Dragonfly v1.17.0

Some prominent changes include:

- Improved performance for MGET operations
([#&#8203;2453](https://togithub.com/dragonflydb/dragonfly/issues/2453))
- Fix argument parsing in json.objkeys
([#&#8203;2872](https://togithub.com/dragonflydb/dragonfly/issues/2872))
- Fix ipv6 support for replication
([#&#8203;2889](https://togithub.com/dragonflydb/dragonfly/issues/2889))
- Support serialisation of bloom filters - saving to and loading from
snapshots
([#&#8203;2846](https://togithub.com/dragonflydb/dragonfly/issues/2846))
- Support of HLL PFADD
([#&#8203;2761](https://togithub.com/dragonflydb/dragonfly/issues/2761))
- Support bullmq workloads that do not have `{}` hashtags in their queue
names
([#&#8203;2890](https://togithub.com/dragonflydb/dragonfly/issues/2890))

##### What's Changed

- fix:
[#&#8203;2745](https://togithub.com/dragonflydb/dragonfly/issues/2745)
don't start migration process again after apply the same the same config
is applied by [@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2822
- feat(transaction): Idempotent callbacks (immediate runs) by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2453
- refactor(cluster): replace sync_id with node_id for slot migration
[#&#8203;2835](https://togithub.com/dragonflydb/dragonfly/issues/2835)
by [@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2838
- feat(tiering): Simple OpManager by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2781
- chore: implement path mutation for JsonFlat by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2805
- feat(cluster): add migration removing by config
[#&#8203;2835](https://togithub.com/dragonflydb/dragonfly/issues/2835)
by [@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2844
- chore: expose direct API on Bloom objects by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2845
- chore: generalize CompactObject::AllocateMR by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2847
- feat(tiering): Simplest small bins by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2810
- refactor: clean cluster slot migration code by
[@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2848
- fix(tests): Fix numsub test by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2852
- fix: healthcheck for docker containers by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2853
- fix: possible crash in tls code by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2854
- fix(server): Do not block admin-port commands by
[@&#8203;chakaz](https://togithub.com/chakaz) in
[dragonflydb/dragonfly#2842
- fix(pytest): make pytests fail if server crash on shutdown by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2827
- feat(server): add prints on takeover timeout by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2856
- fix(pytest): dont check process return code on kill by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2862
- fix: authorize the http connection to call commands by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2863
- feat(cluster): Send number of keys for incoming and outgoing
migrations. by [@&#8203;chakaz](https://togithub.com/chakaz) in
[dragonflydb/dragonfly#2858
- feat(tiering): TieredStorageV2 by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2849
- bug(server): set connection flags block/pause flag on all blocking
commands by [@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2816
- chore: serialize SBF by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2846
- fix: test_replicaof_reject_on_load crash on stop by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2818
- feat(dbslice): Add self-laundering iterator in `DbSlice` by
[@&#8203;chakaz](https://togithub.com/chakaz) in
[dragonflydb/dragonfly#2815
- chore: License update by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2767
- fix(acl): incompatibilities with acl load by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2867
- fix(json): make path optional in json.objkeys by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2872
- fix: return wrong type errors for SET...GET command by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2874
- fix(redis replication): remove partial sync flow ,not supported yet by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2865
- chore: limit traffic logger only to the main interface by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2877
- chore: relax repltakeover constraints to only exclude write commands
by [@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2873
- chore(replayer): Roll back to go1.18 by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2881
- fix: brpoplpush single shard to wake up blocked transactions by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2875
- chore: LockTable tracks fingerprints of keys by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2839
- chore: reject TLS handshake when our listener is plain TCP by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2882
- Add support for Sparse HLL PFADD by
[@&#8203;azuredream](https://togithub.com/azuredream) in
[dragonflydb/dragonfly#2761
- feat server: bring visibility to script errors by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2879
- chore: clean up REPLTAKEOVER flow by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2887
- chore(tiering): Move files and move kb literal to common by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2868
- chore(interpreter): Support object replies by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2885
- fix(ci/helm): Stick to v0.73.0 version of prom operator by
[@&#8203;Pothulapati](https://togithub.com/Pothulapati) in
[dragonflydb/dragonfly#2893
- fix(acl): authentication with UDS socket by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2895
- feat(cluster): add repeated ACK if an error is happened by
[@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2892
- chore(blocking): Remove faulty DCHECK by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2898
- chore: add a clear link on how to build dragonfly from source by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2884
- feat(server): Allow configuration of hashtag extraction by
[@&#8203;chakaz](https://togithub.com/chakaz) in
[dragonflydb/dragonfly#2890
- fix: fix build under macos by
[@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2901
- fix(cluster_replication): replicate redis cluster node bug fix by
[@&#8203;adiholden](https://togithub.com/adiholden) in
[dragonflydb/dragonfly#2876
- fix(acl): skip http and add check on connection traversals by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2883
- fix(zset): Better memory consumption calculation by
[@&#8203;chakaz](https://togithub.com/chakaz) in
[dragonflydb/dragonfly#2900
- fix: fix ld for num converting by
[@&#8203;BorysTheDev](https://togithub.com/BorysTheDev) in
[dragonflydb/dragonfly#2902
- chore: add help string for memory_fiberstack_vms_bytes by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2903
- fix(sanitizers): false positive fail on multi_test::Eval by
[@&#8203;kostasrim](https://togithub.com/kostasrim) in
[dragonflydb/dragonfly#2896
- chore: pull helio and add ipv6 replication test by
[@&#8203;dranikpg](https://togithub.com/dranikpg) in
[dragonflydb/dragonfly#2889
- chore: add ipv6 support for native linux release by
[@&#8203;romange](https://togithub.com/romange) in
[dragonflydb/dragonfly#2908

##### Huge thanks to all the contributors! ❤️

**Full Changelog**:
dragonflydb/dragonfly@v1.16.0...v1.17.0

</details>

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zMDEuNSIsInVwZGF0ZWRJblZlciI6IjM3LjMwMS41IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL21pbm9yIl19-->

Co-authored-by: repo-jeeves[bot] <106431701+repo-jeeves[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bring visibility to script errors
3 participants