Update Manager and stats to correctly store information about proceses #289

ondrejbartas · 2017-10-11T07:33:20Z

This fixes problem with Sidekiq Web UI which has changed API from version 4.

Now we correctly add data about manager and processes to redis

coveralls · 2017-10-11T07:46:09Z

Coverage increased (+0.1%) to 89.231% when pulling fb3ddc5 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

coveralls · 2017-10-11T07:46:42Z

Coverage increased (+0.1%) to 89.231% when pulling fb3ddc5 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

coveralls · 2017-10-11T08:15:34Z

Coverage increased (+0.2%) to 89.332% when pulling 16e710a on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

ondrejbartas · 2017-10-11T15:19:36Z

It is still not working, problem with cleaning up processes after init :(

coveralls · 2017-10-11T18:49:36Z

Coverage increased (+0.5%) to 89.581% when pulling 097a820 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

coveralls · 2017-10-11T19:15:54Z

Coverage increased (+0.5%) to 89.594% when pulling 0dd6738 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

coveralls · 2017-10-11T19:30:03Z

Coverage increased (+0.5%) to 89.594% when pulling adbd0d7 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

coveralls · 2017-10-11T19:31:35Z

Coverage increased (+0.7%) to 90.476% when pulling 3a610b0 on PPCBee:fix-sidekiq5 into ee06d8c on akira:master.

akira · 2017-10-15T01:20:31Z

@ondrejbartas Thanks! Will need to take a look.

Also, I'm assuming this will require changes to exq_ui?

ondrejbartas · 2017-10-18T19:25:35Z

@akira yes exq_ui will need to change.

akira · 2017-10-29T18:36:57Z

lib/exq/manager/server.ex

    {:ok, state, 0}
  end

+  def cocurency_count(state) do


akira · 2017-10-29T19:48:48Z

lib/exq/manager/server.ex

@@ -251,6 +281,17 @@ defmodule Exq.Manager.Server do

      job_results = jobs |> Enum.map(fn(potential_job) -> dispatch_job(state, potential_job) end)

+      # Update worker info in redis that it is alive


So this manages worker state each time it dequeues, is this how Sidekiq does it?

This seems to be adding overhead. Looking at performance test:

without lines 284:293
2017-10-29 13:25:34.217 [debug]: Perf test took 0.680126 secs

with lines 284:293
2017-10-29 13:23:03.301 [debug]: Perf test took 1.000115 secs

(To run this I set ExUnit.start(capture_log: false) in test_helper.exs)

Hi @akira, biggest problem is cleanup of queues which happens parallel by calling: https://github.com/akira/exq/pull/289/files/adbd0d7657abff63dcdfdb1db5a7240c7ce4d4a6#diff-ee19dd3b88a278d7cf31b33f2b2b5705R147

I am new to elixir so I have no experience how to do it in one run, first, cleanup, then whole code.

then you wil need just:

name = redis_worker_name(state) worker_init = [ ["HSET", name, "beat", Time.unix_seconds], ["EXPIRE", name, (state.poll_timeout / 1000 + 5)], # expire information about live worker in poll_interval + 5s ] Connection.qp!(state.redis, worker_init)

also this code can run every second and result will be same

Yes - for Sidekiq, seems like it is running this in a separate thread:
https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/launcher.rb#L25
That is why I was thinking if we can just run this separately as well, would it be possible to do this in a separate GenServer?

If you need to do synchronous methods from the Manager to the GenServer, you can still do GenServer.call instead of GenServer.cast and it will wait for the call. Let me know if you think this is doable or not. The other option is perhaps lower the frequency of this call so it doesn't do it in every dequeue.

The other thing I noticed is the section with HSET, etc is duplicated - it would be nice to take this out into a function at least, or function in JobStat module.

akira · 2017-10-29T23:40:28Z

Looking at the code again, I think the heartbeat mechanism could be tweaked. It seems like it is high overhead to update this on every dequeue - when it can be done a lot less frequently. Perhaps we can have a separate GenServer that will do the heartbeat?

This PR actually goes makes some progress on issue #261. For the full implementation, a random uuid can be generated whenever a new heartbeat GenServer comes online (1 for the whole system). If certain parts of the the system crash, this should crash as well. Then the cleanup process can then look through uuids every once in a while and see whichever ones are stale, and re-queue those as lost messages.

However, we don't necessarily have to implement it all for this PR, if not, maybe we can start with the separate GenServer that simply starts up, inserts worker information, and then heartbeats on a regular basis - and will crash when system crashes.

Let me know your thoughts.

…ing jobs

TondaHack · 2018-01-24T14:26:55Z

Hi @akira,

Could you check these changes we made?

Thanks

ondrejbartas · 2018-02-06T21:08:14Z

@akira Hi can you review this PR again?

akira · 2018-02-07T03:56:40Z

Sorry, have been slammed at work lately. Will try to take a look when things lighten up. Thanks!

akira · 2018-02-11T18:27:01Z

lib/exq/manager/server.ex

+      Exq.Stats.Server.cleanup_host_stats(state.stats, state.namespace, state.node_id, state.pid)
+    end)
+
+    HeartbeatServer.start_link(state)


Starting a process this way is not recommended. The proper way of starting it is to add this to the supervision tree: https://github.com/akira/exq/blob/master/lib/exq/support/mode.ex#L26

akira · 2018-02-11T18:27:57Z

lib/exq/manager/heartbeat_server.ex

+  defp getRedisCommands(state) do
+    name = redis_worker_name(state)
+    [
+      ["SADD", JobQueue.full_key(state.namespace, "processes"), name],


Can we refactor redis commands to be in one of the other modules? For example. RedisStat.

Also, nit, using camel case for methods.

akira · 2018-02-20T16:04:02Z

lib/exq/redis/job_stat.ex

@@ -145,4 +186,30 @@ defmodule Exq.Redis.JobStat do
        val
    end
  end
+
+  def get_redis_commands(namespace, node_id, started_at, master_pid, queues, work_table, poll_timeout) do


Maybe name this something more descriptive, above methods that are similar are named something like _commands

akira · 2018-02-20T16:04:12Z

lib/exq/redis/job_stat.ex

+    JobQueue.full_key(namespace, "#{node_id}:elixir")
+  end
+
+  defp cocurency_count(queues, work_table) do


This is mispelled.

akira · 2018-02-20T16:05:21Z

Looks like build is failing from some compilation issues

TondaHack · 2018-02-21T09:57:45Z

Hi @akira,

Thanks for the review. A few of builds failed because of compilation issue, but the rest of them because of tests. I know that reason is Exq.Heartbeat.Server. I am trying to find the issue but every build shows different failing tests. It's happening in /exq/test/exq_test.exs and with tests related to ExqTest.PerformWorker.

I've tried to run the first heartbeat with smaller or no a timeout. It didn't work. Also, I tried to extend assert_receive timeout from 100 to 200 or 400 ms. There were less failing tests but still some. Could you take a look, please?

Edit: It's happening on Elixir 1.3 and 1.4. My machine has 1.7.0-dev and It works just fine.

akira · 2018-04-12T05:08:23Z

@TondaHack I see the tests pass now, looks like you figured out the issue?

TondaHack · 2018-04-12T11:00:28Z

@akira Yes, I think so. The clean up of Heartbeat server helped. What do you think about current changes?

akira · 2018-06-11T05:40:41Z

@TondaHack @ananthakumaran I wonder if it would be better to have a generated node ID instead of just using the hostname. If you look at #321 some people may be running multiple workers on the same machine (at least this case is it true during deploys). However, this obscures each node heartbeat since two nodes will be registering the same heartbeat.

If we use a more ephemeral key for a node, for example PID or UUID (plus the hostname), it would be unique even if multiple processes are running for the same node.

The other comment I had is, this is a big change to the current system and may break some compatibility. I wonder if there's a way to deploy this through some config flag, where we can keep the old method, and allow a user to switch to use this new method. This would be a much lower risk way of deploying this change without impacting anyone.

leeicmobile · 2018-09-07T21:12:40Z

@akira any chance we can just make the heartbeat solution a major version release rather than a config change currently a potential blocker in my production deployment to k8s cluster. #imselfish

akira · 2018-09-10T02:06:51Z

@leeicmobile

@akira any chance we can just make the heartbeat solution a major version release rather than a config change currently a potential blocker in my production deployment to k8s cluster. #imselfish

I have been very busy lately with work / life and have not had much time lately. I would consider merging this if following happened:

Someone can review this diff.
This can be tested in a running ensure no new bugs have been introduced.
ExqUI is updated to understand this format (it is not backwards compatible with this PR).
(Nice to have) Add dynamic UUID to node identifier, so it handles restarts on the same host.

Let me know if anyone has done this or can spend some time doing any of these. It is a big change and I just didn't want to merge it without proper due diligence.

ananthakumaran · 2020-02-12T11:27:31Z

@leeicmobile There is an open PR which implements the heartbeat. If you are still waiting for it you could help with testing it.

ananthakumaran · 2020-05-03T04:16:41Z

Someone can review this diff.

This can be tested in a running ensure no new bugs have been introduced.

ExqUI is updated to understand this format (it is not backwards compatible with this PR).

(Nice to have) Add dynamic UUID to node identifier, so it handles restarts on the same host.

I would love to get this merged into master. I can help with 1 & 2. I believe 4 is already fixed in master. I am not sure about 3, as I am not familiar with exq_ui codebase and we don't use it as well.

ananthakumaran · 2021-12-23T08:44:37Z

fixed in master

ondrejbartas force-pushed the fix-sidekiq5 branch from 6d1299c to fb3ddc5 Compare October 11, 2017 07:44

ondrejbartas force-pushed the fix-sidekiq5 branch from fc28429 to adbd0d7 Compare October 11, 2017 19:28

akira reviewed Oct 29, 2017

View reviewed changes

lib/exq/manager/server.ex Outdated

{:ok, state, 0}

end

def cocurency_count(state) do

Copy link

Owner

akira Oct 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling

akira reviewed Oct 29, 2017

View reviewed changes

Update Manager and stats to correctly store information about process…

5df7dcf

…ing jobs

TondaHack force-pushed the fix-sidekiq5 branch from adbd0d7 to 3aa19a9 Compare January 24, 2018 13:54

Move heartbeat to separate genserver

28da15d

TondaHack force-pushed the fix-sidekiq5 branch from 3aa19a9 to 28da15d Compare January 24, 2018 14:00

Clean up stats at init of server

e7c2058

TondaHack force-pushed the fix-sidekiq5 branch from 650550c to e7c2058 Compare January 29, 2018 10:56

akira reviewed Feb 11, 2018

View reviewed changes

Hearbeat server fixes

82f5082

TondaHack force-pushed the fix-sidekiq5 branch from 34363ce to ea33b33 Compare February 16, 2018 12:37

Start Heartbeat server from Mode modul

2919ebc

TondaHack force-pushed the fix-sidekiq5 branch from ea33b33 to 2919ebc Compare February 19, 2018 12:37

First Heartbeat run without timeout

6c0f851

akira reviewed Feb 20, 2018

View reviewed changes

TondaHack force-pushed the fix-sidekiq5 branch from f01e84f to ab88ee5 Compare February 20, 2018 19:49

Fix naming

94fdc05

TondaHack force-pushed the fix-sidekiq5 branch from ab88ee5 to 94fdc05 Compare February 20, 2018 19:53

TondaHack added 2 commits March 22, 2018 18:10

Fix expire to seconds

f32d0b5

Simplify heartbeat & add timout to call

a1be48c

TondaHack force-pushed the fix-sidekiq5 branch from 2d6a98f to a1be48c Compare April 3, 2018 13:05

TondaHack added 3 commits April 3, 2018 16:44

Cast heartbeat

1fc9f5c

Merge branch 'temp-cast-heartbeat' into fix-sidekiq5

9c1f2b0

Add HeartBeat tests

3a610b0

ananthakumaran mentioned this pull request Feb 12, 2020

Process data type API compatibility with Sidekiq #407

Closed

yknx4 mentioned this pull request Jan 28, 2021

Update Manager and stats to correctly store information about proceses . (Backport from PR #289) #439

Closed

ananthakumaran mentioned this pull request Oct 15, 2021

Sidekiq 5 UI compatibility #458

Merged

ananthakumaran closed this Dec 23, 2021

		@@ -251,6 +281,17 @@ defmodule Exq.Manager.Server do

		job_results = jobs \|> Enum.map(fn(potential_job) -> dispatch_job(state, potential_job) end)

		# Update worker info in redis that it is alive

Update Manager and stats to correctly store information about proceses #289

Update Manager and stats to correctly store information about proceses #289

Conversation

ondrejbartas commented Oct 11, 2017

coveralls commented Oct 11, 2017

coveralls commented Oct 11, 2017 • edited

coveralls commented Oct 11, 2017 • edited

ondrejbartas commented Oct 11, 2017

coveralls commented Oct 11, 2017 • edited

coveralls commented Oct 11, 2017 • edited

coveralls commented Oct 11, 2017 • edited

coveralls commented Oct 11, 2017 • edited

akira commented Oct 15, 2017 • edited

ondrejbartas commented Oct 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akira commented Oct 29, 2017 • edited

TondaHack commented Jan 24, 2018

ondrejbartas commented Feb 6, 2018

akira commented Feb 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akira commented Feb 20, 2018

TondaHack commented Feb 21, 2018 • edited

akira commented Apr 12, 2018

TondaHack commented Apr 12, 2018

akira commented Jun 11, 2018 • edited

leeicmobile commented Sep 7, 2018

akira commented Sep 10, 2018 • edited

ananthakumaran commented Feb 12, 2020

ananthakumaran commented May 3, 2020

ananthakumaran commented Dec 23, 2021

coveralls commented Oct 11, 2017 •

edited

coveralls commented Oct 11, 2017 •

edited

coveralls commented Oct 11, 2017 •

edited

coveralls commented Oct 11, 2017 •

edited

coveralls commented Oct 11, 2017 •

edited

coveralls commented Oct 11, 2017 •

edited

akira commented Oct 15, 2017 •

edited

akira commented Oct 29, 2017 •

edited

TondaHack commented Feb 21, 2018 •

edited

akira commented Jun 11, 2018 •

edited

akira commented Sep 10, 2018 •

edited