Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Manager and stats to correctly store information about proceses #289

Closed
wants to merge 12 commits into from

Conversation

ondrejbartas
Copy link

This fixes problem with Sidekiq Web UI which has changed API from version 4.

Now we correctly add data about manager and processes to redis

@coveralls
Copy link

Coverage Status

Coverage increased (+0.1%) to 89.231% when pulling fb3ddc5 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.1%) to 89.231% when pulling fb3ddc5 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.2%) to 89.332% when pulling 16e710a on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@ondrejbartas
Copy link
Author

It is still not working, problem with cleaning up processes after init :(

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.5%) to 89.581% when pulling 097a820 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.5%) to 89.594% when pulling 0dd6738 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.5%) to 89.594% when pulling adbd0d7 on PPCBee:fix-sidekiq5 into 46f0d0d on akira:master.

@coveralls
Copy link

coveralls commented Oct 11, 2017

Coverage Status

Coverage increased (+0.7%) to 90.476% when pulling 3a610b0 on PPCBee:fix-sidekiq5 into ee06d8c on akira:master.

@akira
Copy link
Owner

akira commented Oct 15, 2017

@ondrejbartas Thanks! Will need to take a look.

Also, I'm assuming this will require changes to exq_ui?

@ondrejbartas
Copy link
Author

@akira yes exq_ui will need to change.

{:ok, state, 0}
end

def cocurency_count(state) do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling

@@ -251,6 +281,17 @@ defmodule Exq.Manager.Server do

job_results = jobs |> Enum.map(fn(potential_job) -> dispatch_job(state, potential_job) end)

# Update worker info in redis that it is alive
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this manages worker state each time it dequeues, is this how Sidekiq does it?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be adding overhead. Looking at performance test:

without lines 284:293
2017-10-29 13:25:34.217 [debug]: Perf test took 0.680126 secs

with lines 284:293
2017-10-29 13:23:03.301 [debug]: Perf test took 1.000115 secs

(To run this I set ExUnit.start(capture_log: false) in test_helper.exs)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akira, biggest problem is cleanup of queues which happens parallel by calling: https://github.com/akira/exq/pull/289/files/adbd0d7657abff63dcdfdb1db5a7240c7ce4d4a6#diff-ee19dd3b88a278d7cf31b33f2b2b5705R147

I am new to elixir so I have no experience how to do it in one run, first, cleanup, then whole code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you wil need just:

name = redis_worker_name(state)
      worker_init = [
         ["HSET", name, "beat", Time.unix_seconds],
         ["EXPIRE", name, (state.poll_timeout / 1000 + 5)], # expire information about live worker in poll_interval + 5s
       ]
Connection.qp!(state.redis, worker_init)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this code can run every second and result will be same

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - for Sidekiq, seems like it is running this in a separate thread:
https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/launcher.rb#L25
That is why I was thinking if we can just run this separately as well, would it be possible to do this in a separate GenServer?

If you need to do synchronous methods from the Manager to the GenServer, you can still do GenServer.call instead of GenServer.cast and it will wait for the call. Let me know if you think this is doable or not. The other option is perhaps lower the frequency of this call so it doesn't do it in every dequeue.

The other thing I noticed is the section with HSET, etc is duplicated - it would be nice to take this out into a function at least, or function in JobStat module.

@akira
Copy link
Owner

akira commented Oct 29, 2017

Looking at the code again, I think the heartbeat mechanism could be tweaked. It seems like it is high overhead to update this on every dequeue - when it can be done a lot less frequently. Perhaps we can have a separate GenServer that will do the heartbeat?

This PR actually goes makes some progress on issue #261. For the full implementation, a random uuid can be generated whenever a new heartbeat GenServer comes online (1 for the whole system). If certain parts of the the system crash, this should crash as well. Then the cleanup process can then look through uuids every once in a while and see whichever ones are stale, and re-queue those as lost messages.

However, we don't necessarily have to implement it all for this PR, if not, maybe we can start with the separate GenServer that simply starts up, inserts worker information, and then heartbeats on a regular basis - and will crash when system crashes.

Let me know your thoughts.

@TondaHack
Copy link

Hi @akira,

Could you check these changes we made?

Thanks

@ondrejbartas
Copy link
Author

@akira Hi can you review this PR again?

@akira
Copy link
Owner

akira commented Feb 7, 2018

Sorry, have been slammed at work lately. Will try to take a look when things lighten up. Thanks!

Exq.Stats.Server.cleanup_host_stats(state.stats, state.namespace, state.node_id, state.pid)
end)

HeartbeatServer.start_link(state)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting a process this way is not recommended. The proper way of starting it is to add this to the supervision tree: https://github.com/akira/exq/blob/master/lib/exq/support/mode.ex#L26

defp getRedisCommands(state) do
name = redis_worker_name(state)
[
["SADD", JobQueue.full_key(state.namespace, "processes"), name],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor redis commands to be in one of the other modules? For example. RedisStat.

Also, nit, using camel case for methods.

@@ -145,4 +186,30 @@ defmodule Exq.Redis.JobStat do
val
end
end

def get_redis_commands(namespace, node_id, started_at, master_pid, queues, work_table, poll_timeout) do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this something more descriptive, above methods that are similar are named something like _commands

JobQueue.full_key(namespace, "#{node_id}:elixir")
end

defp cocurency_count(queues, work_table) do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mispelled.

@akira
Copy link
Owner

akira commented Feb 20, 2018

Looks like build is failing from some compilation issues

@TondaHack
Copy link

TondaHack commented Feb 21, 2018

Hi @akira,

Thanks for the review. A few of builds failed because of compilation issue, but the rest of them because of tests. I know that reason is Exq.Heartbeat.Server. I am trying to find the issue but every build shows different failing tests. It's happening in /exq/test/exq_test.exs and with tests related to ExqTest.PerformWorker.

I've tried to run the first heartbeat with smaller or no a timeout. It didn't work. Also, I tried to extend assert_receive timeout from 100 to 200 or 400 ms. There were less failing tests but still some. Could you take a look, please?

Edit: It's happening on Elixir 1.3 and 1.4. My machine has 1.7.0-dev and It works just fine.

@akira
Copy link
Owner

akira commented Apr 12, 2018

@TondaHack I see the tests pass now, looks like you figured out the issue?

@TondaHack
Copy link

@akira Yes, I think so. The clean up of Heartbeat server helped. What do you think about current changes?

@akira
Copy link
Owner

akira commented Jun 11, 2018

@TondaHack @ananthakumaran I wonder if it would be better to have a generated node ID instead of just using the hostname. If you look at #321 some people may be running multiple workers on the same machine (at least this case is it true during deploys). However, this obscures each node heartbeat since two nodes will be registering the same heartbeat.

If we use a more ephemeral key for a node, for example PID or UUID (plus the hostname), it would be unique even if multiple processes are running for the same node.

The other comment I had is, this is a big change to the current system and may break some compatibility. I wonder if there's a way to deploy this through some config flag, where we can keep the old method, and allow a user to switch to use this new method. This would be a much lower risk way of deploying this change without impacting anyone.

@leeicmobile
Copy link

@akira any chance we can just make the heartbeat solution a major version release rather than a config change currently a potential blocker in my production deployment to k8s cluster. #imselfish

@akira
Copy link
Owner

akira commented Sep 10, 2018

@leeicmobile

@akira any chance we can just make the heartbeat solution a major version release rather than a config change currently a potential blocker in my production deployment to k8s cluster. #imselfish

I have been very busy lately with work / life and have not had much time lately. I would consider merging this if following happened:

  1. Someone can review this diff.
  2. This can be tested in a running ensure no new bugs have been introduced.
  3. ExqUI is updated to understand this format (it is not backwards compatible with this PR).
  4. (Nice to have) Add dynamic UUID to node identifier, so it handles restarts on the same host.

Let me know if anyone has done this or can spend some time doing any of these. It is a big change and I just didn't want to merge it without proper due diligence.

@ananthakumaran
Copy link
Collaborator

@leeicmobile There is an open PR which implements the heartbeat. If you are still waiting for it you could help with testing it.

@ananthakumaran
Copy link
Collaborator

  1. Someone can review this diff.
  2. This can be tested in a running ensure no new bugs have been introduced.
  3. ExqUI is updated to understand this format (it is not backwards compatible with this PR).
  4. (Nice to have) Add dynamic UUID to node identifier, so it handles restarts on the same host.

I would love to get this merged into master. I can help with 1 & 2. I believe 4 is already fixed in master. I am not sure about 3, as I am not familiar with exq_ui codebase and we don't use it as well.

@ananthakumaran
Copy link
Collaborator

fixed in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants