Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't retry dead pods indefinately #6220

Closed
jaywink opened this issue Jul 21, 2015 · 9 comments

Comments

Projects
None yet
7 participants
@jaywink
Copy link
Contributor

commented Jul 21, 2015

Currently we seem to retry dead pods if not forever, then for a long time at least. Currently, for a recent post I made, I had 89 failures for outgoing sidekiq jobs. I can only imagine how much extra weight this is putting on large long running pods.

So what I would propose is to either;

  • Add a "retry_next" timestamp which would be set on a failure. It would start as "1s" and increase exponentally each time a failure occurs. When a job needs to be done, the pod first checks whether this pod should be retried or not, if timestamp exists, and if not, it would just brutally skip the job.
  • After say "local users*1000" failures or something that somehow takes into account pod size, queue deletion of pod entry and all local content from that pod. Brutal but would clean up for real users.

Here is the list as a fyi, from the post on my pod. A few I know might not be permanently down, but most are and many of these names have gone years ago.

http://diaspora.linuxmaniac.net/receive/public return_code=couldnt_resolve_host response_code=0
http://party.spraci.info/receive/public return_code=ok response_code=502
http://pod.gnuaspora.de:8080/receive/public return_code=couldnt_connect response_code=0
http://red.zy.lc/receive/public return_code=ok response_code=410
http://uxpod.vispress.com/receive/public return_code=operation_timedout response_code=0
https://4ray.co/receive/public return_code=ok response_code=404
https://antipod.net/receive/public return_code=couldnt_resolve_host response_code=0
https://bastion.enigmacurry.com/receive/public return_code=couldnt_resolve_host response_code=0
https://blue.skye.cx/receive/public return_code=couldnt_connect response_code=0
https://calispora.org/receive/public return_code=couldnt_resolve_host response_code=0
https://canfly.org/receive/public return_code=operation_timedout response_code=0
https://caray.net/receive/public return_code=couldnt_resolve_host response_code=0
https://colapso.eu/receive/public return_code=ok response_code=301
https://community.fi/receive/public return_code=ok response_code=502
https://d.viewskew.com/receive/public return_code=couldnt_resolve_host response_code=0
https://d.w2bh.com.ar/receive/public return_code=operation_timedout response_code=0
https://d.xd.cm/receive/public return_code=operation_timedout response_code=0
https://datamol.in/receive/public return_code=ok response_code=302
https://diasp-pucheen.rhcloud.com/receive/public return_code=ok response_code=503
https://diasp.net/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspo.rrerr.net/receive/public return_code=ok response_code=404
https://diaspop.org/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.cc/receive/public return_code=couldnt_connect response_code=0
https://diaspora.dark-fx.com/receive/public return_code=ok response_code=500
https://diaspora.deadhexagon.com/receive/public return_code=couldnt_connect response_code=0
https://diaspora.derfahrervombangbushatdenschlechtestenjobimporn.biz/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.digitalignition.net/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.ing-adb.eu/receive/public return_code=operation_timedout response_code=0
https://diaspora.la/receive/public return_code=operation_timedout response_code=0
https://diaspora.melroy.org/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.mike-jones.me.uk/receive/public return_code=ok response_code=503
https://diaspora.monetd.fr/receive/public return_code=couldnt_connect response_code=0
https://diaspora.selent.me/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.teriksson.com/receive/public return_code=couldnt_resolve_host response_code=0
https://diaspora.toxip.net/receive/public return_code=operation_timedout response_code=0
https://diasporado.org/receive/public return_code=couldnt_connect response_code=0
https://diaspote.org/receive/public return_code=ok response_code=503
https://flokk.no/receive/public return_code=ok response_code=500
https://friendiaspora.com/receive/public return_code=couldnt_connect response_code=0
https://frolix8.asia/receive/public return_code=operation_timedout response_code=0
https://hardrate.net/receive/public return_code=couldnt_resolve_host response_code=0
https://hfase.com/receive/public return_code=couldnt_resolve_host response_code=0
https://homebutter.com/receive/public return_code=couldnt_connect response_code=0
https://i.luckfables.net/receive/public return_code=couldnt_connect response_code=0
https://ilikefreedom.org/receive/public return_code=operation_timedout response_code=0
https://jauspora.com/receive/public return_code=couldnt_resolve_host response_code=0
https://jiprnet.cf/receive/public return_code=couldnt_resolve_host response_code=0
https://logicfish.org.uk/receive/public return_code=couldnt_resolve_host response_code=0
https://meekshome.com/receive/public return_code=operation_timedout response_code=0
https://morab.drunkenhorse.org/receive/public return_code=ok response_code=404
https://nakama.moe/receive/public return_code=couldnt_connect response_code=0
https://nerdbash.net/receive/public return_code=operation_timedout response_code=0
https://netiz.in/receive/public return_code=couldnt_connect response_code=0
https://nym.no/receive/public return_code=operation_timedout response_code=0
https://oi.cx/receive/public return_code=couldnt_resolve_host response_code=0
https://ontrigue.com/receive/public return_code=ok response_code=404
https://paracin.me/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.02019.org/receive/public return_code=operation_timedout response_code=0
https://pod.affix.me/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.andreasrohner.at/receive/public return_code=ok response_code=503
https://pod.carlnimbus.com/receive/public return_code=operation_timedout response_code=0
https://pod.daquan.eu/receive/public return_code=ok response_code=503
https://pod.diaspbr.org/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.diaspora.co.nz/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.ecigusine.fr/receive/public return_code=couldnt_connect response_code=0
https://pod.felmey.org/receive/public return_code=couldnt_connect response_code=0
https://pod.freedombox.org/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.glasgownet.com/receive/public return_code=ok response_code=404
https://pod.gleisnetze.de/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.matbranyon.net/receive/public return_code=peer_failed_verification response_code=0
https://pod.p07.co/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.pyaspora.info/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.sfunk1x.com/receive/public return_code=couldnt_resolve_host response_code=0
https://pod.teejay.org/receive/public return_code=couldnt_resolve_host response_code=0
https://podricing.org/receive/public return_code=ok response_code=502
https://pubpod.alqualonde.org/receive/public return_code=operation_timedout response_code=0
https://rbsn.org/receive/public return_code=ok response_code=404
https://social-underground.nl/receive/public return_code=ok response_code=404
https://social.deuxfleurs.fr/receive/public return_code=ok response_code=502
https://social.neshema.com/receive/public return_code=couldnt_resolve_host response_code=0
https://social.sedrubal.de/receive/public return_code=couldnt_resolve_host response_code=0
https://social.sysfu.com/receive/public return_code=couldnt_resolve_host response_code=0
https://social.trashserver.net/receive/public return_code=ok response_code=404
https://sxspora.de/receive/public return_code=couldnt_resolve_host response_code=0
https://triaspora.net/receive/public return_code=ok response_code=502
https://www.diasporasf.org/receive/public return_code=ok response_code=500
https://www.dpod.se/receive/public return_code=ok response_code=301
https://www.sour.be/receive/public return_code=couldnt_resolve_host response_code=0
https://www.theantidrug.org/receive/public return_code=couldnt_connect response_code=0

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@Flaburgan

This comment has been minimized.

Copy link
Member

commented Jul 21, 2015

Kind of a duplicate of #4183, at least dealing with the same problem. I think to stop trying to send message to a pod which didn't answer for more than X days is enough. Of course, we have to remove the pod from the blacklist if we receive a message from it.

@jaywink

This comment has been minimized.

Copy link
Contributor Author

commented Jul 22, 2015

Related but not similar. 4183 talks about manual cleanup actions. This talks about tuning the federation layer so it would more intelligently just not bang its head against a brick wall indefinitely.

@Raven24

This comment has been minimized.

Copy link
Member

commented Jul 23, 2015

I have written a small test-script to potentially check pods against some possible reasons of why they can't be reached (DNS, connection, HTTP) some time ago. (Here is a Gist)
My idea back then was to add a list of pods to the admin section and list them as reachable/unreachable and state the reason in case it's not responding and for how long that has been the case. With that information there could easily be a rake task clearing out old pods that haven't been online for a configurable amount of time.

@jaywink

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2015

I'd still like this to be somehow automatic, without podmins having to do maintenance by hand. To me, this should be in the code that actually figures out where to send a request to, without just blindly doing so. We already some some stuff regarding SSL related return codes. This would just be more logic there.

@Flaburgan

This comment has been minimized.

Copy link
Member

commented Jul 24, 2015

@jaywink so what about a boolean "reachable" on the pod table true by default, so before sending any message to podB, podA checks if reachable is true for podB. If no message reach podB successfully during 30 days (to be determined), reachable is set to false in podA table. If any message from podB is received by podA, reachable is reset to true.

@KentShikama

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2015

Thanks for bringing this up @jaywink . I think this would greatly improve the federation.

I would be in favor of having a "retry_next" timestamp. Perhaps even make it so that after the time has exceeded a certain threshold, the pod is locally destroyed as mentioned in the second point.

If we do use @Flaburgan idea instead of true/false I think it should just be a timestamp of last reached. Then the server could check if the timestamp is less than 30 days before of Time.now.

@Raven24

This comment has been minimized.

Copy link
Member

commented Aug 2, 2015

I'm working on it, hopefully have some coherent code ready for a PR tomorrow.

@Flaburgan

This comment has been minimized.

Copy link
Member

commented Aug 3, 2015

Awesome! Please base your work on the stable branch ;)

@Raven24

This comment has been minimized.

Copy link
Member

commented Aug 6, 2015

just linking #6290 for reference

@denschub denschub reopened this Mar 5, 2016

SuperTux88 added a commit to SuperTux88/diaspora that referenced this issue Sep 24, 2016

Don't federate to pods that are offline for more than two weeks
Also fix a case where offline_since can be nil.

fixes diaspora#6220

@denschub denschub closed this in fe5811b Sep 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.