Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

background_jobs crashes makes other instances not receive federation updates #1820

Closed
dessalines opened this issue Oct 6, 2021 · 22 comments
Closed
Labels
bug Something isn't working

Comments

@dessalines
Copy link
Member

dessalines commented Oct 6, 2021

This seems to happen on lemmy intemittently, sometimes after a week or so of running fine. background_jobs will stop showing up in the logs.

cc @asonix @Nutomic

This issue happened again, twice in one day. A restart seemed to fix it again.

edit: I'm tailing the log right now:

sudo docker-compose logs -f --tail=1000 lemmy | grep activity_queue
lemmy_1     | [2021-10-06T23:13:25Z INFO  lemmy_apub::activity_queue] Sending activity https://lemmy.ml/activities/dislike/d1dbd77f-8e84-4367-ac94-92c3057f6681
lemmy_1     | [2021-10-06T23:13:29Z INFO  lemmy_apub::activity_queue] Sending activity https://lemmy.ml/activities/undo/8f646974-c257-4f01-aac7-24c3fdb765f2
lemmy_1     | [2021-10-06T23:14:52Z INFO  lemmy_apub::activity_queue] Sending activity https://lemmy.ml/activities/dislike/b6e8b8af-a6e0-484b-aac3-4496b8fe02bb
lemmy_1     | [2021-10-06T23:15:13Z INFO  lemmy_apub::activity_queue] Sending activity https://lemmy.ml/activities/undo/2846bd5c-a041-4b23-92a6-e7dcc93416be
@dessalines dessalines added the bug Something isn't working label Oct 6, 2021
@asonix
Copy link
Collaborator

asonix commented Oct 7, 2021

The only thing I can think of is the arbiter the jobs are running in is dying for some reason. You could maybe get around this by running the jobs on their own arbiter:

let arbiter = actix_web::rt::Arbiter::new();

WorkerConfig::new(|| MyState {
  client: Client::default(),
})
.register::<SendActivityTask>()
.start_in_arbiter(&arbiter, queue_handle.clone());

@asonix
Copy link
Collaborator

asonix commented Oct 7, 2021

There's also a warn level log that emits when a worker closes cleanly: Worker {id} closing
I imagine these aren't closing cleanly, though.

@asonix
Copy link
Collaborator

asonix commented Oct 7, 2021

I've just published a new version of background-jobs-actix which should warn when workers drop regardless of cleanliness, so you can try updating to that version to catch when they drop

background-jobs-actix = "0.9.3"

you should be able to pull it in automatically with cargo update

@dessalines
Copy link
Member Author

Thx, I'll try to get new builds out for this today, including the arbiter fix. I'm not sure what's causing the crash, but unfortunately there's nothing in the logs. It crashed sometime last night, and new actions aren't ending up in the activity queue.

lemmy_1     | [2021-10-07T01:53:48Z WARN  lemmy_apub::activity_queue] error sending request for url (https://forum.purplerabbit.xyz/inbox): error trying to connect: tcp connect error: Operation timed out (os error 110)
lemmy_1     | [2021-10-07T01:54:02Z INFO  lemmy_apub::activity_queue] Sending activity https://lemmy.ml/activities/announce/4181cbf5-469c-431f-9372-5f0df07ef2db

@dessalines
Copy link
Member Author

dessalines commented Oct 7, 2021

@asonix background-jobs and background-jobs-actix are still at 0.10.0: https://crates.io/crates/background-jobs/versions

Let me know when you get those updated, or if I can go to a git version.

@dessalines
Copy link
Member Author

nm, I downgraded to 0.9.0, and that picked up the new one. Running checks now.

dessalines added a commit that referenced this issue Oct 7, 2021
@dessalines
Copy link
Member Author

Okay I deployed 0.13.1 on lemmy.ml with the arbiter fix and extra logging above. I'll keep tailing the log to see if there's any problems.

@asonix
Copy link
Collaborator

asonix commented Oct 7, 2021

yeah 0.10 switched log out for tracing so i figured until y'all move your logging to tracing as well I'll release fixes for 0.9

dessalines added a commit that referenced this issue Oct 8, 2021
Nutomic pushed a commit that referenced this issue Oct 8, 2021
@dessalines
Copy link
Member Author

The arbiter seems to do it, I'll re-open if we encounter this again.

@dessalines dessalines reopened this Oct 29, 2021
@dessalines
Copy link
Member Author

This bug cropped up again after 3 weeks of running fine. I didn't run the correct log check to make sure its good, but its :

sudo docker-compose logs -f --tail=100 lemmy | grep background_jobs

I'll make sure to check that next time we have another issue.

@Nutomic
Copy link
Member

Nutomic commented Nov 1, 2021

Theres a new version of background-jobs which we should upgrade to.

@kromonos
Copy link

Can you please take a look again? Looks like my instance fapsi.be doesn't get any updates from lemmy.ml at the moment.

@dessalines
Copy link
Member Author

dessalines commented Nov 17, 2021

Seems like it crashed yesterday, with nothing in the logs again:

sudo docker logs --since 24h lemmyml_lemmy_1 2>&1 | grep background_jobs

These messages are pretty common, but its the last one:

[2021-11-16T19:56:39Z INFO  background_jobs_core::processor_map] Job 395c85d9-139a-421a-a32c-cb8fbcfff6f3 SendActivityTask completed 0.361524
[2021-11-16T19:56:39Z INFO  background_jobs_core::processor_map] Job 32abbf6d-e034-495d-8000-888d1b915277 SendActivityTask errored Error performing job: Failed to send activity {"actor":["https://lemmy.ml/c/linux"],"to":["https://www.w3.org/ns/activitystreams#Public"],"object":{"actor":["https://lemmy.ml/u/VonMax"],"to":["https://www.w3.org/ns/activitystreams#Public"],"object":["https://lemmy.ml/comment/92138"],"cc":[["https://lemmy.ml/c/linux"]],"type":"Like","id":"https://lemmy.ml/activities/like/a843fb84-b192-4e40-803d-cab8aa4ca1ff","@context":["https://www.w3.org/ns/activitystreams",{"comments_enabled":{"type":"sc:Boolean","id":"pt:commentsEnabled"},"moderators":"as:moderators","matrixUserId":{"type":"sc:Text","id":"as:alsoKnownAs"},"stickied":"as:stickied","sc":"http://schema.org#","sensitive":"as:sensitive","pt":"https://join-lemmy.org#"},"https://w3id.org/security/v1"]},"cc":["https://lemmy.ml/c/linux/followers"],"type":"Announce","id":"https://lemmy.ml/activities/announce/0ebc2f61-3916-4442-9505-8b68817d2fa7","@context":["https://www.w3.org/ns/activitystreams",{"stickied":"as:stickied","sc":"http://schema.org#","sensitive":"as:sensitive","moderators":"as:moderators","pt":"https://join-lemmy.org#","comments_enabled":{"type":"sc:Boolean","id":"pt:commentsEnabled"},"matrixUserId":{"type":"sc:Text","id":"as:alsoKnownAs"}},"https://w3id.org/security/v1"]} to https://forum.purplerabbit.xyz/inbox 0.012036

I've saved the entire log now, and restarted lemmy.

Okay I've searched the log for a lot of different terms, and unfortunately it seems that background_jobs crashes without an error message. @asonix

@kromonos
Copy link

Can confirm, that the federation is working again.

image

@dessalines
Copy link
Member Author

Sweet, sorry about this again.

@asonix
Copy link
Collaborator

asonix commented Nov 17, 2021

@dessalines are you sure there's no background jobs log about the Ticker stopping? if there's not, then this doesn't look like a crash, it just looks like it stops doing anything

@dessalines
Copy link
Member Author

No ticker or Ticker in the logs

@asonix
Copy link
Collaborator

asonix commented Nov 17, 2021

@dessalines does it look like there could be jobs that started but didn't finish? Probably not since the last log line involves a job finishing but

@kromonos
Copy link

Is it possible, that this bug hit lemmy.ml again?
Lemmy.ml: https://lemmy.ml/post/110385
Same on fapsi.be: https://fapsi.be/post/13991

@asonix
Copy link
Collaborator

asonix commented Dec 10, 2021

I wonder if this is related:

I think maybe a bit more logging is in order to confirm where the problem lies, though

@dessalines
Copy link
Member Author

@kromonos k I just did a lemmy restart, it temporarily fixed it. I'll re-add restarts to our nightly cron job until we can figure out why this keeps happening.

@Nutomic
Copy link
Member

Nutomic commented Jun 21, 2022

I think this problem was caused by too low worker count and wrong implementation which meant that failed activity sends would not be retried. At least i havent heard of any similar problems since fixing those.

@Nutomic Nutomic closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants