Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: automatic restart on changedetection memory leak #1831

Closed
wants to merge 3 commits into from

Conversation

adripo
Copy link

@adripo adripo commented Oct 3, 2023

Based on https://github.com/dgtlmoon/changedetection.io/wiki/Playwright-content-fetcher#playwright-memory-leak

A memory leak occurs when changedetection monitors large html pages with playwright.

I added automatic restart to docker compose when memory leak occurs.

deploy:
resources:
limits:
memory: 1G
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super cool, however, even better is to leave this commented out # so that people can install it

juuuust incase someone really needs it to run with that much ram, atleast they can find an easy solution

@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 3, 2023

@adripo please dont make silly demands in other issues, the memory issue is actually due to LXML string handling , and you even put the memory restart feature on the changedetection.io container, not the browserless container

@adripo
Copy link
Author

adripo commented Oct 4, 2023

I apologize for any misunderstanding caused by my previous messages. I've taken your feedback and removed my question from the other issue.

@adripo
Copy link
Author

adripo commented Oct 4, 2023

Going back to this PR, I followed the suggestion from the wiki, where it suggests to kill the changedetection service and not the browserless service.
This behaviour works fine for me and solves the memory leak from this container.

I've also tested moving the restart policy on the browserless container instead of the changedetection one and this is the RAM usage after 1d.

browserless container with restart policy - no restart in 1d

image

changedetection container without restart policy - already 4GB ram

image


This is the docker compose that I am using:

docker-compose.yml
version: "3"

services:
app:
  image: ghcr.io/dgtlmoon/changedetection.io:latest
  environment:
     - PUID=1000
     - PGID=1000
    # Alternative Playwright URL, do not use "'s or 's!
    - PLAYWRIGHT_DRIVER_URL=ws://playwright-chrome:3000/?--disable-web-security=true
    # Base URL of your changedetection.io install (Added to the notification alert)
    - BASE_URL=${APP_BASE_URL}
    # Hides the `Referer` header so that monitored websites can't see the changedetection.io hostname.
    - HIDE_REFERER=true
  volumes:
    - data:/datastore
  deploy:
    resources:
      limits:
        memory: 2G
    restart_policy:
      condition: on-failure
  restart: unless-stopped
  depends_on:
    - playwright-chrome

playwright-chrome:
  image: browserless/chrome:1-chrome-stable
  environment:
    # Set session timeout to 10 min
    - CONNECTION_TIMEOUT=600000
    - DEFAULT_STEALTH=true
    # Ignore HTTPS errors, like for self-signed certs
    - DEFAULT_IGNORE_HTTPS_ERRORS=true
  restart: unless-stopped

volumes:
data:
  driver: local
  driver_opts:
    type: none
    o: bind
    device: /path/to/data

Without the restart policy it already saturated all my 32GB of RAM multiple times in the past.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 4, 2023

I understand, another critical thing is that the docker shutdown MUST send "SIGTERM" not SIGKILL or it can risk corrupting the database, does this cover that situation?

@adripo
Copy link
Author

adripo commented Oct 4, 2023

It just limits the container memory and generates an OOME exception. The kernel should manage the exception and kill the process, so docker detects it and restarts the container. The signal should be managed by the kernel so the behaviour could change from system to system.

More details here: https://docs.docker.com/compose/compose-file/compose-file-v3/#out-of-memory-exceptions-oome

@Constantin1489
Copy link
Contributor

Constantin1489 commented Oct 5, 2023

I think the CD manual(or wiki) that we have, misleads us. it wrote about the playwright with their issue but the script is for CD.

So the wiki should provide about OOM killer for playwright container only.

@adripo I noticed CD doesn't write changes immediately. So when CD received the 'podman rmi --force' or whatever the unwritten changes were just gone. I've just read about OOM-killer(https://unix.stackexchange.com/a/305840). It sends SIGKILL. Also 'trap' cannot trap SIGKILL as I know. It will cause data loss. (FYI, CD has signal.signal(signal.SIGTERM, sigterm_handler))

Also, FYI, I don't know whether podman has this problem but Docker for some versions has. Hitting Container Kernel Memory Limit causes OOM on Docker Host

@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 5, 2023

For the sake of the argument, here is our SIGTERM handler

https://github.com/dgtlmoon/changedetection.io/blob/master/changedetectionio/changedetection.py#L27

https://github.com/dgtlmoon/changedetection.io/blob/master/changedetectionio/changedetection.py#L105

if SIGKILL is called without SIGTERM... who knows what can happen

@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 5, 2023

@adripo please update the title and description of your issue here, the memory leak you are 'fixing' is actually nothing todo with playwight other than processing huge amounts of HTML

@dgtlmoon dgtlmoon closed this Oct 5, 2023
@Constantin1489
Copy link
Contributor

Constantin1489 commented Oct 5, 2023

@dgtlmoon FYI, actually my CD container doesn't stop correctly with SIGTERM if the --time is short (Mine is default 10s). (Hmmm.?)
--stop: first sigterm with waiting and sigkill.

#I use podman instead of docker.
[shelf@localhost ~]$ podman stop xpath_for_infra 
WARN[0010] StopSignal SIGTERM failed to stop container xpath_for_infra in 10 seconds, resorting to SIGKILL 
xpath_for_infra

@dgtlmoon dgtlmoon reopened this Oct 5, 2023
@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 5, 2023

@dgtlmoon FYI, actually my CD container doesn't stop correctly with SIGTERM if the --time is short (Mine is default 10s). (Hmmm.?)
--stop: first sigterm with waiting and sigkill.

aaah ok! yes this needs a test then, can you confirm its stopping at all under SIGTERM?

@Constantin1489
Copy link
Contributor

Constantin1489 commented Oct 5, 2023

@dgtlmoon I actually noticed that your trap doesn't capture it.

This CD has two changedetction.py.

#signal.signal(signal.SIGCHLD, sigchld_handler)

Maybe SIGTERM goes wrong py or something. Because I didn't read all the code in your program. I'm not sure how to fix it currently. I will look at it within two weeks. Currently, I'm quite busy to deep dive.

@adripo
Copy link
Author

adripo commented Oct 6, 2023

@adripo please update the title and description of your issue here, the memory leak you are 'fixing' is actually nothing todo with playwight other than processing huge amounts of HTML

Okay, then I think there is another issue with changedetection. I just have 40 amazon products that I am monitoring once every hour. Can this be a problem? Why is the memory usage continuously growing?

All the watched pages are using playwright. In the past I used basic html without noticing this memory issue, but some pages were continuously failing to parse with an error message that "filter pattern cannot be found". I had to rerun them multiple times before succeeding.

@adripo adripo changed the title fix: automatic restart on memory leak fix: automatic restart on changedetection memory leak Oct 6, 2023
@adripo
Copy link
Author

adripo commented Oct 6, 2023

I think the CD manual(or wiki) that we have, misleads us. it wrote about the playwright with their issue but the script is for CD.

So the wiki should provide about OOM killer for playwright container only.

@adripo I noticed CD doesn't write changes immediately. So when CD received the 'podman rmi --force' or whatever the unwritten changes were just gone. I've just read about OOM-killer(https://unix.stackexchange.com/a/305840). It sends SIGKILL. Also 'trap' cannot trap SIGKILL as I know. It will cause data loss. (FYI, CD has signal.signal(signal.SIGTERM, sigterm_handler))

Also, FYI, I don't know whether podman has this problem but Docker for some versions has. Hitting Container Kernel Memory Limit causes OOM on Docker Host

Thank you for the input. I wasn't aware of this issue with docker and didn't happen for now.

I think it's better to find a solution for the growing memory and avoid those limitations then.

@dgtlmoon
Copy link
Owner

dgtlmoon commented Oct 6, 2023

I believe the memory issue is part of a LXML parsing bug, which may actually related to the way the memory is 'perceived' to be used, i spent 3 weeks of unpaid time digging around and I couldnt figure it out... LXML is used to parse the HTML etc

all i can say is - dig in the direction of LXML :)

also - you can work around this by just restarting your containers reguraly

closing for now - lets add a automated test to test the shutdown ability in SIGTERM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants