Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow on Ubuntu 20.04 Server #425

Closed
Davincible opened this issue Feb 21, 2021 · 25 comments
Closed

Extremely slow on Ubuntu 20.04 Server #425

Davincible opened this issue Feb 21, 2021 · 25 comments

Comments

@Davincible
Copy link

I am running the edge version of the docker, tried it both locally (Manjaro) and on my Ubuntu server. Locally it works fine, on the server I need to wait for a minute for all the requests to complete. Resources are not the issue, its a quad core with 8gb ram. Tried it both with a mariodb server as well as a mariodb instance on my server itself. Running it on the server directly installed with the bench also works fine.

Ive disabled the traefik and exposed the port of the nginx container directly.
I can't seem to figure out what's causing it to slow down, as logs are extremely limited.

Any ideas as to whats going on>

@Davincible Davincible added the bug label Feb 21, 2021
@revant revant added question and removed bug labels Feb 22, 2021
@revant
Copy link
Collaborator

revant commented Feb 22, 2021

Is it slow in docker only?

Site creation is slow, on develop branch, it creates 950+ tables in mariadb during creating a new site.

@Davincible
Copy link
Author

Davincible commented Feb 22, 2021

Yes, only in docker. I have also installed it with the bench command, and those web requests return instantaneously.

I don't mean during site creation, but during site operation. I see the requests coming in in the docker logs, but somehow it take a long time to process all requests.

As the same simple setup does work fast on Manjaro, I am inclined to speculate that some configuration is affecting the Ubuntu server setup, but I am not sure where to look.

I have tried a few different containers on the server, and it's the same result every time.

@revant
Copy link
Collaborator

revant commented Feb 24, 2021

I'm using on Ubuntu 20.04 for k8s nodes. That's what is provided by default by my cloud provider.

@revant
Copy link
Collaborator

revant commented Feb 25, 2021

I think we need more details to fix config.

I'm closing the issue.

Re-open if needed.

@revant revant closed this as completed Feb 25, 2021
@revant
Copy link
Collaborator

revant commented Feb 25, 2021

❯ kubectl get node -o wide -w
NAME                            STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE                        KERNEL-VERSION     CONTAINER-RUNTIME
project-44cfec5f844943c78f34c   Ready    <none>   87d   v1.20.4   XX.XX.XXX.XX   XX.XX.XXX.XXX   Ubuntu 20.04.1 LTS 2da9bb3059   5.4.0-53-generic   docker://19.3.13
project-95d303e2489c4af2b5c14   Ready    <none>   87d   v1.20.4   XX.XX.XX.XXX   XX.XXX.XXX.XX   Ubuntu 20.04.1 LTS 2da9bb3059   5.4.0-53-generic   docker://19.3.13
project-d58c47a5a3e7449b9c299   Ready    <none>   87d   v1.20.4   XX.XX.XXX.XX   XX.XX.XXX.XXX   Ubuntu 20.04.1 LTS 2da9bb3059   5.4.0-53-generic   docker://19.3.13

@Davincible
Copy link
Author

What details do you want? I wouldn't consider it closed. This is the only docker container I have slow responses with. I've tried Odoo and Wazuh.

@revant
Copy link
Collaborator

revant commented Feb 25, 2021

I'll keep it open. If someone else from community finds a fix we'll close it. At least I'm not fixing it for now.

I really can't help right now, my Ubuntu 20.04 LTS servers are not causing any problems. If I face them, I'll have to fix it, I depend on it with lot of data already running in production.

Also edge version introduces many things daily. Keep trying daily if there are any improvements. Or keep track of Travis ci cron job.

I'm not using edge version in production.

I'm using up to date v12 and v13-beta.

@revant revant reopened this Feb 25, 2021
@sunhoww
Copy link
Contributor

sunhoww commented Mar 9, 2021

One possible reason could be the mariadb config. bench installs sets innodb-buffer-pool-size dynamically off of the host memory. See -
https://github.com/frappe/bench/blob/f3809b00acc4bfa586e9a12116fb6bd262d3226e/bench/playbooks/roles/mariadb/files/mariadb_config.cnf#L46

However, the config file here does not. So, I had to manually set the config value after going through this -
https://dba.stackexchange.com/questions/27328/how-large-should-be-mysql-innodb-buffer-pool-size

Another thing could be with the nginx config. Specifically client_body_buffer_size. I was getting a lot of warnings like this -

1963/11/22 12:30:00 [warn] 150#150: *211054 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000001631, client: 0.0.0.0, server: example.com, request: "POST /api/method/frappe.desk.form.save.savedocs HTTP/2.0", host: "example.com", referrer: "https://example.com/desk"

Possibly because requests forwarded from the proxy to the containers are already decompressed. I'm not 100% about this tho. I was using docker-compose-letsencrypt-nginx-proxy-companion then and did changed the mentioned config value and some.

Lastly I noticed that the worker services seem to run wild sometimes, maybe you need some resource limits on these containers. Although, I haven't come around to doing this myself.

@Davincible
Copy link
Author

@sunhoww Hmmm those sound like things that could cause the issues. I'll have a look at them, thanks

@Davincible
Copy link
Author

@sunhoww I did some digging and don't think its the database limit, In hindsight I'm using a local db on my host instead of a container, and frappe set the limit to 5G, manually set it to 1G then as that was reported by the command in linked stackexchange.

However, while looking at the request logs, and in the devtools network inspector, I noticed that the issue is a request to /website_script.js?ver=1615007321.0.
In my host nginx error log it turns up after a minute (timeout), with the following logs:

2021/03/18 22:37:41 [warn] 4093098#4093098: *53592 upstream server temporarily disabled while connecting to upstream, client: <ip>, server: example.com, request: "GET /website_script.js?ver=1615007321.0 HTTP/1.1", upstream: "http://[::1]:8060/website_script.js?ver=1615007321.0", host: "example.com", referrer: "example.com"
2021/03/18 22:37:41 [error] 4093098#4093098: *53592 upstream timed out (110: Connection timed out) while connecting to upstream, client: <ip>, server: example.com, request: "GET /website_script.js?ver=1615007321.0 HTTP/1.1", upstream: "http://[::1]:8060/website_script.js?ver=1615007321.0", host: "example.com", referrer: "example.com"

Do you have any idea what might be causing this?

@Davincible
Copy link
Author

@revant How can I check the server logs in the python container? the default logs are very limited. I want to figure out why /website_script.js is timing out

@revant
Copy link
Collaborator

revant commented Mar 20, 2021

Recently "logs" volume was contributed

#422 check if it helps.

For old images logs are in container, exec into the container and grep frappe-bench/logs directory for errors.

@sunhoww
Copy link
Contributor

sunhoww commented Mar 20, 2021

Sorry for reply late. If it is only happening with /website_script.js and other dynamically generated content and NOT with the static assets, you can maybe disable/stop the other worker containers - -default, -short, -long, -schedule and check if the timeout still persists.

@Davincible
Copy link
Author

Hmm when I turn off the workers non of dynamic links get resolved, when they're on they all do but that single request. What I also noticed is that website_script.js is requested twice, the first comes through without problem, and then the second request times out a minute later.

@revant Yeah I noticed, unfortunately, those logs are only logging high-level processes and are not very telling in this case

@sunhoww
Copy link
Contributor

sunhoww commented Mar 20, 2021

Interesting. I have no issue serving any resources dynamic or otherwise when the workers are disabled (running just -nginx, -python and -socketio containers).

  1. Are you running the frappe build of the images or have you rebuilt some of the images?
  2. Are you using docker-compose or running in swarm mode?

@Davincible
Copy link
Author

@sunhoww

  1. Not sure what build I'm running, just regular old docker compose without the traefik, tried latest version and version-13-beta
  2. WARNING: The Docker Engine you're using is running in swarm mode. -- I suppose I do, not sure what the implications are of this

@Davincible
Copy link
Author

Davincible commented Mar 20, 2021

I've disabled swarm, and tried with and without the extra services, now the other dynamic content does get resolved, where it previously only did if the extra services were enabled, but it doesn't make a difference for the timeout request, as shown in the screenshot

image

My docker-compose is here: https://gist.github.com/Davincible/d4b9f02bd5d9f60352780ffe5d88ae4c

@sunhoww
Copy link
Contributor

sunhoww commented Mar 20, 2021

Weird how everything is being requested twice. I just noticed this on my production deploys as well. Happening just with firefox tho and not with chrome. Maybe something to do with the application code. So probably not related to the issue at hand.

Did these...

I take it your following the guide here - https://github.com/frappe/frappe_docker/blob/develop/docs/single-bench.md

I haven't used this, so I tried on a 1vCPU / 4GB RAM VM running debian 10. Again no issues, apart from the duped requests. I tried with both version-12 and edge tags. Even with YOUR compose file from the gist. Of course, I had to reenable the mariadb container and had to make requests over port 80. Also I did not proceed onto the setup wizard stages.

One more thing you could check is the disk space. Some applications take a performance hit when available space becomes limited. While you're at it, you could look into the disk IOPS and throughput

@revant
Copy link
Collaborator

revant commented Mar 21, 2021

Can you also check ubuntu kernel and the manjaro kernel and if something is related to that?

This is one of my Digitalocean VPS that is on Ubuntu 20.04 and running in production with few sites.
I'm using swarm mode with portainer.

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal
Kernel:         Linux docker 5.4.0-42-generic #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

I faced some problem with KVM kernel, didn't try to figure out why, I just changed to generic kernel and everything was working as expected.

@Davincible
Copy link
Author

@revant Pretty much the same here, I am running the docker containers on an Ubuntu VPS too, as I want to use them for production:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
Kernel:         Linux Naboo 5.4.0-67-generic #75-Ubuntu SMP Fri Feb 19 18:03:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Docker:         Docker version 19.03.8, build afacb8b7f0
Docker-Compose: Docker-compose version 1.25.0, build unknown

@Davincible
Copy link
Author

@sunhoww disk usage is 50Gb/150Gb, and IOPS shouldn't be an issue either as during testing I wasn't running much else. Using a quadcore VPS with 8Gb ram, and worked with a bunch of other containers too, so I'd assume hardware is not the issue.

I'm really curious to see the direct log outputs from the python server in the containers and see the requests there while they come in and get processed, as I think that might give some valuable insight. Unfortunately, I haven't been able to figure out how to get any of these logs besides the main generic logs in the logs folder of the domain

@Davincible
Copy link
Author

@sunhoww My disk usage is 50Gb/150Gb, and the VPS has a quadcore with 8Gb ram, so I don't think hardware is the issue. Ran a bunch of other stuff without any issues.

What would be interesting would be to see the python server logs inside the python container, and to see the requests as they come in and are being processed, that could provide some information as to what is going wrong, but I haven't been able to find a way to find such logs.

The other possibility I could think of is that something is going wrong in the proxy forwarding from the host to the container. My host nginx config is here

@sunhoww
Copy link
Contributor

sunhoww commented Mar 21, 2021

What would be interesting would be to see the python server logs inside the python container, and to see the requests as they come in and are being processed, that could provide some information as to what is going wrong, but I haven't been able to find a way to find such logs.

I think you might need to change gunicorn log level here -

gunicorn -b 0.0.0.0:$FRAPPE_PORT \
--worker-tmp-dir /dev/shm \
--threads=4 \
--workers $WORKERS \
--worker-class=gthread \
--log-file=- \
-t 120 frappe.app:application --preload

Then you could just do docker logs <container-name>

The other possibility I could think of is that something is going wrong in the proxy forwarding from the host to the container.

Any particular reason why you need nginx and certbot? The traefik service included in the single bench guide should be enough to proxy and manage certs. Maybe you can disable those host services and verify if indeed the proxy forward is the issue.

@revant
Copy link
Collaborator

revant commented May 6, 2021

WORKER_CLASS environment variable for erpnext-python container defaults to gthread.

Try setting it to sync,

erpnext-python:
  ...
  environment:
    ...
    - WORKER_CLASS=sync

I found best performance on worker class gevent

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2021

This issue has been automatically marked as stale. You have a week to explain why you believe this is an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants