proxy: increase timeout since raven can take lots of time to process requests #122

tlvu · 2021-02-03T21:47:43Z

Fix this timeout error in some Raven notebooks:

HTTPError: 504 Server Error: Gateway Time-out for url:
https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps

This is just a work-around since something is very wrong on our
production host Boreas (a physical host with 128G ram and 48 logical cpu).

During the test, Boreas was having this "load average: 6.35, 5.90,
4.33". For its hardware specs, it is basically idle.

The increased timeout was not needed for my test VM (10G ram, 2 cpu),
medus.ouranos.ca (physical host with 16G ram, 16 logical cpu) and
hirondelle.crim.ca (VM with 32G ram, 8 cpu).

Should fix Ouranosinc/raven#362 and fix Ouranosinc/raven#357.

Ping @moulab88 to take a look at Boreas. Just a wild guess, is it due for a reboot?

Ping @richardarsenault can you retry the 2 broken Raven notebooks? I've already deployed this to prod.

…requests Fix this timeout error in some Raven notebooks: ``` HTTPError: 504 Server Error: Gateway Time-out for url: https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps ``` This is just a work-around since something is very wrong on our production host Boreas (a physical host with 128G ram and 48 logical cpu). During the test, Boreas was having this "load average: 6.35, 5.90, 4.33". For its hardware specs, it is basically idle. The increased timeout was not needed for my test VM (10G ram, 2 cpu), medus.ouranos.ca (physical host with 16G ram, 16 logical cpu) and hirondelle.crim.ca (VM with 32G ram, 8 cpu).

huard

I think we should document the fact any process running longer than this limit will fail.
Also what happens on the PyWPS side, is the process killed or it becomes a zombie ?

tlvu · 2021-02-04T00:01:32Z

I think we should document the fact any process running longer than this limit will fail.
Also what happens on the PyWPS side, is the process killed or it becomes a zombie ?

It just continue and return later ... when it's too late. So no zombie or accumulated queue.

richardarsenault · 2021-02-04T00:52:25Z

@tlvu I confirm that the 2 notebooks (Ouranosinc/raven#362 and Ouranosinc/raven#357) now work without time-outing!

moulab88 · 2021-02-04T13:23:07Z

The machine is up since 13 days, no need a reboot. I will analyze incoming traffic with more details.

tlvu · 2021-02-04T15:36:06Z

The machine is up since 13 days, no need a reboot. I will analyze incoming traffic with more details.

@moulab88 Just to be clear, the traffic is not blocked. There is simply something that appear to slow down Raven responses from the client point of view. There are quite a few possible reasons here:

The request takes time to reach Raven (maybe OS firewall, Nginx, or Twitcher buffering the request)
Raven itself is slower (slow disc or network access to get the data? There is plenty of ram and cpu so probably not a problem with ram/cpu)
The response from Raven takes time to return to caller (again maybe OS firewall, Nginx, or Twitcher buffering the response)

Ping me if you need anything.

moulab88 · 2021-02-04T16:28:20Z

The OS firewall is disable. I will just check if there are dropped packets on the connections.

moulab88 · 2021-02-04T17:27:47Z

We have better latency on local disks / or /home than direct attached RAID5 direct disk which have better throughput:

To increase containers IO performance maybe it is better to move all docker directories to the local disks.

tlvu · 2021-02-04T17:39:51Z

We have better latency on local disks / or /home than direct attached RAID5 direct disk which have better throughput:

To increase containers IO performance maybe it is better to move all docker directories to the local disks.

Sensible suggestion. Migration will take time, we have lots of images and data.

Also there might not be enough space. / have 38G free, for sure not enough, /home have 114G free, probably not enough and for sure not enough future proof.

Given complicated migration, we probably should find a way to test the idea without doing the full migration.

moulab88 · 2021-02-04T18:10:55Z

Forget to mention /var (265 GB Free), this space was reserved for this need.

tlvu · 2021-02-04T18:53:26Z

We should also be using overlay2 as the storage driver for Docker, see https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Currently, it is not the case:

$ docker info
 Storage Driver: overlay
  Backing Filesystem: xfs
  Supports d_type: true
 Docker Root Dir: /pvcs1/var-lib/docker
WARNING: the overlay storage-driver is deprecated, and will be removed in a future release.

tlvu · 2021-02-04T22:14:55Z

$ sudo du -sh /pvcs1/var-lib/docker/
478G    /pvcs1/var-lib/docker/

There is clean up to do and we'll need an even bigger partition !

moulab88 · 2021-02-04T22:47:43Z

oh!! it will be a good candidate to a new SSD/nVME disk/partition.

tlogan2000 · 2021-02-08T16:51:34Z

With Raven is there an issue Progress=True or async like we have seen for finch?

tlvu · 2021-02-10T18:05:19Z

With Raven is there an issue Progress=True or async like we have seen for finch?

This is an excellent question. There should be Raven notebooks using async mode, since this issue exists Ouranosinc/raven#353, but I have not seen the same "queue not cleared" problem with Raven !!! This is very odd, why Finch have the problem and not Raven !

huard · 2021-02-10T18:09:35Z

I don't know if we should read too much into that. Maybe Raven has not been exercised as intensively as Finch.
One difference I see is that Finch uses sentry. Maybe it has an unhealthy relationship with PyWPS.

…uld-be-configurable proxy: proxy_read_timeout config should be configurable We have a performance problem with the production deployment at Ouranos so we need a longer timeout. Being a Ouranos specific need, it should not be hardcoded as in previous PR #122. The previous increase was sometime not enough ! The value is now configurable via `env.local` as most other customizations. Documentation updated. Timeout in Prod: ``` WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS _URL=https://pavics.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_datas et.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints HTTPError: 504 Server Error: Gateway Time-out for url: https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps ===================================================== 11 failed, 4 passed, 1 warning in 249.80s (0:04:09) =========================================== ``` Pass easily on my test VM with very modest hardware (10G ram, 2 cpu): ``` WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/finch/wp s FLYINGPIGEON_WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Runni ng_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints =========================================================== 15 passed, 1 warning in 33.84s =========================================================== ``` Pass against Medus: ``` WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints ============================================== 15 passed, 1 warning in 42.44s ======================================================= ``` Pass against `hirondelle.crim.ca`: ``` WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints =============================================== 15 passed, 1 warning in 35.61s =============================================== ``` For comparison, a run on Prod without Twitcher (PR bird-house/birdhouse-deploy-ouranos#5): ``` WPS_URL=https://pavics.ouranos.ca/raven/wps FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://pavics .ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize -with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints HTTPError: 504 Server Error: Gateway Time-out for url: https://pavics.ouranos.ca/raven/wps ================================================ 11 failed, 4 passed, 1 warning in 248.99s (0:04:08) ================================================= ``` A run on Prod without Twitcher and Nginx (direct hit Raven): ``` WPS_URL=http://pavics.ouranos.ca:8096/ FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://pavics.oura nos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints ===================================================== 15 passed, 1 warning in 218.46s (0:03:38) ======================================================

tlvu · 2021-02-19T22:13:14Z

@moulab88 Performance seems to have improved on prod Boreas, did you change something? See test results in Ouranosinc/PAVICS-e2e-workflow-tests#61

moulab88 · 2021-02-22T14:15:24Z

I did nothing on my side.

tlvu requested a review from huard February 3, 2021 21:47

huard approved these changes Feb 3, 2021

View reviewed changes

tlvu added 2 commits February 3, 2021 20:22

document timeout limit for WPS process

d89ae81

docs: add automated TOC generation

2507c6f

tlvu force-pushed the fix-raven-timeout-while-processing-requests branch from 114b6bb to 2507c6f Compare February 4, 2021 01:22

docs: remove useless line break

4a6b5a4

tlvu force-pushed the fix-raven-timeout-while-processing-requests branch from 143cdf5 to 4a6b5a4 Compare February 4, 2021 01:32

tlvu merged commit 3fcb760 into master Feb 4, 2021

tlvu deleted the fix-raven-timeout-while-processing-requests branch February 4, 2021 01:36

tlvu mentioned this pull request Feb 10, 2021

proxy: proxy_read_timeout config should be configurable #123

Merged

tlvu mentioned this pull request Jun 14, 2021

average_polygon extremely slow when running on THREDDS data bird-house/finch#178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy: increase timeout since raven can take lots of time to process requests #122

proxy: increase timeout since raven can take lots of time to process requests #122

tlvu commented Feb 3, 2021

huard left a comment

tlvu commented Feb 4, 2021

richardarsenault commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlogan2000 commented Feb 8, 2021

tlvu commented Feb 10, 2021

huard commented Feb 10, 2021

tlvu commented Feb 19, 2021

moulab88 commented Feb 22, 2021

proxy: increase timeout since raven can take lots of time to process requests #122

proxy: increase timeout since raven can take lots of time to process requests #122

Conversation

tlvu commented Feb 3, 2021

huard left a comment

Choose a reason for hiding this comment

tlvu commented Feb 4, 2021

richardarsenault commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlvu commented Feb 4, 2021

tlvu commented Feb 4, 2021

moulab88 commented Feb 4, 2021

tlogan2000 commented Feb 8, 2021

tlvu commented Feb 10, 2021

huard commented Feb 10, 2021

tlvu commented Feb 19, 2021

moulab88 commented Feb 22, 2021