Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: increase timeout since raven can take lots of time to process requests #122

Merged
merged 4 commits into from Feb 4, 2021

Conversation

tlvu
Copy link
Collaborator

@tlvu tlvu commented Feb 3, 2021

Fix this timeout error in some Raven notebooks:

HTTPError: 504 Server Error: Gateway Time-out for url:
https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps

This is just a work-around since something is very wrong on our
production host Boreas (a physical host with 128G ram and 48 logical cpu).

During the test, Boreas was having this "load average: 6.35, 5.90,
4.33". For its hardware specs, it is basically idle.

The increased timeout was not needed for my test VM (10G ram, 2 cpu),
medus.ouranos.ca (physical host with 16G ram, 16 logical cpu) and
hirondelle.crim.ca (VM with 32G ram, 8 cpu).

Should fix Ouranosinc/raven#362 and fix Ouranosinc/raven#357.

Ping @moulab88 to take a look at Boreas. Just a wild guess, is it due for a reboot?

Ping @richardarsenault can you retry the 2 broken Raven notebooks? I've already deployed this to prod.

…requests

Fix this timeout error in some Raven notebooks:
```
HTTPError: 504 Server Error: Gateway Time-out for url:
https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps
```

This is just a work-around since something is very wrong on our
production host Boreas (a physical host with 128G ram and 48 logical cpu).

During the test, Boreas was having this "load average: 6.35, 5.90,
4.33".  For its hardware specs, it is basically idle.

The increased timeout was not needed for my test VM (10G ram, 2 cpu),
medus.ouranos.ca (physical host with 16G ram, 16 logical cpu) and
hirondelle.crim.ca (VM with 32G ram, 8 cpu).
@tlvu tlvu requested a review from huard February 3, 2021 21:47
Copy link
Collaborator

@huard huard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should document the fact any process running longer than this limit will fail.
Also what happens on the PyWPS side, is the process killed or it becomes a zombie ?

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 4, 2021

I think we should document the fact any process running longer than this limit will fail.
Also what happens on the PyWPS side, is the process killed or it becomes a zombie ?

It just continue and return later ... when it's too late. So no zombie or accumulated queue.

@richardarsenault
Copy link

@tlvu I confirm that the 2 notebooks (Ouranosinc/raven#362 and Ouranosinc/raven#357) now work without time-outing!

@tlvu tlvu force-pushed the fix-raven-timeout-while-processing-requests branch from 114b6bb to 2507c6f Compare February 4, 2021 01:22
@tlvu tlvu force-pushed the fix-raven-timeout-while-processing-requests branch from 143cdf5 to 4a6b5a4 Compare February 4, 2021 01:32
@tlvu tlvu merged commit 3fcb760 into master Feb 4, 2021
@tlvu tlvu deleted the fix-raven-timeout-while-processing-requests branch February 4, 2021 01:36
@moulab88
Copy link
Collaborator

moulab88 commented Feb 4, 2021

The machine is up since 13 days, no need a reboot. I will analyze incoming traffic with more details.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 4, 2021

The machine is up since 13 days, no need a reboot. I will analyze incoming traffic with more details.

@moulab88 Just to be clear, the traffic is not blocked. There is simply something that appear to slow down Raven responses from the client point of view. There are quite a few possible reasons here:

  • The request takes time to reach Raven (maybe OS firewall, Nginx, or Twitcher buffering the request)
  • Raven itself is slower (slow disc or network access to get the data? There is plenty of ram and cpu so probably not a problem with ram/cpu)
  • The response from Raven takes time to return to caller (again maybe OS firewall, Nginx, or Twitcher buffering the response)

Ping me if you need anything.

@moulab88
Copy link
Collaborator

moulab88 commented Feb 4, 2021

The OS firewall is disable. I will just check if there are dropped packets on the connections.

@moulab88
Copy link
Collaborator

moulab88 commented Feb 4, 2021

We have better latency on local disks / or /home than direct attached RAID5 direct disk which have better throughput:
image

To increase containers IO performance maybe it is better to move all docker directories to the local disks.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 4, 2021

We have better latency on local disks / or /home than direct attached RAID5 direct disk which have better throughput:

To increase containers IO performance maybe it is better to move all docker directories to the local disks.

Sensible suggestion. Migration will take time, we have lots of images and data.

Also there might not be enough space. / have 38G free, for sure not enough, /home have 114G free, probably not enough and for sure not enough future proof.

Given complicated migration, we probably should find a way to test the idea without doing the full migration.

@moulab88
Copy link
Collaborator

moulab88 commented Feb 4, 2021

Forget to mention /var (265 GB Free), this space was reserved for this need.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 4, 2021

We should also be using overlay2 as the storage driver for Docker, see https://docs.docker.com/storage/storagedriver/overlayfs-driver/

Currently, it is not the case:

$ docker info
 Storage Driver: overlay
  Backing Filesystem: xfs
  Supports d_type: true
 Docker Root Dir: /pvcs1/var-lib/docker
WARNING: the overlay storage-driver is deprecated, and will be removed in a future release.

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 4, 2021

$ sudo du -sh /pvcs1/var-lib/docker/
478G    /pvcs1/var-lib/docker/

There is clean up to do and we'll need an even bigger partition !

@moulab88
Copy link
Collaborator

moulab88 commented Feb 4, 2021

oh!! it will be a good candidate to a new SSD/nVME disk/partition.

@tlogan2000
Copy link
Collaborator

With Raven is there an issue Progress=True or async like we have seen for finch?

@tlvu
Copy link
Collaborator Author

tlvu commented Feb 10, 2021

With Raven is there an issue Progress=True or async like we have seen for finch?

This is an excellent question. There should be Raven notebooks using async mode, since this issue exists Ouranosinc/raven#353, but I have not seen the same "queue not cleared" problem with Raven !!! This is very odd, why Finch have the problem and not Raven !

@huard
Copy link
Collaborator

huard commented Feb 10, 2021

I don't know if we should read too much into that. Maybe Raven has not been exercised as intensively as Finch.
One difference I see is that Finch uses sentry. Maybe it has an unhealthy relationship with PyWPS.

tlvu added a commit that referenced this pull request Feb 10, 2021
…uld-be-configurable

proxy: proxy_read_timeout config should be configurable

We have a performance problem with the production deployment at Ouranos so we need a longer timeout.  Being a Ouranos specific need, it should not be hardcoded as in previous PR #122.

The previous increase was sometime not enough !

The value is now configurable via `env.local` as most other customizations.  Documentation updated.

Timeout in Prod:
```
WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS
_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_datas
et.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

HTTPError: 504 Server Error: Gateway Time-out for url: https://pavics.ouranos.ca/twitcher/ows/proxy/raven/wps

===================================================== 11 failed, 4 passed, 1 warning in 249.80s (0:04:09) ===========================================
```

Pass easily on my test VM with very modest hardware (10G ram, 2 cpu):
```
WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/finch/wp
s FLYINGPIGEON_WPS_URL=https://lvupavicsmaster.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Runni
ng_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

=========================================================== 15 passed, 1 warning in 33.84s ===========================================================
```

Pass against Medus:
```
WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://medus.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

============================================== 15 passed, 1 warning in 42.44s =======================================================
```

Pass against `hirondelle.crim.ca`:
```
WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/raven/wps FINCH_WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://hirondelle.crim.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

=============================================== 15 passed, 1 warning in 35.61s ===============================================
```

For comparison, a run on Prod without Twitcher (PR bird-house/birdhouse-deploy-ouranos#5):
```
WPS_URL=https://pavics.ouranos.ca/raven/wps FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://pavics
.ouranos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize
-with docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

HTTPError: 504 Server Error: Gateway Time-out for url: https://pavics.ouranos.ca/raven/wps

================================================ 11 failed, 4 passed, 1 warning in 248.99s (0:04:08) =================================================
```

A run on Prod without Twitcher and Nginx (direct hit Raven):
```
WPS_URL=http://pavics.ouranos.ca:8096/ FINCH_WPS_URL=https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps FLYINGPIGEON_WPS_URL=https://pavics.oura
nos.ca/twitcher/ows/proxy/flyingpigeon/wps pytest --nbval-lax --verbose docs/source/notebooks/Running_HMETS_with_CANOPEX_dataset.ipynb --sanitize-with
 docs/source/output-sanitize.cfg --ignore docs/source/notebooks/.ipynb_checkpoints

===================================================== 15 passed, 1 warning in 218.46s (0:03:38) ======================================================
@tlvu
Copy link
Collaborator Author

tlvu commented Feb 19, 2021

@moulab88 Performance seems to have improved on prod Boreas, did you change something? See test results in Ouranosinc/PAVICS-e2e-workflow-tests#61

@moulab88
Copy link
Collaborator

I did nothing on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants