Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow data extraction through opendap when using twitcher #97

Closed
aulemahal opened this issue Jul 8, 2020 · 25 comments · Fixed by Ouranosinc/Magpie#388
Closed

Very slow data extraction through opendap when using twitcher #97

aulemahal opened this issue Jul 8, 2020 · 25 comments · Fixed by Ouranosinc/Magpie#388

Comments

@aulemahal
Copy link

aulemahal commented Jul 8, 2020

Working on a notebook running on boréas I noticed that using twitcher really slows down the process.

MWE:

import xarray as xr

def extract(url):
    ds = xr.open_dataset(url, drop_variables=['ts', 'time_vectors'],
                         chunks={'time': 256, 'lat': 16, 'lon': 16})
    ds.isel(lon=slice(0, 10), lat=slice(0, 10)).to_netcdf('test.nc', compute=True)

Test 1 using twitcher (normal url)

%%time
baseurl = "http://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/ouranos/cb-oura-1.0_rechunk/MPI-ESM-LR/rcp85/day/MPI-ESM-LR_rcp85_allvars.ncml"
extract(baseurl)

Output:

CPU times: user 4.51 s, sys: 1.79 s, total: 6.3 s
Wall time: 3min 10s

And CPU usage (of the notebook) is super low, around 2-3 % for most of the 3 mn. I can see that twitcher (on boréas) runs with 30-60% CPU.

Test 2 bypassing twitcher:

%%time
baseurl = "http://pavics.ouranos.ca:8083/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/ouranos/cb-oura-1.0_rechunk/MPI-ESM-LR/rcp85/day/MPI-ESM-LR_rcp85_allvars.ncml"
extract(baseurl)

Output:

CPU times: user 3.56 s, sys: 996 ms, total: 4.56 s
Wall time: 11.3 s

Also, CPU usage of the notebook stays reasonable (30-40%).

This is a simple example with a small region, but in my notebook I need to extract the same region (potentially large) from 23 different datasets. Going through twitcher means I have to wait around 1 hr in some cases, even when launching multiple (2-3) extraction in parallel.­..

@tlvu
Copy link

tlvu commented Jul 9, 2020

ping @dbyrns @fmigneault here is the performance issue my college observed for you to assess, thanks

@dbyrns
Copy link
Collaborator

dbyrns commented Jul 9, 2020

Thanks @tlvu we will take care of this.

@fmigneault
Copy link
Contributor

@tlvu @aulemahal
Can we have twitcher version running on boreas at that time?

@fmigneault
Copy link
Contributor

I suspect this to be the culprit, but needs to be investigated.
https://github.com/bird-house/twitcher/pull/35/files

@tlvu
Copy link

tlvu commented Jul 9, 2020

Can we have twitcher version running on boreas at that time?

@fmigneault It's the current head of master in the birdhouse-deploy repo: https://github.com/bird-house/birdhouse-deploy/blob/775c3b392813872cb8045be473d6e4b091d52d88/birdhouse/docker-compose.yml#L310

@tlvu
Copy link

tlvu commented Oct 14, 2020

Follow up de @fmigneault

De mémoire, le fix apporté par David Caron était plus relié au caching pour les requêtes répétitives, mais cela n'impacte pas vraiment la vitesse de download one-shot d'un fichier Thredds tel que présenté ici.

Sinon, l'autre issue de streaming de la réponse dans Twitcher semble être intégrée depuis longtemps:
8022353

La seule chose que je peux voir est que chuck_size=64*1024 est encore assez petit.
Si le service répond avec ex 10MB, le 64kB va vraiment freiner le tout.
Je mettrais au moins ~ 10MB (ie: 10 * 2^20).
Sinon, on peut aussi mettre chuck_size=None ce qui va automatiquement utiliser le content-size retourné par le service, mais il y a risque de out-of-memory si c'est trop large.

Le chuck_size est hardcodé ici:

class BufferedResponse(object):
def __init__(self, resp):
self.resp = resp
def __iter__(self):
return self.resp.iter_content(64 * 1024)

@fmigneault
Copy link
Contributor

fmigneault commented Oct 24, 2020

@dbyrns @tlvu
I think I got it. Seems to be because of waitress.

No code change to Twitcher. Just changed the WSGI app runner.

Stepping through the code until going into waitress, it looks like its pre-fetching and doing soom unnecessary intermediate write before returning back the content.

I didn't bother going through how gunicorn handles it, but it does it clearly better.

My test results are as follows :

  1. direct access to my fake thredds file server
  2. waitress running Twitcher with MagpieAdapter
  3. gunicorn running Twitcher with MagpieAdapter
❯ rm -f test.nc && time wget http://localhost:7000/thredds/dodsC/test.nc
--2020-10-23 20:28:26--  http://localhost:7000/thredds/dodsC/test.nc
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:7000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996330431 (950M) [application/x-netcdf]
Saving to: ‘test.nc’

test.nc                                          100%[============================>] 950.17M   956MB/s    in 1.0s    

2020-10-23 20:28:27 (956 MB/s) - ‘test.nc’ saved [996330431/996330431]

wget http://localhost:7000/thredds/dodsC/test.nc  0.17s user 0.81s system 97% cpu 0.999 total

❯ rm -f test.nc && time wget http://localhost:8000/ows/proxy/fakeThredds/dodsC/test.nc
--2020-10-23 20:28:37--  http://localhost:8000/ows/proxy/fakeThredds/dodsC/test.nc
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996330431 (950M) [application/x-netcdf]
Saving to: ‘test.nc’

test.nc                                          100%[============================>] 950.17M  21.5MB/s    in 44s     

2020-10-23 20:29:24 (21.5 MB/s) - ‘test.nc’ saved [996330431/996330431]

wget http://localhost:8000/ows/proxy/fakeThredds/dodsC/test.nc  0.13s user 1.15s system 2% cpu 46.741 total

❯ rm -f test.nc && time wget http://localhost:8001/ows/proxy/fakeThredds/dodsC/test.nc
--2020-10-23 20:29:28--  http://localhost:8001/ows/proxy/fakeThredds/dodsC/test.nc
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8001... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996330431 (950M) [application/x-netcdf]
Saving to: ‘test.nc’

test.nc                                          100%[============================>] 950.17M   435MB/s    in 2.2s    

2020-10-23 20:29:31 (435 MB/s) - ‘test.nc’ saved [996330431/996330431]

wget http://localhost:8001/ows/proxy/fakeThredds/dodsC/test.nc  0.08s user 0.91s system 40% cpu 2.433 total

@tlvu
Copy link

tlvu commented Oct 26, 2020

I think I got it. Seems to be because of waitress.

FYI, benchmark of a few other WSGI server: https://www.appdynamics.com/blog/engineering/a-performance-analysis-of-python-wsgi-servers-part-2/

@fmigneault
Copy link
Contributor

FYI, benchmark of a few other WSGI server: https://www.appdynamics.com/blog/engineering/a-performance-analysis-of-python-wsgi-servers-part-2/

I gave a try to the highest-rated ones other than gunicorn. Results are for bjoern (port 8003) and meinheld (port 8002).
bjoern needed an extra dataflake.wsgi.bjoern package to support INI PasteDeploy. meinheld supports it directly using worker_class=egg:meinheld#gunicorn_worker in the config.

Obviously, we can't really see a difference for a single request of file download.
I wouldn't be surprised that regardless of the choice between gunicorn, bjoern and meinheld, the limiting factor will always be the pyramid app and underlying response times of proxied services for our situation.
I think that bjoern would be worthwhile only if we were serving an application directly with very-high demand of quick requests.
I don't think it is necessary for our use case, and since we always need to install gunicorn or similar to support PasteDeploy, I would just use it directly.

At the very least, anything works better than waitress.

❯ rm -f test.nc && time wget http://localhost:8002/ows/proxy/fakeThredds/dodsC/test.nc
--2020-10-26 11:43:02--  http://localhost:8002/ows/proxy/fakeThredds/dodsC/test.nc
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8002... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996330431 (950M) [application/x-netcdf]
Saving to: ‘test.nc’

test.nc                                          100%[============================>] 950.17M   393MB/s    in 2.4s    

2020-10-26 11:43:05 (393 MB/s) - ‘test.nc’ saved [996330431/996330431]

wget http://localhost:8002/ows/proxy/fakeThredds/dodsC/test.nc  0.14s user 0.89s system 39% cpu 2.633 total
❯ rm -f test.nc && time wget http://localhost:8003/ows/proxy/fakeThredds/dodsC/test.nc
--2020-10-26 11:43:09--  http://localhost:8003/ows/proxy/fakeThredds/dodsC/test.nc
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:8003... connected.
HTTP request sent, awaiting response... 200 OK
Length: 996330431 (950M) [application/x-netcdf]
Saving to: ‘test.nc’

test.nc                                          100%[============================>] 950.17M   402MB/s    in 2.4s    

2020-10-26 11:43:12 (402 MB/s) - ‘test.nc’ saved [996330431/996330431]

wget http://localhost:8003/ows/proxy/fakeThredds/dodsC/test.nc  0.12s user 0.93s system 40% cpu 2.603 total

fmigneault added a commit that referenced this issue Oct 29, 2020
Fixes #97

Replaces `waitress` by `gunicorn` by default.
`waitress` seems to do some kind of deduplication of contents with extra read/write of chucks during data streaming.
Could not find any way to avoid `waitress` to do this with some option/flag.

Tests results can be found here:
- #97 (comment)
- #97 (comment)

Leaving waitress in dependencies so that pre-existing deploys can use whichever they want between `gunicorn` / `waitress`.

Important note:
`gunicorn > 20` needs to be started with `pserve` to have correct port assigned.
There is a change of parsing `PasteDeploy` config where starting directly with `gunicorn` doesn't detect the right port.
(doesn't matter that much in this case, since default is 8000 which is also the one employed)

Regarless, Docker image was already using `pserve`, so everything should work just fine.
@tlvu
Copy link

tlvu commented Dec 10, 2020

@fmigneault

I've tried to backport this fix to PAVICS, see commit bird-house/birdhouse-deploy@84df6c9

The fix provided no improvements at all on my test VM, see notebook https://gist.github.com/tlvu/40346480fd17deaa10915bd0010f55c6#file-test-fix-for-slow-twitcher-ipynb

I'd suggest you test this fix in a PAVICS stack itself to be sure. Maybe additional configs changes are required for PAVICS.

@matprov
Copy link

matprov commented Feb 11, 2021

@dbyrns @fmigneault

Tested with bird-house/birdhouse-deploy@84df6c9 and no difference with master.
With a 60gb file, passing through Twitcher takes roughly 2x more time than accessing thredds behind nginx.

https://gist.github.com/MatProv/e5065c73fddf508aeaf3bf6304231f15

@tlvu
Copy link

tlvu commented Feb 11, 2021

https://gist.github.com/MatProv/e5065c73fddf508aeaf3bf6304231f15

Thredds behind Nginx and Twitcher: avg time: 21.201374769210815
Thredds behind Nginx:              avg time:  8.687683343887329
Thredds directly:                  avg time:  7.670507272084554

Twitcher overhead: -5%

Shoulddn't this Twitcher overhead be 21.201374769210815 / 8.687683343887329 * 100 = 244% and not 5% ? :D

@tlvu
Copy link

tlvu commented Feb 11, 2021

https://gist.github.com/MatProv/e5065c73fddf508aeaf3bf6304231f15

Great notebook, ready for automation ! I looked at the test result calculation (the bad 5%) and it looks good. I think the notebook output simply needs to be refreshed.

@matprov
Copy link

matprov commented Feb 11, 2021

Shoulddn't this Twitcher overhead be 21.201374769210815 / 8.687683343887329 * 100 = 244% and not 5% ? :D

@tlvu Oh no, forgot to specify that only the first two tests have been run (twitcher+nginx+thredds and nginx+thredds) for this test. Difference was too obvious so actual assertion were useless.

So yes, old cells outputs are a bit misleading here. Won't happen when it will be part of the automated suite ;)

@dbyrns
Copy link
Collaborator

dbyrns commented Feb 16, 2021

Thank you @matprov, now we see that for a 60go file, twitcher is introducing a latency of 13sec. which is not bad at all given that to download that file over a 1Gbps link would require at least 8 minutes.
The issue originally state >3min behind twitcher vs 6sec bypassing it (3000%), which is clearly not the case anymore.
@tlvu what do you think? Can we close this issue?
As a due diligence I asked @ldperron to check this issue to know if he can see something that could be improved rapidly, but I think we have dedicated enough time on this issue.

@tlvu
Copy link

tlvu commented Feb 16, 2021

Thank you @matprov, now we see that for a 60go file, twitcher is introducing a latency of 13sec. which is not bad at all given that to download that file over a 1Gbps link would require at least 8 minutes.
The issue originally state >3min behind twitcher vs 6sec bypassing it (3000%), which is clearly not the case anymore.
@tlvu what do you think? Can we close this issue?

@dbyrns What? I have not seen 3000% anywhere.

But we still have a 244% overhead, see my comment #97 (comment). It's not acceptable at all for us.

Also as you are aware, we have a Thredds optimization project with SSD and it clearly shows on that project. The testing notebooks is private since it has internal hostnames (the host we fake the SSD with RAM disk) but the Twitcher overhead vs Nginx alone is routinely between 300% and 400%.

@dbyrns
Copy link
Collaborator

dbyrns commented Feb 16, 2021

@dbyrns What? I have not seen 3000% anywhere.

My 3000% came from @aulemahal original issue description Test1 : Wall time: 3min 10s vs Test2 : Wall time: 11.3 s.

But we still have a 244% overhead, see my comment #97 (comment). It's not acceptable at all for us.

We can discuss that in our next meeting, but sadly we don't have the budget to achieve nginx's level of performance neither. Please consider a contribution from your project if the security and performance are required or maybe bypass twitcher completely in the internal hostname / RAM disk setup as this is not the intended use case. Even with the test case used by @matprov, 60GB in 21sec, implied at least a 22Gbps bandwidth, which is not representative of the real world.

Understand me well, I agree that the performance isn't stellar, but we're exhausting our ressource and while @ldperron is trying to still improve the performance we're starting to exhaust our budget on this issue and will need to stop working on that soon.

@dbyrns
Copy link
Collaborator

dbyrns commented Feb 17, 2021

@tlvu we might be on a bad lead. You said that caching should not improve performance because we are downloading one big file (bandwidth issue), but we are looking if by using the opendap protocol we would get multiple requests where it's more a latency issue.

Also, I just found out that the whole 60Go is never downloaded in the test so let's forget the bandwidth part of the conversation.

Hopefully the light will be shed soon on this question.

@dbyrns
Copy link
Collaborator

dbyrns commented Feb 17, 2021

Good news! In fact it was really a caching issue.
@ldperron performs some tests and discovered that opendap make 223 requests to thredds to complete the operation. Those requests in turn generated more than 2574 SQL requests.
The fix made by David C. in 1.6.0 allowed some caching but the config has never been done in birdhouse-deploy.
This is what should have been added : https://github.com/Ouranosinc/Magpie/blob/master/config/magpie.ini#L37
With a proper config @ldperron lowered the SQL request number to 1784 and a total time of 14sec (compared to 21sec before).

@fmigneault this is what we have found that is not in the cache currently and could be added to further improve the performance :
SELECT resources.resource_display_name AS resources_resource_display_name, services.resource_id AS services_resource_id, resources.resource_id AS resources_resource_id, resources.parent_id AS resources_parent_id, resources.ordering AS resources_ordering, resources.resource_name AS resources_resource_name, resources.resource_type AS resources_resource_type, resources.owner_group_id AS resources_owner_group_id, resources.owner_user_id AS resources_owner_user_id, resources.root_service_id AS resources_root_service_id, services.url AS services_url, services.type AS services_type, services.sync_type AS services_sync_type FROM resources JOIN services ON services.resource_id = resources.resource_id WHERE resources.resource_name = 'thredds'

ldperron added a commit to bird-house/birdhouse-deploy that referenced this issue Feb 17, 2021
@fmigneault
Copy link
Contributor

@dbyrns
bird-house/birdhouse-deploy@7737c5a defines the "best" caching offered by magpie for the moment.

This SQL query seemed to be caused by this, with the resulting service calling the cached ACL :
https://github.com/Ouranosinc/Magpie/blob/a06e65e6122d5936e57a9914101cdff0c99b1498/magpie/adapter/magpieowssecurity.py#L40-L43

It will require a new Magpie version with added caching of this step.

@tlvu
Copy link

tlvu commented Feb 18, 2021

@tlvu we might be on a bad lead. You said that caching should not improve performance because we are downloading one big file (bandwidth issue), but we are looking if by using the opendap protocol we would get multiple requests where it's more a latency issue.

@dbyrns I don't remember what I said but I most probably referred to Thredds caching and not Magpie SQL query caching. Since we repeated hit the same file on Thredds, this would make the Thredds cache "hot" so how big is Thredds cache should not impact this performance test.

So I am glad we are onto something with Magpie SQL query caching.

Did not expect such a tight coupling between Twitcher and Magpie.

@tlvu
Copy link

tlvu commented Feb 18, 2021

bird-house/birdhouse-deploy@7737c5a defines the "best" caching offered by magpie for the moment.

Great ! Thanks to @ldperron for the investigation and the patch.

@fmigneault Can this be merged now and is compatible with the current Magpie 1.7.3 ?

@fmigneault
Copy link
Contributor

Should be available, it was added in 1.6
Above Ouranosinc/Magpie#388 adds even more caching where above SQL (#97 (comment)) are also cached, but this requires upcoming Magpie 3.7

@dbyrns
Copy link
Collaborator

dbyrns commented Feb 18, 2021

@dbyrns I don't remember what I said but I most probably referred to Thredds caching and not Magpie SQL query caching

I was referring to this : #97 (comment) and indeed the relation between opendap, twitcher, magpie and thredds was not obvious.

So to sum it up:

  1. opendap needs a lot of requests to fetch even one file from thredds
  2. twitcher needs to grant access to every of those requests
  3. so twitcher, via its magpie adapter, makes a lot of SQL queries and finally grants the access
  4. twitcher forwards the request to thredds
  5. thredds starts to heat its own cache and returns the data
  6. twitcher streams back the response to opendap

2 and 3 cause latency and this is what Ouranosinc/Magpie#388 and bird-house/birdhouse-deploy@7737c5a will solve.
6 was made possible a long time ago by 8022353

Thank you all and I'm really looking forward to see Magpie 3.7 integrated in the stack!

fmigneault pushed a commit to bird-house/birdhouse-deploy that referenced this issue May 19, 2021
fmigneault pushed a commit to bird-house/birdhouse-deploy that referenced this issue Jun 28, 2021
fmigneault pushed a commit to bird-house/birdhouse-deploy that referenced this issue Jul 12, 2021
fmigneault pushed a commit to bird-house/birdhouse-deploy that referenced this issue Jul 14, 2021
fmigneault pushed a commit to bird-house/birdhouse-deploy that referenced this issue Aug 2, 2021
fmigneault added a commit to bird-house/birdhouse-deploy that referenced this issue Aug 2, 2021
## Overview

Applies the same changes as #174 (cache settings) but disable Twitcher caching explicitly (Magpie caching already disabled).

This is in order to validate that cache settings are applied and tests suite can run successfully with cache disabled, 
to isolate that sporadic errors seen in #174 are related to enabled cache.

## Changes

**Non-breaking changes**
- Added cache settings of `MagpieAdapter` through Twitcher (intended to improve response times, but disabled in this PR).
- Update Magpie/Twitcher version 3.14.0

**Breaking changes**
- Unique user email in Magpie 3.13.0 is enforced, see birdhouse-deploy changelog for how to find them and update the conflicted values before the upgrade.  Otherwise the upgrade will break. See [Magpie 3.13.0 change log](https://pavics-magpie.readthedocs.io/en/3.13.0/changes.html#bug-fixes) for details regarding the introduced unique email restriction.

## Related Issue / Discussion

- Undo resolution of [DAC-296 Twitcher performance issue](https://www.crim.ca/jira/browse/DAC-296) since caching now disabled
- Undo resolution of bird-house/twitcher#97 since cache becomes disabled 
- Based on top of #174 with cache-settings fixes
- Related new stress test Ouranosinc/PAVICS-e2e-workflow-tests#74
fmigneault added a commit to bird-house/birdhouse-deploy that referenced this issue Aug 2, 2021
Bump version: 1.13.14 → 1.14.0
Following  merge of pull request #182 from bird-house/cache-settings-off

## Overview

Applies the same changes as #174 (cache settings) but disable Twitcher caching explicitly (Magpie caching already disabled).

This is in order to validate that cache settings are applied and tests suite can run successfully with cache disabled, 
to isolate that sporadic errors seen in #174 are related to enabled cache.

## Changes

**Non-breaking changes**
- Added cache settings of `MagpieAdapter` through Twitcher (intended to improve response times, but disabled in this PR).
- Update Magpie/Twitcher version 3.14.0

**Breaking changes**
- Unique user email in Magpie 3.13.0 is enforced, see birdhouse-deploy changelog for how to find them and update the conflicted values before the upgrade.  Otherwise the upgrade will break. See [Magpie 3.13.0 change log](https://pavics-magpie.readthedocs.io/en/3.13.0/changes.html#bug-fixes) for details regarding the introduced unique email restriction.

## Related Issue / Discussion

- Undo resolution of [DAC-296 Twitcher performance issue](https://www.crim.ca/jira/browse/DAC-296) since caching now disabled
- Undo resolution of bird-house/twitcher#97 since cache becomes disabled 
- Based on top of #174 with cache-settings fixes
- Related new stress test Ouranosinc/PAVICS-e2e-workflow-tests#74
@tlvu
Copy link

tlvu commented May 3, 2023

@ldperron performs some tests and discovered that opendap make 223 requests to thredds to complete the operation. Those requests in turn generated more than 2574 SQL requests.

@ldperron could you let me know how to reproduce your test setup to the find the result above? See original comment #97 (comment)

The performance and intermittent problem is back again, see #123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants