-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331
Comments
We plan to look into this; here's the Tracker story (but you can't see it because it's a private Tracker) https://www.pivotaltracker.com/story/show/182745569 |
Nice research! It looks like there were 32 VMs being deployed simultaneously (I counted 32 unique VM ids), which triggered several NSX-T API calls apiece. This is a relief in a way—we're not generating an excessive number of calls per VM; we're merely not handling the throttling properly. |
NSX-T throttles API requests when they exceed 100/sec (default configuration). Our code fails rather than retries when it receives and HTTP Status 429 ("Too many requests"). This commit addresses that by implementing an exponential backup (it backs off 1 second, tries again, then 2 seconds, until 8 seconds is reached, at which point we give up). We're confident that this will fix the problem because it's hard to trigger. In the reported case, the user had 32 VMs being deployed simultaneously, which is the max # of threads that the BOSH Director has. Each VM called the NSX-T API 3-4 times. The user had a fast Director and a fast NSX-T Manager. A 1-second backoff should have been enough, but we wanted to be safe. This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T Manager API. Fixes: ``` Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47) L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340') ``` [Fixes #331] [#182745569] Signed-off-by: Brian Cunnie <bcunnie@vmware.com> Signed-off-by: Rajan Agaskar <ragaskar@vmware.com> Signed-off-by: Brian Upton <bupton@vmware.com>
NSX-T throttles API requests when they exceed 100/sec (default configuration). Our code fails rather than retries when it receives and HTTP Status 429 ("Too many requests"). This commit addresses that by implementing an exponential backup (it backs off 1 second, tries again, then 2 seconds, until 8 seconds is reached, at which point we give up). We're confident that this will fix the problem because it's hard to trigger. In the reported case, the user had 32 VMs being deployed simultaneously, which is the max # of threads that the BOSH Director has. Each VM called the NSX-T API 3-4 times. The user had a fast Director and a fast NSX-T Manager. A 1-second backoff should have been enough, but we wanted to be safe. This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T Manager API. Fixes: ``` Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47) L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340') ``` [Fixes #331] [#182745569] Signed-off-by: Brian Cunnie <bcunnie@vmware.com> Signed-off-by: Rajan Agaskar <ragaskar@vmware.com> Signed-off-by: Brian Upton <bupton@vmware.com>
NSX-T throttles API requests when they exceed 100/sec (default configuration). Our code fails rather than retries when it receives and HTTP Status 429 ("Too many requests"). This commit addresses that by implementing an exponential backup (it backs off 1 second, tries again, then 2 seconds, until 8 seconds is reached, at which point we give up). We're confident that this will fix the problem because it's hard to trigger. In the reported case, the user had 32 VMs being deployed simultaneously, which is the max # of threads that the BOSH Director has. Each VM called the NSX-T API 3-4 times. The user had a fast Director and a fast NSX-T Manager. A 1-second backoff should have been enough, but we wanted to be safe. This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T Manager API. Fixes: ``` Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47) L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340') ``` [Fixes #331] [#182745569] Signed-off-by: Brian Cunnie <bcunnie@vmware.com> Signed-off-by: Rajan Agaskar <ragaskar@vmware.com> Signed-off-by: Brian Upton <bupton@vmware.com> Signed-off-by: Kenneth Lakin <klakin@vmware.com>
Hi @cunnie - Could you please tell on which TAS version it got fixed? |
My Slack answer from 9/22:
|
Describe the bug
A clear and concise description of what the bug is.
We have NSX-T + TAS (TASforVM) 2.13.4 + NCP ENV
During TAS deployment stage (Creating VMs) we are seeing below Error:
Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
Task 54 | 10:27:37 | Creating missing vms: diego_database/09b50aae-e7f9-4381-953d-dfbc4733576b (2) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-824664')
Task 54 | 10:27:38 | Creating missing vms: diego_cell/7e170845-64b5-4ed1-ba07-30945ea4f09e (9) (00:11:48)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-657944')
NSX Client API rate limit: 100 requests/sec
We seeing more than 100 request per sec in between 10:25:29:78999 -> 10:25:30:55555 time period.
From 121 Request - 96 were GET calls and 25 were PUT calls
From bosh task 54 - debug logs - 96 Searching request logs
96 GET call on NSX proxy localhost logs
Looks like APIs are responding faster , which are leading too many APIs getting called in one second, and so we need to be handled in bosh upon hitting 429/503 error, bosh should retry after some backoff interval
To Reproduce
Steps to reproduce the behavior:
CPI Error Log
Can be obtained by running
bosh <env params> task <failed-task-id> --cpi
D, [2022-07-04T10:25:28.665739 #14065] DEBUG -- [req_id cpi-705561]: Running method 'FindByInventoryPath'...
I, [2022-07-04T10:25:28.710780 #14065] INFO -- [req_id cpi-705561]: Checking if dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.580548 #14065] INFO -- [req_id cpi-705561]: DVPG dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.886422 #14065] INFO -- [req_id cpi-705561]: NIC with device key 4000 is backed by NSXT DVPG
I, [2022-07-04T10:25:29.894319 #14065] INFO -- [req_id cpi-705561]: NSX-T networks found for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844': ["DVSwitch: 50 00 38 e5 cc 62 5d a3-a4 ef 22 74 0d c0 6b 3e"]
I, [2022-07-04T10:25:29.894470 #14065] INFO -- [req_id cpi-705561]: Searching for LogicalPorts for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844'
I, [2022-07-04T10:25:30.389665 #14065] INFO -- [req_id cpi-705561]: Searching VIFs with 'owner_vm_id: 5000716b-fd84-98c4-427b-1b6db4df8ad2'
I, [2022-07-04T10:25:30.500763 #14065] INFO -- [req_id cpi-705561]: Searching LogicalPorts with 'attachment_id: 8ac6a0a1-400c-43b2-924d-6bce3c20cf80'
I, [2022-07-04T10:25:31.731257 #14065] INFO -- [req_id cpi-705561]: Finished set_vm_metadata in 159.85 seconds
D, [2022-07-04T10:25:31.731767 #14065] DEBUG -- [req_id cpi-705561]: Running method 'Logout'...
E, [2022-07-04T10:25:32.140480 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] ERROR -- DirectorJobRunner: Failed to create/contact VM : #<Bosh::Clouds::ExternalCpi::UnknownError: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561')>
D, [2022-07-04T10:27:52.184199 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] DEBUG -- DirectorJobRunner: (0.000977s) (conn: 251400) UPDATE "tasks" SET "event_output" = ("event_output" || '{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
E, [2022-07-04T10:27:52.228164 #8410] [] ERROR -- DirectorJobRunner: Worker thread raised exception: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561') - /var/vcap/data/packages/director/b210b5f1943ce1c22ea22ff8474c9f44f2b60f53/gem_home/ruby/3.0.0/gems/bosh-director-0.0.0/lib/cloud/external_cpi.rb:223:in `handle_error'
{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
ubuntu@opsmanager-2-10:~$
Expected behavior
A clear and concise description of what you expected to happen.
TAS deployment should complete successfully.
Screenshots
If applicable, add screenshots to help explain your problem.
Release Version & Related Info (please complete the following information):
Additional context
Add any other context about the problem here.
Bosh task 54 debug log and NSX proxy-localhost logs attached here:
bosh_task.txt
localhost_logs.txt
The text was updated successfully, but these errors were encountered: