TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

Nilesh7756 · 2022-07-05T11:35:05Z

Describe the bug
A clear and concise description of what the bug is.

We have NSX-T + TAS (TASforVM) 2.13.4 + NCP ENV

During TAS deployment stage (Creating VMs) we are seeing below Error:

Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
Task 54 | 10:27:37 | Creating missing vms: diego_database/09b50aae-e7f9-4381-953d-dfbc4733576b (2) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-824664')
Task 54 | 10:27:38 | Creating missing vms: diego_cell/7e170845-64b5-4ed1-ba07-30945ea4f09e (9) (00:11:48)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-657944')

NSX Client API rate limit: 100 requests/sec

We seeing more than 100 request per sec in between 10:25:29:78999 -> 10:25:30:55555 time period.

root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep "10:25:" | wc -l
121
root@nsx-mgr-0:/var/log/proxy#

From 121 Request - 96 were GET calls and 25 were PUT calls

From bosh task 54 - debug logs - 96 Searching request logs

ubuntu@opsmanager-2-10:~ cat task-54-debug.txt | grep "10:25:" | grep 'Searching' | wc -l
96
ubuntu@opsmanager-2-10:~

96 GET call on NSX proxy localhost logs

root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep '10:25:' | grep 'GET' | grep '" 429' | wc -l
7
root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep '10:25:' | grep 'GET' | grep -v '" 429' | wc -l
89
root@nsx-mgr-0:/var/log/proxy#

Looks like APIs are responding faster , which are leading too many APIs getting called in one second, and so we need to be handled in bosh upon hitting 429/503 error, bosh should retry after some backoff interval

To Reproduce
Steps to reproduce the behavior:

Deploy NSX-T + Bosh + TAS 2.13.4 ENV
During TAS tile deployment we saw this issue.

CPI Error Log
Can be obtained by running bosh <env params> task <failed-task-id> --cpi

VM - vm-ee8010aa-6ae2-41b1-b132-e20d69723844

D, [2022-07-04T10:25:28.665739 #14065] DEBUG -- [req_id cpi-705561]: Running method 'FindByInventoryPath'...
I, [2022-07-04T10:25:28.710780 #14065] INFO -- [req_id cpi-705561]: Checking if dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.580548 #14065] INFO -- [req_id cpi-705561]: DVPG dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.886422 #14065] INFO -- [req_id cpi-705561]: NIC with device key 4000 is backed by NSXT DVPG
I, [2022-07-04T10:25:29.894319 #14065] INFO -- [req_id cpi-705561]: NSX-T networks found for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844': ["DVSwitch: 50 00 38 e5 cc 62 5d a3-a4 ef 22 74 0d c0 6b 3e"]
I, [2022-07-04T10:25:29.894470 #14065] INFO -- [req_id cpi-705561]: Searching for LogicalPorts for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844'
I, [2022-07-04T10:25:30.389665 #14065] INFO -- [req_id cpi-705561]: Searching VIFs with 'owner_vm_id: 5000716b-fd84-98c4-427b-1b6db4df8ad2'
I, [2022-07-04T10:25:30.500763 #14065] INFO -- [req_id cpi-705561]: Searching LogicalPorts with 'attachment_id: 8ac6a0a1-400c-43b2-924d-6bce3c20cf80'
I, [2022-07-04T10:25:31.731257 #14065] INFO -- [req_id cpi-705561]: Finished set_vm_metadata in 159.85 seconds
D, [2022-07-04T10:25:31.731767 #14065] DEBUG -- [req_id cpi-705561]: Running method 'Logout'...
E, [2022-07-04T10:25:32.140480 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] ERROR -- DirectorJobRunner: Failed to create/contact VM : #<Bosh::Clouds::ExternalCpi::UnknownError: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561')>
D, [2022-07-04T10:27:52.184199 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] DEBUG -- DirectorJobRunner: (0.000977s) (conn: 251400) UPDATE "tasks" SET "event_output" = ("event_output" || '{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
E, [2022-07-04T10:27:52.228164 #8410] [] ERROR -- DirectorJobRunner: Worker thread raised exception: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561') - /var/vcap/data/packages/director/b210b5f1943ce1c22ea22ff8474c9f44f2b60f53/gem_home/ruby/3.0.0/gems/bosh-director-0.0.0/lib/cloud/external_cpi.rb:223:in `handle_error'
{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
ubuntu@opsmanager-2-10:~$

Expected behavior
A clear and concise description of what you expected to happen.
TAS deployment should complete successfully.

Screenshots
If applicable, add screenshots to help explain your problem.

Release Version & Related Info (please complete the following information):

CPI Version
BOSH Director Version 2.10.19-build.314
Stemcell Name & Version bosh-stemcell-621.135-vsphere-esxi-ubuntu-xenial-go_agent.tgz
TAS 2.13.4
vCenter Version: 7.0.3
NSX(T/V) Version (If using) NSXT

Additional context
Add any other context about the problem here.

Bosh task 54 debug log and NSX proxy-localhost logs attached here:
bosh_task.txt
localhost_logs.txt

The text was updated successfully, but these errors were encountered:

cunnie · 2022-07-22T16:03:07Z

We plan to look into this; here's the Tracker story (but you can't see it because it's a private Tracker) https://www.pivotaltracker.com/story/show/182745569

cunnie · 2022-08-05T19:36:30Z

Nice research! It looks like there were 32 VMs being deployed simultaneously (I counted 32 unique VM ids), which triggered several NSX-T API calls apiece. This is a relief in a way—we're not generating an excessive number of calls per VM; we're merely not handling the throttling properly.

NSX-T throttles API requests when they exceed 100/sec (default configuration). Our code fails rather than retries when it receives and HTTP Status 429 ("Too many requests"). This commit addresses that by implementing an exponential backup (it backs off 1 second, tries again, then 2 seconds, until 8 seconds is reached, at which point we give up). We're confident that this will fix the problem because it's hard to trigger. In the reported case, the user had 32 VMs being deployed simultaneously, which is the max # of threads that the BOSH Director has. Each VM called the NSX-T API 3-4 times. The user had a fast Director and a fast NSX-T Manager. A 1-second backoff should have been enough, but we wanted to be safe. This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T Manager API. Fixes: ``` Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47) L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340') ``` [Fixes #331] [#182745569] Signed-off-by: Brian Cunnie <bcunnie@vmware.com> Signed-off-by: Rajan Agaskar <ragaskar@vmware.com> Signed-off-by: Brian Upton <bupton@vmware.com>

NSX-T throttles API requests when they exceed 100/sec (default configuration). Our code fails rather than retries when it receives and HTTP Status 429 ("Too many requests"). This commit addresses that by implementing an exponential backup (it backs off 1 second, tries again, then 2 seconds, until 8 seconds is reached, at which point we give up). We're confident that this will fix the problem because it's hard to trigger. In the reported case, the user had 32 VMs being deployed simultaneously, which is the max # of threads that the BOSH Director has. Each VM called the NSX-T API 3-4 times. The user had a fast Director and a fast NSX-T Manager. A 1-second backoff should have been enough, but we wanted to be safe. This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T Manager API. Fixes: ``` Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47) L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340') ``` [Fixes #331] [#182745569] Signed-off-by: Brian Cunnie <bcunnie@vmware.com> Signed-off-by: Rajan Agaskar <ragaskar@vmware.com> Signed-off-by: Brian Upton <bupton@vmware.com> Signed-off-by: Kenneth Lakin <klakin@vmware.com>

Context: cloudfoundry/bosh-vsphere-cpi-release#331

suyash2796 · 2022-09-20T05:25:31Z

Hi @cunnie - Could you please tell on which TAS version it got fixed?

cunnie · 2022-10-03T14:21:38Z

My Slack answer from 9/22:

Answer: the fix is not in TAS but in Ops Manager, and it hasn’t been released yet.
Ops Manager consists of many components, one of which is the BOSH vSphere CPI Release. The BOSH vSphere CPI is the component that interfaces with NSX-T, and the newest release of the CPI, v80, not even 24 hours old, has the fix.
Most likely the next version of Ops Manager (2.10.47?) will incorporate the fix, but you can check for sure by checking the download page for Ops Manager and then checking the release notes. For example, here are the release notes for Ops Manager 2.10.46, and you can see that it’s using the older version of the vSphere CPI (v79), so it doesn’t have the fix.

ramonskie assigned cunnie Jul 21, 2022

cunnie added a commit to cunnie/deployments that referenced this issue Sep 6, 2022

Stress the NSX serer to hit the API limit

5510e2b

Context: cloudfoundry/bosh-vsphere-cpi-release#331

cunnie closed this as completed in 04bf2c7 Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

Nilesh7756 commented Jul 5, 2022

cunnie commented Jul 22, 2022

cunnie commented Aug 5, 2022

suyash2796 commented Sep 20, 2022

cunnie commented Oct 3, 2022 •

edited

TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

Comments

Nilesh7756 commented Jul 5, 2022

cunnie commented Jul 22, 2022

cunnie commented Aug 5, 2022

suyash2796 commented Sep 20, 2022

cunnie commented Oct 3, 2022 • edited

cunnie commented Oct 3, 2022 •

edited