Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAS deployment failing - CPI error message 'Client 'admin' exceeded request rate of 100 per second' at Creating missing stage #331

Closed
Nilesh7756 opened this issue Jul 5, 2022 · 4 comments
Assignees

Comments

@Nilesh7756
Copy link

Describe the bug
A clear and concise description of what the bug is.

We have NSX-T + TAS (TASforVM) 2.13.4 + NCP ENV

During TAS deployment stage (Creating VMs) we are seeing below Error:

Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
Task 54 | 10:27:37 | Creating missing vms: diego_database/09b50aae-e7f9-4381-953d-dfbc4733576b (2) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-824664')
Task 54 | 10:27:38 | Creating missing vms: diego_cell/7e170845-64b5-4ed1-ba07-30945ea4f09e (9) (00:11:48)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-657944')

NSX Client API rate limit: 100 requests/sec

We seeing more than 100 request per sec in between 10:25:29:78999 -> 10:25:30:55555 time period.

root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep "10:25:" | wc -l
121
root@nsx-mgr-0:/var/log/proxy#

From 121 Request - 96 were GET calls and 25 were PUT calls

From bosh task 54 - debug logs - 96 Searching request logs

ubuntu@opsmanager-2-10:~ cat task-54-debug.txt | grep "10:25:" | grep 'Searching' | wc -l
96
ubuntu@opsmanager-2-10:~

96 GET call on NSX proxy localhost logs

root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep '10:25:' | grep 'GET' | grep '" 429' | wc -l
7
root@nsx-mgr-0:/var/log/proxy# cat localhost_access_log.txt | grep '10:25:' | grep 'GET' | grep -v '" 429' | wc -l
89
root@nsx-mgr-0:/var/log/proxy#

Looks like APIs are responding faster , which are leading too many APIs getting called in one second, and so we need to be handled in bosh upon hitting 429/503 error, bosh should retry after some backoff interval

To Reproduce
Steps to reproduce the behavior:

  1. Deploy NSX-T + Bosh + TAS 2.13.4 ENV
  2. During TAS tile deployment we saw this issue.

CPI Error Log
Can be obtained by running bosh <env params> task <failed-task-id> --cpi

VM - vm-ee8010aa-6ae2-41b1-b132-e20d69723844

D, [2022-07-04T10:25:28.665739 #14065] DEBUG -- [req_id cpi-705561]: Running method 'FindByInventoryPath'...
I, [2022-07-04T10:25:28.710780 #14065] INFO -- [req_id cpi-705561]: Checking if dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.580548 #14065] INFO -- [req_id cpi-705561]: DVPG dvportgroup-13168 is backed by NSXT
I, [2022-07-04T10:25:29.886422 #14065] INFO -- [req_id cpi-705561]: NIC with device key 4000 is backed by NSXT DVPG
I, [2022-07-04T10:25:29.894319 #14065] INFO -- [req_id cpi-705561]: NSX-T networks found for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844': ["DVSwitch: 50 00 38 e5 cc 62 5d a3-a4 ef 22 74 0d c0 6b 3e"]
I, [2022-07-04T10:25:29.894470 #14065] INFO -- [req_id cpi-705561]: Searching for LogicalPorts for vm 'vm-ee8010aa-6ae2-41b1-b132-e20d69723844'
I, [2022-07-04T10:25:30.389665 #14065] INFO -- [req_id cpi-705561]: Searching VIFs with 'owner_vm_id: 5000716b-fd84-98c4-427b-1b6db4df8ad2'
I, [2022-07-04T10:25:30.500763 #14065] INFO -- [req_id cpi-705561]: Searching LogicalPorts with 'attachment_id: 8ac6a0a1-400c-43b2-924d-6bce3c20cf80'
I, [2022-07-04T10:25:31.731257 #14065] INFO -- [req_id cpi-705561]: Finished set_vm_metadata in 159.85 seconds
D, [2022-07-04T10:25:31.731767 #14065] DEBUG -- [req_id cpi-705561]: Running method 'Logout'...
E, [2022-07-04T10:25:32.140480 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] ERROR -- DirectorJobRunner: Failed to create/contact VM : #<Bosh::Clouds::ExternalCpi::UnknownError: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561')>
D, [2022-07-04T10:27:52.184199 #8410] [create_missing_vm(router/9eb98956-b00c-4545-90b9-7629367216f6 (1)/36)] DEBUG -- DirectorJobRunner: (0.000977s) (conn: 251400) UPDATE "tasks" SET "event_output" = ("event_output" || '{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
E, [2022-07-04T10:27:52.228164 #8410] [] ERROR -- DirectorJobRunner: Worker thread raised exception: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-705561') - /var/vcap/data/packages/director/b210b5f1943ce1c22ea22ff8474c9f44f2b60f53/gem_home/ruby/3.0.0/gems/bosh-director-0.0.0/lib/cloud/external_cpi.rb:223:in `handle_error'
{"time":1656930472,"stage":"Creating missing vms","tags":[],"total":36,"task":"router/9eb98956-b00c-4545-90b9-7629367216f6 (1)","index":15,"state":"failed","progress":100,"data":{"error":"Unknown CPI error ''Unknown'' with message ''Client ''admin'' exceeded request rate of 100 per second'' in ''set_vm_metadata'' CPI method (CPI request ID: ''cpi-705561'')"}}
ubuntu@opsmanager-2-10:~$

Expected behavior
A clear and concise description of what you expected to happen.
TAS deployment should complete successfully.

Screenshots
If applicable, add screenshots to help explain your problem.

Release Version & Related Info (please complete the following information):

  • CPI Version
  • BOSH Director Version 2.10.19-build.314
  • Stemcell Name & Version bosh-stemcell-621.135-vsphere-esxi-ubuntu-xenial-go_agent.tgz
  • TAS 2.13.4
  • vCenter Version: 7.0.3
  • NSX(T/V) Version (If using) NSXT

Additional context
Add any other context about the problem here.

Bosh task 54 debug log and NSX proxy-localhost logs attached here:
bosh_task.txt
localhost_logs.txt

@cunnie
Copy link
Member

cunnie commented Jul 22, 2022

We plan to look into this; here's the Tracker story (but you can't see it because it's a private Tracker) https://www.pivotaltracker.com/story/show/182745569

@cunnie
Copy link
Member

cunnie commented Aug 5, 2022

Nice research! It looks like there were 32 VMs being deployed simultaneously (I counted 32 unique VM ids), which triggered several NSX-T API calls apiece. This is a relief in a way—we're not generating an excessive number of calls per VM; we're merely not handling the throttling properly.

cunnie pushed a commit that referenced this issue Aug 15, 2022
NSX-T throttles API requests when they exceed 100/sec (default
configuration). Our code fails rather than retries when it receives and
HTTP Status 429 ("Too many requests").

This commit addresses that by implementing an exponential backup (it
backs off 1 second, tries again, then 2 seconds, until 8 seconds is
reached, at which point we give up).

We're confident that this will fix the problem because it's hard to
trigger. In the reported case, the user had 32 VMs being deployed
simultaneously, which is the max # of threads that the BOSH Director
has. Each VM called the NSX-T API 3-4 times. The user had a fast
Director and a fast NSX-T Manager. A 1-second backoff should have been
enough, but we wanted to be safe.

This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T
Manager API.

Fixes:
```
Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
```

[Fixes #331]
[#182745569]

Signed-off-by: Brian Cunnie <bcunnie@vmware.com>
Signed-off-by: Rajan Agaskar <ragaskar@vmware.com>
Signed-off-by: Brian Upton <bupton@vmware.com>
cunnie pushed a commit that referenced this issue Sep 6, 2022
NSX-T throttles API requests when they exceed 100/sec (default
configuration). Our code fails rather than retries when it receives and
HTTP Status 429 ("Too many requests").

This commit addresses that by implementing an exponential backup (it
backs off 1 second, tries again, then 2 seconds, until 8 seconds is
reached, at which point we give up).

We're confident that this will fix the problem because it's hard to
trigger. In the reported case, the user had 32 VMs being deployed
simultaneously, which is the max # of threads that the BOSH Director
has. Each VM called the NSX-T API 3-4 times. The user had a fast
Director and a fast NSX-T Manager. A 1-second backoff should have been
enough, but we wanted to be safe.

This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T
Manager API.

Fixes:
```
Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
```

[Fixes #331]
[#182745569]

Signed-off-by: Brian Cunnie <bcunnie@vmware.com>
Signed-off-by: Rajan Agaskar <ragaskar@vmware.com>
Signed-off-by: Brian Upton <bupton@vmware.com>
cunnie pushed a commit that referenced this issue Sep 6, 2022
NSX-T throttles API requests when they exceed 100/sec (default
configuration). Our code fails rather than retries when it receives and
HTTP Status 429 ("Too many requests").

This commit addresses that by implementing an exponential backup (it
backs off 1 second, tries again, then 2 seconds, until 8 seconds is
reached, at which point we give up).

We're confident that this will fix the problem because it's hard to
trigger. In the reported case, the user had 32 VMs being deployed
simultaneously, which is the max # of threads that the BOSH Director
has. Each VM called the NSX-T API 3-4 times. The user had a fast
Director and a fast NSX-T Manager. A 1-second backoff should have been
enough, but we wanted to be safe.

This commit addresses both the NSX-T Policy API and the now-deprecated NSX-T
Manager API.

Fixes:
```
Task 54 | 10:27:37 | Creating missing vms: backup_restore/c30ae701-5d46-47df-b19d-21bc7c077ec0 (0) (00:11:47)
L Error: Unknown CPI error 'Unknown' with message 'Client 'admin' exceeded request rate of 100 per second' in 'set_vm_metadata' CPI method (CPI request ID: 'cpi-224340')
```

[Fixes #331]
[#182745569]

Signed-off-by: Brian Cunnie <bcunnie@vmware.com>
Signed-off-by: Rajan Agaskar <ragaskar@vmware.com>
Signed-off-by: Brian Upton <bupton@vmware.com>
Signed-off-by: Kenneth Lakin <klakin@vmware.com>
cunnie added a commit to cunnie/deployments that referenced this issue Sep 6, 2022
@cunnie cunnie closed this as completed in 04bf2c7 Sep 6, 2022
@suyash2796
Copy link

Hi @cunnie - Could you please tell on which TAS version it got fixed?

@cunnie
Copy link
Member

cunnie commented Oct 3, 2022

My Slack answer from 9/22:

Answer: the fix is not in TAS but in Ops Manager, and it hasn’t been released yet.
Ops Manager consists of many components, one of which is the BOSH vSphere CPI Release. The BOSH vSphere CPI is the component that interfaces with NSX-T, and the newest release of the CPI, v80, not even 24 hours old, has the fix.
Most likely the next version of Ops Manager (2.10.47?) will incorporate the fix, but you can check for sure by checking the download page for Ops Manager and then checking the release notes. For example, here are the release notes for Ops Manager 2.10.46, and you can see that it’s using the older version of the vSphere CPI (v79), so it doesn’t have the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants