New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openstack Plugin has performance issue #27423
Comments
Thank you for your very detailed analysis. There are definitely some Also, most of these improvements should actually be made at the shade level. First of all - the list/filter approach is not an accident, it is in place for
For some (but not all) of the resources you are mentioning there exists the
It will limit shade to only doing a single list call for servers ports and The list/filter approach is the one that we know will work everywhere and the Fixing some of these has been on the todo list for a while, but we just spent To combat this, there are a few possibilities of improvements (some similar to
The list+filter approach is essential for high-volume long-lived processes. BUT - for execution profiles such as the os_server module where different If there was an OpenStackCloud constructor parameter such as "batch_queries"
(I use the term "pushdown" to mean doing server-side filtering with a We do pushdown for some but not all resources. It has to be done on a case by The Cinder API docs do not mention any ability to do server-side filtering by https://github.com/openstack/cinder/blob/master/cinder/api/v3/resource_filters.py which would be great, because it would mean we can write general code to know However, that's all brand-new - so we need to limit using it to specific This is also a reason that having a mechanism to support a list vs. get
This one is very easy. One of the downsides to allowing users to pass in values as either name or
shade will not do a search for the flavor object but instead will just directly In os_server, the flavor parameter is defined as:
If we define that instead as:
That will allow a user that knows they have a flavor object or just an id to
or:
and skip lookups in both cases.
The code you have to test for the id being an id by trying to parse a UUID Identifying places we can safely use that to optimize our ability to
This is the trickiest bit of engineering, and one that I have an engineer There are currently three states of caching in shade:
The goal is to have ALL resources use dogpile.cache, so that a user can Even if we get all of the other logic improved, having dogpile-based
In your clouds.yaml file, you can do:
Which will avoid the loop on flavors to get all the extra_specs. It's an The internal calls in shade itself should all be setting get_extra to False. If
This is primarily os_server and os_server_actions, but there is a bunch of A good for instance where this is a problem is here: https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/cloud/openstack/os_server.py#L510 The In ** Summary ** I'd love to make improvements in the areas you have highlighted. I hope my The shade team WILL get to all of these, but given various projects and |
#27443 has the "get rid of local stuff in os_server that's done better in shade" |
@lealoncity Thanks for the detailed report! And @emonty for your detailed response. We've got one person here that might have capacity in a while to look into some of the suggested changes. As this is a community supported set of modules, we'd really need someone in the community to dig in and provide some PRs to resolve the issues; otherwise, it's going to be a while before we can look into it. |
needs_contributor |
Hi emonty and calfonso, Really thank you for your very detail response. Sorry for answer late because of trip. I will feedback your advice and plan to my colleage, then we will communicate with this issue here. Thanks again~ |
Sorry emontu and calfonso, This is a very import issue for our actual customer, this issue actually comes from ansible,especially from shade, while customer thinks the performance of openstack is bad. So this makes us passive and we hope this issue will be fixed soon. Thanks~ As talked with my colleages,unfortunatelly we really has not the bandwidth for such an effort in recent days, but we can exchange our ideas here to support each others. Hope you help to find someone to improve the performance and share this issue to shade community. As worry about customer‘s give-up for ansible and openstack, could you provide the fix plan for this performance issue? We may talk with our customer to build their confidence. I think it is really helpful for ansible’s extendtion on openstack scene. Thank you very much, hope for response~ :) |
Hi emonty, causes an extreme amount of load for the clouds? I think I am not clear with the reason. Do you mean when lots of volumes really in a tenant, you call only one list api and then filter all the volumes? But for lots of volumes, it looks it will still call one list api for querying every volume. Maybe the persistent cache can control and limit one list api call only... So the high scale scenarios really needs persistent cache, or the performance will be bad. I don't know whether my understanding is right. Be so kind as to give me a reply, please :) |
Any new information about this issue? Thanks~ |
Sorry - my turn to have been delayed by trip. :) I will write up a more detailed fix plan by tomorrow, and I'll see what we can prioritize on my end. I think there might be some ways we can do a quick hack to help your customer that we can generalize as part of the plan. (but I need to write up the details to make sure) As for the reason the individual GET calls can be bad - Imagine an application that is creating and destroying 1000s of VMs in a constant/dymanic manner, but in a parallel/event-driven manner. In this case you have, for instance, 1000 different create requests that than must be polled for success. The only way to poll for success (in the general case) is to do a GET /servers/{uuid} and check the status. If you do that for each server, potentially have 1000 independent GET calls. However, if you have a centralized cached/batched list call - you can make 1 GET call to /servers/detail every 5 seconds, and then you can fulfill each of the GET requests in the code by filtering the results of that GET. Same with ports and floating-ips, which also must be checked. As you have noted, for more direct use this is more expensive, which is the root of your customer's problems right now. For high-scale cases using different processes for each operation- you are correct, a persistent/shared cache is required. For high-scale cases using the same process but multiple threads or coroutines, in-memory cache works well today. This is why there are ultimately two different items that need work for this to be FULLY solved - allowing the use of single-get instead of list for smaller cases, and improving the shared-cache story for large-scale uses of Ansible so that the Ansible modules can take advantage of the same logic that is available to shade uses running long-lived processes. Both are ultimately important, but I believe there is some short-term help we can provide for your current case. I'll write up a description of the specific fixes I think we need to make, and also give an estimate from my side on what making progress will look like. I have another plane flight in a few hours which will be a good opportunity to work on this. |
Hi emonty, Thanks for your support! I think I can understand the design thinking of list/filter method, and as you said, we need an option to control batched-list vs. direct-get. Another question, I have tried to configure per-resource caching, while I think I used it in a wrong way . I just pasted the follow lines in my playbook yml file, in the same level of 'task'.
Do I need to install something, such as dogpile? Help please : ) Waiting for your detail plan, then we can advise our customer to continue to use ansible but other tools to complete daily jobs. This is worth expecting! Thank you very much! |
Hi emonty, As you said before, “The shade team WILL get to all of these, but given various projects and My colleage asked me if there is a link for this bug or patch. He maybe help to drive shade project. |
Hi emonty, Could you help to classify the jobs needed to be done by ansible and shade clearly again. We may help to drive shade project, but my colleage is not very clear to know what shade needs to do in details. Thanks :) |
Hi emonty, It is better that you help to patch bugs for openstack community. Then my colleage may help to push forward this issue. Thanks :) |
Hi emonty, Any update? |
Hi emonty, Any update? |
Yes - we've got patches landed in shade now for adding get_by_id methods,
as well as a global constructor option to switch the default shade behavior
from list/filter to direct get. There is one more patch we need to land and
then we'll cut a new shade release, at which point we can update ansible to
set the direct get flag by default.
Thanks for your patience on this one.
…On Thu, Aug 31, 2017 at 8:29 AM, ansibot ***@***.***> wrote:
cc @dagnello <https://github.com/dagnello>
click here for bot help
<https://github.com/ansible/ansibullbot/blob/master/ISSUE_HELP.md>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27423 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFztF30-cDXCqsYYPAVJLZOZhSB1Ef3ks5sdrVQgaJpZM4OmmCk>
.
|
Hi yes, |
@lealoncity how is the performance, is this still an issue? |
!component =lib/ansible/modules/cloud/openstack/os_server.py |
Thank you very much for your interest in Ansible. Ansible has migrated much of the content into separate repositories to allow for more rapid, independent development. We are closing this issue/PR because this content has been moved to one or more collection repositories.
For further information, please see: |
ISSUE TYPE
COMPONENT NAME
ansible/modules/cloud/openstack/os_server.py
ansible/modules/cloud/openstack/os_volume.py
ansible/modules/cloud/openstack/os_server_volume.py
ANSIBLE VERSION
CONFIGURATION
default ansible.cfg
OS / ENVIRONMENT
SUMMARY
STEPS TO REPRODUCE
ansible-playbook test.yml
EXPECTED RESULTS
1.The whole job completes within a very short time.
2. No performance difference between openstack plugin and aws plugin
ACTUAL RESULTS
===============================================================================
Create an instance ---------------------------------------------------- 218.52s
Attach volumes -------------------------------------------------------- 173.52s
Create System Volume -------------------------------------------------- 116.32s
Wait for 90 seconds so the instance is ready for the next task --------- 90.01s
Create static IP NIC --------------------------------------------------- 16.40s
Create Data Volume ----------------------------------------------------- 13.59s
Return configured local ipv4 -------------------------------------------- 0.02s
Create DHCP NIC --------------------------------------------------------- 0.01s
real 10m29.258s
user 0m44.327s
sys 0m6.167s
Analyse:
From the API log, we find some unreasonable parts , which means unreasonable api calls.
1.When we create system volume, the playbook specify volume name and image id, the plugin should use the list API with filter for volume and get image by id, rather than List all Volumes with detail and List all images with detail.
2.The API like this list all images and all volumes cost much time that the API with some filter parameters.
Since the playbook specify flavor name already, the OpenStack plugin shouldn’t get all flavors with detail and loop to query every flavor detail
When we create data volume, the playbook specify volume name, the plugin should use the list API with filter for volume rather than List all Volumes with detail, the same to the port, floatingip, network, subnet, security-group, image.
5.After the VM create successful, the plugin should call get detail by instance id rather that loop to get all of the resources, such as port, floatingip, network, security-group, image, subnet etc.
6.The playbook specify image id, the plugin should use get image by id API, rather than List all images with detail.
7.For the step attaching volume for instance, there is no reason to list all of the resource about the tenant, from the process, we find that this step call lots of list all resource for port, servers, floatingips, networks, subnets, security-groups, volumes etc.
In all,
Every get_${resource} function in OpenStack plugin has the below logic:
a. List all of the resources for the specify tenant.
b. Loop all resource to find the one which the user specify.
The resource maybe image, flavor, volume, server, port, subnet, floating-ip, security-group, network etc.
These functions are called very frequently during the task, for example, after create instance request sent, the plugin will call get_server(server_id) function to check if the VM status is ACTIVE.
Every query function to check the resource status in Amazon plugin is not similar with the implementation in OpenStack plugin, and we can find the amazon plugin have better performance and specialized optimization in the Plugin.
Above all,
a. The result of callback plugin isn’t very proper, it includes lots of unreasonable API calls lead to more time cost of the testing, and we can confirm that all the API calls during the testing is normal. For example, the time of creating instances cost 23s, but the ansible count 218s totally.
For create instance task, it include the following API calls:
List all flavors
Get every flavors detail
List all flavors with detail
List all volumes with detail
Create Instances
List all servers
List all networks
List all subnets
List all servers to check if servers ACTIVE
List all ports
List all floatingips
List all ports
List all floatingips
List all flavors with detail
Get Server security-groups
List all images with detail
List all volumes with detail
b. The logic of OpenStack plugin of Ansible has poor performance, if the plugin would get a resource by id, it will get all the resources of the tenant, the loop to find the specify resource.
The text was updated successfully, but these errors were encountered: