Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openstack Plugin has performance issue #27423

Closed
lealoncity opened this issue Jul 28, 2017 · 26 comments
Closed

Openstack Plugin has performance issue #27423

lealoncity opened this issue Jul 28, 2017 · 26 comments
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bot_closed bug This issue/PR relates to a bug. cloud collection Related to Ansible Collections work module This issue/PR relates to a module. openstack performance support:community This issue/PR relates to code supported by the Ansible community.

Comments

@lealoncity
Copy link

lealoncity commented Jul 28, 2017

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ansible/modules/cloud/openstack/os_server.py
ansible/modules/cloud/openstack/os_volume.py
ansible/modules/cloud/openstack/os_server_volume.py

ANSIBLE VERSION
2.3.1.0
CONFIGURATION

default ansible.cfg

OS / ENVIRONMENT
  1. ansible installed Ubuntu 14.04
  2. Test public cloud based on Openstack
SUMMARY
STEPS TO REPRODUCE

ansible-playbook test.yml

test.yml
---
- name: Create an instance
  hosts: localhost
  gather_facts: false

  vars:
    ipv4_address: 10.32.15.150

  tasks:
    - name: Create System Volume
      os_volume:
        state:                      present
        availability_zone:          "AZ1"
        size:                       "8"
        wait:                       yes                                 
        image:                      "Standard_CentOS_7.3_latest"
        volume_type:                "SATA"
        display_name:               "ops-dummy-0001-volume-system"  
     
    - name: Create Data Volume
      os_volume:
        state:                      present
        availability_zone:          "AZ1"
        size:                       "50"
        wait:                       yes                              
        volume_type:                "SSD"
        display_name:               "ops-dummy-0001-volume-data" 

    - name: Create static IP NIC
      os_port:
        state:                      present
        name:                       "ops-dummy-0001-port1"
        admin_state_up:             true
        network:                    "76cc4d0c-18b8-433d-9dc8-a26b5c3f6508"  
        security_groups:            "ops-sg-tier1"
        fixed_ips:
          - ip_address: "10.32.15.150"
      when: ipv4_address is defined
      
    - name: Create DHCP NIC
      os_port:
        state: present
        name:                       "ops-dummy-0001-port1"
        admin_state_up:             true
        network:                    "76cc4d0c-18b8-433d-9dc8-a26b5c3f6508"
        security_groups:            "ops-sg-tier1"
      when: ipv4_address is not defined

    - name: Create an instance
      os_server:
        state:                      present
        name:                       "ops-dummy-0001"                    
        key_name:                   "ops"
        boot_volume:                "ops-dummy-0001-volume-system"      
        timeout:                    200                                     
        flavor:                     "c1.medium"                             
        availability_zone:          "AZ1"
        auto_ip:                    false                                   
        reuse_ips:                  false # for now
        nics:
         - port-name:               "ops-dummy-0001-port1"              
      register: osvm                                                       
      
    - name: Attach volumes
      os_server_volume:
        state:                      present
        server:                     "ops-dummy-0001"
        volume:                     "ops-dummy-0001-volume-data"
        device:                     "/dev/xvdb"

    - name: Return configured local ipv4
      set_fact:
        instance_ipv4: "{{ osvm.openstack.private_v4}}"

    - name: Wait for 90 seconds so the instance is ready for the next task
      pause:
        seconds: 90
EXPECTED RESULTS

1.The whole job completes within a very short time.
2. No performance difference between openstack plugin and aws plugin

ACTUAL RESULTS

===============================================================================
Create an instance ---------------------------------------------------- 218.52s
Attach volumes -------------------------------------------------------- 173.52s
Create System Volume -------------------------------------------------- 116.32s
Wait for 90 seconds so the instance is ready for the next task --------- 90.01s
Create static IP NIC --------------------------------------------------- 16.40s
Create Data Volume ----------------------------------------------------- 13.59s
Return configured local ipv4 -------------------------------------------- 0.02s
Create DHCP NIC --------------------------------------------------------- 0.01s
real 10m29.258s
user 0m44.327s
sys 0m6.167s

Analyse:

From the API log, we find some unreasonable parts , which means unreasonable api calls.

1.When we create system volume, the playbook specify volume name and image id, the plugin should use the list API with filter for volume and get image by id, rather than List all Volumes with detail and List all images with detail.

2.The API like this list all images and all volumes cost much time that the API with some filter parameters.

  1. Since the playbook specify flavor name already, the OpenStack plugin shouldn’t get all flavors with detail and loop to query every flavor detail

  2. When we create data volume, the playbook specify volume name, the plugin should use the list API with filter for volume rather than List all Volumes with detail, the same to the port, floatingip, network, subnet, security-group, image.

5.After the VM create successful, the plugin should call get detail by instance id rather that loop to get all of the resources, such as port, floatingip, network, security-group, image, subnet etc.

6.The playbook specify image id, the plugin should use get image by id API, rather than List all images with detail.

7.For the step attaching volume for instance, there is no reason to list all of the resource about the tenant, from the process, we find that this step call lots of list all resource for port, servers, floatingips, networks, subnets, security-groups, volumes etc.

In all,

Every get_${resource} function in OpenStack plugin has the below logic:
a. List all of the resources for the specify tenant.
b. Loop all resource to find the one which the user specify.
The resource maybe image, flavor, volume, server, port, subnet, floating-ip, security-group, network etc.

These functions are called very frequently during the task, for example, after create instance request sent, the plugin will call get_server(server_id) function to check if the VM status is ACTIVE.

Every query function to check the resource status in Amazon plugin is not similar with the implementation in OpenStack plugin, and we can find the amazon plugin have better performance and specialized optimization in the Plugin.

Above all,
a. The result of callback plugin isn’t very proper, it includes lots of unreasonable API calls lead to more time cost of the testing, and we can confirm that all the API calls during the testing is normal. For example, the time of creating instances cost 23s, but the ansible count 218s totally.
For create instance task, it include the following API calls:

List all flavors
Get every flavors detail
List all flavors with detail
List all volumes with detail
Create Instances
List all servers
List all networks
List all subnets
List all servers to check if servers ACTIVE
List all ports
List all floatingips
List all ports
List all floatingips
List all flavors with detail
Get Server security-groups
List all images with detail
List all volumes with detail
b. The logic of OpenStack plugin of Ansible has poor performance, if the plugin would get a resource by id, it will get all the resources of the tenant, the loop to find the specify resource.

We use a temporary method to improve a little the  performance, but it is really not the best way. We hope ansible can fix this issue. Thanks~

Our steps are as follows:

1. Modify these three modules

ansible/modules/cloud/openstack/os_server.py
ansible/modules/cloud/openstack/os_volume.py
ansible/modules/cloud/openstack/os_server_volume.py

 a. Edit the os_server.py and append the below code before the line with content “cloud = shade.openstack_cloud(**cloud_params)”

from shade import os_monkey_patch
os_monkey_patch.monkey_patch_os_server()

b.Edit the os_volume.py and append the below code before the line with content “cloud = shade.openstack_cloud(**module.params)”

from shade import os_monkey_patch
os_monkey_patch.monkey_patch_os_volume()

c.Edit the os_server_volume.py and append the below code before the line with content “cloud = shade.openstack_cloud(**module.params)”

from shade import os_monkey_patch
os_monkey_patch.monkey_patch_os_server_volume()


2. Touch a new file called **os_monkey_patch.py**
python2.7/dist-packages/shade/os_monkey_patch.py

The file is as follows:
import shade
import uuid
import keystoneauth1.exceptions
import warnings
from urllib import urlencode

from shade import _utils
from shade import meta
from shade.openstackcloud import OpenStackCloud


# Patch the os_server module in ansible.modules.cloud.openstack
def patch_list_flavors(self, get_extra=None, filters=None):
    if get_extra is None:
        get_extra = self._extra_config['get_flavor_extra_specs']
    flavors = self._normalize_flavors(
        self._compute_client.get(
            '/flavors/detail', params=dict(is_public='None'),
            error_message="Error fetching flavor list"))

    filters_values = filters.values() if filters else []
    for flavor in flavors:
        if flavor.id not in filters_values and flavor.name not in filters_values:
            continue
        if not flavor.extra_specs and get_extra:
            endpoint = "/flavors/{id}/os-extra_specs".format(
                id=flavor.id)
            try:
                flavor.extra_specs = self._compute_client.get(
                    endpoint,
                    error_message="Error fetching flavor extra specs")
            except shade.exc.OpenStackCloudHTTPError as e:
                flavor.extra_specs = {}
                self.log.debug(
                    'Fetching extra specs for flavor failed:'
                    ' %(msg)s', {'msg': str(e)})
    return flavors


def patch_search_flavors(self, name_or_id=None, filters=None, get_extra=True):
    flavors = self.list_flavors(get_extra=get_extra, filters={'name_or_id': name_or_id})
    return _utils._filter_list(flavors, name_or_id, filters)


def _no_pending_images(images):
    """If there are any images not in a steady state, don't cache"""
    for image in images:
        if image.status not in ('active', 'deleted', 'killed'):
            return False
    return True


@_utils.cache_on_arguments(should_cache_fn=_no_pending_images)
def patch_list_images(self, filter_deleted=True, filters=None):
    images = []
    image_list = []
    try:
        if self.cloud_config.get_api_version('image') == '2':
            endpoint = '/images'
        else:
            endpoint = '/images/detail'
        params = {}
        try:
            uuid.UUID(filters['name_or_id'])
            params['id'] = filters['name_or_id']
        except ValueError:
            params['name'] = filters['name_or_id']
        if params:
            response = self._image_client.get(endpoint + "?" + urlencode(params))
            if len(response['images']) == 0 and 'id' in params:
                params['name'] = params.pop('id')
                response = self._image_client.get(endpoint + "?" + urlencode(params))
        else:
            response = self._image_client.get(endpoint)

    except keystoneauth1.exceptions.catalog.EndpointNotFound:
        response = self._compute_client.get('/images/detail')
    while 'next' in response:
        image_list.extend(meta.obj_list_to_munch(response['images']))
        endpoint = response['next']
        response = self._raw_image_client.get(endpoint)
    if 'images' in response:
        image_list.extend(meta.obj_list_to_munch(response['images']))
    else:
        image_list.extend(response)

    for image in image_list:
        if not filter_deleted:
            images.append(image)
        elif image.status.lower() != 'deleted':
            images.append(image)
    return self._normalize_images(images)


def patch_search_images(self, name_or_id=None, filters=None):
    images = self.list_images(filters={'name_or_id': name_or_id})
    return _utils._filter_list(images, name_or_id, filters)


def _no_pending_volumes(volumes):
    for volume in volumes:
        if volume['status'] not in ('available', 'error', 'in-use'):
            return False
    return True


def patch_search_volumes(self, name_or_id=None, filters=None):
    volumes = self.list_volumes(filters={'name_or_id': name_or_id})
    return _utils._filter_list(
        volumes, name_or_id, filters)


@_utils.cache_on_arguments(should_cache_fn=_no_pending_volumes)
def patch_list_volumes(self, cache=True, filters=None):
    def _list(data):
        volumes.extend(data.get('volumes', []))
        endpoint = None
        for l in data.get('volumes_links', []):
            if 'rel' in l and 'next' == l['rel']:
                endpoint = l['href']
                break
        if endpoint:
            try:
                _list(self._volume_client.get(endpoint))
            except shade.exc.OpenStackCloudURINotFound:
                self.log.debug(
                    "While listing volumes, could not find next link"
                    " {link}.".format(link=data))
                raise

    if not cache:
        warnings.warn('cache argument to list_volumes is deprecated. Use '
                      'invalidate instead.')
    attempts = 5
    for _ in range(attempts):
        volumes = []
        params = {}
        if not filters:
            data = self._volume_client.get('/volumes/detail')
        else:
            try:
                uuid.UUID(filters['name_or_id'])
                params['id'] = filters['name_or_id']
            except ValueError:
                params['name'] = filters['name_or_id']
            if params:
                data = self._volume_client.get('/volumes/detail', params=params)
                if len(data.get('volumes', [])) == 0 and 'id' in params:
                    params['name'] = params.pop('id')
                    data = self._volume_client.get('/volumes/detail', params=params)
            else:
                data = self._volume_client.get('/volumes/detail')

        if 'volumes_links' not in data:
            volumes.extend(data.get('volumes', []))
            break
        try:
            _list(data)
            break
        except shade.exc.OpenStackCloudURINotFound:
            pass
    else:
        self.log.debug(
            "List volumes failed to retrieve all volumes after"
            " {attempts} attempts. Returning what we found.".format(
                attempts=attempts))
    return self._normalize_volumes(
        meta.get_and_munchify(key=None, data=volumes))


def monkey_patch_os_server():
    setattr(OpenStackCloud, 'list_flavors', patch_list_flavors)
    setattr(OpenStackCloud, 'search_flavors', patch_search_flavors)
    setattr(OpenStackCloud, 'list_images', patch_list_images)
    setattr(OpenStackCloud, 'search_images', patch_search_images)
    setattr(OpenStackCloud, 'list_volumes', patch_list_volumes)
    setattr(OpenStackCloud, 'search_volumes', patch_search_volumes)


def monkey_patch_os_volume():
    setattr(OpenStackCloud, 'list_images', patch_list_images)
    setattr(OpenStackCloud, 'search_images', patch_search_images)
    setattr(OpenStackCloud, 'list_volumes', patch_list_volumes)
    setattr(OpenStackCloud, 'search_volumes', patch_search_volumes)


def monkey_patch_os_server_volume():
    setattr(OpenStackCloud, 'list_volumes', patch_list_volumes)
    setattr(OpenStackCloud, 'search_volumes', patch_search_volumes)



@ansibot
Copy link
Contributor

ansibot commented Jul 28, 2017

@ansibot ansibot added affects_2.3 This issue/PR affects Ansible v2.3 bug_report cloud module This issue/PR relates to a module. needs_triage Needs a first human triage before being processed. openstack support:community This issue/PR relates to code supported by the Ansible community. labels Jul 28, 2017
@s-hertel s-hertel removed the needs_triage Needs a first human triage before being processed. label Jul 28, 2017
@emonty
Copy link
Contributor

emonty commented Jul 28, 2017

Thank you for your very detailed analysis. There are definitely some
improvements that can be made, but I'm going to respond to each point with more
information, and then point to places where the improvements can be made.

Also, most of these improvements should actually be made at the shade level.
The reason for that should be fairly clear given that you had to monkeypatch
shade anyway. I'll be more than happy to work with you on getting improvements
made there so that we can surface them here.

First of all - the list/filter approach is not an accident, it is in place for
two reasons:

  1. In very high scale scenarios, doing individual GET calls for single IDs
    causes an extreme amount of load for the clouds. We have actually crashed
    at least one public cloud before implementing batched and rate-limited
    list operations.

  2. For many of the resources it is not possible to know if the user is giving
    a name or an id and the server does not present an API to get by either.
    (neutron does a good job of making names and ids interchangeable, the same
    cannot be said elsewhere)

For some (but not all) of the resources you are mentioning there exists the
ability to configure per-resource caching - and it is possible to configure
that caching to use a persistent cache that spans process boundaries. If you
do:

cache:
   expiration:
     server: 5
     port: 5
     floating-ip: 5
     flavor: 3600
     image: 3600
   class: dogpile.cache.dbm
   arguments:
     filename: /home/username/.cache/openstack/shade.dbm

It will limit shade to only doing a single list call for servers ports and
floating ips once every five seconds so that the poll loops are less costly,
and it will cache the flavor and image lists into that dbm file and keep them
for 3600 seconds. I'll mention needed improvements to this system below.

The list/filter approach is the one that we know will work everywhere and the
one we know handles the largest cases, but it is certainly not the most
efficient and the per-resource caching implementation is not fully complete.

Fixing some of these has been on the todo list for a while, but we just spent
the last six months focused on removing the python-*client dependencies
(almost done) and getting prepared for proper microversion support so making
progress has been slower than we'd like. (If you can't tell already, I'm hoping
you'll help out! If you don't have time but making progress on this is
important for Huawei, I can raise that with a few folks there and see if
we can get some resources allocated)

To combat this, there are a few possibilities of improvements (some similar to
ones in your patch) that break down into a few major work areas:

  • Add ability for shade user to select batched-list vs. direct-get profile

The list+filter approach is essential for high-volume long-lived processes.
Coupled with a rate-limiting TaskManager it makes things work VERY well at
scales of 1000s of servers at a time.

BUT - for execution profiles such as the os_server module where different
operations cross process boundaries, it can be known that the high-volume
optimization is actually not an optimization.

If there was an OpenStackCloud constructor parameter such as "batch_queries"
(or something, name not important) that controlled whether get and search
attempted to do the smallest REST calls when they could, then we could set
that flag in all of the Ansible modules and it would allow us to make progress
on the next item:

  • Implement condition pushdown for more resources

(I use the term "pushdown" to mean doing server-side filtering with a
parameter. It's a term from MySQL storage engine implementations to distinguish
conditions that can be satisfied by communicating the constrating to the
storage engine vs. conditions that can only be satisfied at the query execution
layer)

We do pushdown for some but not all resources. It has to be done on a case by
case basis because different resources have different abilities for server-side
filtering.

The Cinder API docs do not mention any ability to do server-side filtering by
name such as you have in your patch. I checked with the Cinder team and it
seems this cycle has brought a generalized server-side resource filtering
implementation. Additionally, they've added an API that one can use to fetch
which resources can filter on which parameters:

https://github.com/openstack/cinder/blob/master/cinder/api/v3/resource_filters.py

which would be great, because it would mean we can write general code to know
if an input parameter can be used server-side or if we need to filter client
side.

However, that's all brand-new - so we need to limit using it to specific
microversion. I've also asked the Cinder team to update the docs to indicate
which versions of Cinder support name as a server-side filter. I believe name
is available earlier than the above work - but I don't know at what point it
starts being available.

This is also a reason that having a mechanism to support a list vs. get
decision is needed. It will make it easier to degrade to list+filter in a
general way depending on what parameter the user has provided and what version
the cloud is running.

  • Plumb the get-by-dict optimization into ansible

This one is very easy.

One of the downsides to allowing users to pass in values as either name or
id is not being able to know whether a name or an ID has been provided. At the
python layer we allow name_or_id fields to also take a dict. So if a user does:

cloud.create_server('test', flavor=dict(id='1234'))

shade will not do a search for the flavor object but instead will just directly
pass 1234 in the flavorRef parameter to the POST /servers call.

In os_server, the flavor parameter is defined as:

flavor = dict(default=None),

If we define that instead as:

flavor = dict(default=None, type='raw'),

That will allow a user that knows they have a flavor object or just an id to
do:

 - os_server:
     flavor:
       id: 1234

or:

 - os_flavor_facts:
     name: my_flavor
   register: flavor_object
 - os_server:
     flavor: flavor_object  

and skip lookups in both cases.

  • Add id detection for resources where it can work

The code you have to test for the id being an id by trying to parse a UUID
is good - but it won't work for all resources because some resources allow
arbitrary ids. (flavor, domain, project and image are all resources that
immediately spring to mind as not having guaranteed UUID ids) This is also
API version dependent. Old versions of Nova (which we still have to support)
do not require UUID ids and there exist clouds that use non-UUID.

Identifying places we can safely use that to optimize our ability to
skip looking up an id when we don't need to and the putting it in as an
optimization is a great idea - and one that should be very easy and valuable
to get done.

  • Finish per-resource caching implementation

This is the trickiest bit of engineering, and one that I have an engineer
thinking about already. (I've told him he can't actually work on it until
we've finished RESTification because it'll be too invasive)

There are currently three states of caching in shade:

  • Resources with no caching implementation at all
  • Resources using dogpile.cache
  • Resources with a custom batching/caching implementation
    (server, port, floating-ip)

The goal is to have ALL resources use dogpile.cache, so that a user can
configure caching to as granular a level as they want. The hard part is that
we need to make sure that the per-resource dogpile.cache-based caching
properly handles the batching/rate-limiting semantics in place for the custom
resources. We wrote this once before but didn't get it all the way correct
and it broke nodepool so we reverted and haven't had a chance to diagnose the
logic error and reattempt. https://review.openstack.org/#/c/366143/ has the
un-revert - but it needs to be rebased and then thoroughly tested. The previous
problem had to do I think with a flaw in invalidation logic. But it might
also simply be that we had places that were internally expecting immediate
current results as part of a worklow that weren't waiting properly. (I think
we also need to add a flag for "this call requires up to date" so that
if batching/rate-limiting is in place the call and appropriately block).

Even if we get all of the other logic improved, having dogpile-based
per-resource caching will be quite nice from the Ansible module perspective.
There are many things, such as flavor lists, that just simply do not change
very frequently, and where even if they supported server-side fetch by name,
the GET /flavors?name=my-flavor is a waste of a call most of the
time. Since in-memory caching doesn't help much with Ansible being
multi-process, the ability to configure a local installation to cache to a
file, a redis or even a memcached will GREATLY improve more complex
playbooks.

  • Document get_extra_specs config option and/or change shade default

In your clouds.yaml file, you can do:

 clients:
   shade:
     get_extra_specs: False

Which will avoid the loop on flavors to get all the extra_specs. It's an
unfortunate current behavior which is only on by default because it started
out that way and we didn't want to break people. I'm increasingly of the
opinion that we should flip the default because the extra calls are just silly
if you don't actually want the info.

The internal calls in shade itself should all be setting get_extra to False. If
there is a codepath that is doing a flavor list in the background that isn't
setting it to False, that's a bug. However, we also have another item:

  • Remove duplicate logic from os_* modules

This is primarily os_server and os_server_actions, but there is a bunch of
logic that is directly in several of the os_ modules that now exists
in shade. The os_ modules are intended to be thin mappings between ansible
input and output parameters and shade calls with the actual logic that needs
to be performed happening in shade. (Logic that is ansible-specific, such as
whether or not to make a call based on a state parameter should stay in os_
modules)

A good for instance where this is a problem is here:

https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/cloud/openstack/os_server.py#L510

The create_server call in shade handles this logic, handles the dict() id
optimization and also knows to set get_extra=False. Similarly the _network_args
method should be completely removed because it's both in create_server and has
been improved. So we need to go through and do some cleanup.

In os_server_actions there are several calls to cloud.nova_client - but shade
has the calls for these things now.

** Summary **

I'd love to make improvements in the areas you have highlighted. I hope my
writeup of where/how we need to investigate making them is helpful. If you
are interested in working on any of these I'd be MORE than happy to work with
you and to help point you in the right direction. As I mentioned before, if
you do not have the bandwidth for such an effort but it is important for you
that they get made soon, let me know that to and I'll go make some noise to try
to find someone to work on it.

The shade team WILL get to all of these, but given various projects and
priorities for the team members it might a little while. I'd love to be able to make
more traction on this.

@emonty
Copy link
Contributor

emonty commented Jul 28, 2017

#27443 has the "get rid of local stuff in os_server that's done better in shade"

@calfonso
Copy link
Contributor

@lealoncity Thanks for the detailed report! And @emonty for your detailed response. We've got one person here that might have capacity in a while to look into some of the suggested changes. As this is a community supported set of modules, we'd really need someone in the community to dig in and provide some PRs to resolve the issues; otherwise, it's going to be a while before we can look into it.

@calfonso calfonso added the waiting_on_contributor This would be accepted but there are no plans to actively work on it. label Jul 31, 2017
@ansibot ansibot removed the waiting_on_contributor This would be accepted but there are no plans to actively work on it. label Jul 31, 2017
@calfonso
Copy link
Contributor

needs_contributor

@ansibot ansibot added the waiting_on_contributor This would be accepted but there are no plans to actively work on it. label Jul 31, 2017
@lealoncity
Copy link
Author

Hi emonty and calfonso,

Really thank you for your very detail response. Sorry for answer late because of trip.

I will feedback your advice and plan to my colleage, then we will communicate with this issue here.

Thanks again~

@lealoncity
Copy link
Author

Sorry emontu and calfonso,

This is a very import issue for our actual customer, this issue actually comes from ansible,especially from shade, while customer thinks the performance of openstack is bad. So this makes us passive and we hope this issue will be fixed soon. Thanks~

As talked with my colleages,unfortunatelly we really has not the bandwidth for such an effort in recent days, but we can exchange our ideas here to support each others. Hope you help to find someone to improve the performance and share this issue to shade community.

As worry about customer‘s give-up for ansible and openstack, could you provide the fix plan for this performance issue? We may talk with our customer to build their confidence. I think it is really helpful for ansible’s extendtion on openstack scene.

Thank you very much, hope for response~ :)

@lealoncity
Copy link
Author

Hi emonty,
Could you help to explain why in very high scale scenarios, doing individual GET calls for single IDs

causes an extreme amount of load for the clouds?

I think I am not clear with the reason.

Do you mean when lots of volumes really in a tenant, you call only one list api and then filter all the

volumes?

But for lots of volumes, it looks it will still call one list api for querying every volume. Maybe the persistent

cache can control and limit one list api call only... So the high scale scenarios really needs persistent

cache, or the performance will be bad.

I don't know whether my understanding is right. Be so kind as to give me a reply, please :)

@lealoncity
Copy link
Author

Any new information about this issue? Thanks~

@emonty
Copy link
Contributor

emonty commented Aug 7, 2017

Sorry - my turn to have been delayed by trip. :)

I will write up a more detailed fix plan by tomorrow, and I'll see what we can prioritize on my end. I think there might be some ways we can do a quick hack to help your customer that we can generalize as part of the plan. (but I need to write up the details to make sure)

As for the reason the individual GET calls can be bad -

Imagine an application that is creating and destroying 1000s of VMs in a constant/dymanic manner, but in a parallel/event-driven manner. In this case you have, for instance, 1000 different create requests that than must be polled for success. The only way to poll for success (in the general case) is to do a GET /servers/{uuid} and check the status. If you do that for each server, potentially have 1000 independent GET calls. However, if you have a centralized cached/batched list call - you can make 1 GET call to /servers/detail every 5 seconds, and then you can fulfill each of the GET requests in the code by filtering the results of that GET.

Same with ports and floating-ips, which also must be checked.

As you have noted, for more direct use this is more expensive, which is the root of your customer's problems right now.

For high-scale cases using different processes for each operation- you are correct, a persistent/shared cache is required. For high-scale cases using the same process but multiple threads or coroutines, in-memory cache works well today. This is why there are ultimately two different items that need work for this to be FULLY solved - allowing the use of single-get instead of list for smaller cases, and improving the shared-cache story for large-scale uses of Ansible so that the Ansible modules can take advantage of the same logic that is available to shade uses running long-lived processes.

Both are ultimately important, but I believe there is some short-term help we can provide for your current case.

I'll write up a description of the specific fixes I think we need to make, and also give an estimate from my side on what making progress will look like. I have another plane flight in a few hours which will be a good opportunity to work on this.

@lealoncity
Copy link
Author

Hi emonty,

Thanks for your support! I think I can understand the design thinking of list/filter method, and as you said, we need an option to control batched-list vs. direct-get.

Another question, I have tried to configure per-resource caching, while I think I used it in a wrong way . I just pasted the follow lines in my playbook yml file, in the same level of 'task'.

cache:
expiration:
server: 5
port: 5
floating-ip: 5
flavor: 3600
image: 3600
class: dogpile.cache.dbm
arguments:
filename: /home/username/.cache/openstack/shade.dbm

Do I need to install something, such as dogpile? Help please : )

Waiting for your detail plan, then we can advise our customer to continue to use ansible but other tools to complete daily jobs. This is worth expecting!

Thank you very much!

@lealoncity
Copy link
Author

Hi emonty,

As you said before, “The shade team WILL get to all of these, but given various projects and
priorities for the team members it might a little while. I'd love to be able to make
more traction on this.”

My colleage asked me if there is a link for this bug or patch. He maybe help to drive shade project.

@lealoncity
Copy link
Author

Hi emonty,

Could you help to classify the jobs needed to be done by ansible and shade clearly again. We may help to

drive shade project, but my colleage is not very clear to know what shade needs to do in details.

Thanks :)

@lealoncity
Copy link
Author

Hi emonty,

It is better that you help to patch bugs for openstack community. Then my colleage may help to push forward this issue.

Thanks :)

@lealoncity
Copy link
Author

Hi emonty,

Any update?

@lealoncity
Copy link
Author

Hi emonty,

Any update?

@ansibot
Copy link
Contributor

ansibot commented Aug 31, 2017

@emonty
Copy link
Contributor

emonty commented Aug 31, 2017 via email

@lealoncity
Copy link
Author

Hi yes,
I find this https://bugs.launchpad.net/shade/+bug/1709577
And I used the latest shade with ansible 2.3.1.0 to test agin ,but no performance booster occurs. Do we also optimize ansible's performmance in the latest version?

@ansibot ansibot added bug This issue/PR relates to a bug. performance and removed bug_report labels Mar 1, 2018
@ansibot
Copy link
Contributor

ansibot commented May 12, 2018

cc @omgjlk
click here for bot help

@ansibot ansibot added support:core This issue/PR relates to code supported by the Ansible Engineering Team. and removed support:community This issue/PR relates to code supported by the Ansible community. labels Sep 17, 2018
@ansibot ansibot added support:community This issue/PR relates to code supported by the Ansible community. and removed support:core This issue/PR relates to code supported by the Ansible Engineering Team. labels Oct 3, 2018
@ansibot
Copy link
Contributor

ansibot commented Oct 11, 2018

cc @mnaser
click here for bot help

@mnaser
Copy link
Contributor

mnaser commented Oct 13, 2018

@lealoncity how is the performance, is this still an issue?

@ansibot
Copy link
Contributor

ansibot commented Nov 2, 2018

@ansibot
Copy link
Contributor

ansibot commented Apr 6, 2019

cc @gtema
click here for bot help

@ansibot ansibot added collection Related to Ansible Collections work collection:openstack.cloud needs_collection_redirect https://github.com/ansible/ansibullbot/blob/master/docs/collection_migration.md labels Apr 29, 2020
@ansibot ansibot removed collection:openstack.cloud needs_collection_redirect https://github.com/ansible/ansibullbot/blob/master/docs/collection_migration.md labels Jul 16, 2020
@sivel
Copy link
Member

sivel commented Aug 17, 2020

!component =lib/ansible/modules/cloud/openstack/os_server.py

@ansibot
Copy link
Contributor

ansibot commented Aug 25, 2020

Thank you very much for your interest in Ansible. Ansible has migrated much of the content into separate repositories to allow for more rapid, independent development. We are closing this issue/PR because this content has been moved to one or more collection repositories.

For further information, please see:
https://github.com/ansible/ansibullbot/blob/master/docs/collection_migration.md

@ansibot ansibot closed this as completed Aug 25, 2020
@ansible ansible locked and limited conversation to collaborators Sep 22, 2020
@sivel sivel removed the waiting_on_contributor This would be accepted but there are no plans to actively work on it. label Dec 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.3 This issue/PR affects Ansible v2.3 bot_closed bug This issue/PR relates to a bug. cloud collection Related to Ansible Collections work module This issue/PR relates to a module. openstack performance support:community This issue/PR relates to code supported by the Ansible community.
Projects
None yet
Development

No branches or pull requests

7 participants