Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple CPIs (mvp) #1471

Merged
merged 55 commits into from
Nov 14, 2016
Merged

Support for multiple CPIs (mvp) #1471

merged 55 commits into from
Nov 14, 2016

Conversation

MatthiasWinzeler
Copy link
Contributor

This PR provides basic support for multiple CPIs in a single BOSH director.

The high level concept is outlined here: https://github.com/cloudfoundry/bosh-notes/blob/master/multi-cpi.md
@cppforlife supported me with implementation details and reviews of some code parts.

Some notable points:

  • The heart of the PR is the newly introduced CpiConfig which stores the multiple CPI endpoint configurations. CloudFactory then looks up the cpi config for a given AZ and returns an instantiated Cloud. To achieve this lookup, the cloud factory uses the latest cpi config and the deployment's cloud config (which contains the AZ -> CPI relation).
  • Every cloud interaction thus needs an up-to-date deployment.cloud_config. In some cases, i.e. while updating the deployment, deployment.cloud_config did not contain the most recent CloudConfig but an older one, because this association was only set and persisted after updating the deployment. I had to do some ugly hacks to set deployment.cloud_config before every cloud interaction. I guess this needs some improvements.
  • I added only very basic integration tests. The branch is built in BOSH's concourse and is successful. I never ran BATS tests though. Do we need some more integration tests or BATS tests?
  • I was having a hard time completely understanding the code parts of stemcell compilation. I’m not quite sure I changed it the right way, so I’d appreciate if you could look at this more carefully.
  • For stemcells, we extended the unique constraint to the new cpi attribute. But since postgresql seems to completely ignore the uniqueness constraint if one of the fields is NULL (which is the case for the cpiattribute when not using cpi config), I added a second validation. I assume this validation could fail if there is a stemcell with name=x, version=y, cpi=z in DB and we add one with name=x, version=y, cpi=nil. Is there a better way to do this? A custom validator? Ignore this case?
  • To be able to support the cpi config, each CPI needs to be modified to also be able to read the configuration from the context hash. We implemented this small change for the openstack cpi.
    We tested this in our lab and were successfully able to stretch deployments over multiple openstacks. I will file the PR there as soon as this PR is merged and a new version of the bosh_cpi gem is ready for downstream use. (cc @voelzmo)

* develop: (34 commits)
  Introducing the DiskManagerFactory.
  Extracting OrphanDiskManager from DiskManager.
  Fix cis spec
  Don't log interpolated values in output
  Refactoring to replace InstanceGroup#persistent_disk_type with InstanceGroup#persistent_disks.
  Bump bosh-agent
  stemcell_builder: force-remove SSH host keys
  stemcell/google: Disable SSH set host keys by Google agent
  Disable IPv6 Redirect Acceptance
  Integration test for resource pool placeholders
  refactor deep_hash_replacement ignore
  Allow to use root partition as ephemeral disk
  Fully qualify syslog reload command when logrotating
  Quote all proxy environment variables to account for spaces
  more robust config server integration test
  Config server better SSL error messages
  Fix config server integration test
  fix hash first level interpolation
  integration tests for env interpolation
  uninterpolated resource pool env are now handled
  ...
* develop: (203 commits)
  More go cli integration test migration
  Continue porting test to go cli
  Add additional tests to gocli integration suite
  Copy syslog to tmp directory before scp'ing
  Bump Agent
  Attach disk if necessary before update settings
  Fix unit test
  Switch to use update settings call for associate disks
  Adding final release for build 3262.14 (v257.14) finalized on master
  Remove tests for removed gocli functionality
  Un-pend external CPI tests for gocli
  Add gocli integration tests
  Add azure-hyperv to list of stemcell artifacts
  Use xen hypervisor when building softlayer stemcells
  Add stemcell smoketest for /var/log permission fixes
  Change concourse build stemcell task when moving a stemcell from tmp -> output to use the stemcell filename instead of a wildcard since tmp now also has bosh-257.3.tgz.
  Bump OS images
  Only use Syslog::Logger with Ruby >= 2
  Write director events also to syslog
  Bump os-images
  ...
* develop:
  Instance returns 'lifecycle' from spec if available
  Add 'lifecycle' attribute to instance apply spec
  disable bosh_micro building in new stemcell pipeline
  Add additional int. test for cert generation
  Split out separate go cli instance class
* develop: (49 commits)
  Bump bosh agent
  Remove polluted test setup
  Revert "Temporarily remove managed_persistent_disk_spec.rb"
  Allow custom configuration of semver resource for stemcells
  Removing policy from README in favor of file
  Bump the parallel_tests gem
  Rename os images directories to reflect move to containers
  Revert "Prepend namespace to config keys."
  Bump OS images
  Prepend namespace to config keys.
  Create IAM policy set to the bosh:os-image s3 resource
  Migrate tests for gocli integration
  Bump bosh-agent for fixing ephemeral disk test
  Temporarily remove managed_persistent_disk_spec.rb
  Move git tag step into publish job
  Push new os-images with public-read permissions
  Fix incorrect semver syntax in initial_version
  bosh pipeline should create candidate release tarballs
  clarify use of dry_run flag in job_runner
  Bump bosh-agent
  ...
… vars (which might no be set globally, i.e. for rvm)
@tjvman
Copy link
Member

tjvman commented Oct 20, 2016

@MatthiasWinzeler @voelzmo Thanks for this PR! We've been looking over the code and doing local testing and it's looking good so far. There are a couple issues that we'd like to bring to your attention before we finalize the review, though.

  1. When using cpi_config, if you run bosh task x --debug then the logs will contain any credentials that were passed in the request['context'] to the CPI; redact creds in the context before sending them to debug logs. See bosh_cpi/lib/cloud/external_cpi.rb:84.
  2. Always using the newest cloud_config when performing CPI calls breaks bosh recreate functionality (may break other things). We would like to preserve the versioning functionality for the CloudConfig, but from your initial PR comments it seemed like always using the latest CloudConfig was necessary. Can you give us some more context on why the newest CloudConfig has to be used so we can work together and figure out a solution that allows both the new PR and desired pre-existing functionality to work?

Copy link
Member

@tjvman tjvman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good overall. There's just a handful of things we'd like you to take care of, plus a few questions we had.

@@ -66,12 +67,15 @@ def ping(*arguments); invoke_cpi_method(__method__.to_s, *arguments); end
private

def invoke_cpi_method(method_name, *arguments)
context = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If context is always going to be initialized in this way, then the CPIs (Openstack CPI currently does this) should not test if context && context[<CPI TYPE>] before merging in the context CPI properties because context will always contain a value (holds true for 3146.x, 256.x, 257.x - backwards compatibility)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I removed the check for context in the openstack cpi: https://github.com/swisscom/bosh-openstack-cpi-release/commit/77970370f428447586925001414c4a84cb892d6d

@@ -79,6 +90,32 @@ def spec
}
end

def cid_for_az(az)
raise 'please bind model first' if @models.nil?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a more detailed error/specific error type here.

Copy link
Contributor Author

@MatthiasWinzeler MatthiasWinzeler Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it a generic runtime exception on purpose - it should always happen during develop time and bubble up to the developer (it means he didn't bind the model before using cid_for_az).

What's the preferred way in such situations? Should we still introduce a more specific error? Does it need to be put in errors.rb (even though it should never leave the system)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, this will be fine for now.

module Bosh::Director
class CloudFactory
def self.create_from_deployment(deployment,
cpi_config = Bosh::Director::Api::CpiConfigManager.new.latest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CloudFactoryHelper is the only place that this method is called and it doesn't override the default_cloud, remove the default_cloud method parameter. (See comment in CloudFactoryHelper as well)

# that could be passed to cloud factory, we use the latest cloud config
CloudFactory.new(CloudFactory.create_cloud_planner(Bosh::Director::Api::CloudConfigManager.new.latest),
CloudFactory.parse_cpi_config(Bosh::Director::Api::CpiConfigManager.new.latest),
Config.cloud)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already the default for this parameter. Remove, also see comment on cloud_factory.rb:15

status(201)
end

get '/', scope: :read do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tighten access to this endpoint; cpi_configs can have sensitive credentials so we don't want to be as permissive as we are with CloudConfig.

@@ -5,6 +5,7 @@ module Jobs
class UpdateStemcell < BaseJob
include ValidationHelper
include DownloadHelper
include CloudFactoryHelper

UPDATE_STEPS = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can no longer be hard-coded because it's dependent on number of configured CLIs. Currently update_stemcell output with multiple CPIs looks like

Started update stemcell
  Started update stemcell > Extracting stemcell archive. Done (00:00:00)
  Started update stemcell > Verifying stemcell manifest. Done (00:00:00)
  Started update stemcell > Checking if this stemcell already exists on cloud foo. Done (00:00:00)
  Started update stemcell > Uploading some-name/some-version to the cloud foo. Done (00:00:12)
  Started update stemcell > Save some-name/some-version (some-ami) for cloud foo. Done (00:00:00)
     Done update stemcell (00:00:12)
  Started update stemcell > Checking if this stemcell already exists on cloud bar. Done (00:00:00)
  Started update stemcell > Uploading stemcell some-name/some-version to the cloud bar. Done (00:00:07)
  Started update stemcell > Save stemcell some-name/some-version (some-ami) for cloud bar. Done (00:00:00)

The Done update message should print after all stemcells have been updated successfully.

}

subject(:parsed_cpi_config) { described_class.new(cpis) }
describe '#find_cpi_by_name' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cpi_manifest_parser doesn't allow CPIs with duplicate names, is there value in testing that the ParsedCpiConfig returns the first cpi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's not. I changed it.

@@ -15,7 +15,7 @@ module Bosh::Director
let(:dns_manager) { DnsManagerProvider.create }
let(:event_log) { Config.event_log }

let(:cloud) { instance_double('Bosh::Cloud') }
let(:cloud) { Config.cloud }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be deleted; cloud was only needed for initializing the subject.

@@ -9,8 +9,8 @@ def make_deployment(name)
BD::Models::Deployment.make(:name => name)
end

def make_stemcell(name, version, os = 'os1')
BD::Models::Stemcell.make(:name => name, :operating_system=>os, :version => version)
def make_stemcell(name, version, os = 'os1', params = {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like cpi could be a separate argument and default to nil. Then params arg could be removed; doesn't seem necessary to have a params hash when only one param is ever used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only cpi, but also cid (see further below in the spec). Should I add both as separate arguments? I prefer the params way since it's more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't catch the cid. I don't have strong feelings about params vs separate args in this case, so this seems fine as is.

@@ -329,43 +328,6 @@
end
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this test removed? Don't see how it relates to this PR.

@ljfranklin
Copy link
Contributor

@cppforlife @MatthiasWinzeler I realize this thought is coming a little late, but do we actually need a separate cpi-config file and command or could we just add those properties to the cloud-config?

The distinction is more or less an implementation detail as the cloud config is already full of properties that will be passed to the CPI. I'm worried users not familiar with the internals of BOSH won't understand why some IaaS properties need to go in cloud-config and others in cpi-config. I'd imagine this also simplifies the multi-cpi implementation and solves the issue mentioned by @tjvman where multi-cpi could break bosh recreate. Let me know if I'm missing something.

@cppforlife
Copy link
Contributor

thinking behind separation was that cpi config isnt versioned just like cloud config. its likely that you will have to update/shift configuration over time for your cpi (cred change, etc) and i dont think we want users to have to redeploy for that.

Sent from my iPhone

On Oct 26, 2016, at 7:44 AM, Lyle Franklin notifications@github.com wrote:

@cppforlife @MatthiasWinzeler I realize this thought is coming a little late, but do we actually need a separate cpi-config file and command or could we just add those properties to the cloud-config?

The distinction is more or less an implementation detail as the cloud config is already full of properties that will be passed to the CPI. I'm worried users not familiar with the internals of BOSH won't understand why some IaaS properties need to go in cloud-config and others in cpi-config. I'd imagine this also simplifies the multi-cpi implementation and solves the issue mentioned by @tjvman where multi-cpi could break bosh recreate. Let me know if I'm missing something.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@voelzmo
Copy link
Contributor

voelzmo commented Oct 26, 2016

@ljfranklin @cppforlife Furthermore, cpi-config contains credentials and should potentially handled differently by admins and also the director than the rest of the cloud-config.

@MatthiasWinzeler
Copy link
Contributor Author

Thanks for reviewing the changes!
I pushed most of the requested changes and will complete the remaining open points in the coming days.

@MatthiasWinzeler
Copy link
Contributor Author

Always using the newest cloud_config when performing CPI calls breaks bosh recreate functionality (may break other things). We would like to preserve the versioning functionality for the CloudConfig, but from your initial PR comments it seemed like always using the latest CloudConfig was necessary. Can you give us some more context on why the newest CloudConfig has to be used so we can work together and figure out a solution that allows both the new PR and desired pre-existing functionality to work?

It's best explained when looking at this integration test which tests the switch from non-cpi-config to cpi-config:

  • Deploy without cpi-config
  • Create cpi-config
  • Update cloud-config (add CPIs to AZs)
  • Deploy

My assumption was that this works out of the box. Without my ugly fix, this did not work - BOSH used the older cloud config while deploying, so it did not know which AZ uses which CPI and thus deployed to the director cpi.

You can easily reproduce this by removing my hack and run the integration test above.

Maybe this is some other bug?

* develop: (159 commits)
  Fix various gocli integration tests
  Separate stemcell buckets for candidate and publish
  Fix `environment` command calls in gocli integration test
  Migrate lifecycle to instance spec
  Fix ruby integration test
  Fix go cli integration test
  If a http with a payload returns a 401, then return a response instead of an error
  Fix gocli integration tests that used stateful env
  Guard against credentials being nil. Raise an auth error when director returns 401
  Fix stemcell upload expectations in ruby CLI
  Quietly ignore stemcells which already exist
  try to fix integration tests for gocli statefullness removal
  Fix broken integration tests
  Unpend dry-run integration tests
  Unpend passing test
  Drop agent_client from GemComponents list
  Allow release sha1 to be passed in body or as query param
  Drop `agent_client` source folder
  Drop simple_blobstore_server
  Remove genisoimage package from release
  ...
* develop:
  When excluding dns records by ip, cast ip to string
  Add tags from runtime_config to deployment_plan
  Formatting
  Add logic to ensure duplicate dns agent records are not sent to agent.
  Add logic to remove stale dns records for an instance
  Fix publish stemcell and integration test
  Use the proper variable
  Unpend cck --auto integration tests
  Bump bosh-agent
  Upload blobs for s3cli/s3cli-0.0.53-linux-amd64
  Bump s3cli to 0.0.53
  Allow trimming empty lines in templates
  Adding mkdir of tmp because it fails without folder created
@MatthiasWinzeler
Copy link
Contributor Author

Pushed some last improvements - it's ready now for another review - and merged current develop. integration-1.9-postgres has some broken specs, but it's the same on develop.

@cppforlife
Copy link
Contributor

@MatthiasWinzeler we are planning to take a look at the changes next week.

@tjvman
Copy link
Member

tjvman commented Nov 10, 2016

@MatthiasWinzeler Thanks for making those changes. We looked into how the change to when the deployment model gets updated, and we believe we've identified the source of the bug as well as a way to fix it.

With that in mind, we intend to merge this PR as it currently stands and make our change to how deployment.cloud_config works on top of that.

@mingxiao mingxiao merged commit 00f3301 into cloudfoundry:develop Nov 14, 2016
@mingxiao
Copy link
Contributor

@MatthiasWinzeler the cpi-config and update-cpi-config has also been ported over to the go bosh cli

@youngm
Copy link

youngm commented Dec 12, 2016

Nice work @MatthiasWinzeler ! Thanks tons for contributing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants