Issues that have come up when deploying or managing Spinnaker.
- Seems to happen when
hal deploy apply
gives up after waiting on the bootstrap Services - Not able to delete Pods
- Have to restart Docker Daemon on Nodes, or rotate Nodes out
- Solution:
- Seems like this does not occur when running on Kubernetes Nodes with more resources available
Shows error
2018-08-09 08:39:51.952 ERROR 1 --- [ecutionAction-6] c.n.s.fiat.roles.UserRolesSyncer : [] Unable to resolve service account permissions. com.netflix.spinnaker.fiat.permissions.PermissionResolutionException: com.netflix.spinnaker.fiat.providers.ProviderException: (Provider: DefaultAccountProvider) retrofit.RetrofitError: connect timed out
- Solution:
- Make sure Clouddriver has a Pod running
- https://github.com/spinnaker/fiat/blob/397706a98b56d4470a06f63972048a3157f98aaf/fiat-roles/src/main/java/com/netflix/spinnaker/fiat/providers/internal/ClouddriverService.java#L32-L36
Make sure
spec.replicas
> 0kubectl -n spinnaker get pods kubectl -n spinnaker get replicasets kubectl -n spinnaker edit replicasets spin-clouddriver-v###
- x.509 port defined as
default.apiPort: 8085
ingate-local.yml
- Output of
netstat -ntlp
on Gate shows no listener on 8085 - Solution:
Requires SSL to be enabled
hal config security api ssl enable
hal config security api ssl enable
- Loading page shows
502 Bad Gateway
- Traefik Ingress using HTTP to communicate with the new HTTPS port
- Traefik recognizes the scheme based on port, if 443 use HTTPS
- Solution:
- Configure Traefik to use HTTPS
Update Gate Service with
kubectl
to route port 443apiVersion: v1 kind: Service metadata: name: spin-gate namespace: spinnaker annotations: prometheus.io/path: /prometheus_metrics prometheus.io/port: "8008" prometheus.io/scrape: "true" spec: ports: - name: https port: 443 targetPort: 8084 - name: http port: 8084 targetPort: 8084
Update Gate Ingress to use Service port 443
apiVersion: extensions/v1beta1 kind: Ingress metadata: name: spin-gate namespace: spinnaker spec: rules: - host: gate.example.com http: paths: - path: / backend: serviceName: spin-gate servicePort: https
- Now page loads with
500 Internal Server Error
- Traefik Ingress does not trust self-signed Certificate
- Possible solutions:
- Use a publicly trusted Certificate
- Add the private Certificate Authority to Traefik
- Set
insecuritySkipVerify = true
in Traefik's global configuration
- Solution:
- Short term, set
insecureSkipVerify = true
Add configuration file for Traefik
apiVersion: v1 kind: ConfigMap metadata: name: traefik-config namespace: kube-system data: traefik.toml: | logLevel = "INFO" insecureSkipVerify = true
Mount Traefik configuration file
kind: Deployment apiVersion: extensions/v1beta1 metadata: name: traefik-ingress-controller namespace: kube-system labels: k8s-app: traefik-ingress-lb spec: template: spec: containers: - image: traefik name: traefik-ingress-lb args: - --api - --kubernetes volumeMounts: - name: traefik-config mountPath: /etc/traefik volumes: - name: traefik-config configMap: name: traefik-config
- Page now loads as expected
- Short term, set
- Front50 returns 403 (permission denied)
Orca error in logs:
2018-05-29 14:14:59.937 ERROR 1 --- [ handlers-19] c.n.s.orca.q.handler.RunTaskHandler : [] Error running UpsertApplicationTask for orchestration[00000000-0000-0000-0000-000000000000] retrofit.RetrofitError: 403 at retrofit.RetrofitError.httpError(RetrofitError.java:40) at retrofit.RestAdapter$RestHandler.invokeRequest(RestAdapter.java:388) at retrofit.RestAdapter$RestHandler.invoke(RestAdapter.java:240) at com.sun.proxy.$Proxy106.get(Unknown Source) at com.netflix.spinnaker.orca.front50.Front50Service$get.call(Unknown Source) at com.netflix.spinnaker.orca.front50.tasks.AbstractFront50Task.fetchApplication(AbstractFront50Task.groovy:73) at com.netflix.spinnaker.orca.applications.tasks.UpsertApplicationTask.performRequest(UpsertApplicationTask.groovy:39) at com.netflix.spinnaker.orca.applications.tasks.UpsertApplicationTask$performRequest.callCurrent(Unknown Source) at com.netflix.spinnaker.orca.front50.tasks.AbstractFront50Task.execute(AbstractFront50Task.groovy:67) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$handle$1$1.invoke(RunTaskHandler.kt:82) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$handle$1$1.invoke(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.AuthenticationAwareKt$sam$Callable$55f02348.call(AuthenticationAware.kt) at com.netflix.spinnaker.security.AuthenticatedRequest.lambda$propagate$1(AuthenticatedRequest.java:79) at com.netflix.spinnaker.orca.q.handler.AuthenticationAware$DefaultImpls.withAuth(AuthenticationAware.kt:49) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.withAuth(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$handle$1.invoke(RunTaskHandler.kt:81) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$handle$1.invoke(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$withTask$1.invoke(RunTaskHandler.kt:173) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler$withTask$1.invoke(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$withTask$1.invoke(OrcaMessageHandler.kt:47) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$withTask$1.invoke(OrcaMessageHandler.kt:31) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$withStage$1.invoke(OrcaMessageHandler.kt:57) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$withStage$1.invoke(OrcaMessageHandler.kt:31) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$DefaultImpls.withExecution(OrcaMessageHandler.kt:66) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.withExecution(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$DefaultImpls.withStage(OrcaMessageHandler.kt:53) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.withStage(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$DefaultImpls.withTask(OrcaMessageHandler.kt:40) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.withTask(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.withTask(RunTaskHandler.kt:166) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.handle(RunTaskHandler.kt:63) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.handle(RunTaskHandler.kt:51) at com.netflix.spinnaker.q.MessageHandler$DefaultImpls.invoke(MessageHandler.kt:36) at com.netflix.spinnaker.orca.q.handler.OrcaMessageHandler$DefaultImpls.invoke(OrcaMessageHandler.kt) at com.netflix.spinnaker.orca.q.handler.RunTaskHandler.invoke(RunTaskHandler.kt:51) at com.netflix.spinnaker.orca.q.audit.ExecutionTrackingMessageHandlerPostProcessor$ExecutionTrackingMessageHandlerProxy.invoke(ExecutionTrackingMessageHandlerPostProcessor.kt:47) at com.netflix.spinnaker.q.QueueProcessor$pollOnce$1$1.run(QueueProcessor.kt:74) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
- Solution:
Set
fiat.cache.expiresAfterWriteSeconds: 0
infiat-local.yml
andservices.fiat.cache.expiresAfterWriteSeconds: 0
inspinnaker-local.yml
- https://www.bountysource.com/issues/48656889-application-not-found-and-delay-issue-in-ui
- Property needs to be set in both files
- Reduces the default 20 seconds
- Application creation workflow now goes:
Front50 responds 404 (not found) instead of 403 (access denied)
com.netflix.spinnaker.front50.exception.NotFoundException: Object not found (key: exampleapplication)
- Create Application
- Application exists immediately
- Anyone is able to disable and enable Clusters
- Destroying a Cluster will disable the Cluster, then fail when destroying with error
Access denied to account ${ACCOUNT}
- Solution:
- Will fail properly with Traffic Guards enabled for Cluster
- Anyone can modify the Traffic Guards for an Application
- After removing safety, someone can later disable a Cluster and take down traffic
ThrottleException
in Clouddriver logs2018-05-09 01:36:48.681 INFO 1 --- [cutionAction-47] com.amazonaws.latency : ServiceName=[AmazonElasticLoadBalancing], ThrottleException=[com.amazonaws.services.elasticloadbalancingv2.model.AmazonElasticLoadBalancingException: Rate exceeded (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: Throttling; Request ID: 00000000-0000-0000-0000-000000000000)], AWSErrorCode=[Throttling], StatusCode=[400, 200], ServiceEndpoint=[https://elasticloadbalancing.us-west-2.amazonaws.com], RequestType=[DescribeTargetHealthRequest], AWSRequestID=[00000000-0000-0000-0000-000000000000, 00000000-0000-0000-0000-000000000000], HttpClientPoolPendingCount=0, RetryCapacityConsumed=0, ThrottleException=1, HttpClientPoolAvailableCount=0, RequestCount=2, HttpClientPoolLeasedCount=0, RetryPauseTime=[474.151], RequestMarshallTime=[0.002], ResponseProcessingTime=[0.214], ClientExecuteTime=[700.076], HttpClientSendRequestTime=[0.059, 0.048], HttpRequestTime=[4.672, 42.883], RequestSigningTime=[0.082, 0.105], CredentialsRequestTime=[0.002, 0.002, 0.003], HttpClientReceiveResponseTime=[4.564, 27.471],
- Solution:
- Decrease allowed Provider API requests per second
Exception ( Monitor Deploy )
unable to resolve AMI imageId from ami-a5532fdd
- Solution:
- Fix where Clouddriver is trying to find AMIs
Not sure what the
hal
command is, but modify.hal/config
soprimaryAccount
is the Account to searchdeploymentConfigurations: - name: default providers: aws: primaryAccount: HALYARD_AWS_ACCOUNT_NAME
Exception ( Determine Source Server Group )
403
- Solution 1:
- Missing
READ
permissions for Account - Look at
.hal/config
for what Roles are listed underREAD
- For Service Accounts, add the Role
- For Users, add the User to the Group in the SAML or other authentication Provider
- Missing
- Solution 2:
- Deploy Stage
application
value does not match Spinnaker Application - In the UI, the
Cluster
name should be the same as the Spinnaker Application
- Deploy Stage
# Igor
2018-10-25 23:25:06.607 INFO 1 --- [RxIoScheduler-4] c.n.s.igor.jenkins.JenkinsBuildMonitor : [master=Jenkins:job=example-job] has no other builds between [Thu Oct 25 23:21:42 GMT 2018 - Thu Oct 25 23:24:00 GMT 2018], advancing cursor to 1540509840709
# Echo
2018-10-25 23:25:06.607 INFO 1 --- [IoScheduler-987] c.n.s.e.p.monitor.TriggerMonitor : Found matching pipeline example-application:example-pipeline
2018-10-25 23:25:06.607 INFO 1 --- [IoScheduler-987] c.n.s.e.p.orca.PipelineInitiator : Triggering Pipeline(example-application, example-pipeline, 00000000-0000-0000-0000-000000000000) due to Trigger(00000000-0000-0000-0000-000000000000, jenkins, Jenkins, example-job, null, gitlab, null, null, null, null, null, null, {}, null, {}, null, null, [], null, null, null, null, Pipeline(example-application, example-pipeline, 00000000-0000-0000-0000-000000000000))
2018-10-25 23:25:06.608 INFO 1 --- [it-/orchestrate] c.n.s.e.p.orca.OrcaService : ---> HTTP POST http://spin-orca.spinnaker:8083/orchestrate
2018-10-25 23:25:06.651 INFO 1 --- [it-/orchestrate] c.n.s.e.p.orca.OrcaService : <--- HTTP 403 http://spin-orca.spinnaker:8083/orchestrate (45ms)
2018-10-25 23:25:06.693 ERROR 1 --- [ Retrofit-Idle] c.n.s.e.p.orca.PipelineInitiator : Retrying pipeline trigger, attempt 1/5
2018-10-25 23:25:27.023 ERROR 1 --- [ Retrofit-Idle] c.n.s.e.p.orca.PipelineInitiator : Error triggering pipeline: Pipeline(example-application, example-pipeline, 00000000-0000-0000-0000-000000000000)
# Orca
2018-10-25 23:25:06.686 INFO 1 --- [0.0-8083-exec-8] c.n.s.o.c.OperationsController : [] received pipeline 00000000-0000-0000-0000-000000000000:{…}
2018-10-25 23:25:06.687 INFO 1 --- [0.0-8083-exec-8] c.n.s.o.c.OperationsController : [] requested pipeline: {…}
2018-10-25 23:25:06.687 INFO 1 --- [0.0-8083-exec-8] c.n.s.orca.front50.Front50Service : [] ---> HTTP GET http://spin-front50.spinnaker:8080/pipelines/example-application?refresh=false
2018-10-25 23:25:06.692 INFO 1 --- [0.0-8083-exec-8] c.n.s.orca.front50.Front50Service : [] <--- HTTP 403 http://spin-front50.spinnaker:8080/pipelines/example-application?refresh=false (5ms)
- Solution:
- Missing
Run As User
with ApplicationREAD
andWRITE
Permissions - When not populated, the
Run As User
defaults toAnonymous
- When there are any Roles configured in the Application Permissions,
Anonymous
authorization no longer works - Create a Service Account: https://www.spinnaker.io/setup/security/authorization/service-accounts/
- Configure Spinnaker Application Permissions to allow
READ
andWRITE
for any Role the Service Account belongs to
- Missing
- Solution:
- Set memory limits for Containers
- https://www.spinnaker.io/reference/halyard/component-sizing/
Set Pod memory requests and limits in
.hal/config
deploymentConfigurations: - name: default deploymentEnvironment: customSizing: spin-clouddriver: limits: memory: 2Gi
Set the JVM flags to be 80-90%
.hal/default/service-settings/clouddriver.yml
env: # 2GB * .8 JAVA_OPTS: -Xmx1638m
-Xms
should be 80-90% of Podrequests
-Xmx
should be 80-90% of Podlimits
JavaScript Console errors when selecting Account
TypeError: Cannot read property 'slice' of undefined
- Solution:
- Specify default Account and Region in Deck
Use
.hal/default/profiles/settings-local.js
to override the defaults in.hal/default/staging/settings.js
window.spinnakerSettings.providers.aws.defaults = { account: 'test', region: 'us-east-5', iamRole: 'DEFAULT_IAM_PROFILE', };
- Have to remember to check Create an internal load balancer when creating Load Balancers
- Solution:
- Configure Deck to infer the Internal flag based on the Subnet Purpose name
Use
.hal/default/profiles/settings-local.js
to override the defaults in.hal/default/staging/settings.js
window.spinnakerSettings.providers.aws.loadBalancers.inferInternalFlagFromSubnet = true;