Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

tomgeorge · 2019-11-13T21:32:57Z

Describe the bug

Depending on the network topology or DNS servers, a fully disconnected installation in some instances will not be able to resolve route URLs inside the cluster. This manifests in an issue with the Che server pod trying to retrieve the openid configuration at $PUBLIC_KEYCLOAK_URL/auth/realms/che/.well-known/openid-configuration.

I don't know exactly how OpenShift does DNS in different environments. I would think that in-cluster traffic would be able to resolve a route properly, but it does not appear to be the case in all scenarios.

curl $KEYCLOAK_ROUTE_URL/auth/realms/che/.well-known/openid-configuration times out, but

curl keycloak.namespace.svc:8080/auth/realms/che/.well-known/openid-configuration succeeds

Che version

latest
nightly
other: please specify

Steps to reproduce

Start a Che installation in a disconnected environment.

Expected behavior

Runtime

kubernetes (include output of kubectl version)
Openshift (include output of oc version)
minikube (include output of minikube version and kubectl version)
minishift (include output of minishift version and oc version)
docker-desktop + K8S (include output of docker version and kubectl version)
other: (please specify)

Screenshots

Installation method

chectl
che-operator 7.4.0
minishift-addon
I don't know

Environment

Additional context

The text was updated successfully, but these errors were encountered:

tomgeorge · 2019-11-13T22:00:56Z

Alternatively, you may be able to override these by setting customCheProperties fields in the CR. Working on the list of those.

rhopp · 2019-11-14T15:18:08Z

The same problem happens much sooner, when installing with TLS with self-signed certificate - While extracting the certificate, operator creates temporary route, but this route is not accessible:

time="2019-11-14T14:19:59Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:19:59Z" level=info msg="Creating a new object Route, name: test" 
time="2019-11-14T14:20:29Z" level=error msg="An error occurred when reaching test TLS route: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:29Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:30Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:31Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:31Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:32Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:33Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:33Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:34Z" level=info msg="Creating a test route test to extract router crt" 
...
...
...

sleshchenko · 2019-11-14T15:38:15Z

@rhopp It may be a fix for the issue you mentioned eclipse-che/che-operator@db15bdb but I'm not sure

rhopp · 2019-11-14T19:16:10Z

I've been able to succesfully start che-server using k8s internal dns name of the keycloak service (in my case CHE_KEYCLOAK_AUTH__SERVER__URL: 'http://keycloak.rhopp-air-gap-crw.svc:8080/auth')

But then (as expected) dashboard wasn't able to load (with typical message Authorization token is missed), because my browser couldn't resolve that URL.

davidfestal · 2019-11-14T20:34:13Z

@rhopp @tomgeorge Would it be possible to check with OpenShift teams whether it is expected that typical airgaped OpenShift 4.2 installations would not allow PODs to access external routes ?
That seems quite a very hard restriction that would probably make Che fail anyway.

ironcladlou · 2019-11-14T21:52:07Z

By default, on AWS, GCP, and Azure, if cluster DNS zone configuration was provided to the OpenShift installer, OpenShift will manage wildcard DNS records for ingress in the configured zones (assuming ingress is being exposed by a LoadBalancer Service, which is the default on those platforms.)

On other platforms, or if cluster DNS zone configuration is omitted, wildcard DNS records for ingress are not managed and it's up to the cluster owner to configure DNS to expose ingress (if desired.)

I hope that helps clarify some of the DNS management behavior. I can provide more specific details if someone can help me understand how the problematic clusters are being created (e.g. through the OpenShift installer IPI flow, UPI, etc.)

tomgeorge · 2019-11-14T23:04:48Z

Thanks to @ironcladlou for looking into this with me. The issue appears to be that the traffic is rejected by the LB or when on the way back to the node. We should look at the way this cluster was configured in QE and see if it matches the installation procedure in https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

rhopp · 2019-11-15T09:19:05Z

Response from Jianlin Liu, who has knowledge of how the cluster is configured:

Ah, I know what is the problem there.
In this disconnected env, we enabled proxy setting so that some core operators can connect cloud API.
cluster-image-registry-operator loaded these proxy setting, that is why my testing is going well.
sh-4.2$ env|grep -i proxy
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.rhopp-airgap.qe.devcluster.openshift.com,api.rhopp-airgap.qe.devcluster.openshift.com,etcd-0.rhopp-airgap.qe.devcluster.openshift.com,etcd-1.rhopp-airgap.qe.devcluster.openshift.com,etcd-2.rhopp-airgap.qe.devcluster.openshift.com,localhost,test.no-proxy.com
HTTPS_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128
HTTP_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128

In this disconnected env, we drop all internet connection from subnets to create an airgap env. apps DNS is pointing to an external ELB which is provisioned by ingress operator.
So that means you are trying to access an internet url inside cluster. I think that is an expected behavior.

Personally apps url mainly for external user access, if you want to access some cluster service inside cluster, why not use k8s svc endpoints?

ironcladlou · 2019-11-15T14:50:33Z

If I understand the setup correctly, if you really want to use Routes on an internal subnet (i.e. routes can be accessed only within the private subnet), with OpenShift 4.2 you can try replacing the default ingresscontroller with an internally-scoped variant that provisions the LB on the cluster's private subnet, e.g.

$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: default
spec:
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal
EOF

(See these Kubernetes docs for more detail on how this works)

ibuziuk · 2019-11-15T16:51:11Z

@rhopp @jianlinliu could you please clarify what is the expected & recommended OCP installation setup/config in the airgap mode regarding DNS / LB? If I understand correctly we face this issue since the dns resolution of routes on the QA cluster is happening in public internet and the only way to communicate is using sevicename + port combo. What I do not understand is how come OCP in the airgap mode falls back on the public internet for route resolution? shouldn't it use internal DNS by default?

tomgeorge · 2019-11-15T19:40:37Z

This document is the best thing that we have for airgap/restricted network installations: https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

The issue is not actually the DNS resolution but rather that there is no route for traffic to exit the cluster and return through the ELB. After looking at the templates in http://git.app.eng.bos.redhat.com/git/openshift-misc.git/plain/v3-launch-templates/functionality-testing/aos-4_2/hosts/upi_on_aws-cloudformation-templates/ it looks like it is using Route53 for DNS resolution.

I went through the cloudformation templates that are used in this installation and compared them from the ones in the docs and found that the only differences were in the VPC/Networking configuration. The documented cloudformation stack had:

AWS::EC2::NatGateways tied to the public subnets
AWS::EC2::EIPs in the VPC's domain
AWS::EC2::Routes from the private routing tables to the NatGateway

The template used in cluster provisioning did not have these resources. Additionally, the template used in installation had:

AWS::EC2::SecurityGroup Allowing ingress from all protocols to the VPC CIDR range
A VPCEndpoint in the VPC to the Services com.amazonaws.${AWS_REGION}.ec2 and com.amazonaws.${AWS_REGION}.elasticloadbalancing

Could the lack of aws Routes from the private subnets to the NATGatewaybe the cause of this? The docs seem to indicate that not all of these resources are necessary in a restricted network environment:

You must have a public internet gateway, with public routes, attached to the VPC. In the provided templates, each public subnet has a NAT gateway with an EIP address. These NAT gateways allow cluster resources, like private-subnet instances, to reach the internet and are not required for some restricted network or proxy scenarios.

So it seems like a difference in configuration from the documented way, and the behavior of AWS ELB's where the traffic must leave the AWS network and come back in.

I wonder how hard it would be to refactor che to use service hostnames wherever possible, and keep the public-facing route to client-side things?

rhopp · 2019-11-15T21:07:59Z

Latest info from my side:
I'm not very sure about all the underlying networking/solutions... But as Jianlin said they are using proxy, I've tried to deploy CRW 2.0 with proxy configured and it works (server succesfully started & dashboard was loaded - keycloak redirection was working).
I wasn't able to try workspace startup and I don't have time to do that now - this will have to wait for monday.

ibuziuk · 2019-11-27T12:49:08Z

PR with docs update has been merged - eclipse-che/che-docs#944 merged. Closing
Closing

tomgeorge added the kind/bug Outline of a bug - must adhere to the bug report template. label Nov 13, 2019

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 13, 2019

tomgeorge changed the title ~~Che server should query the kubernetes service URL of keycloak, instead of the public facing route~~ Route DNS entries not resolveable in airgap scenario so che fails to start Nov 14, 2019

tomgeorge changed the title ~~Route DNS entries not resolveable in airgap scenario so che fails to start~~ Route DNS hostnames not routeable in airgap scenario so che fails to start Nov 14, 2019

amisevsk mentioned this issue Nov 20, 2019

Provision proxy settings on init containers #15263

Merged

ibuziuk added this to the 7.5.0 milestone Nov 25, 2019

This was referenced Nov 25, 2019

Document the reachability of routes/ingress hostnames from inside the cluster #15308

Closed

Document Route/Ingress accessability in restricted netowrk environments eclipse-che/che-docs#944

Merged

ibuziuk added the team/osio label Nov 26, 2019

ibuziuk closed this as completed Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

tomgeorge commented Nov 13, 2019

tomgeorge commented Nov 13, 2019 •

edited

rhopp commented Nov 14, 2019

sleshchenko commented Nov 14, 2019

rhopp commented Nov 14, 2019

davidfestal commented Nov 14, 2019 •

edited

ironcladlou commented Nov 14, 2019

tomgeorge commented Nov 14, 2019

rhopp commented Nov 15, 2019 •

edited

ironcladlou commented Nov 15, 2019

ibuziuk commented Nov 15, 2019

tomgeorge commented Nov 15, 2019 •

edited

rhopp commented Nov 15, 2019

ibuziuk commented Nov 27, 2019

Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

Comments

tomgeorge commented Nov 13, 2019

Describe the bug

Che version

Steps to reproduce

Expected behavior

Runtime

Screenshots

Installation method

Environment

Additional context

tomgeorge commented Nov 13, 2019 • edited

rhopp commented Nov 14, 2019

sleshchenko commented Nov 14, 2019

rhopp commented Nov 14, 2019

davidfestal commented Nov 14, 2019 • edited

ironcladlou commented Nov 14, 2019

tomgeorge commented Nov 14, 2019

rhopp commented Nov 15, 2019 • edited

ironcladlou commented Nov 15, 2019

ibuziuk commented Nov 15, 2019

tomgeorge commented Nov 15, 2019 • edited

rhopp commented Nov 15, 2019

ibuziuk commented Nov 27, 2019

tomgeorge commented Nov 13, 2019 •

edited

davidfestal commented Nov 14, 2019 •

edited

rhopp commented Nov 15, 2019 •

edited

tomgeorge commented Nov 15, 2019 •

edited