Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route DNS hostnames not routeable in airgap scenario so che fails to start #15187

Closed
3 of 23 tasks
tomgeorge opened this issue Nov 13, 2019 · 13 comments
Closed
3 of 23 tasks
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@tomgeorge
Copy link
Contributor

Describe the bug

Depending on the network topology or DNS servers, a fully disconnected installation in some instances will not be able to resolve route URLs inside the cluster. This manifests in an issue with the Che server pod trying to retrieve the openid configuration at $PUBLIC_KEYCLOAK_URL/auth/realms/che/.well-known/openid-configuration.

I don't know exactly how OpenShift does DNS in different environments. I would think that in-cluster traffic would be able to resolve a route properly, but it does not appear to be the case in all scenarios.

curl $KEYCLOAK_ROUTE_URL/auth/realms/che/.well-known/openid-configuration times out, but

curl keycloak.namespace.svc:8080/auth/realms/che/.well-known/openid-configuration succeeds

Che version

  • latest
  • nightly
  • other: please specify

Steps to reproduce

Start a Che installation in a disconnected environment.

Expected behavior

Runtime

  • kubernetes (include output of kubectl version)
  • Openshift (include output of oc version)
  • minikube (include output of minikube version and kubectl version)
  • minishift (include output of minishift version and oc version)
  • docker-desktop + K8S (include output of docker version and kubectl version)
  • other: (please specify)

Screenshots

Installation method

  • chectl
  • che-operator 7.4.0
  • minishift-addon
  • I don't know

Environment

  • my computer
    • Windows
    • Linux
    • macOS
  • Cloud
    • Amazon
    • Azure
    • GCE
    • other (please specify)
  • other: please specify

Additional context

@tomgeorge tomgeorge added the kind/bug Outline of a bug - must adhere to the bug report template. label Nov 13, 2019
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 13, 2019
@tomgeorge
Copy link
Contributor Author

tomgeorge commented Nov 13, 2019

Alternatively, you may be able to override these by setting customCheProperties fields in the CR. Working on the list of those.

@tomgeorge tomgeorge changed the title Che server should query the kubernetes service URL of keycloak, instead of the public facing route Route DNS entries not resolveable in airgap scenario so che fails to start Nov 14, 2019
@rhopp
Copy link
Contributor

rhopp commented Nov 14, 2019

The same problem happens much sooner, when installing with TLS with self-signed certificate - While extracting the certificate, operator creates temporary route, but this route is not accessible:

time="2019-11-14T14:19:59Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:19:59Z" level=info msg="Creating a new object Route, name: test" 
time="2019-11-14T14:20:29Z" level=error msg="An error occurred when reaching test TLS route: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:29Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https://test-rhopp-air-gap-crw.apps.rhopp-airgap.qe.devcluster.openshift.com: dial tcp 3.130.28.239:443: i/o timeout" 
time="2019-11-14T14:20:30Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:31Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:31Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:32Z" level=info msg="Creating a test route test to extract router crt" 
time="2019-11-14T14:20:33Z" level=error msg="An error occurred when reaching test TLS route: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:33Z" level=error msg="Failed to extract crt. Failed to create a secret with a self signed crt: Get https:: http: no Host in request URL" 
time="2019-11-14T14:20:34Z" level=info msg="Creating a test route test to extract router crt" 
...
...
...

@sleshchenko
Copy link
Member

@rhopp It may be a fix for the issue you mentioned eclipse-che/che-operator@db15bdb but I'm not sure

@rhopp
Copy link
Contributor

rhopp commented Nov 14, 2019

I've been able to succesfully start che-server using k8s internal dns name of the keycloak service (in my case CHE_KEYCLOAK_AUTH__SERVER__URL: 'http://keycloak.rhopp-air-gap-crw.svc:8080/auth')

But then (as expected) dashboard wasn't able to load (with typical message Authorization token is missed), because my browser couldn't resolve that URL.

@davidfestal
Copy link
Contributor

davidfestal commented Nov 14, 2019

@rhopp @tomgeorge Would it be possible to check with OpenShift teams whether it is expected that typical airgaped OpenShift 4.2 installations would not allow PODs to access external routes ?
That seems quite a very hard restriction that would probably make Che fail anyway.

@ironcladlou
Copy link

By default, on AWS, GCP, and Azure, if cluster DNS zone configuration was provided to the OpenShift installer, OpenShift will manage wildcard DNS records for ingress in the configured zones (assuming ingress is being exposed by a LoadBalancer Service, which is the default on those platforms.)

On other platforms, or if cluster DNS zone configuration is omitted, wildcard DNS records for ingress are not managed and it's up to the cluster owner to configure DNS to expose ingress (if desired.)

I hope that helps clarify some of the DNS management behavior. I can provide more specific details if someone can help me understand how the problematic clusters are being created (e.g. through the OpenShift installer IPI flow, UPI, etc.)

@tomgeorge tomgeorge changed the title Route DNS entries not resolveable in airgap scenario so che fails to start Route DNS hostnames not routeable in airgap scenario so che fails to start Nov 14, 2019
@tomgeorge
Copy link
Contributor Author

Thanks to @ironcladlou for looking into this with me. The issue appears to be that the traffic is rejected by the LB or when on the way back to the node. We should look at the way this cluster was configured in QE and see if it matches the installation procedure in https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

@tolusha tolusha added area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Nov 15, 2019
@rhopp
Copy link
Contributor

rhopp commented Nov 15, 2019

Response from Jianlin Liu, who has knowledge of how the cluster is configured:

Ah, I know what is the problem there.
In this disconnected env, we enabled proxy setting so that some core operators can connect cloud API.
cluster-image-registry-operator loaded these proxy setting, that is why my testing is going well.
sh-4.2$ env|grep -i proxy
NO_PROXY=.cluster.local,.svc,.us-east-2.compute.internal,10.0.0.0/16,10.128.0.0/14,127.0.0.1,169.254.169.254,172.30.0.0/16,api-int.rhopp-airgap.qe.devcluster.openshift.com,api.rhopp-airgap.qe.devcluster.openshift.com,etcd-0.rhopp-airgap.qe.devcluster.openshift.com,etcd-1.rhopp-airgap.qe.devcluster.openshift.com,etcd-2.rhopp-airgap.qe.devcluster.openshift.com,localhost,test.no-proxy.com
HTTPS_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128
HTTP_PROXY=http://ec2-3-17-157-193.us-east-2.compute.amazonaws.com:3128

In this disconnected env, we drop all internet connection from subnets to create an airgap env. apps DNS is pointing to an external ELB which is provisioned by ingress operator.
So that means you are trying to access an internet url inside cluster. I think that is an expected behavior.

Personally apps url mainly for external user access, if you want to access some cluster service inside cluster, why not use k8s svc endpoints?

@ironcladlou
Copy link

If I understand the setup correctly, if you really want to use Routes on an internal subnet (i.e. routes can be accessed only within the private subnet), with OpenShift 4.2 you can try replacing the default ingresscontroller with an internally-scoped variant that provisions the LB on the cluster's private subnet, e.g.

$ oc replace --force --wait --filename - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  namespace: openshift-ingress-operator
  name: default
spec:
  endpointPublishingStrategy:
    type: LoadBalancerService
    loadBalancer:
      scope: Internal
EOF

(See these Kubernetes docs for more detail on how this works)

@ibuziuk
Copy link
Member

ibuziuk commented Nov 15, 2019

@rhopp @jianlinliu could you please clarify what is the expected & recommended OCP installation setup/config in the airgap mode regarding DNS / LB? If I understand correctly we face this issue since the dns resolution of routes on the QA cluster is happening in public internet and the only way to communicate is using sevicename + port combo. What I do not understand is how come OCP in the airgap mode falls back on the public internet for route resolution? shouldn't it use internal DNS by default?

@tomgeorge
Copy link
Contributor Author

tomgeorge commented Nov 15, 2019

This document is the best thing that we have for airgap/restricted network installations: https://docs.openshift.com/container-platform/4.2/installing/installing_restricted_networks/installing-restricted-networks-aws.html

The issue is not actually the DNS resolution but rather that there is no route for traffic to exit the cluster and return through the ELB. After looking at the templates in http://git.app.eng.bos.redhat.com/git/openshift-misc.git/plain/v3-launch-templates/functionality-testing/aos-4_2/hosts/upi_on_aws-cloudformation-templates/ it looks like it is using Route53 for DNS resolution.

I went through the cloudformation templates that are used in this installation and compared them from the ones in the docs and found that the only differences were in the VPC/Networking configuration. The documented cloudformation stack had:

  • AWS::EC2::NatGateways tied to the public subnets
  • AWS::EC2::EIPs in the VPC's domain
  • AWS::EC2::Routes from the private routing tables to the NatGateway

The template used in cluster provisioning did not have these resources. Additionally, the template used in installation had:

  • AWS::EC2::SecurityGroup Allowing ingress from all protocols to the VPC CIDR range
  • A VPCEndpoint in the VPC to the Services com.amazonaws.${AWS_REGION}.ec2 and com.amazonaws.${AWS_REGION}.elasticloadbalancing

Could the lack of aws Routes from the private subnets to the NATGatewaybe the cause of this? The docs seem to indicate that not all of these resources are necessary in a restricted network environment:

You must have a public internet gateway, with public routes, attached to the VPC. In the provided templates, each public subnet has a NAT gateway with an EIP address. These NAT gateways allow cluster resources, like private-subnet instances, to reach the internet and are not required for some restricted network or proxy scenarios.

So it seems like a difference in configuration from the documented way, and the behavior of AWS ELB's where the traffic must leave the AWS network and come back in.

I wonder how hard it would be to refactor che to use service hostnames wherever possible, and keep the public-facing route to client-side things?

@rhopp
Copy link
Contributor

rhopp commented Nov 15, 2019

Latest info from my side:
I'm not very sure about all the underlying networking/solutions... But as Jianlin said they are using proxy, I've tried to deploy CRW 2.0 with proxy configured and it works (server succesfully started & dashboard was loaded - keycloak redirection was working).
I wasn't able to try workspace startup and I don't have time to do that now - this will have to wait for monday.

@ibuziuk
Copy link
Member

ibuziuk commented Nov 27, 2019

PR with docs update has been merged - eclipse-che/che-docs#944 merged. Closing
Closing

@ibuziuk ibuziuk closed this as completed Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/install Issues related to installation, including offline/air gap and initial setup kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

8 participants