New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[BEAM-4430] Improve Performance Testing Documentation #465

Closed

lgajowy wants to merge 1 commit into apache:asf-site from lgajowy:update-io-testing-docs

lgajowy commented Jun 8, 2018

The previous version of this docs was not up to date with current state of the project. I improved it a bit by adding new sections to explain more how the Performance Testing Framework works. I provided some more examples too (HDFS) to show different configurations that are feasible now.

It can be cumbersome to update the docs with new ghprb plugin phrases (line 332), because it may change very rapidly. On the other hand, I think it's better to have something (even if it will get slightly outdated in time) than have no documentation for this at all. This is a useful feature, not many people know about. WDYT of this?

@chamikaramj could you take a look?
@szewi could you also take a look especially at the HDFS examples?
CC: @melap

szewi reviewed

View reviewed changes

src/documentation/io/testing.md Outdated

               ```
+              Example run with the HDFS filesystem and Cloud Dataflow runner:
+              HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task.

szewi Jun 13, 2018

exporting HADOOP_USER_NAME is only required when running with DirectRunner

Author

lgajowy Jun 28, 2018

ok

src/documentation/io/testing.md Outdated

+              HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task.
+              ```
+              export HADOOP_USER_NAME=root

szewi Jun 13, 2018

Please see comment above.

Author

lgajowy Jun 28, 2018

ok

src/documentation/io/testing.md

+              ./gradlew integrationTest -p sdks/java/io/hadoop-input-format -DintegrationTestPipelineOptions='["--project=GOOGLE_CLOUD_PROJECT", "--tempRoot=GOOGLE_STORAGE_BUCKET", "--numberOfRecords=1000", "--postgresPort=5432", "--postgresServerName=SERVER_NAME", "--postgresUsername=postgres", "--postgresPassword=PASSWORD", "--postgresDatabaseName=postgres", "--postgresSsl=false", "--runner=TestDataflowRunner"]' -DintegrationTestRunner=dataflow --tests=org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIOIT
+              ```
+              Example usage on HDFS filesystem and Direct runner:

szewi Jun 13, 2018

This will only work when /etc/hosts file will contain entries with hadoop namenode and hadoop datanodes external IPs, otherwise user will get java.nio.channels.UnresolvedAddressException It's worthy mentioning, however this info is already in comment section of yml files. I will suggest at least adding:

Example usage on HDFS filesystem and Direct runner (with /etc/hosts entries added):

make people aware of what need to be done before running this with DirectRunner.

Author

lgajowy Jun 28, 2018

This is good advice! I'm fixing this.

src/documentation/io/testing.md Outdated


		### Performance testing dashboard {#performance-testing-dashboard}

		We mesure the performance of IOITs by gathering test execution times from Jenkins jobs that run periodically. The consequent results are stored in a database (BigQuery), therefore we can display them in a form of plots.

szewi Jun 13, 2018

nit: mesure -> measure

Author

lgajowy Jun 28, 2018

ok

chamikaramj reviewed

View reviewed changes

chamikaramj left a comment

Thanks.

src/documentation/io/testing.md Outdated

               Prerequisites:
 .  [Install PerfKit Benchmarker](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker)
-.  Have a running Kubernetes cluster you can connect to locally using kubectl
+.  Have a running Kubernetes cluster you can connect to locally using kubectl. A cluster hosted on Google Kubernetes Engine might be the best fit as it is used to run the tests on Beam's Jenkins.

chamikaramj Jun 18, 2018

You mean "since Jenkins machines are authenticated to connect to GCP." ?

Author

lgajowy Jun 28, 2018

No - this additional sentence was just to emphasize the fact that we're using GKE on Jenkins and it works there. We are not 100% sure that everything works the same on other Kubernetes cluster alternatives (eg. minikube).

Now I think this sentence is misguiding and dosen't bring any value. I will delete it and leave as it was before (it was all right).

src/documentation/io/testing.md Outdated


		Example run with the direct runner:
		Example run with the Direct runner:

chamikaramj Jun 18, 2018

DirectRunner (with a link to https://beam.apache.org/documentation/runners/direct/).

Author

lgajowy Jun 28, 2018

ok

src/documentation/io/testing.md

		```

		Example run with the HDFS filesystem and Cloud Dataflow runner:

chamikaramj Jun 18, 2018

DataflowRunner (with a link to https://beam.apache.org/documentation/runners/dataflow/).

Author

lgajowy Jun 28, 2018

ok

src/documentation/io/testing.md

 .  Set up the data store corresponding to the test you wish to run. You can find Kubernetes scripts for all currently supported data stores in [.test-infra/kubernetes](https://github.com/apache/beam/tree/master/.test-infra/kubernetes).
 .  In some cases, there is a setup script (*.sh). In other cases, you can just run ``kubectl create -f [scriptname]`` to create the data store.
 .  Convention dictates there will be:
-.  A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines.
-.  A separate script, called for-local-dev, which sets up a LoadBalancer service.
+.  A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine.

chamikaramj Jun 18, 2018

yml scripts can be useful for any standalone Kubernetes setup, right ?

Author

lgajowy Jun 28, 2018

Yes, in general, this is true. What I meant here was that when we use NodePort service, there is no ExternalIP exposed to the "outer world" (other networks). In such case we are not able to use the datastores hosted this way if we run the test from other network than the datastore is in. Typically when using Minikube, everything happens on the same machine (same network), so everything works fine (we can connect to the db).

src/documentation/io/testing.md

    
                      1.  A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines.

                      1.  A separate script, called for-local-dev, which sets up a LoadBalancer service.

                      1.  A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine.

                      1.  A separate script, with LoadBalancer service. Such service will expose an _external ip_ for the datastore. Such scripts are needed when external access is required (eg. on Jenkins).

chamikaramj Jun 18, 2018

External access to the Kubernetes cluster ?

Author

lgajowy Jun 28, 2018

Yes. More precisely, the datastore that is hosted on the Kubernetes cluster.

src/documentation/io/testing.md

               If you would like help with this or have other questions, contact the Beam dev@ mailing list and the community may be able to assist you.
               Guidelines for creating a Beam data store Kubernetes script:
-.  **You must only provide access to the data store instance via a `NodePort` service.**
-                  *   This is a requirement for security, since it means that only the local network has access to the data store. This is particularly important since many data stores don't have security on by default, and even if they do, their passwords will be checked in to our public Github repo.

chamikaramj Jun 18, 2018

Is this not required any more ?

Author

lgajowy Jun 28, 2018

Short version: This is a requirement that cannot (is hard to?) satisfy on current testing infrastructure.

Long version:
This sentence was in the docs from the very beginning of our work on performance testing framework and we were never able to use the NodePort service on Jenkins. This is due to NodePort service nature - it doesn't allow to connect to the database from some external network.

Let's take JDBC test as an example. It:

sets up the database in setup() method. It all happens on Jenkins executor/users machine. Those are not in the same network as GKE so they cannot connect when NodePort service is used
runs the actual pipeline code. This happens on Dataflow so only there we are in the same network as the Kubernetes cluster (GKE) and only when we are using DataflowRunner (which will not always be the case).
tears down the database. Similarily to setup (1) it's done outside Dataflow, so happens in separate network.

As you can see we are unable to use setup and teardown if we run them outside Dataflow and this is the situation on Jenkins now.

lgajowy force-pushed the update-io-testing-docs branch from 0413f17 to 4fea52c Compare

June 28, 2018 12:26

Author

lgajowy commented Jun 28, 2018

@chamikaramj @szewi I updated the PR and responded to your comments. Sorry for doing it so late. Could you take a look again?

szewi commented Jul 5, 2018 •

edited

Ok, I will check this, by the way maybe its also worth mentioning that *IOIT.java files for different IOs contains example gradle commands. Edit: Sorry it's already there ;)

szewi commented Jul 5, 2018

Ok, so I went through this and it looks good to me. One thing that I'm considering is using local kubernetes clusters for development purposes. Some users may want to recreate infra on locally available clusters via minikube and of course, there will be the different port used as minikube uses ports >30000 and we need to override default ports when running pipelines. Simple services that use a single port (like Postgres 5432)could handle that(we just override Postgres port 5432 with some 300xx port), but when we run complex multi-port services like hdfs that simply won't work. What I mean is the most suitable infra to develop is having GKE on GCP, rather than using minikube or local kubernetes clusters. For simple datastores minikube is ok, but for complex it's painful. The advantage of having kubernetes on GCP is also the fact that infra would be the same as the one created by Jenkins. TLdr; we should suggest using GKE rather than local kubernetes clusters.

Author

lgajowy commented Jul 5, 2018

Ok, Thanks @szewi and thanks for the effort of testing this on Minikube. Initially, there was a mention in the docs that it is best to use GKE as this is what we've used for the testing infrastructure (and know it best). I deleted it but now I think that it's good to leave such note in the docs.

Since @chamikaramj is unavailable it might be a good idea to add another reviewer here. @iemejia, do you think you could take a look too?

Author

lgajowy commented Jul 11, 2018

@aromanenko-dev could you also take a look?

aromanenko-dev commented Jul 12, 2018

LGTM with all these comments above. Thanks!


          [BEAM-4430] Improve Performance Testing Documentation

6df18ab

lgajowy force-pushed the update-io-testing-docs branch from 4fea52c to 6df18ab Compare

July 13, 2018 11:59

Author

lgajowy commented Jul 13, 2018

I added a clarifying sentence to line 150. @aromanenko-dev could you take a look again?

aromanenko-dev commented Jul 13, 2018

retest this please

1 similar comment

Author

lgajowy commented Jul 16, 2018

retest this please

aromanenko-dev commented Jul 16, 2018

asfgit pushed a commit that referenced this pull request


          This closes #465

adb6489

asfgit commented Jul 16, 2018

Error: PR failed in verification; check the Jenkins job for more information.

aromanenko-dev commented Jul 16, 2018

asfgit pushed a commit that referenced this pull request


          This closes #465

a9b0c6b

asfgit commented Jul 16, 2018

Error: PR failed in verification; check the Jenkins job for more information.

aromanenko-dev commented Jul 17, 2018

asfgit closed this in

ea80837

swegner pushed a commit to swegner/beam that referenced this pull request


          This closes apache/beam-site#465

c445968

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment