Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-4430] Improve Performance Testing Documentation #465

Closed
wants to merge 1 commit into from

Conversation

lgajowy
Copy link

@lgajowy lgajowy commented Jun 8, 2018

The previous version of this docs was not up to date with current state of the project. I improved it a bit by adding new sections to explain more how the Performance Testing Framework works. I provided some more examples too (HDFS) to show different configurations that are feasible now.

It can be cumbersome to update the docs with new ghprb plugin phrases (line 332), because it may change very rapidly. On the other hand, I think it's better to have something (even if it will get slightly outdated in time) than have no documentation for this at all. This is a useful feature, not many people know about. WDYT of this?

@chamikaramj could you take a look?
@szewi could you also take a look especially at the HDFS examples?
CC: @melap

```

Example run with the HDFS filesystem and Cloud Dataflow runner:

HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exporting HADOOP_USER_NAME is only required when running with DirectRunner

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task.

```
export HADOOP_USER_NAME=root
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comment above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

./gradlew integrationTest -p sdks/java/io/hadoop-input-format -DintegrationTestPipelineOptions='["--project=GOOGLE_CLOUD_PROJECT", "--tempRoot=GOOGLE_STORAGE_BUCKET", "--numberOfRecords=1000", "--postgresPort=5432", "--postgresServerName=SERVER_NAME", "--postgresUsername=postgres", "--postgresPassword=PASSWORD", "--postgresDatabaseName=postgres", "--postgresSsl=false", "--runner=TestDataflowRunner"]' -DintegrationTestRunner=dataflow --tests=org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIOIT
```

Example usage on HDFS filesystem and Direct runner:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work when /etc/hosts file will contain entries with hadoop namenode and hadoop datanodes external IPs, otherwise user will get java.nio.channels.UnresolvedAddressException It's worthy mentioning, however this info is already in comment section of yml files. I will suggest at least adding:

Example usage on HDFS filesystem and Direct runner (with /etc/hosts entries added):

make people aware of what need to be done before running this with DirectRunner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good advice! I'm fixing this.


### Performance testing dashboard {#performance-testing-dashboard}

We mesure the performance of IOITs by gathering test execution times from Jenkins jobs that run periodically. The consequent results are stored in a database (BigQuery), therefore we can display them in a form of plots.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mesure -> measure

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@@ -147,21 +147,30 @@ However, **PerfKit Benchmarker is not required for running integration tests**.

Prerequisites:
1. [Install PerfKit Benchmarker](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker)
1. Have a running Kubernetes cluster you can connect to locally using kubectl
1. Have a running Kubernetes cluster you can connect to locally using kubectl. A cluster hosted on Google Kubernetes Engine might be the best fit as it is used to run the tests on Beam's Jenkins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "since Jenkins machines are authenticated to connect to GCP." ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this additional sentence was just to emphasize the fact that we're using GKE on Jenkins and it works there. We are not 100% sure that everything works the same on other Kubernetes cluster alternatives (eg. minikube).

Now I think this sentence is misguiding and dosen't bring any value. I will delete it and leave as it was before (it was all right).


Example run with the direct runner:
Example run with the Direct runner:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

```

Example run with the HDFS filesystem and Cloud Dataflow runner:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


1. Set up the data store corresponding to the test you wish to run. You can find Kubernetes scripts for all currently supported data stores in [.test-infra/kubernetes](https://github.com/apache/beam/tree/master/.test-infra/kubernetes).
1. In some cases, there is a setup script (*.sh). In other cases, you can just run ``kubectl create -f [scriptname]`` to create the data store.
1. Convention dictates there will be:
1. A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines.
1. A separate script, called for-local-dev, which sets up a LoadBalancer service.
1. A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yml scripts can be useful for any standalone Kubernetes setup, right ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general, this is true. What I meant here was that when we use NodePort service, there is no ExternalIP exposed to the "outer world" (other networks). In such case we are not able to use the datastores hosted this way if we run the test from other network than the datastore is in. Typically when using Minikube, everything happens on the same machine (same network), so everything works fine (we can connect to the db).

1. A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines.
1. A separate script, called for-local-dev, which sets up a LoadBalancer service.
1. A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine.
1. A separate script, with LoadBalancer service. Such service will expose an _external ip_ for the datastore. Such scripts are needed when external access is required (eg. on Jenkins).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

External access to the Kubernetes cluster ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. More precisely, the datastore that is hosted on the Kubernetes cluster.

@@ -282,12 +466,11 @@ As discussed in [Integration tests, data stores, and Kubernetes](#integration-te
If you would like help with this or have other questions, contact the Beam dev@ mailing list and the community may be able to assist you.

Guidelines for creating a Beam data store Kubernetes script:
1. **You must only provide access to the data store instance via a `NodePort` service.**
* This is a requirement for security, since it means that only the local network has access to the data store. This is particularly important since many data stores don't have security on by default, and even if they do, their passwords will be checked in to our public Github repo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not required any more ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short version: This is a requirement that cannot (is hard to?) satisfy on current testing infrastructure.

Long version:
This sentence was in the docs from the very beginning of our work on performance testing framework and we were never able to use the NodePort service on Jenkins. This is due to NodePort service nature - it doesn't allow to connect to the database from some external network.

Let's take JDBC test as an example. It:

  1. sets up the database in setup() method. It all happens on Jenkins executor/users machine. Those are not in the same network as GKE so they cannot connect when NodePort service is used
  2. runs the actual pipeline code. This happens on Dataflow so only there we are in the same network as the Kubernetes cluster (GKE) and only when we are using DataflowRunner (which will not always be the case).
  3. tears down the database. Similarily to setup (1) it's done outside Dataflow, so happens in separate network.

As you can see we are unable to use setup and teardown if we run them outside Dataflow and this is the situation on Jenkins now.

@lgajowy
Copy link
Author

lgajowy commented Jun 28, 2018

@chamikaramj @szewi I updated the PR and responded to your comments. Sorry for doing it so late. Could you take a look again?

@szewi
Copy link

szewi commented Jul 5, 2018

Ok, I will check this, by the way maybe its also worth mentioning that *IOIT.java files for different IOs contains example gradle commands. Edit: Sorry it's already there ;)

@szewi
Copy link

szewi commented Jul 5, 2018

Ok, so I went through this and it looks good to me. One thing that I'm considering is using local kubernetes clusters for development purposes. Some users may want to recreate infra on locally available clusters via minikube and of course, there will be the different port used as minikube uses ports >30000 and we need to override default ports when running pipelines. Simple services that use a single port (like Postgres 5432)could handle that(we just override Postgres port 5432 with some 300xx port), but when we run complex multi-port services like hdfs that simply won't work. What I mean is the most suitable infra to develop is having GKE on GCP, rather than using minikube or local kubernetes clusters. For simple datastores minikube is ok, but for complex it's painful. The advantage of having kubernetes on GCP is also the fact that infra would be the same as the one created by Jenkins. TLdr; we should suggest using GKE rather than local kubernetes clusters.

@lgajowy
Copy link
Author

lgajowy commented Jul 5, 2018

Ok, Thanks @szewi and thanks for the effort of testing this on Minikube. Initially, there was a mention in the docs that it is best to use GKE as this is what we've used for the testing infrastructure (and know it best). I deleted it but now I think that it's good to leave such note in the docs.

Since @chamikaramj is unavailable it might be a good idea to add another reviewer here. @iemejia, do you think you could take a look too?

@lgajowy
Copy link
Author

lgajowy commented Jul 11, 2018

@aromanenko-dev could you also take a look?

@aromanenko-dev
Copy link

LGTM with all these comments above. Thanks!

@lgajowy
Copy link
Author

lgajowy commented Jul 13, 2018

I added a clarifying sentence to line 150. @aromanenko-dev could you take a look again?

@aromanenko-dev
Copy link

retest this please

1 similar comment
@lgajowy
Copy link
Author

lgajowy commented Jul 16, 2018

retest this please

@aromanenko-dev
Copy link

@asfgit merge

asfgit pushed a commit that referenced this pull request Jul 16, 2018
@asfgit
Copy link

asfgit commented Jul 16, 2018

Error: PR failed in verification; check the Jenkins job for more information.

@aromanenko-dev
Copy link

@asfgit merge

asfgit pushed a commit that referenced this pull request Jul 16, 2018
@asfgit
Copy link

asfgit commented Jul 16, 2018

Error: PR failed in verification; check the Jenkins job for more information.

@aromanenko-dev
Copy link

@asfgit merge

@asfgit asfgit closed this in ea80837 Jul 17, 2018
swegner pushed a commit to swegner/beam that referenced this pull request Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants