New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-4430] Improve Performance Testing Documentation #465
Conversation
src/documentation/io/testing.md
Outdated
``` | ||
|
||
Example run with the HDFS filesystem and Cloud Dataflow runner: | ||
|
||
HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exporting HADOOP_USER_NAME is only required when running with DirectRunner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/documentation/io/testing.md
Outdated
HDFS clusters require `export HADOOP_USER_NAME=root` to be set before runnning `performanceTest` task. | ||
|
||
``` | ||
export HADOOP_USER_NAME=root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
./gradlew integrationTest -p sdks/java/io/hadoop-input-format -DintegrationTestPipelineOptions='["--project=GOOGLE_CLOUD_PROJECT", "--tempRoot=GOOGLE_STORAGE_BUCKET", "--numberOfRecords=1000", "--postgresPort=5432", "--postgresServerName=SERVER_NAME", "--postgresUsername=postgres", "--postgresPassword=PASSWORD", "--postgresDatabaseName=postgres", "--postgresSsl=false", "--runner=TestDataflowRunner"]' -DintegrationTestRunner=dataflow --tests=org.apache.beam.sdk.io.hadoop.inputformat.HadoopInputFormatIOIT | ||
``` | ||
|
||
Example usage on HDFS filesystem and Direct runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work when /etc/hosts file will contain entries with hadoop namenode and hadoop datanodes external IPs, otherwise user will get java.nio.channels.UnresolvedAddressException
It's worthy mentioning, however this info is already in comment section of yml files. I will suggest at least adding:
Example usage on HDFS filesystem and Direct runner (with /etc/hosts entries added):
make people aware of what need to be done before running this with DirectRunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good advice! I'm fixing this.
src/documentation/io/testing.md
Outdated
|
||
### Performance testing dashboard {#performance-testing-dashboard} | ||
|
||
We mesure the performance of IOITs by gathering test execution times from Jenkins jobs that run periodically. The consequent results are stored in a database (BigQuery), therefore we can display them in a form of plots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mesure -> measure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
src/documentation/io/testing.md
Outdated
@@ -147,21 +147,30 @@ However, **PerfKit Benchmarker is not required for running integration tests**. | |||
|
|||
Prerequisites: | |||
1. [Install PerfKit Benchmarker](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker) | |||
1. Have a running Kubernetes cluster you can connect to locally using kubectl | |||
1. Have a running Kubernetes cluster you can connect to locally using kubectl. A cluster hosted on Google Kubernetes Engine might be the best fit as it is used to run the tests on Beam's Jenkins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "since Jenkins machines are authenticated to connect to GCP." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this additional sentence was just to emphasize the fact that we're using GKE on Jenkins and it works there. We are not 100% sure that everything works the same on other Kubernetes cluster alternatives (eg. minikube).
Now I think this sentence is misguiding and dosen't bring any value. I will delete it and leave as it was before (it was all right).
src/documentation/io/testing.md
Outdated
|
||
Example run with the direct runner: | ||
Example run with the Direct runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DirectRunner (with a link to https://beam.apache.org/documentation/runners/direct/).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
``` | ||
|
||
Example run with the HDFS filesystem and Cloud Dataflow runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataflowRunner (with a link to https://beam.apache.org/documentation/runners/dataflow/).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
1. Set up the data store corresponding to the test you wish to run. You can find Kubernetes scripts for all currently supported data stores in [.test-infra/kubernetes](https://github.com/apache/beam/tree/master/.test-infra/kubernetes). | ||
1. In some cases, there is a setup script (*.sh). In other cases, you can just run ``kubectl create -f [scriptname]`` to create the data store. | ||
1. Convention dictates there will be: | ||
1. A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines. | ||
1. A separate script, called for-local-dev, which sets up a LoadBalancer service. | ||
1. A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yml scripts can be useful for any standalone Kubernetes setup, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in general, this is true. What I meant here was that when we use NodePort service, there is no ExternalIP exposed to the "outer world" (other networks). In such case we are not able to use the datastores hosted this way if we run the test from other network than the datastore is in. Typically when using Minikube, everything happens on the same machine (same network), so everything works fine (we can connect to the db).
1. A core yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines. | ||
1. A separate script, called for-local-dev, which sets up a LoadBalancer service. | ||
1. A yml script for the data store itself, plus a `NodePort` service. The `NodePort` service opens a port to the data store for anyone who connects to the Kubernetes cluster's machines from within same subnetwork. Such scripts are typically useful when running the scripts on Minikube Kubernetes Engine. | ||
1. A separate script, with LoadBalancer service. Such service will expose an _external ip_ for the datastore. Such scripts are needed when external access is required (eg. on Jenkins). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
External access to the Kubernetes cluster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. More precisely, the datastore that is hosted on the Kubernetes cluster.
@@ -282,12 +466,11 @@ As discussed in [Integration tests, data stores, and Kubernetes](#integration-te | |||
If you would like help with this or have other questions, contact the Beam dev@ mailing list and the community may be able to assist you. | |||
|
|||
Guidelines for creating a Beam data store Kubernetes script: | |||
1. **You must only provide access to the data store instance via a `NodePort` service.** | |||
* This is a requirement for security, since it means that only the local network has access to the data store. This is particularly important since many data stores don't have security on by default, and even if they do, their passwords will be checked in to our public Github repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not required any more ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short version: This is a requirement that cannot (is hard to?) satisfy on current testing infrastructure.
Long version:
This sentence was in the docs from the very beginning of our work on performance testing framework and we were never able to use the NodePort service on Jenkins. This is due to NodePort service nature - it doesn't allow to connect to the database from some external network.
Let's take JDBC test as an example. It:
- sets up the database in setup() method. It all happens on Jenkins executor/users machine. Those are not in the same network as GKE so they cannot connect when NodePort service is used
- runs the actual pipeline code. This happens on Dataflow so only there we are in the same network as the Kubernetes cluster (GKE) and only when we are using DataflowRunner (which will not always be the case).
- tears down the database. Similarily to setup (1) it's done outside Dataflow, so happens in separate network.
As you can see we are unable to use setup and teardown if we run them outside Dataflow and this is the situation on Jenkins now.
0413f17
to
4fea52c
Compare
@chamikaramj @szewi I updated the PR and responded to your comments. Sorry for doing it so late. Could you take a look again? |
Ok, I will check this, by the way maybe its also worth mentioning that |
Ok, so I went through this and it looks good to me. One thing that I'm considering is using local kubernetes clusters for development purposes. Some users may want to recreate infra on locally available clusters via minikube and of course, there will be the different port used as minikube uses ports >30000 and we need to override default ports when running pipelines. Simple services that use a single port (like Postgres 5432)could handle that(we just override Postgres port 5432 with some 300xx port), but when we run complex multi-port services like hdfs that simply won't work. What I mean is the most suitable infra to develop is having GKE on GCP, rather than using minikube or local kubernetes clusters. For simple datastores minikube is ok, but for complex it's painful. The advantage of having kubernetes on GCP is also the fact that infra would be the same as the one created by Jenkins. TLdr; we should suggest using GKE rather than local kubernetes clusters. |
Ok, Thanks @szewi and thanks for the effort of testing this on Minikube. Initially, there was a mention in the docs that it is best to use GKE as this is what we've used for the testing infrastructure (and know it best). I deleted it but now I think that it's good to leave such note in the docs. Since @chamikaramj is unavailable it might be a good idea to add another reviewer here. @iemejia, do you think you could take a look too? |
@aromanenko-dev could you also take a look? |
LGTM with all these comments above. Thanks! |
4fea52c
to
6df18ab
Compare
I added a clarifying sentence to line 150. @aromanenko-dev could you take a look again? |
retest this please |
1 similar comment
retest this please |
@asfgit merge |
Error: PR failed in verification; check the Jenkins job for more information. |
@asfgit merge |
Error: PR failed in verification; check the Jenkins job for more information. |
@asfgit merge |
The previous version of this docs was not up to date with current state of the project. I improved it a bit by adding new sections to explain more how the Performance Testing Framework works. I provided some more examples too (HDFS) to show different configurations that are feasible now.
It can be cumbersome to update the docs with new ghprb plugin phrases (line 332), because it may change very rapidly. On the other hand, I think it's better to have something (even if it will get slightly outdated in time) than have no documentation for this at all. This is a useful feature, not many people know about. WDYT of this?
@chamikaramj could you take a look?
@szewi could you also take a look especially at the HDFS examples?
CC: @melap