Benchmark repeat parameter and results pushed to s3 #406

gusmith · 2019-08-20T08:53:11Z

Two main features added at the same time (they could have been split).
Use of boto3 for the benchmark to be able to push the results to s3 instead of relying on k8s volumes which are hard to access afterward.
A parameter repetition has been added to an experiment schema to be able to repeat the same experiment a number of time in the same benchmarking k8s job. This should reduce the number of issue we could observe when running 100 times 1M*1M linkage as we were previously repeating the same k8s jobs 100 times, which can have issues with volumes which cannot be shared.

This is an optional field, which needs to be an integer greater or equal to 1. If absent, it should be defaulted to 1.

Used to post the results to s3.

…sh results to s3

hardbyte

I really like the addition of uploading results to an object store. Just a few small suggestions from me.

benchmarking/benchmark.py

hardbyte · 2019-08-20T10:48:09Z

benchmarking/benchmark.py

+def push_result_s3(experiment_file):
+    client = boto3.client(
+        's3',
+        aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),


My hunch is we should be agnostic as multiple services offer the S3 api (including MINIO which we already use in this project).

How about OBJECT_STORE_SERVER, OBJECT_STORE_ACCESS_KEY, OBJECT_STORE_BUCKET etc as the environment variables? And open an issue to change them in the backend settings from MINIO_ - https://github.com/data61/anonlink-entity-service/blob/develop/backend/entityservice/settings.py#L27

I'm actually not convinced. My reasoning being the following:

we do not support anything else than aws in the benchmark

the benchmark is fully independent from the main repository, so variable names from the main repo should not impact choices made for the benchmark (even if consistency may be a good idea)

but if we are thinking about both repos together, I would actually made them even more specific, such as AWS_BENCHMARK_... to be sure that we cannot imagine they will be used for any other context than the chosen one. I would also point out that if we implement a script doing all at once (deploying and benchmarking), it's good to use different environment variable names to be sure that a mis-configuration will not lead to misuse of a token. E.g. I create a script deploying the entity service setting the env var OBJECT_STORE_ACCESS_KEY, if the script also starts the benchmark but I forgot to update the env var, I would push the results to the same bucket. Here it is not important as we are not deleting anything but if we were, we could do really bad things...

would there be a scenario where a single service has multiple keys? E.g. one key for the bucket x and one key for the bucket y? The underlying question being if we prefer to have keys per application, or keys per use-case (the application can do x, y and z, or key a can do x, key b can do y and key c can do z, and I give the application the keys a, b and c). Both have pros and cons, but we may want to think about it

if we generalise to OBJECT_STORE_SERVER, I think we will also need an extra field OBJECT_STORE_TYPE, in which case we may also need to modify the description of the env variables: one object store type may not use a bucket and access key, but something totally different (I have no clue what it could be, but maybe :) ). This field would also help to know which service to use.

But the more options we are adding, the more I would assume we should push them into a command line tool instead of env vars.

While not being convinced, I'm also not really strongly opinionated on this, so happy to change if that's the preference :)

I'm not too fussed with the variable names but it would be nice to include a way to set the object store server so the benchmark user can decide for themselves.

See the docs at https://docs.min.io/docs/minio-select-api-quickstart-guide.html it looks like all you would need is to add an endpoint_url which is only set if an environment variable is present e.g. S3_SERVER or my preference OBJECT_STORE_SERVER.

s3 = boto3.client('s3', endpoint_url='http://localhost:9000', aws_access_key_id='minio', aws_secret_access_key='minio123', )

I don't think we need to tell the benchmark component the TYPE of object store - that would defeat the purpose of an abstraction - in this case the S3 API.

OBJECT_STORE_SERVER added, and other ones renamed to: OBJECT_STORE_ACCESS_KEY, OBJECT_STORE_SECRET_KEY and OBJECT_STORE_BUCKET

hardbyte · 2019-08-20T10:51:24Z

benchmarking/benchmark.py

+        aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
+    )
+    s3_bucket = "anonlink-benchmark-result"
+    client.upload_file(experiment_file, s3_bucket, "results.json")


So if this benchmarking job is run multiple times in k8s with the same s3 bucket (extremely likely) the results get overridden. I suggest we include a timestamp or uuid in the uploaded filename.

👍
I will use timestamps, to more easily access them after (the job may not be kept on k8s, so the UUID may become hard to access).

hardbyte · 2019-08-20T10:52:24Z

benchmarking/benchmark.py

@@ -293,6 +323,7 @@ def main():
        pprint(results)
        with open(config['results_path'], 'wt') as f:
            json.dump(results, f)
+        push_result_s3(config['results_path'])


Should uploading to s3 be optional? Someone might want to run this benchmarking locally and just see the output as before?

👍
I'll just check if the environment variables have been set, not adding an extra one.

closes #326

gusmith · 2019-08-21T03:44:46Z

@hardbyte in the benchmark, we are printing out the environment variables if the is an exception thrown in the read_conf method. Do you know where the benchmark logs are pushed to? Mainly: who can see them? If we add some credentials, we may not want to print them...

gusmith · 2019-08-21T03:46:26Z

@hardbyte all your concerns have been resolved, except the naming of the environment variables, for which we may want to chat, cf discussion.

added time when a run starts and when it stops.

hardbyte

Approved, but suggest making the S3 server configurable

Guillaume Smith added 4 commits August 20, 2019 18:41

Add a repetition file in the description of experiments.

78cbfab

This is an optional field, which needs to be an integer greater or equal to 1. If absent, it should be defaulted to 1.

Add boto3 requirement to benchmark

545e0e0

Used to post the results to s3.

Use repetition fields to run multiple time the same experiment and pu…

1796bc5

…sh results to s3

Benchmark job uses s3 creadentials to save the results.

39690fc

gusmith requested a review from hardbyte August 20, 2019 08:53

gusmith self-assigned this Aug 20, 2019

hardbyte requested changes Aug 20, 2019

View reviewed changes

Guillaume Smith added 3 commits August 21, 2019 13:09

Add project_id to benchmark result

d2224f2

closes #326

Name of file pushed to s3 include timestamp.

9853d9f

Push to s3 only if AWS_ACCESS_KEY_ID available.

0a22456

gusmith requested a review from hardbyte August 21, 2019 03:45

Guillaume Smith added 2 commits August 22, 2019 10:56

The benchmark k8s job uses the container built from this rbanch.

85a51f3

Some added logs to check what is happening during becnhmark.

cc47c20

added time when a run starts and when it stops.

hardbyte approved these changes Aug 23, 2019

View reviewed changes

Add the object store env var and rename the s3 env var to object store.

0484c5a

gusmith force-pushed the feature-benchmark-repeat branch from 3fac83c to 0484c5a Compare August 23, 2019 04:12

gusmith merged commit 70120d2 into develop Aug 23, 2019

gusmith deleted the feature-benchmark-repeat branch August 23, 2019 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark repeat parameter and results pushed to s3 #406

Benchmark repeat parameter and results pushed to s3 #406

gusmith commented Aug 20, 2019

hardbyte left a comment

hardbyte Aug 20, 2019

gusmith Aug 21, 2019

hardbyte Aug 23, 2019

gusmith Aug 23, 2019

hardbyte Aug 20, 2019

gusmith Aug 21, 2019

hardbyte Aug 20, 2019

gusmith Aug 21, 2019

gusmith commented Aug 21, 2019

gusmith commented Aug 21, 2019

hardbyte left a comment

Benchmark repeat parameter and results pushed to s3 #406

Benchmark repeat parameter and results pushed to s3 #406

Conversation

gusmith commented Aug 20, 2019

hardbyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gusmith commented Aug 21, 2019

gusmith commented Aug 21, 2019

hardbyte left a comment

Choose a reason for hiding this comment