-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark repeat parameter and results pushed to s3 #406
Conversation
This is an optional field, which needs to be an integer greater or equal to 1. If absent, it should be defaulted to 1.
Used to post the results to s3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the addition of uploading results to an object store. Just a few small suggestions from me.
benchmarking/benchmark.py
Outdated
def push_result_s3(experiment_file): | ||
client = boto3.client( | ||
's3', | ||
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hunch is we should be agnostic as multiple services offer the S3 api (including MINIO which we already use in this project).
How about OBJECT_STORE_SERVER
, OBJECT_STORE_ACCESS_KEY
, OBJECT_STORE_BUCKET
etc as the environment variables? And open an issue to change them in the backend settings from MINIO_
- https://github.com/data61/anonlink-entity-service/blob/develop/backend/entityservice/settings.py#L27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not convinced. My reasoning being the following:
- we do not support anything else than aws in the benchmark
- the benchmark is fully independent from the main repository, so variable names from the main repo should not impact choices made for the benchmark (even if consistency may be a good idea)
- but if we are thinking about both repos together, I would actually made them even more specific, such as
AWS_BENCHMARK_...
to be sure that we cannot imagine they will be used for any other context than the chosen one. I would also point out that if we implement a script doing all at once (deploying and benchmarking), it's good to use different environment variable names to be sure that a mis-configuration will not lead to misuse of a token. E.g. I create a script deploying the entity service setting the env varOBJECT_STORE_ACCESS_KEY
, if the script also starts the benchmark but I forgot to update the env var, I would push the results to the same bucket. Here it is not important as we are not deleting anything but if we were, we could do really bad things... - would there be a scenario where a single service has multiple keys? E.g. one key for the bucket
x
and one key for the buckety
? The underlying question being if we prefer to have keys per application, or keys per use-case (the application can do x, y and z, or key a can do x, key b can do y and key c can do z, and I give the application the keys a, b and c). Both have pros and cons, but we may want to think about it - if we generalise to
OBJECT_STORE_SERVER
, I think we will also need an extra fieldOBJECT_STORE_TYPE
, in which case we may also need to modify the description of the env variables: one object store type may not use a bucket and access key, but something totally different (I have no clue what it could be, but maybe :) ). This field would also help to know which service to use.
But the more options we are adding, the more I would assume we should push them into a command line tool instead of env vars.
While not being convinced, I'm also not really strongly opinionated on this, so happy to change if that's the preference :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too fussed with the variable names but it would be nice to include a way to set the object store server so the benchmark user can decide for themselves.
See the docs at https://docs.min.io/docs/minio-select-api-quickstart-guide.html it looks like all you would need is to add an endpoint_url
which is only set if an environment variable is present e.g. S3_SERVER
or my preference OBJECT_STORE_SERVER
.
s3 = boto3.client('s3',
endpoint_url='http://localhost:9000',
aws_access_key_id='minio',
aws_secret_access_key='minio123',
)
I don't think we need to tell the benchmark component the TYPE
of object store - that would defeat the purpose of an abstraction - in this case the S3 API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OBJECT_STORE_SERVER
added, and other ones renamed to: OBJECT_STORE_ACCESS_KEY
, OBJECT_STORE_SECRET_KEY
and OBJECT_STORE_BUCKET
benchmarking/benchmark.py
Outdated
aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY') | ||
) | ||
s3_bucket = "anonlink-benchmark-result" | ||
client.upload_file(experiment_file, s3_bucket, "results.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this benchmarking job is run multiple times in k8s with the same s3 bucket (extremely likely) the results get overridden. I suggest we include a timestamp or uuid in the uploaded filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I will use timestamps, to more easily access them after (the job may not be kept on k8s, so the UUID may become hard to access).
benchmarking/benchmark.py
Outdated
@@ -293,6 +323,7 @@ def main(): | |||
pprint(results) | |||
with open(config['results_path'], 'wt') as f: | |||
json.dump(results, f) | |||
push_result_s3(config['results_path']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should uploading to s3 be optional? Someone might want to run this benchmarking locally and just see the output as before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I'll just check if the environment variables have been set, not adding an extra one.
@hardbyte in the benchmark, we are printing out the environment variables if the is an exception thrown in the |
@hardbyte all your concerns have been resolved, except the naming of the environment variables, for which we may want to chat, cf discussion. |
added time when a run starts and when it stops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but suggest making the S3 server configurable
3fac83c
to
0484c5a
Compare
Two main features added at the same time (they could have been split).
Use of boto3 for the benchmark to be able to push the results to s3 instead of relying on k8s volumes which are hard to access afterward.
A parameter
repetition
has been added to an experiment schema to be able to repeat the same experiment a number of time in the same benchmarking k8s job. This should reduce the number of issue we could observe when running 100 times 1M*1M linkage as we were previously repeating the same k8s jobs 100 times, which can have issues with volumes which cannot be shared.