Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark repeat parameter and results pushed to s3 #406

Merged
merged 10 commits into from
Aug 23, 2019

Conversation

gusmith
Copy link
Contributor

@gusmith gusmith commented Aug 20, 2019

Two main features added at the same time (they could have been split).
Use of boto3 for the benchmark to be able to push the results to s3 instead of relying on k8s volumes which are hard to access afterward.
A parameter repetition has been added to an experiment schema to be able to repeat the same experiment a number of time in the same benchmarking k8s job. This should reduce the number of issue we could observe when running 100 times 1M*1M linkage as we were previously repeating the same k8s jobs 100 times, which can have issues with volumes which cannot be shared.

Guillaume Smith added 4 commits August 20, 2019 18:41
This is an optional field, which needs to be an integer greater or equal to 1.
If absent, it should be defaulted to 1.
Used to post the results to s3.
@gusmith gusmith requested a review from hardbyte August 20, 2019 08:53
@gusmith gusmith self-assigned this Aug 20, 2019
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the addition of uploading results to an object store. Just a few small suggestions from me.

benchmarking/benchmark.py Show resolved Hide resolved
def push_result_s3(experiment_file):
client = boto3.client(
's3',
aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hunch is we should be agnostic as multiple services offer the S3 api (including MINIO which we already use in this project).

How about OBJECT_STORE_SERVER, OBJECT_STORE_ACCESS_KEY, OBJECT_STORE_BUCKET etc as the environment variables? And open an issue to change them in the backend settings from MINIO_ - https://github.com/data61/anonlink-entity-service/blob/develop/backend/entityservice/settings.py#L27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not convinced. My reasoning being the following:

  • we do not support anything else than aws in the benchmark
  • the benchmark is fully independent from the main repository, so variable names from the main repo should not impact choices made for the benchmark (even if consistency may be a good idea)
  • but if we are thinking about both repos together, I would actually made them even more specific, such as AWS_BENCHMARK_... to be sure that we cannot imagine they will be used for any other context than the chosen one. I would also point out that if we implement a script doing all at once (deploying and benchmarking), it's good to use different environment variable names to be sure that a mis-configuration will not lead to misuse of a token. E.g. I create a script deploying the entity service setting the env var OBJECT_STORE_ACCESS_KEY, if the script also starts the benchmark but I forgot to update the env var, I would push the results to the same bucket. Here it is not important as we are not deleting anything but if we were, we could do really bad things...
  • would there be a scenario where a single service has multiple keys? E.g. one key for the bucket x and one key for the bucket y? The underlying question being if we prefer to have keys per application, or keys per use-case (the application can do x, y and z, or key a can do x, key b can do y and key c can do z, and I give the application the keys a, b and c). Both have pros and cons, but we may want to think about it
  • if we generalise to OBJECT_STORE_SERVER, I think we will also need an extra field OBJECT_STORE_TYPE, in which case we may also need to modify the description of the env variables: one object store type may not use a bucket and access key, but something totally different (I have no clue what it could be, but maybe :) ). This field would also help to know which service to use.

But the more options we are adding, the more I would assume we should push them into a command line tool instead of env vars.

While not being convinced, I'm also not really strongly opinionated on this, so happy to change if that's the preference :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too fussed with the variable names but it would be nice to include a way to set the object store server so the benchmark user can decide for themselves.

See the docs at https://docs.min.io/docs/minio-select-api-quickstart-guide.html it looks like all you would need is to add an endpoint_url which is only set if an environment variable is present e.g. S3_SERVER or my preference OBJECT_STORE_SERVER.

s3 = boto3.client('s3',
                  endpoint_url='http://localhost:9000',
                  aws_access_key_id='minio',
                  aws_secret_access_key='minio123',
                  )

I don't think we need to tell the benchmark component the TYPE of object store - that would defeat the purpose of an abstraction - in this case the S3 API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OBJECT_STORE_SERVER added, and other ones renamed to: OBJECT_STORE_ACCESS_KEY, OBJECT_STORE_SECRET_KEY and OBJECT_STORE_BUCKET

aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY')
)
s3_bucket = "anonlink-benchmark-result"
client.upload_file(experiment_file, s3_bucket, "results.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this benchmarking job is run multiple times in k8s with the same s3 bucket (extremely likely) the results get overridden. I suggest we include a timestamp or uuid in the uploaded filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I will use timestamps, to more easily access them after (the job may not be kept on k8s, so the UUID may become hard to access).

@@ -293,6 +323,7 @@ def main():
pprint(results)
with open(config['results_path'], 'wt') as f:
json.dump(results, f)
push_result_s3(config['results_path'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should uploading to s3 be optional? Someone might want to run this benchmarking locally and just see the output as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I'll just check if the environment variables have been set, not adding an extra one.

@gusmith
Copy link
Contributor Author

gusmith commented Aug 21, 2019

@hardbyte in the benchmark, we are printing out the environment variables if the is an exception thrown in the read_conf method. Do you know where the benchmark logs are pushed to? Mainly: who can see them? If we add some credentials, we may not want to print them...

@gusmith gusmith requested a review from hardbyte August 21, 2019 03:45
@gusmith
Copy link
Contributor Author

gusmith commented Aug 21, 2019

@hardbyte all your concerns have been resolved, except the naming of the environment variables, for which we may want to chat, cf discussion.

Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but suggest making the S3 server configurable

@gusmith gusmith merged commit 70120d2 into develop Aug 23, 2019
@gusmith gusmith deleted the feature-benchmark-repeat branch August 23, 2019 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants