Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-region performance optimization #11

Closed
rspicer opened this issue Aug 30, 2017 · 15 comments
Closed

cross-region performance optimization #11

rspicer opened this issue Aug 30, 2017 · 15 comments

Comments

@rspicer
Copy link

rspicer commented Aug 30, 2017

Hello again my friends - just finished implementation, using this library called from within a child process in node. I initially thought my slow speeds were due to the buildup and takedown of the process itself, but it seems now, after some timing tests, that this isn't the case.

The decrypt function, by itself, takes 1+ seconds. Is this normal behaviour? I'm calling it a single time, with a string payload no larger than 400 characters long. Sometimes I call it with a longer string, but even with strings ~50 characters in length it seems to take this full second of time.

For our use case (calling it within an API endpoint) this doesn't really work. Am I doing something wrong? Let me know if you need to see my code, but it only has a few additions from the example code you guys gave, and I'm timing only the call to the SDK itself.

Thanks guys!

@rspicer
Copy link
Author

rspicer commented Aug 30, 2017

I know I haven't given you a wealth of information about specifics here - feel free to ask details. To add a few: I'm using an EC2 instance instantiated via Elastic Beanstalk. Currently, we're trying our t2.medium, as t2.micro was twice as slow as the medium one. Speeds seem to get worse the more times we call the endpoint mentioned, but that's our problem to solve I think - it's worth noting that, even when called a single time in isolation, we're still getting the same 1sec+ timing.

@mattsb42-aws
Copy link
Member

Unfortunately, because you are calling a child process you are seeing the overhead of the Python interpreter startup and all library loading in addition to the actual runtime for the decrypt operation.

Quick note before I go any further, as for why it is running slower on subsequent calls: check the CloudWatch metrics for the instance you are using. It sounds to me like you are probably running out of T2 CPU credits.

I have in-depth performance testing/tweaking on my todo list, but haven't gotten a chance to really dig into that yet. I did some basic initial testing to try and replicate your issue and highlight where time is going. My test key is in us-east-1, so I ran the test on my laptop (i7-5557U) in Seattle as well as on a t2.small (with plenty of CPU credits) in us-east-1.

The test is extremely basic and just runs either an encrypt or a decrypt operation on a file, writing the output to another file, uses the default algorithm suite and a KMS master key provider with a single key, and alternately looks at the runtime for the relevant KMS operation alone (in order to highlight latency). Results below.

t2.small t2.small laptop laptop
system runtime (s) execution time (s) system runtime (s) execution runtime (s)
(py2.7) full encrypt 0.289 0.111 0.703 0.445
(py2.7) kms generate data key 0.252 0.081 0.559 0.345
(py2.7) full decrypt 0.280 0.109 0.595 0.375
(py2.7) kms decrypt 0.254 0.086 0.562 0.346
(py3.6) full encrypt 0.349 0.100 0.908 0.426
(py3.6) kms generate data key 0.308 0.056 0.601 0.304
(py3.6) full decrypt 0.363 0.103 0.661 0.361
(py3.6) kms decrypt 0.315 0.069 0.623 0.324

As you can see, the interpreter/library loading overhead can have a significant impact on total runtime, though that can be superseded by latency to KMS if you are making cross-region calls. While I did not get times matching what you are seeing (especially for in-region), the Python 3.6 system runtimes from my laptop got close. This is all from calling the Python interpreters directly from the shell; I'm not sure what overhead Node adds on child process calls.

Some things I would recommend looking at are your available T2 CPU credits, how many KMS keys you are using and in what regions they are located (especially compared to what region your host is in). Of course, if you can replace the thing calling this in a child process with a full-on Python implementation, that would have the biggest impact.

@rspicer
Copy link
Author

rspicer commented Aug 31, 2017

Your writeup is about as helpful as anyone could ask for. I appreciate it loads man. We are indeed going cross-region - I didn't think about the performance impact that would have. As for T2 CPU credits... I'll look in to that. It is worth mentioning that I don't think this particular issue is affected by interpreter/library loading:

https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution

This is the method I'm using the time execution calls. It isn't perfect, but it gives a good estimate. I'm starting the timer directly before the call to the encryption-SDK, and ending it immediately after. Unless I'm wrong, this should take overhead out of the question.

Still, that being said, you've given me plenty to go on. Let me spend the next day or two tweaking things, specifically the cross-region KMS key, and I'll see where I stand.

Much, much appreciated. Cheers!

@mattsb42-aws
Copy link
Member

That's basically how I tested the timing too. There are more precise ways, but that's the easiest. I missed your original comment that you were only timing the SDK call. With that in mind, I suspect the time you are seeing is mainly due to the cross region call(s?).

Here's the basic checker I threw together for this, if you would like to compare same with same.

https://gist.github.com/mattsb42-aws/862b7e32c31c0dca44f1cfac6e2d4f2b

Depending on what your exact use case is, one workaround could be for you to do in-region encryption in the API call handler, then have a post-processor (not in the API call chain) re-encrypt the message using (I assume) keys in multiple regions. That way you can keep your API call chain fast but still persist the more portable ciphertext.

@mattsb42-aws
Copy link
Member

@rspicer did you find a solution to the performance problem you were having?

@rspicer
Copy link
Author

rspicer commented Sep 19, 2017

Truthfully, I'm just today getting back on this issue (got tied up in other things). I'll be testing the same-region calls within the next couple days, and will let you know if that fixes the performance issues.

On the topic of performance, I do have another issue that could use your input... perhaps it isn't appropriate to put it in this thread, but it is a very similar issue; I'll leave it here, and if you'd prefer I move it to it's own issue, let me know:

Is there a way to have multiple calls to the Decrypt/Encrypt functions run in parallel, asynchronously? In Node, which is what I know best, this would be a simple task... but I'm not sure how to do so in Python. I don't need much help on this problem - I'll do the heavy lifting regarding the actual parallel processing, but I guess my question is simple: Is it possible to run the encryption SDK functions asynchronously?

Once again, if you feel this needs to be moved to it's own thread, no problem. I'll also make sure to give you an update on our original performance problems shortly. Thanks so much!

@mattsb42-aws
Copy link
Member

The clients are all thread safe, yes, with one minor qualifier. Instances of KMSMasterKeyProvider should not be shared between threads, for the reasons outlined in the boto3 docs. We do create new boto3 sessions for each KMSMasterKeyProvider instance regional client, so you don't need to worry about issues below that level. As long as you create a new KMSMasterKeyProvider for each thread, you should be fine.

If you are using data key caching, however, caches can be shared across threads without issue, though if you want to share entries in that cache across threads you will need to be careful (see explanation about partition name).

Side note, I realized we did not actually have any tests verifying this, so I made some #15 .

@mattsb42-aws mattsb42-aws changed the title Decrypt function called basically, with context, taking 1 second or more. cross-region performance optimization Sep 20, 2017
@mattsb42-aws
Copy link
Member

I'm fine keeping this in the same thread; it's all related to your original question. I renamed the issue, though, to more align it with the content.

@rspicer
Copy link
Author

rspicer commented Sep 20, 2017

Perhaps you can guide me just a little bit more on this problem, and I think I'll be off and running... I've been researching all day, and I just can't quite find the correct solution for this problem. Was hoping not to bother you with it. Basically, here it is: most of the time, I'd say 90%, I'm going to have to run the encrypt/decrypt function only a single time. That's easy enough, lol. I can handle that. But the other 10% of the time requires me to run decrypt in bulk, anywhere from 500-2500 times within the span of an API call. Now, this API call can be a long one, but obviously the shorter the better.

What, in your opinion, is the fastest way to do this? As is the story of this thread, performance is key. I see three options, but I don't know enough to tell which is the right one (all of them are viable, one is worse than the other's implementation wise, but if it's the fastest, no problem):

  1. Run a standard asyncio coroutine that utilizes await and the event loop, single-threaded (my understanding is that the standard asyncio processes are single-threaded) for all of the processes "at once".
  2. Use run_as_executor (https://stackoverflow.com/a/29280606/5335646) to use thread or process pooling to run each "set" of tasks (say, a group of 100 decryptions per thread) in parallel/concurrently. However, my understanding is this is more for blocking tasks, however.
  3. Actually run multiple physical python processes at the same time, communicating with each other over HTTP, with Docker (I'll be using Docker for this anyway, I have a feeling this idea is the worst and makes no sense, but mine as well include it). Or, using multiprocessing.

There may be other options that I'm unaware of, but these are the main ones that come to mind. For some reason, I have it stuck in my head that I should be using a combination of multi-threading or multi-processing and asynchronicity, but I could be off the mark.

I may delete this if I come up with an answer soon, but if you have any suggestions, let me know!

@rspicer
Copy link
Author

rspicer commented Sep 20, 2017

I'm going to start trying all of these implementations now, I have a few ideas. But if you know the answer to which is fastest flat-out, please interrupt me and let me know! 😄

@rspicer
Copy link
Author

rspicer commented Sep 20, 2017

I'm clearly mistaken... "TypeError: object tuple can't be used in 'await' expression" this can't be used as an asynchronous function at all, can it? Which means what you meant above is that the fastest route would be to use multi-threading or multi-processing straight-up? Apologies, as stated I come from Node where nearly every function created is async 🤣 making my way around slowly.

@rspicer
Copy link
Author

rspicer commented Sep 20, 2017

I'm currently using run_as_executor with a ThreadPoolExecutor (ProcessPool doesn't seem to work, problems with pickling) with a max_worker count of 40 on my laptop (that seems to be where the performance difference trails off from increasing it more). Under the same region (which helps performance a bit) I'm getting 45 seconds for a load of 2500 small encryptions. Which actually... might be something I can work with. I'll update you later tonight.

@mattsb42-aws
Copy link
Member

With Python, the general rule of thumb is that if you are IO-bound use multithreading, if you are CPU-bound use multiprocessing. Multiprocessing actually spawns new system processes, each with an independent instance of the Python runtime (and thus an independent GIL), letting you actually use multiple CPU cores if you have access to them. There is a really good introduction to the GIL here that goes into a lot more detail on this.

Which approach makes the most sense for your use case will likely depend mostly on how large the objects you expect to encrypt are and what kind of hardware resources you have available.

Given a consistent configuration, the CMM portion (generating the data key, encrypting the data key, generating the signing key, etc) of the encrypt/decrypt calls tend to have a pretty consistent runtime. If you are using the KMSMasterKeyProvider, this portion of the runtime is also largely IO-bound as most of the time is spent waiting on KMS to respond.

If you are encrypting small amounts of data and the CMM runtime makes up a larger portion of your total runtime, multithreading will probably be better, at least up to a certain point when you will become CPU-bound processing the actual encryption operations. If you are encrypting large amounts of data, you are likely to be CPU-bound from the start, and so multiprocessing will make more sense.

Making the right decision on what approach to use will require some tuning on your end, based on monitoring the performance and runtimes of your specific scenario. What might end up being the best choice is a combination of the two, tuned based on the results of your testing.

Another thing that might help with performance is to take advantage of data key caching to cut out the hit to KMS for some of those calls.

To be honest, I'm not very familiar with asyncio; I haven't taken the time to play with it yet. That TypeError message sounds more like you might just be feeding in parameters incorrectly, though.

@rspicer
Copy link
Author

rspicer commented Sep 25, 2017

Hey there! Just want to thank you much for all of your help, and give you a bit of an update: you were absolutely, 100%, right on the money. We haven't tested decrypt yet (it's hard, as all of our data is encrypted with the eu-central key) but we HAVE tested encrypt... and the results are ridiculous. Encrypt with the SDK, on our t2.medium (running in a docker container, but that should make no difference) takes roughly 1.2 seconds on average execution time with the eu-central key (our servers being in sa-east). However, moving the key to sa-east with the server drops times for encrypt down to 0.08. Not 0.8, to be clear, 0.08. Now those are times we can work with! 😄 That pretty much wraps things up for us, assuming decrypt has the same time drop. Your help on threading as well has been hugely helpful. In short, thank you! Hopefully I've helped you in some way too, bringing up cross-region time differences, but I think I got the better end of this one! Cheers!

@mattsb42-aws
Copy link
Member

That's great to hear!

If you need to have your eu-central master key able to decrypt the message but need to retain the low runtime on encrypt, something you could do is encrypt only with an in-region master key on initial request handling, then have a post-processing step (Lambda watching an S3 bucket, perhaps?) that re-encrypts the data using the full complement of master keys.

As always, feel free to reach out again with any other issues you encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants