-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent private instances #1749
Comments
For example, when I request an API, Cortex will spin up an instance, and next time I make a call, will I access the same instance or some other one (considering that I will specify Compute requirement to be the max of the type of instance I’ll use, so that one instance will be used per API) ? |
@da-source that is an interesting question. I have a few follow up questions:
Generally, it would be better to not do it this way if possible, since there is some behavior that could be undefined. For example, what if the user doesn't make a request for a bit, the instance spins down, and then the user makes a new request: will they have lost their data, or is it ok to spin up a new instance? Also, locking users to specific instances could affect autoscaling's ability to distribute the load, since instances created during autoscaling could only be used for new users (all existing users would have to stay on their initial instance). |
Somewhere around 150-300 users. It would be best if each user only had access only to one instance. I will set up a web interface, with a timer set to amount of minutes an instance will be alive, once it spins up (which must be defined in cortex.yaml as I understand), after timer runs out, there will be an option to request a new instance. Since the time window at which instance is alive will be predefined and known by a user, the user will have that time window (around 30-50 minutes) to interact with the instance, after which new instance could be requested if needed. Ability to distribute the load, that you’ve mentioned, I think wouldn’t be an issue, since if all existing users will have to stay on their instances, untill the timer runs out/instance spins down, that would actually be preferable. |
Hello! Engineering-wise that sounds like a very bad idea for an application architecture. Can you tell us what is it that you are trying to achieve? There must be other ways to achieve it, while maintaining a stateless application. |
@miguelvr I have a few quite large models and datasets, which I would like to be available to use at any given time and I would like to make this application as cost-effective as possible. I considered creating separate API for each model/dataset, but doing so would require running many idle instances on the cluster, when APIs are not being used. I also considered using a single API implementation - downloading all the models/datasets to each instance and then loading them when needed, using If I could instead, create a single API, which would download the model+dataset combination on an instance upon user's request (using |
Sounds like what you want might be solved with multi model caching, which Cortex supports. You can check the documentation here: https://docs.cortex.dev/workloads/multi-model |
@miguelvr With multi-model caching, will all the models be downloaded on an instance, when it spins up or will the model of choice be downloaded after each API call specifying which model to get? |
You can configure how many models are cached on disk |
But will the models be downloaded on instance startup or API call? I don’t want to unnecessarily download models/data, since they are quite large and AWS charges 0.15$/1GB |
@deliahu Multi-model documentation states that you can store many models in cache and then load model that you need. But will all the models be downloaded on an instance, when it is requested? Or will the model specified in query_params be downloaded? If so, if it is downloaded on one instance, and API call randomly redirects user to another instance without that model, will it have to be downloaded again? I’m trying to avoid those AWS data transfer prices, because my files are so large (all of them combined currently weigh around 95-100GB) |
@da-source the models are downloaded only when they are requested. This means that when multi-model caching is enabled, the models won't get downloaded when the API starts, but when requests start coming in. The cache then gets populated until its threshold is hit (the max number of models kept in the cache) and then they are dropped based on the LRU policy. The models that will get downloaded are those specified in the API spec ( To reiterate, the API will start with no models, and only when requests come in, that's when they get downloaded. Subsequent requests for the same model will be faster because the model will already be present on disk / in memory. The opposite is when only live reloading is enabled, for which case, the models are downloaded on the API from the very beginning - and they are always available to the user. Does that make sense to you? And if not, could you tell us where we could improve our documentation (the parts that are confusing if any) with regards to live-reloading/multi-model caching? Your feedback is gonna help us improve our documentation for future users :) Out of pure curiosity, what kind of models are you deploying? 95-100GB per each one is quite a lot. |
Thanks, for clearing it up! One small detail: with multi-model caching, when request for a certain model comes, will it affect only one instance or will the requested model be downloaded on all of the running instances? Multi-model part of the documentation seemed a little unclear to me and you could update it with the information that you’ve given me, to make the documentation a bit more detailed. Also, it’s not clear from documentation how to use query_params when making a request to an API with curl* I’m using a number of pretrained XLNET models. What I meant to say was, all of them combined weigh around 100GB (not each one of them). |
@da-source with model-multi caching, when a request lands in for a certain model, it will only affect a single API replica (not an instance and nor all instances). An API replica can fit a single time or multiple times on a single instance, it depends on how many compute resources the API demands. I suppose that the thing you are requesting is described by #1288 ticket. Is this something that you would definitely need?
Anything that's passed into the
I see! We'll see what we can do about it! Thank you! |
@RobertLucian I see. If I specify API compute requirement, so that it can fit only on one instance, make a certain model request (which will download model on one instance, if its not present on an instance), and then after some time make the request for the same model, will Cortex access the instance on which model was previously downloaded or will instance be chosen at random, and model will have to be downloaded again, if the randomly accessed instance doesn’t have the previously downloaded model? |
Yes that is correct |
Is there a way to modify this behaviour or at least increase chances of the instance with the needed model at hand being used, instead of an instance which doesn’t have the previously downloaded model? That way, the models that aren’t being chosen often, wouldn’t need to be downloaded on each API call. Ideally, each user would repeatedly access only one instance (where the needed model and data would be stored at), before instance spins down |
@da-source something like this could be possible for us to implement. Since it would be based on consistent hashing of a request header (e.g. user ID) it would rely on a few things being true to run as smoothly as possible, e.g. do your users each generate a similar number of requests, or is there a wide variety? There is another option worth considering, which is to create a separate API for each user. This would ensure that each replica is fully owned by a single user: all requests would be routed to the same replica, however multiple users would not be able to share a replica (I wrote "replica" since you can have multiple replicas per instance). Does this approach seem promising? |
In my case, the deviation in number of user requests I think would usually be below 100. Could something like this be done with the current architecture? How would the autoscaling work, in the scenario that you’ve suggested? Is there a way to auto-create/delete API upon each new user’s request or will I need to launch enormous amount of APIs on a cluster at launch? |
The current architecture supports it, but it would need to be implemented (this functionality is not currently implemented or exposed to the user).
Each API would have it's own autoscaling, so it would be based on the traffic generated by each user. Each user would have a dedicated API with at least one replica.
You can create/delete APIs programmatically. The user would have to make two different types of requests: one to create the API, and one to call it once it's live. For the API that creates Cortex APIs, for separation of concerns, it might best to run this outside of the cortex cluster on a separate backend (e.g. app engine, elastic beanstock, lambda, heroku, etc). But I don't see why it couldn't run in the Cortex cluster if that's your preference. You would use Cortex's Python client to create/delete the Cortex APIs. |
@deliahu Deploying an API that creates/deletes Cortex APIs seems to be the best solution in my case! I'm getting
What could be causing this? |
@da-source Do you also see a stack trace? If so, do you mind sending that? Also, what version of Cortex are you using? Also, just to make sure, when you are creating the client, are you replacing "myoperator" with the actual operator endpoint (e.g. |
Yes, of course, I replace the arguments with the actual keys and operator |
@deliahu I’m using Cortex 0.25 |
@da-source it seems to me that you are not using the correct cortex client. please try using this: import cortex as cx
client = cx.client(your_env_name) |
@miguelvr I don't think that is the issue, I'm getting:
|
and what is the name of the environment in which your cluster is? please run |
@da-source Thanks for bringing this to our attention. There is indeed a bug in the code that prevents the creation and updating of environments in python. This PR will fix the bug #1772. As a temporary workaround, please use the |
Alright! But it still haven't resolved the problem that @deliahu and I started adressing:
|
There is a bug in As a workaround, you can mimic the behaviour of If running Cortex CLI commands isn't feasible in your use case, you can run the python code below which will mimic the behaviour of import cortex
from cortex.binary import run_cli
env_name = "test"
provider = "aws"
operator_endpoint = "your operator endpoint"
aws_access_key_id = "your aws access key id"
aws_secret_access_key = "your aws secret key"
cli_args = ["env", "configure", env_name, "--provider", provider, "--operator-endpoint", operator_endpoint, "--aws-access-key-id", aws_access_key_id, "--aws-secret-access-key", aws_secret_access_key]
run_cli(cli_args, hide_output=True)
client = cortex.client(env_name) Let me know if this works for you. |
@vishalbollu I was able to create a client, but I'm getting an error when trying to
I tried running
But after re-running |
@da-source This error is happening because for some reason the Python client is not able to connect to the cluster. When you run |
The operator I get by calling Also, I retried running this snippet, which mimics the behaviour of
And got this error:
Which is weird because |
Is it possible that there is a copy-paste error? Based on the script you've specified above, |
@vishalbollu Yes, that was a simple copy-paste error :)
or would you recommend some other way? |
@da-source so let's answer your question.
When using the Python Client, if the
The best way to configure the API at deploy time is to populate the One thing I notice is that the PythonPredictor implementation in your example has the incorrect signature for your constructor - at the very least, the 2nd thing I notice is that in client.create_api(api_spec= {'name': 'modelname',
'kind':'RealtimeAPI',
'predictor':
{'type': 'python'},
'compute':
{'cpu': '1',
'mem': '5G'}}, predictor=PythonPredictor('modelname')) you're initializing the All in all, your example would look like this: # define predictor
class PythonPredictor:
def __init__(self, config):
self.config = config
if self.config["condition"] == "value-1":
# initialization of type 1
elif self.config["condition"] == "value-2":
# initialization of type 2
else:
# whatever other kind of initialization
def predict(self):
#do something
# create api
client.create_api(api_spec={
'name': 'modelname',
'kind':'RealtimeAPI',
'predictor': {
'type': 'python',
'config': {'condition': 'value-1'},
},
'compute': {
'cpu': '1',
'mem': '5G'
}}, predictor=PythonPredictor) Let us know if this clears up things for you! |
@RobertLucian Thanks! And what should should I do about importing libraries? I tried doing it on
|
@da-source the imports' scope is limited to their respective module/class/function(or method). In your case, the class PythonPredictor:
def __init__(self, config):
self.boto3 = __import__("boto3")
# or import boto3 and then assign boto3 to boto3 attribute
def predict(self):
return self.boto3 Generally, packages should not be made available this way (through assigning them to class attributes) - the imports should be done in the constructor, you initialize whatever you need there, and then in the I wouldn't recommend doing the imports at the class level like this either: class PythonPredictor:
boto3 = __import__("boto3") And that's because the imports will be done when the class is defined (and not when an object of this class is instantiated). There may be some work we have to do to improve the UX with regards to importing modules though. |
@RobertLucian So for now, this:
is the best option, am I correct? |
@da-source as long as you're required to use boto3 in your predict method, yes, for now. That being said, I would still point out that importing the modules wherever they are needed is the better alternative - in your case, I would strive to import them in the constructor and use them to initialize stuff just there. |
@da-source @RobertLucian I just wanted to be clear and double check for others reading this thread that this is the recommended approach: class PythonPredictor:
def __init__(self, config):
import boto3
# boto3 can be used here
def predict(self):
import boto3
# boto3 can be used here |
@da-source I'll go ahead and close this issue, let us know if you have additional questions |
I would like to use Cortex functionality, to create an application where each user will be able to request and communicate with AWS instance for a period of time. In this scenario, data of each user will be processed and stored on one whole AWS instance. From the documentation, I understand that each API call will use an instance that it is not busy at the moment. It wouldn’t be ideal if by making an API call, a user would receive sensitive data stored by a another user on the same instance. Would it be possible to somehow mark an instance to which an API call is being made? That way the data of individual users wouldn’t be made accesible to everyone, but only to those users who request/use an instance.
The text was updated successfully, but these errors were encountered: