Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance with cortex local server is 6X slower than when model is run from jupyter notebook #1774

Closed
lminer opened this issue Jan 9, 2021 · 13 comments
Labels
bug Something isn't working

Comments

@lminer
Copy link

lminer commented Jan 9, 2021

I'm finding that my tensorflow model is ~6X slower when run from the local server than when it is run from a jupyter notebook. I've checked nvtop when the model is running and it appears that the GPU is being used although only for a very brief portion of the overall time. I've also tried running the model in bentoml. In that case it's also slower, but only 3X. Speeds are comparably when I run from AWS, although I'm using a T4 in that case rather than the rtx 2080 ti that I use locally. Any suggestions on how I might diagnose the cause of the slowdown? Here are my config files:

- name: Foo
  kind: RealtimeAPI
  predictor:
    type: tensorflow
    path: serving/cortex_server.py
    models:
      path: foo
      signature_key: serving_default
    image: quay.io/robertlucian/tensorflow-predictor:0.25.0-tfs
    tensorflow_serving_image: quay.io/robertlucian/cortex-tensorflow-serving-gpu-tf2.4:0.25.0
  compute:
    gpu: 1
  autoscaling:
    min_replicas: 1
    max_replicas: 1
# cluster.yaml

# EKS cluster name
cluster_name: foo

# AWS region
region: us-east-1

# list of availability zones for your region
availability_zones: # default: 3 random availability zones in your region, e.g. [us-east-1a, us-east-1b, us-east-1c]

# instance type
instance_type: g4dn.xlarge

# minimum number of instances
min_instances: 1

# maximum number of instances
max_instances: 1

# disk storage size per instance (GB)
instance_volume_size: 50

# instance volume type [gp2 | io1 | st1 | sc1]
instance_volume_type: gp2

# instance volume iops (only applicable to io1)
# instance_volume_iops: 3000

# subnet visibility [public (instances will have public IPs) | private (instances will not have public IPs)]
subnet_visibility: private

# NAT gateway (required when using private subnets) [none | single | highly_available (a NAT gateway per availability zone)]
nat_gateway: single

# API load balancer scheme [internet-facing | internal]
api_load_balancer_scheme: internal

# operator load balancer scheme [internet-facing | internal]
# note: if using "internal", you must configure VPC Peering to connect your CLI to your cluster operator
operator_load_balancer_scheme: internet-facing

# to install Cortex in an existing VPC, you can provide a list of subnets for your cluster to use
# subnet_visibility (specified above in this file) must match your subnets' visibility
# this is an advanced feature (not recommended for first-time users) and requires your VPC to be configured correctly; see https://eksctl.io/usage/vpc-networking/#use-existing-vpc-other-custom-configuration
# here is an example:
# subnets:
#   - availability_zone: us-west-2a
#     subnet_id: subnet-060f3961c876872ae
#   - availability_zone: us-west-2b
#     subnet_id: subnet-0faed05adf6042ab7

# additional tags to assign to AWS resources (all resources will automatically be tagged with cortex.dev/cluster-name: <cluster_name>)
tags: # <string>: <string> map of key/value pairs

# whether to use spot instances in the cluster (default: false)
spot: true

spot_config:
  # additional instance types with identical or better specs than the primary cluster instance type (defaults to only the primary instance type)
  instance_distribution: # [similar_instance_type_1, similar_instance_type_2]

  # minimum number of on demand instances (default: 0)
  on_demand_base_capacity: 0

  # percentage of on demand instances to use after the on demand base capacity has been met [0, 100] (default: 50)
  # note: setting this to 0 may hinder cluster scale up when spot instances are not available
  on_demand_percentage_above_base_capacity: 0

  # max price for spot instances (default: the on-demand price of the primary instance type)
  max_price: # <float>

  # number of spot instance pools across which to allocate spot instances [1, 20] (default: number of instances in instance distribution)
  instance_pools: 3

  # fallback to on-demand instances if spot instances were unable to be allocated (default: true)
  on_demand_backup: true

# SSL certificate ARN (only necessary when using a custom domain)
ssl_certificate_arn:

# primary CIDR block for the cluster's VPC
vpc_cidr: 192.168.0.0/16
@lminer lminer added the bug Something isn't working label Jan 9, 2021
@miguelvr
Copy link
Collaborator

miguelvr commented Jan 9, 2021

@lminer can you please try it out on a notebook with the same GPU used with cortex? T4 GPUs are considerably slower than an RTX 2080 TI.

@lminer
Copy link
Author

lminer commented Jan 9, 2021

@miguelvr I already have. That number is for the RTX 2080 ti. 6X is the difference between running the model in the server on my local machine and running the model in a jupyter notebook.

@deliahu
Copy link
Member

deliahu commented Jan 10, 2021

@lminer Another possibility is insufficient system memory (assuming the memories on the GPUs are equivalent). It might be worth trying on e.g. g4dn.4xlarge which still has 1 GPU, but has 64gb memory (versus 16gb for the g4dn.xlarge).

@lminer
Copy link
Author

lminer commented Jan 10, 2021

@deliahu probably best to forget about the AWS cluster for now. This is an issue on my local box, which has 90 GB of ram and 2 GPUs, and a threadripper 1950. If I run inference on the model in a jupyter notebook, it is > 6X times faster than if I spin up a local cortex server and run inference through the server. When I run the model through the server, I only see ~10 seconds when the GPU is actually being used. The rest of the time both the CPU and GPU are idle.

@deliahu
Copy link
Member

deliahu commented Jan 10, 2021

@lminer It would be worth profiling where the time is going. An easy first step is to add a few log statements within your predict() function, to see what is eating the bulk of the time (e.g. is it before predict() is even called, or is it all in the self.client.predict() call?).

Also, since local mode has been removed going forward, and because local does not have exactly the same architecture as when running in the cluster, it would probably be best to check on the cluster.

@lminer
Copy link
Author

lminer commented Jan 10, 2021

@deliahu I did some logging and it looks like the slowdown is more like 10X. 124 seconds is spent in self.client.predict(). When I run inference via jupyter, it only takes 11 seconds.

If I look at GPU utilization during the 124 seconds, it appears as if GPU is only being used for ~10 seconds.

I'm wondering if this issue relates to #1740. I'm passing 40 mb inputs to tensorflow-serving and back, and maybe it's throttling on this for some reason.

Incidentally, the issue is the same on the cluster.

@deliahu
Copy link
Member

deliahu commented Jan 10, 2021

@lminer yes that could be it, although 124 seconds still seems high to me for this kind of issue. Maybe it has something to do with how the tensor is created/encoded before sending off to TF Serving? Although you did mention that CPU is idle too...

One alternative you could look into is using Cortex's Python predictor type instead of the TensorFlow type (then there would not be an extra hop). How are you importing your model and running inference in your notebook, and would that be easily transferrable to the PythonPredictor?

In the Python predictor, you would pass in the path to your model in the API's config field, and download/load it on __init__(). Here are the docs, and we have a couple examples that work like this, e.g. pytorch/iris-classifier

@lminer
Copy link
Author

lminer commented Jan 12, 2021

@deliahu maybe we should close this as it is a consequence of the change made in #1740. The problem does indeed seem to be with passing large amounts of data to tensorflow-serving.

@deliahu
Copy link
Member

deliahu commented Jan 12, 2021

@lminer I'm glad to hear you got it working!

I'd like to keep this issue open for now, since I have one more theory I'd like to try out (it still seems like it takes too long if it were only a matter of networking).

When you were passing in the data via self.client.predict(), what was the type of the object you were passing in? Would you be able to send us an example model / input file / predictor.py that reproduces the super long latency? (I'm assuming the 40mb input you referenced above comes from the user's request). Feel free to email us at dev@cortex.dev.

@lminer
Copy link
Author

lminer commented Jan 12, 2021

@deliahu Unfortunately I can't send along the model. The data object I was sending was as follows:

self.client.predict({"audio": audio}) where audio is a float32 tensor of shape (None, 2) where none corresponds to the number of samples in the audio file.

@deliahu
Copy link
Member

deliahu commented Jan 14, 2021

where audio is a float32 tensor of shape (None, 2)

@lminer just to confirm, was audio of type tf.Tensor?

@lminer
Copy link
Author

lminer commented Jan 14, 2021

Yeah, that's correct. The same thing holds with a numpy array as well.

@vishalbollu
Copy link
Contributor

Closing issue because cortex local is no longer supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants