# Truss on GPU

In this guide, we'll walk through how to use Truss on a GPU. We'll build a model that uses the GPU, turn it into a Truss and run the Truss as GPU-enabled on an AWS instance. 

# Getting a GPU server on AWS

We'll start by setting up our GPU instance. 
1. Navigate to the [EC2](https://us-west-2.console.aws.amazon.com/ec2/) dashboard and select the __Create Instance__ button on the top right. 
2. You'll need to select an instance. [This](https://instances.vantage.sh/?region=us-west-2) page might be helpful if you want to know more about GPU instance pricing (just search for "GPU" in the *Name* column.). We'll be using a `p2.xlarge`. 
3. Next, we'll select an AMI. Specifically, we'll use the *Deep Learning AMI GPU PyTorch 1.12.0 (Amazon Linux 2)* AMI. This AMI is free and comes with a lot of neccesary setup for GPU support. When looking at GPU AMIs, we want an AMI that supports CUDA, CUDNN and the relevant library we're using (in this case PyTorch). 
4. Generate a key pair, this'll be important when we connect to our instance. Make sure to download the resulting `.cer` file. 
5. Press __Launch Instance__ on the bottom right. 

# Connecting to your GPU server 

Now that we have a GPU server, let's connect to it. 
1. On your AWS dashboard, identify the `Public IPv4 DNS`. This is the link we'll use to access our instance. 
2. In your local terminal, execute the following command. This'll SSH us into our GPU server. 
```
ssh -i [PATH_TO_.CER_FILE] ec2-user@[IPV4_DNS_ADDRESS]
```
3. If successful, you should now see your terminal user as `ec2-user@ip-[IPV4_ADDRESS]`. You should also be able to run `nvidia-smi` to see details about the GPU and NVidia drivers on your system. 

# Setting up GPU server
We'll need to install a couple packages before we can create and run our model. 
1.  We'll want to activate the PyTorch enviroment that comes with our AMI 
``` 
source activate pytorch 
```

2. We'll also want to install Truss and the Transformers package 
``` 
pip install --upgrade truss transformers
```
3. Let's create a directory where our model will live
```
mkdir dialo
cd dialo
```
4. We'll also want to make sure that `docker buildx` is installed. If running `docker buildx` results in a command not found, we can install it via 
```
LATEST=$(wget -qO- "https://api.github.com/repos/docker/buildx/releases/latest" | jq -r .name)
wget https://github.com/docker/buildx/releases/download/$LATEST/buildx-$LATEST.linux-amd64
chmod a+x buildx-$LATEST.linux-amd64
mkdir -p ~/.docker/cli-plugins
mv buildx-$LATEST.linux-amd64 ~/.docker/cli-plugins/docker-buildx
```

5. Finally, we'll initialize a base Truss scaffold 
```
truss init ./ 
``` 

# Creating our model 
We'll be using the Microsoft `DialoGPT` model, designed to be able to converse with users. Let's replace the base `model.py` with our `model.py` file below to be GPU compatible (read the comments in code snippet)


In [None]:
# Our model.py file
from typing import Dict, List

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class Model(object):
    def __init__(self, **kwargs) -> None:
        self._data_dir = kwargs['data_dir']
        self._config = kwargs['config']
        self._model = None

    def load(self):
        # First thing, in our load function, we'll want to set an attribute `device`. This will help us push 
        # certain tensors / inputs to our GPU instead of the CPU. This is useful because when we run inference 
        # on our GPU, we want the data we're using to live on the GPU for faster inference times. 
        self.device = 0 if torch.cuda.is_available() else 'cpu'
        self._tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
        # Now, we'll load our model onto the GPU. In order to utilize the GPU, both the model and any 
        # variable needed for inference (like the tokenized input) need to be `.to(self.device)`. 
        self._model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large").to(self.device)
        self.ready = True

    def preprocess(self, request: Dict) -> Dict:
        """
        Incorporate pre-processing required by the model if desired here.

        These might be feature transformations that are tightly coupled to the model.
        """
        return request

    def postprocess(self, request: Dict) -> Dict:
        """
        Incorporate post-processing required by the model if desired here.
        """
        return request

    def predict(self, request: Dict) -> Dict[str, List]:
        response = {}
        inputs = request['inputs'] # noqa
        with torch.no_grad():
            conversation_history = []
            for message in inputs:
                text = message['prompt'] + self._tokenizer.eos_token
                text_ids = self._tokenizer.encode(text, return_tensors='pt')
                conversation_history.append(text_ids)
            # Like mentioned above, this is where we "push" our input to the GPU. 
            # The '.to(self.device)' pushes the Torch Tensor to our GPU. 
            conversation_history_torch = torch.cat(conversation_history, dim=-1).to(self.device)
            chat_history_ids = self._model.generate(conversation_history_torch, max_length=500, pad_token_id=self._tokenizer.eos_token_id)
            # NOTE: You cannot map the tokenizer to the GPU. Just the model and any input to the model. 
            decoded_response = self._tokenizer.decode(chat_history_ids[:, conversation_history_torch.shape[-1]:][0], skip_special_tokens=True)
        response['response'] = decoded_response
        return response

We'll also need to edit our `config.yaml` file. Specifically, underneath the `resources` key, we'll remove all the keys and keep the `GPU` key. We'll set that to `True` to tell Truss to run our model in GPU mode. Your `config.yaml` should look something like this 

```
data_dir: data
environment_variables: {}
examples_filename: examples.yaml
input_type: Any
model_class_filename: model.py
model_class_name: Model
model_framework: custom
model_metadata: {}
model_module_dir: model
model_name: null
model_type: custom
python_version: py39
requirements: []
resources:
  use_gpu: true
secrets: {}
system_packages: []
```

# Running our GPU-enabled Truss 
We've setup our GPU instance, defined our model, made it GPU-compatible, and updated our `config` to let Truss know to use the GPU. Let's run our model now! 

We can run our model by using the Truss CLI. To begin, we'll run our model locally with the `--run-local` flag. 

```
truss predict --target_directory ./ --request '{"inputs" : [{"prompt" : "Hi whats your name"}]}' --run-local
```
The model should respond to your question. To continue the conversation, we just pass another prompt object in the request. 

```
truss predict --target_directory ./ --request '{"inputs" : [{"prompt" : "Hi whats your name"}, {"prompt" : "Im not sure what you mean by that"}]}' --run-local
```

We can also run the model inside a Docker image. This is useful when you want to deploy your Truss on a server all packaged up. 
```
truss predict --target_directory ./ --request '{"inputs" : [{"prompt" : "Hi whats your name"}]}'
```