# Running models at scale using DL & AWS Batch

In the previous notebook we ran our model using AWS Lambda. Although we can use Lambda to asynchronously execute our model we will often run into scenarios where we need more resources for a job or need more control over the execution of a job. In these scenarios we can leverage another AWS service, Batch. Batch provides a scalable compute environment on which we can execute our model. In this notebook we will be running the same model from our previous notebook but using Batch instead of Lambda.

## Runtime code

For our Batch deployment we will need to slightly modify our runtime code from our Lambda example. We will be using `click` to simplify the command we pass along when submitting jobs. The runtime code can be found in `../app/dl_aws_batch.py` or seen below:

```python
from .model import get_field_class

import click
import logging
import json

logger = logging.getLogger()
logger.setLevel(logging.INFO)

@click.command()
@click.argument(
    "geom_str",
    type=click.STRING
)
@click.argument(
    "fid",
    type=click.STRING
)
@click.argument(
    "s3_bucket",
    type=click.STRING
)
@click.argument(
    "model_name",
    type=click.STRING
)
def run_batch_field_model(
    geom_str,
    fid,
    s3_bucket,
    model_name
):
    geom = json.loads(geom_str)
    result = get_field_class(geom, fid, s3_bucket, model_name)
    logger.info(result)

    return result

if __name__ == "__main__":
    run_batch_field_model()

```

Unlike in our Lambda example we will be passing in the S3 bucket and model name when we submit the jobs so these are added as arguments for our `run_batch_field_model()` function.

## Building a Docker image

As in the previous notebook we will need to build a special Docker image to be run in Batch. You can find the Dockerfile for this notebook in `../dockerfiles/batch/Dockerfile`. The file is fairly similar to the one we used in the Lambda example but with a few subtle differences:

```Dockerfile
FROM python:3.9-slim-buster

COPY app app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
RUN pip3 install -U descarteslabs>=1.11.0 

ENV DESCARTESLABS_ENV=aws-production

RUN mkdir /tmp/models

ENTRYPOINT ["python3", "-m", "app.dl_aws_batch"]
```

We use a different base image here `python:3.9-slim-buster` and do not copy the application code to a Lambda specific execution directory. We also specify an `ENTRYPOINT` rather than a `CMD`. Both are ways to specify what code to run in the container but for this example an `ENTRYPOINT` will simplify things slightly when we submit out job.

We now need to build our image and push it to the ECR. For more information on these steps please see the following:

- [Creating an ECR repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html)
- [Build and push your Docker image](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html)

The general steps for this process look something like this though:
1) `cd ~/dl-ea-aws-onboarding`
2) `docker build -t dl-aws-onboarding-batch -f dockerfiles/batch/Dockerfile .` You can specify a different name for your image by swapping our "dl-aws-onboarding-batch" for something else
3) `docker tag dl-aws-onboarding-batch:latest {your-container-registry}/dl-aws-onboarding-batch:latest`
4) `docker push {your-container-registry}/dl-aws-onboarding-batch:latest`

`docker build -t dl-aws-onboarding-batch -f dockerfiles/batch/Dockerfile .`

`docker tag dl-aws-onboarding-batch:latest 851517463584.dkr.ecr.us-west-2.amazonaws.com/dl-aws-onboarding-batch:latest`

`docker push 851517463584.dkr.ecr.us-west-2.amazonaws.com/dl-aws-onboarding-batch:latest`

## Creating IAM role for Batch

We will now need to create some infrastructure pieces to support our Batch deployment. We need to first create an IAM role (permissions) to be used for executing out project code. This IAM role will need access to a few different services to be able to properly run the model. Please note that this notebook only provides a rough outline of how to properly create an IAM role for your application. When adjust permissions and creating roles and policies please proceed with caution to avoid any issues with security. Descartes Labs, Inc. is not responsible for any issues arising from improperly specified AWS security infrastructure. Please see the following docs for more information on how to tailor your role and policies to your needs: https://docs.aws.amazon.com/iam/index.html

We will be specifying our infrastructure on AWS using `boto3` (AWS's python API).

In [None]:
import boto3
import json
import geopandas as gpd
import shapely.geometry as sg

We start be instantiating the iam client.

In [None]:
aws_iam = boto3.client('iam')

We then need to allow our IAM role to assume two roles to be able to use relevant services: S3 and ECS Tasks. These two roles will allow us to access and use S3 buckets and to use Batch/ECS.

In [None]:
assume_role_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Effect": "Allow",
            "Sid": ""
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ecs-tasks.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

In [None]:
role_response = aws_iam.create_role(
    RoleName='dl-aws-onboarding-batch-role',
    AssumeRolePolicyDocument=json.dumps(assume_role_policy)
)

In [None]:
role_response["Role"]["Arn"]

### Attaching policies to the execution role

We now need to attach policies to this role to allow it to use specific AWS services. The policies we care about for this example are: S3 Access, EC2 Container Registry, Cloud Watch Logs, and Secrets Manager. For more information about policies please see [these docs](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html). Again you may want to specify more specific and limited policies for your IAM role. For more information about this please see [here](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html).

In [None]:
execution_role = aws_iam.get_role(RoleName='dl-aws-onboarding-batch-role')

In [None]:
aws_iam.attach_role_policy(
    RoleName=execution_role["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/AmazonS3FullAccess"
)

In [None]:
aws_iam.attach_role_policy(
    RoleName=execution_role["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
)

In [None]:
aws_iam.attach_role_policy(
    RoleName=execution_role["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"
)

In [None]:
aws_iam.attach_role_policy(
    RoleName=execution_role["Role"]["RoleName"],
    PolicyArn="arn:aws:iam::aws:policy/SecretsManagerReadWrite"
)

## Create a compute environment

Although you can create a compute environment using boto3 we will instead use the management console. Navigate to Batch and "Compute environments". Then select "Create".

<img src="../images/create_compute_env_batch.png" align="center"/>

Now provide a name for your compute environment. Keep everything default until you select the "Instance Configuration". For this example choose either "Fargate" or "Fargate Spot". Please note that the cost of these services will differ depending on your selection. Please consult [the pricing documentation](https://aws.amazon.com/batch/pricing/) for more info on this.

<img src="../images/instance_config.png" align="center"/>

You will also need to specify a maximum number of vCPUs that can be leveraged by your compute environment. For this example you can specify something lower (32 for example)/

Under the networking section you will likely want to define a specific VPC to use for this compute environment. For more information on this please consult the [AWS docs here](https://docs.aws.amazon.com/batch/latest/userguide/get-set-up-for-aws-batch.html#create-a-vpc).

Finally you can click "Create compute environment"!

## Creating job queue and job definition

Now that we have a compute environment we need to create [a job queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) and [a job definition](https://docs.aws.amazon.com/batch/latest/userguide/job_definitions.html).

In [None]:
batch = boto3.client("batch")

We need to specify a few variables to use in our `boto3` calls for creating our queue and definition. The values in the cells below need to be specified to match your compute environment and Docker image in the AWS ECR. The other values are used to specify the resources, retries, and timeouts associated with the jobs we will be running. For more info on these please see [the docs here](https://docs.aws.amazon.com/batch/latest/userguide/job_definition_parameters.html). There are also [specific resource limitations](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-cpu-memory-error.html) between numbers of cpus and memory.

In [None]:
compute_env = "dl-aws-onboarding-batch"
image = "851517463584.dkr.ecr.us-west-2.amazonaws.com/dl-aws-onboarding-batch:latest"

timeout = 600
vcpu = 1
memory = 2048
retries = 1

### Job queue

We start by defining the queue. We must provide a name, the compute environment, and a priority.

In [None]:
# Define the "name" of the job queue and definition
queue_name = "dl-aws-onboarding-batch-queue"

# Create the job queue
response = batch.create_job_queue(
    jobQueueName=queue_name,
    state="ENABLED",
    priority=1,
    computeEnvironmentOrder=[
        {
            "order": 0,
            "computeEnvironment": compute_env,
        },
    ],
)
job_queue_arn = response["jobQueueArn"]

In [None]:
print(job_queue_arn)

In [None]:
job_queue_arn = "arn:aws:batch:us-west-2:851517463584:job-queue/dl-aws-onboarding-batch-queue"

### Creating a DL Auth secret in AWS Secret Manager

Before we create our job definition we will want to add our DL credentials to the AWS Secret Manager so that we can authenticate to use the DL services from our Batch jobs. For more information on how to do this [please see the docs](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html). You will need to store your client id and secret in the manager and then get the ARN associated with the secret to use in your job definition. The previous notebook on Lambda has details on how to get your client id and secret.

In [None]:
dl_auth_secret_name = "arn:aws:secretsmanager:us-west-2:851517463584:secret:dylan_aws_dlauth_creds-dfmj5U"

### Job definition

We can now create our job definition.

In [None]:
job_def_name = "dl-aws-onboarding-batch-definition"
# If the job definition doesn't exist, create it
response = batch.register_job_definition(
    jobDefinitionName=job_def_name,
    type="container",
    timeout={"attemptDurationSeconds": timeout},
    containerProperties={
        "image": image,
        "executionRoleArn": execution_role["Role"]["Arn"],
        "jobRoleArn": execution_role["Role"]["Arn"],
        "resourceRequirements": [
            {"value": str(vcpu), "type": "VCPU"},
            {"value": str(memory), "type": "MEMORY"},
        ],
        "networkConfiguration": {"assignPublicIp": "ENABLED"},
        "secrets": [
            {
                "name": "DESCARTESLABS_CLIENT_ID",
                "valueFrom": f"{dl_auth_secret_name}:client_id::"
            },
            {
                "name": "DESCARTESLABS_CLIENT_SECRET",
                "valueFrom": f"{dl_auth_secret_name}:client_secret::"
            },
        ],
    },
    platformCapabilities=[
        "FARGATE",
    ],
    retryStrategy={
        "attempts": retries,
    },
)
job_definition_arn = response["jobDefinitionArn"]

In [None]:
print(job_queue_arn, job_definition_arn)

In [None]:
job_definition_arn = "arn:aws:batch:us-west-2:851517463584:job-definition/dl-aws-onboarding-batch-definition:9"

## Submitting jobs

With our job queue and definition created we can now start submitting jobs. To do this you can use the simple function below. The function takes a geometry, a field identifier (unique id), and the S3 bucket and model name for the model you had previously stored (please see the Lambda notebook for information on this).

In [None]:
def submit_job(
    geom, 
    fid, 
    s3_bucket,
    model_name
):
    cmd = [
        json.dumps(geom),
        fid,
        s3_bucket,
        model_name
    ]

    response = batch.submit_job(
        jobName=f"dl_ea_onboarding_class_fid-{fid}",
        jobQueue=job_queue_arn,
        jobDefinition=job_definition_arn,
        containerOverrides={"command": cmd}
    )
    
    return response

We can submit a job using a simple test geometry to start:

In [None]:
test_geom = {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -91.55319213867188,
          35.805249625952506
        ],
        [
          -91.54885768890381,
          35.805249625952506
        ],
        [
          -91.54885768890381,
          35.80895624882348
        ],
        [
          -91.55319213867188,
          35.80895624882348
        ],
        [
          -91.55319213867188,
          35.805249625952506
        ]
      ]
    ]
  }

test_fid = "test-field"
s3_bucket = "dl-aws-onboarding"
model_name = "classifier.joblib"

In [None]:
job_response = submit_job(test_geom, test_fid, s3_bucket, model_name)

In [None]:
job_response

### Submitting multiple jobs

The final piece will be to submit a list of jobs to Batch to be run asynchronously. To do this we will load a few hundred agricultural fields in Iowa and submit the geometries to our queue. We load the geojson data into a GeoPanda DataFrame and visualize the fields.

In [None]:
ia_fields = gpd.read_file("ia_test_fields.geojson")

In [None]:
ia_fields.head()

In [None]:
ia_fields.plot()

We specify the location of our model in S3 and then for each field in our DataFrame submit a job to our queue. We need to make sure we submit json formatted geometries so we use `shapely` to map the geometries as a json-like object.

In [None]:
s3_bucket = "dl-aws-onboarding"
model_name = "classifier.joblib"

In [None]:
jobs = []
for index, row in ia_fields.iterrows():
    # print(sg.mapping(row["geometry"]), row["FBndID"], s3_bucket, model_name)
    jobs.append(submit_job(sg.mapping(row["geometry"]), row["FBndID"], s3_bucket, model_name))

Now we can visit our Batch dashboard to watch as our tasks are run!

<img src="../images/batch_dashboard.png" align="center"/>

In [None]:
len(jobs)

In [None]:
jobs[5]

Congrats! You have now deployed a model at scale leveraging the DL platform and AWS Batch.