# Introduction

We're introducing a new capability in AWS SageMaker Python SDK that allows data scientists to run their Python functions as SageMaker Jobs to take advantage of the compute power offered by SageMaker. This sample notebook is a quick introduction to this new capability with dummy Python functions.

Before running the samples here, please follow the quick start guide to set up the environment properly, including downloading and installing the private version of SageMaker Python SDK.


## Install the dependencies

In [None]:
%pip install -r ./requirements.txt

## Invoke Python function as a SageMaker job

There are two ways users could invoke a function as a job.

* Using the decorator. When the decorated function is invoked, it executes synchronously

In [None]:
from sagemaker.remote_function import remote

@remote
def divide(x, y):
    print(f"Calculating {x}/{y}")
    return x / y

divide(3, 2)

* Using the RemoteExecutor APIs. They follow the pattern of the Python [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) APIs.

In [None]:
def divide(x, y):
    print(f"Calculating {x}/{y}")
    return x / y

from sagemaker.remote_function import RemoteExecutor

with RemoteExecutor(max_parallel_job=1, keep_alive_period_in_seconds=30) as executor:
    futures = [executor.submit(divide, x, 2) for x  in [1, 2, 3]]

In [None]:
[future.result() for future in futures]

## Customize the decorator settings

In the example above, the @remote decorator and RemoteExecutor look for configurations in the following places

* The global configuration file at `<user’s home folder>.sagemaker/defaults/sdk-default-config.yaml`
* The local configuration file at the current directory where the decorated function is invoked, specifically `./config.yaml`

Configurations in the local configuration file take high precedence over those in the global one.

You can override the configurations by specifying the decorator arguments directly. In the following example, instead of launching the job with `ml.m5.xlarge`,
as specified in the `./config.yaml`, run the function with a more powerful instance

In [None]:
@remote(instance_type='ml.m5.4xlarge')
def divide(x, y):
    print(f"Calculating {x}/{y}")
    return x / y

divide(3, 2)

## Add extra dependencies using conda environment yml file

In the example below, the function will run in a new conda environment where pandas and sagemaker will be installed.
(Note that the cell will not run if executed in SageMaker Studio.)

In [None]:
import pandas as pd

@remote(dependencies='./environment.yml')
def multiply(dataframe: pd.DataFrame, factor: float):
    return dataframe * factor

df = pd.DataFrame(
    {
        "A": [14, 4, 5, 4, 1],
        "B": [5, 2, 54, 3, 2],
        "C": [20, 20, 7, 3, 8],
        "D": [14, 3, 6, 2, 6],
    }
)

multiply(df, 10.0)

## Common errors

### SerializationError and DeserializationError

Behind the scenes, the function and function arguments and returns are serialized and deserialized using Pickle.

While SerializationError occurs when passing an unpickle-able object as function argument, such as XGBoost DMatrix, open file object. DeserializationError typically happens when there is discrepancy between the dependencies in the local environments and dependencies in
the job environments. In the following example, the latest pandas is used to declare a dataframe. The dataframe is passed to the function
call. On the job side, an older version of pandas is installed. The dataframe can't be deserialized.

In [None]:
import pandas as pd

@remote(dependencies='./incompatible_requirements.txt')
def multiply(dataframe: pd.DataFrame, factor: float):
    return dataframe * factor


df = pd.DataFrame(
    {
        "A": [14, 4, 5, 4, 1],
        "B": [5, 2, 54, 3, 2],
        "C": [20, 20, 7, 3, 8],
        "D": [14, 3, 6, 2, 6],
    }
)

multiply(df, 10.0)