# DJL Sklearn Handler Tutorial

This notebook shows how to use DJL sklearn handler for model serving on SageMaker.

**Prerequisites:**
- S3 bucket push access
- SageMaker access

## Step 1: Setup and Install Dependencies

In [None]:
%pip install sagemaker scikit-learn --upgrade --quiet

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import joblib
import json
import sys
import sklearn

role = sagemaker.get_execution_role()
sess = sagemaker.session.Session()
region = sess._region_name
account_id = sess.account_id()

## Step 2: Create and Train Sklearn Model

DJL Serving's scikit-learn handler supports multiple model persistence formats:

**Secure formats (default):**
- `skops`: .skops files

**Insecure formats (require trust_insecure_model_files=true):**
- `joblib`: .joblib, .jl files
- `pickle`: .pkl, .pickle files
- `cloudpickle`: .pkl, .pickle, .cloudpkl files

For security, only skops format is allowed by default. Other formats require setting `option.trust_insecure_model_files=true`.

In [None]:
# Create synthetic dataset for demo purposes
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X, y)

# Save model
joblib.dump(model, 'model.joblib')
print("Model trained and saved")

## Step 3: Using Built-in Handler's Default Handling

The scikit-learn handler must be specified as the entry point in the model server options, either in a serving.properties file or via environment variables that are passed to the SageMaker Model object. For more detail, read DJL's deployment guide: https://docs.djl.ai/master/docs/serving/serving/docs/lmi/deployment_guide/deploying-your-endpoint.html.

You must specify the model format as well in the model server options (skops, joblib, pickle, or cloudpickle). Additionally, if the model format is not secure (only skops is secure by default), you must set OPTION_TRUST_INSECURE_MODEL_FILES to true. If using skops, set OPTION_SKOPS_TRUSTED_TYPES--for example, option.skops_trusted_types='sklearn.ensemble._forest.RandomForestClassifier,numpy.ndarray'.

DJL supports the following file extensions for each format:
skops: .skops
joblib: .joblib, .jl
pickle: .pkl, .pickle
cloudpickle: .pkl, .pickle, .cloudpkl

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.sklearn_handler
option.model_format=joblib
option.trust_insecure_model_files=true

In [None]:
%%sh
mkdir sklearn_basic
mv model.joblib sklearn_basic/
mv serving.properties sklearn_basic/
tar czvf sklearn_basic.tar.gz sklearn_basic/
rm -rf sklearn_basic

## Step 4: DJL Custom Formatters and Handlers Model

You can use DJL Serving's custom decorators to write your own custom pre-processing, prediction, and post-processing code for flexible inference. To do so, include a model.py file in the model directory.

In [None]:
# Recreate model for custom formatters
joblib.dump(model, 'model.joblib')

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.sklearn_handler
option.model_format=joblib
option.trust_insecure_model_files=true

In your model.py file, include the following decorators and signatures. The function names are flexible. Note that if you omit any definition, the default scikit-learn handler logic will be used for initialization, input processing, prediction, and output processing. By default, the handler expects a NumPy array for prediction and returns the prediction as a NumPy array. CSV and JSON input/output are supported by default, with JSON being the default.

In [None]:
%%writefile model.py
import numpy as np
import joblib
import os
import sys
import sklearn
from djl_python.input_parser import input_formatter, prediction_handler, init_handler
from djl_python.output_formatter import output_formatter
from djl_python import Input

@init_handler
def custom_init_fn(model_dir, **kwargs):
    """Custom model initialization - load with custom logic"""
    print(f"Python version: {sys.version}")
    print(f"Scikit-learn version: {sklearn.__version__}")
    
    model_path = os.path.join(model_dir, "model.joblib")
    model = joblib.load(model_path)
    print(f"Custom init: Loaded model with {model.n_estimators} estimators")
    return model

@input_formatter
def custom_input_fn(inputs: Input, **kwargs):
    """Custom input processing - expects {"features": [...]} format or default {"inputs": [...]} format"""
    data = inputs.get_as_json()
    features = data.get("features", data.get("inputs", data))
    X = np.array(features)
    
    # Ensure 2D array
    if X.ndim == 1:
        X = X.reshape(1, -1)
    
    return X

@prediction_handler
def custom_predict_fn(X, model, **kwargs):
    """Custom prediction logic - returns both predictions and probabilities"""
    predictions = model.predict(X)
    probabilities = model.predict_proba(X)
    
    return {
        "predictions": predictions,
        "probabilities": probabilities
    }

@output_formatter
def custom_output_fn(result):  
    """Custom output formatting - returns detailed predictions"""
    predictions = result["predictions"]
    probabilities = result["probabilities"]
    
    return {
        "predictions": predictions.tolist(),
        "probabilities": probabilities.tolist(),
        "model_type": "sklearn_random_forest",
        "num_samples": len(predictions)
    }

In [None]:
%%sh
mkdir sklearn_custom
mv model.joblib sklearn_custom/
mv model.py sklearn_custom/
mv serving.properties sklearn_custom/
tar czvf sklearn_custom.tar.gz sklearn_custom/
rm -rf sklearn_custom

## Step 5: SageMaker Custom Script and ENV Variable Compatibility

The scikit-learn handler is also backwards compatible with SageMaker inference scripts that implement model_fn, input_fn, predict_fn, and output_fn. Like the DJL custom formatters, you can omit any number of the four functions and use the default handler logic for the missing function(s). Note that the functions must follow the following signatures:

- model_fn(model_dir) where model_dir is the model directory's name.
- input_fn(request_body, request_content_type) where request_body is a byte buffer. (request_content_type can also be named content_type)
- predict_fn(input_object, model) where input_object is the object returned by input_fn.
- output_fn(prediction, response_content_type) where prediction is returned by predict_fn and output_fn returns a byte array of data serialized to the specified type. (response_content_type can also be named accept)

In [None]:
# Recreate model for SageMaker format
joblib.dump(model, 'model.joblib')

In [None]:
%%writefile serving.properties
engine=Python
option.entryPoint=djl_python.sklearn_handler
option.model_format=joblib
option.trust_insecure_model_files=true

In [None]:
%%writefile model.py
import joblib
import numpy as np
import os
import json
import sys
import sklearn
import logging

# Configure logging to ensure it appears
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def model_fn(model_dir):
    """Load the scikit-learn model from the model directory."""
    model_path = os.path.join(model_dir, "model.joblib")
    model = joblib.load(model_path)
    return model

def input_fn(request_body, request_content_type):
    """Parse input data for inference."""
    if request_content_type == "application/json":
        data = json.loads(request_body)
        if isinstance(data, dict):
            features = data.get("features", data.get("inputs", data))
        else:
            features = data
        result = np.array(features)
        return result
    else:
        raise ValueError(f"Unsupported content type: {request_content_type}")

def predict_fn(input_data, model):
    """Run prediction on the input data."""
    if input_data.ndim == 1:
        input_data = input_data.reshape(1, -1)
    
    prediction = model.predict(input_data)
    probabilities = model.predict_proba(input_data)
    return {"prediction": prediction, "probabilities": probabilities}

def output_fn(prediction, response_content_type):
    """Format the prediction output."""
    result = {
        "prediction": prediction["prediction"].tolist(),
        "probabilities": prediction["probabilities"].tolist(),
        "custom_sagemaker_functions": "WORKING",
        "sklearn_version": sklearn.__version__
    }
    json_result = json.dumps(result)
    return json_result

In [None]:
%%sh
mkdir sklearn_sagemaker
mv model.joblib sklearn_sagemaker/
mv model.py sklearn_sagemaker/
mv serving.properties sklearn_sagemaker/
tar czvf sklearn_sagemaker.tar.gz sklearn_sagemaker/
rm -rf sklearn_sagemaker

## Step 6: Deploy to SageMaker

In [None]:

"""
TODO: Uncomment this once container is officially available.
# Get DJL container image 
image_uri = image_uris.retrieve(
    framework="djl-inference",
    region=region,
    version="0.35.0"
)
"""

image_uri = "875423407011.dkr.ecr.us-west-2.amazonaws.com/djl-serving-demo-cpufull:latest"

In [None]:
# Upload all models to S3
s3_code_prefix = "sklearn-djl/code"
bucket = sess.default_bucket()

# Upload all three model variants
models = ['sklearn_basic.tar.gz', 'sklearn_custom.tar.gz', 'sklearn_sagemaker.tar.gz']
model_uris = {}

for model_file in models:
    model_name = model_file.replace('.tar.gz', '')
    model_uri = sess.upload_data(model_file, bucket, f"{s3_code_prefix}/{model_name}")
    model_uris[model_name] = model_uri
    print(f"{model_name} uploaded to: {model_uri}")

# Choose which model to deploy
selected_model = 'sklearn_sagemaker'  # Change this as needed
code_artifact = model_uris[selected_model]
print(f"\nSelected model for deployment: {selected_model}")

The following environment variables are now supported in the DJL serving container:

- SAGEMAKER_MAX_REQUEST_SIZE
- SAGEMAKER_NUM_MODEL_WORKERS
- SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT
- SAGEMAKER_MODEL_SERVER_TIMEOUT_SECONDS
- SAGEMAKER_MODEL_SERVER_VMARGS
- SAGEMAKER_STARTUP_TIMEOUT
- SAGEMAKER_MAX_PAYLOAD_IN_MB

You can set these environment variables in the env parameter when creating your SageMaker Model.

In [None]:
from sagemaker.predictor import Predictor

# Example of how to specify environment variables 
model = Model(
    image_uri=image_uri,
    model_data=code_artifact, 
    role=role,
    predictor_cls=Predictor, # Include predictor_cls parameter if using customized DJL container via ECR URI
    # specify all environment variable configs in this map (note that SAGEMAKER_MAX_REQUEST_SIZE is in bytes)
    env={
        "SAGEMAKER_NUM_MODEL_WORKERS": "2",
        "SAGEMAKER_MAX_REQUEST_SIZE": "10485760",  
        "SAGEMAKER_MODEL_SERVER_TIMEOUT_SECONDS": "300",
    }
)

"""
# Create SageMaker model
model = Model(
    image_uri=image_uri,
    model_data=code_artifact,
    role=role,
    predictor_cls=Predictor
)
"""

In [None]:
# Deploy endpoint
instance_type = "ml.m5.large"
endpoint_name = sagemaker.utils.name_from_base("sklearn-djl")

print(f"Endpoint name: {endpoint_name}")

# Check if model variable exists and has deploy method
print(f"Model object: {model}")
print(f"Model type: {type(model)}")

try:
    predictor = model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        serializer=serializers.JSONSerializer()
    )
    print(f"Predictor created: {predictor}")
except Exception as e:
    print(f"Deployment error: {e}")


## Step 7: Test Inference

By default, the scikit-learn handler supports CSV and JSON inputs/outputs. CSVs must use a comma as the delimiter and separate samples with new lines. JSON inputs must either be a dictionary in the form `({"input": [...]})` or an array of arrays. Examples can be seen below.

In [None]:
# Test basic DJL-compatible JSON formats ({"inputs": [...]} or an array of arrays (list of lists))
basic_payload = {
    "inputs": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
}

# Note that this doesn't work with xgboost_custom due to how the the input formatter is implemented--expect an error.
list_payload = [[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]]

# Note that result is a byte buffer because we didn't specify a deserializer
result = predictor.predict(basic_payload)
print("Basic format result:")
print(result)
print("Decoded basic format result:")
print(result.decode('utf-8'))

result = predictor.predict(list_payload)
print("List format result:")
print(result)
print("Decoded list format result:")
print(result.decode('utf-8'))

In [None]:
# Test custom format (works with models with custom formatters)
custom_payload = {
    "features": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
}

try:
    result = predictor.predict(custom_payload)
    print("Custom format result:")
    print(result)
    print("Decoded custom format result:")
    print(result.decode('utf-8'))
except Exception as e:
    print(f"Custom format not supported by this model: {e}")

In [None]:
# Test batch prediction
batch_payload = {
    "inputs": [
        [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
        [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
        [3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0]
    ]
}

result = predictor.predict(batch_payload)
print("Batch prediction result:")
print(result)
print("Decoded batch format result:")
print(result.decode('utf-8'))

## Step 8: Clean Up

In [None]:
# Clean up resources
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()
print("Resources cleaned up")

In [None]:
# Test payloads for different scenarios

# Single sample
single_sample = {
    "inputs": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
}

# Batch samples
batch_samples = {
    "inputs": [
        [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
        [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
    ]
}

# Custom format (for custom formatters above) using "features" instead of "inputs"
custom_format = {
    "features": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
}

# CSV format (note: switch serializer to CSVSerializer if using CSV inputs. Note that CSV inputs won't work with the provided 
# custom formatter/handler models above due to their custom function implementation but will work with the basic model.)
csv_single = "1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0"
csv_batch = """1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0
2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0"""

print("Single sample JSON:")
print(json.dumps(single_sample, indent=2))

print("\nBatch samples JSON:")
print(json.dumps(batch_samples, indent=2))

print("\nCustom format JSON:")
print(json.dumps(custom_format, indent=2))

print("\nCSV single:")
print(csv_single)

print("\nCSV batch:")
print(csv_batch)

## Summary

This notebook demonstrated how to use DJL's sklearn handler with different approaches:

1. **Basic Handler**: Using the built-in sklearn handler with default processing
2. **Custom DJL Formatters**: Using DJL's decorators for custom input/output processing
3. **SageMaker Compatible**: Using SageMaker's standard inference functions

**Key Points:**

- **Model Formats**: Only `skops` is secure by default. Other formats (`joblib`, `pickle`, `cloudpickle`) require `option.trust_insecure_model_files=true`
- **File Extensions**: 
  - `skops`: .skops
  - `joblib`: .joblib, .jl
  - `pickle`: .pkl, .pickle
  - `cloudpickle`: .pkl, .pickle, .cloudpkl
- **Custom Processing**: Use DJL decorators (`@init_handler`, `@input_formatter`, `@prediction_handler`, `@output_formatter`) for flexible inference
- **SageMaker Compatibility**: Implement standard SageMaker functions (`model_fn`, `input_fn`, `predict_fn`, `output_fn`) for PySDK compatibility

For more information, see the [DJL Serving documentation](https://docs.djl.ai/docs/serving/index.html).