This example is part of a talk in Supply Chain Security for MLSecOps.
The example will cover the following steps:
- Training a simple Scikit-Learn model, serialising it with
joblib
. - Signing our model, and verifying manually the signature.
- Implementing a MLServer runtime which handles the verification during model load.
The first step will be to train a simple scikit-learn
model.
For that, we will use the MNIST example from the scikit-learn
documentation which trains an SVM model.
# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
# The digits dataset
digits = datasets.load_digits()
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)
To save our trained model, we will serialise it using joblib
.
While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn
documentation.
Our model will be persisted as a file named mnist-svm.joblib
import joblib
model_file_name = "./models/good-model/model.joblib"
joblib.dump(classifier, model_file_name)
open(model_file_name, 'rb').read()
Now that we have a model artefact, the next step will be to sign our binary.
For this, we will use the sigstore
CLI, which will generate a set of files:
- A *.crt certificate.
- A *.sig signature.
- A *.sigstore bundle, which combines both.
Note that the signature also verifies that a particular ID was the one who created it. As ID, we will use our own Gmail account (which will be authenticated through Google's OIDC server ahead of creating the signature).
!sigstore sign --overwrite {model_file_name}
%ls ./models/good-model/model.joblib*
To validate that the signature does what it should, we will try to use it manually to verify the identity for our artefact.
As you can see, the sigstore verify identity
command needs us to specify both our own identity, and the OIDC issuer for this identity (Google's in this case).
!sigstore verify identity \
--bundle {model_file_name}.sigstore \
--cert-identity agm@seldon.io \
--cert-oidc-issuer https://accounts.google.com \
--offline \
{model_file_name}
To also validate that the signature detects if the file has been tampered with, we will manually inject a malicious pickle within our model artefact.
!cp ./models/good-model/model.joblib* ./models/tampered-model/
tampered_file_name = "./models/tampered-model/model.joblib"
pwnd_pickle = b'\x80\x04\x95)\x00\x00\x00\x00\x00\x00\x00\x8c\x05posix\x94\x8c\x06system\x94\x93\x94\x8c\x0eenv > pwnd.txt\x94\x85\x94R\x94.'
with open(tampered_file_name, 'wb') as f:
f.write(pwnd_pickle)
As we can see below, this makes the verification fail with the following error:
Signature is invalid for input
!sigstore verify identity \
--bundle {tampered_file_name}.sigstore \
--cert-identity agm@seldon.io \
--cert-oidc-issuer https://accounts.google.com \
--offline \
{tampered_file_name}
Now that we have signed our artefact and validated that it does its job, we will build a MLServer runtime that handles the signature verification at model load time.
MLServer greatly simplifies building a custom inference runtime to serve our models. In this case, we will extend the built-in Scikit-Learn runtime to also handle verification.
For the latter, we will leverage the Python API exposed by the sigstore
package.
Note that this runtime is not production-ready, however it should be good to enough to showcase how signature verification could be enforced at serving time.
# %load ./runtime.py
from mlserver import MLModel
from mlserver.utils import get_model_uri
from mlserver.errors import MLServerError
from mlserver_sklearn.sklearn import SKLearnModel, WELLKNOWN_MODEL_FILENAMES
from sigstore_protobuf_specs.dev.sigstore.bundle.v1 import Bundle
from sigstore.verify import Verifier, VerificationMaterials
from sigstore.verify.policy import Identity
CERT_IDENTITY = "agm@seldon.io"
CERT_OIDC_ISSUER = "https://accounts.google.com"
class VerificationError(MLServerError):
def __init__(self, model_name: str, reason: str):
msg = f"Invalid signature for model '{model_name}': {reason}."
super().__init__(msg)
class SigstoreModel(SKLearnModel):
async def load(self):
model_uri = await get_model_uri(
self._settings, wellknown_filenames=WELLKNOWN_MODEL_FILENAMES
)
self.verify(model_uri)
return await super().load()
def _get_bundle(self, model_uri: str) -> Bundle:
bundle_path = f"{model_uri}.sigstore"
with open(bundle_path, 'r') as bundle_file:
return Bundle().from_json(bundle_file.read())
def _get_materials(self, model_uri: str) -> VerificationMaterials:
with open(model_uri, 'rb') as model_file:
bundle = self._get_bundle(model_uri)
return VerificationMaterials.from_bundle(
input_=model_file,
bundle=bundle,
offline=True
)
def verify(self, model_uri: str):
verifier = Verifier.production()
identity = Identity(
identity=CERT_IDENTITY,
issuer=CERT_OIDC_ISSUER,
)
materials = self._get_materials(model_uri)
result = verifier.verify(
materials,
identity
)
if not result.success:
raise VerificationError(self.name, result.reason)
Note that we will also add a small piece of config to MLServer to disable auto-loading all available models. That way we will be able to load them separately to assess whether the tampered model gets rejected.
%%writefile models/settings.json
{
"load_models_at_startup": false
}
With all in place, the next step will be to start MLServer on a separate terminal with:
mlserver start ./models
Now that MLServer is running on the background, we can send a request to list the available models. As you can see, MLServer finds the two artefacts that we created previously:
good-model
: our original trained model, with the right signed artefact.naive-model
: the model we tampered, which will be loaded without signature verification.tampered-model
: the model we tampered, injecting malicious code into the Pickle.
!curl -s -X POST -H 'Content-Type: application/json' localhost:8080/v2/repository/index -d '{}' | jq
We will start by testing the model we originally trained and signed. As you can see below, the model gets loaded correctly - as expected.
!curl -I -X POST localhost:8080/v2/repository/models/good-model/load
We will now load the tampered model without verifying its signature. For this, we will just instruct MLServer to load the model with the base SKLearn runtime, which lacks the verification added by our custom runtime.
!curl -I -X POST localhost:8080/v2/repository/models/naive-model/load
As we can see, the model goes through as normal.
However, if we inspect the folder where MLServer is running, we will be able to see that it the malicious code we injected previously run and exposed our environment variables in a pwnd.txt
file:
!cat pwnd.txt
We will now try to load the tampered model which, as you below, will fail with the following error:
Signature is invalid for input
!rm pwnd.txt
!curl -s -X POST localhost:8080/v2/repository/models/tampered-model/load | jq
Thanks to the signature verification, we avoided to run the malicious code that we injected previously.
!ls pwnd.txt