<h1><center>Let's exploit pickle</center></h1>
<h2><center>and skops to the rescue!</center></h2>
<h3><center>Adrin Jalali</center></h3>
<h4><center>github.com/adrinjalali</center></h3>
<h4><center>@probabl.ai</center></h3>
<h4><center>November 2024</center></h3>

## Me
- PhD in interpretable methods for cancer diagnostics
- ML consulting
- Worked in an algorithmic privacy and fairness team
- Open source
    - `scikit-learn`
    - `fairlearn`
    - `skops`

In [1]:
import pickle

In [2]:
pickle.loads(b"cos\nsystem\n(S'echo hello world'\ntR.")

hello world


0

That relies on `os` being available, which we can customize when loading a pickle file:


``` python
class RestrictedUnpickler(pickle.Unpickler):

    def find_class(self, module, name):
        # Only allow safe classes from builtins.
        if module == "builtins" and name in safe_builtins:
            return getattr(builtins, name)
        # Forbid everything else.
        raise pickle.UnpicklingError(
            f"global '{module}.{name}' is forbidden"
        )
        
with open("file.pkl", "rb") as f:
    obj = RestrictedUnpickler(f).load()
```

*Exploits*: https://ctftime.org/writeup/16723

## PEP 307 - Extensions to the pickle protocol

https://peps.python.org/pep-0307/#security-issues
    
<div>
<img src="figs/security.png" width="600"/>
</div>

# pickles
- Pickler
- Unpickler
- pickling instruction set (`OP` codes)

# `__getstate__`, `__setstate__`

https://docs.python.org/3/library/pickle.html

In [4]:
class C:
    def __getstate__(self):
        return {"a": 42}
    
    def __setstate__(self, state):
        for key, value in state.items():
            setattr(self, key, value)
            
obj = pickle.loads(pickle.dumps(C()))
obj.a

42

In [5]:
import pickletools
pickletools.dis(pickle.dumps(C()))

    0: \x80 PROTO      4
    2: \x95 FRAME      31
   11: \x8c SHORT_BINUNICODE '__main__'
   21: \x94 MEMOIZE    (as 0)
   22: \x8c SHORT_BINUNICODE 'C'
   25: \x94 MEMOIZE    (as 1)
   26: \x93 STACK_GLOBAL
   27: \x94 MEMOIZE    (as 2)
   28: )    EMPTY_TUPLE
   29: \x81 NEWOBJ
   30: \x94 MEMOIZE    (as 3)
   31: }    EMPTY_DICT
   32: \x94 MEMOIZE    (as 4)
   33: \x8c SHORT_BINUNICODE 'a'
   36: \x94 MEMOIZE    (as 5)
   37: K    BININT1    42
   39: s    SETITEM
   40: b    BUILD
   41: .    STOP
highest protocol among opcodes = 4


In [6]:
with open("/tmp/dumps/oddpickle.pkl", "wb") as f:
    pickle.dump(C(), f)

with open("/tmp/dumps/oddpickle.pkl", "rb") as f:
    obj = pickle.load(f)
obj

<__main__.C at 0x7d45b6b3a990>

# `__reduce__` 👹
https://docs.python.org/3/library/pickle.html#object.__reduce__

Returns a tuple of up to size 6, the first two mandatory:

- A callable object that will be called to create the initial version of the object.
- A tuple of arguments for the callable object. An empty tuple must be given if the callable does not accept any argument.

In [7]:
class D:
    def __reduce__(self):
        return (print, ("!!!I SEE YOU!!!",))
    
pickled = pickle.dumps(D())
pickletools.dis(pickled)

    0: \x80 PROTO      4
    2: \x95 FRAME      44
   11: \x8c SHORT_BINUNICODE 'builtins'
   21: \x94 MEMOIZE    (as 0)
   22: \x8c SHORT_BINUNICODE 'print'
   29: \x94 MEMOIZE    (as 1)
   30: \x93 STACK_GLOBAL
   31: \x94 MEMOIZE    (as 2)
   32: \x8c SHORT_BINUNICODE '!!!I SEE YOU!!!'
   49: \x94 MEMOIZE    (as 3)
   50: \x85 TUPLE1
   51: \x94 MEMOIZE    (as 4)
   52: R    REDUCE
   53: \x94 MEMOIZE    (as 5)
   54: .    STOP
highest protocol among opcodes = 4


In [8]:
pickle.loads(pickled)

!!!I SEE YOU!!!


In [9]:
import os
class E:
    def __reduce__(self):
        return (
            os.system,
            ("""echo "!!!I'm in YOUR SYSTEM!!!" > /tmp/dumps/demo.txt""",),
        )
    
pickled = pickle.dumps(E())
pickletools.dis(pickled)

    0: \x80 PROTO      4
    2: \x95 FRAME      80
   11: \x8c SHORT_BINUNICODE 'posix'
   18: \x94 MEMOIZE    (as 0)
   19: \x8c SHORT_BINUNICODE 'system'
   27: \x94 MEMOIZE    (as 1)
   28: \x93 STACK_GLOBAL
   29: \x94 MEMOIZE    (as 2)
   30: \x8c SHORT_BINUNICODE 'echo "!!!I\'m in YOUR SYSTEM!!!" > /tmp/dumps/demo.txt'
   85: \x94 MEMOIZE    (as 3)
   86: \x85 TUPLE1
   87: \x94 MEMOIZE    (as 4)
   88: R    REDUCE
   89: \x94 MEMOIZE    (as 5)
   90: .    STOP
highest protocol among opcodes = 4


In [10]:
pickle.loads(pickled)

0

# Other attacks
- Denial of service
    - Unhandled exceptions
    - Protocol downgrades
    - pickle bombs
- Weird Machine
    - Unused `OP` codes, such as `DUP`
    - Parser abuse
    - Stack corruption
    
source: https://github.com/moreati/pickle-fuzz

In [11]:
import ast
import pickle
from fickling.fickle import Pickled
print(ast.dump(Pickled.load(pickle.dumps([1, 2, 3, 4])).ast, indent=4))

Module(
    body=[
        Assign(
            targets=[
                Name(id='result', ctx=Store())],
            value=List(
                elts=[
                    Constant(value=1),
                    Constant(value=2),
                    Constant(value=3),
                    Constant(value=4)],
                ctx=Load()))],
    type_ignores=[])


In [12]:
print(ast.dump(Pickled.load(pickle.dumps(E())).ast, indent=4))

Module(
    body=[
        ImportFrom(
            module='posix',
            names=[
                alias(name='system')],
            level=0),
        Assign(
            targets=[
                Name(id='_var0', ctx=Store())],
            value=Call(
                func=Name(id='system', ctx=Load()),
                args=[
                    Constant(value='echo "!!!I\'m in YOUR SYSTEM!!!" > /tmp/dumps/demo.txt')],
                keywords=[])),
        Assign(
            targets=[
                Name(id='result', ctx=Store())],
            value=Name(id='_var0', ctx=Load()))],
    type_ignores=[])


In [13]:
!fickling /tmp/dumps/oddpickle.pkl

from __main__ import C
_var0 = C()
_var0.__setstate__({'a': 42})
result0 = _var0


In [14]:
with open("/tmp/dumps/badpickle.pkl", "wb") as f:
    pickle.dump(E(), f)

In [15]:
!fickling /tmp/dumps/badpickle.pkl

from posix import system
_var0 = system('echo "!!!I\'m in YOUR SYSTEM!!!" > /tmp/dumps/demo.txt')
result0 = _var0


In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

clf = LogisticRegression(solver="liblinear").fit(X, y)
with open("/tmp/dumps/goodpickle.pkl", "wb") as f:
    pickle.dump(clf, f)

In [17]:
!fickling /tmp/dumps/goodpickle.pkl

from sklearn.linear_model._logistic import LogisticRegression
from numpy.core.multiarray import _reconstruct
from numpy import ndarray
_var0 = _reconstruct(ndarray, (0,), b'b')
from numpy import dtype
_var1 = dtype('i8', False, True)
_var2 = _var1
_var2.__setstate__((3, '<', None, None, None, -1, -1, 0))
_var3 = _var0
_var3.__setstate__((1, (3,), _var2, False, b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00'))
_var4 = _reconstruct(ndarray, (0,), b'b')
_var5 = dtype('f8', False, True)
_var6 = _var5
_var6.__setstate__((3, '<', None, None, None, -1, -1, 0))
_var7 = _var4
_var7.__setstate__((1, (3, 4), _var6, True, b'# ?T\xff@\xda?z]5nM\\\xdb?4z\xa2\x86\xfbQ\xfb\xbf\x0eh|N5m\xf7?\x1f\xfb$3:\xcb\xf9\xbf\xbb\x99m\xbff\x8c\xf8\xbfz\xc8\x01\x01\x8c\x14\x02\xc0\xa0\xc9\xc4e\x18m\xe2?\xbds\x82\xa2\x8a\xc4\x03@`\xf08\xe4(V\xf0\xbf\x19\\}\x85\xaf\x7f\xf6\xbf\xfaL#\nfq\x04@'))
_var8 = _reconstruct(ndarray, (0,), b'b')
_var9 = _var8
_var9.__setstate

**Fickling**: https://github.com/trailofbits/fickling

# skops
More secure persistence with `skops.io`

https://skops.readthedocs.io/en/stable/persistence.html

In [18]:
import skops.io as sio
sio.loads(sio.dumps(D()))

UntrustedTypesFoundException: Untrusted types found in the file: ['__main__.D'].

In [None]:
sio.loads(sio.dumps(D()), trusted=['__main__.D'])

### File content

Let's check dumped files!

In [None]:
sio.dump(D(), "/tmp/dumps/D.skops")

In [19]:
sio.dump(C(), "/tmp/dumps/C.skops")

In [20]:
sio.dump(clf, "/tmp/dumps/lr.skops")

## `numpy.save`

https://numpy.org/doc/stable/reference/generated/numpy.save.html

<div>
<img src="figs/numpy-save.png" width="600"/>
</div>

## `numpy.load`

https://numpy.org/doc/stable/reference/generated/numpy.load.html

<div>
<img src="figs/numpy-load.png" width="600"/>
</div>

In [22]:
sio.load("/tmp/dumps/lr.skops")

In [23]:
sio.visualize("/tmp/dumps/lr.skops", show="all")

root: sklearn.linear_model._logistic.LogisticRegression
└── attrs: builtins.dict
    ├── penalty: json-type("l2")
    ├── dual: json-type(false)
    ├── tol: json-type(0.0001)
    ├── C: json-type(1.0)
    ├── fit_intercept: json-type(true)
    ├── intercept_scaling: json-type(1)
    ├── class_weight: json-type(null)
    ├── random_state: json-type(null)
    ├── solver: json-type("liblinear")
    ├── max_iter: json-type(100)
    ├── multi_class: json-type("deprecated")
    ├── verbose: json-type(0)
    ├── warm_start: json-type(false)
    ├── n_jobs: json-type(null)
    ├── l1_ratio: json-type(null)
    ├── n_features_in_: json-type(4)
    ├── classes_: numpy.ndarray
    ├── coef_: numpy.ndarray
    ├── intercept_: numpy.ndarray
    ├── n_iter_: numpy.ndarray
    └── _sklearn_version: json-type("1.5.2")


In [24]:
from sklearn.preprocessing import FunctionTransformer

def f(x):
    return x + 1

obj = FunctionTransformer(f)
dumped = sio.dumps(obj)

In [25]:
sio.loads(dumped)

UntrustedTypesFoundException: Untrusted types found in the file: ['__main__.f'].

In [26]:
sio.get_untrusted_types(data=dumped)

['__main__.f']

In [28]:
sio.visualize(dumped, show="all")

root: sklearn.preprocessing._function_transformer.FunctionTransformer
└── attrs: builtins.dict
    ├── func: __main__.f [UNSAFE]
    ├── inverse_func: json-type(null)
    ├── validate: json-type(false)
    ├── accept_sparse: json-type(false)
    ├── check_inverse: json-type(true)
    ├── feature_names_out: json-type(null)
    ├── kw_args: json-type(null)
    ├── inv_kw_args: json-type(null)
    └── _sklearn_version: json-type("1.5.2")


In [29]:
sio.loads(dumped, trusted="__main__.f")

# `skops` format

```
zip file:
    schema.json
    139801436035376.npy
    139803280731088.npy
    139803280731952.npy
    139803280801840.npy
```

## Serializers and Loaders

Default serializer/loader:

- `__new__` to construct.
- `__getstate__` and `__setstate__` to get and set attributes.

Special treatment of
- `dict`, `set`, `list`, `tuple`, `type`, `slice`
- `partial`, methods, and functions
- `numpy` and `scipy` arrays, `ufunc`s, and RNGs
- scikit-learn's C extension types, some using `__reduce__`
    - hard coded list of allowed objects
    - some C extension types from non scikit-learn libs
        - supported but not trusted by default
- scikit-learn's custom objects

## Loading Process

- Load content into memory w/o constructing any objects
- Check included types/functions against a trusted set
- Construct objects if there's nothing we don't trust/know of
    - this is where `__new__` and `__setstate__` are called

# Web app: convert pickles to skops format
- Uses Gradio: https://www.gradio.app/
- Hosted on Hugging Face Spaces: https://huggingface.co/spaces/adrin/pickle-to-skops
- Source code: https://huggingface.co/spaces/adrin/pickle-to-skops/tree/main

## Is it Safe?

- No code is 100% safe!
    - We're trying to make things safe\[er\]!
-`zip` file vulnerabilities apply
    - zip bomb
- Raised exceptions
- Very large objects

## Notes
- Doesn't support any custom object with mandatory args to `__new__`

## Roadmap
- Trust more safe and commonly used types and functions by default
- Speed improvements: memory mapping ndarrays, etc
- Public protocol for third parties to implement to be "`skops`able"
    - C extension types, currently via `__reduce__`, but only if we already know them

## Help Us
- Find vulnerabilities
- Test it
- Report issues on our issue tracker: https://github.com/skops-dev/skops/issues