Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PicklingError on compute with HyperbandSearchCV #549

Open
fonnesbeck opened this issue Oct 1, 2019 · 11 comments
Open

PicklingError on compute with HyperbandSearchCV #549

fonnesbeck opened this issue Oct 1, 2019 · 11 comments

Comments

@fonnesbeck
Copy link

I'm attempting to do a hyperparameter search using HyperbandSearchCV on a PyTorch model that has been wrapped with skorch, but am running into a failure when I call fit:

Exception: PicklingError("Can't pickle <class '__main__.DNNRegressor'>: it's not the same object as __main__.DNNRegressor")

The exception does not seem to make sense.

My model is a subclass of torch.nn.Module that is just a deep neural network regressor, and this has been wrapped by a skorch NeuralNetRegressor as follows

dnnr = NeuralNetRegressor(
    module=DNNRegressor,
    module__n_feature=len(NUMERIC_COLUMNS),
    module__n_hidden=128,
    module__n_output=1,
    module__dropout_rate=0.5,
    criterion=torch.nn.MSELoss,
    device=device
)

Any obvious reason for this to be happening?

Running dask_ml 1.0.0, skorch 0.6.0 and pytorch 1.1.0 on a GCS instance.

@stsievert
Copy link
Member

That looks like an error unclear definitions, and Python doesn't know what to pickle. Here's a good SO question: https://stackoverflow.com/questions/1412787/picklingerror-cant-pickle-class-decimal-decimal-its-not-the-same-object

Basically, either remove the %load_ext; %autoreload 2 in the notebook or put the definition of DNNRegressor in a separate module/Python file.

I'd be surprised if this is an issue with skorch: https://skorch.readthedocs.io/en/stable/user/save_load.html

@TomAugspurger
Copy link
Member

@fonnesbeck have you had a chance to look into this again? I was recently able to use HyperbandSearchCV with Skorch.

@fonnesbeck
Copy link
Author

Yes, removing my model class from the notebook and putting it into a Python file did the trick. The error message will continue to confuse users, though.

@TomAugspurger
Copy link
Member

I suspect there's not much we can do about it, since it's an error from Python about a different package. We just happen to hit it here since dask needs to pickle things to move them around :/

@mrocklin
Copy link
Member

mrocklin commented Aug 5, 2020

Would this be a good place for us to build custom serialization? Is there an obvious subclass for all of these and a clean way of serializing them?

(I also ran into this)

@TomAugspurger
Copy link
Member

Does anyone have a reproducible example? This doesn't do it

from distributed import Client
import torch
import skorch
import pickle


client = Client()


class DNNRegressor(torch.nn.Module):
    pass

dnnr = skorch.NeuralNetRegressor(
    module=DNNRegressor,
    module__n_feature=128,
    module__n_hidden=128,
    module__n_output=1,
    module__dropout_rate=0.5,
    criterion=torch.nn.MSELoss,
)

pickle.loads(pickle.dumps(dnnr))

client.scatter([dnnr], broadcast=True)

Do I need Hyperband to reproduce the problem?

@jrbourbeau
Copy link
Member

Here's a reproducer

from distributed import Client
from dask_ml.model_selection import HyperbandSearchCV
from dask_ml.datasets import make_classification
import torch
import skorch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from skorch import NeuralNetRegressor
from scipy.stats import loguniform, uniform


client = Client()

X, y = make_classification(chunks=(10, -1))


class HiddenLayerNet(nn.Module):
    def __init__(self, n_features=10, n_outputs=1, n_hidden=100, activation="relu"):
        super().__init__()
        self.fc1 = nn.Linear(n_features, n_hidden)
        self.fc2 = nn.Linear(n_hidden, n_outputs)
        self.activation = getattr(F, activation)

    def forward(self, x, **kwargs):
        return self.fc2(self.activation(self.fc1(x)))


niceties = {
    "callbacks": False,
    "warm_start": True,
    "train_split": None,
    "max_epochs": 1,
}

model = NeuralNetRegressor(
    module=HiddenLayerNet,
    module__n_features=X.shape[1],
    optimizer=optim.SGD,
    criterion=nn.MSELoss,
    lr=0.0001,
    **niceties,
)


params = {
    "module__activation": ["relu", "elu", "softsign", "leaky_relu", "rrelu"],
    "batch_size": [32, 64, 128, 256],
    "optimizer__lr": loguniform(1e-4, 1e-3),
    "optimizer__weight_decay": loguniform(1e-6, 1e-3),
    "optimizer__momentum": uniform(0, 1),
    "optimizer__nesterov": [True],
}

search = HyperbandSearchCV(model, params, random_state=2, verbose=True, max_iter=2)
search.fit(X, y)

@mrocklin
Copy link
Member

mrocklin commented Aug 5, 2020 via email

@TomAugspurger
Copy link
Member

It's almost like the class is being mutated, by hyperband or someone else? I'll look a bit today.

@TomAugspurger
Copy link
Member

Nothing on the pickling yet, but a couple updates to James' reproducer based on using Client(processes=False)

  1. We're using skorch.NeurelNetRegressor, so the data should be make_regression()
  2. Something in torch wants int32 / float32, so astype to those
  3. I think torch wants y to be (n_samples, 1) so reshape to that.
from distributed import Client
from dask_ml.model_selection import HyperbandSearchCV
from dask_ml.datasets import make_classification, make_regression
import torch
import skorch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from skorch import NeuralNetRegressor
from scipy.stats import loguniform, uniform


client = Client(processes=True)

X, y = make_regression(chunks=(10, -1))
y = y.reshape(-1, 1).astype("float32")
X = X.astype("float32")


class HiddenLayerNet(nn.Module):
    def __init__(self, n_features=10, n_outputs=1, n_hidden=100, activation="relu"):
        super().__init__()
        self.fc1 = nn.Linear(n_features, n_hidden)
        self.fc2 = nn.Linear(n_hidden, n_outputs)
        self.activation = getattr(F, activation)

    def forward(self, x, **kwargs):
        return self.fc2(self.activation(self.fc1(x)))


niceties = {
    "callbacks": False,
    "warm_start": True,
    "train_split": None,
    "max_epochs": 1,
}

model = NeuralNetRegressor(
    module=HiddenLayerNet,
    module__n_features=X.shape[1],
    optimizer=optim.SGD,
    criterion=nn.MSELoss,
    lr=0.0001,
    **niceties,
)


params = {
    "module__activation": ["relu", "elu", "softsign", "leaky_relu", "rrelu"],
    "batch_size": [32, 64, 128, 256],
    "optimizer__lr": loguniform(1e-4, 1e-3),
    "optimizer__weight_decay": loguniform(1e-6, 1e-3),
    "optimizer__momentum": uniform(0, 1),
    "optimizer__nesterov": [True],
}

search = HyperbandSearchCV(model, params, random_state=2, verbose=True, max_iter=2)
search.fit(X, y)

But still seeing the

_pickle.PicklingError: Can't pickle <class '__main__.HiddenLayerNet'>: attribute lookup HiddenLayerNet on __main__ failed

with that.

@TomAugspurger
Copy link
Member

This rabbit hole keeps on going. I don't fully understand the issue, but the original exception came from trying to pickle model.module_. That's set when model.initialize() is called, and does the equivalent of model.module().to("cpu").

>>> model.module().to("cpu")
HiddenLayerNet(
  (fc1): Linear(in_features=10, out_features=100, bias=True)
  (fc2): Linear(in_features=100, out_features=1, bias=True)
)

That's an instance of the interactively defined class. Apparently something (cloudpickle? Dask?) has trouble serializing those when they're attributes of another object.

Anyway, we can get around that by serializing it separately

import cloudpickle

import skorch
from .serialize import dask_serialize, dask_deserialize


@dask_serialize.register(skorch.NeuralNet)
def serialize_skorch(x):
    has_module = hasattr(x, "module_")
    headers = {"has_module": has_module}
    if has_module:
        module = x.__dict__.pop("module_")
        try:
            frames = [cloudpickle.dumps(x), cloudpickle.dumps(module)]
        finally:
            x.__dict__["module_"] = module
    else:
        frames = [cloudpickle.dumps(x)]

    return headers, frames


@dask_deserialize.register(skorch.NeuralNet)
def deserialize_skorch(header, frames):
    model = cloudpickle.loads(frames[0])
    if header["has_module"]:
        module = cloudpickle.loads(frames[1])
        model.module_ = module
    return model

But now we face a trickier problem. Hyperband calls copy.deepcopy(model), which invokes torch.save, which eventually tries to pickle the interactively defined HiddenLayerNet, which pickle can't serialize (though cloudpickle can).

I'd hoped that

model = deepcopy(model)
can be changed to sklearn.base.clone, but that's failing some tests. Will need to look more later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants