# DoubleML with auto-sklearn

In [None]:
!pip install sklearn -U

Requirement already up-to-date: sklearn in /usr/local/lib/python3.7/dist-packages (0.0)


In [None]:
!sudo apt-get install build-essential swig
!pip install auto-sklearn==0.12.4
!python -m pip install dask distributed --upgrade

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
swig is already the newest version (3.0.12-1).
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.
Requirement already up-to-date: dask in /usr/local/lib/python3.7/dist-packages (2021.3.1)
Requirement already up-to-date: distributed in /usr/local/lib/python3.7/dist-packages (2021.3.1)


In [None]:
import numpy as np
import pandas as pd
from sklearn.base import clone
import sklearn

from autosklearn.classification import AutoSklearnClassifier
from autosklearn.regression import AutoSklearnRegressor

from google.colab import files
from tqdm.notebook import tqdm
import pickle

  self.re = re.compile(self.reString)


# An Adapted AutoSklearnRegressor Class to Use with DoubleML

Two things need to be adapted if one intends to externally tune the AutoSklearnRegressor and then within the DoubleML framework only wants to `refit()` on each fold.

1. Method `AutoSklearnRegressor.refit()` should be called in `DoubleML.fit()` instead of `AutoSklearnRegressor.fit()`. To achieve this, we define a new class `AutoSklearnRegressorDoubleML`. It is inherited from `AutoSklearnRegressor`, i.e., it comes with the same properties and functionalities. We overwrite the `fit()` method such that a call to `fit()` is redirected to `refit()`. To still be able to tune the model, we add a new method `tune()` which allows to call the original `AutoSklearnRegressor.fit()` method from an object of class `AutoSklearnRegressorDoubleML`.
2. The implementation of `DoubleML.fit()` at the moment clones the learners during the cross-validated estimation process. To do so, the method `sklearn.base.clone` (https://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html) is used. Applying `sklearn.base.clone` "constructs a new unfitted estimator with the same parameters". For objects of class `AutoSklearnRegressor` this especially means that the tuned models are removed, i.e., especially `AutoSklearnRegressor.automl_`. In the background `sklearn.base.clone` relies on `set_params` and `get_params` methods which are usually present for all scikit-learn learners. To be able to pass trough a pre-tuned autosklearn model we therefore slightly adapt the `set_params` method of the new class `AutoSklearnRegressorDoubleML`. It now allows us to pass through the pre-tuned model (the property `automl_`) because the paramters set via `DoubleML.set_ml_nuisance_params()` are set after the `clone` within the cross-validated estimation process.

In [None]:
class AutoSklearnRegressorDoubleML(AutoSklearnRegressor):
    def fit(self, X, y):
        return self.refit(X, y)

    def tune(self, X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None):
        return super().fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)
    
    def set_params(self, **params):
        this_params = params.copy()
        if isinstance(params, dict) & ('automl_' in this_params.keys()):
            self.automl_ = this_params.pop('automl_')
        super().set_params(**this_params)
        return self

class AutoSklearnClassifierDoubleML(AutoSklearnClassifier):
    target_type = 'binary'
      
    def fit(self, X, y):
        return self.refit(X, y)

    def tune(self, X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None):
        return super().fit(X, y, X_test=None, y_test=None, feat_type=None, dataset_name=None)
    
    def set_params(self, **params):
        this_params = params.copy()
        if isinstance(params, dict) & ('automl_' in this_params.keys()):
            self.automl_ = this_params.pop('automl_')
        super().set_params(**this_params)
        return self

In [None]:
X = pd.read_csv("X_nt_normalised.csv", index_col=0)

In [None]:
df = pd.read_csv("dataset_00.csv", index_col=0)
y = df['y']

In [None]:
from autosklearn.metrics import mean_squared_error
model = AutoSklearnRegressorDoubleML(
    time_left_for_this_task=36000,
    per_run_time_limit=600,
    n_jobs=-1,
    memory_limit=10240,
    include_preprocessors=["no_preprocessing"],
    ensemble_size=10,
    seed=0,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds':5},
    metric=mean_squared_error)

In [None]:
model.tune(X, y)

AutoSklearnRegressorDoubleML(ensemble_size=10,
                             include_preprocessors=['no_preprocessing'],
                             memory_limit=10240, metric=mean_squared_error,
                             n_jobs=-1, per_run_time_limit=600,
                             resampling_strategy='cv',
                             resampling_strategy_arguments={'folds': 5}, seed=0,
                             time_left_for_this_task=36000)

In [None]:
print(model.show_models())

[(0.300000, SimpleRegressionPipeline({'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'most_frequent', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'none', 'feature_preprocessor:__choice__': 'no_preprocessing', 'regressor:__choice__': 'random_forest', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.039623360293329364, 'regressor:random_forest:bootstrap': 'True', 'regressor:random_forest:criterion': 'friedman_mse', 'regressor:random_forest:max_depth': 'None', 'regressor:random_forest:max_features': 0.28349366048362246, 'regressor:random_forest:max_leaf_nodes': 'None', 'regressor:random_forest:min_impurity_decrease': 0.0, 'regressor:random_forest:min_samples_leaf': 7, 'regressor:random_forest:min_sampl

In [None]:
pickle.dump(model, open("model.pickle", "wb"))

In [None]:
n = len(X)
cut_points = [fold * n // 5 for fold in range(5)] + [n]

In [None]:
for knob in range(4,9):
    for dataset in tqdm(range(10)):
        y = pd.read_csv("dataset_"+str(knob)+str(dataset)+".csv", index_col=0)["y"]
        y_hat = pd.Series(index=range(n), dtype='float64')
        for fold in range(5):
            X_in = X.drop(range(cut_points[fold], cut_points[fold+1]))
            y_in = y.drop(range(cut_points[fold], cut_points[fold+1]))
            X_out = X.iloc[cut_points[fold] : cut_points[fold+1]]
            y_out = y.iloc[cut_points[fold] : cut_points[fold+1]]
            model.fit(X_in, y_in)
            y_hat.iloc[y_out.index] = model.predict(X_out)
        y_hat.to_csv("yhat_dataset_"+str(knob)+str(dataset)+".csv")

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [None]:
for dataset in range(10):
  files.download("yhat_dataset_8"+str(dataset)+".csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Reproduciblity of AutoSklearn
It is difficult to verify whether the implementation of the slightly adapted `AutoSklearnRegressorDoubleML` class really does what it should.
The primary reason is that `AutoSklearnRegressor` does not seem to be fully reproducible (also see discussions: https://github.com/automl/auto-sklearn/issues/514 & https://github.com/automl/auto-sklearn/issues/725)

In [None]:
# estimate a AutoSklearnRegressor
X, y = sklearn.datasets.load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
automl_reg = AutoSklearnRegressor(time_left_for_this_task=120, seed=3141)
automl_reg.fit(X_train, y_train)

To verify that the implementation of the `AutoSklearnRegressorDoubleML` class is behaving as expected the following calls should be equivalent:

In [None]:
np.random.seed(3141)
automl_reg.refit(X_test, y_test)
y_hat = automl_reg.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, y_hat))

In [None]:
np.random.seed(3141)
automl_ = automl_reg.automl_
new_automl_reg = clone(automl_reg)
new_automl_reg.automl_ = automl_ # set the fitted model as in AutoSklearnRegressorDoubleML
new_automl_reg.refit(X_test, y_test)
#print(new_automl_reg.show_models())
y_hat = new_automl_reg.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, y_hat))

We see that we don't get exactly the same result.
However, even without the clone we have the same "not fully reproducible" issue

In [None]:
np.random.seed(3141)
automl_reg.refit(X_test, y_test)
#print(automl_reg.show_models())
y_hat = automl_reg.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, y_hat))

In [None]:
np.random.seed(3141)
automl_reg.refit(X_test, y_test)
#print(automl_reg.show_models())
y_hat = automl_reg.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, y_hat))