# How to write a scikit-learn compatible estimator/transformer

## Adrin Jalali
### @adrinjalali, Anaconda Inc., scikit-learn

### Intro to scikit-learn

A machine learning library implementing statistical machine learning methods.

__Out of scope:__
- Neural Networks (except a simple MLP implementation)
- GPU

__Important components in the API:__

- estimators (transformers and predictors)
- scorers
- meta-estimators
    - pipeline
    - grid search

### Basic API

- `fit (X, y, **kwargs)`
- `predict(X)` (`predict_proba` and `decision_function`)
- `transform(X)`
- `score(X, y[, sample_weight])`


## Our First Estimator

In [1]:
import logging
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import (
    check_array, check_is_fitted, check_X_y)
from sklearn.svm import SVC

logger = logging.getLogger('mymodule')
logger.setLevel(logging.DEBUG)
FORMAT = '%(asctime)-15s %(message)s'
logging.basicConfig(format=FORMAT)

class MyClassifier(ClassifierMixin, BaseEstimator):
    def __init__(self, C=1):
        self.C = C
    
    def fit(self, X, y, **fit_params):
        # input validation
        X, y = check_X_y(X, y)
        logger.debug(f"C: {self.C}")
        logger.debug(f"fitting on {len(X)} samples")
        # _estimator is set in `fit`, and not `__init__`
        self.estimator_ = SVC(C=self.C).fit(X, y, **fit_params)
        logger.debug("fit end/")
        # a classifier requires a `classes_` attribute
        self.classes_ = self.estimator_.classes_
        return self

    def predict(self, X):
        # make sure the fit has been called
        check_is_fitted(self)
        X = check_array(X)
        logger.debug(f"predicting on {len(X)} samples")
        return self.estimator_.predict(X)

    def decision_function(self, X):
        check_is_fitted(self)
        X = check_array(X)
        logger.debug(f"decision function on {len(X)} samples")
        return self.estimator_.decision_function(X)
    

``` python
def check_is_fitted(estimator, attributes=None, msg=None, all_or_any=all):
    """Perform is_fitted validation for estimator.

    """

def check_array(array, accept_sparse=False, accept_large_sparse=True,
                dtype="numeric", order=None, copy=False, force_all_finite=True,
                ensure_2d=True, allow_nd=False, ensure_min_samples=1,
                ensure_min_features=1, estimator=None):

    """Input validation on an array, list, sparse matrix or similar.

    By default, the input is checked to be a non-empty 2D array containing
    only finite values. If the dtype of the array is object, attempt
    converting to float, raising on failure.
    """
    
def check_X_y(X, y, accept_sparse=False, accept_large_sparse=True,
              dtype="numeric", order=None, copy=False, force_all_finite=True,
              ensure_2d=True, allow_nd=False, multi_output=False,
              ensure_min_samples=1, ensure_min_features=1, y_numeric=False,
              estimator=None):
    """Input validation for standard estimators.

    Checks X and y for consistent length, enforces X to be 2D and y 1D. By
    default, X is checked to be non-empty and containing only finite values.
    Standard input checks are also applied to y, such as checking that y
    does not have np.nan or np.inf targets. For multi-label y, set
    multi_output=True to allow 2D and sparse y. If the dtype of X is
    object, attempt converting to float, raising on failure.

```

### Fit the Model

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_informative=10)
X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = MyClassifier().fit(X_train, y_train)

2020-01-21 11:28:52,264 C: 1
2020-01-21 11:28:52,265 fitting on 750 samples
2020-01-21 11:28:52,279 fit end/


### Get a Score

In [3]:
clf.score(X_test, y_test)

2020-01-21 11:28:52,285 predicting on 250 samples


0.94

## Run Common Tests on Our Classifier

In [4]:
from sklearn.utils.estimator_checks import parametrize_with_checks

@parametrize_with_checks([MyClassifier])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

In [5]:
import ipytest
ipytest.config(rewrite_asserts=True, magics=True)
__file__ = 'custom_estimators.ipynb'

logger.setLevel(logging.INFO)

ipytest.run()

platform linux -- Python 3.8.1, pytest-5.3.3, py-1.8.1, pluggy-0.13.1
rootdir: /home/adrin/Documents/talks/sklearn-estimator-fossdem-2020
collected 40 items

custom_estimators.py ........................................                                                                                                         [100%]



## In a Pipeline

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

pipe = make_pipeline(SelectKBest(k=2), MyClassifier(C=.1))
pipe.fit(X_train, y_train).score(X_test, y_test)

0.74

## The Pipeline in a Grid Search

In [7]:
param_grid = {'selectkbest__k': [1, 2, 5, 10, 20],
              'myclassifier__C': [0.1, 1, 100]}
logger.setLevel(logging.INFO)
gs = GridSearchCV(pipe, param_grid=param_grid).fit(X_train, y_train)
gs.score(X_test, y_test)

0.94

In [8]:
gs.best_estimator_

Pipeline(memory=None,
         steps=[('selectkbest',
                 SelectKBest(k=20,
                             score_func=<function f_classif at 0x7fa77c9c3700>)),
                ('myclassifier', MyClassifier(C=1))],
         verbose=False)

## Conventions

- `fit` should only get sample aligned data in `fit_params`
    - everything else should go through `__init__`
- `__init__` doesn't set anything other than the parameters passed to it
- `attr` is set through `__init__` and `set_params`
- `attr_` is set during fit and counts as public API
- `_attr` is private

## Estimator Tags

``` python
_DEFAULT_TAGS = {
    'non_deterministic': False,
    'requires_positive_X': False,
    'requires_positive_y': False,
    'X_types': ['2darray'],
    'poor_score': False,
    'no_validation': False,
    'multioutput': False,
    "allow_nan": False,
    'stateless': False,
    'multilabel': False,
    '_skip_test': False,
    'multioutput_only': False,
    'binary_only': False,
    'requires_fit': True}
```

You can change them with:

``` python
class MyMultiOutputEstimator(BaseEstimator):

    def _more_tags(self):
        return {'multioutput_only': True,
                'non_deterministic': True}
```

## Upcoming

- `n_features_in_`[#16112](https://github.com/scikit-learn/scikit-learn/pull/16112), `n_features_out_`[#14241](https://github.com/scikit-learn/scikit-learn/pull/14241)
- `feature_names_in`, `feature_names_out_` [SLEP#7](https://github.com/scikit-learn/enhancement_proposals/pull/17), [SLEP#8](https://github.com/scikit-learn/enhancement_proposals/pull/18), [SLEP#12](https://github.com/scikit-learn/enhancement_proposals/pull/25)
- sample/feature/data properties (through `_request_props`?) [SLEP#6](https://github.com/scikit-learn/enhancement_proposals/pull/16), [#16079](https://github.com/scikit-learn/scikit-learn/pull/16079)

## More Details

- [https://scikit-learn.org/dev/developers/develop.html](https://scikit-learn.org/stable/developers/develop.html)
- `sklearn/base.py`
- `sklearn/metaestimators.py`
- `sklear/utils/validation.py`

## Questions?
### Thank you!