# How to write a scikit-learn compatible estimator

## Adrin Jalali
### @adrinjalali, scikit-learn

### Intro to scikit-learn

A machine learning library implementing statistical machine learning methods.

__Out of scope:__
- Neural Networks (except a simple MLP implementation)
- GPU

__Important components in the API:__

- estimators (transformers and predictors)
- scorers
- meta-estimators
    - pipeline
    - grid search

``` python
# A common workflow includes a pipeline once the data is loaded.
# We usually preprocess the data and prepare it for the
# final classifier or regressor.
# We call each step an "estimator", the preprocessing steps which
# augment the data "transformers", and the final step a classifier
# or a regressor.
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# Each step can be tuned with many hyper-parameters, and we want to
# find the best hyper-parameter set for the given dataset.
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

# find the best parameters for both the feature extraction and the
# classifier, we use a grid search.
grid_search = GridSearchCV(pipeline, parameters)
```

## Why a custom estimator?

- scikit-learn doesn't include all algorithms, and it has a very high bar for including one. You can test your new or modified algorithm as a custom estimator.
- The library does not include methods which are appropriate only for a small set o use-cases, and if you happen to work on one of those problems, you might need to develop your own estimator to tackle the specific issues you have.
- You may want to add some steps before or after running each step, in which case you can write a meta-estimator wrapping around the steps you usually would have in a pipeline.

## Example - True Story!
### Train

In [11]:
import spacy
# need to run: python -m spacy download en_core_web_md
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

X, y = fetch_20newsgroups(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X[:500], y[:500])

nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(X_train))
feature_matrix = np.array(list(map(lambda x: x.vector, docs)))

clf = SGDClassifier().fit(feature_matrix, y_train)
clf.score(feature_matrix, y_train)

0.5813333333333334

### Test

In [12]:
docs = list(nlp.pipe(X_test))
feature_matrix_test = np.array(list(map(lambda x: x.vector, docs)))
clf.score(feature_matrix_test, y_test)

0.184

## Deployment and backend

If the model is being _deployed_ in a backend, the backend would need to know how to process the data and how to run the model.

### Basic API

- `fit (X, y, **kwargs)`
- `predict(X)` (`predict_proba` and `decision_function`)
- `transform(X)`
- `score(X, y[, sample_weight])`


## Our Custom Transformer

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class SpacyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, langulage_model='en_core_web_sm'):
        self.language_model = language_model

    def fit(self, X=None, y=None):
        self.nlp_ = spacy.load(self.language_model)
        return self

    def transform(self, X, y=None):
        check_is_fitted(self)
        try:
            docs = list(self.nlp_.pipe(X))
        except OSError:
            # This is needed when the language model is not pickled with the transformer itself
            self.nlp_ = spacy.load("en_core_web_md")
            docs = list(self.nlp_.pipe(X))

        feature_matrix = np.array(list(map(lambda x: x.vector, docs)))
        return feature_matrix

    def _more_tags(self):
        return {'X_types': ['string']}

``` python
def check_is_fitted(estimator, attributes=None, msg=None, all_or_any=all):
    """Perform is_fitted validation for estimator.

    """

def check_array(array, accept_sparse=False, accept_large_sparse=True,
                dtype="numeric", order=None, copy=False, force_all_finite=True,
                ensure_2d=True, allow_nd=False, ensure_min_samples=1,
                ensure_min_features=1, estimator=None):

    """Input validation on an array, list, sparse matrix or similar.

    By default, the input is checked to be a non-empty 2D array containing
    only finite values. If the dtype of the array is object, attempt
    converting to float, raising on failure.
    """
    
def check_X_y(X, y, accept_sparse=False, accept_large_sparse=True,
              dtype="numeric", order=None, copy=False, force_all_finite=True,
              ensure_2d=True, allow_nd=False, multi_output=False,
              ensure_min_samples=1, ensure_min_features=1, y_numeric=False,
              estimator=None):
    """Input validation for standard estimators.

    Checks X and y for consistent length, enforces X to be 2D and y 1D. By
    default, X is checked to be non-empty and containing only finite values.
    Standard input checks are also applied to y, such as checking that y
    does not have np.nan or np.inf targets. For multi-label y, set
    multi_output=True to allow 2D and sparse y. If the dtype of X is
    object, attempt converting to float, raising on failure.

```

## Run Common Tests on Our Classifier

In [3]:
from sklearn.utils.estimator_checks import parametrize_with_checks

@parametrize_with_checks([SpacyTransformer()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

In [4]:
import ipytest
ipytest.config(rewrite_asserts=True, magics=True)
__file__ = 'custom_estimators.ipynb'

ipytest.run()

platform linux -- Python 3.8.1, pytest-5.3.3, py-1.8.1, pluggy-0.13.1
rootdir: /home/adrin/Documents/talks/sklearn-estimator-spacy
collected 1 item

custom_estimators.py s                                                                                                                                                           [100%]

/home/adrin/miniconda3/envs/talks/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py:234



### Fit the Model

In [5]:
SpacyTransformer().fit().transform(X_train)

array([[ 0.45556316,  0.17772849, -0.96406066, ...,  0.60679364,
         0.72729415,  0.5039766 ],
       [ 0.23949091, -0.09925516, -0.97056216, ...,  0.578748  ,
         0.3359907 ,  0.43843523],
       [ 0.46740833,  0.8174765 , -1.347426  , ...,  0.6512799 ,
         0.33042884,  0.7753896 ],
       ...,
       [ 0.31319967, -0.34968168, -1.3392841 , ...,  0.2398599 ,
         0.46239212,  0.4577189 ],
       [ 0.4845453 ,  0.23500457, -0.88941014, ...,  0.5740574 ,
         0.23149489,  0.663821  ],
       [ 0.40018892,  0.3793777 , -1.191784  , ...,  0.7832488 ,
         0.00516417,  0.70668983]], dtype=float32)

## In a Pipeline

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDClassifier

pipe = make_pipeline(SpacyTransformer(), SGDClassifier())
pipe.fit(X_train, y_train).score(X_test, y_test)

0.152

## The Pipeline in a Grid Search

In [7]:
param_grid = {'sgdclassifier__penalty': ['l1', 'l2'],
              'sgdclassifier__alpha': [0.0001, 0.001]}
gs = GridSearchCV(pipe, param_grid=param_grid, n_jobs=6, verbose=1).fit(X_train, y_train)
gs.score(X_test, y_test)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 out of  20 | elapsed:  1.5min finished


0.168

In [8]:
gs.best_estimator_

Pipeline(steps=[('spacytransformer', SpacyTransformer()),
                ('sgdclassifier', SGDClassifier(penalty='l1'))])

In [None]:
gs.set_params(n_jobs=4)
gs.fit(...)

## Conventions

- `fit` should only get sample aligned data in `fit_params`
    - everything else should go through `__init__`
- `__init__` doesn't set anything other than the parameters passed to it
- `obj.attr` is set through `__init__` and `set_params`
- `obj.attr_` is set during fit and counts as public API
- `obj._attr` is private

## Estimator Tags

``` python
_DEFAULT_TAGS = {
    'non_deterministic': False,
    'requires_positive_X': False,
    'requires_positive_y': False,
    'X_types': ['2darray'],
    'poor_score': False,
    'no_validation': False,
    'multioutput': False,
    "allow_nan": False,
    'stateless': False,
    'multilabel': False,
    '_skip_test': False,
    'multioutput_only': False,
    'binary_only': False,
    'requires_fit': True}
```

You can change them with:

``` python
class MyMultiOutputEstimator(BaseEstimator):

    def _more_tags(self):
        return {'multioutput_only': True,
                'non_deterministic': True}
```

## Upcoming Features

- `n_features_in_`[#16112](https://github.com/scikit-learn/scikit-learn/pull/16112), `n_features_out_`[#14241](https://github.com/scikit-learn/scikit-learn/pull/14241)
- `feature_names_in_`, `feature_names_out_` [SLEP#7](https://github.com/scikit-learn/enhancement_proposals/pull/17), [SLEP#8](https://github.com/scikit-learn/enhancement_proposals/pull/18), [SLEP#12](https://github.com/scikit-learn/enhancement_proposals/pull/25)
- sample/feature/data properties (through `_request_props`?) [SLEP#6](https://github.com/scikit-learn/enhancement_proposals/pull/16), [#16079](https://github.com/scikit-learn/scikit-learn/pull/16079)

## More Details/Further Reading

- [https://scikit-learn.org/dev/developers/develop.html](https://scikit-learn.org/stable/developers/develop.html)
- `sklearn/base.py`
- `sklearn/metaestimators.py`
- `sklearn/utils/validation.py`

## Questions?
### Thank you!