# How to write a scikit-learn compatible estimator

## Adrin Jalali
### @adrinjalali, HuggingFace, scikit-learn, PyData Berlin

### Intro to scikit-learn

A machine learning library implementing statistical machine learning methods.

__Important components in the API:__

- estimators (transformers, classifiers, regressors, clustering algorithms)
- scorers
- meta-estimators
    - pipeline
    - grid search

``` python
# A common workflow includes a pipeline once the data is loaded.
# We usually preprocess the data and prepare it for the
# final classifier or regressor.
# We call each step an "estimator", the preprocessing steps which
# augment the data are "transformers", and the final step is a
# classifier or a regressor.
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# Each step can be tuned with many hyper-parameters, and we want to
# find the best hyper-parameter set for the given dataset.
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
}

# find the best parameters for both the feature extraction and the
# classifier, we use a grid search.
grid_search = GridSearchCV(pipeline, parameters)
```

But it can be a more complicated pipeline!
See the full example [here](https://scikit-learn.org/dev/auto_examples/compose/plot_column_transformer.html).
```python
pipeline = Pipeline(
    [
        # Extract subject & body
        ("subjectbody", subject_body_transformer),
        # Use ColumnTransformer to combine the subject and body features
        (
            "union",
            ColumnTransformer(
                [
                    # bag-of-words for subject (col 0)
                    ("subject", TfidfVectorizer(min_df=50), 0),
                    # bag-of-words with decomposition for body (col 1)
                    (
                        "body_bow",
                        Pipeline(
                            [
                                ("tfidf", TfidfVectorizer()),
                                ("best", TruncatedSVD(n_components=50)),
                            ]
                        ),
                        1,
                    ),
                    # Pipeline for pulling text stats from post's body
                    (
                        "body_stats",
                        Pipeline(
                            [
                                (
                                    "stats",
                                    text_stats_transformer,
                                ),  # returns a list of dicts
                                (
                                    "vect",
                                    DictVectorizer(),
                                ),  # list of dicts -> feature matrix
                            ]
                        ),
                        1,
                    ),
                ],
                # weight above ColumnTransformer features
                transformer_weights={
                    "subject": 0.8,
                    "body_bow": 0.5,
                    "body_stats": 1.0,
                },
            ),
        ),
        # Use a SVC classifier on the combined features
        ("svc", LinearSVC(dual=False)),
    ],
    verbose=True,
)
```

## Why a custom estimator?

- scikit-learn doesn't include all algorithms, and it has a very high bar for including one. You can test your new or modified algorithm as a custom estimator.
- The library does not include methods which are appropriate only for a small set o use-cases, and if you happen to work on one of those problems, you might need to develop your own estimator to tackle the specific issues you have.
- You may want to add some steps before or after running each step, in which case you can write a meta-estimator wrapping around the steps you usually would have in a pipeline.

### Basic API

- `fit (X, y[, sample_weight], **kwargs)`
- `predict(X)` (`predict_proba` and `decision_function`)
- `transform(X)`
- `score(X, y[, sample_weight])`


## Our First Estimator

In [None]:
import numpy as np
from sklearn.base import BaseEstimator

class MyClassifier(BaseEstimator):
    def fit(self, X, y):
        self._targets = np.unique(y)

    def predict(self, X):
        return np.asarray([self._targets[0]] * len(X))


In [None]:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(MyClassifier())

The above check runs all estimator checks against your estimator. You can iteratively fix issues with your estimator until it passes.

``` python
def check_is_fitted(estimator, attributes=None, msg=None, all_or_any=all):
    """Perform is_fitted validation for estimator.

    ...
    """

def check_array(
    array,
    accept_sparse=False,
    *,
    accept_large_sparse=True,
    dtype="numeric",
    order=None,
    copy=False,
    force_all_finite=True,
    ensure_2d=True,
    allow_nd=False,
    ensure_min_samples=1,
    ensure_min_features=1,
    estimator=None,
    input_name="",
):
    """Input validation on an array, list, sparse matrix or similar.

    By default, the input is checked to be a non-empty 2D array containing
    only finite values. If the dtype of the array is object, attempt
    converting to float, raising on failure.
    
    ...
    """
    
def check_X_y(
    X,
    y,
    accept_sparse=False,
    *,
    accept_large_sparse=True,
    dtype="numeric",
    order=None,
    copy=False,
    force_all_finite=True,
    ensure_2d=True,
    allow_nd=False,
    multi_output=False,
    ensure_min_samples=1,
    ensure_min_features=1,
    y_numeric=False,
    estimator=None,
):
    """Input validation for standard estimators.

    Checks X and y for consistent length, enforces X to be 2D and y 1D. By
    default, X is checked to be non-empty and containing only finite values.
    Standard input checks are also applied to y, such as checking that y
    does not have np.nan or np.inf targets. For multi-label y, set
    multi_output=True to allow 2D and sparse y. If the dtype of X is
    object, attempt converting to float, raising on failure.
    
    ...
    """


def check_classification_targets(y):
    """Ensure that target y is of a non-regression type.

    Only the following target types (as defined in type_of_target) are allowed:
        'binary', 'multiclass', 'multiclass-multioutput',
        'multilabel-indicator', 'multilabel-sequences'
    ...
    """
```

## Run Common Tests on Our Classifier with pytest

In [None]:
from sklearn.utils.estimator_checks import parametrize_with_checks

@parametrize_with_checks([MyClassifier()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

## A Custom Transformer

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class MyTransformer(TransformerMixin, BaseEstimator):
    def fit(self, X, y):
        return self
    
    def __sklearn_is_fitted__(self):
        return True
    
    def transform(self, X):
        return X + 1

In [None]:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(MyTransformer())

## Transformer in a Pipeline

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

est = make_pipeline(
    StandardScaler(),
    LogisticRegression()
).fit(X_train, y_train)
est[:-1].get_feature_names_out()

In [None]:
est = make_pipeline(
    MyTransformer(),
    StandardScaler(),
    LogisticRegression()
).fit(X_train, y_train)
est[:-1].get_feature_names_out()

In [None]:
est

## TimedEstimator

In [None]:
from sklearn.base import BaseEstimator, MetaEstimatorMixin, clone
from sklearn.utils.validation import check_is_fitted
import time


class TimedEstimator(MetaEstimatorMixin, BaseEstimator):
    def __init__(self, estimator, verbose=0):
        self.estimator = estimator
        self.verbose = verbose
        
    def fit(self, X, y, **fit_params):
        start = time.perf_counter()
        self.estimator_ = clone(self.estimator).fit(X, y, **fit_params)
        end = time.perf_counter()
        self.time_ = end - start
        if self.verbose > 0:
            print(self.time_)
        return self

In [None]:
from sklearn.utils.estimator_checks import check_estimator
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import StandardScaler
check_estimator(TimedEstimator(estimator=LogisticRegression()))
check_estimator(TimedEstimator(estimator=LinearRegression()))
check_estimator(TimedEstimator(estimator=StandardScaler()))

## TimedEstimator for and in a Pipeline

In [None]:
est = TimedEstimator(
    make_pipeline(
        MyTransformer(),
        StandardScaler(),
        TimedEstimator(LogisticRegression())
    )
).fit(X_train, y_train)

In [None]:
est[-1].time_

## In a GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

est = make_pipeline(
        MyTransformer(),
        StandardScaler(),
        LogisticRegression()
)

grid = {
    "logisticregression__C": [0.01, 0.1, 1, 10, 100]
}

gs = GridSearchCV(estimator=est, param_grid=grid)
gs.fit(X_train, y_train)

In [None]:
gs.best_estimator_

In [None]:
from sklearn.model_selection import GridSearchCV

est = TimedEstimator(
    make_pipeline(
        MyTransformer(),
        StandardScaler(),
        TimedEstimator(LogisticRegression())
    )
)

grid = {
    "estimator": [
        make_pipeline(
            MyTransformer(),
            StandardScaler(),
            TimedEstimator(LogisticRegression(C=C))
        )
        for C in [0.01, 0.1, 1, 10]
    ]
}

gs = GridSearchCV(estimator=est, param_grid=grid)
gs.fit(X_train, y_train)

In [None]:
gs.best_estimator_

In [None]:
from sklearn.model_selection import GridSearchCV

est = TimedEstimator(
    make_pipeline(
        MyTransformer(),
        StandardScaler(),
        TimedEstimator(LogisticRegression())
    )
)

grid = {
    "logisticregression__C": [0.01, 0.1, 1, 10, 100]
}

gs = GridSearchCV(estimator=est, param_grid=grid)
gs.fit(X_train, y_train)

In [None]:
gs.best_estimator_

## Conventions

- `fit` should only get sample aligned data in `fit_params`
    - everything else should go through `__init__`
- `__init__` doesn't set anything other than the parameters passed to it
- `obj.attr` is set through `__init__` and `set_params`
- `obj.attr_` is set during fit and counts as public API
- `obj._attr` is private
- `n_features_in_`, `feature_names_in_`, and `get_feature_names_out()`

## Estimator Tags

``` python
_DEFAULT_TAGS = {
    "non_deterministic": False,
    "requires_positive_X": False,
    "requires_positive_y": False,
    "X_types": ["2darray"],
    "poor_score": False,
    "no_validation": False,
    "multioutput": False,
    "allow_nan": False,
    "stateless": False,
    "multilabel": False,
    "_skip_test": False,
    "_xfail_checks": False,
    "multioutput_only": False,
    "binary_only": False,
    "requires_fit": True,
    "preserves_dtype": [np.float64],
    "requires_y": False,
    "pairwise": False,
}
```

## Upcoming Features

- passing feature names along the pipeline during `fit`: [SLEP#18](https://github.com/scikit-learn/enhancement_proposals/pull/68)
- sample/feature/data properties through `get_metadata_routing()` and `set_{method}_request()` [SLEP#6](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html), [#22893](https://github.com/scikit-learn/scikit-learn/issues/22893)
- standardized parameter validation through `validate_params` [#23462](https://github.com/scikit-learn/scikit-learn/issues/23462)

## More Details/Further Reading

- [https://scikit-learn.org/dev/developers/develop.html](https://scikit-learn.org/dev/developers/develop.html)
- `sklearn/base.py`
- `sklearn/metaestimators.py`
- `sklearn/utils/metaestimators.py`
- `sklearn/utils/validation.py`
- `check_random_state`

## Questions?
### Thank you!