In [None]:
#default_exp benchmark

In [None]:
#hide
from nbdev.showdoc import *

# Automatic benchmark model
> Functions to create a relevant, fast and reasonably well-performing benchmark

A Benchmark object has a similar API to a `sciki-learn` estimator: you build an instance with the desired arguments, and fit it to the data at a later moment.

Benchmarks is a convenience wrapper for reading the training data, passing it through a simplified pipeline consisting of data imputation and a standard scalar, and then the benchmark function calibrated with a grid search.

A `gingado` Benchmark comprises the following steps, all glued together:
* split the dataset into a training and a test datasets
* a `Pipeline` consisting of a missing data imputation step and a random forest estimator
* a grid search object that tunes the parameters of the random forest
* a `compare` method that helps users evaluate if their model is better than the benchmark

In addition to the estimator methods that a Benchmark object has by virtue of itself being an estimator, these objects also have a `compare` method, which takes as argument another fitted estimator (which could be itself a solo estimator or a whole pipeline) or a list of fitted estimators. 

Benchmarks start with default values, but the user is also free to choose any of the benchmark's components by passing as arguments the data split, pipeline, a dictionary of parameters for the hyperparameter tuning, etc.

In [None]:
#export
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit, StratifiedShuffleSplit, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.utils.metaestimators import available_if
#from gingado.model_documentation import ModelCard

class ModelCard:
    def __init__(self):
        pass

def _benchmark_has(attr):
        def check(self):
            getattr(self.benchmark, attr)
            return True
        return check
        
class ggdBenchmark:
    """
    The base class for gingado's Benchmark objects.
    """
    def _check_is_time_series(self, X, y=None):
        """
        Checks whether the data is a time series, and sets a data splitter
        accordingly if no data splitter is provided by the user
        Note: all data without an index (eg, a Numpy array) are considered to NOT be a time series
        """
        if hasattr(X, "index"):
            self.is_timeseries = pd.core.dtypes.common.is_datetime_or_timedelta_dtype(X.index)
        else:
            self.is_timeseries = False
        if self.is_timeseries and y:
            if hasattr(y, "index"):
                self.is_timeseries = pd.core.dtypes.common.is_datetime_or_timedelta_dtype(y.index)
            else:
                self.is_timeseries = False

        if self.cv is None:
            self.cv = TimeSeriesSplit() if self.is_timeseries else StratifiedShuffleSplit()

    def _creates_estimator(self):
        if self.estimator is None:
            pass

    def _fit(self, X, y):
        self._check_is_time_series(X, y)

        X, y = self._validate_data(X, y)

        if self.param_search and self.param_grid:                
            self.benchmark = self.param_search(estimator=self.estimator, param_grid=self.param_grid, scoring=self.scoring)
            self.benchmark.fit(X, y)

        if self.auto_document:
            self.document()

        return self
    
    def set_benchmark(self, estimator):
        self.benchmark = estimator

    def compare(self, X, candidate):
        """
        Uses a test dataset to compare the performance of the fitted benchmark model with one or more candidate models
        This method achieves this by conducting a grid search 
        """
        # Step 1: create a param_grid *list* where the first item is the current benchmark,
        # ... the other elements are the candidate model(s), and the final model is an ensemble
        # ... of all the previous models (including the benchmark), with uniform weights (1/N)

        # Step 2: Evaluate them using the same CV as strategy as defined in self.cv and select the best model

        # Step 3: The best model (or the ensemble) is now the current benchmark
        pass

    def document(self):
        pass

    @available_if(_benchmark_has("predict"))
    def predict(self, X, **predict_params):
        return self.benchmark.predict(X, **predict_params)

    @available_if(_benchmark_has("fit_predict"))
    def fit_predict(self, X, y=None, **predict_params):
        return self.benchmark.fit_predict(X, y, **predict_params)

    @available_if(_benchmark_has("predict_proba"))
    def predict_proba(self, X, **predict_proba_params):
        return self.benchmark.predict_proba(X, **predict_proba_params)

    @available_if(_benchmark_has("decision_function"))
    def decision_function(self, X):
        return self.benchmark.decision_function(X)
    
    @available_if(_benchmark_has("decision_function"))
    def decision_function(self, X):
        return self.benchmark.decision_function(X)

    @available_if(_benchmark_has("score_samples"))
    def score_samples(self, X):
        return self.benchmark.score_samples(X)

    @available_if(_benchmark_has("predict_log_proba"))
    def predict_log_proba(self, X, **predict_log_proba_params):
        return self.benchmark.predict_log_proba(X, **predict_log_proba_params)

In [None]:
#show_doc(Benchmark)

### Classification tasks

The default benchmark for classification tasks is a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) object. Its parameters are fine-tuned in each case according to the user's data.

In [None]:
#export
from sklearn.ensemble import RandomForestClassifier

class ClassificationBenchmark(ggdBenchmark):
    def __init__(self, cv=None, estimator=RandomForestClassifier(), param_grid=None, param_search=GridSearchCV, scoring=None, auto_document=ModelCard()):
        self.cv = cv
        self.estimator = estimator
        self.param_grid = param_grid
        self.param_search = param_search
        self.scoring = scoring
        self.auto_document = auto_document

    def fit(self, X, y=None):
        self._fit(X, y)
        return self

In [None]:
#show_doc(ClassificationBenchmark)

### Regression tasks

The default benchmark for regression tasks is a [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) object.  Its parameters are fine-tuned in each case according to the user's data.

In [None]:
#export
from sklearn.ensemble import RandomForestRegressor

class RegressionBenchmark(ggdBenchmark):
    def __init__(self, cv=None, estimator=RandomForestRegressor(), param_search=GridSearchCV, param_grid=None, scoring=None, auto_document=ModelCard()):
        self.cv = cv
        self.estimator = estimator
        self.param_grid = param_grid
        self.param_search = param_search
        self.scoring = scoring
        self.auto_document = auto_document

    def fit(self, X, y=None):
        self._fit(X, y)
        return self

It is also simple to define as benchmark a model that you already fitted and still benefit from the other functionalities provided by `Benchmark` class. This can also be done in case you are using a saved version of a fitted model (eg, the model you are using in production) and want to have that as the benchmark.

In [None]:
from sklearn.ensemble import RandomForestRegressor
#from gingado.benchmark import RegressionBenchmark

forest = RandomForestRegressor().fit(X, y)

bm = RegressionBenchmark()
bm.set_benchmark(estimator=forest)

assert forest == bm.benchmark
assert hasattr(bm.benchmark, "predict")
# note that now the `bm` object can be used as the estimator: 
assert bm.predict(X).shape == y.shape

### Data split

Please refer to [this page](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for more information on the different `Splitter` classes available on `scikit-learn`, and [this page](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py) for practical advice on how to choose a splitter for data that are not time series. Any one of these objects (or a custom splitter that is compatible with them) can be passed to a `Benchmark` object.

The API does not accept custom parameters for the splitters. USers that wish to use specific parameters should include the actual `Splitter` object as the parameter.

In [None]:
bm.get_params()

TypeError: BaseEstimator.get_params() missing 1 required positional argument: 'self'

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

p = Pipeline([('scaler', StandardScaler), ('clf', RandomForestClassifier)])

In [None]:
p.steps

[('scaler', sklearn.preprocessing._data.StandardScaler),
 ('clf', sklearn.ensemble._forest.RandomForestClassifier)]

In [None]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit()

In [None]:
from gingado.utils import load_EURFX_data
X = load_EURFX_data()
y = X.pop('BRL')

In [None]:
t

CURRENCY,AUD,CAD,CHF,GBP,JPY,SGD,USD
TIME_PERIOD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2003-01-02,1.8554,1.6422,1.4528,0.65200,124.40,1.8188,1.0446
2003-01-03,1.8440,1.6264,1.4555,0.65000,124.56,1.8132,1.0392
2003-01-06,1.8281,1.6383,1.4563,0.64950,124.40,1.8210,1.0488
2003-01-07,1.8160,1.6257,1.4565,0.64960,124.82,1.8155,1.0425
2003-01-08,1.8132,1.6231,1.4586,0.64950,124.90,1.8102,1.0377
...,...,...,...,...,...,...,...
2022-05-19,1.5036,1.3490,1.0265,0.84728,134.46,1.4576,1.0525
2022-05-20,1.4980,1.3526,1.0280,0.84820,135.34,1.4588,1.0577
2022-05-23,1.4982,1.3626,1.0310,0.84783,136.05,1.4639,1.0659
2022-05-24,1.5152,1.3714,1.0334,0.85750,136.49,1.4722,1.0720


### Custom benchmarks

`gingado` provides users with two `Benchmark` objects out of the box: `ClassificationBenchmark` and `RegressionBenchmark`, to be used depending on the task at hand. Both classes derive from a base class `ggdBenchmark`, which implements methods that facilitate model comparison. Users that want to create a customised benchmark model for themselves have two options:

* the simpler possibility is to train the estimator as usual, and then assign the fitted estimator to a `Benchmark` object. 
* if the user wants more control over the fitting process of estimating the benchmark, they can create a class that subclasses from `ggdBenchmark` and either implements custom `fit`, `predict` and `score` methods, or also subclasses from [`scikit-learn`'s `BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html). 
  * In any case, if the user wants the benchmark to automatically detect if the data is a time series and also to document the model right after fitting, the `fit` method should call `self._fit` on the data. Otherwise, the user can simply implement any consistent logic in fit as the user sees fit (pun intended).
