# Example showing how the caching works
- A change in estimator causes a fresh run
- An identical estimator (even if technically a different instance) will use cached data
- A fresh run always happens on the first run
- If the input data is different, the cached data is ignore.
- This works across runs i.e. if you run this notebook again, all fit and predict calls will be quick because the data will have been saved. Delete the data in `./cache_data` for it to run as new.
- We use a `RandomForestClassifier`, try different estimators or even a custom one with a slower fit/predict and you'll see even greater improvements.
- This is a simple example, anything you can do with estimators you should be able to do with the wrapped estimator (use in `Pipeline`s, crossval etc.)
- We've used numpy arrays, but it should also work nicely with pandas and polars data.

In [1]:
import os
import sys

sys.path.append(os.path.abspath('../'))

In [2]:
import logging
import structlog

# Set the logging level for standard logging
logging.basicConfig(level=logging.WARNING)

# Configure structlog to use the standard logging
structlog.configure(
    processors=[
        structlog.processors.KeyValueRenderer(),
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    wrapper_class=structlog.stdlib.BoundLogger,
)

# Set the logging level for structlog as well
logging.getLogger().setLevel(logging.WARNING)

In [3]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

from sklearn_estimator_caching import EstimatorCachingWrapperGetter

In [4]:
wrapper = EstimatorCachingWrapperGetter(
    # if we modify experiments or project (or any other kwarg of your choice) then the data will be saved/loaded from a new location.
    project="sklearn_wrapping_development",
    experiment="testing1",
    base_dir="./cache_data"
)

In [5]:
# large dataset so that fit/predict take some time. Increase these numbers and you'll see an even greated speedup.
X, y = make_classification(n_samples=100_000, n_features=10,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

In [6]:
random_forest = wrapper(
    RandomForestClassifier()
)
random_forest # note how the output retains the detail of the original classifier.


### First fit call
Should take a while since we have a a lot of data

In [7]:
%%time
random_forest.fit(X, y)



CPU times: user 34.4 s, sys: 206 ms, total: 34.6 s
Wall time: 36.7 s


### Second fit call
Should be very quick, since the estimator and the data are the same, we load from the cache.

In [8]:
%%time
random_forest.fit(X, y)



CPU times: user 128 ms, sys: 51.5 ms, total: 179 ms
Wall time: 134 ms


### Change the training data
With the same estimator but different training data, the fit should run again from scratch.

In [9]:
X2 = X.copy()
np.random.shuffle(X2)

In [10]:
%%time
random_forest.fit(X2, y)



CPU times: user 54.5 s, sys: 244 ms, total: 54.8 s
Wall time: 56.9 s


### Modify the target data, the fit runs again from scratch

In [11]:
y2 = y.copy()
np.random.shuffle(y2)

In [12]:
%%time
random_forest.fit(X2, y2)



CPU times: user 53 s, sys: 242 ms, total: 53.3 s
Wall time: 54.3 s


In [13]:
%%time
random_forest.fit(X2, y2)



CPU times: user 172 ms, sys: 106 ms, total: 278 ms
Wall time: 263 ms


### A different but identical wrapped classifier, and the same input data, we use the cached estimator
i.e. a different instance with the exact same args/kwargs (in this case none) will behave as the original one, seen as the data processing should be identical.

In [14]:
%%time
wrapper(RandomForestClassifier()).fit(X, y)



CPU times: user 121 ms, sys: 48.1 ms, total: 169 ms
Wall time: 125 ms


### The same input data, but a different estimator - with some different attribute
Runs the fit again as it is a different model

In [15]:
%%time
wrapper(RandomForestClassifier(random_state = 7)).fit(X, y)



CPU times: user 32 s, sys: 93 ms, total: 32.1 s
Wall time: 32.4 s


## Predict calls
FYI these examples would work the same with a transformer

### The first call runs the prediction

In [16]:
%%time
predictions_first_call = random_forest.predict(X)



CPU times: user 2.14 s, sys: 81 ms, total: 2.22 s
Wall time: 2.6 s


### The second call with the same data loads the cached data instead of running the prediction, and the outputs are identical

In [17]:
%%time
predictions_second_call = random_forest.predict(X)



CPU times: user 82.5 ms, sys: 19.8 ms, total: 102 ms
Wall time: 59.8 ms


In [18]:
np.testing.assert_array_equal(predictions_first_call, predictions_second_call)

### Using different data means that it re-runs the prediction

In [19]:
%%time
predictions_third_call = random_forest.predict(X2)



CPU times: user 2.14 s, sys: 21.7 ms, total: 2.16 s
Wall time: 2.13 s


In [20]:
%%time
predictions_fourth_call = random_forest.predict(X2)



CPU times: user 84.2 ms, sys: 15.7 ms, total: 99.9 ms
Wall time: 43.2 ms


In [21]:
assert not np.array_equal(predictions_second_call, predictions_third_call)
np.testing.assert_array_equal(predictions_third_call, predictions_fourth_call)

### All of the behaviour shown also works with fit_predict (and fit_transform in the case of transformers)
because both fit and predict have been run and cached, `fit_predict` will use that cached data.

In [22]:
%%time
predictions_fifth_call = random_forest.fit_predict(X, y)



CPU times: user 202 ms, sys: 65.3 ms, total: 267 ms
Wall time: 180 ms


In [23]:
assert not np.array_equal(predictions_fifth_call, predictions_third_call)
np.testing.assert_array_equal(predictions_second_call, predictions_fifth_call)