# Minimising computation time

`AutoEmulate` fits lots of models and can be slow if the input data has many observations (rows) or output variables. By default, `AutoEmulate` fits and cross-validates each model. So for each of 10 models, we compute 5 model fits for each of the cross validation folds. The computation time will be relatively shortfor datasets up to a few thousands of datapoints, but some models (e.g. Gaussian Processes) don't scale well, so computation time might quickly become an issue. 

If we want to use hyperparameter search, we suddenly have to fit many more models. For each model, we might have 20 different parameter combinations, and we cross validate each combination, leading to 20 * 5 = 100 model fits per model. 

However, there are several ways to speed up `AutoEmulate`:

1) parallise model fits using `n_jobs` 
2) restrict the number of models using `model_subset` 
3) run fewer cross validation folds using `cross_validator` 
4) for hyperparameter search:
    - all of the above
    - run fewer iterations using `param_search_iters`

Here's how you might approach working with a larger dataset:

In [9]:
from sklearn.datasets import make_regression
from autoemulate.compare import AutoEmulate

Let's make a dataset.

In [10]:
X, y = make_regression(n_samples=500, n_features=10, n_targets=5)
X.shape, y.shape

((500, 10), (500, 5))

And see how long `AutoEmulate` takes to run (without hyperparameter search).

In [11]:
import time

start = time.time()

em = AutoEmulate()
em.setup(X, y)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),False
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),20
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Initializing:   0%|          | 0/11 [00:00<?, ?it/s]

Time taken: 51.25502681732178 seconds


### 1) parallise model fits using `n_jobs` 

In [12]:
start = time.time()

em = AutoEmulate()
em.setup(X, y, n_jobs=5)
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),False
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),20
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Initializing:   0%|          | 0/11 [00:00<?, ?it/s]

Time taken: 21.6130108833313 seconds



### 2) restrict the number of models using `model_subset` 


In [16]:
em = AutoEmulate()
em.setup(X, y)

# let's see all models
em.print_model_names()

# setup with fewer models
start = time.time()

em.setup(X, y, model_subset=["sop", "rbf", "gb"])
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),False
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),20
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Unnamed: 0,short name
SecondOrderPolynomial,sop
RadialBasisFunctions,rbf
RandomForest,rf
GradientBoosting,gb
GaussianProcess,gp
SupportVectorMachines,svm
LightGBM,lgbm
PyTorchMultiLayerPerceptron,ptmlp
PyTorchRadialBasisFunctionsNetwork,ptrbfn
NeuralNetSk,nns


Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),False
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),20
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Initializing:   0%|          | 0/3 [00:00<?, ?it/s]

Time taken: 2.5793070793151855 seconds


### 3) run fewer cross validation folds using `cross_validator` 

With larger datasets, you might initially want to set the number of folds for the cross validation to 3 instead of 5 (the default), so that there are fewer models to fit. `AutoEmulate` takes a `cross_validator` argument, which takes an scklearn cross validator or [splitter](https://scikit-learn.org/stable/api/sklearn.model_selection.html). Let's use kfold with 3 splits, which saves 2 model fits per model.

In [17]:
from sklearn.model_selection import KFold

start = time.time()

em = AutoEmulate()
em.setup(X, y, cross_validator=KFold(n_splits=3))
em.compare()

end = time.time()
print(f"Time taken: {end - start} seconds")

Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),False
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),20
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Initializing:   0%|          | 0/11 [00:00<?, ?it/s]

Time taken: 27.888551950454712 seconds


### 4) modify hyperparameter search

As mentioned above, we are fitting a lot of models when searching for the best hyperparameters, it's therefore recommended to focus on a few models of interest to maximise performance.

For example, let's check out how well the models did in the example above:

In [18]:
em.print_results()

Unnamed: 0,model,short,r2,rmse
0,RadialBasisFunctions,rbf,1.0,0.0
1,SecondOrderPolynomial,sop,1.0,0.0
2,GaussianProcess,gp,0.9999,1.5016
3,ConditionalNeuralProcess,cnp,0.9935,12.9235
4,NeuralNetSk,nns,0.9082,47.4905
5,SupportVectorMachines,svm,0.8758,57.3031
6,LightGBM,lgbm,0.8742,59.0567
7,GradientBoosting,gb,0.8539,63.699
8,RandomForest,rf,0.6382,99.7956
9,PyTorchMultiLayerPerceptron,ptmlp,-0.0085,166.1604


Now, we might like to see whether the Random Forest model and the Gradient Boosting model could do better with better hyperparameters. To minimise computation time, we only take those two models, and run only 10 iterations (instead of the default 20), we run them in parallel using 5 jobs, and we run 3 fold cross validation (instead of the default 5).

In [20]:
start = time.time()

em.setup(X, y, model_subset=["rf", "gb"], param_search=True, param_search_iters=10, 
         n_jobs=5, cross_validator=KFold(n_splits=3))
em.compare()
em.print_results()

end = time.time()
print(f"Time taken: {end - start} seconds")

Unnamed: 0,Values
Simulation input shape (X),"(500, 10)"
Simulation output shape (y),"(500, 5)"
# hold-out set samples (test_set_size),100
Do hyperparameter search (param_search),True
Type of hyperparameter search (search_type),random
# sampled parameter settings (param_search_iters),10
Scale data before fitting (scale),True
Scaler (scaler),StandardScaler
Dimensionality reduction before fitting (reduce_dim),False
Dimensionality reduction method (dim_reducer),PCA


Initializing:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0,model,short,r2,rmse
0,GradientBoosting,gb,0.9077,50.5277
1,RandomForest,rf,0.6085,103.7577


Time taken: 13.609801292419434 seconds
