In [3]:
%matplotlib inline



# Model Selection

In `model_evaluation`, we saw how to check the performance of an interpolator using
cross-validation. We found that the default parameters for :class:`verde.Spline` are not
good for predicting our sample air temperature data. Now, let's see how we can tune the
:class:`~verde.Spline` to improve the cross-validation performance.

Once again, we'll start by importing the required packages and loading our sample data.


In [4]:
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import itertools
import pyproj
import verde as vd

data = vd.datasets.fetch_baja_bathymetry()
#auxilary data for testing
dataz = vd.datasets.fetch_texas_wind()

# Use Mercator projection because Spline is a Cartesian gridder
projection = pyproj.Proj(proj="merc", lat_ts=data.latitude.mean())
projectionz = pyproj.Proj(proj="merc", lat_ts=dataz.latitude.mean())
proj_coords = projection(data.longitude.values, data.latitude.values)
proj_coordz = projection(dataz.longitude.values, dataz.latitude.values)

region = vd.get_region((data.longitude, data.latitude))
# For this data, we'll generate a grid with 15 arc-minute spacing
spacing = 30 / 60

Before we begin tuning, let's reiterate what the results were with the default
parameters.



In [5]:
spline_default = vd.Spline()
score_default = np.mean(
    vd.cross_val_score(spline_default, proj_coordz, dataz.air_temperature_c)
)
spline_default.fit(proj_coordz, dataz.air_temperature_c)
print("R² with defaults:", score_default)

R² with defaults: 0.7960375866932156


## Tuning

:class:`~verde.Spline` has many parameters that can be set to modify the final result.
Mainly the ``damping`` regularization parameter which controls how much smoothness is imposed on the estimated forces. It allows our algorithm to take more straight forwards path towards local optima and damp out vertical oscillations. The ``mindists`` param which accounts for slight errors in the measured location of forces and their actual locations. Would changing the default values give us a better score?

We can answer these questions by changing the values in our ``spline`` and
re-evaluating the model score repeatedly for different values of these parameters.
Let's test the following combinations:



In [6]:
dampings = [None, 1e-4, 1e-3, 1e-2]
mindists = [5e3, 10e3, 50e3, 100e3]

# Use itertools to create a list with all combinations of parameters to test
parameter_sets = [
    dict(damping=combo[0], mindist=combo[1])
    for combo in itertools.product(dampings, mindists)
]
print("Number of combinations:", len(parameter_sets))
print("Combinations:", parameter_sets)

Number of combinations: 16
Combinations: [{'damping': None, 'mindist': 5000.0}, {'damping': None, 'mindist': 10000.0}, {'damping': None, 'mindist': 50000.0}, {'damping': None, 'mindist': 100000.0}, {'damping': 0.0001, 'mindist': 5000.0}, {'damping': 0.0001, 'mindist': 10000.0}, {'damping': 0.0001, 'mindist': 50000.0}, {'damping': 0.0001, 'mindist': 100000.0}, {'damping': 0.001, 'mindist': 5000.0}, {'damping': 0.001, 'mindist': 10000.0}, {'damping': 0.001, 'mindist': 50000.0}, {'damping': 0.001, 'mindist': 100000.0}, {'damping': 0.01, 'mindist': 5000.0}, {'damping': 0.01, 'mindist': 10000.0}, {'damping': 0.01, 'mindist': 50000.0}, {'damping': 0.01, 'mindist': 100000.0}]


Now we can loop over the combinations and collect the scores for each parameter set.



In [7]:
spline = vd.Spline()
scores = []
for params in parameter_sets:
    spline.set_params(**params)
    score = np.mean(vd.cross_val_score(spline, proj_coordz, dataz.air_temperature_c))
    scores.append(score)
print(scores)

[-3.0558088625105246, -1.0324987610313634, 0.8388935474419433, 0.8372297486556756, 0.8357332950163141, 0.8302176127333297, 0.8503658696313814, 0.842379393827142, 0.8371791440825426, 0.8403780767301334, 0.852794541088171, 0.8524545032000166, 0.8403005442318481, 0.8344566205401425, 0.843101054678694, 0.8486770494068516]


The largest score will yield the best parameter combination.



In [8]:
best = np.argmax(scores)
print("Best score:", scores[best])
print("Score with defaults:", score_default)
print("Best parameters:", parameter_sets[best])

Best score: 0.852794541088171
Score with defaults: 0.7960375866932156
Best parameters: {'damping': 0.001, 'mindist': 50000.0}


**That is a nice improvement over our previous score!**

This type of tuning is important and should always be performed when using a new
gridder or a new dataset.
